You are on page 1of 47

Manipulating Data

Selecting Rows / Observations


Selecting Columns / Fields
Merging Data
Relabeling the Column Names
Converting Variable Types
Data Sorting
Data Aggregation

Selecting Rows / Observations


Top N Obs

Random Sampling

K2Analytics.co.in

Selecting Rows / Observations


Filtering

Subsetting and selecting required columns

K2Analytics.co.in

Selecting Columns | KEEP

K2Analytics.co.in

Selecting Columns | KEEP

K2Analytics.co.in

Selecting Columns | DROP

K2Analytics.co.in

Selecting Columns | DROP

Summarizing
data_frame [ row, col ]
data_frame [ row, ]
data_frame [, col ]
data_frame [col ]

K2Analytics.co.in

Merging Data
Let us first get the two datasets required for merging and note we
will build on this for our logistic regression

Alternatively we can load from our saved Rdata file

Let us load the second dataset

Merge lr_ds1 and lr_ds2 to create our Logistic Regression lr_ds data
frame to be used in subsequent steps

K2Analytics.co.in

Merging Data | various joins

Let us try one more merge example

Remember R is case sensitive; LR_DF is not same as lr_df


K2Analytics.co.in

Relabeling the Column Name


Let us say we wish to re-label the column SCR to Score

Renaming Score back to SCR

Another way of renaming


Renaming Score back to SCR

K2Analytics.co.in

10

Converting Variable Types


Let us first get the structure of our data frame and then we will play
around converting variable data types

Let us convert Cust_ID from Factor to Character

K2Analytics.co.in

11

Converting Variable Types


Let us convert Target from Integer to Factor and back to Integer

Target variable values have changed from 0 & 1 to 1 & 2


?????

Let us first back to the original state


You can convert numeric to integer using as.integer

K2Analytics.co.in

12

Converting Variable Types


Let us now see how Factor is converting into Integer

Using levels() is better approach then type


castingvalues
to as.character()
and
then
to 1 to 1 & 2
Target variable
have changed
from
0&
integer. With as.charcater()
you
may
run
?????
into issues with missing values

*Other common conversion functions as.Date(), as.numeric()


K2Analytics.co.in

13

Data Sorting
Lets say we want to sort our data frame by Age and Holding_Period
Use order() to sort;
Prefix minus - sign for descending sort

Note: Usage of attach(); because of attach() we do not need to prefix column names by data frame name
To sort vectors you can use sort() function
K2Analytics.co.in

14

Data Aggregation | Freq & Cross-Tabs

K2Analytics.co.in

15

Freq & Cross-Tabs contd

K2Analytics.co.in

16

Freq & Cross-Tabs. contd

K2Analytics.co.in

17

Data Aggregations | Group By

K2Analytics.co.in

18

Data Aggregations | Group By as

K2Analytics.co.in

19

Data Aggregations | Group By as

K2Analytics.co.in

20

apply, sapply, lapply

K2Analytics.co.in

21

tapply

K2Analytics.co.in

22

Using Functions in R
Commonly used mathematics functions
Commonly used summary functions
Commonly used string functions
Creating user defined functions
Local and Global variables

Commonly used summary functions

K2Analytics.co.in

24

Commonly used mathematical functions

R does not have inbuilt


mathematical mode function

Sample code to get mode

K2Analytics.co.in

25

Commonly used String function

K2Analytics.co.in

26

Creating User Defined Functions


Syntax for creating user defined function
fun_name <- function(<args1>, <args2>,..<argsn>)
{
code statements
return(object)
}

For arguments where we wish to provide default values we can have


the arguments as
fun_name <- function(<args1 = default value>, <args2>,..<argsn>)
{
code statements
return(object)
}
K2Analytics.co.in

27

Creating User Defined Functions

K2Analytics.co.in

28

Local and Global variables

Note the usage of <<- for global variables


K2Analytics.co.in

29

R Programming Structures
For Loops
While Loops
If-Else
Arithmetic and Boolean Operators

Loops
For Loop
for (condition)
{

Statements
}

While Loop
while (condition)
{
statements
}

K2Analytics.co.in

31

IF-ELSE
IF-ELSE Syntax
if (condition) { statements }
if (condition) { statements } else {statement}

ifelse (condition, value if true, value if false)

Note the processing time required


in ifelse code as compared to for
loop in previous slide; Note: For
Loops are less efficient

K2Analytics.co.in

32

Using function and sapply

K2Analytics.co.in

33

Arithmetic & Boolean Operators


Operator

Description

+-*/

Plus, Minus, Multiply, Divide

^ **

Exponentiation; e.g. A <- 2^4 will set the value of A as 16

%%

Modulus; usage X%%Y

%/%

Integer Divisor; usage X%/%Y

< > <= >=

Less Than, Greater Than, Less Than or Equal To, Greater Than or Equal To

== !=

Equal To; Not Equal To

& &&

AND

OR

||

Negation

%in%

X %in% Y; Check whether values of X are in Y

K2Analytics.co.in

34

Charts & Plots


Histogram
Pareto Charts
Box Plots
Pie Chart
Line Graph
Scatter Plot
Overlaying Graphs

Histogram

K2Analytics.co.in

36

Pareto Chart

K2Analytics.co.in

37

Box Plot

Note: The impact of Age ~ Gender


between the two box plots

K2Analytics.co.in

38

Colourful Box Plot

Note:
levels give you the distinct values in Factor
length gives you the number of values in
vector returned by levels
For col (colours) we are using their
corresponding values like 2 for Red, 3 for
Green, 4 for Blue
1 is for Black as such starting from 2; and
hence the need for +1

K2Analytics.co.in

39

Pie Chart

K2Analytics.co.in

40

Bar Plot

K2Analytics.co.in

41

Scatter Plot Importing dummy data

K2Analytics.co.in

42

Scatter Plot

K2Analytics.co.in

43

Scatter Plot
If we simply use the
plot with data frame
name then R will
compute scatter
plot for all possible
combinations so
be careful while
working with big
data sets. It will
hang the machine

K2Analytics.co.in

44

Line Plot

K2Analytics.co.in

45

Overlaying Charts

K2Analytics.co.in

46

Thank you
End of Part 2

You might also like