You are on page 1of 50

Self-Learning R Essentials

Rajesh Jakhotia
22 Jun 2014

About K2 Analytics
At K2 Analytics, we believe that skill development is very important for the
growth of an individual, which in turn leads to the growth of Society & Industry
and ultimately the Nation as a whole. For this it is important that access to
knowledge and skill development trainings should be made available easily
and economically to every individual.
Our Vision: To be the preferred partner for training and skill development
Our Mission: To provide training and skill development training to individuals,
make them skilled & industry ready and create a pool of skilled resources
readily available for the industry

We have chosen Business Intelligence and Analytics as our focus area. With
this endeavour we make this Self-Learning R Essentials accessible to all
those who wish to learn R. We hope it is of help to you. For any feedback /
suggestion or you are looking for job in analytics then feel free to write back to
us at ar.jakhotia@k2analytics.co.in
Welcome to R!!!
K2Analytics.co.in

Welcome to R

Content
Introduction to R
Understanding R data structures
Importing Data

Managing Data
R Programming Structures
Basic Charting and Plotting

Performing Logistic Regression


Introduction to Rattle

K2Analytics.co.in

Introduction to R
What is R?
Why R?
Installing R
Understanding the R interface
R environment variables and startup files
How to get help in R
R Console & R Editor

What is R?
Free software environment for statistical computing and graphics
Compiles and runs on a wide variety of UNIX platforms, Windows
and Mac OS
Official website: http://cran.r-project.org/
R's source code is freely available under the GNU General Public
License

Originally created by Ross Ihaka and Robert Gentleman (hence the


name R) at the University of Auckland, New Zealand

K2Analytics.co.in

Why R?
Free and exceptionally good statistical tool
Provides cutting edge statistical techniques as available in many
paid expensive software
Has decent data handling and data manipulation capabilities
Provides connectors to social media sites and one can also easily
get get streaming data
R can work with Big Data

Easily installed and usable with training


With time the popularity of R in analytics fraternity has been growing
(As a case study in context R was extensively used by the Analytics
Team working on the Obamas Presidential Election in 2012)
K2Analytics.co.in

Installing R
Go to website: http://cran.r-project.org/
Click the link based on OS Environment
Click base

Download R installer
Double click on the installer
Select Run

Follow the instruction steps

K2Analytics.co.in

R Interface

R Console is where
you execute the
code

R Editor to write
and save code

K2Analytics.co.in

Basic commands to know your


environment
help() ## type this on your R Console

Note: R Syntax and variable names are case sensitive


K2Analytics.co.in

10

Environment variables contd()


Getting environment variables.e.g.
## we request the learner to keep typing the commands on R Console and attempt to interpret the
output; Does not matter if you do not understand everything in the first run;

Sys.getenv("R_HOME")
Sys.getenv("R_PROFILE")
Sys.getenv("R_PROFILE_USER")
Sys.getenv("R_DATA")

Sys.getenv(c("R_HOME", "R_PROFILE", "R_PROFILE_USER",


"R_DATA"))
for more details:
http://stat.ethz.ch/R-manual/R-patched/library/base/html/Startup.html
K2Analytics.co.in

11

Environment variables contd()


R_HOME: The R home directory is the top-level directory of
the R installation being run.
R_PROFILE: The path for the site-wide startup profile file of R code.
the default is R_HOME/etc/Rprofile.site
R_PROFILE_USER: The path for the file containing user specific
profile customization. If this is unset, a file called .Rprofile is
searched for in the current directory or in the user's home directory
(in that order).

R_DATA: The path from where R loads the last saved image from
the current directory, if there is one. The extension of the is .RData

K2Analytics.co.in

12

Customizing R Startup
At startup, R searches for Renviron.site file
Default location is R_HOME/etc/Renviron.site
R_HOME is the path where you installed R. In my case it is

C:/Program Files/R/R-3.0.3
The factory installation does not come up with the Renviron.site file.
You have to create one in notepad and save at the above path

Note: Renviron.site file should be used only for setting environment


variables

K2Analytics.co.in

13

Creating Renviron.site file in Notepad


For e.g. let us create the environment file and store it with following
content
R_PROFILE_USER = C:/R/Startup.Rprofile
R_PROFILE = C:/R/Rprofile.site

K2Analytics.co.in

14

Customizing R Startup contd


After the Renviron file, R looks for .Rprofile file.
If R_PROFILE is not set then R looks for the Rprofile.site file in
default location, R_HOME/etc
Two important functions you can define besides other function you
create in Rprofile file are
.First() This will get executed at startup
.Last() This will get executed when we end the R Session

Note when the site file and profile file are loaded only the base
packages gets loaded. If you have to refer to any other packages
then they need to be explicitly loaded.

K2Analytics.co.in

15

Customizing R Startup contd


Let us have the following function written in Rprofile.site file
.First <- function() {
setwd("C:/K2-Analytics/Rtraining_Codes)

}
If your R is Running then
close and restart R
Go to R Editor
Click File>Save or Save
As
Note the default folder
path
Note: the above is just an
example. There is lots you
can do as part of
customizing your R Startup

K2Analytics.co.in

16

Customizing R Startup contd


Let us have the following function written in the startup.Rprofile file
A <- function() {
getwd()

The utility of startup.Rprofile is that here you can define all your
functions that you may wish to frequently use

K2Analytics.co.in

17

Running R Code
Interactive Mode
You run R by typing the code at the R Command Prompt

Script Mode
You run your code written in script file saved with .R extension
Syntax: Source(myprog.R)
Let us create the file and save in working directory path.
To get working directory path use getwd() command
Assume we have the following statements in myprog.R file
cat(Welcome to R\n)
## \n is escape sequence for new line

Batch Mode:
R CMD BATCH c:\Training\myprog.R c:\Training\myprog.Rout
K2Analytics.co.in

18

R Tip

R Tip
Set you R interface with R Console
& R Editor placed side-by-side
Write all your code in R Editor

Select the code or keep cursor on


the line which you wish to execute
from R Editor
Click the icon
to execute the
code in R Console
K2Analytics.co.in

19

Understanding R Data Structure


Variables in R
Scalars
Vectors
Matrices
Lists
Data Frames
Using c, cbind, rbind, attach and detach functions in R
Factors

Variables in R
Variable names in R are case sensitive ( A and a are two different
variables in R)
It can be alpha-numeric and can contain _ or . as part of variable
name
It cannot contain operators (+ - / * < > % =) or special characters like
~{}?#$@
A variable name cannot start with number

You may be able to create variable name having same name as


some other built in symbol. In that case you may not be able to use
the specific built in symbol. (so better avoid giving such names)

K2Analytics.co.in

21

Scalar Variables
Scalar Variable It is single value variables. Scalars in R are vectors
of length 1

Note: In R you can use = or <- as assignment operator

K2Analytics.co.in

22

Vector Variables
Vector Variable It is a sequence of numbers

Note small x and Capital X are two different vectors. R is case sensitive
c is the concatenate function
You can easily do mathematical operation on two vectors of same size just
as you would do on two scalar variables
All vector elements must be of the same mode; it can be integer, numeric,
string, object, etc

K2Analytics.co.in

23

Matrices
Matrix variable is a 2 way table structure having rows and columns

Note the subtle difference in which the values have got populated in matrix m & M

Also note that to create the matrix we have used the R function named matrix by
passing certain arguments and values

K2Analytics.co.in

24

How to get help?


Let us get help on the matrix
function
The help syntax is help() or ?

The help opens up the help


content on internet

K2Analytics.co.in

25

Lists
In a Vector all values can be of only one mode type
In case you wish to save values of different mode types then we
should use Lists. Sample Syntax:

K2Analytics.co.in

26

Lists contd
Vectors in R are similar to Arrays in C. Elements cannot be deleted
in Vectors and if you wish to do it then use Lists
Adding to Lists

Deleting from Lists

List having List and Vector as its elements

K2Analytics.co.in

27

Accessing List Elements


Note the difference of [[ ]] and [ ] in below two examples

[ ] returns a sublist; [[ ]] returns a value

List element can also be accessed using name tags as shown below

K2Analytics.co.in

28

List unlist
E.g. 1

E.g. 2

E.g. 3

unlist() returns a vector


E.g. 1 The List did not have name tags and as such the vector created from unlist
does not have same

E.g. 2 Name tags exists. The mode of the vector is character. (LCD rule)
E.g. 3 Note the suffixes 1, 2, 3, and 4 given to the VectorElement tags of the List
K2Analytics.co.in

29

Summarizing List
Lists are kind of vectors which can store values of different modes
We can add / delete values from list
List values can be given name tags

List elements can be accessed by [[ ]], [ ], or name tags


[[ ]] returns the values; [ ] returns the sublist
unlist() returns a vector; the mode of the vector created from unlist()
depends on the Least Common Denominator (LCD)
. Finally if we want the length of the list we can use length()

K2Analytics.co.in

30

Data Frames
Data Frame is used for storing data tables.
Very simply said, what we call Table in SQL parlance, Dataset in
SAS is called Data Frame in R terminology
The columns are the Vectors
Small e.g. to create a Data Frame
The first line of the data table showing the
column names is called header.
Each horizontal line representing a record
is called row

Each data member of the row is called cell


The cell data is accessed by specifying it
row and column coordinates in [ ]
brackets
K2Analytics.co.in

31

rm - Remove
rm is to remove objects no longer needed
(Cleanup)
Note: R does in memory processing and hence it is
advisable to keep removing objects which are not required.

To get the objects currently in R memory use


the function ls()

To remove a column from data frame you set


it to NULL

K2Analytics.co.in

32

cbind, c, and rbind


Adding a column to data frame Creating new variables from
using cbind
existing fields

Reordering the columns

Adding row using rbind

Note: c() adds them head to tail; cbind() combines them into matrix
form; rbind() adds them row-wise
K2Analytics.co.in

33

attach() and detach()


attach() takes a data frame or list
as argument
It helps refer columns of the data
frame or list without having to
prefix it with the object name
If there are objects in Global
Environment having same names
as column name in data frame /
list then those columns will have
to be accessed with $ symbol
Using rm() we have removed the
global variable, then detached
and attached the data frame

K2Analytics.co.in

34

attach() and detach()


Note: In R lists and data frames
can only be attached at position
2 or above, and
what is attached is a copy of the
original object. You can alter the
attached values via assign
operator, but the original list or
data frame is unchanged.
To make change to the original
list or data frame the column has
to be referred along with $
symbol

K2Analytics.co.in

35

Factors
We are creating a vector named
data and it is of type character

Using factor function we are


converting the type from character
to data;
Note the Levels; (pl note, I have executed this command by
narrowing the R Console so that the integer values for some of the
levels can be displayed)

Factor provide an efficient ways of storing data in R. If you have large data frame having
categorical variable then Factor converts the categorical values into levels and each level
corresponds to an integer number; For the factor column, this integer value is stored in the
data frame rather than the actual value.

I hope this should clarify things further


K2Analytics.co.in

36

Factorscontd
From previous e.g. and this e.g.
you can see that the levels are in
ascending order

Assigning Labels to Factor Levels

K2Analytics.co.in

37

Importing Data
Reading tabular datafiles
Reading CSV files
Importing data from Excel
Importing data from SAS
Accessing Database
Saving in Rdata
Loading Rdata Objects
Writing to files

read.table
read.table function reads data from txt / csv file and returns a Data
Frame
Arguments
file = <the file path>
sep = argument to specify the separator
header = TRUE; if the first row of the data contains column names
stringsAsFactors = FALSE; this option will prevent character
variables to be converted to Factors
as.is = argument can be used to suppress factor conversion for
certain specific column; TRUE will ensure suppression of factor
conversion
There are many other arguments; run ?read.table command to get full help on all the arguments

K2Analytics.co.in

39

read.table e.g.
Sample Data file
Loan_Cross_Sell_Logistic_Regression_Sample.CSV
Data Import Syntax

Note: The columns have been named as V1, V2.V8


K2Analytics.co.in

40

read.table contd

Note the columns have the proper names as was in the first row of
the data file
In case the file is tab delimited the sep argument will become
sep = \t

K2Analytics.co.in

41

read.fwf
Read.fwf is used to read Fixed Width Format file
~ is to be replaced by full folder path

Note that the headers if present should be separated by some


separator; default separator is tab \t

Let us the data types / class for


the two variables
Cust_ID is an identifier field; If we do not wish to have auto factor
conversion for character variables then use option stringsAsFactors

K2Analytics.co.in

42

Importing data from MS Excel


Importing using RODBC Library

Importing using XLConnect Library

K2Analytics.co.in

43

Importing data from SAS dataset


To importing data from SAS your require sas7bdat Library

Load the library and call the read function to import from SAS dataset

K2Analytics.co.in

44

Reading data from Database


Database Access using RODBC package
Open Database Connection

with trusted connection

with login / password

Get the contents of a database table in Data Frame

Close Connection

K2Analytics.co.in

45

Saving Objects
Let us start the R session afresh and try the below

Use
header=TRUE
option if the
column headers
is the first row in
the file

K2Analytics.co.in

46

Loading Saved Objects


Loading a saved image will load all the objects which were in memory
at time of saving the image

Or you may choose to load only specific saved objects

K2Analytics.co.in

47

Writing / Exporting data to a file


write.table

Note: the above command writes the row names (here row numbers are row names)
as an addition column in the output file. to avoid this use the option row.names=F

K2Analytics.co.in

48

Writing / Exporting. contd


Writing output of summary statistics or other things to a file

K2Analytics.co.in

49

Thank you
End of Part 1