Professional Documents
Culture Documents
Section 1 ................................................................................................................ 8
PART 1 WHAT IS R ....................................................................................................... 9
Introduction ................................................................................................................... 9
What is R?....................................................................................................................... 9
Why R? ........................................................................................................................... 9
Summary ...................................................................................................................... 10
PART 2 INSTALLATION ............................................................................................... 11
Introduction ................................................................................................................. 11
How can R be installed? ............................................................................................... 11
Overview of the R GUI .................................................................................................. 12
Installation of R Studio ................................................................................................. 14
Overview of R Studio GUI ............................................................................................. 15
Summary ...................................................................................................................... 19
Section 2 .............................................................................................................. 20
PART 1 DATA TYPES ................................................................................................... 21
Introduction ................................................................................................................. 21
Why are DATA TYPES important? ................................................................................ 21
Data types .................................................................................................................... 22
Creating data types in R ............................................................................................... 24
Summary ...................................................................................................................... 29
PART 2 DATA STRUCTURES & VECTORS .................................................................... 30
Introduction ................................................................................................................. 30
What is a data structure ............................................................................................... 30
What is the difference between data structure and data type ................................... 31
Type of data structure Vectors.................................................................................. 32
How to create a Vector in R ......................................................................................... 33
Mixing up data types in a Vector ................................................................................. 35
Replacing the contents of a Vector .............................................................................. 37
Arithmetic functions between Vectors ........................................................................ 38
Identifying elements in a Vector .................................................................................. 41
Replacing contents in a Vector..................................................................................... 42
Using a function to Index ............................................................................................. 43
Speeding up the task with Operators .......................................................................... 45
Summary ...................................................................................................................... 48
PART 3 DATAFRAMES ................................................................................................ 49
Introduction ................................................................................................................. 49
What is a Dataframe .................................................................................................... 49
Creating a Dataframe in R ............................................................................................ 50
Functions that can be carried out with Dataframes .................................................... 52
Summary ...................................................................................................................... 58
PART 4 LIST & MATRIX............................................................................................... 59
Introduction ................................................................................................................. 59
What is a List ................................................................................................................ 59
What is a Matrix ........................................................................................................... 60
Creating a List in R ........................................................................................................ 61
Creating a Matrix in R ................................................................................................... 63
Creating a Matrix out of a Dataframe .......................................................................... 63
Summary ...................................................................................................................... 65
PART 5 FACTORS ........................................................................................................ 66
Introduction ................................................................................................................. 66
What is a factor ............................................................................................................ 66
How to create a factor in R .......................................................................................... 67
Summary ...................................................................................................................... 69
Section 3 .............................................................................................................. 70
PART 1 PACKAGES ..................................................................................................... 71
Introduction ................................................................................................................. 71
What is a Package ........................................................................................................ 71
Installing and Loading a Package ................................................................................. 72
Importing an Excel file into R ....................................................................................... 75
Importing a CSV file into R ........................................................................................... 77
SUMMARY .................................................................................................................... 79
PART 2 Exporting and Reading data in R ................................................................... 80
Introduction ................................................................................................................. 80
Exporting to Excel ......................................................................................................... 81
Exporting to CSV ........................................................................................................... 82
Reading a file in R ......................................................................................................... 83
Summary ...................................................................................................................... 85
Section 4 .............................................................................................................. 86
PART 1 Logical operators and If condition ................................................................ 87
Introduction ................................................................................................................. 87
What is a logical operator ............................................................................................ 87
How to execute a logical operator in R ........................................................................ 89
What is IF Condition in R .............................................................................................. 93
Summary ...................................................................................................................... 94
Hello,
Thanks to the many students who have signed up for our courses, we are delighted to
offer all our online lectures as downloadable material. We know that learning should
be continuous, so through this material we hope that you will take your time within
your busy schedule to really understand the concepts and techniques of this
fascinating open source tool R!
We have tried to make your learning easy by highlighting key takeaways and screen
grabs of the tool so that you can continue your learning offline as well.
Enjoy the learning experience and thank you for choosing us to be your partner in
your journey of discovery!
All rights reserved. Without limiting rights under the copyright reserved above, no
part of this publication may be reproduced, stored, introduced into a retrieval
system, distributed or transmitted in any form or by any means, including without
limitation photocopying, recording, or other electronic or mechanical methods,
without the prior written permission of the publisher, except in the case of brief
quotations embodied in critical reviews and other non commercial uses permitted bu
copyright law. The scanning, uploading, and/or distribution of this document via the
internet or via any other means without the permission of the publisher is illegal and
punishable by law. Please do not participate or encourage electronic piracy of
copyrightable materials.
Here are a few things that you would probably find helpful before you begin.
Sections:
There are 5 sections in this material starting with the installation of R and R Studio
right up to generating a Word Cloud in R which showcases the text mining capabilities
of R.
Online videos:
This material is a supplement to the online videos available on
https://www.udemy.com/analyticstraining/?dtcode=VQRaQsx1KWR2
This material corresponds to the section on R.
Material:
The online class format supports downloadable material for each section. So perhaps
it would be a good idea to check each section for additional downloadable material
like case studies or sample data to work on and so forth.
Ready to begin?
INTRODUCTION
R is an open source statistical tool which not just manages data but also carries out a
lot of sophisticated analytical processes as well.
Before looking at how R works, it is important to get a good overview of R.
- What is R?
- Why R?
WHAT IS R?
2. R can manage and analyse data. It can execute all statistical techniques like
liner regression, logistical regression, forecasting, decision trees and any other
technique that you can think of.
WHY R?
So what makes R stand out when compared to other statistical tools? Let us break it
down.
1. Firstly, R can work with any type of data and can handle data of any size. So
whether the data you are working with is small or really big, R will be able to
handle it.
2. R can work with data received in any type of file format, whether text, CSV,
SASS and so on.
3. R offers really great visualization of data. It can connect with Google maps and
Motion charts.
4. Next and this is what makes R so much more powerful than other statistical
tools it is open source. Open source does not just mean that it can be used
for free, but that anyone can contribute to it as well.
6. As was mentioned earlier, R being open source means anyone can contribute
to it. This is why R has a huge community of contributors who almost on a
daily basis keep adding functionality to it. This is the reason why even the
most complicated techniques can be executed in R by just calling a function.
So, when using R we as users do not need to worry about how to perform a
linear regression or a logistics regression. The code to execute this and many
other advanced analytical functions is already built in and refined by those in
the R community on a regular basis.
SUMMARY
In this material, we covered the reasons which make R a powerful statistical tool.
To summarize,
R can work with big or small data, and also with the different formats in which
data is usually presented.
It does not use much code and offers great data visualization making it a
popular statistical tool in many global corporations.
INTRODUCTION
- Installation of R
- Overview of a typical GUI or Graphic User Interface of R
- Installation of R Studio
- Overview of the GUI of R Studio
http://cran.r-projeact.org/bin/windows/base/old/3.0.2
When you update your version of R, the earlier version is NOT automatically
uninstalled. Further, R Studio allows you to run multiple versions of R (though not in
same session) Therefore in R Studio, find out which version of R is running by typing
R.Version().
The default version of R that R Studio runs can be changed from Tools>Options> R
General.
Before proceeding with the rest of this tutorial, we suggest that you download R in
case you already havent.
At first glance, all that is visible on the R GUI is a single screen, which is known as the
Console.
The Console can be used to input data as well as view output. But we recommend
that the Console is used to only view output.
Commands or inputs in R are referred to as Scripts. To write a script, go to File and
select New Script.
Think of a Script in R as code or syntax that is written in order to tell R what it needs
to execute.
For eg, lets enter a = 1, which means R is being told to create a variable a and store
a value of 1 against it.
To execute this script or code, press Control +Enter. As shown in the image below,
the command is executed and the output is displayed in the Console.
INSTALLATION OF R STUDIO
A more user friendly option available to users is R Studio. It has a better GUI and
comes with more options.
http://www.rstudio.com/ide/download/
When you update your version of R, the earlier version is NOT automatically
uninstalled. Further, R Studio allows you to run multiple versions of R (though not in
same session) Therefore in R Studio, find out which version of R is running by typing
R.Version().
The default version of R that R Studio runs can be changed from Tools>Options> R
General.
The first section is the Editor section where the script or code that R needs to execute
is written. To add more than one script, use the plus sign on the top left hand corner.
It is possible to add as many scripts as required using this option.
Using the example looked at earlier, the script or code a = 1 is entered. To execute
this code, press Control plus Enter.
The output appears in the Console window which can be found right below the Editor
window. When values appear in the Console section it means that the script or the
code has been executed.
Section 3: Workspace
To the right of the Editor section, is the Workspace section, where the data being
worked on can be viewed.
This includes even data that has been imported from an external source.
In the example used, a new variable a with a value of 1 was created. Since this is
the data currently being worked on, both a and 1 are displayed in the Workspace
section.
The other sections in R Studio are File, Plots, Packages and the Help section.
The Help section helps in locating functions in R Studio. In the Search field, type what
is being searched for and click Enter. For eg, if plot is entered in Search, everything
related to it is displayed just below.
The other tabs available are packages (which will be covered later), Plot and Files
which displays all the files that are currently being worked on.
For the rest of the series of tutorials on R, we will be working with R Studio as it has a
better GUI than R Editor.
We are now ready to start using R to manage data and carry out other types of
actions on data.
To summarize:
An overview of the typical GUI of R has been looked at. Since individual
screens need to be opened, a better option is R Studio.
R Studio is more user friendly as all the relevant sections are available at a
single glance removing the need to have multiple screens open at a time.
INTRODUCTION
What is Analytics without data? Likewise how can you leverage the amazing
capabilities of R without understanding data?
To begin, it is important to understand why data types are useful and why it is
necessary to be able to distinguish between different types of data.
Suppose, you have been asked to evaluate five different brands of cars let us call
them Brand A, Brand B, Brand C, Brand D and Brand E. If you were asked to calculate
the mean of these five cars, how would you go about it?
It most likely would be an impossible operation to carry out because all you have is
the name or the brand of these cars and as you know you cannot calculate the mean
of names!
Now, the situation would have been different if you had some numeric data about
these cars. This emphasizes the need to understand the type of data you have to
work with because certain types of functions can be carried out on certain types of
data. Like calculating mean is not possible with character data types like names or
brands.
Data can be of different types. The different types of data one would commonly
come across are:
Numeric:
Integer:
Eg: 1, 2, 3..
Logical:
Character:
Factor:
Numeric data type is any number or numeric value like 2.1, 1.2 and so on. It could be
an integer or a decimal value.
In R Studio, to create a numeric data type the syntax y<-3.1 (or y is equal to 3.1) is
used. This means that a variable y is being created against which a numeric value of
3.1 is being stored.
Integer data type indicates any data which stores integer values.
In R Studio, numeric data types can be converted to integer data types by using the
following syntax:
as.integer(numeric value)
Eg: as.integer (3.1)
Logical data type indicates any data where the value is either True or False, but never
both.
In R Studio, the following syntax can be used to create a logical data type:
if x <-1, y<-2, then x > y is FALSE
(x is equal to 1, y is equal to 2, then x being greater than y is false)
In R Studio, they have to be written within double quotes. For example, the text
learning would be written as learning.
Factor data type refers to categorical types of data like gender or cities.
In R studio, lets create a numeric data type with a variable name of num1 and a
numeric value of 3.1 stored against it.
num1<-3.1
The output is displayed in the Console area in blue indicating that the code has been
executed. Simultaneously, the values num1 and 3.1 are displayed in the Workspace
section.
In order to identify the data type of the variable num 1 use a function called Class.
Type the words class and the name of the variable in brackets as shown below.
class(num1)
In the Console area numeric is displayed indicating that the data type of num1 is
numeric.
In the example used above, the number 3.1 when converted to an integer gives a
value of 3. To convert this numeric data type to an integer data type in R Studio, the
function as.Integer(numeric variable) is used.
Let us create a new variable num3 and store the integer value against this variable.
num3 <-as.Integer(num1)
and press Control + Enter. The values will be displayed in the Workspace section
when the code is executed.
In order to determine the data type of num3, the following function will be used
class(num3)
To print the value of any variable like num1 or num3, enter the value, say
num3
The value will be displayed in the Console like in the case of num3, where the value 3
is displayed.
To create a character data type in R, let us create a variable char1 and store a value
of hello against it.
Remember that equal to can also be indicated by using the equal to sign.
When this code is executed, in the Workspace section a variable char1 has been
created and a value hello stored against it.
To find out the data type of this variable, use the class function discussed earlier.
Enter the code
class(char1)
Logical and factor data types will be discussed in more depth in a later section.
We have covered important data types in this tutorial. Understanding these data
types will help in managing and working with data in R.
To summarize:
It is important to understand data types in order to determine what type of
actions can be carried out with a specific type of data.
Different data types are available numeric, integer, character, logical and
factor.
Different data types can be created in R using the proper syntax. Eg num1<-
3.1, as.integer(3.1), char1<-hello
INTRODUCTION
The data that you are working on, needs to work for you. In other words it has to be
arranged in a way that helps you manage, store it and analyze it better.
Let us understand this better with the help of an example. Shown here is a table with
different types of information stored in it.
When storing information of different types, it will need to be stored across more
than one variable. For eg, if the data to be stored relates to employee records, then
the variables across which this data would be stored would be Name, Age, Address,
Nationality, Assessment scores and so on. This collection of information displayed
across different variables is referred to as a data structure.
A data structure is different from data type because of the number of values stored.
Lets look at this with the help of an example. If a variable Name has been created,
and a value Bob stored against it, it will result in the creation of a character data
type. In a data type only one value is stored.
But when different information related to Bob apart from his name, is stored, like his
age, address, nationality and assessment score then it results in the creation of a data
structure. A data structure stores more than one value.
A simple way to look at a data structure is to think of an Excel sheet with rows and
columns where the columns are made up of different data types. In the example
used, the Name column will store character data types, the Age column will store
integer data types, the Score column will store numeric data types and so on.
The first type of data structure that will be discussed is referred to as Vectors.
A Vector is like a column in an Excel sheet. Going back to the example used earlier,
Vectors would be Name, Age, Address, Nationality and so on.
In Vectors, all the elements within a Vector should be of the same data type.
So, if Age is a Vector, then all the elements under age should be of the data type
integer. This Vector cannot have any other data type within it like character or
number, nor can they be a combination of data types.
So, Vector is therefore a data structure which contains elements of the same data
type. Visualize a single column in an Excel sheet which contains values of the same
data type.
vector1<-(9,8,2,7)
First, the Console will display vector1 with its corresponding values.
Second, in the Workspace section the variable vector 1 will be displayed along with
the data type of its values - which is numeric - and the number of values which is 4.
Now let us look at something interesting. As discussed, a Vector can only contain
elements of the same data type. There can be no mixing of data types within a
Vector. So what happens if a second Vector is created and along with numeric data
types, a character data type is inserted into it?
Shown here, is the code to create a new Vector called vector 2 with some values.
Inserted into these values is a character value bob.
When the contents of vector 2 are printed, all values in the Vector are displayed in
the Console in quotes. This indicates that by default R has converted all numeric data
types in the Vector to character data types by adding quotes to all the numbers. This
is why R does not display any error on executing this code!
R recognizes the rule of common data types and converts uncommon values to a
single data type.
Values in a Vector can also be overwritten. So one data type can always be replaced
with another data type within the same Vector.
In the example we looked at earlier, vector 2 contains 11 values all of which are
character data types. Suppose we want to replace these 11 values with 4 values of
numeric data type. These 4 values are 1, 2, 3 and 8.
Vector2<-(1,2,3,8)
In the Workspace the data type of vector 2 has now changed to numeric and has 4
values stored against it.
It is also possible to carry out arithmetic functions between Vectors like addition,
subtraction, multiplication and division. The only pre requisite to execute these
functions is that the data types in each Vector should be of equal length.
As you can see in the workspace both vector 1 and vector 2, are of numeric data type
and have 4 values each, which means they are both of the same length.
It is possible to carry out any type of arithmetic function on these 2 vectors such as
vector 1 + vector 2 or vector 1 vector 2 and so on.
vector1 + vector2
Lets cross check these values. Vector 2 comprises the values 1, 2, 3 and 8. To check
the values of vector 1, enter vector 1 and press Control + Enter. The values displayed
in the console are 9, 8, 2 and 7.
So, when 9, the first element of vector 1 is added to 1, the first element of vector 2
the result is 10 which is shown in the Console.
In this example the vectors were both of equal length. Lets look at what happens in
the event the elements in the vector are of unequal length.
Vector 1 has 4 elements. Let us add this to a new vector c which has 3 elements 1, 2
and 3.
vector1 + c(1,2,3)
On executing this code, a warning message is displayed in the console but the
addition function has still been executed. How?
For eg, we know that vector 1 contains 4 elements, 9, 8, 2 and 7. Let us suppose that
we want to find out the third element in vector 1 which is 2.
vector1 [3]
We can see a value of 2 displayed in the console which as we know is the third
element in vector 1.
So to index a Vector, next to the name of the Vector enter within square brackets the
number of the element that needs to be accessed. Eg, vector1[3]
Now, let us suppose that we want to create a new Vector called new_vector. In this
new Vector we want to populate the same elements as vector 1 but without the
second element. So in new_vector we only want to store the first, third and fourth
elements of vector 1.
new_vector<-vector1[-2]
Entering minus next to 2 indicates that we want to exclude the second element of
vector 1 in new_vector.
When the code is executed we can see in the Workspace section that the vector
new_vector has been created with three values of numeric data type.
To view the contents of new_vector, enter the name of the vector and press Control
+ Enter.
In the console, 9,2 and 7 are displayed. 8 is not displayed as it is the second element
in vector 1 and hence has been excluded.
If a Vector has only three elements but if a value of 10 is being entered in square
brackets, then it means that we are trying to index elements that are greater than
If there are only 3 elements in a Vector, then how can you locate the 10 th element?!
Hence the term Index out of Boundary.
Indexing in Vectors can also be done with the help of logical functions. Heres how.
Let us create a new Vector called vector 2 with 4 elements in it 1,2,3,4.
Enter the code
vector2=c(1,2,3,4)
We already have a Vector, called vector 1 which has the elements 9,8,2,7.
Let us now use a logical function to find the the third element in vector 1.
vector1[vector2==3]
By entering vector 2==3, we are trying to locate the position of the value 3 in vector
2. The value 3 is the third element in vector. So, when the code is executed, in the
Console the third element in vector 1 needs to be displayed. Since the third element
in vector 1 is 2, we should be able to see this number in the Console.
Using the position of 3 in vector 2, the logical function tries to find the equivalent
position in vector 1.
Operators help in executing certain types of tasks quickly and more efficiently. Let us
understand this better with the help of an example.
Let us assume that a Vector called Age needs to be created which needs to store the
first 100 natural numbers i.e., numbers from 1 to 100. One way to execute this is to
write the code age<-(1,2,3..) and so on mentioning all numbers till 100. This
obviously is not a feasible option. Sometimes numbers could run till 100, at other
times till even 1000!
In these types of situations, a good option would be to use Operators. Here are a few
common Operators that are used in R.
Colon Operator
The Colon Operator can be used to create Vectors like the Age Vector quite easily by
using the code
age<-1:100
To view the contents of the Vector, enter age and press Control + Enter.
In the Console values from 1 to 100 are displayed.
Sequence Operator
Let us suppose that a Vector is to be created with some numbers, which are not
continuous but have some sort of order to it. An example would be 1,3,5,7,9 and so
on. To create this Vector, the Sequence Operator can be used.
Lets create a Vector called Age and populate it with the values values 1,3,5,7,9 and
so on till 101. To do this, enter the code
age<-(1,101,2)
In the code entered, 1 represents the start point, 101 represents the end point and 2
represents how the numbers should increment.
Sequence operators can populate vectors with data that follow a logical sequence.
So, now we know that vectors can be created in 3 different ways firstly through c or
concatenate, secondly with a colon operator and lastly with a sequence operator.
To summarize:
Data structures are needed to store and organize data.
A Vector is a data structure which can store values of a single data type like
only characters or only numbers or only integers.
INTRODUCTION
Vectors, the first type of data structure that was looked at is actually quite closely
linked to the next type of data structure that will be discussed.
- Understanding Dataframes
- Creating Dataframes in R
- Different functions related to Dataframes
WHAT IS A DATAFRAME
Shown here is a table with columns like Name, Age and Score.
Each column is in fact a Vector. So, Name constitutes one Vector, Age another and
Score another. So, a Dataframe is nothing but a collection of Vectors of equal length.
Here is a table with some data. This data needs to be converted into a Dataframe
called Records.
The table has 4 columns which individually become 4 Vectors in the Dataframe.
So, the first step in creating the Dataframe is to create the four Vectors.
To create the four Vectors in R Studio, viz, Name, Gender, Age and Income, enter the
code
The order in which the Vectors are entered is important. If Name is entered first, it
will be the first Vector displayed in the Dataframe. Likewise if Gender is entered first
it will be the first Vector displayed in the Dataframe.
On executing this statement, the Console shows that the code has been executed.
In Workspace the name of the Dataframe Records is displayed together with its
Vectors, their data types and the number of values in each.
When we double click on records we can see the entire Dataframe displayed.
To find out the names of the variables in a Dataframe, enter the code names
followed by the name of the Dataframe whose variables need to be determined.
So, to find out the names of the variables in the Dataframe Records, enter the code
names(Records)
In the Console is displayed all the variables of the Dataframe Records - which is
Name, Gender, Age and Income.
This is a useful function when working with a Dataframe that contains a large number
of variables.
In this tutorial, a simple Dataframe with just four variables has been created. There
could be a situation where a large Excel table with lots of variables is imported and
the names of the variables used in this table need to be determined.
Records[2,2]
In the code, the first 2 indicates the second row in the Dataframe where the value
Gopal is displayed. The second 2 indicates the column Gender.
When this code is executed, Mis displayed in the Console indicating that the gender
of Gopal is male.
It is also possible to view all the elements in any row of a Dataframe. For eg, to view
the elements of only the first row in the Dataframe Records, enter the code
Records[1, ]
The space left after the comma indicates that all the elements in the row need to be
fetched. When this code is executed, in the Console the entire elements of row 1 of
Records is displayed.
Records[c(1:4),]
On executing this code the elements of the 4 rows is displayed in the Console.
Given below is also the code to access only rows 3 and 4, and the resulting output in
the Console.
There are three ways to find out the content/s of a particular column in a Dataframe.
Let us look at each of these with the help of an example.
Let us find out the contents of the column Name in the Dataframe Records.
Records.$Name
On executing this code we can see all the values under the Name column displayed.
Records*,Name+
Here the first field is empty because it relates to rows and the second field is the
name of the column whose contents are to be retrieved.
On executing this code the contents of the column are displayed in the Console.
Records[,1]
In this case we know that Name is the first column in Records. On executing this
code, the contents of Name are displayed in the Console.
Of the three ways to find the values in a column, two work with the name or the label
of the column and the third requires the number of the column.
To add a column to a Datafram, the only pre requisite is that the new column to be
added should be the same size as the other columns in the Dataframe.
In the Dataframe Records, there are 6 rows and 4 columns. To add a fifth column to
this Dataframe enter the code
Records$newc<-(100:106)
When this code is executed, an error is displayed in the Console as 100:106 adds up
to seven rows and not six.
Records$newc<-(100:105)
On executing this code in the Workspace the number of columns in Records is now 5.
Also, when Records is opened, the column New is displayed with values from 100 to
105.
It is also possible to remove a column from a Dataframe. For eg, to remove the
column New from the Dataframe Records enter the code
Records$new<-NULL
When this code is executed the data in the Workspace is updated to show only 4
columns.
In this tutorial we have looked at another important data structure called Dataframe.
To summarize:
There are various types of functions that can be carried out with Dataframes.
These are:
a. Printing the contents of a Dataframe
b. Indexing or locating specific values in a Dataframe
c. Finding out the values of a row in a Dataframe
d. Finding out the values of a column in a Dataframe
e. Adding a column with values to a Dataframe
f. Removing a column from a Dataframe
INTRODUCTION
This tutorial will deal with two more types of data structures List and Matrix.
- Understanding Lists
- Understanding a Matrix
- Creating Lists in R
- Creating a Matrix in R
- Different functions related to List and Matrix
WHAT IS A LIST
Just like a Dataframe, a List is also made up of Vectors. But unlike the Dataframe, the
Vectors in a List can be of equal or unequal length.
However, the Vectors in a List should comprise elements of the same data type.
For eg, in the table below, n,s and b are 3 Vectors of different data types numeric,
character and logical. Each of them are of unequal length. A combination of these
Vectors can make up a List.
n s b
2 aa TRUE
3 bb FALSE
5 cc TRUE
dd FALSE
ee FALSE
Unlike a Dataframe where each column stores different elements like Name or Age,
in a Matrix all the columns need to have the same type of elements either only
numbers or only characters and so on.
A Matrix cannot have one column with character data types, one column with
integers and so on.
To understand how to create a List in R, the table shown earlier will be converted into
a List.
In that table, column n has only numeric data, s has only character data and b
contains logical data (only True or false). Each of these columns are Vectors.
n = c(2,3,5)
s = c(aa, bb, cc, dd, ee)
b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
In the Workspace section, each Vector n, s and b is displayed together with its data
type and the number of values in it.
To create a List X with the three Vectors n, s and b and a fourth Vector called 3, enter
the code
X = list(n,s,b,3)
On executing this statement, in the Workspace section x with a value List against it is
visible. It also indicates that the List has 4 Vectors.
To view the contents of the List, select the name of the List and press Enter. The
values in X will be displayed in the Console.
Let us create a matrix in R, called my.matrix with 5 columns and 2 rows. This Matrix
needs to store 10 elements.
Here, the first argument indicates the number of elements to be stored in the Matrix,
the second argument relates to the number of columns in the Matrix and the third
argument relates to the number of rows in the Matrix.
On executing this code the workspace indicates that a Matrix has been created.
On double clicking the name of the Matrix, a 2x5 matrix with 10 elements is visible.
A Dataframe can also be converted into a Matrix. Let us understand this with the help
of an example. First the Dataframe needs to be created, with some sample elements.
To create a Dataframe called data_frame, enter the code
data_frame<-data.frame(a=c(1,2,3), b=c(1,2,3))
To convert this Dataframe to a Matrix (let us call it next.matrix), use the function
next.matrix<-as.matrix(data_frame)
Now let us find out the data type of the second column in the Dataframe data_frame.
The data type of the second column is numeric, but this can be found out by using
the code
class(data_frame$b)
Now let us find out the data type of the second column in the Matrix next.matrix.
The second column in next.matrix is b. To find out the data type of b, enter the code
class(next.matrix[,2])
So, if the same elements in the Dataframe were used to create the Matrix, why does
the data type of the column differ? Column a or the first column in the dataframe
that was used to create next.matrix has elements of the character data type. So as a
Matrix needs to have elements of the same data type, every element in the Matrix
including the elements in the second column b have been converted to character
data type. This is why the data type of the second column of the Matrix is character.
This underlies the key difference between a Dataframe and a Matrix i.e, all the
elements in a Matrix need to be of the same data type.
In this tutorial we have covered two more data structures List and Matrix.
To summarize:
There are two ways in which a List can be created in R. The first is by
generating the Vectors in the List individually. The second is by combining the
Vectors into a consolidated List.
A Matrix can be created using code where the number of elements, columns
and rows is specified.
INTRODUCTION
An important data type used in data structures is factor. Factor as already mentioned
refers to data types of categorical nature.
- Understanding factors
- Creating a factor in R
WHAT IS A FACTOR
In R, let us assume that a Vector called fac_list has been created with names of cities
like city1, city2, city 3 etc.
The names of these cities are categories in themselves. So each city which is
originally a character data type can be converted into factor or a separate category in
R.
Let us take another example. In a Vector like gender, there are invariably two values,
male and female, each of which are categories in their own right.
So, the utility of the data type factors is to convert values into categories.
First create the Vector fac_list using the code mentioned above.
fact1<-as.factor(fac_list)
class(fact1)
summary(fact1)
The values under each indicate the number of times they appear in the Vector
fact_list.
For eg, city 1 appears only once hence the value 1, but city 2 appears twice which is
indicated by the value 2. Likewise the number of times the other categories appear is
also indicated.
To summarize:
Factors are a data type which converts values into categories. For eg, names
of cities to city, male and female to gender and so on.
INTRODUCTION
One section of the R Studio GUI comprises a section of Packages. They allow for amny
important functions to be carried out.
- Understanding packages
WHAT IS A PACKAGE
Packages are collections of R functions, data and compiled code put together in a
well defined format. They can be thought of as prepared routines that are available
in R.
Packages are like a bundle of everything that is needed to carry out a specific
function in R.
Suppose we want to carry out a linear regression in R to create a linear model. One
way to do this is to write all the logic and code to carry out a linear regression and
then execute it. Another way is to access a linear regression function from an
external file, pass your data through it and execute it. This pre made function is what
is referred to as packages in R.
By using the right package in R, one can save time and effort in carrying out a
particular function.
Let us assume that in one of the drives in the system being used, an Excel file called
Excel_import is to be imported into R. In R the code to import an excel file is
read.xlsx.
But if we were to execute this code, it would not work. This is because xlsx is a
function present within a package and it will only work if this package is installed. So
certain functions in R are linked to packages and will only work if those packages are
installed in R.
Let us now look at how to install a package. The option Packages is available in R
Studio on the right hand side.
In the field Packages enter the name of the package that needs to be installed. In
the example being used, the package to be imported is called xlsx. So, enter xlsx.
Make sure that when installing a package like xlsx you are connected to the internet,
as R will need to download the package from a server. Like in the case of xlsx it will be
downloaded from the server Repository.
After entering the name of the package to be installed, click on the Install button.
Loading a package
Installing a package adds it to your system, but post that the package needs to be
loaded. Loading means using the package in R to carry out or execute the function.
To load a package in R, the common code that is used is library followed by the name
of the package within brackets. So, enter the code
library(xlsx)
In the console the text in red indicates that the package has been loaded in R.
Now let us import an Excel file into R, as the package to import the file is installed and
loaded in R.
To do this the code to import the Excel file needs to be entered. A breakdown of this
code is mentioned here:
Let us assume that the name of the sample Excel file to be imported is Excel_import.
To find out the file path of this Excel file, right click next to the file and look under
Properties.
In the space left for the file path, paste the file path of the Excel file. When pasting or
writing the file path, make sure that back slash is entered twice in the file path.
After the file path enter the name of the file which is Excel_Import. Then the sheet
to be imported needs to be mentioned. We can either enter 1 or sheet dot Index
equal to 1.
To import a comma separated value file or a CSV file, the code read.csv is to be used.
Importing a CSV file does not require any package to be imported as this function is
inbuilt in R.
- file path.csv is the file path or the location of the file to be imported
- sep = , indicates that the file to be imported is a comma separated value
file
Assume that a csv file called CSV_Import is to be imported. Copy the file path of this
file which can be found under Properties.
In R, enter the code to import the file by first entering the name of the Dataframe
where the imported file will be stored, which is data2. Then enter the code read.csv
followed by the filepath which has been copied earlier. Then the name of the file to
be imported is entered which is CSV_Import followed by the file type which is CSV.
Remember to add 2 back slashes to the file path just as we did in the case of the
Excel file import. The last part in the code is the separator which is a comma.
read.csv (file.choose ( ) )
The space after choose ( ) is to select the file from the menu in the system. This
option is a menu driven option and removes the need to copy and paste the file path
in the code.
After pressing Control + Enter, in the Select File option which appears look for the
CSV file to be imported (which in our case is CSV_Import).
To summarize:
Packages are a bundle of pre defined functions. They help in executing certain
processes in R with ease.
INTRODUCTION
Just like data can be imported into R whether in Excel or CSV format it can also be
exported from R.
- Reading a file in R
In an earlier section, a Dataframe called a has already been created when CSV files
were imported to R. Let us assume that the contents of this Dataframe will now be
exported to an Excel sheet.
When this code is executed, the data is exported to the location specified. You can
always check this by going to the location where the Excel file is saved, and checking
its contents.
Exporting to a CSV file is similar to exporting to an Excel file. The code to carry out
this function is shown here:
In the code shown, a is the Dataframe whose contents are to be exported, filepath is
the location where the CSV file is to be saved and the comma against the separator
(sep) indicates that the data has to be exported in CSV format.
After executing the code, go to the desktop of your system and look for the location
where the CSV file has been stored. Verify that the contents of the Dataframe a
have been exported in CSV format.
Assume that on the desktop of your system, a text file called Consultants is
available whose contents are to be read through R. Assume that this file contains a
set of email ids all separated by commas. When this data is read in R, we want to
make sure that each email id is an element in itself.
The dataframe where the contents of the text file are to be displayed is mentioned
which in the code displayed is a. The location of the text file is given next. Comma
is written against separator as all values in the text file Consultants are separated
by commas. When we execute this code, in the Console a red dot appears indicating
that the data is being compiled.
To summarize:
INTRODUCTION
Locating values in R is fairly simple with the use of logical operators and conditions.
- Understanding IF condition
- Executing IF condition in R
An example will be used to understand each of these terms better. Assume there is a
table, that lists a few names along with certain particulars related to those names like
gender, age and income.
So, if in this sample table, one wants to identify all those names where the age is
Greater than 23, then the logical operator Greater than is used. Here is the result of
using this operator on the sample table. Looking at the table we can identify 3 names
where the age is greater than 23.
Let us take a look at another example. Suppose we want to identify all those names
whose gender is male and whose income is greater than 40000. Here we need to use
3 logical operators to identify these names. These are gender Equal to male,
followed by the logical operator AND, followed by income Greater than 40000.
Here is the result of applying these logical operators.
So from these examples we can see that logical operators are very useful in
extracting particular information from a Dataframe or a table.
We will now look at how to work with these logical operators in R through a simple
exercise. In an earlier section, a Dataframe called Records was created using the
information mentioned in the sample table above. But for purposes of this exercise,
we will and create this Dataframe again.
To create the Dataframe Records again, first create the vectors Name, Age and
Income before creating Records.
After the Dataframe has been created, the following three tasks will be carried out:
Let us begin with finding out the elements or rows where the age is less than 23.
From the table, we know that there are 2 rows where the age is less than 23. These
can be found against the names Aryan and Gopal.
Let us now find out the elements or rows in this Dataframe where the age is less than
23. When discussing data structures we touched upon the code to find out the
number of rows. The first rule to remember is to use square brackets after the name
of the Dataframe, and the second rule is that the first argument within the bracket
relates to rows and the second argument relates to columns.
So as we need to find out the rows where the age is less than 23, the logical
statement is mentioned in the first argument and the second argument has been left
blank as in nothing is mentioned after the comma.
The code to find the rows where the age is less than 23 is:
First mention the Dataframe name which is Records followed by the dollar sign ($)
and the name of the column from where data needs to be identified. In our example
this would be Age. Then enter the logical operator less than ( < ) followed by 23.
On pressing Control plus Enter, we can see in the console section two rows. The rows
displayed tally with the results that we arrived at when we looked at the data
displayed in the table.
Remember to identify rows, enter only the first argument in the code, as the second
argument relates to columns.
First we need to create a dataset data1 and attach it to the code we used earlier.
data1<-Records[Records$Age<23]
As you recall from earlier sections, this has the effect of attaching the results of the
code to the dataset data1. So the two rows that we saw in the Console now belong
to the dataset data1.
Now to find out the number of rows in data1, enter the code
nrow (data1)
Going back to the table, we can see that there are two records which meet the
conditions of gender being male and age being over 21. These are found against the
names Ravi and Umesh.
Records[Records$Gender== M&Records$Age>21,+
So what we have effectively stated in this code is to find in the Dataframe Records, all
rows with gender Equal to M and with age Greater than 21.
On pressing Control plus Enter,in the Console the rows with Ravi and Umesh are
displayed. This as we have seen exactly matches the requirements of all rows with
gender male and age greater than 21.
Remember that when entering character data types in R, the values need to be
entered within double quotes.
Let us begin by opening the Dataframe Records. In this Dataframe, let us assume we
want to add another variable called Gender_dummy.
The values to be displayed against this variable are 1 against all those rows where M
(male) is displayed and 0 against all those rows where F (female) is displayed.
Records$Gender_dummy<-ifelse(Records$Gender== M,1,0)
ifelse indicates that IF the value in the column Gender is M display 1, ELSE display 0.
Remember when entering the code to precede the name of the column with a dollar
symbol.
Press Control plus Enter. In the Workspace section, the number of variables has
increased to 5 (where it was earlier 4).
On opening the Dataframe we can see that a new variable Gender underscore
dummy has been created. In this column all 1s have been added against all those
elements where the gender is M or male and 0 against all those elements where the
gender is F or female.
So let us run through the code one more time. IF the statement gender is equal to
male is true, display 1, Else display 0 (which means that if the statement gender is
equal to male is not true, then display 0)
To summarize:
Logical operators are used to identify certain elements in a data structure. Egs
are greater than, less than, equal to etc
If condition looks for the presence of certain conditions before carrying out a
specific function
INTRODUCTION
Data in different tables can be merged in R.
- Executing a merge in R
A second table which we will refer to as Table 2, stores Employee ID, Address and
Nationality.
Assume that the organization wants to combine the information in these two tables
into a single table. To do this, Merge will be used. So, Merge is an operation which
helps in combining data which are present in different tables.
Shown here are two data sets. The first data set has three columns, k1, k2 and data.
In our example, the column which is common between the two datasets is k1. So it is
possible to merge these two datasets as k1 is common between the 2.
Full merge
The first type of merge possible is called the Full merge. In our example, we have two
columns in the first dataset and three in the next dataset. Of these, one column k1 is
common. After a full merge one dataset with four columns will be created k1, k2,
k3 and data. So a full merge is a concatenated table with all the unique columns and
data present in the tables that were merged.
So let us look at merged table that has been created after a full merge of the 2
datasets.
Inner merge
The second type of merge is called Inner merge. In this type of merge only the row
with matching elements in the common column of the datasets to be merged are
brought together.
In our example, the only column that is common between the two datasets is k1.
Within k1, the only common element between the 2 columns is the number 1. So
when an Inner merge is carried out only the row which has the common element is
merged. The figure shown on your screen indicates the result of an Inner merge
between the two datasets.
The third type of merge is called the left outer merge. In this type of merge a
consolidated table is created, but only the contents or elements of the columns
which are to the left are merged.
The figure shown on your screen displays the results of a left outer merge.
The inverse of a left outer merge would be a right outer merge, where only the
contents or elements of the columns to the right are merged.
The first thing we will do is create two dataframes X and Y which will contain the
elements of the datasets that we used as an example earlier.
The dataset X will comprise the columns k1, k2 and data, whereas the dataset Y will
comprise the columns k1 and k3.
Let us now look at the syntax or code to carry out a full merge of both the datasets
x and y. Enter the code:
x and y are the datasets that are going to be merged. K1 is the column that is
common between the datasets x and y.
To indicate that a full merge needs to be carried out, all = TRUE is specified. On
pressing Control plus Enter, a fully merged dataset is displayed in the Console.
Press Control plus Enter. In the Console, the results of an inner merge are shown,
wherein the common elements in the common column are merged.
all.x is mentioned as the dataset x is to the left . On pressing Control plus Enter, the
results are shown in the Console.
Here all.y is specified, as y is the dataset to the right. On pressing Control plus Enter,
the results of the right outer merge are shown in the Console.
For datasets to be merged there has to be atleast one column in common between
them.
SUMMARY
In this section, the different ways two data structures can be merged has been
looked at in some detail.
To summarize:
To carry out a merge, atleast one of the columns in each of the datasets to be
merged must be common.
A full merge combines all of the elements in the datasets into a consolidated
dataset.
An inner merge combines only the elements of the row which have elements in
common (within the common column)
A left outer merge combines the elements of the table or dataset to the left. A
right outer merge combines the elements of the table or dataset to the right.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 101
All rights reserved
PART 1 UNDERSTANDING TEXT ANALYTICS
INTRODUCTION
Analyzing text is extremely powerful and is an integral part of our social media and
web activity.
So if we were to define text analytics we could say that it is the process of deriving
high quality information from unstructured text. Simply put, it is making sense or
giving structure to data or information which is not structured.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 103
All rights reserved
HOW IS TEXT ANALYTICS USEFUL
Let us suppose that you have been searching on the web for anything related to
computer games. On your search results page you will often find ads and
recommended pages related to computer games. A lot of what you see is dependent
on your search history or the keywords that you have been using.
Likewise, when you are on the Newsfeed page of Facebook, you can see posts on
Suggested pages or Ads displayed on the right hand side of your page. Maybe you
have been looking for something specific on Facebook or have been spending time on
a certain company page. Those suggested pages or ads could be very similar to the
pages that you have been looking for or spending time on in Facebook.
If you have a Gmail account, you would find in your Spam folder a lot of mail that you
yourself did not actually send to Spam. Well all of these examples that we have cited
is a result of using text analytics. Take the example of Spam filtering. There have
been instances when you have flagged of mail from a certain recipient as Spam. Your
mail service provider will now automatically look for those words in a string and send
104Copyright (c) 2014 Redwood Associates Business Solutions Private Limited
All rights reserved
any mail with that text to Spam. Likewise in facebook what you search for or write is
being analysed to come up with suggested pages and display ads.
Text analytics is an exciting and useful part of analytics. To understand this concept
better, in the sections ahead we are going to focus on two aspects:
1. Understanding the common terms used in text analytics; and
2. Completing a a text analytics project using data from a popular social medium
- Twitter.
The project will focus on the framework to create a Word Cloud out of a set of
tweets on Big data, R and analytics.
You will need to execute this project in R using the concepts that will be
discussed in this tutorial. We will of course be guiding you along the way.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 105
All rights reserved
IMPORTANT TERMS IN TEXT ANALYTICS
Corpus
The first term that we will look at is called Corpus. Corpus is the data structure that is
used to manage the text that is being analyzed.
It is a data structure of the relevant words in a piece of text. Let us assume that we
are analyzing a blog on democracy. The corpus will be a list of all relevant words in
the blog related to democracy stored in a structured format. So like in a dictionary
when you look up the term democracy you will find all words associated with it listed
in one place, the corpus will list all the relevant words from the blog in a single place.
An important point to remember about a Corpus is that just like in the case of a
dataset, it needs to be cleaned up.
Cleaning up a Corpus
Stopwords
So what do we typically clean from a Corpus? Firstly, words which do not really make
sense in itself need to be removed. For eg, if the blog that we are analyzing, uses the
words the, or, of , am , is, are , was quite frequently these words
really carry no meaning or have little or no value and hence need to be removed from
the Corpus. These types of words are referred to as stopwords. There are around
196 stopwords that have been identified.
You need not worry about identifying these words by yourself, because in R we will
be using a Text Mining or TM package which will help you in identifying and removing
stopwords from your Corpus.
In addition to the 196 stopwords identified, you can also add your own stopwords
based on what you think is useful or not. For example, if you think that names of
Numbers
Secondly, we can also remove numbers from the Corpus. So, if numbers have been
used to demarcate points like 1, 2, 3 and so on, these can be removed from the
Corpus as they have no meaning by themselves.
Punctuation
Thirdly we can also remove punctuations like commas, semi colons, colons, full stops
etc from the Corpus.
Treatment of case
Fourthly, we can decide whether the same words used in a text need to begin with
upper case or lower case.
For eg, if democracy is spelt in one place with lower case but in another sentence
begins with upper case, then we need to decide if in the Corpus democracy should
always start with upper case or lower case.
Stemming
The next type of clean up that can be done is through a process called Stemming.
To understand Stemming, let us assume that in the blog we are analyzing a word
participate which has been mentioned in different ways like participated
participating participatory etc across the blog. All these words relate to the same
root word which is participate.
The process of Stemming will ensure that all these words will eventually add up to
that one word no matter the tense used. Another example would be a verb like fly
which can be represented in an article as flew flying flown etc. Stemming will
ensure that in the end this is all represented by the one word fly.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 107
All rights reserved
Framework
So, the framework to start analyzing text begins with creating a Corpus which is a
data structure to store text. This is then followed by the process of cleaning up the
Corpus wherein the following is carried out:
Another important term is Tokenize. In this process a sentence is broken down into
individual tokens so that each word in that sentence is a separate entity. So the
sentence Parliament is the seat of democracy, when tokenized would be:
Parliament, is, the, seat, of, democracy.
This method is also used in search engines like Google when they look at keywords.
For eg, if the keywords analytics jobs is entered, it would be first broken down into
2 tokens analytics and jobs.
TDM
Having arrived at a clean Corpus, we now need to decide what to do with it.
Remember that a Corpus in itself is not an output, but a dictionary to be used to
create something else. So if our final objective is to create a Word Cloud out of a
Corpus, the Corpus needs to be converted into a format which enables a Word Cloud
to be created from it.
To understand this better, we need to know what is required to create a Word Cloud.
Two very simple components make up a Word Cloud words and the number of
times or frequency with which those words appear. For example, look at the table
shown below:
Words Frequency
People 20
Democracy 35
Freedom 40
The numbers next to those words represent the number of times they appear in a
piece of text like a blog or an article.
When a Word Cloud is created, the frequency will determine the size of the word
within the Word Cloud.
For example in the image shown below, the larger the size of the words, the more
frequently they would have recurred in the content or the text from which this Word
Cloud was created.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 109
All rights reserved
So, the structure which has been described above is referred to as a Term Document
Matrix or TDM. So to take a Corpus and make it Word Cloud ready we need to create
a TDM. A TDM is made up of rows and columns. The columns represent the words
and the rows represent the frequency of their occurrence.
So lets stop for a while and ask ourselves a question. Hey, I have a blog and I want to
create a Word Cloud out of it. How can I do it?
Well, everything we have discussed so far should answer our question. Quite simply:
1. Create a Corpus
2. Clean up the Corpus
3. Create a TDM or Term Document Matrix out of it
4. Create your Word Cloud
For example, to help with Step 2 which is cleaning up of the Corpus the TM package
uses a function called tm_map. To carry out various types of processes like removing
Stopwords the correct argument needs to be entered after tm_map.
The TM package comes with some really good documentation which you need to go
through to understand how to execute each of the steps we have talked about.
Remember to also use the Help feature in R for specific queries.
Before we move onto our project of creating a Word cloud out of a set of tweets,
lets make sure we do the following.
1. Download the file comprising the tweets that we need to convert into a Word
Cloud. You can find this in the Download section of this tutorial.
When opening this file, remember to right click and select Open with R studio.
2. When the file is opened in R Studio, it will be visible in the Workspace section
with the number of tweets in it visible which is 320.
3. Import and install the list of packages that are mentioned in the Download
section of this tutorial. The packages are:
a) Twitter: This package is needed to read the tweets that have been
downloaded
b) Word Cloud: This is required to create the Word Cloud
c) TM: As mentioned earlier, the TM or Text mining package is needed to
create the Corpus, clean up the Corpus and create the TDM
d) Snowball: This package is required to enable Stemming to be carried out.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 111
All rights reserved
SUMMARY
In this section, the meaning and importance of Text Analytics was covered. Some
important terms in Text Analytics and the framework to create a Word Cloud has also
been explained.
To summarize:
INTRODUCTION
Word Clouds are a product of text analytics. They are not so difficult to create.
Before we begin, youll should have downloaded the sample set of tweets and
opened it in R studio. You should have also imported and installed the list of packages
that were specified in the previous section.
As you can see in the workspace section, a list of 320 tweets is displayed.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 113
All rights reserved
Double click on this list and you will see a list displayed in Notepad.
From this list it is pretty evident that there are no details of the actual content of the
tweets. So what we have is essentially unstructured text.
To convert this unstructured text into a Dataframe, enter the code shown below:
library(twitter)
df<-do.call(rbind,lapply(tweets,as.dataframe))
do.call = a function which is calling another function multiple times. In the case of our
code the function that is being called multiple times is rbind or row bind.
For a detailed explanation of the syntax do.call, go to the Help section in R studio.
Type the words do.call in the Search field and press Enter. As you can see a detailed
explanation of the function do dot call is shown in Help. You can go through this
explanation to understand this function better.
lapply = a function which converts the tweets that are being combined into a
dataframe
Let us now execute the code mentioned above. Press Control plus Enter. As you can
see in the Workspace section, a dataframe df with 320 observations is visible. These
320 observations is nothing but the twitter data which has been converted into a
dataframe.
Let us open the dataframe. As you can see each row is numbered, with the first row
relating to the first tweet, the second row the second tweet and so on. This
dataframe will run into 320 rows which corresponds to the 320 tweets in our original
data structure.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 115
All rights reserved
The dataframe has 14 variables. For the purposes of our exercise we will focus on the
text column of the dataframe.
dim(df)
Press Control plus Enter and you can see in the Console the numbers 320 and 14
displayed.
You can see two links shown here a Description file and Overview of user guides
and package vignettes.
We will click on the second option which is the Overview of user guides and package
vignettes.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 117
All rights reserved
Once we do that, a PDF file on the Introduction to the tm package will open.
As we scroll through this document, you will see every conceivable task that is
possible with the tm package listed.
It lists out how to eliminate stopwords to how to carry out stemming to creating a
Term Document Matrix. Term Document Matrix or TDM, as we know is essential to
creating a Word Cloud as it lists out the set of words along with the frequency or the
number of times they appear in a given text.
As you can see in the image below, an example of a Document Term Matrix has been
shown. Here the documents are mentioned in rows.
Let us interpret this matrix. Listed under Docs are the names or numbers of the text
that have been analysed. Listed as columns are words which appear in these
documents. So, if against Doc 127 and against the word able the number 10 was
mentioned, it would mean that the word able has appeared 10 times in the
document 127.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 119
All rights reserved
In order to do this, use the syntax shown below.
myCorpus <-Corpus(VectorSource(df$text))
If we open the dataframe df you can see that the column which contains the contents
of the tweets is referred to as text.
So in the syntax we mention df $ text. Press Control plus Enter to create the Corpus.
Within the document, you will find the code required to carry out various processes
like eliminating white space or blank space from the Corpus, to conversion to lower
case to removal of stop words.
In the Help document, in the code shown to transform the Corpus, a sample Corpus
name reuters has been used. For the purposes of our project we need to use the
same code but replace reuters with the name myCorpus.
In R Studio against each of the code or syntax mentioned we will hit Control plus
Enter and start cleaning up the Corpus. We will first convert to lower case, then
remove punctuation and then remove numbers from the Corpus.
Removing URLS
It is also possible to remove urls from the Corpus with the help of a user defined
function which is shown below.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 121
All rights reserved
To find out the meaning of gsub which is used in the function, type out the words
gsub in the search field of the Help section. As you can see from the text displayed
gsub is a function which is used to carry out any kind of replacement.
So in the removeurl function gsub is replacing any text starting with http with a blank.
We will now look at how to remove stop words from the Corpus. As you can see in
the Console there are a number of words which by themselves do not make sense.
Some of these words are once, why, each, in, to, etc.
There are around 196 Stop Words that have been identified, but you can include
more as well.
In addition, we also want to include some other words which for the purposes of our
project are of no value or utility. These words are English, available and via.
Stemming
Now that we have removed stop words, we will move onto another important
process in the clean up of the Corpus which is called Stemming. To do this we first
create a copy of the corpus by using the code shown.
Stemming will convert words like eating, eaten etc to one root word eat.
Since we have already installed this package we will click on Cancel, but in case you
have not then click on Install instead.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 123
All rights reserved
To carry out stemming, we need to use a function called stemDocument which is
found in the SnowballC package.
Shown here is the code to convert the Corpus into a Term Document Matrix.
The code indicates that any word with a frequency from one to infinity needs to be
added to the Term Document Matrix.
This need not be mentioned in the code, because by default words with all types of
frequencies will be added to the term document matrix.
Press Control plus Enter. The Term Document Matrix has been created.
To view the contents of the Term Document Matrix, go to the Workspace section
where you can see the value myTdm displayed. However, the Term Document
Matrix is in the form of a List, whereas we would like to see it in the form of a matrix.
In order to do this, we create a dataframe matrix called m and convert the List into
this matrix using the code shown below.
In the Workspace section, a matrix m is displayed. Double click on this and our Term
Document Matrix opens up! So let us break this down.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 125
All rights reserved
The first column row names indicates the words that are contained in the 320
tweets (remember all stop words have been removed, so these are the actual usable
words) The rows which are numbered 1,2 3 and so on are the number of tweets,
which we know will run into 320. The numbers indicate the number of times these
words appear in each of these 320 tweets. In most cases the number is zero
indicating that they have not appeared in those tweets. To find out the frequency or
the cumulative number of times a word appears across the 320 tweets we will need
to look at the sum of each row. So for example to find out the frequency of the word
big, we will need to add up all the numbers under each of the 320 columns against
the row big.
Let us deconstruct this code. The term rowSums and within brackets m indicates
that the summation of each row in the Term Document Matrix m will be carried out.
Decreasing = true, means that the summated amounts will be arranged in descending
order. Press Control plus Enter. The result will be stored against wordfreq.
To view the frequencies that have been calculated, select wordfreq and press Control
plus Enter. The results are displayed in the form of a List.
So an easier alternative would be to convert the List wordfreq into a matrix
wordfreq1 using the code that is shown on the screen.
In the Workspace double click on the matrix wordfreq1. Shown on the screen is a
matrix of all the words in the Corpus myCorpus along with their frequencies or the
cumulative number of times they appear in the 320 tweets. Also, the frequencies
have been arranged in descending order from the highest to the lowest.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 127
All rights reserved
STEP 8: CREATING THE WORD CLOUD
Now all that is left to be done is to generate the Word Cloud. Let us go to the Help
section in R Studio and enter the words Word Cloud. Click on the link which appears.
As you can see the arguments necessary to create a Word Cloud will be listed.
The first requirement shown is words, followed by frequencies. There are many
other options listed so that one can create a Word Cloud based on different
conditions. But a Word Cloud can be generated with just 2 pieces of information
words and their frequencies.
The Term Document Matrix that we will be using to generate the Word Cloud is
called wordfreq1. To generate the Word Cloud enter the following code:
The Word Cloud creates the words with the highest frequencies first. So, words like
r, analysis, research and example have high frequencies and hence are displayed
quite prominently in the Word Cloud.
In the matrix, there were many words with a frequency of 1. We can choose not to
show those words in the Word Cloud. To exclude these words from the Word Cloud,
enter in the code an option to include only those words with a frequency of say 5 and
above. The code to execute this is shown below:
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 129
All rights reserved
Press Control plus Enter. You can see that fewer words are being added to the Word
Cloud.
Another thing to remember is that each time a Word Cloud is generated the position
of the words will change. As we can see, r which was earlier vertical is now horizontal
and is located in a different place. In order to ensure that the position of a word does
not change each time the Word Cloud is generated, we can use the function
set.seed.
In order to limit the number of words to be shown in the Word Cloud, we can use the
syntax max.words. We can also determine the colour of the Word Cloud by using
the syntax colour is equal to say red (within brackets)
So we have completed the objective of this project which was to generate a Word
Cloud. To generate a Word Cloud all that is needed are words and their frequencies.
Other parameters can also be defined. Modifications can also be done on the Word
Cloud like minimum frequency, maximum number of words to be displayed and
colour.
Do try out the other parameters available by referring to the content in the Help
section under Word Cloud.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 131
All rights reserved
SUMMARY
Creating a Word Cloud in R is a function of using the right package with the right set
of text or words.
To summarize:
The tm package in R which is needed to carry out text analysis comes with
detailed documentation.
Apart from cleaning up stop words, numbers, punctuation, urls can also be
removed through a user defined function.
To calculate the frequency of words in a TDM, the rows against each word in
the matrix needs to be summed up.
A Word Cloud can be created once words and their frequencies are mapped
out.
Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 133
All rights reserved