You are on page 1of 34

MOVIEREVIEWS

(ProjectinMachineLearningCS771)

DepartmentofCSE,IITKANPUR

ReportwithCode



Submittedto:
Prof.HarishKarnick




By:
VenuGopalReddy(14111043)
SushilKumarVerma(14111038)
RishabhDevShukla(14111029)
BanothuRajKumar(14111007)

Acknowledgement

We would like to express our special thanks of gratitudetoour instructor


Prof.
HarishKarnick whogaveusthegoldenopportunitytodothiswonderfulproject
on the topic
Movie Reviews in the course Machine Learning (CS771) ,
which also helped us in doing a lot of Researchandwecame toknowaboutso
manynewthings.Wearereallythankfultohim.

Secondlywewouldalsoliketothankourgroupmembers(team)whohelpedusa
lotinfinalizingthisprojectwithinthelimitedtimeframe.

Contents


Aboutthisproject
Aboutdataset
Basicidea
Bagofwords
TfIdf
Distributedrepresentation
Googlesword2vec
Decisiontree
Randomforest
Neuralnetworks
Otherclassifiers
Results
Issuesinimplementation
Furtherimprovements
RTextTools
References
Appendix:Code











AboutThisProject

Sentiment analysis is a challenging task in machine learning. People


express their feelings in language and play on words, which could be very
misleadingforbothhumansandcomputers.

We are given a dataset of IMDB movie reviews and expected to predict


one oftwosentimentsgoodorbadbasedonreviewsi.e.0 (IMDBrating<5)or1
(IMDBrating>=6).

IMDB movie reviews are the text data and we have converted those text
reviews into numerical data table by using Bag of Words, TfIdf, and
googlesWord2Vectechniques.

We built models such as neural networks, SVM, LDA, QDA,


Random Forest and Decision Tree by varying parameters and noted the
accuraciesineachmodelinresulttable.

The accuracy values with * in result section of this document are the
accuraciesgotinKagglesubmission.


AboutDataset

LabeledTrainData:25000instances
TestData :25000instances

ThesedatasetaregivenonKaggle.

Details:
idUniqueIDofeachreviews
sentiment Sentimentofthereview:1forpositivereviewsand0for
negativereviews
review Textofthereview











BasicIdea

Basicallytherearethreemajorstepsinourproject:
1. processingdataset
2. creatingaclassifiermodel
3. trainingandtestingofclassifier

Bagofwords

Bag of Words: It is one of the powerful technique in the natural language
processing. In this model, every document is represent as a multi set of words
presentinthedocument.Thismodelwontconsider thegrammarandevenword
ordering also. This multi set can represent as a vector form to represent the
feature vector. The issue with this model is that itisnotgoodenoughtohandle
negation case. Negation case is that expressing thoughts using negative words
like not,dontetc.,likeinsteadofsaying badmoviesomeone cansaynot
goodmovieaswell.

If you use raw bag of words model, then we will loss specific nature of some
words and become more general words. for example, let our vocabulary list be
[This,is,not, good,movie]thendocument X,Ycanberepresentasbelow.Here
eachentrywillrepresentthenumberoftimethatwordoccurinthedocument.
X:Thisisgoodmovie: [11011]
Y:Thisisnotgoodmovie: [11111]
Let X is labeled as positive review and Y is labeled as negative review. If you
consider the general meaningofthegoodispositive andif anyreviewcontain
this wordsthatreviewtendencytowardspositivereview.but,becauseofnegation
this word contain in both positive and negative reviews and it lost its positive
natureandbecomeageneralwordlikethisthatpresentinboththereviews.

Solution for above issue is replace


k words after negation term with NOT_
prepended. Ideally weneedto replaceallwordslikethistillpunctuation symbol
got appeared but most of these review writers didnt follow the punctuation
symbol.Inourimplementationwetook k =3
.Theimplementationdetailscanbe
found in
Appendix a.
One more issuewiththismodifiedmodelthat, vocabulary
willbecomemorecomparetothegeneralmodel.

TfIdf

TherearestillsomeproblemswithBagofwordsmodel,thatissomeofthewords
are most frequent and that are present in each document. we can remove these
words using stopwordslist.stop wordsareglobaltoparticularlanguagethatare
mostfrequentlyusedinthelanguage.Asthisis globaltothelanguage,itdoesnt
handlethefrequentwordstogivengivendataset.

One more issue thathandletherarewords,thesewordsare hardlyoccurinvery


few documentsandthesemaynotpresentintheanyfuturedocument.Thesetwo
issue can be handled by droppingthesefrequentswordsandrarewordstogether
thesewordsarecanberecognisedbytheTfIdfscoreofthewords.

TFIDF(TermFrequencyvs.InvertedDocumentFrequency) :
ThiscontainsthetwofactorsTfandIdf,bymultiplyingthesetwofactorswewill
gettheTfIdfscoreofword.

T f = 0.5 + (0.5 f (w, d) max{f (w, d) : wd})



I df = log(N |{dD : wd}|)

f (w, d) :frequencyofwordindocument
N:totalnumberofdocumentsindatasat

Here,Tfscorewilltellabouttheimportance ofwordtotheparticulardocument.
Idf is the scaling factor the will tell about the how frequent the word to the
particular dataset. For rare and frequent words, this score is lesscomparetothe
otherwords.wecaneliminatethembyignoringthewordswithlessTfIdfscores.
The representation of feature vector is just like Bag of words model, but the
frequencyofwordsisreplacedwithTfIdfscoresofword.
Implementation : we use TfidfVectorizer from scikit learn to calculate these
scores andrepresentfeaturevectors.Hereweareconsiderthewordstocalculate
the TfIdf if it is present in at least 5 documents like this wecaneliminaterare
wordsatpriori.



DistributedRepresentation

There are some more problems we cant handle with bag of words models, as
these model represents each word as a atomic quantity and it wont capturethe
information aboutsimilarityofwords.forexample,hotelandmotelaresame
words bythemeaning,ifyourepresentthisusingbagofwordsmodeltherelation
between hotel and motelissameashotel andhomeasthismodeltreatall
wordssame.

Distributed representation of word will solve this problem, by representing the


each word by vector in vector space such that all similar words are in close
distancethanthedissimilarwords,thatis all similartypeofwordscanbecluster
together. For example, let our vocabulary contain the three type of words,
positive,negativeandneutralwords.

If we assign some values in ddimensional vector space such that all similar
words are clustered together. all positive words are together, allnegative words
aretogether andfarfrompositivewords,andallneutralwordsaresamedistance
topositiveandnegativewords.

WeuseGoogleWord2Vectorepresentvectorembeddingofwords.









GooglesWord2Vec

This tool provides an efficient implementation of the continuous bagofwords


and skipgramarchitecturesforcomputingvectorrepresentationsofwords.These
representations can be subsequently used in many natural language processing
applicationsandforfurtherresearch.

Theword2vec tooltakesa textcorpusas input andproduces thewordvectorsas


output.Itfirstconstructsavocabulary from thetrainingtextdataandthenlearns
vectorrepresentationofwords.

There are two main learning algorithms in


word2vec : continuousbagofwords
and continuousskipgram.Theswitch cbowallowstheusertopickoneofthese
learning algorithms. Both algorithms learn the representation of a word that is
usefulforpredictionofotherwordsinthesentence.

Severalfactorsinfluencethequalityofthewordvectors:
amountandqualityofthetrainingdata
sizeofthevectors
trainingalgorithm

Word2vecisalsoavailableinPythoningensimlibrarygivenbyRadimRehurek.

CommandinLinuxtogenerateword2vecbyterminal:

./word2vec -train ~/Desktop/sentences.txt -output


~/Desktop/vec.csv -size 50 -window 5 -sample 1e-4 -negative
5 -hs 0 -binary 0 -cbow 5 -iter 5




Decision Tree

We have varied impurity at the node by Complexity Parameter of

rpart library in R, has given the best accuracy 73.06% for CP=0.1,observed
accuracy when CP=0.01 which has given accuracy 70.14% and 70.89% at
CP=0.02.Thisis duetoas theCPvaluedecreases,theheightofthetreeincreases
beyondcertainheightandleadstodecreaseintheaccuracy.

Fully grown tree, using oblique.tree libraryhas givenaccuracy70.84%
and after pruning the accuracy has increased to 75.32% because in pruning we
remove the sections of tree that provide little power to classify instances.This
technique growingandpruning isusedinordertoavoidhorizoneffect.

Splits based on number of instances are observed by varying minsplit
parameter in R, has given best accuracy of 74.32% at minsplit=83, and given
accuracy 72.87% at minsplit=75.As the number of instances at a node
decreases(relativetooptimum)theheight of theincreasesandthusdecreasesthe
accuracy.












RandomForest


We generatedRandomForests on ourpreprocesseddatausingtherandomForest
libraryof R,and did
5foldcrossvalidation .Firstwevariedthe
ntree(Number
of trees to grow) parameter of our randomForestmodel.We startedwithntree
=2,andvarieditfordifferentvaluestill400,ourresultsareasfollows:

ntree 2 52 102 152 202 252 302 352 400
Accuracy 72.8 81.288 81.74 81.7 81.81 81.92 82.064 82.008 82.01


Then wefixedntreetobe302,for whichwegotthe bestresult,and variedmtry
parameterofourrandomforestmodel.

mtry:Numberofvariablesrandomlysampledascandidatesateachsplit.
Wegeneratedrandomforestforntree=302andmtry=(2,4,8)
Theresultsareasfollows:


mtry 2 4 8
Accuracy 81.62 82.148 81.936


Sothebestaccuracy whichwegotfromgenerating random forestswasforntree
=302andmtry=4whichwas82.148%.


NeuralNetworks

NeuralNetworksisoneofthepowerfultechniqueforsupervisedlearning.weuse
Bag of Model with TfIdf score feature representation to represent each
documents.Weuseonehiddenlayer

After, construction of feature vectors of every document can be representedby


the ~2,50,000 dimensional vector. All these vectors are together as a sparse
matrix calldatamatrix.Thedatamatrixrepresent thisisreviewhavelower rank
because of sparsity, so it better to use dimensionality reduction to get better
performance.

We use
TruncatedSVD module in scikitlearnfordimensionalityreduction.This
istheonlyactivedimensionalityreductionmodulethatwilltakesparsematrixas
input,anditisarandomizationprocess,byusingthiswereduceddatamatrixinto
300dimensionalvector.
We usesimple neuralnetworkwithonehiddenlayerare25neurons inittotrain
the data. This module is taken from
http://danielfrg.com/ as in scikit learn
superviseneuralnetworkisnotimplementedyet.

By doing this we achieved ourbestaccuracy0.86596 , actuallywecandobetter


thanbyincreasingnumberofneuronsinthehiddenlayerbutitistakingsomuch
timetotrainthemodel.





OtherClassifiers

The other classifiers which we are focusing are Naive Bayesian classifier,
SupportVectorMachine(SVM),LDAandQDA.

NaiveBayesianclassifier:
Weareusinglibrarye1071inRforbayesianclassifier.Withsometuning
of parameters like Laplace smoothing and threshold value, we are getting an
accuracyof73.4%withfivefoldcrossvalidationontrainingdata.

SupportVectorMachine(SVM):
We are using library e1071 in R for SVM. With some tuning of
parameters,weare gettinganaccuracyof 84.03%withfive foldcrossvalidation
ontrainingdata.

QuadraticDiscriminantAnalysis(QDA):
We are using library e1071 in R for QDA. With the use of
straightforward function , we are getting an accuracy of 83.0% with five fold
crossvalidationontrainingdata.

LinearDiscriminantAnalysis(LDA):
We are using library e1071 in R for LDA. With the use of
straightforward function , we are getting an accuracy of 83.01% (slightly better
thanQDA)withfivefoldcrossvalidationontrainingdata.

Results


The results(accuracies) of various classifiers using various processing methods
aregiveninthetablebelow.Processingmethodsaregivenalongrows.Theseare:
Word2Vec, TfIdf and Bag of Words. The classifiers are given along columns.
These are:Decision Tree,RandomForest,NeuralNetworks,SVM,LDA,QDA,
NaiveBayesianclassifier.
The values with a star mark aretheaccuracywhichweare gettingfromKaggle
by submitting our classifier on Kaggle. The best accuracy , we are getting is
86.6%. Other values are accuracy of five foldcross validationresultontraining
data.





% Decision Random Neural SVM LDA QDA Bayesian
Tree Forest Nets Classifier

Word2Vec 72.14 82.14 84.14 84.03 83.01 83.0 73.4

TFIDF 86.6
*
80.5

BagofWords 71.25 85.14


*






IssuesinImplementation

Herewearementioningsomeissues,towhichwearehandlingduringourproject
work.Theseare:

1. Negationwords:
We arehandlingthisissueinBagofwordsmodel.Weareaddingnegative
kindofwordsinourvocabularybyattachingnottoeverywordafterwhichitis
takingplace.

2.Uselesswordsbutnotinstopwords:
Therearesomewordslikemoviewhichareuselessforthisdatasetbutare
notinstopwordsofenglish.WearetreatingthesekindofwordswithTfIdf.

3.Largenumberofinputs:
Inourdataset,inputsizeislarge.Itcanraiseamemoryerrororcomputation
error with some classifier. We are sampling our dataset andchoosing someless
numberofrowsrandomlytohandletheseerrors.

4.Differenttypesofprocessingsoninput:
Forourproject,therearedifferentkindsofprocessingsavailable.Theseare
bag ofwords,wordtovectorandTfIdfscore.Wehavedoneeveryprocessingon
inputwithsomeclassifier.

5.Largecorpus,tablesandmatrices:
Weareusingdimensionalityreductionwhereitisrequiredtoreducesizeof
tableormatrixwithoutaffectingresult.

6.Sameclassifierindifferentlibraries:
InRandPython,therearemanyclassifierswhicharedevelopedin more
than one libraries. For example : KNN, Neural nets. We have studied and used
bothoftheselibrariesinprojectandshowedresultswhichoneisbetter.

7.BetterCPUandmemoryrequirements:
Our project requires faster machines with enough memory. We are
exchangingourlaptopsduringimplementationwhenitispossibleandrequired.

FurtherImprovements

We can improve the accuracy of our classifiers by using someotherprocessing


techniquesoninput.Sometechniquesare:

TermDocumentMatrix:
ThetermDocumentMatrixinRisausefultooltomakeacorpusofdataset.
TDM is available in library tm in R. It shows which term(word) is occurring
how many times in which document(review). The rows ofthismatrixareterms
andcolumnsaredocuments.

Deeplearning:
Deeplearning isa branchofmachinelearningbasedonasetof algorithms
thatattempttomodelhighlevelabstractionsindatabyusingmodelarchitectures,
with complex structures or otherwise, composed of multiple nonlinear
transformations.TherearesomelibrariesinRandPythonfordeeplearning.

TextMining:
Text mining usually involves the process of structuring the input text
(usually parsing, along with the addition ofsomederivedlinguisticfeaturesand
theremoval ofothers,andsubsequentinsertionintoadatabase),derivingpatterns
withinthestructureddata,andfinallyevaluationandinterpretationoftheoutput.

WordtoPhrases:
GoogleisdevelopingsometoollikeWord2Vecwhichcanshowrelations
ofwordstophrasesofdocuments.
RTextTools

RTextTools is a very good and new library inRfortext analysis. Itisavailable


withR3.2.0.Somecommandsofthislibraryare:
Tocreateadocumenttermmatrix/corpus:
DocTermMatrix<-create_matrix(train,language="english",
removeNumbers=TRUE, weighting=tm::weightTfIdf)
Makeacontainergivinglabelandtrain/testsize:
container<-create_container(DocTermMatrix,train$sentiment,
trainSize=1:100, testSize=101:125)
Createamodelgivingalgorithm,youwanttouse:
models <-train_models(container,algorithms=c("SVM","NNET"))
Trainyourmodeloncontainer:
results <- classify_models(container, models)
Watchresultsandotheranalysisofclassifier:
analytics <- create_analytics(container, results)








References

Thankstothesereferencesfromwherewehaveworkedonthisproject:
https://www.kaggle.com/c/word2vecnlptutorial
http://cran.rproject.org/
http://scikitlearn.org/stable/
http://stackoverflow.com/questions/tagged/machinelearning
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
http://www.rtexttools.com/
http://en.wikipedia.org/wiki/Machine_learning
https://store.continuum.io/cshop/anaconda/
https://code.google.com/p/word2vec/
ftp://hk.cse.iitk.ac.in/
http://danielfrg.com/








Appendix:Code

a. BagofWordswithNegationCaseHandled:

defreview_to_wordlist(review,remove_stopwords=False):
review_text=BeautifulSoup(review).get_text()
review_text=re.sub("[^azAZ.]","",review_text)
document=[]
k=0
#handlingnegation
words=review_text.split()
foriinrange(len(words)1):
ifwords[i]in["not","didn't","don't"]:
k=3
elifk>0:
document.append('NOT_'+words[i])
else:
document.append(words[i])

b.DecisionTree:

library(tree)
library(oblique.tree)
Auto=read.csv(file="moviereviews.csv",header=TRUE)
#AttachdatasettoR
Auto=Auto[1:5000,]

attach(Auto)
head(Auto)
length(Auto$class)
#createcategoricalvariable
High=ifelse(V51>=1,"Yes","No")
length(High)
Auto=data.frame(Auto,High)
Auto=Auto[,51]
names(Auto)

#splittestingandtrainingsets

set.seed(1)
training=sample(1:nrow(Auto),4*nrow(Auto)/5)
testing=training
training_data=Auto[training,]
dim(training_data)
testing_data=Auto[testing,]
dim(testing_data)
testing_High=High[testing]
length(testing_High)

#fitthetreemodelusingtreedata

tree_model=tree(High~.,training_data)
plot(tree_model)
text(tree_model,pretty=0)
title(main="ClassificationTree")

#growthetreeaslongaspossible

ob_tree=oblique.tree(formula=High~.,
data=Auto,
oblique.splits="on")
plot(ob_tree)
text(ob_tree)
title(main="MaximumGrownTree")
tree_pred=predict(ob_tree,testing_data,type="class")
mean(tree_pred!=testing_High)

#checkhowmodeldoingusingtestdata

tree_pred1=predict(tree_model,testing_data,type="class")
errr=mean(tree_pred1!=testing_High)
print(errr)

#dopruningtoreducemisclassificationerror
#todopruningfirstdocrossvalidationtoknowwheretostop
pruning

set.seed(2)
cv_tree=cv.tree(tree_model,FUN=prune.misclass)
names(cv_tree)
plot(cv_tree$size,
cv_tree$dev,
type="b")
title(main="CrossValidation")

#nowprunetree
#createprunemodel

pruned_model=prune.misclass(tree_model,best=4)
plot(pruned_model)
text(pruned_model,pretty=0)
title(main="PrunedTree")

#checkmisclassificationerror%now

tree_pred2=predict(pruned_model,testing_data,type="class")
err=mean(tree_pred2!=testing_High)
print(err)

c.ActualworkingdatagenerationinRusingGooglesword2vec:

project.R

#sushilkumarvermainitialprojectinr
#date02/04/2015
#https://www.kaggle.com/c/word2vecnlptutorial/details/
#geratingactualworkingtablefromgoogleword2vecresult
#googlecommand
# ./word2vec train ~/Desktop/sentences.txt output
~/Desktop/vec.csvsize50window5sample1e4negative5
hs0binary0cbow5iter5

setwd("C:/data")#workingdirectory
train= read.table("labeledTrainData.tsv", sep="\t",
header=TRUE,quote="")#kaggletraindata
vec = read.csv(file = "vec.csv", sep=" "
,header=FALSE,na.string="?")#resultofgoogleword2vec
vec=vec[,1:51]
library(tm.plugin.webmining)#removehtmltag
library(tm)#stopwords

#tab_frame=data.frame(tab)
#write.table(tab_frame,file="tab_frame.csv",
# row.names=FALSE, na="",col.names=FALSE, sep=",") #
writingcsv
#
# tab_frame2 = read.csv(file = "tab_frame.csv",header=FALSE,
na.string="0")#readingtab_frame
my_cor < Corpus(VectorSource(train$review)) #make corpus of
trainingdata

tab_request=matrix(0,nrow=25000,ncol=51,byrow=T)#making
finalrequiredtablematrix

for(iin1:25000)
{
x3= strsplit(gsub("[^[:alnum:] ]", "", my_cor[[i]]), " +")
#removesymbols
y3=unlist(x3)#unlistreview
c=0
r1.vector=vector(mode="numeric",length=50)
for(jin1:(length(y3)))
{
row=which(vec$V1==y3[j])
if(!(is.integer(row)&&length(row)==0L))
{r2.vector=as.numeric(vec[row,2:51])
r1.vector=as.numeric(r2.vector)+as.numeric(r1.vector)
c=c+1}
}
tab_request[i,1:50]=r1.vector
tab_request[i,]=(tab_request[i,])/c
print(i)
}

tab_request[,51]=train$sentiment
write.table(tab_request,file="my_final_data.csv",
row.names=FALSE,quote=FALSE,na="",col.names=FALSE,
sep=",")#writingoutputcsv

my_final_data= read.csv(file = "my_final_data.csv",


header=TRUE,na.string=c("","NA","NULL"))

for(iin1:25000)
{
for(jin1:50)
{
if ( is.na(my_final_data[i,j]) || ! (
is.numeric(my_final_data[i,j])))
{
my_final_data[i,j]=0
print(i)
}
}
}

write.csv(x=my_final_data,file="final.csv",row.names=FALSE,
quote=FALSE)


###########datagenerated

d.NeuralNetworkinR:

proj_nn.R

#sushilkumarvermaprojectusingneuralnet


setwd("C:/data")#workingdirectory
my_rawdata = read.csv(file = "final.csv", header=TRUE,
na.string="?")
set.seed(17)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(nnet)#fornn

for(kin1:5)
{print("sizeofhiddenlayeris")
print(2*k)
sum=0
for(iin1:5)
{
#i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]

my_model=nnet(as.factor(V51)~.,data=training_data,size=
0,maxit=150,trace=FALSE)

my_predict<predict(my_model,testing_data,type="class")
# for(min1:5000){
# if(my_predict[m]>0.5){my_predict[m]=1}
# else{my_predict[m]=0}}

x=mean(my_predict!=test_output_col)
print((1x)*100)
sum=sum+x

}
print("averageerroris")
print((1sum/5)*100)#averageresult84.144

}
#[1]"sizeofhiddenlayeris"
#[1]1
#[1]84.28
#[1]84.78
#[1]83.9
#[1]84.08
#[1]83.76
#[1]"averageerroris"
#[1]84.16
#[1]"sizeofhiddenlayeris"
#[1]2
#[1]84.08
#[1]84.64
#[1]83.74
#[1]84.08
#[1]83.88
#[1]"averageerroris"
#[1]84.084
#[1]"sizeofhiddenlayeris"
#[1]4
#[1]83.8
#[1]84.7
#[1]83.96
#[1]83.48
#[1]83.5
#[1]"averageerroris"
#[1]83.888
#[1]"sizeofhiddenlayeris"
#[1]6
#[1]83.4
#[1]84.42
#[1]84.1
#[1]83.78
#[1]83.54
#[1]"averageerroris"
#[1]83.848
#[1]"sizeofhiddenlayeris"
#[1]8
#[1]83.42
#[1]83.86
#[1]83.48
#[1]83.54
#[1]83.84
#[1]"averageerroris"
#[1]83.628
#[1]"sizeofhiddenlayeris"
#[1]10
#[1]82.56
#[1]84.44
#[1]82.94
#[1]83.7
#[1]83.34
#[1]"averageerroris"
#[1]83.396

e.BayesianclassifierinR:

proj_bayes.R

#sushilkumarvermaprojectusingbayesclassifier

setwd("C:/data")#workingdirectory
my_rawdata=read.csv(file="final.csv",header=TRUE,na.string="?")
set.seed(2)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(e1071)#forsvm
sum=0
for(iin1:5)
{
#i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]

my_model<naiveBayes(as.factor(V51)~.,data=training_data)

my_predict<predict(my_model, testing_data,type = c("class"),


threshold=0.001,eps=0)
#for(min1:5000){
#if(my_predict[m]>0.5){my_predict[m]=1}
#else{my_predict[m]=0}}

x=mean(my_predict!=test_output_col)
print((1x)*100)
sum=sum+x

##fivetimescrossvalidationresult
#[1]72.76
#[1]72.16
#[1]72.06
#[1]73.4
#[1]73.1

print((1sum/5)*100)#averageresult72.7

f.LDAinR:

proj_lda.R

#sushilkumarvermaprojectusingLDA

setwd("C:/data")#workingdirectory
my_rawdata=read.csv(file="final.csv",header=TRUE,na.string="?")
set.seed(2)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(e1071)#forlda
sum=0
for(iin1:5)
{
i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]
z<lda(training_data$V51~.,training_data,prior=c(1,1)/2)
my_lda=predict(z,testing_data[c(1:5000),])

mean(my_lda$class!=testing_data$V51)
#print((1x)*100)
sum=sum+x

}
print((1sum/5)*100)#averageresult83.74

g.QDAinR:

proj_qda.R

#sushilkumarvermaprojectusingQDA


setwd("C:/data")#workingdirectory
my_rawdata=read.csv(file="final.csv",header=TRUE,na.string="?")
set.seed(11)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(e1071)#forsvm
sum=0
for(iin1:5)
{
i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]
z<qda(training_data$V51~.,training_data)
my_qda=predict(z,testing_data[c(1:5000),])

mean(my_qda$class!=testing_data$V51)
#print((1x)*100)
sum=sum+x

}
print((1sum/5)*100)#averageresult83.74

h.SVMinR:

proj_svm.R

#sushilkumarvermaprojectusingsvm

setwd("C:/data")#workingdirectory
my_rawdata=read.csv(file="final.csv",header=TRUE,na.string="?")
set.seed(2)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(e1071)#forsvm
sum=0
for(iin1:5)
{
#i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]

my_model<svm(V51~.,data=training_data,type="Cclassification")

my_predict<predict(my_model,testing_data)
my_predict=(unlist(my_predict,use.names=FALSE))
x=mean(my_predict!=test_output_col)
print((1x)*100)
sum=sum+x

print((1sum/5)*100)

i.RandomForestinR

#generatingrandomforestsforntree=302,andvaryingmtry
library(randomForest)
#attach(my_final_data)

setwd("/home/rishabh/IITK/ML")#workingdirectory
my_final_data= read.csv(file = "final.csv", header=TRUE,
na.string=c("","NA","NULL"))

my_data = my_final_data[sample(1:nrow(my_final_data)
,nrow(my_final_data)),]

names(my_data)
attach(my_data)

my_final_data[18959,]
error=array(1:5)
acc=array(1:4)
set.seed(2)
k=1
i=302

m=2
while(m<=8)
{
sum=0
j=1
for(jin1:5)
{
test=c(((5000*(j1))+1):(5000*(j)))
train=test

training_data=my_data[train,]
testing_data=my_data[test,]

label=V51[test]
label=label[c(1:5000)]

rand_For=randomForest(as.factor(V51)~.,training_data,ntree=i,
mtry=m,na.action=na.omit)

tree_pred=predict(rand_For,testing_data,type="response")

#print(tree_pred)
#print(label)
error1=mean(tree_pred!=label)
error[j]=error1
print(error1)
sum=sum+error1
}

m=m*2
avg=sum/5
#print(avg)
accuracy=1avg
print(error)
print(accuracy)
acc[k]=accuracy
k=k+1
}

j.useofRTextToolsinR:

test.R

#sushil:RTEXTTOOLSinproject
#referencehttp://www.rtexttools.com/

library(RTextTools)#rtexttools
library(tm)

setwd("C:/data")#workingdirectory
train= read.table("labeledTrainData.tsv", sep="\t",
header=TRUE,quote="")#kaggletraindata

DocTermMatrix < create_matrix(train, language="english",
removeNumbers=TRUE,stemWords=TRUE,weighting=tm::weightTfIdf)
container<create_container(DocTermMatrix,train$sentiment,
trainSize=1:100,testSize=101:125,virgin=FALSE)
models<train_models(container,algorithms=c("NNET"))
results<classify_models(container,models)
analytics<create_analytics(container,results)
#analytics@algorithm_summary #:SUMMARYOFPRECISION,RECALL,
FSCORES,ANDACCURACYSORTEDBYTOPICCODEFOREACHALGORITHM
#analytics@label_summary #:SUMMARYOFLABEL(e.g.TOPIC)
ACCURACY
#analytics@document_summary #:RAWSUMMARYOFALLDATAAND
SCORING
# analytics@ensemble_summary #: SUMMARY OF ENSEMBLE
PRECISION/COVERAGE. USES THE n VARIABLE PASSED INTO
create_analytics()

summary(analytics)
#ENSEMBLESUMMARY
#
#nENSEMBLECOVERAGEnENSEMBLERECALL
#n>=1 1 0.78
#
#
#ALGORITHMPERFORMANCE
#
#SVM_PRECISION SVM_RECALLSVM_FSCORE
#0.805 0.785 0.780

models<train_models(container,algorithms=c("MAXENT"))
results<classify_models(container,models)
analytics<create_analytics(container,results)
summary(analytics)
#ENSEMBLESUMMARY
#
#nENSEMBLECOVERAGEnENSEMBLERECALL
#n>=1 1 0.82
#
#
#ALGORITHMPERFORMANCE
#
#MAXENTROPY_PRECISION MAXENTROPY_RECALL
MAXENTROPY_FSCORE
#0.825 0.820 0.820

You might also like