Professional Documents
Culture Documents
(ProjectinMachineLearningCS771)
DepartmentofCSE,IITKANPUR
ReportwithCode
Submittedto:
Prof.HarishKarnick
By:
VenuGopalReddy(14111043)
SushilKumarVerma(14111038)
RishabhDevShukla(14111029)
BanothuRajKumar(14111007)
Acknowledgement
Secondlywewouldalsoliketothankourgroupmembers(team)whohelpedusa
lotinfinalizingthisprojectwithinthelimitedtimeframe.
Contents
Aboutthisproject
Aboutdataset
Basicidea
Bagofwords
TfIdf
Distributedrepresentation
Googlesword2vec
Decisiontree
Randomforest
Neuralnetworks
Otherclassifiers
Results
Issuesinimplementation
Furtherimprovements
RTextTools
References
Appendix:Code
AboutThisProject
IMDB movie reviews are the text data and we have converted those text
reviews into numerical data table by using Bag of Words, TfIdf, and
googlesWord2Vectechniques.
The accuracy values with * in result section of this document are the
accuraciesgotinKagglesubmission.
AboutDataset
LabeledTrainData:25000instances
TestData :25000instances
ThesedatasetaregivenonKaggle.
Details:
idUniqueIDofeachreviews
sentiment Sentimentofthereview:1forpositivereviewsand0for
negativereviews
review Textofthereview
BasicIdea
Basicallytherearethreemajorstepsinourproject:
1. processingdataset
2. creatingaclassifiermodel
3. trainingandtestingofclassifier
Bagofwords
Bag of Words: It is one of the powerful technique in the natural language
processing. In this model, every document is represent as a multi set of words
presentinthedocument.Thismodelwontconsider thegrammarandevenword
ordering also. This multi set can represent as a vector form to represent the
feature vector. The issue with this model is that itisnotgoodenoughtohandle
negation case. Negation case is that expressing thoughts using negative words
like not,dontetc.,likeinsteadofsaying badmoviesomeone cansaynot
goodmovieaswell.
If you use raw bag of words model, then we will loss specific nature of some
words and become more general words. for example, let our vocabulary list be
[This,is,not, good,movie]thendocument X,Ycanberepresentasbelow.Here
eachentrywillrepresentthenumberoftimethatwordoccurinthedocument.
X:Thisisgoodmovie: [11011]
Y:Thisisnotgoodmovie: [11111]
Let X is labeled as positive review and Y is labeled as negative review. If you
consider the general meaningofthegoodispositive andif anyreviewcontain
this wordsthatreviewtendencytowardspositivereview.but,becauseofnegation
this word contain in both positive and negative reviews and it lost its positive
natureandbecomeageneralwordlikethisthatpresentinboththereviews.
TfIdf
TherearestillsomeproblemswithBagofwordsmodel,thatissomeofthewords
are most frequent and that are present in each document. we can remove these
words using stopwordslist.stop wordsareglobaltoparticularlanguagethatare
mostfrequentlyusedinthelanguage.Asthisis globaltothelanguage,itdoesnt
handlethefrequentwordstogivengivendataset.
TFIDF(TermFrequencyvs.InvertedDocumentFrequency) :
ThiscontainsthetwofactorsTfandIdf,bymultiplyingthesetwofactorswewill
gettheTfIdfscoreofword.
f (w, d) :frequencyofwordindocument
N:totalnumberofdocumentsindatasat
Here,Tfscorewilltellabouttheimportance ofwordtotheparticulardocument.
Idf is the scaling factor the will tell about the how frequent the word to the
particular dataset. For rare and frequent words, this score is lesscomparetothe
otherwords.wecaneliminatethembyignoringthewordswithlessTfIdfscores.
The representation of feature vector is just like Bag of words model, but the
frequencyofwordsisreplacedwithTfIdfscoresofword.
Implementation : we use TfidfVectorizer from scikit learn to calculate these
scores andrepresentfeaturevectors.Hereweareconsiderthewordstocalculate
the TfIdf if it is present in at least 5 documents like this wecaneliminaterare
wordsatpriori.
DistributedRepresentation
There are some more problems we cant handle with bag of words models, as
these model represents each word as a atomic quantity and it wont capturethe
information aboutsimilarityofwords.forexample,hotelandmotelaresame
words bythemeaning,ifyourepresentthisusingbagofwordsmodeltherelation
between hotel and motelissameashotel andhomeasthismodeltreatall
wordssame.
If we assign some values in ddimensional vector space such that all similar
words are clustered together. all positive words are together, allnegative words
aretogether andfarfrompositivewords,andallneutralwordsaresamedistance
topositiveandnegativewords.
WeuseGoogleWord2Vectorepresentvectorembeddingofwords.
GooglesWord2Vec
Severalfactorsinfluencethequalityofthewordvectors:
amountandqualityofthetrainingdata
sizeofthevectors
trainingalgorithm
Word2vecisalsoavailableinPythoningensimlibrarygivenbyRadimRehurek.
CommandinLinuxtogenerateword2vecbyterminal:
Decision Tree
rpart library in R, has given the best accuracy 73.06% for CP=0.1,observed
accuracy when CP=0.01 which has given accuracy 70.14% and 70.89% at
CP=0.02.Thisis duetoas theCPvaluedecreases,theheightofthetreeincreases
beyondcertainheightandleadstodecreaseintheaccuracy.
Fully grown tree, using oblique.tree libraryhas givenaccuracy70.84%
and after pruning the accuracy has increased to 75.32% because in pruning we
remove the sections of tree that provide little power to classify instances.This
technique growingandpruning isusedinordertoavoidhorizoneffect.
Splits based on number of instances are observed by varying minsplit
parameter in R, has given best accuracy of 74.32% at minsplit=83, and given
accuracy 72.87% at minsplit=75.As the number of instances at a node
decreases(relativetooptimum)theheight of theincreasesandthusdecreasesthe
accuracy.
RandomForest
We generatedRandomForests on ourpreprocesseddatausingtherandomForest
libraryof R,and did
5foldcrossvalidation .Firstwevariedthe
ntree(Number
of trees to grow) parameter of our randomForestmodel.We startedwithntree
=2,andvarieditfordifferentvaluestill400,ourresultsareasfollows:
ntree 2 52 102 152 202 252 302 352 400
Accuracy 72.8 81.288 81.74 81.7 81.81 81.92 82.064 82.008 82.01
Then wefixedntreetobe302,for whichwegotthe bestresult,and variedmtry
parameterofourrandomforestmodel.
mtry:Numberofvariablesrandomlysampledascandidatesateachsplit.
Wegeneratedrandomforestforntree=302andmtry=(2,4,8)
Theresultsareasfollows:
mtry 2 4 8
Accuracy 81.62 82.148 81.936
Sothebestaccuracy whichwegotfromgenerating random forestswasforntree
=302andmtry=4whichwas82.148%.
NeuralNetworks
NeuralNetworksisoneofthepowerfultechniqueforsupervisedlearning.weuse
Bag of Model with TfIdf score feature representation to represent each
documents.Weuseonehiddenlayer
We use
TruncatedSVD module in scikitlearnfordimensionalityreduction.This
istheonlyactivedimensionalityreductionmodulethatwilltakesparsematrixas
input,anditisarandomizationprocess,byusingthiswereduceddatamatrixinto
300dimensionalvector.
We usesimple neuralnetworkwithonehiddenlayerare25neurons inittotrain
the data. This module is taken from
http://danielfrg.com/ as in scikit learn
superviseneuralnetworkisnotimplementedyet.
OtherClassifiers
The other classifiers which we are focusing are Naive Bayesian classifier,
SupportVectorMachine(SVM),LDAandQDA.
NaiveBayesianclassifier:
Weareusinglibrarye1071inRforbayesianclassifier.Withsometuning
of parameters like Laplace smoothing and threshold value, we are getting an
accuracyof73.4%withfivefoldcrossvalidationontrainingdata.
SupportVectorMachine(SVM):
We are using library e1071 in R for SVM. With some tuning of
parameters,weare gettinganaccuracyof 84.03%withfive foldcrossvalidation
ontrainingdata.
QuadraticDiscriminantAnalysis(QDA):
We are using library e1071 in R for QDA. With the use of
straightforward function , we are getting an accuracy of 83.0% with five fold
crossvalidationontrainingdata.
LinearDiscriminantAnalysis(LDA):
We are using library e1071 in R for LDA. With the use of
straightforward function , we are getting an accuracy of 83.01% (slightly better
thanQDA)withfivefoldcrossvalidationontrainingdata.
Results
The results(accuracies) of various classifiers using various processing methods
aregiveninthetablebelow.Processingmethodsaregivenalongrows.Theseare:
Word2Vec, TfIdf and Bag of Words. The classifiers are given along columns.
These are:Decision Tree,RandomForest,NeuralNetworks,SVM,LDA,QDA,
NaiveBayesianclassifier.
The values with a star mark aretheaccuracywhichweare gettingfromKaggle
by submitting our classifier on Kaggle. The best accuracy , we are getting is
86.6%. Other values are accuracy of five foldcross validationresultontraining
data.
% Decision Random Neural SVM LDA QDA Bayesian
Tree Forest Nets Classifier
TFIDF 86.6
*
80.5
IssuesinImplementation
Herewearementioningsomeissues,towhichwearehandlingduringourproject
work.Theseare:
1. Negationwords:
We arehandlingthisissueinBagofwordsmodel.Weareaddingnegative
kindofwordsinourvocabularybyattachingnottoeverywordafterwhichitis
takingplace.
2.Uselesswordsbutnotinstopwords:
Therearesomewordslikemoviewhichareuselessforthisdatasetbutare
notinstopwordsofenglish.WearetreatingthesekindofwordswithTfIdf.
3.Largenumberofinputs:
Inourdataset,inputsizeislarge.Itcanraiseamemoryerrororcomputation
error with some classifier. We are sampling our dataset andchoosing someless
numberofrowsrandomlytohandletheseerrors.
4.Differenttypesofprocessingsoninput:
Forourproject,therearedifferentkindsofprocessingsavailable.Theseare
bag ofwords,wordtovectorandTfIdfscore.Wehavedoneeveryprocessingon
inputwithsomeclassifier.
5.Largecorpus,tablesandmatrices:
Weareusingdimensionalityreductionwhereitisrequiredtoreducesizeof
tableormatrixwithoutaffectingresult.
6.Sameclassifierindifferentlibraries:
InRandPython,therearemanyclassifierswhicharedevelopedin more
than one libraries. For example : KNN, Neural nets. We have studied and used
bothoftheselibrariesinprojectandshowedresultswhichoneisbetter.
7.BetterCPUandmemoryrequirements:
Our project requires faster machines with enough memory. We are
exchangingourlaptopsduringimplementationwhenitispossibleandrequired.
FurtherImprovements
TermDocumentMatrix:
ThetermDocumentMatrixinRisausefultooltomakeacorpusofdataset.
TDM is available in library tm in R. It shows which term(word) is occurring
how many times in which document(review). The rows ofthismatrixareterms
andcolumnsaredocuments.
Deeplearning:
Deeplearning isa branchofmachinelearningbasedonasetof algorithms
thatattempttomodelhighlevelabstractionsindatabyusingmodelarchitectures,
with complex structures or otherwise, composed of multiple nonlinear
transformations.TherearesomelibrariesinRandPythonfordeeplearning.
TextMining:
Text mining usually involves the process of structuring the input text
(usually parsing, along with the addition ofsomederivedlinguisticfeaturesand
theremoval ofothers,andsubsequentinsertionintoadatabase),derivingpatterns
withinthestructureddata,andfinallyevaluationandinterpretationoftheoutput.
WordtoPhrases:
GoogleisdevelopingsometoollikeWord2Vecwhichcanshowrelations
ofwordstophrasesofdocuments.
RTextTools
References
Thankstothesereferencesfromwherewehaveworkedonthisproject:
https://www.kaggle.com/c/word2vecnlptutorial
http://cran.rproject.org/
http://scikitlearn.org/stable/
http://stackoverflow.com/questions/tagged/machinelearning
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
http://www.rtexttools.com/
http://en.wikipedia.org/wiki/Machine_learning
https://store.continuum.io/cshop/anaconda/
https://code.google.com/p/word2vec/
ftp://hk.cse.iitk.ac.in/
http://danielfrg.com/
Appendix:Code
a. BagofWordswithNegationCaseHandled:
defreview_to_wordlist(review,remove_stopwords=False):
review_text=BeautifulSoup(review).get_text()
review_text=re.sub("[^azAZ.]","",review_text)
document=[]
k=0
#handlingnegation
words=review_text.split()
foriinrange(len(words)1):
ifwords[i]in["not","didn't","don't"]:
k=3
elifk>0:
document.append('NOT_'+words[i])
else:
document.append(words[i])
b.DecisionTree:
library(tree)
library(oblique.tree)
Auto=read.csv(file="moviereviews.csv",header=TRUE)
#AttachdatasettoR
Auto=Auto[1:5000,]
attach(Auto)
head(Auto)
length(Auto$class)
#createcategoricalvariable
High=ifelse(V51>=1,"Yes","No")
length(High)
Auto=data.frame(Auto,High)
Auto=Auto[,51]
names(Auto)
#splittestingandtrainingsets
set.seed(1)
training=sample(1:nrow(Auto),4*nrow(Auto)/5)
testing=training
training_data=Auto[training,]
dim(training_data)
testing_data=Auto[testing,]
dim(testing_data)
testing_High=High[testing]
length(testing_High)
#fitthetreemodelusingtreedata
tree_model=tree(High~.,training_data)
plot(tree_model)
text(tree_model,pretty=0)
title(main="ClassificationTree")
#growthetreeaslongaspossible
ob_tree=oblique.tree(formula=High~.,
data=Auto,
oblique.splits="on")
plot(ob_tree)
text(ob_tree)
title(main="MaximumGrownTree")
tree_pred=predict(ob_tree,testing_data,type="class")
mean(tree_pred!=testing_High)
#checkhowmodeldoingusingtestdata
tree_pred1=predict(tree_model,testing_data,type="class")
errr=mean(tree_pred1!=testing_High)
print(errr)
#dopruningtoreducemisclassificationerror
#todopruningfirstdocrossvalidationtoknowwheretostop
pruning
set.seed(2)
cv_tree=cv.tree(tree_model,FUN=prune.misclass)
names(cv_tree)
plot(cv_tree$size,
cv_tree$dev,
type="b")
title(main="CrossValidation")
#nowprunetree
#createprunemodel
pruned_model=prune.misclass(tree_model,best=4)
plot(pruned_model)
text(pruned_model,pretty=0)
title(main="PrunedTree")
#checkmisclassificationerror%now
tree_pred2=predict(pruned_model,testing_data,type="class")
err=mean(tree_pred2!=testing_High)
print(err)
c.ActualworkingdatagenerationinRusingGooglesword2vec:
project.R
#sushilkumarvermainitialprojectinr
#date02/04/2015
#https://www.kaggle.com/c/word2vecnlptutorial/details/
#geratingactualworkingtablefromgoogleword2vecresult
#googlecommand
# ./word2vec train ~/Desktop/sentences.txt output
~/Desktop/vec.csvsize50window5sample1e4negative5
hs0binary0cbow5iter5
setwd("C:/data")#workingdirectory
train= read.table("labeledTrainData.tsv", sep="\t",
header=TRUE,quote="")#kaggletraindata
vec = read.csv(file = "vec.csv", sep=" "
,header=FALSE,na.string="?")#resultofgoogleword2vec
vec=vec[,1:51]
library(tm.plugin.webmining)#removehtmltag
library(tm)#stopwords
#tab_frame=data.frame(tab)
#write.table(tab_frame,file="tab_frame.csv",
# row.names=FALSE, na="",col.names=FALSE, sep=",") #
writingcsv
#
# tab_frame2 = read.csv(file = "tab_frame.csv",header=FALSE,
na.string="0")#readingtab_frame
my_cor < Corpus(VectorSource(train$review)) #make corpus of
trainingdata
tab_request=matrix(0,nrow=25000,ncol=51,byrow=T)#making
finalrequiredtablematrix
for(iin1:25000)
{
x3= strsplit(gsub("[^[:alnum:] ]", "", my_cor[[i]]), " +")
#removesymbols
y3=unlist(x3)#unlistreview
c=0
r1.vector=vector(mode="numeric",length=50)
for(jin1:(length(y3)))
{
row=which(vec$V1==y3[j])
if(!(is.integer(row)&&length(row)==0L))
{r2.vector=as.numeric(vec[row,2:51])
r1.vector=as.numeric(r2.vector)+as.numeric(r1.vector)
c=c+1}
}
tab_request[i,1:50]=r1.vector
tab_request[i,]=(tab_request[i,])/c
print(i)
}
tab_request[,51]=train$sentiment
write.table(tab_request,file="my_final_data.csv",
row.names=FALSE,quote=FALSE,na="",col.names=FALSE,
sep=",")#writingoutputcsv
for(iin1:25000)
{
for(jin1:50)
{
if ( is.na(my_final_data[i,j]) || ! (
is.numeric(my_final_data[i,j])))
{
my_final_data[i,j]=0
print(i)
}
}
}
write.csv(x=my_final_data,file="final.csv",row.names=FALSE,
quote=FALSE)
###########datagenerated
d.NeuralNetworkinR:
proj_nn.R
#sushilkumarvermaprojectusingneuralnet
setwd("C:/data")#workingdirectory
my_rawdata = read.csv(file = "final.csv", header=TRUE,
na.string="?")
set.seed(17)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(nnet)#fornn
for(kin1:5)
{print("sizeofhiddenlayeris")
print(2*k)
sum=0
for(iin1:5)
{
#i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]
my_model=nnet(as.factor(V51)~.,data=training_data,size=
0,maxit=150,trace=FALSE)
my_predict<predict(my_model,testing_data,type="class")
# for(min1:5000){
# if(my_predict[m]>0.5){my_predict[m]=1}
# else{my_predict[m]=0}}
x=mean(my_predict!=test_output_col)
print((1x)*100)
sum=sum+x
}
print("averageerroris")
print((1sum/5)*100)#averageresult84.144
}
#[1]"sizeofhiddenlayeris"
#[1]1
#[1]84.28
#[1]84.78
#[1]83.9
#[1]84.08
#[1]83.76
#[1]"averageerroris"
#[1]84.16
#[1]"sizeofhiddenlayeris"
#[1]2
#[1]84.08
#[1]84.64
#[1]83.74
#[1]84.08
#[1]83.88
#[1]"averageerroris"
#[1]84.084
#[1]"sizeofhiddenlayeris"
#[1]4
#[1]83.8
#[1]84.7
#[1]83.96
#[1]83.48
#[1]83.5
#[1]"averageerroris"
#[1]83.888
#[1]"sizeofhiddenlayeris"
#[1]6
#[1]83.4
#[1]84.42
#[1]84.1
#[1]83.78
#[1]83.54
#[1]"averageerroris"
#[1]83.848
#[1]"sizeofhiddenlayeris"
#[1]8
#[1]83.42
#[1]83.86
#[1]83.48
#[1]83.54
#[1]83.84
#[1]"averageerroris"
#[1]83.628
#[1]"sizeofhiddenlayeris"
#[1]10
#[1]82.56
#[1]84.44
#[1]82.94
#[1]83.7
#[1]83.34
#[1]"averageerroris"
#[1]83.396
e.BayesianclassifierinR:
proj_bayes.R
#sushilkumarvermaprojectusingbayesclassifier
setwd("C:/data")#workingdirectory
my_rawdata=read.csv(file="final.csv",header=TRUE,na.string="?")
set.seed(2)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(e1071)#forsvm
sum=0
for(iin1:5)
{
#i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]
my_model<naiveBayes(as.factor(V51)~.,data=training_data)
x=mean(my_predict!=test_output_col)
print((1x)*100)
sum=sum+x
##fivetimescrossvalidationresult
#[1]72.76
#[1]72.16
#[1]72.06
#[1]73.4
#[1]73.1
print((1sum/5)*100)#averageresult72.7
f.LDAinR:
proj_lda.R
#sushilkumarvermaprojectusingLDA
setwd("C:/data")#workingdirectory
my_rawdata=read.csv(file="final.csv",header=TRUE,na.string="?")
set.seed(2)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(e1071)#forlda
sum=0
for(iin1:5)
{
i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]
z<lda(training_data$V51~.,training_data,prior=c(1,1)/2)
my_lda=predict(z,testing_data[c(1:5000),])
mean(my_lda$class!=testing_data$V51)
#print((1x)*100)
sum=sum+x
}
print((1sum/5)*100)#averageresult83.74
g.QDAinR:
proj_qda.R
#sushilkumarvermaprojectusingQDA
setwd("C:/data")#workingdirectory
my_rawdata=read.csv(file="final.csv",header=TRUE,na.string="?")
set.seed(11)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(e1071)#forsvm
sum=0
for(iin1:5)
{
i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]
z<qda(training_data$V51~.,training_data)
my_qda=predict(z,testing_data[c(1:5000),])
mean(my_qda$class!=testing_data$V51)
#print((1x)*100)
sum=sum+x
}
print((1sum/5)*100)#averageresult83.74
h.SVMinR:
proj_svm.R
#sushilkumarvermaprojectusingsvm
setwd("C:/data")#workingdirectory
my_rawdata=read.csv(file="final.csv",header=TRUE,na.string="?")
set.seed(2)#seed=2
my_rawdata<my_rawdata[sample(nrow(my_rawdata)),]
attach(my_rawdata)
library(e1071)#forsvm
sum=0
for(iin1:5)
{
#i=1
test=c(((i1)*5000+1):(i*5000))
train=test
testing_data=my_rawdata[test,]
test_output_col=V51[test]
test_output_col=test_output_col[c(1:5000)]
training_data=my_rawdata[train,]
my_model<svm(V51~.,data=training_data,type="Cclassification")
my_predict<predict(my_model,testing_data)
my_predict=(unlist(my_predict,use.names=FALSE))
x=mean(my_predict!=test_output_col)
print((1x)*100)
sum=sum+x
print((1sum/5)*100)
i.RandomForestinR
#generatingrandomforestsforntree=302,andvaryingmtry
library(randomForest)
#attach(my_final_data)
setwd("/home/rishabh/IITK/ML")#workingdirectory
my_final_data= read.csv(file = "final.csv", header=TRUE,
na.string=c("","NA","NULL"))
my_data = my_final_data[sample(1:nrow(my_final_data)
,nrow(my_final_data)),]
names(my_data)
attach(my_data)
my_final_data[18959,]
error=array(1:5)
acc=array(1:4)
set.seed(2)
k=1
i=302
m=2
while(m<=8)
{
sum=0
j=1
for(jin1:5)
{
test=c(((5000*(j1))+1):(5000*(j)))
train=test
training_data=my_data[train,]
testing_data=my_data[test,]
label=V51[test]
label=label[c(1:5000)]
rand_For=randomForest(as.factor(V51)~.,training_data,ntree=i,
mtry=m,na.action=na.omit)
tree_pred=predict(rand_For,testing_data,type="response")
#print(tree_pred)
#print(label)
error1=mean(tree_pred!=label)
error[j]=error1
print(error1)
sum=sum+error1
}
m=m*2
avg=sum/5
#print(avg)
accuracy=1avg
print(error)
print(accuracy)
acc[k]=accuracy
k=k+1
}
j.useofRTextToolsinR:
test.R
#sushil:RTEXTTOOLSinproject
#referencehttp://www.rtexttools.com/
library(RTextTools)#rtexttools
library(tm)
setwd("C:/data")#workingdirectory
train= read.table("labeledTrainData.tsv", sep="\t",
header=TRUE,quote="")#kaggletraindata
DocTermMatrix < create_matrix(train, language="english",
removeNumbers=TRUE,stemWords=TRUE,weighting=tm::weightTfIdf)
container<create_container(DocTermMatrix,train$sentiment,
trainSize=1:100,testSize=101:125,virgin=FALSE)
models<train_models(container,algorithms=c("NNET"))
results<classify_models(container,models)
analytics<create_analytics(container,results)
#analytics@algorithm_summary #:SUMMARYOFPRECISION,RECALL,
FSCORES,ANDACCURACYSORTEDBYTOPICCODEFOREACHALGORITHM
#analytics@label_summary #:SUMMARYOFLABEL(e.g.TOPIC)
ACCURACY
#analytics@document_summary #:RAWSUMMARYOFALLDATAAND
SCORING
# analytics@ensemble_summary #: SUMMARY OF ENSEMBLE
PRECISION/COVERAGE. USES THE n VARIABLE PASSED INTO
create_analytics()
summary(analytics)
#ENSEMBLESUMMARY
#
#nENSEMBLECOVERAGEnENSEMBLERECALL
#n>=1 1 0.78
#
#
#ALGORITHMPERFORMANCE
#
#SVM_PRECISION SVM_RECALLSVM_FSCORE
#0.805 0.785 0.780
models<train_models(container,algorithms=c("MAXENT"))
results<classify_models(container,models)
analytics<create_analytics(container,results)
summary(analytics)
#ENSEMBLESUMMARY
#
#nENSEMBLECOVERAGEnENSEMBLERECALL
#n>=1 1 0.82
#
#
#ALGORITHMPERFORMANCE
#
#MAXENTROPY_PRECISION MAXENTROPY_RECALL
MAXENTROPY_FSCORE
#0.825 0.820 0.820