You are on page 1of 11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

NaturalLanguageProcessinginaKaggle
CompetitionforMovieReviews

Source
IdecidedtotryplayingaroundwithaKagglecompetition.Inthiscase,Ientered
theWhenbagofwordsmeetsbagsofpopcorncontest.Thiscontestisntfor
moneyitisjustawaytolearnaboutvariousmachinelearningapproaches.
ThecompetitionwastryingtoshowcaseGooglesWord2Vec.Thisessentiallyuses
deeplearningtofindfeaturesintextthatcanbeusedtohelpinclassificationtasks.
https://jessesw.com/NLPMovieReviews/

1/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

Specifically,inthecaseofthiscontest,thegoalinvolveslabelingthesentimentofa
moviereviewfromIMDB.Ratingswereona10pointscale,andanyreviewof7or
greaterwasconsideredapositivemoviereview.
Originally,IwasgoingtotryoutWord2Vecandtrainitonunlabeledreviews,butthen
oneofthecompetitorspointedoutthatyoucouldsimplyusealesscomplicated
classifiertodothisandstillgetagoodresult.
IdecidedtotakethisbasicinspirationandtryafewvariousclassifierstoseewhatI
couldcomeupwith.Thehighestmyscorereceivedwas6thplacebackinDecember
of2014,butthenpeoplestartedusingensemblemethodstocombinevarious
modelstogetherandgetaperfectscoreafteralotoffinetuningwiththeparameters
oftheensembleweights.
Hopefully,thispostwillhelpyouunderstandsomebasicNLP(NaturalLanguage
Processing)techniques,alongwithsometipsonusingscikitlearntomakeyour
classificationmodels.

CleaningtheReviews
Thefirstthingweneedtodoiscreateasimplefunctionthatwillcleanthereviews
intoaformatwecanuse.Wejustwanttherawtext,notalloftheotherassociated
HTML,symbols,orotherjunk.
Wewillneedacoupleofverynicelibrariesforthistask:BeautifulSoupfortaking
careofanythingHTMLrelatedandreforregularexpressions.
importre
frombs4importBeautifulSoup

Nowsetupourfunction.Thiswillcleanallofthereviewsforus.
defreview_to_wordlist(review):
'''
MeantforconvertingeachoftheIMDBreviewsintoalistofwords.
'''
#FirstremovetheHTML.
https://jessesw.com/NLPMovieReviews/

2/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

review_text=BeautifulSoup(review).get_text()

#Useregularexpressionstoonlyincludewords.
review_text=re.sub("[^azAZ]","",review_text)

#Convertwordstolowercaseandsplitthemintoseparatewords.
words=review_text.lower().split()

#Returnalistofwords
return(words)

Great!Nowitistimetogoaheadandloadourdatain.Forthis,pandasisdefinitely
thelibraryofchoice.Ifyouwanttofollowalongwithadownloadedversionofthe
attachedIPythonnotebookyourself,makesureyouobtainthedatafromKaggle.You
willneedaKaggleaccountinordertoaccessit.
importpandasaspd

train=pd.read_csv('labeledTrainData.tsv',header=0,
delimiter="\t",quoting=3)
test=pd.read_csv('testData.tsv',header=0,delimiter="\t",
quoting=3)

#Importboththetrainingandtestdata.

Nowitistimetogetthelabelsfromthetrainingsetforourreviews.Thatway,wecan
teachourclassifierwhichreviewsarepositivevs.negative.
y_train=train['sentiment']

Nowweneedtocleanboththetrainandtestdatatogetitreadyforthenextpartof
ourprogram.
https://jessesw.com/NLPMovieReviews/

3/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

traindata=[]
foriinxrange(0,len(train['review'])):
traindata.append("".join(review_to_wordlist(train['review'][i])))
testdata=[]
foriinxrange(0,len(test['review'])):
testdata.append("".join(review_to_wordlist(test['review'][i])))

TFIDFVectorization
ThenextthingwearegoingtodoismakeTFIDF(termfrequencyinterdocument
frequency)vectorsofourreviews.Incaseyouarenotfamiliarwithwhatthisisdoing,
essentiallywearegoingtoevaluatehowoftenacertaintermoccursinareview,but
normalizethissomewhatbyhowmanyreviewsacertaintermalsooccurs
in.Wikipediahasanexplanationthatissufficientifyouwantfurtherinformation.
Thiscanbeagreattechniqueforhelpingtodeterminewhichwords(orngramsof
words)willmakegoodfeaturestoclassifyareviewaspositiveornegative.
Todothis,wearegoingtousetheTFIDFvectorizerfromscikitlearn.Then,decide
whatsettingstouse.ThedocumentationfortheTFIDFclassisavailablehere.
InthecaseoftheexamplecodeonKaggle,theydecidedtoremoveallstopwords,
alongwithngramsuptoasizeoftwo(youcouldusemorebutthiswillrequireaLOT
ofmemory,sobecarefulwhichsettingsyouuse!)
fromsklearn.feature_extraction.textimportTfidfVectorizerasTFIV

tfv=TFIV(min_df=3,max_features=None,
strip_accents='unicode',analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1,2),use_idf=1,smooth_idf=1,sublinear_tf=1,
stop_words='english')

Nowthatwehavethevectorizationobject,weneedtorunthisonallofthedata
(bothtrainingandtesting)tomakesureitisappliedtobothdatasets.Thiscouldtake
https://jessesw.com/NLPMovieReviews/

4/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

sometimeonyourcomputer!
X_all=traindata+testdata#CombinebothtofittheTFIDFvectorization.
lentrain=len(traindata)
tfv.fit(X_all)#Thisistheslowpart!
X_all=tfv.transform(X_all)
X=X_all[:lentrain]#Separatebackintotrainingandtestsets.
X_test=X_all[lentrain:]

MakingOurClassifiers
Becauseweareworkingwithtextdata,andwejustmadefeaturevectorsofevery
word(thatisntastopwordofcourse)inallofthereviews,wearegoingtohave
sparsematricestodealwiththatarequitelargeinsize.JusttoshowyouwhatI
mean,letsexaminetheshapeofourtrainingset.
X.shape

(25000,309798)

Thatmeanswehave25,000trainingexamples(orrows)and309,798features(or
columns).Weneedsomethingthatisgoingtobesomewhatcomputationallyefficient
givenhowmanyfeatureswehave.Usingsomethinglikearandomforesttoclassify
wouldbeunwieldy(plusrandomforestscantworkwithsparsematricesanywayyet
inscikitlearn).Thatmeansweneedsomethinglightweightandfastthatscalesto
manydimensionswell.Somepossiblecandidatesare:
NaiveBayes
LogisticRegression
SGDClassifier(utilizesStochasticGradientDescentformuchfasterruntime)
LetsjusttryallthreeassubmissionstoKaggleandseehowtheyperform.
https://jessesw.com/NLPMovieReviews/

5/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

Firstup:LogisticRegression(seethescikitlearndocumentationhere).
WhileintheoryL1regularizationshouldworkwellbecausepn(manymorefeatures
thantrainingexamples),IactuallyfoundthroughalotoftestingthatL2regularization
gotbetterresults.Youcouldsetupyourowntrialsusingscikitlearnsbuiltin
GridSearchclass,whichmakesthingsaloteasiertotry.Ifoundthroughmytesting
thatusingaparameterCof30gotthebestresults.
fromsklearn.linear_modelimportLogisticRegressionasLR
fromsklearn.grid_searchimportGridSearchCV

grid_values={'C':[30]}#Decidewhichsettingsyouwantforthegridsearch.

model_LR=GridSearchCV(LR(penalty='L2',dual=True,random_state=0),
grid_values,scoring='roc_auc',cv=20)
#Trytosetthescoringonwhatthecontestisaskingfor.
#ThecontestsaysscoringisforareaundertheROCcurve,sousethis.

model_LR.fit(X,y_train)#Fitthemodel.

GridSearchCV(cv=20,estimator=LogisticRegression(C=1.0,class_weight=None,dual=True
fit_intercept=True,intercept_scaling=1,penalty='L2',random_state=0
fit_params={},iid=True,loss_func=None,n_jobs=1,
param_grid={'C':[30]},pre_dispatch='2*n_jobs',refit=True,
score_func=None,scoring='roc_auc',verbose=0)

Youcaninvestigatewhichparametersdidthebestandwhatscorestheyreceivedby
lookingatthemodel_LRobject.
model_LR.grid_scores_

[mean:0.96459,std:0.00489,params:{'C':30}]
https://jessesw.com/NLPMovieReviews/

6/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

model_LR.best_estimator_

LogisticRegression(C=30,class_weight=None,dual=True,fit_intercept=True,
intercept_scaling=1,penalty='L2',random_state=0,tol=0.0001)

Feelfree,ifyouhaveaninteractiveversionofthenotebook,toplayaroundwith
varioussettingsinsidethegrid_valuesobjecttooptimizeyourROCAUCscore.
Otherwise,letsmoveontothenextclassifier,NaiveBayes.
UnlikeLogisticRegression,NaiveBayesdoesnthavearegularizationparameterto
tune.YoujusthavetochoosewhichflavorofNaiveBayestouse.
AccordingtothedocumentationonNaiveBayesfromscikitlearn,Multinomialisour
bestversiontouse,sincewenolongerhavejusta1or0forawordfeature:ithas
beennormalizedbyTFIDF,soourvalueswillbeBETWEEN0and1(mostofthe
time,althoughhavingafewTFIDFscoresexceed1istechnicallypossible).Ifwe
werejustlookingatwordoccurrencevectors(withnocounting),Bernoulliwould
havebeenabetterfitsinceitisbasedonbinaryvalues.
LetsmakeourMultinomialNaiveBayesobject,andtrainit.
fromsklearn.naive_bayesimportMultinomialNBasMNB

model_NB=MNB()
model_NB.fit(X,y_train)

MultinomialNB(alpha=1.0,class_prior=None,fit_prior=True)

Prettyfast,right?Thisspeedcomesataprice,however.NaiveBayesassumesallof
yourfeaturesareENTIRELYindependentfromeachother.Inthecaseofword
vectors,thatseemslikeasomewhatreasonableassumptionbutwiththengramswe
https://jessesw.com/NLPMovieReviews/

7/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

includedthatprobablyisntalwaysthecase.Becauseofthis,NaiveBayestendsto
belessaccuratethanotherclassificationalgorithms,especiallyifyouhaveasmaller
numberoftrainingexamples.
WhydontweseehowNaiveBayesdoes(atleastina20foldCVcomparison)so
wehavearoughideaofhowwellitperformscomparedtoourLogisticRegression
classifier?
YoucoulduseGridSearchagain,butthatseemslikeoverkill.Thereisasimpler
methodwecanimportfromscikitlearnforthistask.
fromsklearn.cross_validationimportcross_val_score
importnumpyasnp

print"20FoldCVScoreforMultinomialNaiveBayes:",np.mean(cross_val_score
(model_NB,X,

#Thiswillgiveusa20foldcrossvalidationscorethatlooksatROC_AUCsowecancomp

20FoldCVScoreforMultinomialNaiveBayes:0.949631232

Well,itwasntquiteasgoodasourwelltunedLogisticRegressionclassifier,butthat
isaprettygoodscoreconsideringhowlittlewehadtodo!
OnelastclassifiertotryistheSGDclassifier,whichcomesinhandywhenyouneed
speedonareallylargenumberoftrainingexamples/features.
Whichmachinelearningalgorithmitendsupusingdependsonwhatyousetforthe
lossfunction.Ifwechoseloss=log,itwouldessentiallybeidenticaltoourprevious
logisticregressionmodel.Wewanttotrysomethingdifferent,butwealsowantaloss
optionthatincludesprobabilities.Weneedthoseprobabilitiesifwearegoingtobe
abletocalculatetheareaunderaROCcurve.Lookingatthedocumentation,it
seemsamodified_huberlosswoulddothetrick!ThiswillbeaSupportVector
Machinethatusesalinearkernel.
fromsklearn.linear_modelimportSGDClassifierasSGD
https://jessesw.com/NLPMovieReviews/

8/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

sgd_params={'alpha':[0.00006,0.00007,0.00008,0.0001,0.0005]}#Regularizationpara

model_SGD=GridSearchCV(SGD(random_state=0,shuffle=True,loss='modified_huber'

model_SGD.fit(X,y_train)#Fitthemodel.

GridSearchCV(cv=20,estimator=SGDClassifier(alpha=0.0001,class_weight=None,epsilon
loss='modified_huber',n_iter=5,n_jobs=1,penalty='l2',
power_t=0.5,random_state=0,shuffle=True,verbose=0,
warm_start=False),
fit_params={},iid=True,loss_func=None,n_jobs=1,
param_grid={'alpha':[6e05,7e05,8e05,0.0001,0.0005]},
pre_dispatch='2*n_jobs',refit=True,score_func=None,
scoring='roc_auc',verbose=0)

Again,similartotheLogisticRegressionmodel,wecanseewhichparameterdidthe
best.
model_SGD.grid_scores_

[mean:0.96477,std:0.00484,params:{'alpha':6e05},
mean:0.96484,std:0.00481,params:{'alpha':7e05},
mean:0.96486,std:0.00480,params:{'alpha':8e05},
mean:0.96479,std:0.00480,params:{'alpha':0.0001},
mean:0.95869,std:0.00484,params:{'alpha':0.0005}]

LookslikethisbeatourpreviousLogisticRegressionmodelbyaverysmallamount.
Nowthatwehaveourthreemodels,wecanworkonsubmittingourfinalscoresin
theproperformat.Itwasfoundthatsubmittingpredictedprobabilitiesofeachscore
insteadofthefinalpredictedscoreworkedbetterforevaluationfromthecontest
https://jessesw.com/NLPMovieReviews/

9/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

participants,sowewanttooutputthisinstead.
First,doourLogisticRegressionsubmission.

LR_result=model_LR.predict_proba(X_test)[:,1]#Weonlyneedtheprobabilitiesthatthe

LR_output=pd.DataFrame(data={"id":test["id"],"sentiment":LR_result})#Createourdata

LR_output.to_csv('Logistic_Reg_Proj2.csv',index=False,quoting=3)#Getthe.csvfilewe

Repeatthiswiththeothertwo.
#RepeatthisforMultinomialNaiveBayes
MNB_result=model_NB.predict_proba(X_test)[:,1]
MNB_output=pd.DataFrame(data={"id":test["id"],"sentiment":MNB_result})
MNB_output.to_csv('MNB_Proj2.csv',index=False,quoting=3)

#Last,dotheStochasticGradientDescentmodelwithmodifiedHuberloss.

SGD_result=model_SGD.predict_proba(X_test)[:,1]
SGD_output=pd.DataFrame(data={"id":test["id"],"sentiment":SGD_result})
SGD_output.to_csv('SGD_Proj2.csv',index=False,quoting=3)

SubmittingtheSGDresult(usingthelinearSVMwithmodifiedHuberloss),I
receivedascoreof0.95673ontheKaggleleaderboard.Thatwasgoodenoughfor
sixthplacebackinDecemberof2014.

IdeasforImprovementandSummary
Inthispost,weexaminedatextclassificationproblemandcleanedunstructured
reviewdata.Next,wecreatedavectoroffeaturesusingTFIDFnormalizationona
BagofWords.Wethentrainedthesefeaturesonthreedifferentclassifiers,someof
whichwereoptimizedusing20foldcrossvalidation,andmadeasubmissiontoa
Kagglecompetition.
Possibleideasforimprovement:
https://jessesw.com/NLPMovieReviews/

10/11

20/5/2016

NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist

Tryincreasingthenumberofngramsto3or4intheTFIDFvectorizationand
seeifthismakesadifference
Blendthemodelstogetherintoanensemblethatusesamajorityvoteforthe
classifiers
TryutilizingWord2Vecandcreatingfeaturevectorsfromtheunlabeledtraining
data.Moredatausuallyhelps!
IfyouwouldliketheIPythonNotebookforthisblogpost,youcanfindithere.
WrittenonMarch13,2015

https://jessesw.com/NLPMovieReviews/

11/11

You might also like