Professional Documents
Culture Documents
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
NaturalLanguageProcessinginaKaggle
CompetitionforMovieReviews
Source
IdecidedtotryplayingaroundwithaKagglecompetition.Inthiscase,Ientered
theWhenbagofwordsmeetsbagsofpopcorncontest.Thiscontestisntfor
moneyitisjustawaytolearnaboutvariousmachinelearningapproaches.
ThecompetitionwastryingtoshowcaseGooglesWord2Vec.Thisessentiallyuses
deeplearningtofindfeaturesintextthatcanbeusedtohelpinclassificationtasks.
https://jessesw.com/NLPMovieReviews/
1/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
Specifically,inthecaseofthiscontest,thegoalinvolveslabelingthesentimentofa
moviereviewfromIMDB.Ratingswereona10pointscale,andanyreviewof7or
greaterwasconsideredapositivemoviereview.
Originally,IwasgoingtotryoutWord2Vecandtrainitonunlabeledreviews,butthen
oneofthecompetitorspointedoutthatyoucouldsimplyusealesscomplicated
classifiertodothisandstillgetagoodresult.
IdecidedtotakethisbasicinspirationandtryafewvariousclassifierstoseewhatI
couldcomeupwith.Thehighestmyscorereceivedwas6thplacebackinDecember
of2014,butthenpeoplestartedusingensemblemethodstocombinevarious
modelstogetherandgetaperfectscoreafteralotoffinetuningwiththeparameters
oftheensembleweights.
Hopefully,thispostwillhelpyouunderstandsomebasicNLP(NaturalLanguage
Processing)techniques,alongwithsometipsonusingscikitlearntomakeyour
classificationmodels.
CleaningtheReviews
Thefirstthingweneedtodoiscreateasimplefunctionthatwillcleanthereviews
intoaformatwecanuse.Wejustwanttherawtext,notalloftheotherassociated
HTML,symbols,orotherjunk.
Wewillneedacoupleofverynicelibrariesforthistask:BeautifulSoupfortaking
careofanythingHTMLrelatedandreforregularexpressions.
importre
frombs4importBeautifulSoup
Nowsetupourfunction.Thiswillcleanallofthereviewsforus.
defreview_to_wordlist(review):
'''
MeantforconvertingeachoftheIMDBreviewsintoalistofwords.
'''
#FirstremovetheHTML.
https://jessesw.com/NLPMovieReviews/
2/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
review_text=BeautifulSoup(review).get_text()
#Useregularexpressionstoonlyincludewords.
review_text=re.sub("[^azAZ]","",review_text)
#Convertwordstolowercaseandsplitthemintoseparatewords.
words=review_text.lower().split()
#Returnalistofwords
return(words)
Great!Nowitistimetogoaheadandloadourdatain.Forthis,pandasisdefinitely
thelibraryofchoice.Ifyouwanttofollowalongwithadownloadedversionofthe
attachedIPythonnotebookyourself,makesureyouobtainthedatafromKaggle.You
willneedaKaggleaccountinordertoaccessit.
importpandasaspd
train=pd.read_csv('labeledTrainData.tsv',header=0,
delimiter="\t",quoting=3)
test=pd.read_csv('testData.tsv',header=0,delimiter="\t",
quoting=3)
#Importboththetrainingandtestdata.
Nowitistimetogetthelabelsfromthetrainingsetforourreviews.Thatway,wecan
teachourclassifierwhichreviewsarepositivevs.negative.
y_train=train['sentiment']
Nowweneedtocleanboththetrainandtestdatatogetitreadyforthenextpartof
ourprogram.
https://jessesw.com/NLPMovieReviews/
3/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
traindata=[]
foriinxrange(0,len(train['review'])):
traindata.append("".join(review_to_wordlist(train['review'][i])))
testdata=[]
foriinxrange(0,len(test['review'])):
testdata.append("".join(review_to_wordlist(test['review'][i])))
TFIDFVectorization
ThenextthingwearegoingtodoismakeTFIDF(termfrequencyinterdocument
frequency)vectorsofourreviews.Incaseyouarenotfamiliarwithwhatthisisdoing,
essentiallywearegoingtoevaluatehowoftenacertaintermoccursinareview,but
normalizethissomewhatbyhowmanyreviewsacertaintermalsooccurs
in.Wikipediahasanexplanationthatissufficientifyouwantfurtherinformation.
Thiscanbeagreattechniqueforhelpingtodeterminewhichwords(orngramsof
words)willmakegoodfeaturestoclassifyareviewaspositiveornegative.
Todothis,wearegoingtousetheTFIDFvectorizerfromscikitlearn.Then,decide
whatsettingstouse.ThedocumentationfortheTFIDFclassisavailablehere.
InthecaseoftheexamplecodeonKaggle,theydecidedtoremoveallstopwords,
alongwithngramsuptoasizeoftwo(youcouldusemorebutthiswillrequireaLOT
ofmemory,sobecarefulwhichsettingsyouuse!)
fromsklearn.feature_extraction.textimportTfidfVectorizerasTFIV
tfv=TFIV(min_df=3,max_features=None,
strip_accents='unicode',analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1,2),use_idf=1,smooth_idf=1,sublinear_tf=1,
stop_words='english')
Nowthatwehavethevectorizationobject,weneedtorunthisonallofthedata
(bothtrainingandtesting)tomakesureitisappliedtobothdatasets.Thiscouldtake
https://jessesw.com/NLPMovieReviews/
4/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
sometimeonyourcomputer!
X_all=traindata+testdata#CombinebothtofittheTFIDFvectorization.
lentrain=len(traindata)
tfv.fit(X_all)#Thisistheslowpart!
X_all=tfv.transform(X_all)
X=X_all[:lentrain]#Separatebackintotrainingandtestsets.
X_test=X_all[lentrain:]
MakingOurClassifiers
Becauseweareworkingwithtextdata,andwejustmadefeaturevectorsofevery
word(thatisntastopwordofcourse)inallofthereviews,wearegoingtohave
sparsematricestodealwiththatarequitelargeinsize.JusttoshowyouwhatI
mean,letsexaminetheshapeofourtrainingset.
X.shape
(25000,309798)
Thatmeanswehave25,000trainingexamples(orrows)and309,798features(or
columns).Weneedsomethingthatisgoingtobesomewhatcomputationallyefficient
givenhowmanyfeatureswehave.Usingsomethinglikearandomforesttoclassify
wouldbeunwieldy(plusrandomforestscantworkwithsparsematricesanywayyet
inscikitlearn).Thatmeansweneedsomethinglightweightandfastthatscalesto
manydimensionswell.Somepossiblecandidatesare:
NaiveBayes
LogisticRegression
SGDClassifier(utilizesStochasticGradientDescentformuchfasterruntime)
LetsjusttryallthreeassubmissionstoKaggleandseehowtheyperform.
https://jessesw.com/NLPMovieReviews/
5/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
Firstup:LogisticRegression(seethescikitlearndocumentationhere).
WhileintheoryL1regularizationshouldworkwellbecausepn(manymorefeatures
thantrainingexamples),IactuallyfoundthroughalotoftestingthatL2regularization
gotbetterresults.Youcouldsetupyourowntrialsusingscikitlearnsbuiltin
GridSearchclass,whichmakesthingsaloteasiertotry.Ifoundthroughmytesting
thatusingaparameterCof30gotthebestresults.
fromsklearn.linear_modelimportLogisticRegressionasLR
fromsklearn.grid_searchimportGridSearchCV
grid_values={'C':[30]}#Decidewhichsettingsyouwantforthegridsearch.
model_LR=GridSearchCV(LR(penalty='L2',dual=True,random_state=0),
grid_values,scoring='roc_auc',cv=20)
#Trytosetthescoringonwhatthecontestisaskingfor.
#ThecontestsaysscoringisforareaundertheROCcurve,sousethis.
model_LR.fit(X,y_train)#Fitthemodel.
GridSearchCV(cv=20,estimator=LogisticRegression(C=1.0,class_weight=None,dual=True
fit_intercept=True,intercept_scaling=1,penalty='L2',random_state=0
fit_params={},iid=True,loss_func=None,n_jobs=1,
param_grid={'C':[30]},pre_dispatch='2*n_jobs',refit=True,
score_func=None,scoring='roc_auc',verbose=0)
Youcaninvestigatewhichparametersdidthebestandwhatscorestheyreceivedby
lookingatthemodel_LRobject.
model_LR.grid_scores_
[mean:0.96459,std:0.00489,params:{'C':30}]
https://jessesw.com/NLPMovieReviews/
6/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
model_LR.best_estimator_
LogisticRegression(C=30,class_weight=None,dual=True,fit_intercept=True,
intercept_scaling=1,penalty='L2',random_state=0,tol=0.0001)
Feelfree,ifyouhaveaninteractiveversionofthenotebook,toplayaroundwith
varioussettingsinsidethegrid_valuesobjecttooptimizeyourROCAUCscore.
Otherwise,letsmoveontothenextclassifier,NaiveBayes.
UnlikeLogisticRegression,NaiveBayesdoesnthavearegularizationparameterto
tune.YoujusthavetochoosewhichflavorofNaiveBayestouse.
AccordingtothedocumentationonNaiveBayesfromscikitlearn,Multinomialisour
bestversiontouse,sincewenolongerhavejusta1or0forawordfeature:ithas
beennormalizedbyTFIDF,soourvalueswillbeBETWEEN0and1(mostofthe
time,althoughhavingafewTFIDFscoresexceed1istechnicallypossible).Ifwe
werejustlookingatwordoccurrencevectors(withnocounting),Bernoulliwould
havebeenabetterfitsinceitisbasedonbinaryvalues.
LetsmakeourMultinomialNaiveBayesobject,andtrainit.
fromsklearn.naive_bayesimportMultinomialNBasMNB
model_NB=MNB()
model_NB.fit(X,y_train)
MultinomialNB(alpha=1.0,class_prior=None,fit_prior=True)
Prettyfast,right?Thisspeedcomesataprice,however.NaiveBayesassumesallof
yourfeaturesareENTIRELYindependentfromeachother.Inthecaseofword
vectors,thatseemslikeasomewhatreasonableassumptionbutwiththengramswe
https://jessesw.com/NLPMovieReviews/
7/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
includedthatprobablyisntalwaysthecase.Becauseofthis,NaiveBayestendsto
belessaccuratethanotherclassificationalgorithms,especiallyifyouhaveasmaller
numberoftrainingexamples.
WhydontweseehowNaiveBayesdoes(atleastina20foldCVcomparison)so
wehavearoughideaofhowwellitperformscomparedtoourLogisticRegression
classifier?
YoucoulduseGridSearchagain,butthatseemslikeoverkill.Thereisasimpler
methodwecanimportfromscikitlearnforthistask.
fromsklearn.cross_validationimportcross_val_score
importnumpyasnp
print"20FoldCVScoreforMultinomialNaiveBayes:",np.mean(cross_val_score
(model_NB,X,
#Thiswillgiveusa20foldcrossvalidationscorethatlooksatROC_AUCsowecancomp
20FoldCVScoreforMultinomialNaiveBayes:0.949631232
Well,itwasntquiteasgoodasourwelltunedLogisticRegressionclassifier,butthat
isaprettygoodscoreconsideringhowlittlewehadtodo!
OnelastclassifiertotryistheSGDclassifier,whichcomesinhandywhenyouneed
speedonareallylargenumberoftrainingexamples/features.
Whichmachinelearningalgorithmitendsupusingdependsonwhatyousetforthe
lossfunction.Ifwechoseloss=log,itwouldessentiallybeidenticaltoourprevious
logisticregressionmodel.Wewanttotrysomethingdifferent,butwealsowantaloss
optionthatincludesprobabilities.Weneedthoseprobabilitiesifwearegoingtobe
abletocalculatetheareaunderaROCcurve.Lookingatthedocumentation,it
seemsamodified_huberlosswoulddothetrick!ThiswillbeaSupportVector
Machinethatusesalinearkernel.
fromsklearn.linear_modelimportSGDClassifierasSGD
https://jessesw.com/NLPMovieReviews/
8/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
sgd_params={'alpha':[0.00006,0.00007,0.00008,0.0001,0.0005]}#Regularizationpara
model_SGD=GridSearchCV(SGD(random_state=0,shuffle=True,loss='modified_huber'
model_SGD.fit(X,y_train)#Fitthemodel.
GridSearchCV(cv=20,estimator=SGDClassifier(alpha=0.0001,class_weight=None,epsilon
loss='modified_huber',n_iter=5,n_jobs=1,penalty='l2',
power_t=0.5,random_state=0,shuffle=True,verbose=0,
warm_start=False),
fit_params={},iid=True,loss_func=None,n_jobs=1,
param_grid={'alpha':[6e05,7e05,8e05,0.0001,0.0005]},
pre_dispatch='2*n_jobs',refit=True,score_func=None,
scoring='roc_auc',verbose=0)
Again,similartotheLogisticRegressionmodel,wecanseewhichparameterdidthe
best.
model_SGD.grid_scores_
[mean:0.96477,std:0.00484,params:{'alpha':6e05},
mean:0.96484,std:0.00481,params:{'alpha':7e05},
mean:0.96486,std:0.00480,params:{'alpha':8e05},
mean:0.96479,std:0.00480,params:{'alpha':0.0001},
mean:0.95869,std:0.00484,params:{'alpha':0.0005}]
LookslikethisbeatourpreviousLogisticRegressionmodelbyaverysmallamount.
Nowthatwehaveourthreemodels,wecanworkonsubmittingourfinalscoresin
theproperformat.Itwasfoundthatsubmittingpredictedprobabilitiesofeachscore
insteadofthefinalpredictedscoreworkedbetterforevaluationfromthecontest
https://jessesw.com/NLPMovieReviews/
9/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
participants,sowewanttooutputthisinstead.
First,doourLogisticRegressionsubmission.
LR_result=model_LR.predict_proba(X_test)[:,1]#Weonlyneedtheprobabilitiesthatthe
LR_output=pd.DataFrame(data={"id":test["id"],"sentiment":LR_result})#Createourdata
LR_output.to_csv('Logistic_Reg_Proj2.csv',index=False,quoting=3)#Getthe.csvfilewe
Repeatthiswiththeothertwo.
#RepeatthisforMultinomialNaiveBayes
MNB_result=model_NB.predict_proba(X_test)[:,1]
MNB_output=pd.DataFrame(data={"id":test["id"],"sentiment":MNB_result})
MNB_output.to_csv('MNB_Proj2.csv',index=False,quoting=3)
#Last,dotheStochasticGradientDescentmodelwithmodifiedHuberloss.
SGD_result=model_SGD.predict_proba(X_test)[:,1]
SGD_output=pd.DataFrame(data={"id":test["id"],"sentiment":SGD_result})
SGD_output.to_csv('SGD_Proj2.csv',index=False,quoting=3)
SubmittingtheSGDresult(usingthelinearSVMwithmodifiedHuberloss),I
receivedascoreof0.95673ontheKaggleleaderboard.Thatwasgoodenoughfor
sixthplacebackinDecemberof2014.
IdeasforImprovementandSummary
Inthispost,weexaminedatextclassificationproblemandcleanedunstructured
reviewdata.Next,wecreatedavectoroffeaturesusingTFIDFnormalizationona
BagofWords.Wethentrainedthesefeaturesonthreedifferentclassifiers,someof
whichwereoptimizedusing20foldcrossvalidation,andmadeasubmissiontoa
Kagglecompetition.
Possibleideasforimprovement:
https://jessesw.com/NLPMovieReviews/
10/11
20/5/2016
NaturalLanguageProcessinginaKaggleCompetitionforMovieReviewsJesseSteinwegWoods,Ph.D.DataScientist
Tryincreasingthenumberofngramsto3or4intheTFIDFvectorizationand
seeifthismakesadifference
Blendthemodelstogetherintoanensemblethatusesamajorityvoteforthe
classifiers
TryutilizingWord2Vecandcreatingfeaturevectorsfromtheunlabeledtraining
data.Moredatausuallyhelps!
IfyouwouldliketheIPythonNotebookforthisblogpost,youcanfindithere.
WrittenonMarch13,2015
https://jessesw.com/NLPMovieReviews/
11/11