You are on page 1of 5

Keepingitfresh:PredictRestaurantInspections

EECS349MachineLearningFinalProject
By:BhavitaJaiswal,KirtiMaharwal,RawanAlharbi,SurabhiRavishankar

Task:
Our goal is to predict the count of health violation during an inspection of arestaurant
in the city of Boston on a specific date using Yelp data. The City of Boston regularly
conductsinspectionofrestauranttomonitorifrestaurantsarefollowingfoodsafetyand
public health rules. It records health violations for all the restaurants at three different
levels: *(one star) "minor", ** (2 stars) "major",and***(3stars)"severe"violations.
Currently the health inspections are random, which leads to the wastage of time and
efforts in inspecting clean restaurants andmissedopportunitytoimprovehealthand
hygieneatplaceswithmoreseriousfoodsafetyissues.

Data:
We retrieved data from DRIVENDATA, which hosts social challenges inthefieldofdata
science. We were provided historical hygiene violation records from theCityofBoston
and Yelp's consumer reviews which business descriptions, restaurant reviews,
restaurant tips, user review history, and checkins.We utilized yelp restaurant reviews
for predicting violations. The data has 1911 recordsforrestaurants,27,088examplesof
historical hygiene violations and 228,805 examples of restaurant reviews for the given
restaurants in the city of Boston. We used 90% datafortrainingand10%forvalidation.
For testing we utilizedthetestdataprovidedbyDRIVENDATAandsubmittedourresults
onthewebsite.

Figure1:Plotofviolationsforall3levels

Methods:
As the data was from different sources sofirstwemappedtheBostoncityhygienedata
with the yelp reviews. Our first approach was to use businessdata,reviewdata(except
review text) and Bostons historical violation data for modeling. We applied linear
regression models and the resulting RMSLE was 3.1991. Then we segregated the data
based on seasons and used applied linear regression, which gave a RMSLE of 3.56. We
realized that restaurant reviews would play an important role in the prediction of
violations.
So in the next approach we only used yelp reviews for predicting violations. We
processed the review text before the inspection date and applied Scikitlearns TFIDF
(term frequency, inverse document frequency) to create a feature matrix ofthewords.
It resulted in a sparse matrix and we chose 1500 features for modeling. In modeling
1500 features in the sparse matrix acted as input and count of violations for each
category*,**,and***wastheoutput.

We usedRootmeansquarelogerror(RMSLE)astheevaluationcriteriaandsetOrdinary
Least Square regression model as our benchmark (RMSLE = 1.1386). We applied Ridge
regression, Naive Bayes, nearest neighbors, SVM, generalized boosted trees, and
random forest, among others. In case of nearest neighbors we tried with different
numberofneighborsbutthebestRMSLEwasachievedby3nearestneighbors.
We also tried doing sentimental analysis on Text Reviews of the all the reviews before
the violation date and added the polarity scoreasanotheraddedfeaturewhichgaveus
aslightlybetterscorethanthepreviousscores.

Figure2:PlotofRMSLEvsNumberofnearestneighbors

Then we manipulated the number of features to determine if it has any effect on the
performance of the various algorithms. The graph concludes that the number of
featuresdonothavemuchinfluenceontheperformanceofthealgorithms.

Figure 2: Plot of performance of different algorithms using different number of


features

As the last approach gavegoodresultswetriedtotestifusingjustoneyearreviewdata


before the inspection will perform better. We used the best performing algorithm in
above approach to test the approach. It gave a RMSLE of 1.08 on random forest
estimator.

Results:
While Ridge regression, Naive Bayes and SVM resulted in a RMSLE below the
benchmark, nearest neighbors (3) and generalizedboostedtreesgaveRMSLEabovethe
benchmark. The best performance was achieved using Random Forest model, which
gave a RMSLE of 0.9992. We used 500 estimators and 5 maximum features for the
Randomforest.

Table 1: Summary of various approaches tried, different algorithms applied and the
accuracyachieved

Approach
Algorithm
Result
(Weighted
RMSLE)
Performing
Regression
on Ordinary Least 3.624
Machine learning business data, review Squares
Algorithms
data (except review Regression
without
text text) and Bostons
analysis
historical violation REPTree
3.1991
data.
Regression
on Ordinary Least 3.56
business data and Squares
seasonal aggregated Regression
Bostons historical
violationdata.
Performing
Regression
on
Ordinary 1.1386
Machine
Bostons historical Least Squares (Baseline)
Learning
violation data and Regression
Algorithms with TFIDF
features
textanalysis
extracted from the
restaurant
text
reviews before the Ridge
1.153
violationdate.
Regression
K
Nearest
Neighbor
Regression
withK=3
Support
Vector
Regression
Gaussian
NaiveBayes
Random
Forest
estimators
Regression
on Random
Bostons historical Forest
violation data and estimators
TFIDF
features
extracted from the
restaurant
recent
text reviews before
theviolationdate.

1.0893

1.2245

2.0511
0.9992
(BestPerformance)
1.08

Performing
Machine
Learning
Algorithms and
text analysis with
Sentiment
Anaysis
and
plotting polarity
score for all the
text
review
beforedate.

Performing
Machine
Learning
Algorithms and
text analysis with
Sentiment
Anaysis
and
plotting polarity
score for all the
text
review
beforedate.

Regression
on
Bostons historical
violation data, TFIDF
features extracted
from the restaurant
text reviews before
the violation date,
andbusinessdata.
Regression
on
Bostons historical
violation data and
TFIDF
features
extracted from the
restaurant
text
reviews before the
violation date as well
as Adding one more
polarity scorefeature
onthetextreviews

Ordinary Least 1.3


Squares
Regression

Ordinary Least
Squares
Regression

Ridge
Regression

Regression
on Random
Bostons historical Forest
violation data and estimators
TFIDF
features
extracted from the
restaurant
recent
text reviews before
theviolationdate.

1.1498

1.0026

1.003

Futurework:
For the given data, the approach of feature extraction from the customer reviews is
really important. As we have used basic approach, TFIDF (term frequency, inverse
document frequency), for featureextractionwewillhavetofindabetterNLPtechnique
to convert the reviews in features. Currently We are at the 9th Position on the
LeaderBoard Of DrivenData and we would be continuing this project for few more
weekstoincreaseourranking.

You might also like