You are on page 1of 2

Review collection: There are many sources on internet that provide reviews about products.

I
choose to pull out reviews from Amazon because of the large domain it covers and the large
number of choices it offers for the consumer. It also has considerable number of different
reviews for each of the items. The reviews are obtained using Amazon’s Web Services API,
whereby we get an XML response. This XML file is later parsed to obtain the reviews. The
system currently fetches upto 20 pages of reviews, with 5 reviews per page. This option can be
changed to any integer.
Sentence segmentation and POS tagging: I have used the NLProcessor program to accomplish
this step. This program is available for both windows and unix. Once we have the product
reviews, we run the reviews through the NLProcessor software to obtain an output in the format
defined by NLProcessor.
Frequent feature identification: All the nouns and noun phrases occurring in each sentence are
chosen as candidate features and are aggregated into a transaction file. A variant of Apriori
algorithm is then run on this to identify the features that are frequently commented upon, with
the hope that these are the features that really matter for the product. For the Apriori algorithm
part, a package from CPAN named “Data::Mining:AssociationRules” is used. From this, we get
a set of frequent patterns which might be candidate features for the product.
Feature Pruning: Once we have a set of candidate features, we can use a couple of heuristics
for removing some items that might not be a relevant feature. I have implemented the
Compactness and Redundancy pruning heuristics, as described in the paper by “Minqing Hu and
Bing Liu”.
Opinion Words Extraction: Now, we have a set of product features and we need to identify the
opinion words that describe them. For this, we extract the adjectives that are within some fixed
distance from each of the feature words. Thus, we get a list of adjectives describing each of the
features.
Opinion Orientation Identification: Once we have a set of opinion words, we need to
calculate its orientation i.e. whether the opinion word is expressing a positive or a negative
opinion. For this, I have used the data from Sentiwordnet.
Opinion sentence orientation identification: Now that we have the orientations of individual
opinion words, we can try to estimate the orientation of the sentence containing them. For this,
I have implemented the algorithm described in the paper by “Minqing Hu and Bing Liu”. Only
the sentences that contain at least one feature word are considered.
Opinion Summarization: We can calculate the total number of positive and negative sentences
that describe each of the features. The features are ranked first by the number of terms they
contain and then by the number of times they appear in the reviews (frequency). So, we have a
tuple of <Feature, Positive scores, Negative scores>.

You might also like