You are on page 1of 20

MIS 441 Final Project

Introduction:
Jack Bauer family is going to move to Pittsburgh. The family is recruiting a butler
to help them make decisions. The tasks are:
1. A house. Jack Bauer family wants to buy a house. The requirements are:
a) The price is less than 500,000 USD.
b) It has investment potential.
c) Close to medical centers/hospitals, universities and supermarkets/malls
(Target, Walmart, Whole Food, Costco, etc.).
d) Excellent traffic in surrounding areas.
2. Technology setup. Jack Bauer is a field agent and asks for a set of tech
equipment which requires your recommendation.
a) A camera (Canon PowerShot SD500, Canon S100, Nikon Coolpix
4300 or Canon G3).
b) A router (Hitachi router or Linksys Router).
You are required to give a presentation to Jack Bauer family to help them make the
above decision.

Proposal 1: A house
To recommend a house to Jack and his family, we divide this task into 2 steps:
1) Neighborhood selection - to select a top neighborhood in Pittsburgh
2) House selection - specific house in the chosen top neighborhood

Neighborhood Selection
1. Evaluate neighborhoods in Pittsburgh using 4 criteria: traffic, crime & safety, good for
families and health & fitness.

- Use GoogleMaps to see typical traffic for each neighborhood:

1
- Use the scorecards on niche.com to evaluate crime & safety, good for families and
health & fitness: https://www.niche.com/places-to-live/search/best-neighborhoods-for-
families/m/pittsburgh-metro-area/

Then evaluate these 4 criteria using 4 grade levels: 100, 75, 50, 25. The higher the score, the
better a neighborhood meets a criterion.
Create a spreadsheet that have all grades for the neighborhoods:

2
In the spreadsheet, filter out neighborhoods that have scores of 75 or greater in traffic. 65
neighborhoods are left.

Import the remaining neighborhoods in RapidMiner Studio. Do not include Traffic column.

3
Change the role of Neighborhood to id. Click Next and Finish to import the data.

Follow these steps to group the remaining neighborhoods:


1) Retrieve the imported data.

2) Impute missing values using k-NN.

4
3) Set the role of Crime & Safety as label.

4) Use Filter Examples to keep neighborhoods that have scores of 75 and greater in
Crime & Safety.

5
5) Use k-means to cluster the neighborhoods. Try different k values from 2 to 6.

6
6) Compare the performance of different k values using Cluster Distance Performance
and find the optimal k value. In our case, 4 is the optimal number of clusters because it
gives the smallest DBI (0.316) compared to smaller k values. When we increase k to 5
and larger numbers, at least one cluster has nothing in it.

The whole process and connections are shown below.

7
The following graphs show some results of clustering.

8
We choose the cluster on the upper right corner because the neighborhoods in this cluster are
good for families and have great access to health & fitness facilities. This helps us narrow down
to 6 top neighborhoods: Oakwood, Highland Park, Point Breeze, Swisshelm Park, Shadyside
and Regent Square.

To continue narrowing down to a single top neighborhood, we use a table to check surrounding
facilities of these 6 neighborhoods on Google Maps and found that only Shadyside is close to
supermarkets, universities and medical centers.

Nearby
Nearby Medical Nearby
Supermarket Center University

Oakwood yes yes no

Highland Park yes no yes

Point Breeze yes no yes

Swisshelm Park yes no no

Shadyside yes yes yes

Regent Square yes no no

Shadyside is the top neighborhood from our analysis. UPMC and Trader Joes is in this
neighborhood. Chatham University is 0.6 miles away and Carnegie Mellon University is 1.5
miles away.
Then we look at all houses on sale in Shadyside.

9
House Selection

To meet Jacks requirements, we looked up all houses with price range between $100,000 to
$500,000, three or mow beds, home type: house, and investment potential of the houses by
using the house select website Zillow. (we used the search function to select all the houses from
Shadyside, then we use the build-in filter from the website to filter out what we want.)
(https://www.zillow.com/homes/for_sale/11525498_zpid/40.510602,-79.843569,40.398921,-
80.028105_rect/12_zm/)

By applying the build-in filter, there are three houses left. We compared all there zestimate
graph (timeline and price of the house), and we found the house : 5447 Potter St,Pittsburgh, PA
15232 have a significant positive relationship between the timeline and the forecast of the house
and it contains 4 beds, 2.5 bathroom, 2,117 sqft and price about $385,000.

We also find some nice pictures of the house and everything looks great here.

10
Proposal 2: Technology Setup

For this proposal, we need to recommend a router and a camera to Jack. Before looking at
specific router or camera, we analyze the given review data using sentiment analysis.

We use the movie review data given in previous lab session as training data and the product
review data given in this project as test data.

The whole process and connections are shown below:

The subprocess for Process Documents from Files is shown below. We use Term Frequency
for vector creation.

11
In addition to Filter Stopwords (English), we also create another list of stopwords and use
Filter Stopwords (Dictionary) to get rid of [n] and ## in the product review data. Our list of
stopwords includes:

The parameter for SVM is shown below:

12
The sentiment analysis predicts positive for all products.

Then we look at router and camera separately to find the perfect product for Jack.

Router

First we look at the term frequency for each routers review to see what are the most important
features that customers are focusing on.

The result shows that words like cut, tool, depth and wood appear a lot in the reviews for
Hitachi router. This tells us that Hitachi router is a power tool used to cut woods. Therefore,

13
Hitachi router is not what Jack is looking for as a field agent. We then move on to Linksys
router.

We use Python to split the Linksys review text file into 48 separate files and use these files to
do a sentiment analysis again.

14
The result of sentiment analysis shows that there are many more positive reviews than negative
reviews. Now we are confident enough to choose Linksys for Jack.

We look at the official website https://www.linksys.com/us/c/wireless-routers and compare


different models of this brand. We apply two criteria: Best for Multiple Devices and Best for
Working from Home and this helps us narrow down to two models. We also add a cheaper
model as reference.

Next, we want to gather some reviews for these three models as our test data. We choose the
most recent 20 reviews from Amazon. Again, we do the sentiment analysis with the exactly
same process on Rapidminer for these three models respectively.

15
16
Product Positive Negative
review Review

EA 7300 85% 15%

EA 9300 90% 10%

EA 9500 85% 15%

The result of sentiment analysis shows that one of the models have the largest number of
positive reviews. Therefore, we decide to choose Linksys EA 9300 for Jack.

Camera

Camera selection process is pretty much a repeat of the process for router.

From the sentiment analysis, we find out two major issues for the 4 cameras provided . 1) four
reviews are showing positive reviews, which means it doesnt provide us enough useful
information. We cant choose from them. 2) all of the 4 cameras released many years ago, and
are out of date now. So we need to select new products for Jack.

First, before we select candidates of camera, we look at the term frequency for each cameras
review to see what are the most important features that customers are focusing on.

17
Based on the term frequency, there are some important features customers care: picture
quality, zoom range, look, shot and easiness to use.
With these features and other three criteria: 1)Brand: Nikon & Canon, 2) Category: Entry-level
Digital SLR Camera, 3)Price range: $400-$600, we choose four camera candidates: Nikon
D3300, Nikon D3400, Canon T6, Canon EOS Rebel T5.

Next, we want to gather some reviews for these four models as our test data. We randomly
choose the 20 reviews from Amazon. Again, we do the sentiment analysis with the exactly same
process on Rapidminer for these three models respectively.

(result for one of the four model)

18
Product Positive review Negative Review

Canon T6 19 1

Canon EOS Rebel T5 17 3

Nikon D3300 17 3

Nikon D3400 16 4

The result showing that Canon T6 is the best with 19 positive review and only 1 negative review.
So we choose the Canon T6 for Jack to purchase!!

More Business Applications

Based on our proposals for Jack Bauer, we find more business applications.
First, we can choose different criteria to evaluate based on customers demands. Since we
considered Jacks job and family in this case, we evaluated good for families and health &
fitness. However, these criteria dont work for every customer.
Second, we can use different criteria for clustering. The priority, again, is decided by the specific

19
requirements of different customers. Although the methodology of clustering doesnt change, it
is important to change criteria we use for clustering.
Last but not the least, we find out that the term frequency can help us learn about customers
focus when they buy a specific product. Their focus can help us to better target our products in
the market and make more profits.

20

You might also like