You are on page 1of 4

BUS 443: Business Analytics

Data Mining Case


PART 1: DATA MINING TECHNIQUES TO FIND PATTERNS UNSUPERVISED LEARNING
Problem 1: Hierarchical Cluster Analysis with the Football Bowl Subdivision (FBS)
We started this example in class and will now do some further analysis. The Football Bowl Subdivision
(FBS) of the National Collegiate Athletic Association (NCAA) consists of over 100 schools. Most of
these schools belong to one of several conferences, or collections of schools, that compete with each other
on a regular basis in collegiate sports. Suppose the NCAA has commissioned a study that will propose the
formation of conferences based on the similarities of the constituent schools.
1. Open the FBS file (found in the Chapter 6 textbook files) that contains rows of information on
constituent FBS schools. Apply hierarchical clustering with 10 clusters using football stadium
capacity, latitude, longitude, endowment, and enrollment as variables. Use Wards method as the
clustering algorithm. Be sure to normalize the data. Copy the assigned cluster column to the data
sheet.
2. Use a Pivot Table on the data in the HC_Clusters sheet to identify the cluster with the largest
average football stadium capacity. Which cluster and school have the highest?
a. Cluster 2 has the largest average stadium capacity
b. Tennessee has the largest stadium capacity
3. How would you characterize the universities in this cluster?
a. The schools in this conference are in the SE and have high capacity in their stadiums as
well as large enrollment numbers
4. What is the smallest cluster (the one with the fewest observations) and what makes it unique?
a. The smallest cluster was cluster 4 (Stanford)
b. Stanford has a large endowment and it is the only school in its cluster
5. Examine the dendrogram on the HC_Dendrogram worksheet (as well as the sequence of clustering
stages in the HC_Output sheet). What number of clusters seems to be the most natural fit based on
the distance?
a. After examining the dendrogram we found that somewhere between 9 & 11 clusters would
be ideal
6. Create another pivot table and count the number of schools per cluster. Analyze the results. Why
arent these cluster results appropriate, or (restated) why should we rerun the cluster analysis using
different variables or a different number of clusters?
a. We had one cluster with 30 schools and another with only 1. This is unacceptable because
clusters are supposed to group things together and there is not a high level of uniformity
across the various clusters.
b. This included in our large pivot table and was highlighted red.
7. Apply hierarchical clustering again with 10 clusters using just latitude and longitude as the
variables. Be sure to normalize the data and specify single linkage as the clustering method. Use a
Pivot Table on the data in HC_Clusters. You can also visualize the clusters with a scatter plot with
longitude as the x-variable and latitude as the y-variable. Compare the clusters to the previous
method. Which is the better method?
a. We found that using Wards method was the superior clustering technique. Under the
current technique, data was not very distributed and we had one large cluster consisting of
98 schools. There were also clusters with only one school. Ultimately, longitude and

latitude alone are not good variables to cluster colleges by and single linkage clustering
yielded a poor result.
Problem 2: k-Means Cluster Analysis with the Football Bowl Subdivision (FBS)
1. Open the FBS file used in Problem 1 and copy the data to a new workbook. Delete the cluster
column from the hierarchical clustering in Problem 1.
2. Apply k-Means clustering with k=10 using football stadium capacity, latitude, longitude,
endowment, and enrollment as variables. Specify 50 iterations and 10 random starts and
normalize the data.
3. Analyze the resultant clusters. What is the smallest cluster (the one with the fewest observations)?
a. The smallest cluster is cluster 5
4. What is the least dense (aka most diverse) cluster, as measured by the largest average distance in
the cluster? What makes the least dense cluster so diverse?
a. Cluster 1 is the least dense
b. It is so diverse because there are multiple observations and they are more spread out than
a highly concentrated cluster group. The density is low because of this distance apart and
the relatively small number of observations to group these 5 universities together.
5. What problems do you see with the plan of defining the school membership of the 10 conferences
directly with these 10 clusters?
a. Cluster 2 only has 3 schools which would be awful for a FBS conference
b. Cluster 5 is also too small with only 1 school in that division
c. Cluster 7 is an outlier with 27 schools in the division
d. Overall the range of the sizes of these clusters span a large distance. It spans form 1 to 27
which makes for a lot of variance.
Problem 3: Both Types of Cluster Analysis with the Football Bowl Subdivision (FBS)
The NCAA has a preference for conferences consisting of similar schools with respect to their
endowment, enrollment, and football stadium size, but these conferences must be in the same geographic
region to reduce traveling costs. Take the following steps to address this desire.
1. Apply k-means clustering again (in a new worksheet) using latitude and longitude as variables
with k=3. Be sure to normalize and specific 50 iterations and 10 random starts. Then create one
distinct data set (one spreadsheet) for each of the three regional clusters (east, west, and south).
2. For the west cluster, apply hierarchical clustering with Wards method and use normalized data to
form two sub-clusters using football stadium capacity, endowment, and enrollment as variables.
Use a PivotTable on the data in HC_Clusters to report the characteristics of each cluster.
Row Labels
1
2
Grand Total

Average of
Enrollment
26589.2381
19945
26287.2272
7

Average of
StadiumCapacit
y
49088.71429
50000

Average of
Endowment
($000)
842519.4762
16502606

49130.13636

1554341.591

Count of
SubCluster
21
1
22

Cluster1has21schoolswhilecluster2onlyhas1school.Cluster1hashighersignificantlyhigher
endowment.

3. Do the same for the east cluster, using three sub-clusters.


Row Labels
1
2
3
Grand Total

Average of
Stadium
Capacity
63568.4
63347.66667
34350.73077
50217.80702

Average of
Endowment
($000)
1336091.8
5866583.5
193019.3462
1291584.193

Average of
Enrollment
32963.4
21313
24231.80769
27754.21053

Count of
Sub-Cluster
25
6
26
57

Cluster1and3hassimilarnumberofschoolsinthereclusterswhilecluster2ismadeupofonly6
schools.
a. Cluster 1
4. Do the same for the south cluster, using four sub-clusters.
Row
Labels
1
2
3
4
Grand
Total

Count of
SubCluster
17
2
21
8

Average of
StadiumCapacity
39736.11765
85812
66754.7619
66461.125

Average of
Endowment ($000)
113253.5882
3652205.5
547584.5238
1191190.375

Average of Enrollment
25873.17647
22330.5
29637.04762
22726

48

57930.77083

630385.8333

26847.72917

Cluster2onlyhas2schools.Therangeofthenumberofschoolsineachclusterisntbalanced.

5. What problems do you see with this plan? How could this approach be tweaked to solve the
problem?
a. The latitude and longitudes doesnt necessarily pick up the proximity of the schools to
each other. For example Hawaii was in the south division when logically they should be
in the west. It might be necessary to manually alter some of the clusters because of this.
b. Within each region there is still an uneven number of schools within each sub-cluster.
This problem could be improved by adding a North region. Creating more geographical
regions besides East, South, West, and North could expand this solution further. Getting
more data on each school would better help cluster them such as ranking.
Problem 4: Market Basket Analysis on Cookie Monster, Inc. (Problem 8 in our Textbook)
Cookie Monster Inc. is a company that specializes in the development of software that tracks Web
browsing history of individuals.
1. Open the CookieMonster file and review the binary matrix format. The entry in row and column
indicates whether the column website was visited by the row user. Using a minimum support of
800 transactions and a minimum confidence of 50%, use XLMiner to generate a list of
association rules.
2. Review the top 14 rules. What information does this analysis provide Cookie Monster regarding
the online behavior of individuals? Be sure to address the lift ratios (and the meaning of the lift
ratios) in common terms that a business user would immediately understand.
a. The lift ratio is a measure of the usefulness of a rule. Lift ratio is made by the support of
(antecedent and consequent) divided by support of the antecedent. This information
regarding online behavior indicates that there is a correlation between Facebook, Twitter,
and YouTube. The highest lift ratios come from any combination of two of these, which
leads to the third. This also allows us to determine the ones with low lift ratios, which are

less effective of measuring customers click patterns. If you know customers are going to
go to all three of these sites you could save money by only advertising on one or flood
the market by advertising on all three.

You might also like