Professional Documents
Culture Documents
Section 1 Background
Section 2 Distance Measures
2.1 Euclidean Distance
2.2 Distance Measures with Weights
Section 3 Clustering Methods
3.1 Hierarchical Clustering
3.2 Partition Based Clustering
Section 4 Variable Clustering
Section 5 Practical Considerations
Section 6 Clustering with SAS Enterprise Miner
Section 7 Case Study 1: K-Means Clustering with the Clustering Node
Section 8 Case Study 2: Clustering with KBM Data Set
2
4
4
5
6
6
9
10
11
11
14
23
29
30
31
32
33
38
43
Section 1 Background
Cluster analysis (also known as data segmentation in the data mining community) has a
variety of goals. The goals are invariably related to grouping or segmenting a collection
of objects into disjoint subsets or clusters such that those objects within each cluster are
similar to each other while those objects assigned to different clusters dissimilar.
Data mining applications typically deal with a humongous volume of data and it is likely
that the data are heterogeneous. This means that the data might fall into several distinct
groups, with objects within each subgroup being similar to each other but different from
objects from other groups. Since it is possible that there are different models or patterns
pertinent to each group, it is very challenging to spot any single pattern or model that is
germane to the whole data set. Creating clusters of similar objects reduces the model
complexity within the clusters which then enhances the chances for data mining
techniques to perform more successfully within each cluster. Even if the data do not have
natural groupings, partitioning data into homogeneous groups (empirically without
regard to a specific explanation for each cluster) can be very useful. For example, it is
well known that customer preferences for products depend on geographic and
demographic factors. Thus, we can use geographic and demographic factors to group
customers into several segments and to develop marketing strategies in each segment.
Although customers do not form these marketing segments naturally, it is much easier
and more effective to develop efficient marketing strategies for each segment separately
than to identify a one-size-fits-all one single marketing strategy to target all customers.
Clustering is one important unsupervised data-mining tool. Unlike supervised data
mining tools that are driven by user direction, cluster analysis has no priori assumptions
concerning the number of clusters or cluster structure. The basic objective in clustering
is to discover natural groupings of the cases or variables based on some similarity or
distance (dissimilarity) measures. Although there is no target variable to be predicted
explicitly at the clustering stages, a clustering technique can be used in many ways. First,
it can be used in missing value imputation. This method was illustrated in the SHOES
example in the previous session in which we used clustering to impute missing values for
the interval input variables (age, miles/week, races/year, years running). Although we
could have replaced the missing values by the population average of the non-missing
variables, such an approach would likely mask the basic structure among variables while
potentially interjecting a diminished relationship between the input variables and the
target variable Number of Purchases (one pair or at least two pairs). Second, one can use
clustering to detect outliers because outliers typically belong to clusters with only one
case. Third, one can use clustering to discover the number of clusters if one suspects that
there are meaningful groupings that may represent groups of cases. After finding these
meaningful groups, one can then develop different ways to deal with each group such as
target marketing that will be discussed later. Fourth, we can use cluster analysis to
partition a complex data structure into several subsets in order to give supervised datamining techniques such as decision tree or neural network a better chance of finding a
good predictive model.
Since the benefits of cluster analysis are evident, there have been extensive research
efforts in the past several decades on cluster analysis and there are many automatic
clustering techniques available. However, different clustering techniques lead to
different types of clusters and it is very difficult to tell whether a cluster analysis exercise
has been successful because cluster analysis is an unsupervised data mining exercise. To
use cluster techniques effectively, we need to at least understand several important
aspects of choosing clustering techniques--especially the selection of a similarity measure
(or dissimilarity measure).
Select Similarity Measure In order to decide whether a set of objects can be split into
subgroups, where members from the same group are more similar to one another than
they are to members from different groups, we need to define what similar means.
Suppose we want to group a deck of ordinary playing cards into clusters. There are many
ways to do so such as
Two clusters: One cluster has all the face cards and another cluster has all other
cards.
Four clusters: Each suit of thirteen cards forms a cluster.
Two clusters: Red suits are in one cluster and black suits are in another cluster.
Thirteen clusters: Each cluster has all cards that have the same face value.
Obviously, the clusters obtained are very different with different similarity measure.
Thus, we need to know how to choose a similarity measure before selecting the clustering
technique.
Select the Right Number of Clusters Another important question to ask in cluster
analysis is what is the right number of clusters? The choice of the number of clusters
depends on the goal of the study. For example, in market segmentation analysis, the
number of appropriate clusters depends on the number of divisions in the company.
However, the number of natural clusters can be decided using descriptive statistics to
guide the choice of the natural number of groups. In the well known k-means clustering
algorithm, the original chosen number of k determines the number of clusters that will be
found. If this number does not match to the natural structure of the data, the technique
will obtain poor results. Unless there is good prior knowledge on how many clusters
exist in the data, it is very difficult to choose the number k before applying k-means
clustering. We will discuss how to choose the right number in Section 5. In general, the
best set of clusters is the one that does the best job of keeping the distance between
objects within the same cluster small and the distance between objects of adjacent
clusters large. However, if the purpose of clustering is to detect unexpected patterns, the
right number of clusters might be the one that can find unexpected patterns from the data.
Cluster Interpretation Clustering is a powerful, unsupervised knowledge discovery
technique; however, it has some weaknesses and limitations. For example, if one does
not know what one is looking for, one may not recognize it when one finds it. Although
the clustering technique can help to find clusters, it is up to the user to interpret them.
The following approaches can help the user to understand clusters:
Use graphical tools or summary statistics to exam the within cluster distribution
for each variable (StatExplore node)
Use graphical tools such as box plots to study the within cluster distribution
for each continuous variable
Use graphical tools such as the Mosaic plot to study the within cluster
distribution for each categorical variable
Study the within cluster statistics for each variable
Use other visualization tools to see how the clusters are affected by changes for
each variable with the normalization mean plot (Clustering node)
Build a decision tree (or other supervised data mining tools) with the cluster label
as the target variable and use it to derive rules explaining how to assign new
records to the correct cluster
, xm ( i ) and x( j ) = x1 ( j ) , x2 ( j ) ,
, xm ( j ) be any two
objects with m variables (features). If all variables are quantitative (or interval type), i.e.,
can be represented by continuous real-valued numbers. The most common choice of
distance measure is the Euclidean distance, a special case of the Minkowski metric (Lp
norm). The Minkowski metric is defined as
1/ p
p
m
(1)
d ( x(i ), x( j ) ) = d ( i, j ) = xk ( i ) xk ( j ) .
k =1
For p=2, d(i,j) becomes the Euclidean distance. For p=1, d(i,j) becomes the mean
absolute deviation between the two objects. For p = , d(i,j) becomes to the maximum
absolute deviation between the two objects.
Another popular similarity measure for numerical variables is the Pearson correlation
coefficient defined by:
( x (i ) x ) ( x ( j ) x )
m
( x(i ), x( j ) ) =
k =1
,
(2)
2 1/ 2
m
2 m
xk ( i ) xi xk ( j ) x j
k =1
k =1
m x (i )
m x ( j)
where xi = k
and x j = k
are the sample averages over variables for
k =1 m
k =1 m
object i, and j, respectively. It is worth noting that clustering based on correlation is
equivalent to that based on Euclidean distance if the inputs are standardized first
(Problem 1 in Appendix 8).
) (
If some features are not quantitative (say they are nominal or ordinal), Euclidean distance
may not be appropriate (as well as being unsuitable for applying equation (2)). Typically,
we can create dummy variables for each level of nominal variables and replace each of
the original categories of the ordinal variable with
i 1
2 , i = 1, 2, , M ,
(3)
M
where M is the number of categories. The transformed variables can be treated as
quantitative variables on this scale.
d ws ( i, j ) = wk xk' ( i ) xk' ( j )
k =1
1/ p
(4)
Suppose the goal of clustering is to discover the natural grouping of the data, some
variables may exhibit more of a grouping tendency than others. Variables that are more
relevant in separating the groups should be assigned a higher weight. Suppose that the
goal of clustering is to segment the data into groups of similar cases, all variables might
not contribute equally to the notion of dissimilarity (problem dependent) between cases.
The domain expert should play an important role in assigning the weight to each variable.
DAL ( Ci , C j ) =
Cj
d ( x, y )
k =1 h =1
nCi + nC j
affected by the distance measures used even if these distance measures have the same
relative ordering. This makes average linkage unappealing in data mining applications.
Agglomerative clustering only requires a distance matrix to initiate the clustering
procedure. This means that it does not need to store all variable values for each case. If
we can compute the distance between variables, these methods can be applied in
variable clustering as well. We will address this issue in Section 4. The methods
mentioned here have several drawbacks. First, a case will stay in the same cluster once it
is assigned to this cluster. This indicates that reallocation is not allowed in the clustering
process even if one case has been wrongly assigned to a cluster. Second, these methods
are sensitive to outliers and noise. Thus, we need to try several different cluster
methods and, within each method, try several distance measures. If the outcomes from
all methods are roughly consistent with one another, perhaps a set of good clusters has
been identified. Also, we can add a small error to each case before applying clustering
method to see how stable the clusters are.
Other than linkage methods, there are centroid methods (the distance between two
clusters is the Euclidean distance between their centroids) and Ward statistics (the
distance between two clusters is the ANOVA sum of squares between two clusters added
up over all variables).
Example 1 Single Linkage, Complete Linkage, and Average Linkage
The distances between pairs of five cases are given below.
1 2 3 4 5
10
2 9 0
3 3 7 0
4 6 5 9 0
5 11 10 2 8 0
Cluster the five cases using each procedure and draw the dendograms (tree structures)
and compare the results.
(a) Single linkage hierarchical procedure.
(b) Complete linkage hierarchical procedure.
(c) Average linkage hierarchical procedure.
<Solutions>:
(a) Step 1: Merge case 3 and case 5 since min ( d ij ) = d 35 = 2
Step 2: d( 3,5),1 = min ( d31 , d51 ) = 3
d( 3,5),2 = min ( d32 , d52 ) = 7
d( 3,5),4 = min ( d34 , d54 ) = 8
Thus, the new distance matrix is
(35) 1 2 4
(35) 0
1 3 0
2 7 9 0
4 8 6 5 0
Step 3: Merge case (3,5) and case 1 since minimum distance is 3.
Step 4: The new distance matrix is
(135) 2 4
(135) 0
2 7 0
4 6 5 0
Step 5: Merge case 2 and case 4 since the minimum distance is 5.
Step 6: Merge all cases together and the minimum distance is 6.
(b) Step 1: Merge case 3 and case 5 since min ( d ij ) = d 35 = 2
Step 2: d( 3,5),1 = max ( d31 , d51 ) = 11
d( 3,5),2 = max ( d32 , d 52 ) = 10
d( 3,5),4 = max ( d34 , d54 ) = 9
1 11 0
2 10 9 0
4 9 6 5 0
( 24 ) 10 0
1 11 9 0
Step 5: Merge case (24) and case 1 since the minimum distance is 9.
Step 6: Merge all cases together and the minimum distance is 11.
(c) Step 1: Merge case 3 and case 5 since min ( dij ) = d35 = 2
1 7 0
2 8.5 9 0
4 8.5 6 5 0
Step 3: Merge case 2 and case 4 since minimum distance is 5.
Step 4: The new distance matrix is
(35) ( 24 ) 1
(35) 0
( 24 ) 8.5 0
1 7 7.5 0
Step 5: Merge case (35) and case 1 since the minimum distance is 7.
Step 6: Merge all cases together and the minimum distance is 8.5.
3.2 Partition Based Clustering In partition based clustering, the task is to partition the
data set into k disjoint clusters of cases such that the cases within each cluster are as
homogeneous as possible. Given a set of n cases D = {x(1), x(2), , x(n)}, our task is to
find k clusters C = {C1, C2, , Cn} such that each case x(i) is assigned to a unique cluster
Ck. There are many score functions can be used to measure the quality of clustering.
Centroid method uses the distance of two cluster centroids to measure the distance
between two clusters. Average method uses the average distance between all pairs of
points (one point from each cluster) to measure the distance between two clusters. Wald
statistics use the between clusters sum of squares to measure the distance between two
clusters. Once the score function is selected, it is an optimization process to find clusters.
Many optimization procedures can be applied. Here, we only introduce the popular Kmeans clustering method. The K-means algorithm is intended for situations in which all
variables are of the quantitative type, and squared Euclidean distance is chosen as the
dissimilarity measure.
Let D = {x(1), x(2), , x(n)} be n cases and our task is to find K clusters C = {C1, C2,
, CK}:
Let {r(k): k = 1, 2, , K} be K randomly selected points in D.
for k = 1, 2, K;
form clusters:
for k = 1, 2, , K do
Ck={xD|d(r(k),x) d(r(j),x) for all j = 1, 2, , K, jk}
end;
compute the new cluster centers;
for k = 1, 2, , K do
r (k)= the vector mean of the cases in CK
end;
end;
The time complexity of K-means algorithm is O(KIn), where I is the number of
iterations. Since K, the number of clusters, is fixed in partition based clustering methods,
the selection of K is critical. If the number of natural clusters is different from K, the
partition based clustering algorithm cannot obtain good results. We will address the way
to select the right number of K in Section 5.
All methods discussed in Section3.1 can be used in variable clustering except the average
linkage method if the distance between variables can be computed. The most popular
distance measure between two quantitative variables is the Pearson correlation coefficient
(Appendix 2). For categorical variables, we typical use the association to measure the
distance between them. It can be shown that the correlation and association are
equivalent if both variables are binary (see Problem 2 in Appendix 8).
10
X1
1
.643
.103
0.82
.259
.152
.045
.013
X2
1
.348
.086
.260
.010
.211
.328
X3
X4
X5
X6
X7
1
.100
1
.435 .034
1
.028 .288 .176
1
.115 .164 .019 .374
1
.005 .486 .007 .561 .185
X8
1
2
3
4
5
6
7
8
1
1.000
0.643
-0.103
-0.820
-0.259
-0.152
0.045
-0.013
Step 1.
step 2.
38
1
2
4
5
6
7
2
1.000
-0.348
-0.086
-0.260
-0.010
0.211
-0.328
min (d38)=
d(38),1=
d(38),2=
d(38),4=
d(38),5=
d(38),6=
d(38),7=
38
1
-0.013
-0.328
0.100
-0.007
0.028
0.115
Step 3.
1.000
0.100 1.000
0.435 0.034 1.000
0.028 -0.288 0.176 1.000
0.115 -0.164 -0.019 -0.374 1.000
0.005 0.486 -0.007 -0.561 -0.185 1.000
0.005
min(d31,d81)=
-0.013
min(d32,d82)=
-0.328
min(d34,d84)=
0.100
min(d35,d85)=
-0.007
min(d36,d86)=
0.028
min(d37,d87)=
0.115
1
1
0.643
-0.820
-0.259
-0.152
0.045
min (d38,5)=
1
-0.086
1
-0.260 0.034
1
-0.010 -0.288 0.176 1.000
0.211 -0.164 -0.019 -0.374
1.000
-0.007
11
d(385,1)=
d(385,2)=
d(385, 4)=
d(385, 6)=
d(385, 7)=
358
1
2
4
6
7
358
1
-0.013
-0.259
0.034
0.028
-0.019
Step 4.
358
26
1
4
7
1358
26
4
7
1358
1
0.028
0.034
-0.019
Step 6.
13578
26
4
1
1
0.643
-0.820
-0.152
0.045
min (d26)=
d(26,358)=
d(26,1)=
d(26,4)=
d(26,7)=
358
1
0.028
-0.013
0.034
-0.019
Step 5.
13578
1
0.028
0.034
-0.013
-0.259
0.034
0.028
-0.019
2
1
-0.086
1
-0.010 -0.288
1
0.211 -0.164 -0.374
-0.010
0.028
-0.152
-0.086
0.211
26
1
-0.152
-0.086
0.211
min (d358,1)=
d(3581,26)=
d(3581,4)=
d(3581,7)=
1
-0.820
1
0.045 -0.164
-0.013
0.028
0.034
-0.019
26
1
-0.086
0.211
1
-0.164
min (d1358,7)=
d(13587,26)=
d(13587,4)=
-0.019
0.028
0.034
26
1
-0.086
12
step 7.
min (d13578,26)=
minimum Distance d(1235678,4)=
0.028
0.034
Complete Linkage
1
2
3
4
5
6
7
8
1
1.000
0.643
-0.103
-0.820
-0.259
-0.152
0.045
-0.013
Step 1.
step 2.
38
1
2
4
5
6
7
38
26
1
1.000
-0.348
-0.086
-0.260
-0.010
0.211
-0.328
1.000
0.100
0.435
0.028
0.115
0.005
1.000
0.034
-0.288
-0.164
0.486
0.005
max(d31,d81)=
max(d32,d82)=
max(d34,d84)=
max(d35,d85)=
max(d36,d86)=
max(d37,d87)=
-0.103
-0.348
0.486
0.435
-0.561
-0.185
min (d38)=
d(38),1=
d(38),2=
d(38),4=
d(38),5=
d(38),6=
d(38),7=
38
1
-0.103
-0.348
0.486
0.435
-0.561
-0.185
Step 3.
1
1
0.643
-0.820
-0.259
-0.152
0.045
min (d26)=
d(26,38)=
d(26, 1)=
d(26, 4)=
d(26, 5)=
d(26, 7)=
38
1
-0.561
-0.103
1.000
0.176 1.000
-0.019 -0.374 1.000
-0.007 -0.561 -0.185 1.000
1
-0.086
1
-0.260 0.034
1
-0.010 -0.288 0.176 1.000
0.211 -0.164 -0.019 -0.374 1.000
-0.010
-0.561
0.643
-0.288
-0.260
-0.374
26
1
0.643
13
4
5
7
0.486
0.435
-0.185
Step 4.
38
26
57
1
4
1
-0.374
0.643
-0.288
min (d138)=
d(138,26)=
d(138,57)=
d(138,4)=
57
1
-0.260
1
-0.164 -0.820
-0.103
0.643
0.435
-0.820
26
57
1
-0.374
-0.288
1
-0.164
min (d26,4)=
d(264,138)=
d(264,57)=
138
1
-0.820
0.435
-0.820
1
-0.259 0.034
1
0.045 -0.164 -0.019
-0.019
0.435
-0.374
-0.259
-0.164
26
138
1
0.643
0.435
-0.820
Step 6.
138
246
57
min (d57)=
d(57,38)=
d(57,26)=
d(57, 1)=
d(57, 4)=
38
1
-0.561
0.435
-0.103
0.486
Step 5.
138
26
57
4
-0.288
-0.260
-0.374
-0.288
-0.820
-0.374
246
57
1
-0.374
step 7.
min (d138,24657)=
minimum Distance d(24567,138)=
-0.374
-0.820
Other than the selection of distance measures between cases and distance measures
between clusters, there are four important practical issues to impute the missing values,
to convert qualitative variable to quantitative variables, to select initial clusters, and to
decide the right number of clusters.
Morgan C. Wang and Mark E. Johnson
14
To impute missing values: We can either exclude cases with one or more missing
variables from analysis or impute these missing values with methods discussed in
data preparation course.
To convert to quantitative variable:
Ordinal variable: Replace the ordered categories by numerical values defined
in equation (3).
Nominal variable: Replace each category by one binary dummy variable.
Group of related binary variables: Use method similar to obtain missing value
pattern (MVP) to group related binary variables first and then consider the
MVP as an ordinary variable.
To select the initial seeds: The initial seeds must be complete cases, that is, cases
that have no missing values. And, the initial seeds should be chosen to be as far
apart as possible. It can be selected either at random or with some optimization
strategies recommended by Hastie et al. (Page 470, 2001).
To decide the right number of clusters: The choice of the right number of clusters
depends on the goal of the clustering. Data driven methods for estimating the
right number K typically examine the within cluster dissimilarity measure WK
as a function of the number of clusters K. Separate within cluster dissimilarity
measures W1, W2 , , WK max are calculated for K {1, 2, , Kmax}. The
{
sequence {W1, W2 ,
, WK max
}
} is a monotone decreasing sequence with a sharp
15
Average: the distance between two clusters is the average distance between
pairs of observations, one in each cluster.
Centroid: the distance between two clusters is the Euclidean distance between
their centroids or means.
Ward: the distance between two clusters is the ANOVA sum of squares
between the two clusters added up over all the variables.
Dummy variable representation of nominal variables can be problematic in kmeans clustering since they tend to dominate the analysis. One way to reduce
their domination is to use rank representation discussed in Section 5.
16
Full Replacement: Select initial seeds that are very well separated using a full
replacement algorithm
Princomp: Select the evenly-spaced seeds along the first principle component
Partial Replacement: Select initial seeds that are well separated using a partial
replacement algorithm
This case study shows you how to use the popular K-means clustering method in
Clustering node and how to use Clustering Result Browser to identify interesting
patterns. The Diagram suggests the steps that will be taking place in the course of this
case study.
Data Source:
Select PROSPECT from Lec12 library
Since variable Climate is a grouping of the variable Location, we need to set the
model role of Location to rejected. Climate supercedes Location.
17
StatExplore Node: We can use the StatExplore node to perform data exploration.
Impute Node:
Since the amount of missing values is only about 2% of the data, the missing
value indicator variables are not very important and so we do not need to create
missing value indicator variables.
Set the default imputation method for both class and interval variables to Tree
Clustering Node:
Select Standardization as Internal Standardization Property because K-means
clustering is very sensitive to the scale of measurement units of different input
variables. Use of this option means that all variables have the same influence on
the overall dissimilarity between pairs of cases.
18
Since we already imputed all missing values in Impute node, we keep the
default for all Missing Values properties.
19
Results:
1. Examine the Segment Size pie chart. The observations are divided fairly evenly
between the four segments.
2. Maximize the Segment Plot in the Results window to begin examining the differences
between the clusters.
20
When you use your cursor to point at a particular section of a graph, information on
that section appears in a pop-up window. Some initial conclusions you might draw
from the segment plot are:
The segments appear to be similar with respect to the variables CLIMATE and
FICO.
The individuals in segment 3 are all homeowners. There are some homeowners in
segment 4, and a few homeowners in segment 2.
Most of the individuals in segment 1 are married, while most of those in segment 4
are unmarried.
Younger individuals are in segment 4.
Most of the individuals in segment 1 are males while most of those in segment 2 are
females.
Income appears to be lower in segment 2.
3. Restore the Segment Plot window to its original size and maximize the Mean
Statistics window.
The window gives descriptive statistics and other information about the clusters such
as the frequency of the cluster, the nearest cluster, and the average value of the input
variables in the cluster.
Scroll to the right to view the statistics for each variable for each cluster. These
statistics confirm some of the conclusions drawn based on the graphs. For example,
the average age of individuals in cluster 4 is approximately 35.5, while the average
ages in clusters 1, 2, and 3 are approximately 49.2, 45.5, and 48.8 respectively.
You can use the Plot feature to graph some of these mean statistics. For example,
create a graph to compare the average income in the clusters.
21
4. With the Mean Statistics window selected, select View Graph Wizard, or select
the plot button
22
8. Select Finish.
23
Another way to examine the clusters is with the cluster profile tree.
9. Select View Cluster Profile Tree.
You can use the ActiveX feature of the graph to see the statistics for each node and you
can control what is displayed with the Edit menu. The tree lists the percentages and
numbers of cases assigned to each cluster and the threshold values of each input variable
displayed as a hierarchical tree. It enables you to see which input variables are most
effective in grouping cases into clusters.
10. Close the Cluster results window when you have finished exploring the results.
In summary, the four clusters can be described as follows:
Cluster 1
married males
Cluster 2
lower income females
Cluster 3
married homeowners
Cluster 4
younger, unmarried people.
These clusters may or may not be useful for marketing strategies, depending on the line
of business and planned campaigns.
24
Be careful since the measurement levels for many variables have been changed in the
course of the analysis and some variables are excluded from the clustering analysis after
looking at the clustering results.
The subsequent screen shots are self-explanatory and consistent with how we proceeded
in Case Study 7.
25
Clusters Tab:
Seeds Tab:
Missing Tab:
26
Clustering Results:
Selecting Screen Shots:
(1) CCC Plot:
27
28
1016
1069
100
1383
Degree of
Urbanization
115.46
241.15
1654.32
2239.33
Number of
meals per
week in
board charge
2.30
3.85
3.03
3.30
9.14
11.91
18.42
18.35
1555.85
1941.12
4348.51
18976.15
17289.37
120738.89
73477.46
124593.82
Percent
Black,
non-Hispanic
499.78
5748.35
2727.59
5340.09
Percent
American
Indian/Alaskan
Native
11.59
11.52
85.26
6.27
1.23
2.57
0.17
0.67
5.35
94.51
111.54
193.47
Percent
Asian/Pacific
Islander
4.72
3.99
0.58
3.48
Combined
Graduate
FT 1st Typical # of
charge for
Total
unduplicated UG 12-month
time
crd. Hrs.
Percent room and dormitory
count in
unduplicated degree
FTFY UG
Hispanic
board
capacity
12-month
count
seek UG
student
8.59
8.44
1.14
3.99
2822.03
1732.45
2166.06
3045.08
29.52
100.05
1066.41
1402.38
298.16
2489.48
794.19
2071.03
Tuition &
fees FTFY UG
in-state
Tuition &
fees FTFY UG
out-of-state
Tuition &
fees FTFY
Grad
in-state
7172.84
1840.04
4570.38
10290.44
7342.03
4376.55
6486.61
11919.57
8194.95
3099.89
3948.60
7324.72
662.22
8987.07
3182.31
5012.71
255.67
921.58
656.03
838.55
35.15
27.95
25.72
29.54
Tuition &
fees FTFY
Grad
out-of-state
8463.69
6708.77
6790.36
9227.44
29
30
2
1 n
k = ( xk ( i ) k ) .
n i =1
or
1 n
k = xk ( i ) x k
n i =1
2 1/ 2
all i
'
d std ( i, j ) = xk ( i ) xk ( j )
k =1
where xk' ( i ) =
1/ p
xk ( i ) x k
.
k
31
The covariance is a measure of how two numerically valued variables X1 and X2 vary
together. The large values of X1 tend to associate with the large values of X2 if the
covariance has a large positive value. The large values of X1 tend to associate with small
values of X2 if covariance has a large negative value. Since the covariance depends on
the measurement units used in measuring both X1 and X2, the definition of large is
problematic. To overcome this weakness, one can use correlation instead of covariance.
Let x(i ) = x1 ( i ) , x2 ( i ) ,
, xm ( i ) and
x( j ) = x1 ( j ) , x2 ( j ) ,
, xm ( j )
be any two
objects with m variables (features). The covariance between two variables Xi and Xj is
defined as
1 n
Cov ( xi , x j ) = xi ( k ) x i x j ( k ) x j ,
n k =1
k =1
k =1
)(
( x (k ) x )( x (k ) x )
n
( xi , x j ) =
k =1
n
xi ( k ) x i
k =1
) ( x (k ) x )
2
k =1
2 1/ 2
32
The data set, PROSPECT has 5,055 observations and 9 variables from a catalog
company. The company periodically purchases demographics information from outside
sources. They want to use this data set to design a testing mail campaign to know the
preference of their potential customers of several of their new products. Based on their
experience, they know that customer preference for their products depends on several
geographical and demographical variables. They want to segment their customers with
respect to these variables. After the potential customers have been segmented, a random
sample of prospective customers within each segment will be mailed one or several
offers. The results of the test mail campaign can provide the company an estimate of
their potential profits for these new products. The output from PROC CONTENTS is, as
follows:
Alphabetic List of Variables and Attributes
#
Variable
Type
Len
Format
Informat
Label
Age
Num
BEST12.
F12.
Climate
Char
$F2.
$F2.
FICO
Num
BEST12.
F12.
Credit Score
Gender
Char
$F1.
$F1.
HomeOwner
Num
BEST12.
F12.
ID
Char
$F9.
$F9.
Identification Code
Income
Num
BEST12.
F12.
Income ($K)
Location
Char
$F1.
$F1.
Married
Num
BEST12.
F12.
33
This data set is part of IC98_HD from IPEDS (Integrated Postsecondary Education Data
System). Interested students can check IPEDS web site to find out more about this data
set.
# Variable Type Len Pos Format
Label
18 ACCRD1
Num
8 136 YESNOA.
National or specialized accrediting
19 ACCRD2
Num
8 144 YESNOA.
Regional accrediting agency
20 ACCRD3
Num
8 152 YESNOA.
State accrediting or approval agency
3 AFFIL
Num
8 16 PRIFMTA. Affiliation of institution
33 BOARD
Num
8 256 YESNOA.
Institution provides board or meal plan
36 BOARDAMT Num
8 280
Typical board charge for academic year
22 CALSYS
Num
8 168
Calendar system
43 CDACTGA Num
8 336
Graduate credit hour activity
42 CDACTUA Num
8 328
Undergraduate credit hour activity
38 ENROLMNT Num
8 296
Corrected fall enrollment count
44 GSAA154 Num
8 344
Generated total for faculty
9/10 month contract
10 HBCU
Num
8 72 YESNOB.
Historically Black College or University
4 HLOFFER Num
8 24 HLOFFERF. Highest level of offering
1 ID
Num
8
0
11 LOCALE
Num
8 80
Degree of Urbanization
35 MEALSVRY Num
8 272 FIX.
Number meals/wk/BORDAMT/ROOMAMT
34 MEALSWK Num
8 264
Number of meals per week in board charge
24 MIL1INSL Num
8 184 YESNOA.
MILI in states and/or territories
25 MIL2INSL Num
8 192 YESNOA.
MILI at military installations abroad
23 MILI
Num
8 176 YESNOA.
Courses at military installations
6 PCTMIN1 Num
8 40
Percent Black, non-Hispanic
7 PCTMIN2 Num
8 48
Percent American Indian/Alaskan Native
8 PCTMIN3 Num
8 56
Percent Asian/Pacific Islander
9 PCTMIN4 Num
8 64
Percent Hispanic
12 PEO1ISTR Num
8 88 YESNOB.
Occupational
13 PEO4ISTR Num
8 96 YESNOB.
Recreational or avocational
14 PEO5ISTR Num
8 104 YESNOB.
Adult basic remedial or HS equivalent
26 PG300
Num
8 200 YESNOA.
Programs at least 300 contact hrs.
17 PRIVATE Num
8 128 PRIFMT.
Private control
15 PUBLIC1 Num
8 112 YESNOA.
Federal
16 PUBLIC2 Num
8 120 YESNOA.
State
37 RMBRDAMT Num
8 288
Combined charge for room and board
32 ROOMCAP Num
8 248
Total dormitory capacity
21 SACCR
Num
8 160 YESNOA.
Accrd by US Dept Ed recognized agency
2 SECTOR
Num
8
8 SECTORF. Sector of institution
41 TOSTUCG Num
8 320
Graduate unduplicated count in 12-month
39 TOSTUCU Num
8 304
UG 12-month unduplicated count
40 TOSTUFR Num
8 312
FT 1st time degree seek UG
29 TPUGCRED Num
8 224
Typical # of crd. Hrs. FTFY UG student
27 TUITION2 Num
8 208
Tuition & fees FTFY UG in-state
28 TUITION3 Num
8 216
Tuition & fees FTFY UG out-of-state
30 TUITION6 Num
8 232
Tuition & fees FTFY Grad in-state
31 TUITION7 Num
8 240
Tuition & fees FTFY Grad out-of-state
5 UGOFFER Num
8 32 YESNOA.
Undergraduate offering
45 USTIER
Num
8 352
US News and World Report Rating
34
The best way to use the CCC is to plot its value against the number of clusters, ranging
from one cluster up to about one-tenth the number of observations. The CCC may not
behave well if the average number of observations per cluster is less than ten. The
following guidelines should be used for interpreting the CCC:
Peaks on the plot with the CCC greater than 2 or 3 indicate good clusterings.
Peaks with the CCC between 0 and 2 indicate possible clusters but should be
interpreted cautiously.
Very distinct nonhierarchical spherical clusters usually show a sharp rise before
the peak followed by a gradual decline.
Very distinct nonhierarchical elliptical clusters often show a sharp rise to the
correct number of clusters followed by a further gradual increase and eventually a
gradual decline.
If all values of the CCC are negative and decreasing for two or more clusters, the
distribution is probably unimodal or long-tailed.
Very negative values of the CCC, say, -30, may be due to outliers. Outliers
generally should be removed before clustering and their removal documented.
A final and very important warning: neither the CCC nor R2 is an appropriate criterion
for clusters that are highly elongated or irregularly shaped. If you do not have prior
substantive reasons for expecting compact clusters, use a nonparametric clustering
method such as Wong and Lane's (1983) rather than Ward's method or k-means
clustering.
35
Appendix 6 References
David Hand, Heikki Mannila, and Padhraic Smyth (2001) Chapter 9 of Principles of
Data Mining, Massachusetts Institute of Technology.
Michael J. A. Berry and Linoff, Gordon S. (2000) Chapter 5 of Mastering Data Mining,
John Wiley & Sons, Inc.: New York, New York.
Richard A. Jognson and Dean W. Wichern (1982) Chapter 11 of Applied Multivariate
Statistical Analysis, Prentice-Hall, Inc.: Englewood Cliffs, New Jersey.
Rud, O. P. (2001), Data Mining Cook Book, John Wiley & Sons, Inc.: New York, N.Y.
Hastie, T., Tibshirani, R., and Friedman, J. (2001) Chapter 14 of The Elements of
Statistical Learning, Springer.
Tibshirani, R. Walther, G. and Hastie, T. (2001) Estimating the Number of Clusters in a
Dataset via the Gap Statistics, Journal of Royal. Statistics Soc. B.
Wong, M. A. and Lane, T. (1983), "A kth Nearest Neighbor Clustering Procedure,"
Journal of the Royal Statistical Society, Series B, 45, 362-368.
36
Appendix 7 Exercises
Problem 1 Suppose x(i ) = x1 ( i ) , x2 ( i ) ,
, xm ( i ) and x( j ) = x1 ( j ) , x2 ( j ) ,
, xm ( j )
m x (i )
m x ( j)
and x j = k
be any two objects with m variables (features). Let xi = k
k =1 m
k =1 m
be the average over variables for objects i, and j, respectively.
Also, let
(
k =1
m
si =
xk (i ) xi
variables
m
m 1
for
(
k =1
m
and s j =
objects
i,
xk ( j ) x j
m 1
and
j,
Show
that
k =1
x ( i ) xi
x ( j) x j
wk ( i ) = k
and wk ( j ) = k
.
si
sj
r=
ad bc
( a + b )( a + c )( b + d )( c + d )
1/ 2
0
1
0
a
c
1
b
d
37