You are on page 1of 37

Session 13 Clustering

Section 1 Background
Section 2 Distance Measures
2.1 Euclidean Distance
2.2 Distance Measures with Weights
Section 3 Clustering Methods
3.1 Hierarchical Clustering
3.2 Partition Based Clustering
Section 4 Variable Clustering
Section 5 Practical Considerations
Section 6 Clustering with SAS Enterprise Miner
Section 7 Case Study 1: K-Means Clustering with the Clustering Node
Section 8 Case Study 2: Clustering with KBM Data Set

2
4
4
5
6
6
9
10
11
11
14
23

Appendix 1 Distance Measures after Standardization


Appendix 2 Covariance and the Pearson Correlation Coefficient
Appendix 3 Data Used in Session 13, Section 7
Appendix 4 Data Used in Session 13, Section 8
Appendix 5 Cubic Clustering Criterion
Appendix 6 References
Appendix 7 Exercises

29
30
31
32
33
38
43

Section 1 Background
Cluster analysis (also known as data segmentation in the data mining community) has a
variety of goals. The goals are invariably related to grouping or segmenting a collection
of objects into disjoint subsets or clusters such that those objects within each cluster are
similar to each other while those objects assigned to different clusters dissimilar.
Data mining applications typically deal with a humongous volume of data and it is likely
that the data are heterogeneous. This means that the data might fall into several distinct
groups, with objects within each subgroup being similar to each other but different from
objects from other groups. Since it is possible that there are different models or patterns
pertinent to each group, it is very challenging to spot any single pattern or model that is
germane to the whole data set. Creating clusters of similar objects reduces the model
complexity within the clusters which then enhances the chances for data mining
techniques to perform more successfully within each cluster. Even if the data do not have
natural groupings, partitioning data into homogeneous groups (empirically without
regard to a specific explanation for each cluster) can be very useful. For example, it is
well known that customer preferences for products depend on geographic and
demographic factors. Thus, we can use geographic and demographic factors to group
customers into several segments and to develop marketing strategies in each segment.
Although customers do not form these marketing segments naturally, it is much easier
and more effective to develop efficient marketing strategies for each segment separately
than to identify a one-size-fits-all one single marketing strategy to target all customers.
Clustering is one important unsupervised data-mining tool. Unlike supervised data
mining tools that are driven by user direction, cluster analysis has no priori assumptions
concerning the number of clusters or cluster structure. The basic objective in clustering
is to discover natural groupings of the cases or variables based on some similarity or
distance (dissimilarity) measures. Although there is no target variable to be predicted
explicitly at the clustering stages, a clustering technique can be used in many ways. First,
it can be used in missing value imputation. This method was illustrated in the SHOES
example in the previous session in which we used clustering to impute missing values for
the interval input variables (age, miles/week, races/year, years running). Although we
could have replaced the missing values by the population average of the non-missing
variables, such an approach would likely mask the basic structure among variables while
potentially interjecting a diminished relationship between the input variables and the
target variable Number of Purchases (one pair or at least two pairs). Second, one can use
clustering to detect outliers because outliers typically belong to clusters with only one
case. Third, one can use clustering to discover the number of clusters if one suspects that
there are meaningful groupings that may represent groups of cases. After finding these
meaningful groups, one can then develop different ways to deal with each group such as
target marketing that will be discussed later. Fourth, we can use cluster analysis to
partition a complex data structure into several subsets in order to give supervised datamining techniques such as decision tree or neural network a better chance of finding a
good predictive model.

Morgan C. Wang and Mark E. Johnson

Since the benefits of cluster analysis are evident, there have been extensive research
efforts in the past several decades on cluster analysis and there are many automatic
clustering techniques available. However, different clustering techniques lead to
different types of clusters and it is very difficult to tell whether a cluster analysis exercise
has been successful because cluster analysis is an unsupervised data mining exercise. To
use cluster techniques effectively, we need to at least understand several important
aspects of choosing clustering techniques--especially the selection of a similarity measure
(or dissimilarity measure).
Select Similarity Measure In order to decide whether a set of objects can be split into
subgroups, where members from the same group are more similar to one another than
they are to members from different groups, we need to define what similar means.
Suppose we want to group a deck of ordinary playing cards into clusters. There are many
ways to do so such as
Two clusters: One cluster has all the face cards and another cluster has all other
cards.
Four clusters: Each suit of thirteen cards forms a cluster.
Two clusters: Red suits are in one cluster and black suits are in another cluster.
Thirteen clusters: Each cluster has all cards that have the same face value.
Obviously, the clusters obtained are very different with different similarity measure.
Thus, we need to know how to choose a similarity measure before selecting the clustering
technique.
Select the Right Number of Clusters Another important question to ask in cluster
analysis is what is the right number of clusters? The choice of the number of clusters
depends on the goal of the study. For example, in market segmentation analysis, the
number of appropriate clusters depends on the number of divisions in the company.
However, the number of natural clusters can be decided using descriptive statistics to
guide the choice of the natural number of groups. In the well known k-means clustering
algorithm, the original chosen number of k determines the number of clusters that will be
found. If this number does not match to the natural structure of the data, the technique
will obtain poor results. Unless there is good prior knowledge on how many clusters
exist in the data, it is very difficult to choose the number k before applying k-means
clustering. We will discuss how to choose the right number in Section 5. In general, the
best set of clusters is the one that does the best job of keeping the distance between
objects within the same cluster small and the distance between objects of adjacent
clusters large. However, if the purpose of clustering is to detect unexpected patterns, the
right number of clusters might be the one that can find unexpected patterns from the data.
Cluster Interpretation Clustering is a powerful, unsupervised knowledge discovery
technique; however, it has some weaknesses and limitations. For example, if one does
not know what one is looking for, one may not recognize it when one finds it. Although
the clustering technique can help to find clusters, it is up to the user to interpret them.
The following approaches can help the user to understand clusters:

Morgan C. Wang and Mark E. Johnson

Use graphical tools or summary statistics to exam the within cluster distribution
for each variable (StatExplore node)
Use graphical tools such as box plots to study the within cluster distribution
for each continuous variable
Use graphical tools such as the Mosaic plot to study the within cluster
distribution for each categorical variable
Study the within cluster statistics for each variable
Use other visualization tools to see how the clusters are affected by changes for
each variable with the normalization mean plot (Clustering node)
Build a decision tree (or other supervised data mining tools) with the cluster label
as the target variable and use it to derive rules explaining how to assign new
records to the correct cluster

Section 2 Distance Measures (or Dissimilarity Measures)


Many data mining techniques such as association analysis, clustering analysis,
multidimensional scaling, and classification analysis are based on similarity measure
between objects. One can measure the similarity directly through a survey. For example,
one can conduct a survey to find out the similarity between two different brands of beer.
However, typically the similarity between two objects cannot be measured directly.
Instead, one measures the similarity between two objects through the corresponding
vectors of property measurements.
Section 2.1 Euclidean Distance
Let x(i ) = x1 ( i ) , x2 ( i ) ,

, xm ( i ) and x( j ) = x1 ( j ) , x2 ( j ) ,

, xm ( j ) be any two

objects with m variables (features). If all variables are quantitative (or interval type), i.e.,
can be represented by continuous real-valued numbers. The most common choice of
distance measure is the Euclidean distance, a special case of the Minkowski metric (Lp
norm). The Minkowski metric is defined as
1/ p

p
m

(1)
d ( x(i ), x( j ) ) = d ( i, j ) = xk ( i ) xk ( j ) .
k =1

For p=2, d(i,j) becomes the Euclidean distance. For p=1, d(i,j) becomes the mean
absolute deviation between the two objects. For p = , d(i,j) becomes to the maximum
absolute deviation between the two objects.

The Minkowski metric satisfies the following properties


d(i,j) = d(j,i)
d(i,j) > 0 if ij
(positive valued)
d(i,j)=0 if i=j
d(i,j) d(i.k) + d(j.k). (triangle inequality)

Morgan C. Wang and Mark E. Johnson

Another popular similarity measure for numerical variables is the Pearson correlation
coefficient defined by:

( x (i ) x ) ( x ( j ) x )
m

( x(i ), x( j ) ) =

k =1

,
(2)
2 1/ 2
m

2 m
xk ( i ) xi xk ( j ) x j
k =1
k =1

m x (i )
m x ( j)
where xi = k
and x j = k
are the sample averages over variables for
k =1 m
k =1 m
object i, and j, respectively. It is worth noting that clustering based on correlation is
equivalent to that based on Euclidean distance if the inputs are standardized first
(Problem 1 in Appendix 8).

) (

If some features are not quantitative (say they are nominal or ordinal), Euclidean distance
may not be appropriate (as well as being unsuitable for applying equation (2)). Typically,
we can create dummy variables for each level of nominal variables and replace each of
the original categories of the ordinal variable with
i 1
2 , i = 1, 2, , M ,
(3)
M
where M is the number of categories. The transformed variables can be treated as
quantitative variables on this scale.

Section 2.2 Distance Measure with Weights


Distance measures presume some degree of commensurability between the different
variables. Thus, it would be effective if each variable were measured using the same
units and hence, each variable would be equally important. However, it is very unlikely
that all variables in a data mining exercise were measured with the same units. Recalling
the SHOES example from an earlier session we had a variable given in miles per week
while another variable was age in years of the respondent. One way to deal with this
incommensurability is to standardize the data by dividing each of the variables by its
sample standard deviation. Standardization is equivalent to setting all variables,
irrespective to the data type, to have the same influence to the overall dissimilarity
between pairs of cases. Although this approach is reasonable and often recommended
from a pure statistical perspective, it can cause problems because variables might not
contribute equally to the notion of dissimilarity between cases. Thus, we can weight
them (after standardization) to yield the weighted standardize distance measure, if we
have a notion of the relative importance of these variables based on domain knowledge.
The weighted standardized similarity measure (Appendix 1) becomes
p
m

d ws ( i, j ) = wk xk' ( i ) xk' ( j )
k =1

Morgan C. Wang and Mark E. Johnson

1/ p

(4)

Suppose the goal of clustering is to discover the natural grouping of the data, some
variables may exhibit more of a grouping tendency than others. Variables that are more
relevant in separating the groups should be assigned a higher weight. Suppose that the
goal of clustering is to segment the data into groups of similar cases, all variables might
not contribute equally to the notion of dissimilarity (problem dependent) between cases.
The domain expert should play an important role in assigning the weight to each variable.

Section 3 Clustering Methods


The guaranteed way to find the best set of clusters is to examine all possible clusters.
However, complete consideration is computationally prohibitive, even with the fastest
computers and optimized algorithms. Because of this problem, a wide variety of
heuristic clustering algorithms have emerged that find reasonable clusters without
examining exhaustively all possible clusters.
3.1 Hierarchical Clustering Hierarchical clustering techniques proceed by either a series
of successive merges or a series of successive partitions. Agglomerative hierarchical
clustering methods start with the individual cases. Thus, initially there are as many
clusters as cases. Most similar cases are first merged to form a reduced number of
clusters. This is repeated until just one cluster exists with all cases.
Let D = {x(1), x(2), , x(n)} be n cases and D(Ci, Cj) be the distance measure between
any two clusters Ci and Cj. Then, an agglomerative algorithm for clustering can be
described as follows:
for i = 1 to n let Ci = {x(i)}
while there is more than one cluster left do
let Ci and Cj be the two clusters minimizing
the distance between any two clusters;
Ci = Ci C j
remove Cj;
end;
In the Single Linkage method, the distance between two clusters is defined as
DSL ( Ci , C j ) = min{d ( x, y ) | x Ci and y C j } . The clusters formed by the single
linkage method will not be affected by the distance measures used if these distance
measures have the same relative ordering. It also has the property that if two pairs of
clusters are equidistant it does not matter which one is merged first. The overall result
will be the same. Single linkage method is the only clustering method that can find nonellipsoidal clusters. The tendency of single linkage method to pick up long string-like
clusters is knows as chaining. This tendency and its sensitivity to outliers and
perturbation of the data combined makes it less useful in customer segmentation
applications than other methods to be described subsequently.

Morgan C. Wang and Mark E. Johnson

In Complete Linkage method, the distance between two clusters is defined as


DCL ( Ci , C j ) = max{d ( x, y ) | x Ci and y C j } . In other words, the distance between
two clusters is the distance between the two most dissimilar points (one from each
cluster). The clusters formed by complete linkage method will not be affected by the
distance measures used if these distance measures have the same relative ordering.
Complete linkage method tends to find clusters of equal size in terms of the volume of
space occupied, making it particularly suitable in customer segmentation application.
In Average Linkage method, the distance between two clusters is defined as
Ci

DAL ( Ci , C j ) =

Cj

d ( x, y )
k =1 h =1

nCi + nC j

. The clusters formed by average linkage method will be

affected by the distance measures used even if these distance measures have the same
relative ordering. This makes average linkage unappealing in data mining applications.
Agglomerative clustering only requires a distance matrix to initiate the clustering
procedure. This means that it does not need to store all variable values for each case. If
we can compute the distance between variables, these methods can be applied in
variable clustering as well. We will address this issue in Section 4. The methods
mentioned here have several drawbacks. First, a case will stay in the same cluster once it
is assigned to this cluster. This indicates that reallocation is not allowed in the clustering
process even if one case has been wrongly assigned to a cluster. Second, these methods
are sensitive to outliers and noise. Thus, we need to try several different cluster
methods and, within each method, try several distance measures. If the outcomes from
all methods are roughly consistent with one another, perhaps a set of good clusters has
been identified. Also, we can add a small error to each case before applying clustering
method to see how stable the clusters are.
Other than linkage methods, there are centroid methods (the distance between two
clusters is the Euclidean distance between their centroids) and Ward statistics (the
distance between two clusters is the ANOVA sum of squares between two clusters added
up over all variables).
Example 1 Single Linkage, Complete Linkage, and Average Linkage
The distances between pairs of five cases are given below.
1 2 3 4 5
10

2 9 0

3 3 7 0

4 6 5 9 0
5 11 10 2 8 0

Morgan C. Wang and Mark E. Johnson

Cluster the five cases using each procedure and draw the dendograms (tree structures)
and compare the results.
(a) Single linkage hierarchical procedure.
(b) Complete linkage hierarchical procedure.
(c) Average linkage hierarchical procedure.
<Solutions>:
(a) Step 1: Merge case 3 and case 5 since min ( d ij ) = d 35 = 2
Step 2: d( 3,5),1 = min ( d31 , d51 ) = 3
d( 3,5),2 = min ( d32 , d52 ) = 7
d( 3,5),4 = min ( d34 , d54 ) = 8
Thus, the new distance matrix is
(35) 1 2 4
(35) 0

1 3 0

2 7 9 0

4 8 6 5 0
Step 3: Merge case (3,5) and case 1 since minimum distance is 3.
Step 4: The new distance matrix is
(135) 2 4

(135) 0

2 7 0
4 6 5 0
Step 5: Merge case 2 and case 4 since the minimum distance is 5.
Step 6: Merge all cases together and the minimum distance is 6.
(b) Step 1: Merge case 3 and case 5 since min ( d ij ) = d 35 = 2
Step 2: d( 3,5),1 = max ( d31 , d51 ) = 11
d( 3,5),2 = max ( d32 , d 52 ) = 10
d( 3,5),4 = max ( d34 , d54 ) = 9

Thus, the new distance matrix is


(35) 1 2 4
(35) 0

1 11 0

2 10 9 0

4 9 6 5 0

Morgan C. Wang and Mark E. Johnson

Step 3: Merge case 2 and case 4 since minimum distance is 5.


Step 4: The new distance matrix is
(35) ( 24 ) 1
(35) 0

( 24 ) 10 0
1 11 9 0
Step 5: Merge case (24) and case 1 since the minimum distance is 9.
Step 6: Merge all cases together and the minimum distance is 11.
(c) Step 1: Merge case 3 and case 5 since min ( dij ) = d35 = 2

Step 2: d( 3,5),1 = avg ( d31 , d51 ) = 7


d( 3,5),2 = avg ( d32 , d52 ) = 8.5
d( 3,5),4 = avg ( d34 , d54 ) = 8.5

Thus, the new distance matrix is


(35) 1 2 4
(35) 0

1 7 0

2 8.5 9 0

4 8.5 6 5 0
Step 3: Merge case 2 and case 4 since minimum distance is 5.
Step 4: The new distance matrix is
(35) ( 24 ) 1
(35) 0

( 24 ) 8.5 0
1 7 7.5 0
Step 5: Merge case (35) and case 1 since the minimum distance is 7.
Step 6: Merge all cases together and the minimum distance is 8.5.

3.2 Partition Based Clustering In partition based clustering, the task is to partition the
data set into k disjoint clusters of cases such that the cases within each cluster are as
homogeneous as possible. Given a set of n cases D = {x(1), x(2), , x(n)}, our task is to
find k clusters C = {C1, C2, , Cn} such that each case x(i) is assigned to a unique cluster
Ck. There are many score functions can be used to measure the quality of clustering.
Centroid method uses the distance of two cluster centroids to measure the distance
between two clusters. Average method uses the average distance between all pairs of
points (one point from each cluster) to measure the distance between two clusters. Wald

Morgan C. Wang and Mark E. Johnson

statistics use the between clusters sum of squares to measure the distance between two
clusters. Once the score function is selected, it is an optimization process to find clusters.
Many optimization procedures can be applied. Here, we only introduce the popular Kmeans clustering method. The K-means algorithm is intended for situations in which all
variables are of the quantitative type, and squared Euclidean distance is chosen as the
dissimilarity measure.
Let D = {x(1), x(2), , x(n)} be n cases and our task is to find K clusters C = {C1, C2,
, CK}:
Let {r(k): k = 1, 2, , K} be K randomly selected points in D.
for k = 1, 2, K;
form clusters:
for k = 1, 2, , K do
Ck={xD|d(r(k),x) d(r(j),x) for all j = 1, 2, , K, jk}
end;
compute the new cluster centers;
for k = 1, 2, , K do
r (k)= the vector mean of the cases in CK
end;
end;
The time complexity of K-means algorithm is O(KIn), where I is the number of
iterations. Since K, the number of clusters, is fixed in partition based clustering methods,
the selection of K is critical. If the number of natural clusters is different from K, the
partition based clustering algorithm cannot obtain good results. We will address the way
to select the right number of K in Section 5.

Section 4 Variable Clustering

All methods discussed in Section3.1 can be used in variable clustering except the average
linkage method if the distance between variables can be computed. The most popular
distance measure between two quantitative variables is the Pearson correlation coefficient
(Appendix 2). For categorical variables, we typical use the association to measure the
distance between them. It can be shown that the correlation and association are
equivalent if both variables are binary (see Problem 2 in Appendix 8).

Example 2 Suppose the correlation matrix is

Morgan C. Wang and Mark E. Johnson

10

X1

1
.643

.103
0.82

.259
.152

.045

.013

X2

1
.348
.086
.260
.010
.211
.328

X3

X4

X5

X6

X7

1
.100
1
.435 .034
1
.028 .288 .176
1
.115 .164 .019 .374
1
.005 .486 .007 .561 .185

X8

Use Single linkage and complete linkage to find the clusters.


<Solutions>:
Single Linkage

1
2
3
4
5
6
7
8

1
1.000
0.643
-0.103
-0.820
-0.259
-0.152
0.045
-0.013

Step 1.
step 2.

38
1
2
4
5
6
7

2
1.000
-0.348
-0.086
-0.260
-0.010
0.211
-0.328

min (d38)=
d(38),1=
d(38),2=
d(38),4=
d(38),5=
d(38),6=
d(38),7=

38
1
-0.013
-0.328
0.100
-0.007
0.028
0.115

Step 3.

1.000
0.100 1.000
0.435 0.034 1.000
0.028 -0.288 0.176 1.000
0.115 -0.164 -0.019 -0.374 1.000
0.005 0.486 -0.007 -0.561 -0.185 1.000

0.005
min(d31,d81)=
-0.013
min(d32,d82)=
-0.328
min(d34,d84)=
0.100
min(d35,d85)=
-0.007
min(d36,d86)=
0.028
min(d37,d87)=
0.115
1
1
0.643
-0.820
-0.259
-0.152
0.045

min (d38,5)=

Morgan C. Wang and Mark E. Johnson

1
-0.086
1
-0.260 0.034
1
-0.010 -0.288 0.176 1.000
0.211 -0.164 -0.019 -0.374

1.000

-0.007

11

d(385,1)=
d(385,2)=
d(385, 4)=
d(385, 6)=
d(385, 7)=

358
1
2
4
6
7

358
1
-0.013
-0.259
0.034
0.028
-0.019
Step 4.

358
26
1
4
7

1358
26
4
7

1358
1
0.028
0.034
-0.019
Step 6.

13578
26
4

1
1
0.643
-0.820
-0.152
0.045
min (d26)=
d(26,358)=
d(26,1)=
d(26,4)=
d(26,7)=

358
1
0.028
-0.013
0.034
-0.019

Step 5.

13578
1
0.028
0.034

-0.013
-0.259
0.034
0.028
-0.019
2

1
-0.086
1
-0.010 -0.288
1
0.211 -0.164 -0.374

-0.010
0.028
-0.152
-0.086
0.211
26
1
-0.152
-0.086
0.211

min (d358,1)=
d(3581,26)=
d(3581,4)=
d(3581,7)=

1
-0.820
1
0.045 -0.164

-0.013
0.028
0.034
-0.019
26

1
-0.086
0.211

1
-0.164

min (d1358,7)=
d(13587,26)=
d(13587,4)=

-0.019
0.028
0.034
26

1
-0.086

Morgan C. Wang and Mark E. Johnson

12

step 7.
min (d13578,26)=
minimum Distance d(1235678,4)=

0.028
0.034

Complete Linkage

1
2
3
4
5
6
7
8

1
1.000
0.643
-0.103
-0.820
-0.259
-0.152
0.045
-0.013

Step 1.
step 2.

38
1
2
4
5
6
7

38
26
1

1.000
-0.348
-0.086
-0.260
-0.010
0.211
-0.328

1.000
0.100
0.435
0.028
0.115
0.005

1.000
0.034
-0.288
-0.164
0.486

0.005
max(d31,d81)=
max(d32,d82)=
max(d34,d84)=
max(d35,d85)=
max(d36,d86)=
max(d37,d87)=

-0.103
-0.348
0.486
0.435
-0.561
-0.185

min (d38)=
d(38),1=
d(38),2=
d(38),4=
d(38),5=
d(38),6=
d(38),7=

38
1
-0.103
-0.348
0.486
0.435
-0.561
-0.185

Step 3.

1
1
0.643
-0.820
-0.259
-0.152
0.045

min (d26)=
d(26,38)=
d(26, 1)=
d(26, 4)=
d(26, 5)=
d(26, 7)=

38
1
-0.561
-0.103

Morgan C. Wang and Mark E. Johnson

1.000
0.176 1.000
-0.019 -0.374 1.000
-0.007 -0.561 -0.185 1.000

1
-0.086
1
-0.260 0.034
1
-0.010 -0.288 0.176 1.000
0.211 -0.164 -0.019 -0.374 1.000

-0.010
-0.561
0.643
-0.288
-0.260
-0.374
26

1
0.643

13

4
5
7

0.486
0.435
-0.185
Step 4.

38
26
57
1
4

1
-0.374
0.643
-0.288

min (d138)=
d(138,26)=
d(138,57)=
d(138,4)=

57

1
-0.260
1
-0.164 -0.820

-0.103
0.643
0.435
-0.820
26

57

1
-0.374
-0.288

1
-0.164

min (d26,4)=
d(264,138)=
d(264,57)=

138
1
-0.820
0.435

-0.820
1
-0.259 0.034
1
0.045 -0.164 -0.019
-0.019
0.435
-0.374
-0.259
-0.164

26

138
1
0.643
0.435
-0.820
Step 6.

138
246
57

min (d57)=
d(57,38)=
d(57,26)=
d(57, 1)=
d(57, 4)=

38
1
-0.561
0.435
-0.103
0.486

Step 5.

138
26
57
4

-0.288
-0.260
-0.374

-0.288
-0.820
-0.374
246

57

1
-0.374

step 7.
min (d138,24657)=
minimum Distance d(24567,138)=

-0.374
-0.820

Section 5 Practical Considerations

Other than the selection of distance measures between cases and distance measures
between clusters, there are four important practical issues to impute the missing values,
to convert qualitative variable to quantitative variables, to select initial clusters, and to
decide the right number of clusters.
Morgan C. Wang and Mark E. Johnson

14

To impute missing values: We can either exclude cases with one or more missing
variables from analysis or impute these missing values with methods discussed in
data preparation course.
To convert to quantitative variable:
Ordinal variable: Replace the ordered categories by numerical values defined
in equation (3).
Nominal variable: Replace each category by one binary dummy variable.
Group of related binary variables: Use method similar to obtain missing value
pattern (MVP) to group related binary variables first and then consider the
MVP as an ordinary variable.
To select the initial seeds: The initial seeds must be complete cases, that is, cases
that have no missing values. And, the initial seeds should be chosen to be as far
apart as possible. It can be selected either at random or with some optimization
strategies recommended by Hastie et al. (Page 470, 2001).
To decide the right number of clusters: The choice of the right number of clusters
depends on the goal of the clustering. Data driven methods for estimating the
right number K typically examine the within cluster dissimilarity measure WK
as a function of the number of clusters K. Separate within cluster dissimilarity
measures W1, W2 , , WK max are calculated for K {1, 2, , Kmax}. The

{
sequence {W1, W2 ,

, WK max

}
} is a monotone decreasing sequence with a sharp

drop at the optimal number of cluster K*.

This means that we have

{WK WK 1 | K < K *} << {WK WK 1 | K K *} if the optimal number of cluster

is K*. Consequently, an estimated value of K * can be obtained by identifying a


sharp drop of the value WK WK +1 or by identifying a kick in the plot of WK as
a function of K. The gap statistics proposed by Tibshirani et al. (2001) is based
on this idea.

Section 6 Clustering with Enterprise Miner

Clustering node uses two SAS procedures, FASTCLUS and CLUSTER.


The
FASTCLUS procedure is design to handle a much larger data set than PROC CLUSTER
can accommodate. The FASTCLUS procedure performs nonhierarchical cluster analysis
that means the clusters obtained do not have the tree structure as they do in hierarchical
cluster analysis algorithm such as the CLUSTER procedure. In order to obtain
hierarchical clusters for a very large data set, one can use PROC FASTCLUS to find
initial clusters and then use those initial clusters as input to PROC CLUSTER to find
clusters with the final tree structure.

Morgan C. Wang and Mark E. Johnson

15

By default, FASTCLUS procedure uses K-means clustering method discussed in


Section 3.2. K is the number of clusters that can be determined either in advance
by the user or as part of the clustering procedure. By default the clustering node
uses CLUSTER procedure with Cubic Clustering Criterion (CCC, Appendix 5)
based on a sample of 5000 observations to estimate the appropriate number of
clusters (between 2 and 50). If you do not want to use the default setting to
choose the number of clusters, you can change the default setting of
Specification Method from Automatic to User Specify.

Enterprise Miner has three methods for calculating cluster distances:

Average: the distance between two clusters is the average distance between
pairs of observations, one in each cluster.
Centroid: the distance between two clusters is the Euclidean distance between
their centroids or means.
Ward: the distance between two clusters is the ANOVA sum of squares
between the two clusters added up over all the variables.

K-means clustering is very sensitive to the scale of measurement of different input


variables. Consequently, it is advisable to use one of the standardization options if
the data has not been standardized. Two standardization methods discussed in
Appendix 1 are available in Enterprise Miner.

Dummy variable representation of nominal variables can be problematic in kmeans clustering since they tend to dominate the analysis. One way to reduce
their domination is to use rank representation discussed in Section 5.

Five Seed Initialization Methods are available in Clustering node:

Morgan C. Wang and Mark E. Johnson

16

First: Select the first k complete cases as the initial seeds

MacQueen: Select the initial seeds based on MacQueen k-means algorithm

Full Replacement: Select initial seeds that are very well separated using a full
replacement algorithm

Princomp: Select the evenly-spaced seeds along the first principle component

Partial Replacement: Select initial seeds that are well separated using a partial
replacement algorithm

Section 7 Case Study 1: K-Means Clustering with Clustering Node

This case study shows you how to use the popular K-means clustering method in
Clustering node and how to use Clustering Result Browser to identify interesting
patterns. The Diagram suggests the steps that will be taking place in the course of this
case study.

Data Source:
Select PROSPECT from Lec12 library
Since variable Climate is a grouping of the variable Location, we need to set the
model role of Location to rejected. Climate supercedes Location.

Morgan C. Wang and Mark E. Johnson

17

StatExplore Node: We can use the StatExplore node to perform data exploration.
Impute Node:
Since the amount of missing values is only about 2% of the data, the missing
value indicator variables are not very important and so we do not need to create
missing value indicator variables.

Set the default imputation method for both class and interval variables to Tree

Clustering Node:
Select Standardization as Internal Standardization Property because K-means
clustering is very sensitive to the scale of measurement units of different input
variables. Use of this option means that all variables have the same influence on
the overall dissimilarity between pairs of cases.

If either we want to put different weights on standardized variables or to change


the values of each level of an ordinal scale variable, we can add a SAS code node
to do so.
Since we do not know the optimal number of clusters, we keep the defaults for
both Number of Clusters and Selection Criterion. This allows Enterprise
Miner using CCC to pick up the optimal number of clusters (between 2 and 50).

Morgan C. Wang and Mark E. Johnson

18

We keep the defaults for all options in Training Options

Since we already imputed all missing values in Impute node, we keep the
default for all Missing Values properties.

Morgan C. Wang and Mark E. Johnson

19

Results:

1. Examine the Segment Size pie chart. The observations are divided fairly evenly
between the four segments.

2. Maximize the Segment Plot in the Results window to begin examining the differences
between the clusters.

Morgan C. Wang and Mark E. Johnson

20

When you use your cursor to point at a particular section of a graph, information on
that section appears in a pop-up window. Some initial conclusions you might draw
from the segment plot are:
The segments appear to be similar with respect to the variables CLIMATE and
FICO.
The individuals in segment 3 are all homeowners. There are some homeowners in
segment 4, and a few homeowners in segment 2.
Most of the individuals in segment 1 are married, while most of those in segment 4
are unmarried.
Younger individuals are in segment 4.
Most of the individuals in segment 1 are males while most of those in segment 2 are
females.
Income appears to be lower in segment 2.
3. Restore the Segment Plot window to its original size and maximize the Mean
Statistics window.

The window gives descriptive statistics and other information about the clusters such
as the frequency of the cluster, the nearest cluster, and the average value of the input
variables in the cluster.
Scroll to the right to view the statistics for each variable for each cluster. These
statistics confirm some of the conclusions drawn based on the graphs. For example,
the average age of individuals in cluster 4 is approximately 35.5, while the average
ages in clusters 1, 2, and 3 are approximately 49.2, 45.5, and 48.8 respectively.
You can use the Plot feature to graph some of these mean statistics. For example,
create a graph to compare the average income in the clusters.

Morgan C. Wang and Mark E. Johnson

21

4. With the Mean Statistics window selected, select View Graph Wizard, or select
the plot button

Morgan C. Wang and Mark E. Johnson

22

5. Select Bar as the Chart Type and then select Next>.


6. Select Response as the Role for the variable IMP_INCOME.
7. Select Category as the Role for the variable _SEGMENT_.

8. Select Finish.

Morgan C. Wang and Mark E. Johnson

23

Another way to examine the clusters is with the cluster profile tree.
9. Select View Cluster Profile Tree.

You can use the ActiveX feature of the graph to see the statistics for each node and you
can control what is displayed with the Edit menu. The tree lists the percentages and
numbers of cases assigned to each cluster and the threshold values of each input variable
displayed as a hierarchical tree. It enables you to see which input variables are most
effective in grouping cases into clusters.
10. Close the Cluster results window when you have finished exploring the results.
In summary, the four clusters can be described as follows:
Cluster 1
married males
Cluster 2
lower income females
Cluster 3
married homeowners
Cluster 4
younger, unmarried people.
These clusters may or may not be useful for marketing strategies, depending on the line
of business and planned campaigns.

Morgan C. Wang and Mark E. Johnson

24

Section 8 Case Study 2: Clustering with KBM Data Set

This data set is described in Appendix 4. Briefly, it represents descriptive information on


educational institutions ranging from large fully accredited institutions to some schools
for profit.
Data Tab:

Be careful since the measurement levels for many variables have been changed in the
course of the analysis and some variables are excluded from the clustering analysis after
looking at the clustering results.
The subsequent screen shots are self-explanatory and consistent with how we proceeded
in Case Study 7.

Morgan C. Wang and Mark E. Johnson

25

Clusters Tab:

Seeds Tab:

Missing Tab:

Morgan C. Wang and Mark E. Johnson

26

Clustering Results:
Selecting Screen Shots:
(1) CCC Plot:

(2) Variable Importance:

Morgan C. Wang and Mark E. Johnson

27

(3) Cluster Statistics:

(4) Distance Plot:

Morgan C. Wang and Mark E. Johnson

28

(5) Means for Numerical Variables:


Generated
Typical
Corrected total for
Cluster
board charge
Graduate Undergraduate
fall
faculty
ID
for academic credit hour credit hour enrollment 9/10 month
Number _FREQ_
year
activity
activity
count
contract
1
2
3
4

1016
1069
100
1383

Degree of
Urbanization

115.46
241.15
1654.32
2239.33

Number of
meals per
week in
board charge

2.30
3.85
3.03
3.30

9.14
11.91
18.42
18.35

1555.85
1941.12
4348.51
18976.15

17289.37
120738.89
73477.46
124593.82

Percent
Black,
non-Hispanic

499.78
5748.35
2727.59
5340.09

Percent
American
Indian/Alaskan
Native

11.59
11.52
85.26
6.27

1.23
2.57
0.17
0.67

5.35
94.51
111.54
193.47

Percent
Asian/Pacific
Islander
4.72
3.99
0.58
3.48

Combined
Graduate
FT 1st Typical # of
charge for
Total
unduplicated UG 12-month
time
crd. Hrs.
Percent room and dormitory
count in
unduplicated degree
FTFY UG
Hispanic
board
capacity
12-month
count
seek UG
student
8.59
8.44
1.14
3.99

2822.03
1732.45
2166.06
3045.08

29.52
100.05
1066.41
1402.38

298.16
2489.48
794.19
2071.03

Tuition &
fees FTFY UG
in-state

Tuition &
fees FTFY UG
out-of-state

Tuition &
fees FTFY
Grad
in-state

7172.84
1840.04
4570.38
10290.44

7342.03
4376.55
6486.61
11919.57

8194.95
3099.89
3948.60
7324.72

Morgan C. Wang and Mark E. Johnson

662.22
8987.07
3182.31
5012.71

255.67
921.58
656.03
838.55

35.15
27.95
25.72
29.54

Tuition &
fees FTFY
Grad
out-of-state
8463.69
6708.77
6790.36
9227.44

29

(6) Cluster Definitions and Interpretation:


Segment 1: Private Institutions that does not provide Board and Meal Plan (1016
institutions)
Low degree of urbanization
Most institutions do not have Board and Meal Plan
For these private institutions that provide Board and Meal Plan
High percentage of graduate students
In-state and out-state students pay same tuition
Most institutions are not ranked by US News and World Report
Segment 2: Public Institutions that do not have Board and Meal Plan or Non-state
Public Institutions that have Board and Meal Plan (1069 institutions)
Highest degree of urbanization
Most institutions do not have Board and Meal Plan
In-state tuition is much cheaper than out-of-state tuition
Most institutions are not ranked by US News and World Report
Segment 3: Historically Black College or University (100 institutions)
High percentage of African American students and low percentage of other
minority students
Most institutions provide Board and Meal Plan
Low percentage of graduate students
High dormitory-student ratio
Segment 4: State and private Institutions that provide Board and Meal Plan (1383
institutions)
Second highest degree of urbanization
More number of meals provided than historical black colleges and universities

Morgan C. Wang and Mark E. Johnson

30

Appendix 1 Distance Measure after Standardization

The measure of distance (dissimilarity) typically assumes some degrees of


commensurability between variables. It would be effective if variables are measured
using the same units and are equally important. However, it is very unlikely that all
variables in a data mining exercise were measured in the same units. One way to deal
with this incommensurability is to standardize the data by dividing each variable by its
sample standard deviation
1/ 2

2
1 n

k = ( xk ( i ) k ) .
n i =1

or
1 n
k = xk ( i ) x k
n i =1

2 1/ 2

if the population mean is unknown. Another way to perform standardization is to divide


each variable by its sample range
rangek = max ( xk ( i ) ) min ( xk ( i ) ) .
all i

all i

Then, the similarity measure after standardization becomes


p
m '

'
d std ( i, j ) = xk ( i ) xk ( j )
k =1

where xk' ( i ) =

1/ p

xk ( i ) x k
.
k

Morgan C. Wang and Mark E. Johnson

31

Appendix 2 Covariance and Pearson Correlation Coefficient

The covariance is a measure of how two numerically valued variables X1 and X2 vary
together. The large values of X1 tend to associate with the large values of X2 if the
covariance has a large positive value. The large values of X1 tend to associate with small
values of X2 if covariance has a large negative value. Since the covariance depends on
the measurement units used in measuring both X1 and X2, the definition of large is
problematic. To overcome this weakness, one can use correlation instead of covariance.
Let x(i ) = x1 ( i ) , x2 ( i ) ,

, xm ( i ) and

x( j ) = x1 ( j ) , x2 ( j ) ,

, xm ( j )

be any two

objects with m variables (features). The covariance between two variables Xi and Xj is
defined as
1 n
Cov ( xi , x j ) = xi ( k ) x i x j ( k ) x j ,
n k =1

k =1

k =1

)(

where xi = xi ( k ) and x j = x j ( k ) . The correlation between two variables Xi and


Xj is defined as

( x (k ) x )( x (k ) x )
n

( xi , x j ) =

k =1

n
xi ( k ) x i
k =1

Morgan C. Wang and Mark E. Johnson

) ( x (k ) x )
2

k =1

2 1/ 2

32

Appendix 3 Data Used in Section 7

The data set, PROSPECT has 5,055 observations and 9 variables from a catalog
company. The company periodically purchases demographics information from outside
sources. They want to use this data set to design a testing mail campaign to know the
preference of their potential customers of several of their new products. Based on their
experience, they know that customer preference for their products depends on several
geographical and demographical variables. They want to segment their customers with
respect to these variables. After the potential customers have been segmented, a random
sample of prospective customers within each segment will be mailed one or several
offers. The results of the test mail campaign can provide the company an estimate of
their potential profits for these new products. The output from PROC CONTENTS is, as
follows:
Alphabetic List of Variables and Attributes
#

Variable

Type

Len

Format

Informat

Label

Age

Num

BEST12.

F12.

Climate

Char

$F2.

$F2.

Climate Code for


Residence

FICO

Num

BEST12.

F12.

Credit Score

Gender

Char

$F1.

$F1.

HomeOwner

Num

BEST12.

F12.

ID

Char

$F9.

$F9.

Identification Code

Income

Num

BEST12.

F12.

Income ($K)

Location

Char

$F1.

$F1.

Location Code for


Residence

Married

Num

BEST12.

F12.

Morgan C. Wang and Mark E. Johnson

33

Appendix 4 Data Used in Section 8

This data set is part of IC98_HD from IPEDS (Integrated Postsecondary Education Data
System). Interested students can check IPEDS web site to find out more about this data
set.
# Variable Type Len Pos Format
Label

18 ACCRD1
Num
8 136 YESNOA.
National or specialized accrediting
19 ACCRD2
Num
8 144 YESNOA.
Regional accrediting agency
20 ACCRD3
Num
8 152 YESNOA.
State accrediting or approval agency
3 AFFIL
Num
8 16 PRIFMTA. Affiliation of institution
33 BOARD
Num
8 256 YESNOA.
Institution provides board or meal plan
36 BOARDAMT Num
8 280
Typical board charge for academic year
22 CALSYS
Num
8 168
Calendar system
43 CDACTGA Num
8 336
Graduate credit hour activity
42 CDACTUA Num
8 328
Undergraduate credit hour activity
38 ENROLMNT Num
8 296
Corrected fall enrollment count
44 GSAA154 Num
8 344
Generated total for faculty
9/10 month contract
10 HBCU
Num
8 72 YESNOB.
Historically Black College or University
4 HLOFFER Num
8 24 HLOFFERF. Highest level of offering
1 ID
Num
8
0
11 LOCALE
Num
8 80
Degree of Urbanization
35 MEALSVRY Num
8 272 FIX.
Number meals/wk/BORDAMT/ROOMAMT
34 MEALSWK Num
8 264
Number of meals per week in board charge
24 MIL1INSL Num
8 184 YESNOA.
MILI in states and/or territories
25 MIL2INSL Num
8 192 YESNOA.
MILI at military installations abroad
23 MILI
Num
8 176 YESNOA.
Courses at military installations
6 PCTMIN1 Num
8 40
Percent Black, non-Hispanic
7 PCTMIN2 Num
8 48
Percent American Indian/Alaskan Native
8 PCTMIN3 Num
8 56
Percent Asian/Pacific Islander
9 PCTMIN4 Num
8 64
Percent Hispanic
12 PEO1ISTR Num
8 88 YESNOB.
Occupational
13 PEO4ISTR Num
8 96 YESNOB.
Recreational or avocational
14 PEO5ISTR Num
8 104 YESNOB.
Adult basic remedial or HS equivalent
26 PG300
Num
8 200 YESNOA.
Programs at least 300 contact hrs.
17 PRIVATE Num
8 128 PRIFMT.
Private control
15 PUBLIC1 Num
8 112 YESNOA.
Federal
16 PUBLIC2 Num
8 120 YESNOA.
State
37 RMBRDAMT Num
8 288
Combined charge for room and board
32 ROOMCAP Num
8 248
Total dormitory capacity
21 SACCR
Num
8 160 YESNOA.
Accrd by US Dept Ed recognized agency
2 SECTOR
Num
8
8 SECTORF. Sector of institution
41 TOSTUCG Num
8 320
Graduate unduplicated count in 12-month
39 TOSTUCU Num
8 304
UG 12-month unduplicated count
40 TOSTUFR Num
8 312
FT 1st time degree seek UG
29 TPUGCRED Num
8 224
Typical # of crd. Hrs. FTFY UG student
27 TUITION2 Num
8 208
Tuition & fees FTFY UG in-state
28 TUITION3 Num
8 216
Tuition & fees FTFY UG out-of-state
30 TUITION6 Num
8 232
Tuition & fees FTFY Grad in-state
31 TUITION7 Num
8 240
Tuition & fees FTFY Grad out-of-state
5 UGOFFER Num
8 32 YESNOA.
Undergraduate offering
45 USTIER
Num
8 352
US News and World Report Rating

Morgan C. Wang and Mark E. Johnson

34

Appendix 5 Cubic Cluster Criterion

The best way to use the CCC is to plot its value against the number of clusters, ranging
from one cluster up to about one-tenth the number of observations. The CCC may not
behave well if the average number of observations per cluster is less than ten. The
following guidelines should be used for interpreting the CCC:

Peaks on the plot with the CCC greater than 2 or 3 indicate good clusterings.

Peaks with the CCC between 0 and 2 indicate possible clusters but should be
interpreted cautiously.

There may be several peaks if the data has a hierarchical structure.

Very distinct nonhierarchical spherical clusters usually show a sharp rise before
the peak followed by a gradual decline.

Very distinct nonhierarchical elliptical clusters often show a sharp rise to the
correct number of clusters followed by a further gradual increase and eventually a
gradual decline.

If all values of the CCC are negative and decreasing for two or more clusters, the
distribution is probably unimodal or long-tailed.

Very negative values of the CCC, say, -30, may be due to outliers. Outliers
generally should be removed before clustering and their removal documented.

If the CCC increases continually as the number of clusters increases, the


distribution may be grainy or the data may have been excessively rounded or
recorded with just a few digits.

A final and very important warning: neither the CCC nor R2 is an appropriate criterion
for clusters that are highly elongated or irregularly shaped. If you do not have prior
substantive reasons for expecting compact clusters, use a nonparametric clustering
method such as Wong and Lane's (1983) rather than Ward's method or k-means
clustering.

Morgan C. Wang and Mark E. Johnson

35

Appendix 6 References

David Hand, Heikki Mannila, and Padhraic Smyth (2001) Chapter 9 of Principles of
Data Mining, Massachusetts Institute of Technology.
Michael J. A. Berry and Linoff, Gordon S. (2000) Chapter 5 of Mastering Data Mining,
John Wiley & Sons, Inc.: New York, New York.
Richard A. Jognson and Dean W. Wichern (1982) Chapter 11 of Applied Multivariate
Statistical Analysis, Prentice-Hall, Inc.: Englewood Cliffs, New Jersey.
Rud, O. P. (2001), Data Mining Cook Book, John Wiley & Sons, Inc.: New York, N.Y.
Hastie, T., Tibshirani, R., and Friedman, J. (2001) Chapter 14 of The Elements of
Statistical Learning, Springer.
Tibshirani, R. Walther, G. and Hastie, T. (2001) Estimating the Number of Clusters in a
Dataset via the Gap Statistics, Journal of Royal. Statistics Soc. B.
Wong, M. A. and Lane, T. (1983), "A kth Nearest Neighbor Clustering Procedure,"
Journal of the Royal Statistical Society, Series B, 45, 362-368.

Morgan C. Wang and Mark E. Johnson

36

Appendix 7 Exercises
Problem 1 Suppose x(i ) = x1 ( i ) , x2 ( i ) ,

, xm ( i ) and x( j ) = x1 ( j ) , x2 ( j ) ,

, xm ( j )

m x (i )
m x ( j)
and x j = k
be any two objects with m variables (features). Let xi = k
k =1 m
k =1 m
be the average over variables for objects i, and j, respectively.
Also, let

(
k =1
m

si =

xk (i ) xi

variables
m

m 1
for

(
k =1
m

and s j =
objects

i,

xk ( j ) x j
m 1
and
j,

be the standard deviation over


respectively.

Show

that

( wk ( i ) wk ( j ) ) = 2(1 ( w(i ), w( j ) ) if we first standardized all inputs, i.e.,

k =1

x ( i ) xi
x ( j) x j
wk ( i ) = k
and wk ( j ) = k
.
si
sj

Problem 2 Show that the sample correlation coefficient, r, can be written as

r=

ad bc
( a + b )( a + c )( b + d )( c + d )

1/ 2

For two binary variables with the contingency table

0
1

Morgan C. Wang and Mark E. Johnson

0
a
c

1
b
d

37

You might also like