Professional Documents
Culture Documents
Amrender Kumar
I.A.S.R.I., Library Avenue, New Delhi – 110 012
akjha@iasri.res.in
Cluster analysis encompasses many diverse techniques for discovering structure within
complex bodies of data. In a typical example, one has a sample of data units (subjects, person,
cases) each described by scores on selected variables (attributes, characteristics,
measurements). The objective is to group either the data units or variables into clusters such
that elements within a cluster have a high degree of "natural association" among themselves
while the clusters are "relative distinct" from one another. Searching the data for a structure of
"natural" grouping is an important exploratory technique. The most important techniques for
data classification are
(1) Cluster analysis
(2) Discriminant analysis
Although both cluster and discriminant analysis classify objects into categories, discriminant
analysis requires one to know group membership for the cases used to decide the
classification rule whereas in cluster analysis group membership for all cases is unknown. In
addition to membership, the number of groups is also generally unknown. Cluster analysis is
more primitive technique in that no assumptions are made concerning the numbers of groups
or groups' structure. Grouping is done on the basis of similarities or distances (dissimilarities).
Thus in the case of cluster analysis the inputs are similarity measures or the data from which
these can be computed. The definition of similarity or homogeneity varies from analysis to
analysis and depends on the objective of study.
Cluster analysis is a technique used for combining observations into groups such that:
(a) Each group is homogeneous or compact with respect to certain characteristics i.e.,
observations in each group are similar to each other.
(b) Each group should be different from other groups with respect to the characteristics
i.e., observations of one group should be different from the observations of other
groups.
The need for cluster analysis arises in natural ways in many fields such as life science,
medicine, engineering, agriculture, social science, etc. In biology, cluster analysis is used to
identify diseases and their stages. For example by examining patients who are diagnosed as
depressed, one finds that there are several distinct sub-groups of patients with different types
of depression. In marketing cluster analysis is used to identify persons with similar buying
habits. By examining their characteristics it becomes possible to plan future marketing
strategies more efficiently.
IV-28
Cluster Analysis
∑ ( X ik
2
Distance dij = − X jk ) Euclidean distance
k
X’s are standardised. It can be calculated for one variable.
A number of different rules or methods have been suggested for computing distance between
two clusters. In fact, the various hierarchical clustering algorithms or methods differ mainly
with respect to how the distance between the two clusters are computed. Some of popular
methods are:
• Single linkage- This method works on the principle of smallest distance or nearest
neighbour
• Complete linkage- It works on the principle of distant neighbour or
dissimilarities- Farthest neighbour
• Average linkage – This works on the principle of average distance. (Average of
distances between unit of one entity and the other unit of the second entity.
• Centroid – This method assigns each item to the cluster having nearest centroid
(means). The process has three steps
• Partition the items into k initial clusters
• Proceed through the list of items assigning an item to the cluster
whose centroid (mean) is nearest. Recalculate the centroid (mean)
for the cluster receiving the new item and the cluster losing the
item.
• Repeat step 2 until no more assignments take place
• Ward’s – It forms cluster by maximising within – cluster homogeneity, within
group sum of squares is used as the measure of homogeneity
• Two stage density linkage
• Units assigned to modal entities on the basis of densities
(frequencies) – (kth nearest neighbour)
• Modal entities allowed to join later on
IV-29
Cluster Analysis
The TREE procedure is considered to be very important because it produces a tree diagram,
also known as a dendrogram, using a data set created by the CLUSTER or VARCLUS
procedure and also create output data sets giving the results of hierarchical clustering as tree
structure. The TREE procedure uses the output sets to print a diagram. Following is the
terminology related to TREE procedure.
Leaves Objects that are clustered
Root The cluster containing all the objects
Branch A cluster containing at least two objects but not all of them
Node A general term for leaves, branch and roots
Parent & If A is union of cluster B and C, then A is parent and B and C are
Child children
Specifications
The TREE procedure is invoked by the following statements:
IV-30
Cluster Analysis
If the data sets have been created by CLUSTER or VARCLUS, the only requirement is the
statement PROC TREE. The other optional statements listed above are described after the
PROC TREE statement
Example: Given below is food nutrient data on calories, protein, fat, calcium and iron. The
objective of the study is to identify suitable clusters of food nutrient data based on the five
variables.
;
proc cluster method = centroid RMSSTD RSQURE
nonorm out = tree;
Id food;
var Calories Protein Fat Calcium Iron;
run;
IV-31
Cluster Analysis
Id food;
Copy Calories Protein Fat Calcium Iron;
proc sort;
by cluster;
proc print;
by cluster;
var food Calories Protein Fat Calcium Iron;
run;
Output From SAS
The CLUSTER Procedure
Centroid Hierarchical Cluster Analysis
Variable Mean Std Dev Skewness Kurtosis Bimodality
IV-32
Cluster Analysis
1 1 340 20 28 9 2.6
2 11 340 20 28 9 2.5
3 12 340 19 29 9 2.5
4 13 355 19 30 9 2.4
5 2 245 21 17 9 2.7
6 9 265 20 20 9 2.6
7 4 375 19 32 9 2.6
8 10 300 18 25 9 2.3
9 3 420 15 39 7 2.0
10 7 170 25 7 12 1.5
11 26 170 25 7 7 1.2
12 14 205 18 14 7 2.5
13 21 200 19 13 5 1.0
14 5 180 22 10 17 3.7
15 15 185 23 9 9 2.7
16 23 195 16 11 14 1.3
17 16 135 22 4 25 0.6
18 20 135 16 5 15 0.5
19 8 160 26 5 14 5.9
20 6 115 20 3 8 1.4
21 17 70 11 1 82 6.0
22 18 45 7 1 74 5.4
23 22 155 16 9 157 1.8
24 24 120 17 5 159 0.7
25 19 90 14 2 38 0.8
26 27 170 23 1 98 2.6
IV-33
Cluster Analysis
400
300
200
100
0
1 11 12 13 4 3 2 9 10 5 15 7 26 8 14 21 23 6 16 20 19 17 18 22 24 27 25
Food
Analyze…
Classify…
K-Means Cluster
Hierarchical Cluster
Discrimant
Hierarchical Cluster
IV-34
Cluster Analysis
Output
References
Chatfield, C. and Collins, A.J. (1990). Introduction to Multivariate Analysis. Chapman and
Hall publications.
Johnson, R.A. and Wichern, D.W. (1996). Applied Multivariate Statistical Analysis.
Prentice-Hall of India Private Limited.
Sharma, S. (1996). Applied Multivariate Techniques. John Wiley & Sons, New York.
IV-35