You are on page 1of 9

CLUSTER ANALYSIS

Amrender Kumar
I.A.S.R.I., Library Avenue, New Delhi – 110 012
akjha@iasri.res.in

Cluster analysis encompasses many diverse techniques for discovering structure within
complex bodies of data. In a typical example, one has a sample of data units (subjects, person,
cases) each described by scores on selected variables (attributes, characteristics,
measurements). The objective is to group either the data units or variables into clusters such
that elements within a cluster have a high degree of "natural association" among themselves
while the clusters are "relative distinct" from one another. Searching the data for a structure of
"natural" grouping is an important exploratory technique. The most important techniques for
data classification are
(1) Cluster analysis
(2) Discriminant analysis

Although both cluster and discriminant analysis classify objects into categories, discriminant
analysis requires one to know group membership for the cases used to decide the
classification rule whereas in cluster analysis group membership for all cases is unknown. In
addition to membership, the number of groups is also generally unknown. Cluster analysis is
more primitive technique in that no assumptions are made concerning the numbers of groups
or groups' structure. Grouping is done on the basis of similarities or distances (dissimilarities).
Thus in the case of cluster analysis the inputs are similarity measures or the data from which
these can be computed. The definition of similarity or homogeneity varies from analysis to
analysis and depends on the objective of study.

Cluster analysis is a technique used for combining observations into groups such that:
(a) Each group is homogeneous or compact with respect to certain characteristics i.e.,
observations in each group are similar to each other.
(b) Each group should be different from other groups with respect to the characteristics
i.e., observations of one group should be different from the observations of other
groups.

The need for cluster analysis arises in natural ways in many fields such as life science,
medicine, engineering, agriculture, social science, etc. In biology, cluster analysis is used to
identify diseases and their stages. For example by examining patients who are diagnosed as
depressed, one finds that there are several distinct sub-groups of patients with different types
of depression. In marketing cluster analysis is used to identify persons with similar buying
habits. By examining their characteristics it becomes possible to plan future marketing
strategies more efficiently.

Steps in Cluster analysis


The objective of cluster analysis is to group observations into clusters such that each cluster is
as homogenous as possible with respect to the clustering variables. The various steps in
cluster analysis
Cluster Analysis

(i) Select a measure of similarity.


(ii) Decision is to be made on the type of clustering technique to be used
(iii) Type of clustering method for the selected technique is selected
(iv) Decision regarding the number of clusters
(v) Cluster solution is interpreted.
No generalisation about cluster analysis is possible as a vast number of clustering methods
have been developed in several different fields with different definitions of clusters and
similarities. There are many kinds of clusters namely:
• Disjoint cluster where every object appears in single cluster.
• Hierarchical clusters where one cluster can be completely contained in another
cluster, but no other kind of overlap is permitted
• Overlapping clusters.
• Fuzzy clusters, defined by a probability of membership of each object in one
cluster.
Similarity Measures
A measure of closeness is required to form simple group structures from complex data sets. A
great deal of subjectivity is involved in the choice of similarity measures. Important
considerations are the nature of the variables i.e. discrete, continuous or binary or scales of
measurement (nominal, ordinal, interval, ratio etc.) and subject matter knowledge. If the items
are to be clustered, proximity is usually indicated by some sort of distance. The variables
however are grouped on the basis of some measure of association like the correlation co-
efficient etc. Some of the measures are
Qualitative Variables
Consider k variables observed on n units, in case of binary response it can be represented as
jth unit ith unit
Yes No Total
Yes K11 K12 K11+K12
No K21 K22 K21+K22
Total K11+K21 K12+K22 K
Simple matching coefficient

(% matches) dij = (K11 + K22 )/ K (i,j =1,2,…………n)

This can easily be summarized to polytomous responses.


Quantitative Variables
In the case of k quantitative variables recorded on n cases, the observations can be expressed
as
X11 X12 X13 . . .X1k

X21 X22 X23 . . . X2k


. . . .
. . . .
Xn1 Xn2 Xn3 . . . Xnk

IV-28
Cluster Analysis

Similarity rij (i,j = 1,2…..n)


Correlation between Xik ‘s with Xjk ‘s
(Not the same as correlation between variables)

∑ ( X ik
2
Distance dij = − X jk ) Euclidean distance
k
X’s are standardised. It can be calculated for one variable.

Hierarchical Clustering technique


Hierarchical Clustering technique begin by either a series of successive mergers or of
successive divisions. Consider a natural process of grouping
• Each unit is an entity to start with
• Merge those two units first which are most similar (least dij) – now becomes an
entity
• Examine mutual distance between (n-1) entities
• Merge those two that are most similar
• Repeat the process and go on merging till all are merged to form one entity
• At each stage of agglomerative process, note the distance between the two
merging entities
• Choose that stage which shows sudden jump in this distance (Since it indicates
that two very dissimilar entities are being merged). This could be subjective.

A number of different rules or methods have been suggested for computing distance between
two clusters. In fact, the various hierarchical clustering algorithms or methods differ mainly
with respect to how the distance between the two clusters are computed. Some of popular
methods are:
• Single linkage- This method works on the principle of smallest distance or nearest
neighbour
• Complete linkage- It works on the principle of distant neighbour or
dissimilarities- Farthest neighbour
• Average linkage – This works on the principle of average distance. (Average of
distances between unit of one entity and the other unit of the second entity.
• Centroid – This method assigns each item to the cluster having nearest centroid
(means). The process has three steps
• Partition the items into k initial clusters
• Proceed through the list of items assigning an item to the cluster
whose centroid (mean) is nearest. Recalculate the centroid (mean)
for the cluster receiving the new item and the cluster losing the
item.
• Repeat step 2 until no more assignments take place
• Ward’s – It forms cluster by maximising within – cluster homogeneity, within
group sum of squares is used as the measure of homogeneity
• Two stage density linkage
• Units assigned to modal entities on the basis of densities
(frequencies) – (kth nearest neighbour)
• Modal entities allowed to join later on

IV-29
Cluster Analysis

Non-hierarchical clustering technique


In non-hierarchical clustering data are divided into k partitions or groups with each partition
representing a cluster. Therefore, as opposed to hierarchical clustering, the number of clusters
must be known a priori. Non-hierarchical clustering techniques basically follow these steps:
1. Select k initial cluster centroids or seeds, where k is the number of clustered desired.
2. Assign each observation to the cluster to which it is the closest.
3. Reassign or re-allocate each observation to one of the k-clusters according to a pre-
determined stopping rules.
4. Stop if there is no re-allocation of data points or if the reassignment satisfies the
criteria set by the stopping rule. Othewise go to step-2.

SAS Cluster Procedure


The SAS procedures for clustering are oriented towards disjoint or hierarchical cluster from a
co-ordinate data, distance or a correlation or covariance matrix. The following procedures are
used for clustering
CLUSTER Does hierarchical clustering of observations
FASTCLUS Finds disjoint clusters of observations using a k-means method applied to co-
ordinate data. Recommended for large data sets.
VARCLUS It is used for hierarchical as well as non-hierarchical clustering
TREE Draws the tree diagrams or dendograms using outputs from the CLUSTER or
VARCLUS procedures

The TREE procedure is considered to be very important because it produces a tree diagram,
also known as a dendrogram, using a data set created by the CLUSTER or VARCLUS
procedure and also create output data sets giving the results of hierarchical clustering as tree
structure. The TREE procedure uses the output sets to print a diagram. Following is the
terminology related to TREE procedure.
Leaves Objects that are clustered
Root The cluster containing all the objects
Branch A cluster containing at least two objects but not all of them
Node A general term for leaves, branch and roots
Parent & If A is union of cluster B and C, then A is parent and B and C are
Child children
Specifications
The TREE procedure is invoked by the following statements:

PROC TREE < options>


Optional Statements
NAME variables
HEIGHT variables
PARENT variables
BY variables
COPY variables
FREQ variables
ID variables

IV-30
Cluster Analysis

If the data sets have been created by CLUSTER or VARCLUS, the only requirement is the
statement PROC TREE. The other optional statements listed above are described after the
PROC TREE statement

Example: Given below is food nutrient data on calories, protein, fat, calcium and iron. The
objective of the study is to identify suitable clusters of food nutrient data based on the five
variables.

Food Items Calories Protein Fat Calcium Iron


1 340 20 28 9 2.6
2 245 21 17 9 2.7
3 420 15 39 7 2
4 375 19 32 9 2.6
5 180 22 10 17 3.7
6 115 20 3 8 1.4
7 170 25 7 12 1.5
8 160 26 5 14 5.9
9 265 20 20 9 2.6
10 300 18 25 9 2.3
11 340 20 28 9 2.5
12 340 19 29 9 2.5
13 355 19 30 9 2.4
14 205 18 14 7 2.5
15 185 23 9 9 2.7
16 135 22 4 25 0.6
17 70 11 1 82 6
18 45 7 1 74 5.4
19 90 14 2 38 0.8
20 135 16 5 15 0.5
21 200 19 13 5 1
22 155 16 9 157 1.8
23 195 16 11 14 1.3
24 120 17 5 159 0.7
25 180 22 9 367 2.5
26 170 25 7 7 1.2
27 170 23 1 98 2.6
Data test;
Input Food Calories Protein Fat Calcium Iron ;
Cards;

;
proc cluster method = centroid RMSSTD RSQURE
nonorm out = tree;
Id food;
var Calories Protein Fat Calcium Iron;
run;

IV-31
Cluster Analysis

proc tree data = tree out = clus3 nclusters=3;

/* It will divide the data into three clusters */

Id food;
Copy Calories Protein Fat Calcium Iron;
proc sort;
by cluster;
proc print;
by cluster;
var food Calories Protein Fat Calcium Iron;
run;
Output From SAS
The CLUSTER Procedure
Centroid Hierarchical Cluster Analysis
Variable Mean Std Dev Skewness Kurtosis Bimodality

Calories 209.6 99.6332 0.5320 -0.6024 0.4619


Protein 19.0000 4.2517 -0.8237 1.3274 0.3565
Fat 13.4815 11.2570 0.7900 -0.6237 0.5892
Calcium 43.9630 78.0343 3.1590 11.3445 0.7456
Iron 2.3815 1.4613 1.2298 1.4689 0.5182

Root-Mean-Square Total-Sample Standard Deviation = 56.85606


Cluster History
RMS Cent
NCL ------Clusters Joined------- FREQ STD SPRSQ RSQ Dist

26 1 11 2 0.0316 0.0000 1.00 0.1


25 CL26 12 3 0.3661 0.0000 1.00 1.4151
24 7 26 2 1.5840 0.0000 1.00 5.009
23 14 21 2 1.8235 0.0000 1.00 5.7663
22 5 15 2 3.0332 0.0001 1.00 9.5917
21 CL23 23 3 3.2444 0.0002 1.00 11.531
20 16 20 2 3.7015 0.0002 .999 11.705
19 CL24 8 3 3.3143 0.0002 .999 12.081
18 CL25 13 4 3.3914 0.0004 .999 15.108
17 CL22 CL19 5 4.9157 0.0008 .998 16.519
16 2 9 2 6.4032 0.0005 .998 20.249
15 6 CL20 3 6.5865 0.0009 .997 23.409
14 17 18 2 8.3986 0.0008 .996 26.559
13 CL17 CL21 8 7.7564 0.0036 .992 28.445
12 CL18 4 5 6.9370 0.0019 .990 31.423
11 22 24 2 11.1679 0.0015 .989 35.316
10 CL15 19 4 11.3232 0.0035 .985 44.563
9 CL16 10 3 12.5993 0.0033 .982 45.537
8 CL13 CL10 12 16.8070 0.0274 .955 65.69
7 CL11 27 3 19.4452 0.0075 .947 68.821
6 CL12 3 6 14.3419 0.0099 .937 70.822
5 CL6 CL9 9 24.3675 0.0405 .897 92.253
4 CL14 CL7 5 30.4181 0.0342 .862 109.44
3 CL8 CL4 17 31.2342 0.1047 .758 111.66
2 CL5 CL3 26 49.8743 0.4977 .260 188.52
1 CL2 25 27 56.8561 0.2601 .000 336.92

IV-32
Cluster Analysis

------------------------------------------ CLUSTER=1 ------------------------------------

Obs Food Calories Protein Fat Calcium Iron

1 1 340 20 28 9 2.6
2 11 340 20 28 9 2.5
3 12 340 19 29 9 2.5
4 13 355 19 30 9 2.4
5 2 245 21 17 9 2.7
6 9 265 20 20 9 2.6
7 4 375 19 32 9 2.6
8 10 300 18 25 9 2.3
9 3 420 15 39 7 2.0

------------------------------------------ CLUSTER=2 ------------------------------------

Obs Food Calories Protein Fat Calcium Iron

10 7 170 25 7 12 1.5
11 26 170 25 7 7 1.2
12 14 205 18 14 7 2.5
13 21 200 19 13 5 1.0
14 5 180 22 10 17 3.7
15 15 185 23 9 9 2.7
16 23 195 16 11 14 1.3
17 16 135 22 4 25 0.6
18 20 135 16 5 15 0.5
19 8 160 26 5 14 5.9
20 6 115 20 3 8 1.4
21 17 70 11 1 82 6.0
22 18 45 7 1 74 5.4
23 22 155 16 9 157 1.8
24 24 120 17 5 159 0.7
25 19 90 14 2 38 0.8
26 27 170 23 1 98 2.6

------------------------------------------ CLUSTER=3 ------------------------------------

Obs Food Calories Protein Fat Calcium Iron

27 25 180 22 9 367 2.5

IV-33
Cluster Analysis

400

300

200

100

0
1 11 12 13 4 3 2 9 10 5 15 7 26 8 14 21 23 6 16 20 19 17 18 22 24 27 25

Food

Data Entry and Procedure in SPSS

Analyze…
Classify…
K-Means Cluster
Hierarchical Cluster
Discrimant

Hierarchical Cluster

IV-34
Cluster Analysis

Output

SPSS Syntax for above data


CLUSTER Calories Protein Fat Calcium Iron
/METHOD CENTROID
/MEASURE= SEUCLID
/PRINT SCHEDULE CLUSTER(3)
/PRINT DISTANCE
/PLOT DENDROGRAM VICICLE.

References
Chatfield, C. and Collins, A.J. (1990). Introduction to Multivariate Analysis. Chapman and
Hall publications.
Johnson, R.A. and Wichern, D.W. (1996). Applied Multivariate Statistical Analysis.
Prentice-Hall of India Private Limited.
Sharma, S. (1996). Applied Multivariate Techniques. John Wiley & Sons, New York.

IV-35

You might also like