You are on page 1of 24

Cluster Analysis

Narendra K Sharma
Department of Industrial and Management Engineering Indian Institute of Technology Kanpur Kanpur 208 016

The concept of cluster analysis


Also called segmentation analysis or taxonomy analysis A technique to identify homogeneous subgroups Other techniques of clustering (obtaining homogeneous subgroups)
Q-mode factor analysis Latent class analysis
March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS 2

Key concepts
Cluster formation Similarity and distance Method Summary measures

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

Cluster formation
Procedure for determining how clusters are created Two procedures
Agglomerative hierarchical clustering Divisive clustering

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

Similarity and distance


Distance: How apart two observations are
Euclidean distance Other distances

Similarity: How alike two cases are


Pearson correlation or cosine (for interval data) Other measures of similarity

The proximity matrix


Shows the actual distances or similarities
March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS 5

Method: linkage
Nearest neighbour Furthest neighbour UPGMA (Unweighted pair-group method using averages Average linkage within groups Wards method Centroid method Median method Correlation of items Binary matching
March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS 6

Summary measures
Assess how the clusters differ from one another
Means and variances Linkage tables
Show the relation of the cases to the clusters Cluster membership table Agglomeration schedule

Linkage plots
Icicle plots Dendograms (tree diagrams)

Relative importance plots


March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS 7

Hierarchical cluster analysis


Hierarchical clustering
For smaller samples (typically < 250) Specify
How similarity or distance is defined How clusters are aggregated (or divided) How many clusters are needed

Clustering
Forward clustering Backward clustering Clustering variables Cluster procedure (as called by SPSS)
NKS/IME/IITK CLUSTER ANALYSIS 8

March 26, 2011

K-means cluster analysis


Uses Euclidean distance Specify in advance
The desired number of clusters, K

Clustering process
Cluster centres Large datasets Method Agglomerative K-means clustering Getting different clusters Getting relationship with other variables
NKS/IME/IITK CLUSTER ANALYSIS 9

March 26, 2011

Assumptions of cluster analysis


Interval or true dichotomies (for hierarchical and k-menas) Independence observations Standardization of variables Same assumptions as for correlation, regression and factor analysis K-means cluster analysis
Assumes a large sample size (for example, > 200) Very sensitive to outliers; remove outliers
March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS 10

An example of cluster analysis


Hypothetical data
Subject ID S1 S2 S3 S4 S5 S6
March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS

Income (Rs thousand) 5 6 15 16 25 30

Education (Years) 5 6 14 15 20 19
11

Similarity measures using Euclidean distances


S1 S2 S3 S4 S5 S6

S1
S2 S3 S4 S5 S6
March 26, 2011

0.00
2.00

2.00 181.00 221.00 625.00 821.00


0.00 145.00 181.00 557.00 745.00 0.00 2.00 2.00 136.00 250.00 0.00 106.00 212.00 0.00 26.00 26.00 0.00
12

181.00 145.00 221.00 181.00

625.00 557.00 136.00 106.00 821.00 745.00 250.00 212.00


NKS/IME/IITK CLUSTER ANALYSIS

Similarity measures using Euclidean distances


S1 S2 S3 S4 S5 S6

S1
S2 S3 S4 S5 S6
March 26, 2011

0.00
2.00

2.00 181.00 221.00 625.00 821.00


0.00 145.00 181.00 557.00 745.00 0.00 2.00 2.00 136.00 250.00 0.00 106.00 212.00 0.00 26.00 26.00 0.00
13

181.00 145.00 221.00 181.00

625.00 557.00 136.00 106.00 821.00 745.00 250.00 212.00


NKS/IME/IITK CLUSTER ANALYSIS

Centroid method: Five clusters


Data for five clusters
Cluster
1
2 3 4 5
March 26, 2011

Education Cluster Income members (Rs thousand) (Years)

S1 and S2
S3 S4 S5 S6
NKS/IME/IITK CLUSTER ANALYSIS

5.5
15 16 25 30

5.5
14 15 20 19
14

Centroid method: Five clusters


Similarity matrix
S1 & S2 S3 S4 S5 S6

S1 & S2
S3 S4 S5

0.00 162.50
162.50 200.50 0.00 2.00

200.50

590.50

782.50

2.00 135.96 250.00


0.00 106.00 212.00 0.00

590.50 135.96 106.00

26.00
0.00
15

S6
March 26, 2011

782.50 250.00 212.00


NKS/IME/IITK CLUSTER ANALYSIS

26.00

Centroid method: Four clusters


Data for four clusters
Cluster 1 Cluster members S1 & S2 S3 & S4 S5 Income Education (Rs thousand) (Years) 5.5 5.5 15.5 25 14.5 20

2 3

4
March 26, 2011

S6
NKS/IME/IITK CLUSTER ANALYSIS

30

19
16

Centroid method: Four clusters


Similarity matrix
S1 &S2 S1 &S2 S3 &S4 S5 S6
March 26, 2011

S3 &S4 181.00 0.00 120.50 230.50


NKS/IME/IITK CLUSTER ANALYSIS

S5 590.50 120.50 0.00 26.00

S6 782.50 230.50 26.00 0.00


17

0.00 181.00 590.50 782.50

Centroid method: Three clusters


Data for three clusters
Cluster 1 2 3 Cluster members Income Education (Rs thousand) (Years) 5.5 15.5 27.5 5.5 14.5 19.5

S1 &S2 S3 &S4 S5 &S6

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

18

Centroid method: Three clusters


Similarity matrix
S1 &S2 S1 &S2 0.00 S3 &S4 181.00 S5 &S6 680.00

S3 &S4
S5 &S6
March 26, 2011

181.00
680.00
NKS/IME/IITK CLUSTER ANALYSIS

0.00
169.00

169.00
0.00
19

Dendogram for hypothetical data


20 18 16 14

Euclidean distance

12 10 8 6 4 2 0

3 1 2

3 4 Observation

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

20

Cluster Analysis

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

21

An exercise on cluster analysis: Find the clusters for the following hypothetical data
Hypothetical data
Subject ID S1 S2 S3 S4 S5 S6
March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS

Residence in a locality (years)

Attitude towards environment 10 10 10 20 20 30 50 10 15 15 40 30


22

Bibliography
Hair, Jr. J. F., Anderson, R. E., Tatham, R. L., and Black, W. C. (1998). Multivariate Data Analysis. Prentice Hall. Malhotra, N., and Dash, S. (2009). Marketing Research: An Applied Orientation. Pearson.

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

23

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

24

You might also like