Cluster Analysis

Cluster Analysis
Narendra K Sharma
Department of Industrial and Management Engineering Indian Institute of Technology Kanpur Kanpur 208 016
The concept of cluster analysis

Also called segmentation analysis or taxonomy analysis A technique to identify homogeneous subgroups Other techniques of clustering (obtaining homogeneous subgroups)
Q-mode factor analysis Latent class analysis
March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS 2
Key concepts
Cluster formation Similarity and distance Method Summary measures
March 26, 2011
NKS/IME/IITK CLUSTER ANALYSIS
Cluster formation
Procedure for determining how clusters are created Two procedures
Agglomerative hierarchical clustering Divisive clustering
March 26, 2011
Similarity and distance

Distance: How apart two observations are
Euclidean distance Other distances
Similarity: How alike two cases are

Pearson correlation or cosine (for interval data) Other measures of similarity
The proximity matrix

Shows the actual distances or similarities
Method: linkage
Nearest neighbour Furthest neighbour UPGMA (Unweighted pair-group method using averages Average linkage within groups Wards method Centroid method Median method Correlation of items Binary matching
Summary measures
Assess how the clusters differ from one another
Means and variances Linkage tables
Show the relation of the cases to the clusters Cluster membership table Agglomeration schedule
Linkage plots
Icicle plots Dendograms (tree diagrams)
Relative importance plots

Hierarchical cluster analysis

Hierarchical clustering
For smaller samples (typically < 250) Specify
How similarity or distance is defined How clusters are aggregated (or divided) How many clusters are needed
Clustering
Forward clustering Backward clustering Clustering variables Cluster procedure (as called by SPSS)
NKS/IME/IITK CLUSTER ANALYSIS 8
March 26, 2011
K-means cluster analysis

Uses Euclidean distance Specify in advance
The desired number of clusters, K
Clustering process
Cluster centres Large datasets Method Agglomerative K-means clustering Getting different clusters Getting relationship with other variables
NKS/IME/IITK CLUSTER ANALYSIS 9
March 26, 2011
Assumptions of cluster analysis

Interval or true dichotomies (for hierarchical and k-menas) Independence observations Standardization of variables Same assumptions as for correlation, regression and factor analysis K-means cluster analysis
Assumes a large sample size (for example, > 200) Very sensitive to outliers; remove outliers
An example of cluster analysis

Hypothetical data
Subject ID S1 S2 S3 S4 S5 S6
March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS
Income (Rs thousand) 5 6 15 16 25 30
Education (Years) 5 6 14 15 20 19
11
Similarity measures using Euclidean distances

S1 S2 S3 S4 S5 S6
S1
S2 S3 S4 S5 S6
March 26, 2011
0.00
2.00
2.00 181.00 221.00 625.00 821.00

0.00 145.00 181.00 557.00 745.00 0.00 2.00 2.00 136.00 250.00 0.00 106.00 212.00 0.00 26.00 26.00 0.00
12
181.00 145.00 221.00 181.00
625.00 557.00 136.00 106.00 821.00 745.00 250.00 212.00

Similarity measures using Euclidean distances

S1 S2 S3 S4 S5 S6
S1
S2 S3 S4 S5 S6
March 26, 2011
0.00
2.00
2.00 181.00 221.00 625.00 821.00

0.00 145.00 181.00 557.00 745.00 0.00 2.00 2.00 136.00 250.00 0.00 106.00 212.00 0.00 26.00 26.00 0.00
13
181.00 145.00 221.00 181.00
625.00 557.00 136.00 106.00 821.00 745.00 250.00 212.00

Centroid method: Five clusters

Data for five clusters
Cluster
1
2 3 4 5
March 26, 2011
Education Cluster Income members (Rs thousand) (Years)
S1 and S2
S3 S4 S5 S6
5.5
15 16 25 30
5.5
14 15 20 19
14
Centroid method: Five clusters

Similarity matrix
S1 & S2 S3 S4 S5 S6
S1 & S2
S3 S4 S5
0.00 162.50
162.50 200.50 0.00 2.00
200.50
590.50
782.50
2.00 135.96 250.00

0.00 106.00 212.00 0.00
590.50 135.96 106.00
26.00
0.00
15
S6
March 26, 2011
782.50 250.00 212.00

26.00
Centroid method: Four clusters

Data for four clusters
Cluster 1 Cluster members S1 & S2 S3 & S4 S5 Income Education (Rs thousand) (Years) 5.5 5.5 15.5 25 14.5 20
2 3
4
March 26, 2011
S6
30
19
16
Centroid method: Four clusters

Similarity matrix
S1 &S2 S1 &S2 S3 &S4 S5 S6
March 26, 2011
S3 &S4 181.00 0.00 120.50 230.50

S5 590.50 120.50 0.00 26.00
S6 782.50 230.50 26.00 0.00

17
0.00 181.00 590.50 782.50
Centroid method: Three clusters

Data for three clusters
Cluster 1 2 3 Cluster members Income Education (Rs thousand) (Years) 5.5 15.5 27.5 5.5 14.5 19.5
S1 &S2 S3 &S4 S5 &S6
March 26, 2011
18
Centroid method: Three clusters

Similarity matrix
S1 &S2 S1 &S2 0.00 S3 &S4 181.00 S5 &S6 680.00
S3 &S4
S5 &S6
March 26, 2011
181.00
680.00
0.00
169.00
169.00
0.00
19
Dendogram for hypothetical data

20 18 16 14
Euclidean distance
12 10 8 6 4 2 0
3 1 2
3 4 Observation
March 26, 2011
20
Cluster Analysis
March 26, 2011
21
An exercise on cluster analysis: Find the clusters for the following hypothetical data
Hypothetical data
Subject ID S1 S2 S3 S4 S5 S6
March 26, 2011 NKS/IME/IITK CLUSTER ANALYSIS
Residence in a locality (years)
Attitude towards environment 10 10 10 20 20 30 50 10 15 15 40 30

22
Bibliography
Hair, Jr. J. F., Anderson, R. E., Tatham, R. L., and Black, W. C. (1998). Multivariate Data Analysis. Prentice Hall. Malhotra, N., and Dash, S. (2009). Marketing Research: An Applied Orientation. Pearson.
March 26, 2011
23
March 26, 2011
24

Cluster Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis

Uploaded by

Copyright:

Available Formats

Cluster Analysis

The concept of cluster analysis

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

Similarity and distance

Similarity: How alike two cases are

The proximity matrix

Relative importance plots

Hierarchical cluster analysis

March 26, 2011

K-means cluster analysis

March 26, 2011

Assumptions of cluster analysis

An example of cluster analysis

Income (Rs thousand) 5 6 15 16 25 30

Similarity measures using Euclidean distances

2.00 181.00 221.00 625.00 821.00

181.00 145.00 221.00 181.00

625.00 557.00 136.00 106.00 821.00 745.00 250.00 212.00

Similarity measures using Euclidean distances

2.00 181.00 221.00 625.00 821.00

181.00 145.00 221.00 181.00

625.00 557.00 136.00 106.00 821.00 745.00 250.00 212.00

Centroid method: Five clusters

Education Cluster Income members (Rs thousand) (Years)

Centroid method: Five clusters

2.00 135.96 250.00

590.50 135.96 106.00

782.50 250.00 212.00

Centroid method: Four clusters

Centroid method: Four clusters

S3 &S4 181.00 0.00 120.50 230.50

S5 590.50 120.50 0.00 26.00

S6 782.50 230.50 26.00 0.00

0.00 181.00 590.50 782.50

Centroid method: Three clusters

S1 &S2 S3 &S4 S5 &S6

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

Centroid method: Three clusters

Dendogram for hypothetical data

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

Residence in a locality (years)

Attitude towards environment 10 10 10 20 20 30 50 10 15 15 40 30

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

March 26, 2011

NKS/IME/IITK CLUSTER ANALYSIS

You might also like