You are on page 1of 19

MSc Business Administration

Research Methodology: Tools


Applied Data Analysis (with SPSS)

Lecture 04: Cluster Analysis

March 2011
Prof. Dr. Jrg Schwarz

juerg.schwarz@hslu.ch

Slide 2

Contents

Aims ___________________________________________________________________________________________________ 5
Introduction _____________________________________________________________________________________________ 6
Outline _________________________________________________________________________________________________ 9
Concepts of Cluster Analysis______________________________________________________________________________ 10
Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24

Slide 3

Table of contents

Aims ___________________________________________________________________________________________________ 5
Aims of the lecture .................................................................................................................................................................................................5

Introduction _____________________________________________________________________________________________ 6
Example .................................................................................................................................................................................................................6

Outline _________________________________________________________________________________________________ 9
Concepts of Cluster Analysis______________________________________________________________________________ 10
Key steps in using a cluster analysis ....................................................................................................................................................................10
How to measure proximity....................................................................................................................................................................................11
Proximity measure with interval variables.............................................................................................................................................................13
Proximity measure with binary variables...............................................................................................................................................................15
How to form Clusters............................................................................................................................................................................................18
How to define similarity? ............................................................................................................................................................................................................18
Cluster formation tree (rules for cluster formation) ....................................................................................................................................................................20
Pros and cons ............................................................................................................................................................................................................................21
Example of hierarchical method: Single linkage (nearest neighbor) .........................................................................................................................................22
Example of hierarchical method: Complete linkage (furthest neighbor) ....................................................................................................................................23

Slide 4
Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24
Marketing research: Customer survey on brand awareness .................................................................................................................................24
SPSS Elements: <Analyze><Classify><Hierarchical ...........................................................................................................................................25
First step: Measure of distance or similarity between objects ...............................................................................................................................27
Output ........................................................................................................................................................................................................................................27

Second step: Formation of clusters ......................................................................................................................................................................28


Between-groups linkage ............................................................................................................................................................................................................28
Dendrogram ...............................................................................................................................................................................................................................29

Third step: Determining the number of clusters ....................................................................................................................................................31


Fourth step: Display and save cluster membership ..............................................................................................................................................33
Output table of cluster membership...........................................................................................................................................................................................33
Saving the cluster membership..................................................................................................................................................................................................34
Scatter plot: <Graphs><Chart Builder>..................................................................................................................................................................................35

Fifth step: Interpretation of clusters ......................................................................................................................................................................37


Taking into account means ........................................................................................................................................................................................................37
Example of Lecture 01: Marketing survey on consumer buying behavior .................................................................................................................................37

Slide 5

Aims
Aims of the lecture
You know different types of measures of distance / similarity

You know the key steps in conducting a cluster analysis.

You can conduct a cluster analysis with SPSS


(Hierarchical agglomerative methods: Between-groups linkage and Ward)
In particular, you know how to
choose the appropriate measure of distance / similarity
interpret the agglomeration schedule
use the dendrogram to determine the number of clusters
interpret the meaning of a cluster

Slide 6

Introduction
Example
Marketing research: Customer survey on brand awareness ("Markenbewusstsein")
Survey features
Brand awareness [Index]

Sample of n = 150 customers


Brand awareness index consist of 3 items:
How likely is it that you will use the
brand again in the future?
How likely would you be to recommend
the brand to your friends?
Overall, how satisfied are you with the
brand?

Yearly income [Index]

Also included in the dataset:


yearly income

Slide 7

Question
Is there a linear relation between brand awareness and yearly income?
Hypothesis: The higher a person's income, the higher his/her brand awareness.

Conduct regression analysis with SPSS


Output (summarized)
Brand awareness [Index]

Overall model test (F-test)


Significance p = .014
Test of coefficients
Constant p = .000
Income p = .014
Coefficient of determination
R Square = .040
It is a really poor model
It seems to have structure in the
brand awareness dataset.
Yearly income [Index]

Slide 8

Question
Is there structure in the brand awareness dataset?
Are there clusters for the combination of yearly income and brand awareness?

Conduct cluster analysis


Output
Brand awareness [Index]

SPSS identified 3 distinct clusters


Interpretation
People with low income are least aware
because they lack money.
People with middle income have the
highest brand awareness because of
the dream of being richer.
People with high income are moderately
brand aware because they have a
certain status but don't need to show off.
Yearly income [Index]

Slide 9

Outline
Cluster analysis is a multivariate procedure for detecting natural groupings in data.
The grouping is based on the scores of several measures (e.g. income and awareness).

Goals in conducting cluster analysis

Brand awareness [Index]

Elements within a group should be as


similar as possible
<=> distance d should be small
d

Similarities between the groups should be


minimal
<=> distance D should be large
Features
Because all information is used for
grouping, cluster analysis is more
objective than just a subjective impression.
There is no optical illusion.

Yearly income [Index]

Slide 10

Concepts of Cluster Analysis


Key steps in using a cluster analysis
1. Measure of distance or similarity between objects (also called proximity measure)
Depends on type of data: interval, counts, binary
Distance: geometrical measure. Similarity: content-related measure
2. Formation of clusters
Calculation of proximity matrix
Many different procedures: Hierarchical / non-hierarchical, agglomerative / divisive etc.
3. Tools / criteria for determining the number of clusters
Tools: Agglomeration schedule, structural chart, dendrogram, icicle plot ("Eiszapfen-Plot")
Criteria (not available in SPSS): F-value, information criterion, etc.
4. Display and save cluster membership
Done by SPSS
5. Interpretation of clusters
Taking into account means (possibly variances) of cluster members

Slide 11

How to measure proximity


From dataset ...
Variable 1

Variable 2

Object 1
Object 2
Object 3
:
Object k

Variable 3

Variable j

raw data

... to proximity matrix (done by SPSS internally)


Object 1
Object 1
Object 2
Object 3
:
Object k

Object 2

Object 3

Object k

distance or similarity

Slide 12

Different proximity measures, depending on type of data


Measure allows specifying the distance (d) or similarity (s) to be used in clustering.
Interval (e.g. brand awareness, yearly income)
Euclidean distance (d)
City block distance (d)
Pearson correlation (d)
:
Counts (e.g. number of clients)
Chi-square measure (s)
Phi-square measure (s)
:
Binary (e.g. yes/no, female/male)
Euclidean distance (d)
Russel and Rao (s)
Simple matching (s)
Dice (s)
(only a selection of 27!)

Slide 13

Proximity measure with interval variables


Example: Brand awareness

Coordinates {x-axis, y-axis}


{0.97, 2.95}
1.407

2.95
c

b
1.73

{1.67, 1.73}

Theorem of Pythagoras about right triangle

a 2 + b 2 = c 2 =>

a2 + b2 = c
0.97

1.67

Distance between "pers_001" and "pers_002"

d001,002

= [1.67 0.97

+ 2.95 1.73

2 1/ 2

= [0.490 + 1.488 ]
= 1.407

1/ 2

Slide 14

Generalized equation
Minkowski distance (Hermann Minkowski, 1864 - 1909, German physicist)

J
r
dk,l = x kj x lj

j=1

1/ r

r = Minkowski's constant
dk,l = Distance between objects k and l (e.g. distance between persons 001 and 002)
J = Number of cluster variables (e.g. variables income and awareness)
xkj, xlj = Values of variable j of objects k and l (e.g. income of persons 001 and 002)
Values of Minkowski's constant

L2

r = 1: City block distance (also called L1-norm)


r = 2: Euclidean distance (also called L2-norm)
L1
City block distance
Manhattan distance
Taxi distance

Slide 15

Proximity measure with binary variables


Example: Car configuration
Identification of similarities between two objects by means of comparison

Mercedes
BMW
Case

ABS
0
0

Airbag
1
1

Configuration
ESP
1
1

0 = feature not present

Navi
1
0

Metallic
0
1

1 = feature present

4 Cases
A

= Feature exists in both comparison objects

B, C = Feature exists in one comparison object


D

= Feature exists in none of the comparison objects


Non-existence is also an important similarity in proximity definition

Slide 16

a = Number of cases of case "A"


b = Number of cases of case "B"
:
Proximity measure between two objects i and j depends on whether and how
the cases are included and how they are weighted (weights , i und ).
Binary proximity measures

General case: Simple Matching Coefficient*

Sij =

a + 1 d
a + (b + c) + 2 d

Variants

Description

Definition

Russel und Rao

Case d reduces proximity

Sij =

a
a+b+c +d

Simple matching

Case d raises proximity

Sij =

a+d
a+b+c +d

Dice

Case d is not taken into account


Similar features are weighted more

Sij =

2a
2a + b + c

*Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships,
*University of Kansas science bulletin, 38:1409--1438, 1958.

Slide 17

Example: Car configuration

Mercedes
BMW

ABS
0
0

Airbag
1
1

Configuration
ESP
1
1

Case

0 = feature not present

Navi
1
0

Metallic
0
1

Count of cases
a=2
b=1
c=1
d=1

1 = feature present

Measure

Proximity

Russel and Rao

Sij =

2
2
= = 0.4
2 + 1+ 1+ 1 5

Simple matching

Sij =

2 +1
3
= = 0.6
2 + 1+ 1+ 1 5

Dice

22
4
Sij =
= = 0.67
2 2 + 1+ 1 6

Some remarks
Sij varies between 0 and 1
There is no "right" proximity measure

Important question/decision:
Is non-existence important?
(<=> taking case d into account?)

Slide 18

How to form Clusters


Cluster A

Cluster B

1.
2.
3.

How to define similarity?


Similarity between cluster A and cluster B is measured by
1. Nearest neighbor (also called single linkage in the cluster formation tree on slide 20)
... the minimum of all possible distances between the cases in cluster A and the cases in B.
2. Centroid clustering (also called other linkage)
... the distance between the centroids of cluster A and of cluster B.
3. Furthest neighbor (also called complete linkage)
... the maximum of all possible distances between the cases in cluster A and the cases in B.

Slide 19

Similarity between cluster A and cluster B is measured by


Between-groups linkage (also called average linkage)
... the average of all the possible distances between the cases in cluster A and the cases in B.
Within-groups linkage (also called other linkage)
... the average of all the possible distances between the cases within a single new cluster
determined by combining cluster A and cluster B.
Median clustering (also called other linkage)
... the distance between the SPSS determined median for the cases in cluster A and the median
for the cases in cluster B.

Special case, taking into account sum of squares


Wards method
For a cluster the sum of squares is the sum of squared distances of each case from the centroid.
d1

Sum of squared distances


k

d2

d + d2 + ... = di
2
1

i =1

Slide 20

Cluster formation tree (rules for cluster formation)

There are several types of clustering procedures:


Cluster
algorithms

Hierarchical

Non-hierarchical

Agglomerative

Linkage
methods

Single
linkage

Complete
linkage

Average
linkage

Divisive

Variance
methods

Other
linkage

Wards
procedure

k-Means
procedure

used in this course


Non-hierarchical clustering is also called k-means clustering.
Average linkage between groups is the default in SPSS ("Between-groups linkage")

Slide 21

Pros and cons

Hierarchical clustering
No a priori decision about the number of clusters
Can be very slow

Non-hierarchical clustering
Need to specify the number of clusters (can be an arbitrary number)
Faster, more reliable
Features
Procedure

Proximity measure

Remark

Single linkage

distance or similarity

tendency to form chains

Complete linkage

distance or similarity

tendency to smaller groups of same size

Average linkage

distance or similarity

"between" single and complete linkage

Other linkage

only distance

No remark

Ward's method

only distance

tendency to groups of same size

Slide 22

Example of hierarchical method: Single linkage (nearest neighbor)


Tendency to form chains
Suitable for the detection of outliers
Close groups are badly separated

Step k

Step k + 1

"chain"

nearest neighbor
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)

Slide 23

Example of hierarchical method: Complete linkage (furthest neighbor)


Tendency to form smaller groups with same size
Not suitable for detecting outliers

Step k

Step k + 1

furthest neighbor
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)

Slide 24

Cluster Analysis with SPSS: A detailed example


Marketing research: Customer survey on brand awareness

Data
Brand awareness [Index]

Random sub-sample of n = 15
(Why this small sub-sample?
Just to keep track of what SPSS does.)

Data set: cluster_small.sav


Syntax: cluster_small.sps
Yearly income [Index]

Slide 25

SPSS Elements: <Analyze><Classify><Hierarchical ..

Slide 26

Syntax

CLUSTER income awareness

Variables included

/METHOD BAVERAGE

Cluster method "Between-groups linkage"

/MEASURE= EUCLID

Proximity measure "Euclidian distance"

/ID=person

Labels cases in plots and tables

/PRINT SCHEDULE

Schedule of cluster algorithm

/PRINT DISTANCE

Matrix with distances ("Proximity Matrix")

/PLOT DENDROGRAM VICICLE.

Tools for determining the number of clusters

Cluster method "Between-groups linkage" (which is default)


<=> A better choice would be Ward's method. Here "Between-groups linkage" was used to
show in detail how SPSS runs a cluster analysis.
Proximity measure "Euclidian distance" (used to show how SPSS performs a cluster analysis)
<=> The squared Euclidean measure (which is default) should be used when the
BAVERAGE, CENTROID, MEDIAN, or WARD cluster method is requested.

Slide 27

First step: Measure of distance or similarity between objects


Output

Proximity Matrix (Distances or similarities between items)

Example:
Distance between cases 9 and 7

:
:

Values represent Euclidian distances

Slide 28

Second step: Formation of clusters

Agglomeration schedule: Displays


the clusters combined at each stage.

Between-groups linkage

Stage 1: Cases 7 and 9 have smallest distance ("Coefficients" = .203) => first cluster {7,9}
First cluster {7,9} will be clustered with case 10 in stage 5 => cluster {7,9,10}
Stage 2: Cases 13 and 14 have second smallest distance => second cluster {13,14}
Second cluster {13,14} will be clustered with case 11 in stage 3 => cluster {11,13,14}
:

Slide 29

Dendrogram

Stage 1
Stage 5

Stage 2
Stage 3

Slide 30

Icicle plot

The figure is called an


icicle plot because the
columns look like icicles
hanging from above.

The plot shows how cases


are merged into clusters.
Read it from bottom to top

14 clusters: Cases 7 and 9 in one cluster, all others each in their own clusters.
13 clusters: 7 and 9 in one cluster, 13 and 14 in one cluster, all others each in their clusters.
12 clusters: 7 and 9 in one cluster, 11, 13 and 14 in one cluster, all others each in their clusters.
:

Slide 31

Third step: Determining the number of clusters


0) Theoretical and empirical reasons (But, be careful about optical illusion!)
In the case of brand awareness there are some indications for three clusters.
A) Elbow criterion in the structure chart (can't be done with SPSS, but with Excel)

Proximity ("Coefficients")

3.0

2.5

elbow => choose 3 clusters

2.0

1.5

1.0

0.5

0.0
1

10

11

12

13

14

15

Number of clusters (=sample size - "Stage")


Please note: Mostly there is large effect from cluster 1 to cluster 2 which is not the "elbow".

Slide 32

B) Dendrogram
Choose the number of clusters within the largest increase in heterogeneity
Standardized distance

Largest increase in heterogeneity

Slide 33

Fourth step: Display and save cluster membership


Output table of cluster membership

If you're not sure


about the number of
clusters, choose a full
range

Example of brand awareness: assumed 3 clusters

Slide 34

Saving the cluster membership

Used for drawing a scatter plot, for example.

Range of solutions: 2 to 5

Example of brand awareness: assumed 3 clusters

Slide 35

Scatter plot: <Graphs><Chart Builder>

Slide 36

One point was assigned incorrectly

Slide 37

Fifth step: Interpretation of clusters


In the case of the brand awareness example, the interpretation is obvious and straightforward.

Taking into account means

The means of the clusters with respect to the original variables


indicate how the clusters can be interpreted.

Example of Lecture 01: Marketing survey on consumer buying behavior

Questionnaire to ask people about their attitudes.


Among other questions:
"What is your general attitude to life?" (variable x1)
"What is your attitude to innovation?" (variable x2)
"What is your willingness to take risks?" (variable x3)

Objects

Scales of variables vary


from extremely negative (1)
to extremely positive (7)

Person A
Person B
Person C
Person D
Person E
Person F

Attributes
general
attitude to
willingness
attitude to life innovation
to take risks
1
2
2
1
3
3
2
4
2
5 Data of 64 people
3
5
4
4
7
6
7

Cluster

Slide 38

(A, B, C)
(D, E)
(F)

general
attitude to life
1.3
5
7

Attributes
attitude to
innovation
3
4
6

willingness
to take risks
2.3
3.5
7

Mean of clusters, regarding the cluster variables

Cluster 1 (A, B, C): pessimistic people who live in fear


Cluster 2 (D, E): slightly optimistic normalos
Cluster 3 (F): life-affirming adventurer

You might also like