Applied Data Analysis (With SPSS)

MSc Business Administration
Research Methodology: Tools

Applied Data Analysis (with SPSS)
Lecture 04: Cluster Analysis
March 2011
Prof. Dr. Jrg Schwarz
juerg.schwarz@hslu.ch
Slide 2
Contents
Aims ___________________________________________________________________________________________________ 5
Introduction _____________________________________________________________________________________________ 6
Outline _________________________________________________________________________________________________ 9
Concepts of Cluster Analysis______________________________________________________________________________ 10
Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24
Slide 3
Table of contents
Aims ___________________________________________________________________________________________________ 5
Aims of the lecture .................................................................................................................................................................................................5
Introduction _____________________________________________________________________________________________ 6
Example .................................................................................................................................................................................................................6
Outline _________________________________________________________________________________________________ 9
Concepts of Cluster Analysis______________________________________________________________________________ 10
Key steps in using a cluster analysis ....................................................................................................................................................................10
How to measure proximity....................................................................................................................................................................................11
Proximity measure with interval variables.............................................................................................................................................................13
Proximity measure with binary variables...............................................................................................................................................................15
How to form Clusters............................................................................................................................................................................................18
How to define similarity? ............................................................................................................................................................................................................18
Cluster formation tree (rules for cluster formation) ....................................................................................................................................................................20
Pros and cons ............................................................................................................................................................................................................................21
Example of hierarchical method: Single linkage (nearest neighbor) .........................................................................................................................................22
Example of hierarchical method: Complete linkage (furthest neighbor) ....................................................................................................................................23
Slide 4
Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24
Marketing research: Customer survey on brand awareness .................................................................................................................................24
SPSS Elements: <Analyze><Classify><Hierarchical ...........................................................................................................................................25
First step: Measure of distance or similarity between objects ...............................................................................................................................27
Output ........................................................................................................................................................................................................................................27
Second step: Formation of clusters ......................................................................................................................................................................28

Between-groups linkage ............................................................................................................................................................................................................28
Dendrogram ...............................................................................................................................................................................................................................29
Third step: Determining the number of clusters ....................................................................................................................................................31

Fourth step: Display and save cluster membership ..............................................................................................................................................33
Output table of cluster membership...........................................................................................................................................................................................33
Saving the cluster membership..................................................................................................................................................................................................34
Scatter plot: <Graphs><Chart Builder>..................................................................................................................................................................................35
Fifth step: Interpretation of clusters ......................................................................................................................................................................37

Taking into account means ........................................................................................................................................................................................................37
Example of Lecture 01: Marketing survey on consumer buying behavior .................................................................................................................................37
Slide 5
Aims
Aims of the lecture
You know different types of measures of distance / similarity
You know the key steps in conducting a cluster analysis.
You can conduct a cluster analysis with SPSS

(Hierarchical agglomerative methods: Between-groups linkage and Ward)
In particular, you know how to
choose the appropriate measure of distance / similarity
interpret the agglomeration schedule
use the dendrogram to determine the number of clusters
interpret the meaning of a cluster
Slide 6
Introduction
Example
Marketing research: Customer survey on brand awareness ("Markenbewusstsein")
Survey features
Brand awareness [Index]
Sample of n = 150 customers

Brand awareness index consist of 3 items:
How likely is it that you will use the
brand again in the future?
How likely would you be to recommend
the brand to your friends?
Overall, how satisfied are you with the
brand?
Yearly income [Index]
Also included in the dataset:

yearly income
Slide 7
Question
Is there a linear relation between brand awareness and yearly income?
Hypothesis: The higher a person's income, the higher his/her brand awareness.
Conduct regression analysis with SPSS

Output (summarized)
Overall model test (F-test)

Significance p = .014
Test of coefficients
Constant p = .000
Income p = .014
Coefficient of determination
R Square = .040
It is a really poor model
It seems to have structure in the
brand awareness dataset.
Slide 8
Question
Is there structure in the brand awareness dataset?
Are there clusters for the combination of yearly income and brand awareness?
Conduct cluster analysis

Output
SPSS identified 3 distinct clusters

Interpretation
People with low income are least aware
because they lack money.
People with middle income have the
highest brand awareness because of
the dream of being richer.
People with high income are moderately
brand aware because they have a
certain status but don't need to show off.
Slide 9
Outline
Cluster analysis is a multivariate procedure for detecting natural groupings in data.
The grouping is based on the scores of several measures (e.g. income and awareness).
Goals in conducting cluster analysis
Elements within a group should be as

similar as possible
<=> distance d should be small
d
Similarities between the groups should be

minimal
<=> distance D should be large
Features
Because all information is used for
grouping, cluster analysis is more
objective than just a subjective impression.
There is no optical illusion.
Slide 10
Concepts of Cluster Analysis

Key steps in using a cluster analysis
1. Measure of distance or similarity between objects (also called proximity measure)
Depends on type of data: interval, counts, binary
Distance: geometrical measure. Similarity: content-related measure
2. Formation of clusters
Calculation of proximity matrix
Many different procedures: Hierarchical / non-hierarchical, agglomerative / divisive etc.
3. Tools / criteria for determining the number of clusters
Tools: Agglomeration schedule, structural chart, dendrogram, icicle plot ("Eiszapfen-Plot")
Criteria (not available in SPSS): F-value, information criterion, etc.
4. Display and save cluster membership
Done by SPSS
5. Interpretation of clusters
Taking into account means (possibly variances) of cluster members
Slide 11
How to measure proximity

From dataset ...
Variable 1
Variable 2
Object 1
Object 2
Object 3
:
Object k
Variable 3
Variable j
raw data
... to proximity matrix (done by SPSS internally)

Object 1
Object 1
Object 2
Object 3
:
Object k
Object 2
Object 3
Object k
distance or similarity
Slide 12
Different proximity measures, depending on type of data

Measure allows specifying the distance (d) or similarity (s) to be used in clustering.
Interval (e.g. brand awareness, yearly income)
Euclidean distance (d)
City block distance (d)
Pearson correlation (d)
:
Counts (e.g. number of clients)
Chi-square measure (s)
Phi-square measure (s)
:
Binary (e.g. yes/no, female/male)
Euclidean distance (d)
Russel and Rao (s)
Simple matching (s)
Dice (s)
(only a selection of 27!)
Slide 13
Proximity measure with interval variables

Example: Brand awareness
Coordinates {x-axis, y-axis}

{0.97, 2.95}
1.407
2.95
c
b
1.73
{1.67, 1.73}
Theorem of Pythagoras about right triangle
a 2 + b 2 = c 2 =>
a2 + b2 = c
0.97
1.67
Distance between "pers_001" and "pers_002"
d001,002
= [1.67 0.97
+ 2.95 1.73
2 1/ 2
= [0.490 + 1.488 ]
= 1.407
1/ 2
Slide 14
Generalized equation
Minkowski distance (Hermann Minkowski, 1864 - 1909, German physicist)
J
r
dk,l = x kj x lj
j=1
1/ r
r = Minkowski's constant
dk,l = Distance between objects k and l (e.g. distance between persons 001 and 002)
J = Number of cluster variables (e.g. variables income and awareness)
xkj, xlj = Values of variable j of objects k and l (e.g. income of persons 001 and 002)
Values of Minkowski's constant
L2
r = 1: City block distance (also called L1-norm)

r = 2: Euclidean distance (also called L2-norm)
L1
City block distance
Manhattan distance
Taxi distance
Slide 15
Proximity measure with binary variables

Example: Car configuration
Identification of similarities between two objects by means of comparison
Mercedes
BMW
Case
ABS
0
0
Airbag
1
1
Configuration
ESP
1
1
0 = feature not present
Navi
1
0
Metallic
0
1
1 = feature present
4 Cases
A
= Feature exists in both comparison objects
B, C = Feature exists in one comparison object

D
= Feature exists in none of the comparison objects

Non-existence is also an important similarity in proximity definition
Slide 16
a = Number of cases of case "A"

b = Number of cases of case "B"
:
Proximity measure between two objects i and j depends on whether and how
the cases are included and how they are weighted (weights , i und ).
Binary proximity measures
General case: Simple Matching Coefficient*
Sij =
a + 1 d
a + (b + c) + 2 d
Variants
Description
Definition
Russel und Rao
Case d reduces proximity
Sij =
a
a+b+c +d
Simple matching
Case d raises proximity
Sij =
a+d
a+b+c +d
Dice
Case d is not taken into account

Similar features are weighted more
Sij =
2a
2a + b + c
*Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships,
*University of Kansas science bulletin, 38:1409--1438, 1958.
Slide 17
Example: Car configuration
Mercedes
BMW
ABS
0
0
Airbag
1
1
Configuration
ESP
1
1
Case
0 = feature not present
Navi
1
0
Metallic
0
1
Count of cases
a=2
b=1
c=1
d=1
1 = feature present
Measure
Proximity
Russel and Rao
Sij =
2
2
= = 0.4
2 + 1+ 1+ 1 5
Simple matching
Sij =
2 +1
3
= = 0.6
2 + 1+ 1+ 1 5
Dice
22
4
Sij =
= = 0.67
2 2 + 1+ 1 6
Some remarks
Sij varies between 0 and 1
There is no "right" proximity measure
Important question/decision:
Is non-existence important?
(<=> taking case d into account?)
Slide 18
How to form Clusters

Cluster A
Cluster B
1.
2.
3.
How to define similarity?

Similarity between cluster A and cluster B is measured by
1. Nearest neighbor (also called single linkage in the cluster formation tree on slide 20)
... the minimum of all possible distances between the cases in cluster A and the cases in B.
2. Centroid clustering (also called other linkage)
... the distance between the centroids of cluster A and of cluster B.
3. Furthest neighbor (also called complete linkage)
... the maximum of all possible distances between the cases in cluster A and the cases in B.
Slide 19
Similarity between cluster A and cluster B is measured by

Between-groups linkage (also called average linkage)
... the average of all the possible distances between the cases in cluster A and the cases in B.
Within-groups linkage (also called other linkage)
... the average of all the possible distances between the cases within a single new cluster
determined by combining cluster A and cluster B.
Median clustering (also called other linkage)
... the distance between the SPSS determined median for the cases in cluster A and the median
for the cases in cluster B.
Special case, taking into account sum of squares

Wards method
For a cluster the sum of squares is the sum of squared distances of each case from the centroid.
d1
Sum of squared distances

k
d2
d + d2 + ... = di
2
1
i =1
Slide 20
Cluster formation tree (rules for cluster formation)
There are several types of clustering procedures:

Cluster
algorithms
Hierarchical
Non-hierarchical
Agglomerative
Linkage
methods
Single
linkage
Complete
linkage
Average
linkage
Divisive
Variance
methods
Other
linkage
Wards
procedure
k-Means
procedure
used in this course

Non-hierarchical clustering is also called k-means clustering.
Average linkage between groups is the default in SPSS ("Between-groups linkage")
Slide 21
Pros and cons
Hierarchical clustering
No a priori decision about the number of clusters
Can be very slow
Non-hierarchical clustering
Need to specify the number of clusters (can be an arbitrary number)
Faster, more reliable
Features
Procedure
Proximity measure
Remark
Single linkage
tendency to form chains
Complete linkage
tendency to smaller groups of same size
Average linkage
"between" single and complete linkage
Other linkage
only distance
No remark
Ward's method
only distance
tendency to groups of same size
Slide 22
Example of hierarchical method: Single linkage (nearest neighbor)

Tendency to form chains
Suitable for the detection of outliers
Close groups are badly separated
Step k
Step k + 1
"chain"
nearest neighbor
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)
Slide 23
Example of hierarchical method: Complete linkage (furthest neighbor)

Tendency to form smaller groups with same size
Not suitable for detecting outliers
Step k
Step k + 1
furthest neighbor
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)
Slide 24
Cluster Analysis with SPSS: A detailed example

Marketing research: Customer survey on brand awareness
Data
Random sub-sample of n = 15
(Why this small sub-sample?
Just to keep track of what SPSS does.)
Data set: cluster_small.sav

Syntax: cluster_small.sps
Slide 25
SPSS Elements: <Analyze><Classify><Hierarchical ..
Slide 26
Syntax
CLUSTER income awareness
Variables included
/METHOD BAVERAGE
Cluster method "Between-groups linkage"
/MEASURE= EUCLID
Proximity measure "Euclidian distance"
/ID=person
Labels cases in plots and tables
/PRINT SCHEDULE
Schedule of cluster algorithm
/PRINT DISTANCE
Matrix with distances ("Proximity Matrix")
/PLOT DENDROGRAM VICICLE.
Tools for determining the number of clusters
Cluster method "Between-groups linkage" (which is default)

<=> A better choice would be Ward's method. Here "Between-groups linkage" was used to
show in detail how SPSS runs a cluster analysis.
Proximity measure "Euclidian distance" (used to show how SPSS performs a cluster analysis)
<=> The squared Euclidean measure (which is default) should be used when the
BAVERAGE, CENTROID, MEDIAN, or WARD cluster method is requested.
Slide 27
First step: Measure of distance or similarity between objects

Output
Proximity Matrix (Distances or similarities between items)
Example:
Distance between cases 9 and 7
:
:
Values represent Euclidian distances
Slide 28
Second step: Formation of clusters
Agglomeration schedule: Displays

the clusters combined at each stage.
Between-groups linkage
Stage 1: Cases 7 and 9 have smallest distance ("Coefficients" = .203) => first cluster {7,9}
First cluster {7,9} will be clustered with case 10 in stage 5 => cluster {7,9,10}
Stage 2: Cases 13 and 14 have second smallest distance => second cluster {13,14}
Second cluster {13,14} will be clustered with case 11 in stage 3 => cluster {11,13,14}
:
Slide 29
Dendrogram
Stage 1
Stage 5
Stage 2
Stage 3
Slide 30
Icicle plot
The figure is called an

icicle plot because the
columns look like icicles
hanging from above.
The plot shows how cases

are merged into clusters.
Read it from bottom to top
14 clusters: Cases 7 and 9 in one cluster, all others each in their own clusters.
13 clusters: 7 and 9 in one cluster, 13 and 14 in one cluster, all others each in their clusters.
12 clusters: 7 and 9 in one cluster, 11, 13 and 14 in one cluster, all others each in their clusters.
:
Slide 31
Third step: Determining the number of clusters

0) Theoretical and empirical reasons (But, be careful about optical illusion!)
In the case of brand awareness there are some indications for three clusters.
A) Elbow criterion in the structure chart (can't be done with SPSS, but with Excel)
Proximity ("Coefficients")
3.0
2.5
elbow => choose 3 clusters
2.0
1.5
1.0
0.5
0.0
1
10
11
12
13
14
15
Number of clusters (=sample size - "Stage")

Please note: Mostly there is large effect from cluster 1 to cluster 2 which is not the "elbow".
Slide 32
B) Dendrogram
Choose the number of clusters within the largest increase in heterogeneity
Standardized distance
Largest increase in heterogeneity
Slide 33
Fourth step: Display and save cluster membership

Output table of cluster membership
If you're not sure

about the number of
clusters, choose a full
range
Example of brand awareness: assumed 3 clusters
Slide 34
Saving the cluster membership
Used for drawing a scatter plot, for example.
Range of solutions: 2 to 5
Example of brand awareness: assumed 3 clusters
Slide 35
Scatter plot: <Graphs><Chart Builder>
Slide 36
One point was assigned incorrectly
Slide 37
Fifth step: Interpretation of clusters

In the case of the brand awareness example, the interpretation is obvious and straightforward.
Taking into account means
The means of the clusters with respect to the original variables

indicate how the clusters can be interpreted.
Example of Lecture 01: Marketing survey on consumer buying behavior
Questionnaire to ask people about their attitudes.

Among other questions:
"What is your general attitude to life?" (variable x1)
"What is your attitude to innovation?" (variable x2)
"What is your willingness to take risks?" (variable x3)
Objects
Scales of variables vary

from extremely negative (1)
to extremely positive (7)
Person A
Person B
Person C
Person D
Person E
Person F
Attributes
general
attitude to
willingness
attitude to life innovation
to take risks
1
2
2
1
3
3
2
4
2
5 Data of 64 people
3
5
4
4
7
6
7
Cluster
Slide 38
(A, B, C)
(D, E)
(F)
general
attitude to life
1.3
5
7
Attributes
attitude to
innovation
3
4
6
willingness
to take risks
2.3
3.5
7
Mean of clusters, regarding the cluster variables
Cluster 1 (A, B, C): pessimistic people who live in fear

Cluster 2 (D, E): slightly optimistic normalos
Cluster 3 (F): life-affirming adventurer

Applied Data Analysis (With SPSS)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Data Analysis (With SPSS)

Uploaded by

Copyright:

Available Formats

MSc Business Administration

Research Methodology: Tools

Lecture 04: Cluster Analysis

Second step: Formation of clusters ......................................................................................................................................................................28

Third step: Determining the number of clusters ....................................................................................................................................................31

Fifth step: Interpretation of clusters ......................................................................................................................................................................37

You know the key steps in conducting a cluster analysis.

You can conduct a cluster analysis with SPSS

Sample of n = 150 customers

Yearly income [Index]

Also included in the dataset:

Conduct regression analysis with SPSS

Overall model test (F-test)

Conduct cluster analysis

SPSS identified 3 distinct clusters

Goals in conducting cluster analysis

Brand awareness [Index]

Elements within a group should be as

Similarities between the groups should be

Yearly income [Index]

Concepts of Cluster Analysis

How to measure proximity

... to proximity matrix (done by SPSS internally)

Different proximity measures, depending on type of data

Proximity measure with interval variables

Coordinates {x-axis, y-axis}

Theorem of Pythagoras about right triangle

Distance between "pers_001" and "pers_002"

r = 1: City block distance (also called L1-norm)

Proximity measure with binary variables

0 = feature not present

= Feature exists in both comparison objects

B, C = Feature exists in one comparison object

= Feature exists in none of the comparison objects

a = Number of cases of case "A"

General case: Simple Matching Coefficient*

Russel und Rao

Case d reduces proximity

Case d raises proximity

Case d is not taken into account

Example: Car configuration

0 = feature not present

Russel and Rao

How to form Clusters

How to define similarity?

Similarity between cluster A and cluster B is measured by

Special case, taking into account sum of squares

Sum of squared distances

Cluster formation tree (rules for cluster formation)

There are several types of clustering procedures:

used in this course

Pros and cons

tendency to form chains

tendency to smaller groups of same size

"between" single and complete linkage

tendency to groups of same size

Example of hierarchical method: Single linkage (nearest neighbor)

Example of hierarchical method: Complete linkage (furthest neighbor)

Cluster Analysis with SPSS: A detailed example

Data set: cluster_small.sav

SPSS Elements: <Analyze><Classify><Hierarchical ..

CLUSTER income awareness

Cluster method "Between-groups linkage"

Proximity measure "Euclidian distance"