2014 Cluster Analysis Handout

Cluster Analysis
Segmenting the market
Cluster Analysis
(classification analysis, numerical
taxonomy):
a class of techniques used to classify objects or
cases into relatively homogeneous groups called
clusters based on the set of variables considered.
Objects in each cluster tend to be similar to each
other and dissimilar to objects in the other
clusters.
objects: either variables or observations;
likeness: calculated from the measurements for
each object.
Applications:
1.
market segmentation: e.g., benefit

segmentation: clustering consumers on the
basis of benefits sought from the purchase of
a product,
2.
understanding buyer behaviors: e.g.,

clustering consumers to identify
homogeneous groups, a firm can examine the
buying behavior or information seeking
behavior of each group,
3.
identifying new product opportunities: e.g.,

clustering brands and products to identify
competitive sets within the market, a firm can
examine its current offerings compared to
those of its competitors to identify potential
new product opportunities,
4.
selecting test markets: e.g., clustering cities

into homogeneous clusters, a firm can select
comparable cities to test various marketing
strategies.
Distance measures for individual

observations
To measure similarity between two observations a
distance measure is needed
With a single variable, similarity is straightforward
Example: income two individuals are similar if their income level
is similar and the level of dissimilarity increases as the income
gap increases
Multiple variables require an aggregate distance

measure
Many characteristics (e.g. income, age, consumption habits,
brand loyalty, purchase frequency, family composition, education
level, ..), it becomes more difficult to define similarity with a single
value
The most known measure of distance is the Euclidean

distance, which is the concept we use in everyday life for
spatial coordinates.
Model:
Data: each object is characterized by a set of
numbers (measurements);
e.g., object 1: (x11, x12, , x1n)
object 2: (x21, x22, , x2n)
:
:
object p: (xp1, xp2, , xpn)
Distance: Euclidean distance, dij,
d ij
i1
x j1 xi 2 x j 2 xin x jn
2
Example
A
B
C
D
Household
Income
50K
50K
20K
20K
Household
Size
5
4
2
1
Size
2
4.24 3 3
A
1
B
3.61 2 2 32
C
D
$
(unit: 10K)
50K
20K
Three Cluster Diagram Showing

BetweenBetween-Cluster and WithinWithin-Cluster Variation
BetweenBetween-Cluster Variation = Maximize
WithinWithin-Cluster Variation = Minimize
Scatter Diagram for Cluster

Observations
Frequency of eating out
High
Low
Low
High
Frequency of going to fast food restaurants
Scatter Diagram for Cluster Observations

Frequency of eating out
High
Low
Low
High
Frequency of going to fast food restaurants
Comparison of Score Profiles for Factor

Analysis and Hierarchical Cluster Analysis
Variables
Respondent
Score
7
6
5
4
3
2
1
Respondent A
Respondent B
Respondent C
Respondent D
Clustering procedures
Hierarchical procedures
Agglomerative (start from n clusters to
get to 1 cluster)
Divisive (start from 1 cluster to get to n
clusters)
Non hierarchical procedures
K-means clustering
Hierarchical clustering
Agglomerative:
Each of the n observations constitutes a separate cluster
The two clusters that are more similar according to some distance rule are
aggregated, so that in step 1 there are n-1 clusters
In the second step another cluster is formed (n-2 clusters), by nesting the two
clusters that are more similar, and so on
There is a merging in each step until all observations end up in a single
cluster in the final step.
Divisive
All observations are initially assumed to belong to a single cluster
The most dissimilar observation(s) is extracted to form a separate cluster
In step 1 there will be 2 clusters, in the second step three clusters and so on,
until the final step will produce as many clusters as the number of
observations. This technique is used in medical research and not in the
scope of our course.
The number of clusters determines the stopping rule for the

algorithms
Non-hierarchical clustering
These algorithms do not follow a hierarchy and produce a
single partition
Knowledge of the number of clusters (c) is required
In the first step, initial cluster centres (the seeds) are
determined for each of the c clusters, either by the
researcher or by the software.
Each iteration allocates observations to each of the c
clusters, based on their distance from the cluster centres
Cluster centres are computed again and observations may
be reallocated to the nearest cluster in the next iteration
When no observations can be reallocated or a stopping rule
is met, the process stops
Distance between clusters

Algorithms vary according to the way the
distance between two clusters is defined.
The most common algorithm for
hierarchical methods include
centroid method
single linkage method
complete linkage method
average linkage method
Ward algorithm
Linkage methods
Single linkage method (nearest neighbour):
distance between two clusters is the minimum
distance among all possible distances between
observations belonging to the two clusters.
Complete linkage method (furthest neighbour):
nests two cluster using as a basis the maximum
distance between observations belonging to
separate clusters.
Average linkage method: the distance between
two clusters is the average of all distances
between observations in the two clusters.
Ward algorithm
1. The sum of squared distances is computed
within each of the cluster, considering all
distances between observation within the same
cluster
2. The algorithm proceeds by choosing the
aggregation between two clusters which
generates the smallest increase in the total sum
of squared distances.
It is a computationally intensive method,
because at each step all the sum of squared
distances need to be computed, together with all
potential increases in the total sum of squared
distances for each possible aggregation of
clusters.
Non-hierarchical clustering:
K-means method
The number k of clusters is fixed
An initial set of k seeds (aggregation centres) is
provided
First k elements
Given a certain fixed threshold, all units are
assigned to the nearest cluster seed
New seeds are computed
Go back to step 3 until no reclassification is
necessary
Units can be reassigned in successive steps
(optimising partioning)
Hierarchical vs. non-hierarchical methods

Hierarchical Methods
Non-hierarchical methods
No decision about the number

of clusters
Problems when data contain a
high level of error
Can be very slow, preferable
with small data-sets
At each step they require
computation of the full
proximity matrix
Faster, more reliable, works

with large data sets
Need to specify the number of
clusters
Need to set the initial seeds
Only cluster distances to seeds
need to be computed in each
iteration
How many clusters?

no hard and fast rules,
a.
b.
c.
theoretical, conceptual, or practical

considerations;
the distances at which clusters are combined
in a hierarchical clustering;
the relative size of the clusters should be
meaningful, etc.
Outlairs
It would affect your cluster solution if you
dont remove it!
It would affect your cluster solution if you
remove it! (small sample size)
Should we standardize clustering

variables?
What is the effect of multi-collinearity in
cluster analysis?
Cluster Analysis Variable Selection
Variables are typically

measured metrically, but
technique can be applied to
non-metric variables with
caution.
Variables are logically related

to a single underlying concept
or construct.
Variable Description
Work Environment Measures
X1
I am paid fairly for the work I do.
X2
I am doing the kind of work I want.
X3
My supervisor gives credit and praise for work well done.
X4
There is a lot of cooperation among the members of my work group.
group.
X5
My job allows me to learn new skills.
X6
My supervisor recognizes my potential.
X7
My work gives me a sense of accomplishment.
X8
My immediate work group functions as a team.
X9
My pay reflects the effort I put into doing my work.
X10 My supervisor is friendly and helpful.
X11 The members of my work group have the skills and/or training
to do their job well.
X12 The benefits I receive are reasonable.
Relationship Measures
X13 I have a sense of loyalty to McDonald's restaurant.
X14 I am willing to put in a great deal of effort beyond that
expected to help McDonald's restaurant to be successful.
X15 I am proud to tell others that I work for McDonald's restaurant.
Classification Variables
X16 Intention to Search
X17 Length of Time an Employee
X18 Work Type = PartPart-Time vs. FullFull-Time
X19 Gender
X20 Age
X21 Performance
Type
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Nonmetric
Nonmetric
Nonmetric
Metric
Metric
Using SPSS to Identify Clusters
For this example we are looking for subgroups among all the 63
employees of McDonald's restaurant using the organizational
commitment
commitment variables. The SPSS click through sequence is: Analyze
Classify Hierarchical Cluster. This will take you to a dialog box where
where
you select and move variables X13, X14 and X15 into the Variables
Variables box.
Next you go to the statistics box and agglomeration schedule is selected as
default option. Cluster membership none
none is selected as default. We shall
continue with default option here. Next click on plot
plot box. Check on
dendogram and in Icicle window, click on none button. Then continue.
Next click on the Method box and select Ward
Wards under Cluster Method (it
is the last option). Squared Euclidean Distances is the default under
Measure and we will use it, and we do not need to standardize this
this data.
We will not select anything on the save option now. Now click on OK
OK to
run the program.
Notice the charm in

coefficients in last two
stages
Identified
Identifiedthe
thenumber
number
ofofclusters
clustersfrom
from
dendogram
dendogram
Using SPSS to Identify Clusters
In the next step in SPSS click through

sequence is: Analyze Classify K-mean cluster.
This will take you to a dialog box where you select
and move variables X13, X14 and X15 into the
Variables
Variables box. In the box number of clusters
clusters put
3 in place of 2. Next you go to the save box and
check on cluster membership. Next click on options.
Uncheck initial cluster option and check ANOVA
table. Now click on OK
OK to run the program.
33
34
35
Determine if clusters exist . . . Run

ANOVA with cluster IDs and
organizational commitment variables.
36
ANOVA
1 Move the three cluster
Move the three cluster

variables into window
Move
cluster ID
variable into
window
2
Click on Options,
check Descriptive,
next Continue,
and then OK
37
Step 1: Determine if clusters exist?

2 Cluster ANOVA Results
Three issues to examine: (1) statistical
significance, (2) cluster sample sizes, and
(3) variable means.
Conclusion:
Conclusion:
Cluster 1
More Committed
Cluster 2
Less Committed
38

3 Cluster ANOVA
Must run postpost-hoc
hoc tests
Take 2 cluster ID
variable out and
insert 3 cluster ID
Click on Post
Hoc button and
check Scheffe
39
Conclusions:
Conclusions:
Cluster 1 Least Committed
Cluster 2 Moderately Committed
Cluster 3 Most Committed
Individual cluster sample sizes OK.
Clusters significantly different, but
must examine post hoc tests.
40

3 Cluster ANOVA
41

4 Cluster ANOVA
Click OK
to run
1 Remove 3 cluster ID
variable and insert 4
cluster ID variable
42
Determine if clusters exist?

4 Cluster ANOVA
Conclusions:
1. Group sample sizes still OK.
2. Clusters are significantly different.
3. Means of four clusters more difficult to
interpret may want to examine polar
extremes
extremes. Most likely approach is combine
clusters 1 and 2 and do a three cluster
solution, or remove groups 1 and 2 and
compare extreme groups (3 & 4).
43
Four Cluster ANOVA

Post Hoc results
1. All clusters are
significantly different.
2. Largest differences
consistently between
clusters 3 and 4.
44
Error Reduction:
Reduction:
1 2 Clusters = 58.4%
Conclusion:
Conclusion: benefit
similar or less after 3
clusters.
Decide number of clusters . . .

1. Examine cluster analysis
Agglomeration Schedule.
2. Consider cluster sample sizes.
3. Consider statistical significance.
4. Evaluate differences in cluster means.
5. Evaluate interpretation &
communication issues.
Error
Coefficients
45
Step 4: Describe cluster characteristics . . .

1. Use ANOVA
2. Remove clustering variables from
Dependent List
List window
3. Insert demographic variables
4. Change Factor
Factor variable if necessary
Insert
Demogra
phic
Variables
46
Step 4: Describe cluster characteristics . . .

1. Go to Variable View.
2. Click on None beside variable for number
of cluster groups you will examine under
Values column.
3. Assign value labels to each cluster.
4. Run ANOVA on demographics.
Assign value
labels for
clusters
47
Describe demographic characteristics

Conclusions 3 cluster solution:
Clusters are significantly different.
More committed cluster (must know coding
to interpret) . . .
Less likely to search (lower mean)
Full time employees (code = 0)
Females (code = 1)
High performers (higher mean)
Thank you

2014 Cluster Analysis Handout

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2014 Cluster Analysis Handout

Uploaded by

Copyright:

Available Formats

Cluster Analysis

Segmenting the market

market segmentation: e.g., benefit

understanding buyer behaviors: e.g.,

identifying new product opportunities: e.g.,

selecting test markets: e.g., clustering cities

Distance measures for individual

Multiple variables require an aggregate distance

The most known measure of distance is the Euclidean

Three Cluster Diagram Showing

Scatter Diagram for Cluster

Scatter Diagram for Cluster Observations

Comparison of Score Profiles for Factor

The number of clusters determines the stopping rule for the

Distance between clusters

Hierarchical vs. non-hierarchical methods

No decision about the number

Faster, more reliable, works

How many clusters?

theoretical, conceptual, or practical

Should we standardize clustering

Cluster Analysis Variable Selection

Variables are typically

Variables are logically related

Using SPSS to Identify Clusters

Notice the charm in

Using SPSS to Identify Clusters

In the next step in SPSS click through

Determine if clusters exist . . . Run

1 Move the three cluster

Move the three cluster

Step 1: Determine if clusters exist?

Step 2: Determine if clusters exist?

Step 2: Determine if clusters exist?

Step 3: Determine if clusters exist?

Determine if clusters exist?

Four Cluster ANOVA

Decide number of clusters . . .

Step 4: Describe cluster characteristics . . .

Step 4: Describe cluster characteristics . . .

Describe demographic characteristics

You might also like