You are on page 1of 25

Cluster Analysis

Segmenting the market

Cluster Analysis
(classification analysis, numerical
taxonomy):
a class of techniques used to classify objects or
cases into relatively homogeneous groups called
clusters based on the set of variables considered.
Objects in each cluster tend to be similar to each
other and dissimilar to objects in the other
clusters.
objects: either variables or observations;
likeness: calculated from the measurements for
each object.

Applications:
1.

market segmentation: e.g., benefit


segmentation: clustering consumers on the
basis of benefits sought from the purchase of
a product,

2.

understanding buyer behaviors: e.g.,


clustering consumers to identify
homogeneous groups, a firm can examine the
buying behavior or information seeking
behavior of each group,

3.

identifying new product opportunities: e.g.,


clustering brands and products to identify
competitive sets within the market, a firm can
examine its current offerings compared to
those of its competitors to identify potential
new product opportunities,

4.

selecting test markets: e.g., clustering cities


into homogeneous clusters, a firm can select
comparable cities to test various marketing
strategies.

Distance measures for individual


observations
To measure similarity between two observations a
distance measure is needed
With a single variable, similarity is straightforward
Example: income two individuals are similar if their income level
is similar and the level of dissimilarity increases as the income
gap increases

Multiple variables require an aggregate distance


measure
Many characteristics (e.g. income, age, consumption habits,
brand loyalty, purchase frequency, family composition, education
level, ..), it becomes more difficult to define similarity with a single
value

The most known measure of distance is the Euclidean


distance, which is the concept we use in everyday life for
spatial coordinates.

Model:
Data: each object is characterized by a set of
numbers (measurements);
e.g., object 1: (x11, x12, , x1n)
object 2: (x21, x22, , x2n)
:
:
object p: (xp1, xp2, , xpn)
Distance: Euclidean distance, dij,

d ij

i1

x j1 xi 2 x j 2 xin x jn
2

Example

A
B
C
D

Household
Income
50K
50K
20K
20K

Household
Size
5
4
2
1

Size
2

4.24 3 3

A
1
B

3.61 2 2 32

C
D

$
(unit: 10K)
50K

20K

Three Cluster Diagram Showing


BetweenBetween-Cluster and WithinWithin-Cluster Variation
BetweenBetween-Cluster Variation = Maximize
WithinWithin-Cluster Variation = Minimize

Scatter Diagram for Cluster


Observations
Frequency of eating out

High

Low
Low

High
Frequency of going to fast food restaurants

Scatter Diagram for Cluster Observations


Frequency of eating out

High

Low
Low

High
Frequency of going to fast food restaurants

Comparison of Score Profiles for Factor


Analysis and Hierarchical Cluster Analysis
Variables
Respondent

Score

7
6
5
4
3
2
1

Respondent A
Respondent B
Respondent C
Respondent D

Clustering procedures
Hierarchical procedures
Agglomerative (start from n clusters to
get to 1 cluster)
Divisive (start from 1 cluster to get to n
clusters)
Non hierarchical procedures
K-means clustering

Hierarchical clustering
Agglomerative:
Each of the n observations constitutes a separate cluster
The two clusters that are more similar according to some distance rule are
aggregated, so that in step 1 there are n-1 clusters
In the second step another cluster is formed (n-2 clusters), by nesting the two
clusters that are more similar, and so on
There is a merging in each step until all observations end up in a single
cluster in the final step.

Divisive
All observations are initially assumed to belong to a single cluster
The most dissimilar observation(s) is extracted to form a separate cluster
In step 1 there will be 2 clusters, in the second step three clusters and so on,
until the final step will produce as many clusters as the number of
observations. This technique is used in medical research and not in the
scope of our course.

The number of clusters determines the stopping rule for the


algorithms

Non-hierarchical clustering
These algorithms do not follow a hierarchy and produce a
single partition
Knowledge of the number of clusters (c) is required
In the first step, initial cluster centres (the seeds) are
determined for each of the c clusters, either by the
researcher or by the software.
Each iteration allocates observations to each of the c
clusters, based on their distance from the cluster centres
Cluster centres are computed again and observations may
be reallocated to the nearest cluster in the next iteration
When no observations can be reallocated or a stopping rule
is met, the process stops

Distance between clusters


Algorithms vary according to the way the
distance between two clusters is defined.
The most common algorithm for
hierarchical methods include
centroid method
single linkage method
complete linkage method
average linkage method
Ward algorithm

Linkage methods
Single linkage method (nearest neighbour):
distance between two clusters is the minimum
distance among all possible distances between
observations belonging to the two clusters.
Complete linkage method (furthest neighbour):
nests two cluster using as a basis the maximum
distance between observations belonging to
separate clusters.
Average linkage method: the distance between
two clusters is the average of all distances
between observations in the two clusters.

Ward algorithm
1. The sum of squared distances is computed
within each of the cluster, considering all
distances between observation within the same
cluster
2. The algorithm proceeds by choosing the
aggregation between two clusters which
generates the smallest increase in the total sum
of squared distances.
It is a computationally intensive method,
because at each step all the sum of squared
distances need to be computed, together with all
potential increases in the total sum of squared
distances for each possible aggregation of
clusters.

Non-hierarchical clustering:
K-means method
The number k of clusters is fixed
An initial set of k seeds (aggregation centres) is
provided
First k elements
Given a certain fixed threshold, all units are
assigned to the nearest cluster seed
New seeds are computed
Go back to step 3 until no reclassification is
necessary
Units can be reassigned in successive steps
(optimising partioning)

Hierarchical vs. non-hierarchical methods


Hierarchical Methods

Non-hierarchical methods

No decision about the number


of clusters
Problems when data contain a
high level of error
Can be very slow, preferable
with small data-sets
At each step they require
computation of the full
proximity matrix

Faster, more reliable, works


with large data sets
Need to specify the number of
clusters
Need to set the initial seeds
Only cluster distances to seeds
need to be computed in each
iteration

How many clusters?


no hard and fast rules,
a.
b.
c.

theoretical, conceptual, or practical


considerations;
the distances at which clusters are combined
in a hierarchical clustering;
the relative size of the clusters should be
meaningful, etc.

Outlairs
It would affect your cluster solution if you
dont remove it!
It would affect your cluster solution if you
remove it! (small sample size)

Should we standardize clustering


variables?
What is the effect of multi-collinearity in
cluster analysis?

Cluster Analysis Variable Selection

Variables are typically


measured metrically, but
technique can be applied to
non-metric variables with
caution.

Variables are logically related


to a single underlying concept
or construct.

Variable Description
Work Environment Measures
X1
I am paid fairly for the work I do.
X2
I am doing the kind of work I want.
X3
My supervisor gives credit and praise for work well done.
X4
There is a lot of cooperation among the members of my work group.
group.
X5
My job allows me to learn new skills.
X6
My supervisor recognizes my potential.
X7
My work gives me a sense of accomplishment.
X8
My immediate work group functions as a team.
X9
My pay reflects the effort I put into doing my work.
X10 My supervisor is friendly and helpful.
X11 The members of my work group have the skills and/or training
to do their job well.
X12 The benefits I receive are reasonable.
Relationship Measures
X13 I have a sense of loyalty to McDonald's restaurant.
X14 I am willing to put in a great deal of effort beyond that
expected to help McDonald's restaurant to be successful.
X15 I am proud to tell others that I work for McDonald's restaurant.
Classification Variables
X16 Intention to Search
X17 Length of Time an Employee
X18 Work Type = PartPart-Time vs. FullFull-Time
X19 Gender
X20 Age
X21 Performance

Type
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Nonmetric
Nonmetric
Nonmetric
Metric
Metric

Using SPSS to Identify Clusters

For this example we are looking for subgroups among all the 63
employees of McDonald's restaurant using the organizational
commitment
commitment variables. The SPSS click through sequence is: Analyze
Classify Hierarchical Cluster. This will take you to a dialog box where
where
you select and move variables X13, X14 and X15 into the Variables
Variables box.
Next you go to the statistics box and agglomeration schedule is selected as
default option. Cluster membership none
none is selected as default. We shall
continue with default option here. Next click on plot
plot box. Check on
dendogram and in Icicle window, click on none button. Then continue.
Next click on the Method box and select Ward
Wards under Cluster Method (it
is the last option). Squared Euclidean Distances is the default under
Measure and we will use it, and we do not need to standardize this
this data.
We will not select anything on the save option now. Now click on OK
OK to
run the program.

Notice the charm in


coefficients in last two
stages

Identified
Identifiedthe
thenumber
number
ofofclusters
clustersfrom
from
dendogram
dendogram

Using SPSS to Identify Clusters

In the next step in SPSS click through


sequence is: Analyze Classify K-mean cluster.
This will take you to a dialog box where you select
and move variables X13, X14 and X15 into the
Variables
Variables box. In the box number of clusters
clusters put
3 in place of 2. Next you go to the save box and
check on cluster membership. Next click on options.
Uncheck initial cluster option and check ANOVA
table. Now click on OK
OK to run the program.

33

34

35

Determine if clusters exist . . . Run


ANOVA with cluster IDs and
organizational commitment variables.

36

ANOVA

1 Move the three cluster

Move the three cluster


variables into window

Move
cluster ID
variable into
window

2
Click on Options,
check Descriptive,
next Continue,
and then OK

37

Step 1: Determine if clusters exist?


2 Cluster ANOVA Results
Three issues to examine: (1) statistical
significance, (2) cluster sample sizes, and
(3) variable means.

Conclusion:
Conclusion:
Cluster 1
More Committed
Cluster 2
Less Committed

38

Step 2: Determine if clusters exist?


3 Cluster ANOVA
Must run postpost-hoc
hoc tests

Take 2 cluster ID
variable out and
insert 3 cluster ID

Click on Post
Hoc button and
check Scheffe

39

Conclusions:
Conclusions:
Cluster 1 Least Committed
Cluster 2 Moderately Committed
Cluster 3 Most Committed
Individual cluster sample sizes OK.
Clusters significantly different, but
must examine post hoc tests.

40

Step 2: Determine if clusters exist?


3 Cluster ANOVA

41

Step 3: Determine if clusters exist?


4 Cluster ANOVA

Click OK
to run

1 Remove 3 cluster ID
variable and insert 4
cluster ID variable

42

Determine if clusters exist?


4 Cluster ANOVA
Conclusions:
1. Group sample sizes still OK.
2. Clusters are significantly different.
3. Means of four clusters more difficult to
interpret may want to examine polar
extremes
extremes. Most likely approach is combine
clusters 1 and 2 and do a three cluster
solution, or remove groups 1 and 2 and
compare extreme groups (3 & 4).

43

Four Cluster ANOVA


Post Hoc results
1. All clusters are
significantly different.
2. Largest differences
consistently between
clusters 3 and 4.

44

Error Reduction:
Reduction:
1 2 Clusters = 58.4%
2 3 Clusters = 25.5%
3 4 Clusters = 22.8%
4 5 Clusters = 22.2%
Conclusion:
Conclusion: benefit
similar or less after 3
clusters.

Decide number of clusters . . .


1. Examine cluster analysis
Agglomeration Schedule.
2. Consider cluster sample sizes.
3. Consider statistical significance.
4. Evaluate differences in cluster means.
5. Evaluate interpretation &
communication issues.

Error
Coefficients

45

Step 4: Describe cluster characteristics . . .


1. Use ANOVA
2. Remove clustering variables from
Dependent List
List window
3. Insert demographic variables
4. Change Factor
Factor variable if necessary

Insert
Demogra
phic
Variables

46

Step 4: Describe cluster characteristics . . .


1. Go to Variable View.
2. Click on None beside variable for number
of cluster groups you will examine under
Values column.
3. Assign value labels to each cluster.
4. Run ANOVA on demographics.

Assign value
labels for
clusters

47

Describe demographic characteristics


Conclusions 3 cluster solution:
Clusters are significantly different.
More committed cluster (must know coding
to interpret) . . .
Less likely to search (lower mean)
Full time employees (code = 0)
Females (code = 1)
High performers (higher mean)

Thank you