Cluster Analysis

- Distributed Clustering Survey
- Cluster
- Difference between Industrial Market segmentation and Traditional Market segmentation
- Terrian Identification Using Co-Clustered Model of the Swarm Intellegence & Segmentation Technique
- Cross Lingual Information Retrieval Using Search Engine and Data Mining
- Abstract
- (2007) Data Streams - Aggarwal
- Galois_PLDI07
- Klasifikasi ABC pada Persediaan
- 17090894
- MDS-UPDRS
- fulltext_111
- 5. Comp Sci - Ijcseitr -Interactive Image Retrieval With - Anuja a. Khodaskar
- EJ1053624_2.pdf
- [Marcos_M_Campos]_S-TREE_self-organizing_trees_fo(b-ok.org).pdf
- Performance Еvaluation of Тracking Аlgorithm Incorporating Attribute Data Processing via DSmT
- 13-The Analysis of Bases for Market
- Market Segmentation by Tahira Umair
- TUGAS EKOTUM
- A Report to develop an ANFIS model of a debutanizer column

Cluster Analysis

(classification analysis, numerical

taxonomy):

a class of techniques used to classify objects or

cases into relatively homogeneous groups called

clusters based on the set of variables considered.

Objects in each cluster tend to be similar to each

other and dissimilar to objects in the other

clusters.

objects: either variables or observations;

likeness: calculated from the measurements for

each object.

Applications:

1.

segmentation: clustering consumers on the

basis of benefits sought from the purchase of

a product,

2.

clustering consumers to identify

homogeneous groups, a firm can examine the

buying behavior or information seeking

behavior of each group,

3.

clustering brands and products to identify

competitive sets within the market, a firm can

examine its current offerings compared to

those of its competitors to identify potential

new product opportunities,

4.

into homogeneous clusters, a firm can select

comparable cities to test various marketing

strategies.

observations

To measure similarity between two observations a

distance measure is needed

With a single variable, similarity is straightforward

Example: income two individuals are similar if their income level

is similar and the level of dissimilarity increases as the income

gap increases

measure

Many characteristics (e.g. income, age, consumption habits,

brand loyalty, purchase frequency, family composition, education

level, ..), it becomes more difficult to define similarity with a single

value

distance, which is the concept we use in everyday life for

spatial coordinates.

Model:

Data: each object is characterized by a set of

numbers (measurements);

e.g., object 1: (x11, x12, , x1n)

object 2: (x21, x22, , x2n)

:

:

object p: (xp1, xp2, , xpn)

Distance: Euclidean distance, dij,

d ij

i1

x j1 xi 2 x j 2 xin x jn

2

Example

A

B

C

D

Household

Income

50K

50K

20K

20K

Household

Size

5

4

2

1

Size

2

4.24 3 3

A

1

B

3.61 2 2 32

C

D

$

(unit: 10K)

50K

20K

BetweenBetween-Cluster and WithinWithin-Cluster Variation

BetweenBetween-Cluster Variation = Maximize

WithinWithin-Cluster Variation = Minimize

Observations

Frequency of eating out

High

Low

Low

High

Frequency of going to fast food restaurants

Frequency of eating out

High

Low

Low

High

Frequency of going to fast food restaurants

Analysis and Hierarchical Cluster Analysis

Variables

Respondent

Score

7

6

5

4

3

2

1

Respondent A

Respondent B

Respondent C

Respondent D

Clustering procedures

Hierarchical procedures

Agglomerative (start from n clusters to

get to 1 cluster)

Divisive (start from 1 cluster to get to n

clusters)

Non hierarchical procedures

K-means clustering

Hierarchical clustering

Agglomerative:

Each of the n observations constitutes a separate cluster

The two clusters that are more similar according to some distance rule are

aggregated, so that in step 1 there are n-1 clusters

In the second step another cluster is formed (n-2 clusters), by nesting the two

clusters that are more similar, and so on

There is a merging in each step until all observations end up in a single

cluster in the final step.

Divisive

All observations are initially assumed to belong to a single cluster

The most dissimilar observation(s) is extracted to form a separate cluster

In step 1 there will be 2 clusters, in the second step three clusters and so on,

until the final step will produce as many clusters as the number of

observations. This technique is used in medical research and not in the

scope of our course.

algorithms

Non-hierarchical clustering

These algorithms do not follow a hierarchy and produce a

single partition

Knowledge of the number of clusters (c) is required

In the first step, initial cluster centres (the seeds) are

determined for each of the c clusters, either by the

researcher or by the software.

Each iteration allocates observations to each of the c

clusters, based on their distance from the cluster centres

Cluster centres are computed again and observations may

be reallocated to the nearest cluster in the next iteration

When no observations can be reallocated or a stopping rule

is met, the process stops

Algorithms vary according to the way the

distance between two clusters is defined.

The most common algorithm for

hierarchical methods include

centroid method

single linkage method

complete linkage method

average linkage method

Ward algorithm

Linkage methods

Single linkage method (nearest neighbour):

distance between two clusters is the minimum

distance among all possible distances between

observations belonging to the two clusters.

Complete linkage method (furthest neighbour):

nests two cluster using as a basis the maximum

distance between observations belonging to

separate clusters.

Average linkage method: the distance between

two clusters is the average of all distances

between observations in the two clusters.

Ward algorithm

1. The sum of squared distances is computed

within each of the cluster, considering all

distances between observation within the same

cluster

2. The algorithm proceeds by choosing the

aggregation between two clusters which

generates the smallest increase in the total sum

of squared distances.

It is a computationally intensive method,

because at each step all the sum of squared

distances need to be computed, together with all

potential increases in the total sum of squared

distances for each possible aggregation of

clusters.

Non-hierarchical clustering:

K-means method

The number k of clusters is fixed

An initial set of k seeds (aggregation centres) is

provided

First k elements

Given a certain fixed threshold, all units are

assigned to the nearest cluster seed

New seeds are computed

Go back to step 3 until no reclassification is

necessary

Units can be reassigned in successive steps

(optimising partioning)

Hierarchical Methods

Non-hierarchical methods

of clusters

Problems when data contain a

high level of error

Can be very slow, preferable

with small data-sets

At each step they require

computation of the full

proximity matrix

with large data sets

Need to specify the number of

clusters

Need to set the initial seeds

Only cluster distances to seeds

need to be computed in each

iteration

no hard and fast rules,

a.

b.

c.

considerations;

the distances at which clusters are combined

in a hierarchical clustering;

the relative size of the clusters should be

meaningful, etc.

Outlairs

It would affect your cluster solution if you

dont remove it!

It would affect your cluster solution if you

remove it! (small sample size)

variables?

What is the effect of multi-collinearity in

cluster analysis?

measured metrically, but

technique can be applied to

non-metric variables with

caution.

to a single underlying concept

or construct.

Variable Description

Work Environment Measures

X1

I am paid fairly for the work I do.

X2

I am doing the kind of work I want.

X3

My supervisor gives credit and praise for work well done.

X4

There is a lot of cooperation among the members of my work group.

group.

X5

My job allows me to learn new skills.

X6

My supervisor recognizes my potential.

X7

My work gives me a sense of accomplishment.

X8

My immediate work group functions as a team.

X9

My pay reflects the effort I put into doing my work.

X10 My supervisor is friendly and helpful.

X11 The members of my work group have the skills and/or training

to do their job well.

X12 The benefits I receive are reasonable.

Relationship Measures

X13 I have a sense of loyalty to McDonald's restaurant.

X14 I am willing to put in a great deal of effort beyond that

expected to help McDonald's restaurant to be successful.

X15 I am proud to tell others that I work for McDonald's restaurant.

Classification Variables

X16 Intention to Search

X17 Length of Time an Employee

X18 Work Type = PartPart-Time vs. FullFull-Time

X19 Gender

X20 Age

X21 Performance

Type

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Metric

Nonmetric

Nonmetric

Nonmetric

Metric

Metric

For this example we are looking for subgroups among all the 63

employees of McDonald's restaurant using the organizational

commitment

commitment variables. The SPSS click through sequence is: Analyze

Classify Hierarchical Cluster. This will take you to a dialog box where

where

you select and move variables X13, X14 and X15 into the Variables

Variables box.

Next you go to the statistics box and agglomeration schedule is selected as

default option. Cluster membership none

none is selected as default. We shall

continue with default option here. Next click on plot

plot box. Check on

dendogram and in Icicle window, click on none button. Then continue.

Next click on the Method box and select Ward

Wards under Cluster Method (it

is the last option). Squared Euclidean Distances is the default under

Measure and we will use it, and we do not need to standardize this

this data.

We will not select anything on the save option now. Now click on OK

OK to

run the program.

coefficients in last two

stages

Identified

Identifiedthe

thenumber

number

ofofclusters

clustersfrom

from

dendogram

dendogram

sequence is: Analyze Classify K-mean cluster.

This will take you to a dialog box where you select

and move variables X13, X14 and X15 into the

Variables

Variables box. In the box number of clusters

clusters put

3 in place of 2. Next you go to the save box and

check on cluster membership. Next click on options.

Uncheck initial cluster option and check ANOVA

table. Now click on OK

OK to run the program.

33

34

35

ANOVA with cluster IDs and

organizational commitment variables.

36

ANOVA

variables into window

Move

cluster ID

variable into

window

2

Click on Options,

check Descriptive,

next Continue,

and then OK

37

2 Cluster ANOVA Results

Three issues to examine: (1) statistical

significance, (2) cluster sample sizes, and

(3) variable means.

Conclusion:

Conclusion:

Cluster 1

More Committed

Cluster 2

Less Committed

38

3 Cluster ANOVA

Must run postpost-hoc

hoc tests

Take 2 cluster ID

variable out and

insert 3 cluster ID

Click on Post

Hoc button and

check Scheffe

39

Conclusions:

Conclusions:

Cluster 1 Least Committed

Cluster 2 Moderately Committed

Cluster 3 Most Committed

Individual cluster sample sizes OK.

Clusters significantly different, but

must examine post hoc tests.

40

3 Cluster ANOVA

41

4 Cluster ANOVA

Click OK

to run

1 Remove 3 cluster ID

variable and insert 4

cluster ID variable

42

4 Cluster ANOVA

Conclusions:

1. Group sample sizes still OK.

2. Clusters are significantly different.

3. Means of four clusters more difficult to

interpret may want to examine polar

extremes

extremes. Most likely approach is combine

clusters 1 and 2 and do a three cluster

solution, or remove groups 1 and 2 and

compare extreme groups (3 & 4).

43

Post Hoc results

1. All clusters are

significantly different.

2. Largest differences

consistently between

clusters 3 and 4.

44

Error Reduction:

Reduction:

1 2 Clusters = 58.4%

2 3 Clusters = 25.5%

3 4 Clusters = 22.8%

4 5 Clusters = 22.2%

Conclusion:

Conclusion: benefit

similar or less after 3

clusters.

1. Examine cluster analysis

Agglomeration Schedule.

2. Consider cluster sample sizes.

3. Consider statistical significance.

4. Evaluate differences in cluster means.

5. Evaluate interpretation &

communication issues.

Error

Coefficients

45

1. Use ANOVA

2. Remove clustering variables from

Dependent List

List window

3. Insert demographic variables

4. Change Factor

Factor variable if necessary

Insert

Demogra

phic

Variables

46

1. Go to Variable View.

2. Click on None beside variable for number

of cluster groups you will examine under

Values column.

3. Assign value labels to each cluster.

4. Run ANOVA on demographics.

Assign value

labels for

clusters

47

Conclusions 3 cluster solution:

Clusters are significantly different.

More committed cluster (must know coding

to interpret) . . .

Less likely to search (lower mean)

Full time employees (code = 0)

Females (code = 1)

High performers (higher mean)

Thank you

