You are on page 1of 38

Cluster Analysis

An Introduction
This document is an introduction to Cluster Analysis. The objectives: 1. Understanding of the underlying concepts 2. Appreciation of when to use the technique 3. What Cluster Analysis can and cant do 4. The process of building a cluster solution 5. Evaluating a cluster solution 6. Implementing a cluster solution

Agenda

Introduction:
Why Segmentation ? Types of Segmentation: a) Objective Segmentation (CHAID). b) Subjective Segmentation (Cluster Analysis). What is Cluster Analysis ? Basic Concepts.

Business Examples of Cluster Analysis:


Cluster Analysis for CCC Thailand (An Overview).

Cluster Analysis Process:


Data Cleaning and Preparing the data set for analysis. Creating new relevant Variables. Selection of Variables. Tackling the Outliers. Treatment on Missing Values. Multicollinearity Check and hence reducing dimensions. Standardization of the selected variables. Getting Cluster Solution. Checking the optimality of the solution.

Agenda

SAS Procedures:
Proc Factor Proc Standard Proc Cluster / Proc Fastclus

Evaluating a Cluster Solution:


% of Population in each Cluster Maximum Distance from Cluster Centroid Distance from the nearest Cluster Variable R-square Overall R-square

Implementing a Cluster Solution:


Scoring Code and Minimum Euclidean Distance Method

Step A: Why Segmentation ? Step B: Types of Segmentation: 1. Objective Segmentation (CHAID) 2. Subjective Segmentation (Cluster Analysis) Step C: What is Cluster Analysis? Step D: Basic Concepts.

Step A: Why Segmentation ?

Each individual is so different that ideally we would want to reach out to each one of them in a different way
1 2 3 4 5 6

..

Problem : The volume is too large for customization at individual level

Solution : Identify segments where people have same characters and target each of these segments in a different way

Segmentation is for better targeting

Step A: Why Segmentation ?

Segmentation provides a catalyst for creative insights.....often the first step in marketing strategy planning. Segmentation is a multipurpose technique.

Segmentation provides a common vocabulary for communicating marketing analysis. Segmentation can complement models.

Segmentation is a technique used by many of our competitors.

Step B: Types of Segmentation

Segmentation

Objective

Subjective

CHAID

Cluster Analysis

Step B.2: Subjective Segmentation

Subjective Segmentation (Cluster Analysis)

Step D: Basic Concepts

Cluster Analysis is a technique used for combining observations into groups Such that

Each group is homogeneous (similar) w.r.t. certain characteristics

&

Each group is different from other groups w.r.t. same characteristics

Step C: What is Cluster Analysis


What is Cluster Analysis ?

The process by which objects are classified into a number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group. In other words Cluster analysis means dividing the whole population into groups which are distinct between themselves but internally similar.
Total Population

Group 1

Group 2

Group 3

Group 4

The objects in group 1 should be as similar as possible. But there should be much difference between an object in group 1 and group 2

The attributes of the objects are allowed to determine which objects should be grouped together

Step B.2: Subjective Segmentation


Consider a portfolio with 1000 customers having Credits. Business wants to make different strategies to different groups of people. Of course Objective segmentation wont help since we dont know who are the responders.

In this case we need some profiling as below: -

Total Population (1000)

Avg. delinquency age = 0 days and Avg. age = 35 yrs. Avg. Utilization > 80%

Avg. delinquency age = 15 days and Avg. age = 33 yrs. Avg. Utilization = 60%

Avg. delinquency age = 12 days and Avg. age = 25 yrs. Avg. Utilization = 90%

Avg. delinquency age = 75 days and Avg. age = 50 yrs. Avg. Utilization = 40%

We can exclude the group with avg. delinquency age = 75 days from mailing This type of segmentation is known as Subjective Segmentation. It gives the salient characteristics of the best customers

Step D: Basic Concepts Basic Concepts of Cluster Analysis Using Two Variables

High

Example Cluster 1 High Balance Low Income

Current Balance

Medium

Example Cluster 2 High Income Low Balance


Low Low Medium High

Gross Monthly Income

Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance. The objects in Cluster 1 have similar characteristics (High Income and Low balance), on the other hand the objects in Cluster 2 have the same characteristic (High Balance and Low Income).

But there are much differences between an object in Cluster 1 and an object in Cluster 2

Cluster Analysis for CCC Thailand

Central Credit Card -Thailand


Segmentation of the PLCC Portfolio (Logo 02)
In this initiative, Cluster analysis technique is applied to GCF Thailands PLCC portfolio to generate segments ; and therefore provide better understanding of the portfolio and avenues to develop marketing strategies tailored to each segment.

Cluster Sizes

The Cluster Analysis that was performed on 75,134 Logo 02 Accounts, resulted in 6 segments.

Segment Sizes

Segment 6 8% Segment 5 15% Segment 4 5% Segment 3 23%

Segment 1 20%

Segment 2 29%

Significant Variables

The variables used for generating the segments are: 1. Account age (in days) 3. Time (in mths.) elapsed between last transaction and end of window (Recency) 4. Sum of all transaction amounts over last 6 months.

2. No. of times revolved in the last 3 months

Segement 1 Segment 2 Segement 3 Segement 4 Segment 5 Segement 6

Account age (days) Total amt. of sale transactions - last 6 months (Bht) Recency (months) No. of times revolved last 3 months

436.18 6282.05 1.07 2.92

282.06 8975.72 0.85 2.9

265.66 7370.61 1.09 0.15

342.17 43385.06 0.11 0.72

434.42 8075.22 0.91 0.16

397.7 0.53 10.41 0.55

These variables are different across segments.

Segment Descriptions: 1 to 4 Segment Descriptions


Segment 1: Credit Hungry Poor (20%)
Low monthly income segment. High Balance, Credit Utilisation, Fees and Fin. Charges. High revolvers. Older accounts. Delinquency marginally higher than average. Low transactors. Segment 2: New Revolvers (28%) Newer accounts. Low monthly income segment. High Balance, Credit Utilisation, Fees and Fin. Charges. High revolvers. Delinquency marginally higher than average. Average transactors.

Segment 3: New Hopefuls (23%) Newest accounts. Low Balance, Credit Utilisation, Fees and Fin. Charges. Low Revolvers. Low transactors in number and total amount transacted. But, sale per transaction slightly higher than average. Therefore, hold some hope for the future. Delinquency higher than average.

Segment 4: Affluent Spenders (5%) Highest monthly income. High Balance; but low Fees and Fin. Charges. Heavy transactors.

Segment Descriptions: 5 & 6 Segment Descriptions


Segment 5: Vintage Fuddy-Duddies (15%) Older accounts Low Balance, Credit Utilisation, Fin. Charges, Fees. Low revolvers. Average transactors. Segment 6: Old Risky Inactives (8%) Older customers Low Balance, Credit Utilisation. Non transactors. High delinquency.

Segment Credit hungry poor New Revolvers New Hopefuls Affluent Spenders Vintage fuddy-duddies Old Risky Inactives

No. of customers 15356 21161 17373 3815 11248 6181

% 20.44 28.16 23.12 5.08 14.97 8.23

Note: Refer Behaviour Profile and Demographic Profile 1 in Appendix for Profile details.

Spend Characteristics of Segments


Affluent Spenders and New Revolvers form 33% of the portfolio but account for 53% of all spends. Affluent Spenders alone accounts for 25% of the total spend though it is only 5% of the portfolio.
2450

1950
AVT (Value per txn. in Bht)

Affluent Spenders
5% of portfolio - 25% of all txn. value

1450

New Revolvers
28% of portfolio - 28% of all txn. value

950

Vintage Fuddy-Duddies
15% of portfolio - 14% of all txn. value

450

Old Risky Inactives


-50 -0.25
8% of portfolio - 0% of all txn. value

0.75

1.75

2.75

3.75

4.75

5.75

6.75

7.75

AFT (No. of txns. per month)

Credit Hungry Poor


20% of portfolio - 14% of all txn. value

New Hopefuls
23% of portfolio - 19% of all txn. value

Revolver Characteristics of Segments


Credit Hungry Poor and New Revolvers together account for 48% of the portfolio and 78% of revolvers.

100 90 80 70 60 50 40 30 20 10 0

8 15 5 23

3 6 4 9

45

28 33 20

% of Portfolio
Credit hungry poor Affluent Spenders New Revolvers

% of Revolvers
New Hopefuls Old Risky Inactives

Vintage fuddy-duddies

Step 1: Data Cleaning and Preparing the data set for analysis. Step 2: Creating new relevant Variables. Step 3: Selection of Variables. Step 4: Tackling the Outliers. Step 5: Treatment of Missing Values. Step 6: Multicollinearity Check and hence reducing dimensions Step 7: Standardization of the selected variables Step 8: Getting Cluster Solution. Step 9: Checking the optimality of the solution.

Clustering Process

Data Cleaning and Preparing the data set for analysis


Step 1

Creating New Relevant Variables


Step 2

Selection of Variables
Step 3

Multicollinearity Check
Step 6

Treatment of Missing Values


Step 5

Tackling the Outliers


Step 4

Standardization
Step 7

Getting Cluster Solution


Step 8

Checking the Optimality of the Solution


Step 9

Process Flow for Cluster Analysis

Step 1: Preparation of Data

Server
Client Data Different Tables

Merged Data

Data Merging Account Level or Customer Level

Cleaning Process Identify the erroneous values.

Data Cleaning
Final Data. Ready for Analysis

Check for Inconsistency in the values of variables.

Step 2: Creating new Variables

Variable Types:
Demographic Socio-Economic Product Related Behavioral
Variable Creation New relevant variables, if necessary are to be created from the existing ones. As an example for auto loan portfolio, if there are variables like deposit amount and price of the vehicles under finance, a new variable Deposit Percent = (Deposit Amount / Price of the Vehicles)*100 can be created.

Step 3: Selection of Variables

No Limit to # of variables to be selected for analysis Selection of Variables depends on the purpose of Clustering Irrelevant Variables are to be dropped Variables with large % of missing values are to be dropped

E.g. when we have Auto Loan portfolio, some variables like Loan Amount, Month on Book, Term, Deposit amount, Car Price should be considered. We should not look at APR (Annual Percentage Rate), because APR depends mainly on the yearly performance of the business overall rather than on the accounts. E.g. In a clustering Process, we want to identify the highly delinquent people. In this case Maximum Delinquency reached will have more significance than month end delinquency variables.

Step 4: Tackling Outliers What is an outlier ? An observation is said to be an outlier w.r.t. a variable if it is far away from the remaining observations.
Scatter Plot

Outlier
90 80 70 60
Var 2

To identify them:
Univariate and Frequency analysis Histogram and Box-Plot

50 40 30 20 10 0 0 5 10 15 20 Var 1 25 30 35 40 45

To tackle them: 1. The outliers can be deleted from analysis if they are very small in number. 2. The variables selected can be trimmed or capped.

Step 5: Treatment on Missing Values

Variables with lot many (about 15%) missing values should not be used for clustering unless Missing has a special significance and can be replaced by some meaningful number. E.g. - Insurance Variables. Note: - SAS does not include observations with missing values for Clustering Process % of Missing
Less than 1%

Treatments
Delete those Observations Mean Imputation
Mean Imputation

1-5%

5-10%

Regression Imputation Mean Imputation


Regression Imputation Try to use some proxy Variable

More than 10%

Step 6: Multi-collinearity Check

What is Multi-collinearity ? A set of independent or explanatory variables are said to have Multicollinearity, if there is any linear relation between them.

Devices to tackle Multi-collinearity: -

Factor Analysis: By Factor Analysis select those factors, which are explaining almost 90/95 % of total variation together. Then select those variables which have high loadings towards those factors.

VIF (Variance Inflation Factor): Variables with VIF more than 2 should be dropped

Step 7: Standardization

Why do we need Standardization ? Since the units of measurement are different for different variables, standardization is a must.

E.g.: - Consider two variables, Age and Income. The unit of Age is Year and the unit of Income is say $. Hence they are not comparable. In that case there wont be an unit of measurement for the distance between two clusters.

Generally we standardize by making the mean = 0 and variance = 1.

Step 8: Getting the Cluster Solution

Cluster Process: In SAS there are two mostly used procedures namely Proc Fastclus and Proc Cluster. Simple Linkage Complete Linkage Two Stage Etc.

Proc Cluster

Proc Fastclus

K - Means

What is K-Means: The Process starts with K distinct observations which are at the highest distance from each other. Then each of the observations will be considered one by one. They will be clubbed to the nearest Cluster. In this way if two clusters come significantly close to each other, they will be merged to each other to form a new cluster.

Step 8: Getting the Cluster Solution

Cluster Process: After cleaning up the data set from outliers any of the above procedures can be used to build clusters. There is no hard and fast rule in terms of cluster numbers and cluster sizes. But the rule of thumb is there should be 5% observations in each cluster and total number of clusters should be between 5 to 15. Some of the variables(present in the data set) are to used for clustering. These variables must be numeric. They may be continuous or discrete, but if discrete there must be an ordinality among the categories. The goodness of a particular set of clusters are to measured by the extent to which means of the clustering variables are differing from one cluster to another.

Profiling the Clusters: After building the clusters, they are to profiled with respect to discrete and continuous variables to identify the different features of the different clusters.

Step 9:Evaluating the Cluster Solution

Statistic/Measure Variable R Square Overall R Square

Meaning Ideal value Between Variation/Total >= 0.3 Variation Avg(Var R Square); 1 >= 0.6 Avg[WithinVariance(Var1), WithinVariance(Var2),] Similar to above, different formula; calculated assuming variables are independent. For each cluster: Sqrt{Avg[Variance(Var1), Variance(Var2),]} How close or how far apart are cluster centroids "Dispersion" within each cluster Close to Overall R Square (diff <= 0.1)

Approximate Expected Overall R Square

RMS STD

<= 1.1

Distance Between Cluster Centroids

>= 1.5

Maximum distance from seed to observation

Relative; roughly uniform across clusters

Proc Factor Proc Cluster Proc Fastclus

Typical SAS codes


proc factor data=t5 method=prin nfactors=10 rotate=varimax out=final1; run; proc standard data=out1 out=out2 mean=0 std=1; var amt_fin term dep_per age mon_book; run; proc cluster data=out1 method=complete; var amt_fin term dep_per age mon_book; run;

proc fastclus data=out1 out=out2 maxc=120 maxiter=100 delete=1200 short; var amt_fin term dep_per age mon_book; run;

Minimum Euclidean Distance Method

Scoring
Minimum Euclidean Distance Method

Scatter Plot
80 70 60 50
Var 2

New Observation

Cluster 1 Cluster 3 Cluster 2

40 30 20 10 0 0 5 10 15 20 25 Var 1 30 35 40 45 50

The New Observation will be a member of Cluster 1

Step 8: Getting the Cluster Solution

The SAS code for implementation

The Cluster Analysis Output

Thank You

You might also like