Cluster Analysis

Cluster Analysis
An Introduction
This document is an introduction to Cluster Analysis. The objectives: 1. Understanding of the underlying concepts 2. Appreciation of when to use the technique 3. What Cluster Analysis can and cant do 4. The process of building a cluster solution 5. Evaluating a cluster solution 6. Implementing a cluster solution
Agenda
Introduction:
Why Segmentation ? Types of Segmentation: a) Objective Segmentation (CHAID). b) Subjective Segmentation (Cluster Analysis). What is Cluster Analysis ? Basic Concepts.
Business Examples of Cluster Analysis:

Cluster Analysis for CCC Thailand (An Overview).
Cluster Analysis Process:

Data Cleaning and Preparing the data set for analysis. Creating new relevant Variables. Selection of Variables. Tackling the Outliers. Treatment on Missing Values. Multicollinearity Check and hence reducing dimensions. Standardization of the selected variables. Getting Cluster Solution. Checking the optimality of the solution.
Agenda
SAS Procedures:
Proc Factor Proc Standard Proc Cluster / Proc Fastclus
Evaluating a Cluster Solution:

% of Population in each Cluster Maximum Distance from Cluster Centroid Distance from the nearest Cluster Variable R-square Overall R-square
Implementing a Cluster Solution:

Scoring Code and Minimum Euclidean Distance Method
Step A: Why Segmentation ? Step B: Types of Segmentation: 1. Objective Segmentation (CHAID) 2. Subjective Segmentation (Cluster Analysis) Step C: What is Cluster Analysis? Step D: Basic Concepts.
Step A: Why Segmentation ?
Each individual is so different that ideally we would want to reach out to each one of them in a different way
1 2 3 4 5 6
..
Problem : The volume is too large for customization at individual level
Solution : Identify segments where people have same characters and target each of these segments in a different way
Segmentation is for better targeting
Step A: Why Segmentation ?
Segmentation provides a catalyst for creative insights.....often the first step in marketing strategy planning. Segmentation is a multipurpose technique.
Segmentation provides a common vocabulary for communicating marketing analysis. Segmentation can complement models.
Segmentation is a technique used by many of our competitors.
Step B: Types of Segmentation
Segmentation
Objective
Subjective
CHAID
Cluster Analysis
Step B.2: Subjective Segmentation
Subjective Segmentation (Cluster Analysis)
Step D: Basic Concepts
Cluster Analysis is a technique used for combining observations into groups Such that
Each group is homogeneous (similar) w.r.t. certain characteristics
&
Each group is different from other groups w.r.t. same characteristics
Step C: What is Cluster Analysis

What is Cluster Analysis ?
The process by which objects are classified into a number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group. In other words Cluster analysis means dividing the whole population into groups which are distinct between themselves but internally similar.
Total Population
Group 1
Group 2
Group 3
Group 4
The objects in group 1 should be as similar as possible. But there should be much difference between an object in group 1 and group 2
The attributes of the objects are allowed to determine which objects should be grouped together
Step B.2: Subjective Segmentation

Consider a portfolio with 1000 customers having Credits. Business wants to make different strategies to different groups of people. Of course Objective segmentation wont help since we dont know who are the responders.
In this case we need some profiling as below: -
Total Population (1000)
Avg. delinquency age = 0 days and Avg. age = 35 yrs. Avg. Utilization > 80%
Avg. delinquency age = 15 days and Avg. age = 33 yrs. Avg. Utilization = 60%
We can exclude the group with avg. delinquency age = 75 days from mailing This type of segmentation is known as Subjective Segmentation. It gives the salient characteristics of the best customers
Step D: Basic Concepts Basic Concepts of Cluster Analysis Using Two Variables
High
Example Cluster 1 High Balance Low Income
Current Balance
Medium
Example Cluster 2 High Income Low Balance

Low Low Medium High
Gross Monthly Income
Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance. The objects in Cluster 1 have similar characteristics (High Income and Low balance), on the other hand the objects in Cluster 2 have the same characteristic (High Balance and Low Income).
But there are much differences between an object in Cluster 1 and an object in Cluster 2
Cluster Analysis for CCC Thailand
Central Credit Card -Thailand

Segmentation of the PLCC Portfolio (Logo 02)
In this initiative, Cluster analysis technique is applied to GCF Thailands PLCC portfolio to generate segments ; and therefore provide better understanding of the portfolio and avenues to develop marketing strategies tailored to each segment.
Cluster Sizes
The Cluster Analysis that was performed on 75,134 Logo 02 Accounts, resulted in 6 segments.
Segment Sizes
Segment 6 8% Segment 5 15% Segment 4 5% Segment 3 23%
Segment 1 20%
Segment 2 29%
Significant Variables
The variables used for generating the segments are: 1. Account age (in days) 3. Time (in mths.) elapsed between last transaction and end of window (Recency) 4. Sum of all transaction amounts over last 6 months.
2. No. of times revolved in the last 3 months
Segement 1 Segment 2 Segement 3 Segement 4 Segment 5 Segement 6
Account age (days) Total amt. of sale transactions - last 6 months (Bht) Recency (months) No. of times revolved last 3 months
436.18 6282.05 1.07 2.92
282.06 8975.72 0.85 2.9
265.66 7370.61 1.09 0.15
342.17 43385.06 0.11 0.72
434.42 8075.22 0.91 0.16
397.7 0.53 10.41 0.55
These variables are different across segments.
Segment Descriptions: 1 to 4 Segment Descriptions

Segment 1: Credit Hungry Poor (20%)
Low monthly income segment. High Balance, Credit Utilisation, Fees and Fin. Charges. High revolvers. Older accounts. Delinquency marginally higher than average. Low transactors. Segment 2: New Revolvers (28%) Newer accounts. Low monthly income segment. High Balance, Credit Utilisation, Fees and Fin. Charges. High revolvers. Delinquency marginally higher than average. Average transactors.
Segment 3: New Hopefuls (23%) Newest accounts. Low Balance, Credit Utilisation, Fees and Fin. Charges. Low Revolvers. Low transactors in number and total amount transacted. But, sale per transaction slightly higher than average. Therefore, hold some hope for the future. Delinquency higher than average.
Segment 4: Affluent Spenders (5%) Highest monthly income. High Balance; but low Fees and Fin. Charges. Heavy transactors.
Segment Descriptions: 5 & 6 Segment Descriptions

Segment 5: Vintage Fuddy-Duddies (15%) Older accounts Low Balance, Credit Utilisation, Fin. Charges, Fees. Low revolvers. Average transactors. Segment 6: Old Risky Inactives (8%) Older customers Low Balance, Credit Utilisation. Non transactors. High delinquency.
Segment Credit hungry poor New Revolvers New Hopefuls Affluent Spenders Vintage fuddy-duddies Old Risky Inactives
No. of customers 15356 21161 17373 3815 11248 6181
% 20.44 28.16 23.12 5.08 14.97 8.23
Note: Refer Behaviour Profile and Demographic Profile 1 in Appendix for Profile details.
Spend Characteristics of Segments

Affluent Spenders and New Revolvers form 33% of the portfolio but account for 53% of all spends. Affluent Spenders alone accounts for 25% of the total spend though it is only 5% of the portfolio.
2450
1950
AVT (Value per txn. in Bht)
Affluent Spenders
5% of portfolio - 25% of all txn. value
1450
New Revolvers
950
Vintage Fuddy-Duddies
450
Old Risky Inactives

-50 -0.25
0.75
1.75
2.75
3.75
4.75
5.75
6.75
7.75
AFT (No. of txns. per month)
Credit Hungry Poor

New Hopefuls
Revolver Characteristics of Segments

Credit Hungry Poor and New Revolvers together account for 48% of the portfolio and 78% of revolvers.
100 90 80 70 60 50 40 30 20 10 0
8 15 5 23
3 6 4 9
45
28 33 20
% of Portfolio
Credit hungry poor Affluent Spenders New Revolvers
% of Revolvers
New Hopefuls Old Risky Inactives
Vintage fuddy-duddies
Step 1: Data Cleaning and Preparing the data set for analysis. Step 2: Creating new relevant Variables. Step 3: Selection of Variables. Step 4: Tackling the Outliers. Step 5: Treatment of Missing Values. Step 6: Multicollinearity Check and hence reducing dimensions Step 7: Standardization of the selected variables Step 8: Getting Cluster Solution. Step 9: Checking the optimality of the solution.
Clustering Process
Data Cleaning and Preparing the data set for analysis

Step 1
Creating New Relevant Variables

Step 2
Selection of Variables
Step 3
Multicollinearity Check
Step 6
Treatment of Missing Values

Step 5
Tackling the Outliers

Step 4
Standardization
Step 7
Getting Cluster Solution

Step 8
Checking the Optimality of the Solution

Step 9
Process Flow for Cluster Analysis
Step 1: Preparation of Data
Server
Client Data Different Tables
Merged Data
Data Merging Account Level or Customer Level
Cleaning Process Identify the erroneous values.
Data Cleaning
Final Data. Ready for Analysis
Check for Inconsistency in the values of variables.
Step 2: Creating new Variables
Variable Types:
Demographic Socio-Economic Product Related Behavioral
Variable Creation New relevant variables, if necessary are to be created from the existing ones. As an example for auto loan portfolio, if there are variables like deposit amount and price of the vehicles under finance, a new variable Deposit Percent = (Deposit Amount / Price of the Vehicles)*100 can be created.
Step 3: Selection of Variables
No Limit to # of variables to be selected for analysis Selection of Variables depends on the purpose of Clustering Irrelevant Variables are to be dropped Variables with large % of missing values are to be dropped
E.g. when we have Auto Loan portfolio, some variables like Loan Amount, Month on Book, Term, Deposit amount, Car Price should be considered. We should not look at APR (Annual Percentage Rate), because APR depends mainly on the yearly performance of the business overall rather than on the accounts. E.g. In a clustering Process, we want to identify the highly delinquent people. In this case Maximum Delinquency reached will have more significance than month end delinquency variables.
Step 4: Tackling Outliers What is an outlier ? An observation is said to be an outlier w.r.t. a variable if it is far away from the remaining observations.
Scatter Plot
Outlier
90 80 70 60
Var 2
To identify them:
Univariate and Frequency analysis Histogram and Box-Plot
50 40 30 20 10 0 0 5 10 15 20 Var 1 25 30 35 40 45
To tackle them: 1. The outliers can be deleted from analysis if they are very small in number. 2. The variables selected can be trimmed or capped.
Step 5: Treatment on Missing Values
Variables with lot many (about 15%) missing values should not be used for clustering unless Missing has a special significance and can be replaced by some meaningful number. E.g. - Insurance Variables. Note: - SAS does not include observations with missing values for Clustering Process % of Missing
Less than 1%
Treatments
Delete those Observations Mean Imputation
Mean Imputation
1-5%
5-10%
Regression Imputation Mean Imputation

Regression Imputation Try to use some proxy Variable
More than 10%
Step 6: Multi-collinearity Check
What is Multi-collinearity ? A set of independent or explanatory variables are said to have Multicollinearity, if there is any linear relation between them.
Devices to tackle Multi-collinearity: -
Factor Analysis: By Factor Analysis select those factors, which are explaining almost 90/95 % of total variation together. Then select those variables which have high loadings towards those factors.
VIF (Variance Inflation Factor): Variables with VIF more than 2 should be dropped
Step 7: Standardization
Why do we need Standardization ? Since the units of measurement are different for different variables, standardization is a must.
E.g.: - Consider two variables, Age and Income. The unit of Age is Year and the unit of Income is say $. Hence they are not comparable. In that case there wont be an unit of measurement for the distance between two clusters.
Generally we standardize by making the mean = 0 and variance = 1.
Step 8: Getting the Cluster Solution
Cluster Process: In SAS there are two mostly used procedures namely Proc Fastclus and Proc Cluster. Simple Linkage Complete Linkage Two Stage Etc.
Proc Cluster
Proc Fastclus
K - Means
What is K-Means: The Process starts with K distinct observations which are at the highest distance from each other. Then each of the observations will be considered one by one. They will be clubbed to the nearest Cluster. In this way if two clusters come significantly close to each other, they will be merged to each other to form a new cluster.
Cluster Process: After cleaning up the data set from outliers any of the above procedures can be used to build clusters. There is no hard and fast rule in terms of cluster numbers and cluster sizes. But the rule of thumb is there should be 5% observations in each cluster and total number of clusters should be between 5 to 15. Some of the variables(present in the data set) are to used for clustering. These variables must be numeric. They may be continuous or discrete, but if discrete there must be an ordinality among the categories. The goodness of a particular set of clusters are to measured by the extent to which means of the clustering variables are differing from one cluster to another.
Profiling the Clusters: After building the clusters, they are to profiled with respect to discrete and continuous variables to identify the different features of the different clusters.
Step 9:Evaluating the Cluster Solution
Statistic/Measure Variable R Square Overall R Square
Meaning Ideal value Between Variation/Total >= 0.3 Variation Avg(Var R Square); 1 >= 0.6 Avg[WithinVariance(Var1), WithinVariance(Var2),] Similar to above, different formula; calculated assuming variables are independent. For each cluster: Sqrt{Avg[Variance(Var1), Variance(Var2),]} How close or how far apart are cluster centroids "Dispersion" within each cluster Close to Overall R Square (diff <= 0.1)
Approximate Expected Overall R Square
RMS STD
<= 1.1
Distance Between Cluster Centroids
>= 1.5
Maximum distance from seed to observation
Relative; roughly uniform across clusters
Proc Factor Proc Cluster Proc Fastclus
Typical SAS codes

proc factor data=t5 method=prin nfactors=10 rotate=varimax out=final1; run; proc standard data=out1 out=out2 mean=0 std=1; var amt_fin term dep_per age mon_book; run; proc cluster data=out1 method=complete; var amt_fin term dep_per age mon_book; run;
proc fastclus data=out1 out=out2 maxc=120 maxiter=100 delete=1200 short; var amt_fin term dep_per age mon_book; run;
Minimum Euclidean Distance Method
Scoring
Minimum Euclidean Distance Method
Scatter Plot
80 70 60 50
Var 2
New Observation
Cluster 1 Cluster 3 Cluster 2
40 30 20 10 0 0 5 10 15 20 25 Var 1 30 35 40 45 50
The New Observation will be a member of Cluster 1
The SAS code for implementation
The Cluster Analysis Output
Thank You

Cluster Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis

Uploaded by

Copyright:

Available Formats

Cluster Analysis

Business Examples of Cluster Analysis:

Cluster Analysis Process:

Evaluating a Cluster Solution:

Implementing a Cluster Solution:

Step A: Why Segmentation ?

Problem : The volume is too large for customization at individual level

Segmentation is for better targeting

Step A: Why Segmentation ?

Segmentation is a technique used by many of our competitors.

Step B: Types of Segmentation

Step B.2: Subjective Segmentation

Subjective Segmentation (Cluster Analysis)

Step D: Basic Concepts

Each group is homogeneous (similar) w.r.t. certain characteristics

Each group is different from other groups w.r.t. same characteristics

Step C: What is Cluster Analysis

Step B.2: Subjective Segmentation

In this case we need some profiling as below: -

Total Population (1000)

Example Cluster 1 High Balance Low Income

Example Cluster 2 High Income Low Balance

Gross Monthly Income

Cluster Analysis for CCC Thailand

Central Credit Card -Thailand

Segment 6 8% Segment 5 15% Segment 4 5% Segment 3 23%

2. No. of times revolved in the last 3 months

Segement 1 Segment 2 Segement 3 Segement 4 Segment 5 Segement 6

436.18 6282.05 1.07 2.92

282.06 8975.72 0.85 2.9

265.66 7370.61 1.09 0.15

342.17 43385.06 0.11 0.72

434.42 8075.22 0.91 0.16

397.7 0.53 10.41 0.55

These variables are different across segments.

Segment Descriptions: 1 to 4 Segment Descriptions

Segment Descriptions: 5 & 6 Segment Descriptions

No. of customers 15356 21161 17373 3815 11248 6181

% 20.44 28.16 23.12 5.08 14.97 8.23

Spend Characteristics of Segments

Old Risky Inactives

AFT (No. of txns. per month)

Credit Hungry Poor

Revolver Characteristics of Segments

Data Cleaning and Preparing the data set for analysis

Creating New Relevant Variables

Treatment of Missing Values

Tackling the Outliers

Getting Cluster Solution

Checking the Optimality of the Solution

Process Flow for Cluster Analysis

Step 1: Preparation of Data

Data Merging Account Level or Customer Level

Cleaning Process Identify the erroneous values.

Check for Inconsistency in the values of variables.

Step 2: Creating new Variables

Step 3: Selection of Variables

Step 5: Treatment on Missing Values

Regression Imputation Mean Imputation

More than 10%

Step 6: Multi-collinearity Check

Devices to tackle Multi-collinearity: -

Generally we standardize by making the mean = 0 and variance = 1.