You are on page 1of 34

ISSS602 DATA ANALYTICS LAB

ASSIGNMENT 2: BE CUSTOMER WISE OR OTHERWISE


OBJECTIVE
To create homogeneous segments based for Supersales Store, based on the (1) customers shipping patterns,
(2) product mix, and (3) purchasing behavior (via RFM model).

Prepared for:

Prepared by:

Instructor - Dr. Kam Tin Seong

Ong Han Ying


G1
Hanying.ong.2015@mitb.smu.edu.sg
Date:
5 October 2015

Page 1 of 34

Table of Contents
SECTION 1: EXECUTIVE SUMMARY

SECTION 2: DATA PREPARATION PART 1 DATA STUDY & REVIEW

2.1
2.1.1
2.1.2
2.2
2.2.1
2.2.2
2.3
2.4
2.5

DATA ACCURACY OF SUPERSTORE SALES DATA


MODIFYING COLUMN INFORMATION OF SUPERSTORESALES DATA
CHECK ON UNIQUENESS OF DATA
DATA CONSISTENCY OF SUPERSTORESALES DATA
CHECKING AND FORMATTING THE NUMBER OF DATA FOR CHARACTER DATA IN SUPERSTORESALES DATA
CHECKING AND CORRECTING NAMING ERROR CHARACTER DATA IN SUPERSTORESALES DATA
DATA COMPLETENESS OF SUPERSTORESALES DATA
REORGANIZATION OF DATASET WITH CUSTOMER NAME AS UNIQUE KEY
CREATING DERIVED VARIABLES FOR ANALYSIS

4
4
5
5
6
6
7
9
15

SECTION 3: ANALYSIS PROCEDURE - PREPARE & SELECT DATA FOR CLUSTER ANALYSIS

15

3.1
3.2
3.2.1
3.2.2
3.2.2
3.2.3
3.3
3.4

15
16
16
17
18
19
21
21

SELECTING CLUSTERING VARIABLES


EXAMINE DISTRIBUTION & PERFORM VARIABLE STANDARDIZATION AND TRANSFORMATION
RFM VARIABLES
SHIPPING PATTERN ORDER PRIORITY
SHIPPING PATTERN SHIPPING MODE
PRODUCT MIX PRODUCT CONTAINERS & PRODUCT CATEGORY
CHECK FOR MULTI-COLLINEARITY AMONG VARIABLES
FINAL VARIABLES USED FOR CLUSTER ANALYSIS

SECTION 4: ANALYSIS PROCEDURE CONDUCT CLUSTERING EXERCISE

22

4.1

22

INTERPRETING CLUSTERS

SECTION 5: ANALYSIS PROCEDURE INTERPRETATION CLUSTER RESULTS

24

5.1 PROFILING CLUSTERING


5.1.1 ANALYSIS FROM CLUSTERING VARIABLES
5.1.2 ANALYSIS FROM NON- CLUSTERING (INDEPENDENT) VARIABLES

24
24
29

SECTION 6: ANALYSIS RESULTS AND RECOMMENDATION

33

Page 2 of 34

SECTION 1: EXECUTIVE SUMMARY


1. Superstore saless customer profile
a. Overall, 5 distinct customer clusters are identified, through analyzing the transactional data with
information related to their purchasing behavior, shipping pattern, and product mix.
2. Of all, customer in cluster 5 generates the most revenue, and order; but the company needs to take note
that
a. Cluster 5 is rank only the 3rd, in term of recently. Hence, it may implicates that loyalty over this
valuable cluster may be losing over time.
b. Also, it contains the most number of customers with negative profit.
3. Though cluster 2 & 3 are next after 5 as valuable; it must also be taken note that cluster 2 & 3 generates
lesser revenue and buy lesser as compare to the other cluster. Customers in these 2 clusters may be new
customers.
4. In product mix
a. Overall, across all clusters; officer supplies come out the top across all clusters.
b. We can see that all clusters prefer the use of small boxes.
5. Shipping Pattern
a. Shipping by regular air is preferred by all clusters
b. Delivery Mode Truck
i. It is affected by the size of the containers. Hence, as long as jumbo drum and jumbo box
are available as container for product, the company needs to invest in delivery mode by
truck.
6. Demographic
a. Most users bought items for corporate use, and found in the central region. The distribution in
all 5 clusters are similar.
7. Date Collection
a. Representation of Data can be Improved to include information such as brand, since it is
embedded in product name
b. Also, the company needs to create a system to issue customer account ID to each customer, so
as to ensure that all customers, even with the same names; are analysed uniquely.

Page 3 of 34

SECTION 2: DATA PREPARATION PART 1 DATA STUDY & REVIEW


2.1

DATA ACCURACY OF SUPERSTORE SALES DATA


The overview of the change(s) made to the table as below;
Column Information
Edit Modeling Type
X
X

Field Name

Edit Data Type


Row ID
X
Order ID
X
Order Date
Order Priority
Order Quantity
Sales
Discount
Ship Mode
Profit
Unit Price
Shipping Cost
Customer Name
City
Zip Code
X
State
Region
Customer Segment
Product Category
Product Sub-Category
Product Name
Product Container
Product Base Margin
Ship Date
2.1.1

Format

X
X
X
X
X

Modifying Column Information of SuperstoreSales Data

Field Name
Row ID
Order ID
Sales
Discount
Profit
Unit Price
Shipping Cost
Zip Code

Before Updated
Data Type Modeling Type Format
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best

Data Type
Character
Character
Numeric
Numeric
Numeric
Numeric
Numeric
Character

Page 4 of 34

After Updated
Modeling Type
Nominal
Nominal
Continuous
Continuous
Continuous
Continuous
Continuous
Nominal

Format

Currency (SGD)
Currency (SGD)
Currency (SGD)
Currency (SGD)
Currency (SGD)

2.1.2 Check on Uniqueness of Data


For transactional data, the row ID is unique; to indicate that the line transactions are not repeated.

2.2

DATA CONSISTENCY OF SUPERSTORESALES DATA


The overview of the change(s) made to the table as below;

*More than 100 categories will not be checked as it is subjected to high human error. Thus, will not clean.
Hence, information is treated as a description, and will not be used for categorical / distribution analysis.
Field Name
Row ID

After Updated
Data Type Modeling Type
Character
Nominal

Actionable

Rationale

No Action

Order ID

Character

Nominal

Check no. of digits

Order Priority
Ship Mode
Customer Name
City

Character
Character
Character
Character

Nominal
Nominal
Nominal
Nominal

Check naming error


Check naming error
No Action
No Action

Zip Code

Character

Nominal

Check no. of digits

State
Region
Customer Segment
Product Category
Product Sub-Category
Product Name
Product Container

Character
Character
Character
Character
Character
Character
Character

Nominal
Nominal
Nominal
Nominal
Nominal
Nominal
Nominal

Check naming error


Check naming error
Check naming error
Check naming error
Check naming error
No Action
Check naming error

Page 5 of 34

For any ID used for


business identifier, it
should contain the same
number of digit/character.

> than 100 categories


> than 100 categories
All Zip Code should contain
the same number of digits.

> than 100 categories

2.2.1 Checking and formatting the Number of Data for Character Data in SuperstoreSales Data
Field
Checking No. of Digits
Actionable
After Actionable
Name
Step 1:
Create a new column, No. of
Char(Order ID)
Step 2: Identify the no. of digits in Order ID:
Order
ID

Results: There are order ID that contains up to


5 digits. Reformat all to contain 5 digits
character.
Step 1: Create a new column, No. of
Char(Zipcode) to count the digits in Order ID
Step 2: Identify the no. of digits in Order ID:
Zip
Code

Results: There are zipcode with 4 digits, which


then need to be reformatted into 5 digit

A new field, Order ID (Formatted) is


created;
Create
a
new field,
Order ID
(Formatted)
; such that Upon creating another new column to check
all order ID on the number of characters, the results is
will contain shown below;
5 digit.

A new field, Zipcode (Formatted) is created;


Create
a
new field,
Zipcode
(Formatted) Upon creating another new column to check
; such that on the number of characters, the results is
all zipcode shown below;
will contain
5 digit.

2.2.2 Checking and correcting Naming Error Character Data in SuperstoreSales Data
Using the tabulate, the categories of the fields were identified. As below, there is no naming error or
similar data spotted, thus; no re-coding is required.

Page 6 of 34

2.3

DATA COMPLETENESS OF SUPERSTORESALES DATA


To ensure that the data is complete, we will conduct a check via a summary of the columns, as below;

No.
Observation
1.
Only Product Base Margin has 63 missing fields.
2.
3.
4.
5.
6.
7.

Impact
If the column is used for analysis, it will
affect the consistency of the analysis.
MIN value of Shipping date & order date are near. May contain If shipping date < order date = true,
data error with shipping date < order date
need to remove error data.
Negative Profit detected
Need to examine if it were a data error
or real business loss.
Range of sales is wide
Range of order quantity is wide.
If distinct outliers is observed, will
Range of unit price is wide. Can be explained by wide range of skewed the overall analysis results.
Need to examine its distribution in the
products (n of product name=1263)
Range of shipping cost is wide. Can be explained by different analysis later.
shipping mode (n=3), and product container (n = 7).

The proposed actionable as below;


No.

1.

Proposed Actionable

Rationale & Results


To analyze the product mix, a factor that we will consider will include
product that generate high profit margin (low product base margin).

To keep the missing data,


and observe on the impact Technically, we should remove the rows with missing data. However, since
on our analysis from the
- Transactional records is tie to the order and customer, it may dilute
missing input.
future analysis on total sum spent per order.
- Missing data come from a brand of SAFCO only.
-

Page 7 of 34

2.

An additional column,
Shipping < Order is
created to check;

A check is conducted and there is no transaction that shows that shipping


date is earlier than order date. Hence, no data is removed.

Next, we will also examine if any business logics are violated;


Observation:
1 order ID having multiple customer names

Rationale & Actionable

Create a unique identifier, Unique Identifier. It combined


Order ID Formatted with Customer Name.

2. multiple zipcode link to a state


Using the following link; it is noticed that multiple zipcode can
be linked to different cities name.
https://tools.usps.com/go/ZipLookupAction!input.action
As such, no further actionable is taken. However, City,
instead of zipcode, will be used to future analysis in this
report, since it is the lowest level of details for shipping
location.

Reference
Working Copy - With Validation, Hidden & Exceptional Fields
File name: CLEAN DATA_WORKING_COPY 01

SuperstoreSalesData

Page 8 of 34

2.4

REORGANIZATION OF DATASET WITH CUSTOMER NAME AS UNIQUE KEY


To prepare our data to meet the objective of this paper, we have to reorganize the dataset such that
customer name is the primary key, with absence of a customer account number. Here, we assumed
that each customer name refers to a distinct customer account. As such, Row ID will be hidden and
excluded.

No Related Field
.
Name(s)

1.

Order ID
(Formatted)

2.

Order Date &


Shipped Date

3.

Discount

4.

Profit &
Sales

Observation

Page 9 of 34

5.

Unit Price ,
Product Base
Margin &
Shipping Cost

6.

Product
Name,
Product
SubCategory,
Product
Category, and
Product
Container

7.

Order
Quantity

Page 10 of 34

8.

Shipping
Mode, Order
Priority &
Customer
Segment

9.

City, State,
Region

For each of the observation(s) above, actionable are proposed & implemented;
No.

Field
Name(s)

Rationale

Procedure & Outcome


Step 1: Create a tabulate, and make into a new data table, and
saved with file name, CLEAN DATA_WORKING_COPY 02
SuperstoreSalesData

1.

Step 2: from the new data table, create another tabulate, and
make into another new data table, renamed the field, n with
To count the number
No. of Orders.
Order
ID
of order per customer.
(Formatted)
Save the file name as CLEAN DATA_WORKING_COPY 03
SuperstoreSalesData

Page 11 of 34

New Field, Lead Time is created;

Create a tabulate and saved as Tabulate - Working01 by Custto calculate duration


Nam
between Shipped &
Ordered (Lead time)
2.

Order Date
& Ship Date

3.

Discount

To identify the MIN &


MAX order date &
MIN & MAX of lead
time
for
each
customer

Discount is
ambiguous.

Data will be dropped from customer segmentation analysis.


Step 1: Open Tabulate - Working01 by Cust-Name and add
MIN, MAX, Range, & Sum of Profit and Sales to the tabulate,
as below;

4.

Profit &
Sales

Add MIN, MAX, Range


& SUM both profit
& Sales by customer

Step 2: Save the tabulate as Tabulate - Working02 by CustName

5.

Unit
Price ,
Product
Base
Margin &
Shipping
Cost

6.

Product
Name,
Product
Subcategory
, Product
Category,
and
Product
Container

These data seem to


be independent from
the customer, and
dependent on the
product name
Product Name and
Product
Subcategory are
subset of Product
Category, with high
varieties by a
customer, thus;
difficult to draw
meaning data.

All the 3 data will be dropped from customer segmentation


analysis.

Step 1: Add Product Category & Product Container to


Tabulate - Working02 by Cust-Name, with N & % Row as
below;

Page 12 of 34

Distinct categories
found in product
categories and
Product Container

Step 2: Save the tabulate as Tabulate - Working03 by CustName

7.

Order
Quantity

8.

Shipping
Mode,
Order
Priority &
Customer
Segment

Data is ambiguous.
With missing unit of
measure, we are
Data will be dropped from customer segmentation analysis.
unable to determine if
this is a bulk purchase
or via single item
Step 1: Add Shipping Mode, Order Priority & Customer
Segment to Tabulate - Working03 by Cust-Name

Distinct categories
found in Shipping
Mode & Customer
Segment

Step 2: Save the tabulate as Tabulate - Working04 by CustName

9.

City,
State,
Region

City, and State are


subset of Region,
Drop City and State from customer segmentation analysis.
and distinct
Step 1: Add Region to Tabulate - Working04 by Cust-Name
categories found in
Regions,
Page 13 of 34

Step 2: Save the tabulate as Tabulate FINAL by Customer


Name
Next, with Tabulate FINAL by Customer Name, make into a data-table, join with CLEAN
DATA_WORKING_COPY 01 SuperstoreSalesData, and saved the file as CLEAN DATA_FINAL_COPY
SuperstoreSalesData
The summary of the actionable as below;
Original Field Name
Row ID
Order ID (Formatted)
Order Date
Order Priority
Order Quantity
Sales
Discount
Ship Mode
Profit
Unit Price
Shipping Cost
Customer Name (Unique)
City
Zip Code
State
Region
Customer Segment (Sector)
Product Category
Product Sub-Category
Product Name
Product Container
Product Base Margin
Ship Date

Application for Customer Perspective


RFM
Product
Shipping
In Use?
Mix
Pattern
R F M
Dropped
Yes
X
Yes
X
Yes
X
Dropped
Yes
X
Dropped
Yes
X
Yes
Dropped
Dropped
Yes
Dropped
Dropped
Dropped
Yes
Yes
Yes
X
Dropped
Dropped
Yes
X
Dropped
Yes

Page 14 of 34

Profile
Analysis

X
X

2.5

CREATING DERIVED VARIABLES FOR ANALYSIS


No. New Fields
Objective & Formula
1.
R - Order Date
To define the Recency variable for the analysis.

Reference
Final Copy With Customer Name as Unique Key

File name: CLEAN DATA_FINAL_COPY


SuperstoreSalesData

SECTION 3: ANALYSIS PROCEDURE - PREPARE & SELECT DATA FOR CLUSTER ANALYSIS
3.1

SELECTING CLUSTERING VARIABLES


Only the fields below are used for clustering analysis, the other fields are used for profiling the cluster,
after the clusters are formed.

Page 15 of 34

3.2

EXAMINE DISTRIBUTION & PERFORM VARIABLE STANDARDIZATION AND TRANSFORMATION


To apply cluster analysis, distribution of variables used should not contain too many outliers, and should
not be too skewed. Therefore, all the selected variables are examined, and if required; will be
standardized or transformed.

3.2.1 RFM Variables


Field
Name

Current Distribution

Analysis

ROrder
Date

F - No.
of
Order
s

Distinct
outliers are
detected in all
3 variables,
and they are
left-skewed.
Thus, will
apply
Johnsons
Transformatio
n, to help to
reduce
skewness.

MSum(S
ales)

Page 16 of 34

Outcome

Analysis

The results of the


transformation is
good for R-Order
Date and MSum(Sales), where
the data is now
normally
distributed, and no
distinct outlier.
F- No. of Order
does not fare as
well, with data
slightly rightly
skewed, with no
distinct outlier.
Hence, all the 3
new variable are
accepted as new
variables for R, F &
M.

3.2.2 Shipping Pattern Order Priority


Field
Name

Row% (Critical) Order


Priority

Row% (High) Order


Priority

Row% (Low) Order


Priority

Row% (Medium) Order


Priority

Row% (No Specified) Order


Priority

Current
Distributi
on

Current
Distributi
on + 1

Analysis

The data is highly skewed to the right, with distinct outliers. As Log is unable to log a 0%, we will add a constant 1, to all the variable.

Transfor
mation 1

Transfor
mation 2

Rationale

To normalize the data, we log the data. After first transformation, the data showed that it is still rightly skewed. Therefore, we transformed it again by
square root the data. After which, outliers are reduced, and the data showed less skewness in it.
Page 17 of 34

3.2.2 Shipping Pattern Shipping Mode


Field
Name

Row % (Delivery Truck) Shipping


Mode

Row % (Express Air) Shipping


Mode

Row % (Regular Air) Shipping Mode

Current
Distributi
on

Current
Distributi
on + 1

Analysis

The data is highly skewed, with distinct outliers. As Log is unable to log a 0%, we will add a constant 1, to all the variable.

Transfor
mation 1

Transfor
mation 2

Rationale

To normalize the data, we log the data. After first transformation, the
data showed that it is still rightly skewed. Therefore, we transformed it
again by square root the data. After which, outliers are reduced, and the
data showed less skewness in it.
Page 18 of 34

After log & Square-root the data, it still show skewness and
outlier. Therefore, Johnson SI Transformation is applied, and the
distributions skewness and outliers are greatly reduced.

3.2.3 Product Mix Product Containers & Product Category


Field Name

Row % (Jumbo Box)


Product Container

Row % (Jumbo Drum)


Product Container

Row % (Large Box)


Product Container

Row % (Medium Box)


Product Container

Row % (Small Box)


Product Container

Current
Distributio
n

Current
Distributio
n+1

Analysis

The data is highly skewed, with distinct outliers. As Log is unable to log a 0%, we will add a constant 1, to
all the variable.

No transformation is done as
the data shows little outlier
and it is not too skewed.

To normalize the data, we log the data. After first transformation, the data showed that it is still rightly skewed.
Therefore, we transformed it again by square root the data. After which, outliers are reduced, and the data showed
less skewness in it.

No transformation is done as
the data shows little outlier
and it is not too skewed.

Transforma
tion 1

Transforma
tion 2

Rationale

Page 19 of 34

Field Name

Row % (Wrap Bag)


Product Container

Row % (Small Pack)


Product Container

Row%(Furniture) Product
Category

Row%(Technology)
Product Category

Row%(Office Supplies)
Product Category

Current
Distributio
n

Current
Distributio
n+1

Analysis

The data is highly skewed, with distinct outliers. As


Log is unable to log a 0%, we will add a constant
1, to all the variable.

The other 2 fields are highly skewed, with distinct outliers.


Required to do transformation.

Transforma
tion 1

Transforma
tion 2

Rationale

Johnson SI Transformation is applied, and the


distributions skewness and outliers are greatly
reduced.

Page 20 of 34

The data is quite normally


distributed, with little
outlier. No transformation
required.

3.3

CHECK FOR MULTI-COLLINEARITY AMONG VARIABLES


Next, to ensure that the collinearity among the variable is low, multi-variate analysis is conducted.
Overall, the collinearity among the variables are not absolutely strong (all correlation fall <0.71).
Refers to Multivariate - Cluster Variables in JMP file.
Nevertheless, the following analysis are observed;
(1) Square Root (Log(Row%(Delivery Truck) Shipping Mode +1))) is repeated 5 times among the
variable that showed high collinearity.

Thus, Square Root (Log(Row%(Delivery Truck) Shipping Mode +1))) will be excluded in the cluster.

3.4

FINAL VARIABLES USED FOR CLUSTER ANALYSIS

In all, 20 variables will be used for clustering, as below;

Page 21 of 34

SECTION 4: ANALYSIS PROCEDURE CONDUCT CLUSTERING EXERCISE


4.1

INTERPRETING CLUSTERS
A. Hierarchical Clustering
Through this exercise, it will help to propose the number of clusters suitable for k-means. The
results of hierarchical clustering as below;
Ward

Centroid

Average

Single Linkage

Complete
Linkage

No.of
clusters
formed

As shown from the result, the 5 different calculation has shown different results, with Ward &
Complete linkage more well distributed clusters. As the results conflict, we will use the suggested
number of cluster, and apply K-means instead.
B. K-Means
Using K-means analysis, the results below;

Page 22 of 34

On closer examination, the distribution of the clustering as below;

Statistically, the recommended number of cluster is 3. However, when n=3, the customer distribution
(cluster 2 = 408) contains of exceeding 50% of the customer. Therefore, it may not be able to help us to
understand our consumer segment better. As such, we consider the next recommended n value with
the 2nd highest CCC value, n = 5.
With n=5, the customer distribution is more evenly spread, with cluster 5 of about 37%. Thus, n=5, or 5
customer clusters should be formed.
With this, we updated the dataset with the cluster, with a new field added, cluster.

Page 23 of 34

SECTION 5: ANALYSIS PROCEDURE INTERPRETATION CLUSTER RESULTS


5.1

PROFILING CLUSTERING
The overview of the clusters as below;

With the clustering, the following feature of the customers are analyzed;

5.1.1 ANALYSIS FROM CLUSTERING VARIABLES


(1)

RFM Purchasing Pattern

Analysis Tools

Analysis
Recently: The number the total number,
the most recent the purchases,
therefore; the smaller the %, the better.
In all, we can see that the most recent
customer comes from cluster 2 (~18%),
but they account for the least no. of
order (~8%), and revenue (5%)
Cluster 5 gives the company the most
number of order (~55%) and revenue
(62%), but is the 3rd performing in terms
of recently (~20%)

From the table, we can see that cluster 5


has the highest minimum sales revenue
by customer, and minimum order.

Page 24 of 34

(2) Product Mix


Analysis Tools
(1) Product Category

Analysis

In all the product


categories, we can see that
they are mostly bought by
cluster 5. For the rest
- cluster 3 bought the
2nd highest in
technology,
- cluster 2 bought the
2nd highest in office
supplies
- Cluster 4 bought
the 2nd highest in
furniture.

(2) Product Category performance in each cluster

Page 25 of 34

In the cluster perspective,


- All the clusters
bought officer
supplies mainly,
-

cluster 5 bought
relatively most
office suppliers than
furniture and
technology

All except cluster 3,


buy more furniture
than technology.

3. Product Container by each Cluster


In all, cluster 5 is the
biggest users in term of all
container size, while
cluster 2 is the least user
among all.
This is in line with the
frequency finding, where
cluster 2 bought the least
amount of items.
It is also observed that
cluster 4 dominate most of
the bigger size container
(Jumbo & medium ~ 20%),
after cluster 5.
4. Product Container in each Cluster

In all, all clusters used small


boxes most.
This is followed by wrap
bag for cluster 1, 2 & 5, and
small pack container for
cluster 3, and jumbo drum
for cluster 4.

Page 26 of 34

(3) Shipping Pattern


Analysis Tools

Analysis

(1) Shipping Mode by Cluster

In all, cluster 5 dominates both


shipping mode, while cluster 1
is the next highest for regular
air, and cluster 4 is the next
highest for express air.

(2) Shipping Mode in each Cluster

All clusters shows significantly


high usage of regular air as
compared to express air.

Page 27 of 34

(3) Order Priority by Cluster

In all, the distribution among the 5


order priorities are similar across all
status, except for critical, where
both cluster 3 & 4 are similar in
proportion, with cluster 3 slightly
higher by 0.1% only.

(4) Order Priority in each Cluster

In each cluster, the distribution of


each type of priority is quite well
distributed. Nevertheless, the highest
frequency of the different status of
the order priority differs, with
- Cluster 1 & 3with mostly not
specified
- Cluster 2 with high order
priority most frequency
- Cluster 4
with medium
priority most frequently
- Cluster 5 with low priority
most frequently.

Page 28 of 34

5.1.2 ANALYSIS FROM NON- CLUSTERING (INDEPENDENT) VARIABLES


(1) Demographic
Analysis Tools

Analysis

1. Region (End destination of the products)

The distribution of cluster in


each regions is similar.

2. Region (End destination of the products)in each Cluster

In all, most of the customers in


all clusters are found in Central
Region, with
-

Page 29 of 34

West Region being 2nd


highest region for cluster 2
& 5,
South Region being 2nd
highest region for cluster 3
and 4,
East region being 2nd
highest region for cluster 1

(2) Customer Segment / Market Sector


Analysis Tools

Analysis

1.

The distribution of cluster


in each customer segment
is similar.

2.

In all, most of the


customers bought items
corporate uses, with
-

Page 30 of 34

Home offices uses


being 2nd highest region
for cluster 1,3 & 5,
Consumer (end user)
being 2nd highest region
for cluster 2
Consumer & small
business uses for
cluster 4.

(3)

Profitability

Analysis Tools
1. Total Profit by Custer

Analysis

In all, we can see that most of the profit


came from cluster 5, with cluster 2
generating the least amount of profit.

2. Cross table to examine behavior of profit

In closer examination, the minimum


profit
- Either by MIN profit by a
customer in an order,
- Or MIN profit by total profit of a
customer
Are both negative figures in all sectors.
Thus, there are customers who are not
generating profits for the company

Page 31 of 34

3. With the above, a new field is added, to find out the proportion
of customers that generate negative sales in all sector, as below;

As we can observed, cluster 2 & 4 by


proportion more customers with
negative profit, but when we examine it
further, most of the negative profit
customers by count, came from cluster
5.

Page 32 of 34

SECTION 6: ANALYSIS RESULTS AND RECOMMENDATION


(1) Overview;
The overall performance and ranking from the clustering variable as below;
3

(2) In RFM analysis, if we do a scoring according to the weightage, giving higher weightage to R, following
by F, and M respectively; we can see that segment 5 is the sector that the company should focus on,
followed by 2 & 3, and then 1.
( 5Clustering
1
2
3
4
5

1st,
R
2
5
4
1
3

Score:
3-3rd, 2-4th, 1-5th)
F
M
4
3
1
1
2
2
3
4
5
5

-2nd,

Total Score
(3R+2F+1M)
17
18
18
13
24

(3) Product Mix


a. While it is quite well distributed in the product container, we can see that all clusters prefer the
use of small boxes.
b. Office suppliers top in all clusters, with cluster 5 being the cluster that bought the most items in
all clusters.
(4) Shipping Pattern
a. Shipping by regular air is preferred by all clusters
b. Delivery Mode Truck
i. It is a dependent variable, affected by the size of the containers. Hence, as long as
jumbo drum and jumbo box are available as container for product, the company needs
to invest in delivery mode by truck.
c. Order priority is well distributed across the clusters, but with different priorities for different
clusters;
i. Cluster 1 & 3with mostly not specified
ii. Cluster 2 with high order priority most frequency
iii. Cluster 4 with medium priority most frequently
iv. Cluster 5 with low priority most frequently.
Page 33 of 34

(5) Demographic
a. Most users bought items for corporate use, and found in the central region. The distribution in
all 5 clusters are similar.
(6) Customers net profit
a. 28% of total customers are with negative profit, and most are found in cluster 5, while by
proportion, cluster 2 & 4 contains proportional more negative customers within the cluster, as
compare to the others.
(7) Representation of Data can be Improved
a. Product Name: it is noticed that the brand of the product is embedded in the product name. As
such, new fields should be created to fill in the brand of the collection; such that we can
further analyse the performance of the brand, in the product mix.
(8) Data Collection - Insufficient information
Missing customer
account number

Area(s) that require more information


Customer account number is a better unique identifier as
compare to customer name, as there can be a case of
different customers with the same name.

Page 34 of 34

You might also like