Isss602 Data Analytics Lab: Assignment 2: Be Customer Wise or Otherwise

ISSS602 DATA ANALYTICS LAB
ASSIGNMENT 2: BE CUSTOMER WISE OR OTHERWISE

OBJECTIVE
To create homogeneous segments based for Supersales Store, based on the (1) customers shipping patterns,
(2) product mix, and (3) purchasing behavior (via RFM model).
Prepared for:
Prepared by:
Instructor - Dr. Kam Tin Seong
Ong Han Ying

G1
Hanying.ong.2015@mitb.smu.edu.sg
Date:
5 October 2015
Page 1 of 34
Table of Contents
SECTION 1: EXECUTIVE SUMMARY
SECTION 2: DATA PREPARATION PART 1 DATA STUDY & REVIEW
2.1
2.1.1
2.1.2
2.2
2.2.1
2.2.2
2.3
2.4
2.5
DATA ACCURACY OF SUPERSTORE SALES DATA

MODIFYING COLUMN INFORMATION OF SUPERSTORESALES DATA
CHECK ON UNIQUENESS OF DATA
DATA CONSISTENCY OF SUPERSTORESALES DATA
CHECKING AND FORMATTING THE NUMBER OF DATA FOR CHARACTER DATA IN SUPERSTORESALES DATA
CHECKING AND CORRECTING NAMING ERROR CHARACTER DATA IN SUPERSTORESALES DATA
DATA COMPLETENESS OF SUPERSTORESALES DATA
REORGANIZATION OF DATASET WITH CUSTOMER NAME AS UNIQUE KEY
CREATING DERIVED VARIABLES FOR ANALYSIS
4
4
5
5
6
6
7
9
15
SECTION 3: ANALYSIS PROCEDURE - PREPARE & SELECT DATA FOR CLUSTER ANALYSIS
15
3.1
3.2
3.2.1
3.2.2
3.2.2
3.2.3
3.3
3.4
15
16
16
17
18
19
21
21
SELECTING CLUSTERING VARIABLES

EXAMINE DISTRIBUTION & PERFORM VARIABLE STANDARDIZATION AND TRANSFORMATION
RFM VARIABLES
SHIPPING PATTERN ORDER PRIORITY
SHIPPING PATTERN SHIPPING MODE
PRODUCT MIX PRODUCT CONTAINERS & PRODUCT CATEGORY
CHECK FOR MULTI-COLLINEARITY AMONG VARIABLES
FINAL VARIABLES USED FOR CLUSTER ANALYSIS
SECTION 4: ANALYSIS PROCEDURE CONDUCT CLUSTERING EXERCISE
22
4.1
22
INTERPRETING CLUSTERS
SECTION 5: ANALYSIS PROCEDURE INTERPRETATION CLUSTER RESULTS
24
5.1 PROFILING CLUSTERING

5.1.1 ANALYSIS FROM CLUSTERING VARIABLES
5.1.2 ANALYSIS FROM NON- CLUSTERING (INDEPENDENT) VARIABLES
24
24
29
SECTION 6: ANALYSIS RESULTS AND RECOMMENDATION
33
Page 2 of 34
SECTION 1: EXECUTIVE SUMMARY

1. Superstore saless customer profile
a. Overall, 5 distinct customer clusters are identified, through analyzing the transactional data with
information related to their purchasing behavior, shipping pattern, and product mix.
2. Of all, customer in cluster 5 generates the most revenue, and order; but the company needs to take note
that
a. Cluster 5 is rank only the 3rd, in term of recently. Hence, it may implicates that loyalty over this
valuable cluster may be losing over time.
b. Also, it contains the most number of customers with negative profit.
3. Though cluster 2 & 3 are next after 5 as valuable; it must also be taken note that cluster 2 & 3 generates
lesser revenue and buy lesser as compare to the other cluster. Customers in these 2 clusters may be new
customers.
4. In product mix
a. Overall, across all clusters; officer supplies come out the top across all clusters.
b. We can see that all clusters prefer the use of small boxes.
5. Shipping Pattern
a. Shipping by regular air is preferred by all clusters
b. Delivery Mode Truck
i. It is affected by the size of the containers. Hence, as long as jumbo drum and jumbo box
are available as container for product, the company needs to invest in delivery mode by
truck.
6. Demographic
a. Most users bought items for corporate use, and found in the central region. The distribution in
all 5 clusters are similar.
7. Date Collection
a. Representation of Data can be Improved to include information such as brand, since it is
embedded in product name
b. Also, the company needs to create a system to issue customer account ID to each customer, so
as to ensure that all customers, even with the same names; are analysed uniquely.
Page 3 of 34
SECTION 2: DATA PREPARATION PART 1 DATA STUDY & REVIEW

2.1
DATA ACCURACY OF SUPERSTORE SALES DATA

The overview of the change(s) made to the table as below;
Column Information
Edit Modeling Type
X
X
Field Name
Edit Data Type

Row ID
X
Order ID
X
Order Date
Order Priority
Order Quantity
Sales
Discount
Ship Mode
Profit
Unit Price
Shipping Cost
Customer Name
City
Zip Code
X
State
Region
Customer Segment
Product Category
Product Sub-Category
Product Name
Product Container
Product Base Margin
Ship Date
2.1.1
Format
X
X
X
X
X
Modifying Column Information of SuperstoreSales Data
Field Name
Row ID
Order ID
Sales
Discount
Profit
Unit Price
Shipping Cost
Zip Code
Before Updated
Data Type Modeling Type Format
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Data Type
Character
Character
Numeric
Numeric
Numeric
Numeric
Numeric
Character
Page 4 of 34
After Updated
Modeling Type
Nominal
Nominal
Continuous
Continuous
Continuous
Continuous
Continuous
Nominal
Format
Currency (SGD)
Currency (SGD)
Currency (SGD)
Currency (SGD)
Currency (SGD)
2.1.2 Check on Uniqueness of Data

For transactional data, the row ID is unique; to indicate that the line transactions are not repeated.
2.2
DATA CONSISTENCY OF SUPERSTORESALES DATA

The overview of the change(s) made to the table as below;
*More than 100 categories will not be checked as it is subjected to high human error. Thus, will not clean.
Hence, information is treated as a description, and will not be used for categorical / distribution analysis.
Field Name
Row ID
After Updated
Data Type Modeling Type
Character
Nominal
Actionable
Rationale
No Action
Order ID
Character
Nominal
Check no. of digits
Order Priority
Ship Mode
Customer Name
City
Character
Character
Character
Character
Nominal
Nominal
Nominal
Nominal
Check naming error

Check naming error
No Action
No Action
Zip Code
Character
Nominal
Check no. of digits
State
Region
Customer Segment
Product Category
Product Name
Product Container
Character
Character
Character
Character
Character
Character
Character
Nominal
Nominal
Nominal
Nominal
Nominal
Nominal
Nominal
Check naming error

Check naming error
Check naming error
Check naming error
Check naming error
No Action
Check naming error
Page 5 of 34
For any ID used for

business identifier, it
should contain the same
number of digit/character.
> than 100 categories

All Zip Code should contain
the same number of digits.
2.2.1 Checking and formatting the Number of Data for Character Data in SuperstoreSales Data
Field
Checking No. of Digits
Actionable
After Actionable
Name
Step 1:
Create a new column, No. of
Char(Order ID)
Step 2: Identify the no. of digits in Order ID:
Order
ID
Results: There are order ID that contains up to

5 digits. Reformat all to contain 5 digits
character.
Step 1: Create a new column, No. of
Char(Zipcode) to count the digits in Order ID
Step 2: Identify the no. of digits in Order ID:
Zip
Code
Results: There are zipcode with 4 digits, which

then need to be reformatted into 5 digit
A new field, Order ID (Formatted) is

created;
Create
a
new field,
Order ID
(Formatted)
; such that Upon creating another new column to check
all order ID on the number of characters, the results is
will contain shown below;
5 digit.
A new field, Zipcode (Formatted) is created;

Create
a
new field,
Zipcode
(Formatted) Upon creating another new column to check
; such that on the number of characters, the results is
all zipcode shown below;
will contain
5 digit.
2.2.2 Checking and correcting Naming Error Character Data in SuperstoreSales Data
Using the tabulate, the categories of the fields were identified. As below, there is no naming error or
similar data spotted, thus; no re-coding is required.
Page 6 of 34
2.3
DATA COMPLETENESS OF SUPERSTORESALES DATA

To ensure that the data is complete, we will conduct a check via a summary of the columns, as below;
No.
Observation
1.
Only Product Base Margin has 63 missing fields.
2.
3.
4.
5.
6.
7.
Impact
If the column is used for analysis, it will
affect the consistency of the analysis.
MIN value of Shipping date & order date are near. May contain If shipping date < order date = true,
data error with shipping date < order date
need to remove error data.
Negative Profit detected
Need to examine if it were a data error
or real business loss.
Range of sales is wide
Range of order quantity is wide.
If distinct outliers is observed, will
Range of unit price is wide. Can be explained by wide range of skewed the overall analysis results.
Need to examine its distribution in the
products (n of product name=1263)
Range of shipping cost is wide. Can be explained by different analysis later.
shipping mode (n=3), and product container (n = 7).
The proposed actionable as below;

No.
1.
Proposed Actionable
Rationale & Results

To analyze the product mix, a factor that we will consider will include
product that generate high profit margin (low product base margin).
To keep the missing data,

and observe on the impact Technically, we should remove the rows with missing data. However, since
on our analysis from the
- Transactional records is tie to the order and customer, it may dilute
missing input.
future analysis on total sum spent per order.
- Missing data come from a brand of SAFCO only.
-
Page 7 of 34
2.
An additional column,
Shipping < Order is
created to check;
A check is conducted and there is no transaction that shows that shipping

date is earlier than order date. Hence, no data is removed.
Next, we will also examine if any business logics are violated;

Observation:
1 order ID having multiple customer names
Rationale & Actionable
Create a unique identifier, Unique Identifier. It combined

Order ID Formatted with Customer Name.
2. multiple zipcode link to a state

Using the following link; it is noticed that multiple zipcode can
be linked to different cities name.
https://tools.usps.com/go/ZipLookupAction!input.action
As such, no further actionable is taken. However, City,
instead of zipcode, will be used to future analysis in this
report, since it is the lowest level of details for shipping
location.
Reference
Working Copy - With Validation, Hidden & Exceptional Fields
File name: CLEAN DATA_WORKING_COPY 01
SuperstoreSalesData
Page 8 of 34
2.4
REORGANIZATION OF DATASET WITH CUSTOMER NAME AS UNIQUE KEY

To prepare our data to meet the objective of this paper, we have to reorganize the dataset such that
customer name is the primary key, with absence of a customer account number. Here, we assumed
that each customer name refers to a distinct customer account. As such, Row ID will be hidden and
excluded.
No Related Field
.
Name(s)
1.
Order ID
(Formatted)
2.
Order Date &

Shipped Date
3.
Discount
4.
Profit &
Sales
Observation
Page 9 of 34
5.
Unit Price ,
Product Base
Margin &
Shipping Cost
6.
Product
Name,
Product
SubCategory,
Product
Category, and
Product
Container
7.
Order
Quantity
Page 10 of 34
8.
Shipping
Mode, Order
Priority &
Customer
Segment
9.
City, State,
Region
For each of the observation(s) above, actionable are proposed & implemented;
No.
Field
Name(s)
Rationale
Procedure & Outcome

Step 1: Create a tabulate, and make into a new data table, and
saved with file name, CLEAN DATA_WORKING_COPY 02
SuperstoreSalesData
1.
Step 2: from the new data table, create another tabulate, and
make into another new data table, renamed the field, n with
To count the number
No. of Orders.
Order
ID
of order per customer.
(Formatted)
Save the file name as CLEAN DATA_WORKING_COPY 03
SuperstoreSalesData
Page 11 of 34
New Field, Lead Time is created;
Create a tabulate and saved as Tabulate - Working01 by Custto calculate duration

Nam
between Shipped &
Ordered (Lead time)
2.
Order Date
& Ship Date
3.
Discount
To identify the MIN &

MAX order date &
MIN & MAX of lead
time
for
each
customer
Discount is
ambiguous.
Data will be dropped from customer segmentation analysis.

Step 1: Open Tabulate - Working01 by Cust-Name and add
MIN, MAX, Range, & Sum of Profit and Sales to the tabulate,
as below;
4.
Profit &
Sales
Add MIN, MAX, Range

& SUM both profit
& Sales by customer
Step 2: Save the tabulate as Tabulate - Working02 by CustName
5.
Unit
Price ,
Product
Base
Margin &
Shipping
Cost
6.
Product
Name,
Product
Subcategory
, Product
Category,
and
Product
Container
These data seem to

be independent from
the customer, and
dependent on the
product name
Product Name and
Product
Subcategory are
subset of Product
Category, with high
varieties by a
customer, thus;
difficult to draw
meaning data.
All the 3 data will be dropped from customer segmentation

analysis.
Step 1: Add Product Category & Product Container to

Tabulate - Working02 by Cust-Name, with N & % Row as
below;
Page 12 of 34
Distinct categories
found in product
categories and
Product Container
7.
Order
Quantity
8.
Shipping
Mode,
Order
Priority &
Customer
Segment
Data is ambiguous.
With missing unit of
measure, we are
Data will be dropped from customer segmentation analysis.
unable to determine if
this is a bulk purchase
or via single item
Step 1: Add Shipping Mode, Order Priority & Customer
Segment to Tabulate - Working03 by Cust-Name
Distinct categories
found in Shipping
Mode & Customer
Segment
9.
City,
State,
Region
City, and State are

subset of Region,
Drop City and State from customer segmentation analysis.
and distinct
Step 1: Add Region to Tabulate - Working04 by Cust-Name
categories found in
Regions,
Page 13 of 34
Step 2: Save the tabulate as Tabulate FINAL by Customer

Name
Next, with Tabulate FINAL by Customer Name, make into a data-table, join with CLEAN
DATA_WORKING_COPY 01 SuperstoreSalesData, and saved the file as CLEAN DATA_FINAL_COPY
SuperstoreSalesData
The summary of the actionable as below;
Original Field Name
Row ID
Order ID (Formatted)
Order Date
Order Priority
Order Quantity
Sales
Discount
Ship Mode
Profit
Unit Price
Shipping Cost
Customer Name (Unique)
City
Zip Code
State
Region
Customer Segment (Sector)
Product Category
Product Name
Product Container
Product Base Margin
Ship Date
Application for Customer Perspective

RFM
Product
Shipping
In Use?
Mix
Pattern
R F M
Dropped
Yes
X
Yes
X
Yes
X
Dropped
Yes
X
Dropped
Yes
X
Yes
Dropped
Dropped
Yes
Dropped
Dropped
Dropped
Yes
Yes
Yes
X
Dropped
Dropped
Yes
X
Dropped
Yes
Page 14 of 34
Profile
Analysis
X
X
2.5
CREATING DERIVED VARIABLES FOR ANALYSIS

No. New Fields
Objective & Formula
1.
R - Order Date
To define the Recency variable for the analysis.
Reference
Final Copy With Customer Name as Unique Key
File name: CLEAN DATA_FINAL_COPY

SuperstoreSalesData
SECTION 3: ANALYSIS PROCEDURE - PREPARE & SELECT DATA FOR CLUSTER ANALYSIS
3.1
SELECTING CLUSTERING VARIABLES

Only the fields below are used for clustering analysis, the other fields are used for profiling the cluster,
after the clusters are formed.
Page 15 of 34
3.2
EXAMINE DISTRIBUTION & PERFORM VARIABLE STANDARDIZATION AND TRANSFORMATION

To apply cluster analysis, distribution of variables used should not contain too many outliers, and should
not be too skewed. Therefore, all the selected variables are examined, and if required; will be
standardized or transformed.
3.2.1 RFM Variables

Field
Name
Current Distribution
Analysis
ROrder
Date
F - No.
of
Order
s
Distinct
outliers are
detected in all
3 variables,
and they are
left-skewed.
Thus, will
apply
Johnsons
Transformatio
n, to help to
reduce
skewness.
MSum(S
ales)
Page 16 of 34
Outcome
Analysis
The results of the

transformation is
good for R-Order
Date and MSum(Sales), where
the data is now
normally
distributed, and no
distinct outlier.
F- No. of Order
does not fare as
well, with data
slightly rightly
skewed, with no
distinct outlier.
Hence, all the 3
new variable are
accepted as new
variables for R, F &
M.
3.2.2 Shipping Pattern Order Priority

Field
Name
Row% (Critical) Order

Priority
Row% (High) Order

Priority
Row% (Low) Order

Priority
Row% (Medium) Order

Priority
Row% (No Specified) Order

Priority
Current
Distributi
on
Current
Distributi
on + 1
Analysis
The data is highly skewed to the right, with distinct outliers. As Log is unable to log a 0%, we will add a constant 1, to all the variable.
Transfor
mation 1
Transfor
mation 2
Rationale
To normalize the data, we log the data. After first transformation, the data showed that it is still rightly skewed. Therefore, we transformed it again by
square root the data. After which, outliers are reduced, and the data showed less skewness in it.
Page 17 of 34
3.2.2 Shipping Pattern Shipping Mode

Field
Name
Row % (Delivery Truck) Shipping

Mode
Row % (Express Air) Shipping

Mode
Row % (Regular Air) Shipping Mode
Current
Distributi
on
Current
Distributi
on + 1
Analysis
The data is highly skewed, with distinct outliers. As Log is unable to log a 0%, we will add a constant 1, to all the variable.
Transfor
mation 1
Transfor
mation 2
Rationale
To normalize the data, we log the data. After first transformation, the
data showed that it is still rightly skewed. Therefore, we transformed it
again by square root the data. After which, outliers are reduced, and the
data showed less skewness in it.
Page 18 of 34
After log & Square-root the data, it still show skewness and
outlier. Therefore, Johnson SI Transformation is applied, and the
distributions skewness and outliers are greatly reduced.
3.2.3 Product Mix Product Containers & Product Category

Field Name
Row % (Jumbo Box)

Product Container
Row % (Jumbo Drum)

Product Container
Row % (Large Box)

Product Container
Row % (Medium Box)

Product Container
Row % (Small Box)

Product Container
Current
Distributio
n
Current
Distributio
n+1
Analysis
The data is highly skewed, with distinct outliers. As Log is unable to log a 0%, we will add a constant 1, to
all the variable.
No transformation is done as
the data shows little outlier
and it is not too skewed.
To normalize the data, we log the data. After first transformation, the data showed that it is still rightly skewed.
Therefore, we transformed it again by square root the data. After which, outliers are reduced, and the data showed
less skewness in it.
No transformation is done as
the data shows little outlier
and it is not too skewed.
Transforma
tion 1
Transforma
tion 2
Rationale
Page 19 of 34
Field Name
Row % (Wrap Bag)

Product Container
Row % (Small Pack)

Product Container
Row%(Furniture) Product
Category
Row%(Technology)
Product Category
Row%(Office Supplies)
Product Category
Current
Distributio
n
Current
Distributio
n+1
Analysis
The data is highly skewed, with distinct outliers. As

Log is unable to log a 0%, we will add a constant
1, to all the variable.
The other 2 fields are highly skewed, with distinct outliers.

Required to do transformation.
Transforma
tion 1
Transforma
tion 2
Rationale
Johnson SI Transformation is applied, and the

distributions skewness and outliers are greatly
reduced.
Page 20 of 34
The data is quite normally

distributed, with little
outlier. No transformation
required.
3.3
CHECK FOR MULTI-COLLINEARITY AMONG VARIABLES

Next, to ensure that the collinearity among the variable is low, multi-variate analysis is conducted.
Overall, the collinearity among the variables are not absolutely strong (all correlation fall <0.71).
Refers to Multivariate - Cluster Variables in JMP file.
Nevertheless, the following analysis are observed;
(1) Square Root (Log(Row%(Delivery Truck) Shipping Mode +1))) is repeated 5 times among the
variable that showed high collinearity.
Thus, Square Root (Log(Row%(Delivery Truck) Shipping Mode +1))) will be excluded in the cluster.
3.4
FINAL VARIABLES USED FOR CLUSTER ANALYSIS
In all, 20 variables will be used for clustering, as below;
Page 21 of 34
SECTION 4: ANALYSIS PROCEDURE CONDUCT CLUSTERING EXERCISE

4.1
INTERPRETING CLUSTERS
A. Hierarchical Clustering
Through this exercise, it will help to propose the number of clusters suitable for k-means. The
results of hierarchical clustering as below;
Ward
Centroid
Average
Single Linkage
Complete
Linkage
No.of
clusters
formed
As shown from the result, the 5 different calculation has shown different results, with Ward &
Complete linkage more well distributed clusters. As the results conflict, we will use the suggested
number of cluster, and apply K-means instead.
B. K-Means
Using K-means analysis, the results below;
Page 22 of 34
On closer examination, the distribution of the clustering as below;
Statistically, the recommended number of cluster is 3. However, when n=3, the customer distribution
(cluster 2 = 408) contains of exceeding 50% of the customer. Therefore, it may not be able to help us to
understand our consumer segment better. As such, we consider the next recommended n value with
the 2nd highest CCC value, n = 5.
With n=5, the customer distribution is more evenly spread, with cluster 5 of about 37%. Thus, n=5, or 5
customer clusters should be formed.
With this, we updated the dataset with the cluster, with a new field added, cluster.
Page 23 of 34
SECTION 5: ANALYSIS PROCEDURE INTERPRETATION CLUSTER RESULTS

5.1
PROFILING CLUSTERING
The overview of the clusters as below;
With the clustering, the following feature of the customers are analyzed;
5.1.1 ANALYSIS FROM CLUSTERING VARIABLES

(1)
RFM Purchasing Pattern
Analysis Tools
Analysis
Recently: The number the total number,
the most recent the purchases,
therefore; the smaller the %, the better.
In all, we can see that the most recent
customer comes from cluster 2 (~18%),
but they account for the least no. of
order (~8%), and revenue (5%)
Cluster 5 gives the company the most
number of order (~55%) and revenue
(62%), but is the 3rd performing in terms
of recently (~20%)
From the table, we can see that cluster 5

has the highest minimum sales revenue
by customer, and minimum order.
Page 24 of 34
(2) Product Mix

Analysis Tools
(1) Product Category
Analysis
In all the product

categories, we can see that
they are mostly bought by
cluster 5. For the rest
- cluster 3 bought the
2nd highest in
technology,
- cluster 2 bought the
2nd highest in office
supplies
- Cluster 4 bought
the 2nd highest in
furniture.
(2) Product Category performance in each cluster
Page 25 of 34
In the cluster perspective,

- All the clusters
bought officer
supplies mainly,
-
cluster 5 bought
relatively most
office suppliers than
furniture and
technology
All except cluster 3,

buy more furniture
than technology.
3. Product Container by each Cluster

In all, cluster 5 is the
biggest users in term of all
container size, while
cluster 2 is the least user
among all.
This is in line with the
frequency finding, where
cluster 2 bought the least
amount of items.
It is also observed that
cluster 4 dominate most of
the bigger size container
(Jumbo & medium ~ 20%),
after cluster 5.
4. Product Container in each Cluster
In all, all clusters used small

boxes most.
This is followed by wrap
bag for cluster 1, 2 & 5, and
small pack container for
cluster 3, and jumbo drum
for cluster 4.
Page 26 of 34
(3) Shipping Pattern

Analysis Tools
Analysis
(1) Shipping Mode by Cluster
In all, cluster 5 dominates both

shipping mode, while cluster 1
is the next highest for regular
air, and cluster 4 is the next
highest for express air.
(2) Shipping Mode in each Cluster
All clusters shows significantly

high usage of regular air as
compared to express air.
Page 27 of 34
(3) Order Priority by Cluster
In all, the distribution among the 5

order priorities are similar across all
status, except for critical, where
both cluster 3 & 4 are similar in
proportion, with cluster 3 slightly
higher by 0.1% only.
(4) Order Priority in each Cluster
In each cluster, the distribution of

each type of priority is quite well
distributed. Nevertheless, the highest
frequency of the different status of
the order priority differs, with
- Cluster 1 & 3with mostly not
specified
- Cluster 2 with high order
priority most frequency
- Cluster 4
with medium
priority most frequently
- Cluster 5 with low priority
most frequently.
Page 28 of 34
5.1.2 ANALYSIS FROM NON- CLUSTERING (INDEPENDENT) VARIABLES

(1) Demographic
Analysis Tools
Analysis
1. Region (End destination of the products)
The distribution of cluster in

each regions is similar.
2. Region (End destination of the products)in each Cluster
In all, most of the customers in

all clusters are found in Central
Region, with
-
Page 29 of 34
West Region being 2nd

highest region for cluster 2
& 5,
South Region being 2nd
and 4,
East region being 2nd
(2) Customer Segment / Market Sector

Analysis Tools
Analysis
1.
The distribution of cluster

in each customer segment
is similar.
2.
In all, most of the

customers bought items
corporate uses, with
-
Page 30 of 34
Home offices uses

being 2nd highest region
for cluster 1,3 & 5,
Consumer (end user)
being 2nd highest region
for cluster 2
Consumer & small
business uses for
cluster 4.
(3)
Profitability
Analysis Tools
1. Total Profit by Custer
Analysis
In all, we can see that most of the profit

came from cluster 5, with cluster 2
generating the least amount of profit.
2. Cross table to examine behavior of profit
In closer examination, the minimum

profit
- Either by MIN profit by a
customer in an order,
- Or MIN profit by total profit of a
customer
Are both negative figures in all sectors.
Thus, there are customers who are not
generating profits for the company
Page 31 of 34
3. With the above, a new field is added, to find out the proportion
of customers that generate negative sales in all sector, as below;
As we can observed, cluster 2 & 4 by

proportion more customers with
negative profit, but when we examine it
further, most of the negative profit
customers by count, came from cluster
5.
Page 32 of 34
SECTION 6: ANALYSIS RESULTS AND RECOMMENDATION

(1) Overview;
The overall performance and ranking from the clustering variable as below;
3
(2) In RFM analysis, if we do a scoring according to the weightage, giving higher weightage to R, following
by F, and M respectively; we can see that segment 5 is the sector that the company should focus on,
followed by 2 & 3, and then 1.
( 5Clustering
1
2
3
4
5
1st,
R
2
5
4
1
3
Score:
3-3rd, 2-4th, 1-5th)
F
M
4
3
1
1
2
2
3
4
5
5
-2nd,
Total Score
(3R+2F+1M)
17
18
18
13
24
(3) Product Mix

a. While it is quite well distributed in the product container, we can see that all clusters prefer the
use of small boxes.
b. Office suppliers top in all clusters, with cluster 5 being the cluster that bought the most items in
all clusters.
(4) Shipping Pattern
a. Shipping by regular air is preferred by all clusters
b. Delivery Mode Truck
i. It is a dependent variable, affected by the size of the containers. Hence, as long as
jumbo drum and jumbo box are available as container for product, the company needs
to invest in delivery mode by truck.
c. Order priority is well distributed across the clusters, but with different priorities for different
clusters;
i. Cluster 1 & 3with mostly not specified
ii. Cluster 2 with high order priority most frequency
iii. Cluster 4 with medium priority most frequently
iv. Cluster 5 with low priority most frequently.
Page 33 of 34
(5) Demographic
a. Most users bought items for corporate use, and found in the central region. The distribution in
all 5 clusters are similar.
(6) Customers net profit
a. 28% of total customers are with negative profit, and most are found in cluster 5, while by
proportion, cluster 2 & 4 contains proportional more negative customers within the cluster, as
compare to the others.
(7) Representation of Data can be Improved
a. Product Name: it is noticed that the brand of the product is embedded in the product name. As
such, new fields should be created to fill in the brand of the collection; such that we can
further analyse the performance of the brand, in the product mix.
(8) Data Collection - Insufficient information
Missing customer
account number
Area(s) that require more information

Customer account number is a better unique identifier as
compare to customer name, as there can be a case of
different customers with the same name.
Page 34 of 34

Isss602 Data Analytics Lab: Assignment 2: Be Customer Wise or Otherwise

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Isss602 Data Analytics Lab: Assignment 2: Be Customer Wise or Otherwise

Uploaded by

Copyright:

Available Formats

ISSS602 DATA ANALYTICS LAB

ASSIGNMENT 2: BE CUSTOMER WISE OR OTHERWISE

Instructor - Dr. Kam Tin Seong

Ong Han Ying

SECTION 2: DATA PREPARATION PART 1 DATA STUDY & REVIEW

DATA ACCURACY OF SUPERSTORE SALES DATA

SELECTING CLUSTERING VARIABLES

SECTION 4: ANALYSIS PROCEDURE CONDUCT CLUSTERING EXERCISE

SECTION 5: ANALYSIS PROCEDURE INTERPRETATION CLUSTER RESULTS

5.1 PROFILING CLUSTERING

SECTION 6: ANALYSIS RESULTS AND RECOMMENDATION

SECTION 1: EXECUTIVE SUMMARY

SECTION 2: DATA PREPARATION PART 1 DATA STUDY & REVIEW

DATA ACCURACY OF SUPERSTORE SALES DATA

Edit Data Type

Modifying Column Information of SuperstoreSales Data

2.1.2 Check on Uniqueness of Data

DATA CONSISTENCY OF SUPERSTORESALES DATA

Check no. of digits

Check naming error

Check no. of digits

Check naming error

For any ID used for

> than 100 categories

> than 100 categories

Results: There are order ID that contains up to

Results: There are zipcode with 4 digits, which

A new field, Order ID (Formatted) is

A new field, Zipcode (Formatted) is created;

DATA COMPLETENESS OF SUPERSTORESALES DATA

The proposed actionable as below;

Rationale & Results

To keep the missing data,

A check is conducted and there is no transaction that shows that shipping

Next, we will also examine if any business logics are violated;

Rationale & Actionable

Create a unique identifier, Unique Identifier. It combined

2. multiple zipcode link to a state

REORGANIZATION OF DATASET WITH CUSTOMER NAME AS UNIQUE KEY

Order Date &

Procedure & Outcome

New Field, Lead Time is created;

Create a tabulate and saved as Tabulate - Working01 by Custto calculate duration

To identify the MIN &

Data will be dropped from customer segmentation analysis.

Add MIN, MAX, Range

Step 2: Save the tabulate as Tabulate - Working02 by CustName

These data seem to

All the 3 data will be dropped from customer segmentation

Step 1: Add Product Category & Product Container to

Step 2: Save the tabulate as Tabulate - Working03 by CustName

Step 2: Save the tabulate as Tabulate - Working04 by CustName

City, and State are

Step 2: Save the tabulate as Tabulate FINAL by Customer

Application for Customer Perspective

CREATING DERIVED VARIABLES FOR ANALYSIS

File name: CLEAN DATA_FINAL_COPY

SELECTING CLUSTERING VARIABLES

EXAMINE DISTRIBUTION & PERFORM VARIABLE STANDARDIZATION AND TRANSFORMATION

3.2.1 RFM Variables

The results of the

3.2.2 Shipping Pattern Order Priority

Row% (Critical) Order