Professional Documents
Culture Documents
Prepared for:
Prepared by:
Page 1 of 34
Table of Contents
SECTION 1: EXECUTIVE SUMMARY
2.1
2.1.1
2.1.2
2.2
2.2.1
2.2.2
2.3
2.4
2.5
4
4
5
5
6
6
7
9
15
SECTION 3: ANALYSIS PROCEDURE - PREPARE & SELECT DATA FOR CLUSTER ANALYSIS
15
3.1
3.2
3.2.1
3.2.2
3.2.2
3.2.3
3.3
3.4
15
16
16
17
18
19
21
21
22
4.1
22
INTERPRETING CLUSTERS
24
24
24
29
33
Page 2 of 34
Page 3 of 34
Field Name
Format
X
X
X
X
X
Field Name
Row ID
Order ID
Sales
Discount
Profit
Unit Price
Shipping Cost
Zip Code
Before Updated
Data Type Modeling Type Format
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Numeric
Continuous
Best
Data Type
Character
Character
Numeric
Numeric
Numeric
Numeric
Numeric
Character
Page 4 of 34
After Updated
Modeling Type
Nominal
Nominal
Continuous
Continuous
Continuous
Continuous
Continuous
Nominal
Format
Currency (SGD)
Currency (SGD)
Currency (SGD)
Currency (SGD)
Currency (SGD)
2.2
*More than 100 categories will not be checked as it is subjected to high human error. Thus, will not clean.
Hence, information is treated as a description, and will not be used for categorical / distribution analysis.
Field Name
Row ID
After Updated
Data Type Modeling Type
Character
Nominal
Actionable
Rationale
No Action
Order ID
Character
Nominal
Order Priority
Ship Mode
Customer Name
City
Character
Character
Character
Character
Nominal
Nominal
Nominal
Nominal
Zip Code
Character
Nominal
State
Region
Customer Segment
Product Category
Product Sub-Category
Product Name
Product Container
Character
Character
Character
Character
Character
Character
Character
Nominal
Nominal
Nominal
Nominal
Nominal
Nominal
Nominal
Page 5 of 34
2.2.1 Checking and formatting the Number of Data for Character Data in SuperstoreSales Data
Field
Checking No. of Digits
Actionable
After Actionable
Name
Step 1:
Create a new column, No. of
Char(Order ID)
Step 2: Identify the no. of digits in Order ID:
Order
ID
2.2.2 Checking and correcting Naming Error Character Data in SuperstoreSales Data
Using the tabulate, the categories of the fields were identified. As below, there is no naming error or
similar data spotted, thus; no re-coding is required.
Page 6 of 34
2.3
No.
Observation
1.
Only Product Base Margin has 63 missing fields.
2.
3.
4.
5.
6.
7.
Impact
If the column is used for analysis, it will
affect the consistency of the analysis.
MIN value of Shipping date & order date are near. May contain If shipping date < order date = true,
data error with shipping date < order date
need to remove error data.
Negative Profit detected
Need to examine if it were a data error
or real business loss.
Range of sales is wide
Range of order quantity is wide.
If distinct outliers is observed, will
Range of unit price is wide. Can be explained by wide range of skewed the overall analysis results.
Need to examine its distribution in the
products (n of product name=1263)
Range of shipping cost is wide. Can be explained by different analysis later.
shipping mode (n=3), and product container (n = 7).
1.
Proposed Actionable
Page 7 of 34
2.
An additional column,
Shipping < Order is
created to check;
Reference
Working Copy - With Validation, Hidden & Exceptional Fields
File name: CLEAN DATA_WORKING_COPY 01
SuperstoreSalesData
Page 8 of 34
2.4
No Related Field
.
Name(s)
1.
Order ID
(Formatted)
2.
3.
Discount
4.
Profit &
Sales
Observation
Page 9 of 34
5.
Unit Price ,
Product Base
Margin &
Shipping Cost
6.
Product
Name,
Product
SubCategory,
Product
Category, and
Product
Container
7.
Order
Quantity
Page 10 of 34
8.
Shipping
Mode, Order
Priority &
Customer
Segment
9.
City, State,
Region
For each of the observation(s) above, actionable are proposed & implemented;
No.
Field
Name(s)
Rationale
1.
Step 2: from the new data table, create another tabulate, and
make into another new data table, renamed the field, n with
To count the number
No. of Orders.
Order
ID
of order per customer.
(Formatted)
Save the file name as CLEAN DATA_WORKING_COPY 03
SuperstoreSalesData
Page 11 of 34
Order Date
& Ship Date
3.
Discount
Discount is
ambiguous.
4.
Profit &
Sales
5.
Unit
Price ,
Product
Base
Margin &
Shipping
Cost
6.
Product
Name,
Product
Subcategory
, Product
Category,
and
Product
Container
Page 12 of 34
Distinct categories
found in product
categories and
Product Container
7.
Order
Quantity
8.
Shipping
Mode,
Order
Priority &
Customer
Segment
Data is ambiguous.
With missing unit of
measure, we are
Data will be dropped from customer segmentation analysis.
unable to determine if
this is a bulk purchase
or via single item
Step 1: Add Shipping Mode, Order Priority & Customer
Segment to Tabulate - Working03 by Cust-Name
Distinct categories
found in Shipping
Mode & Customer
Segment
9.
City,
State,
Region
Page 14 of 34
Profile
Analysis
X
X
2.5
Reference
Final Copy With Customer Name as Unique Key
SECTION 3: ANALYSIS PROCEDURE - PREPARE & SELECT DATA FOR CLUSTER ANALYSIS
3.1
Page 15 of 34
3.2
Current Distribution
Analysis
ROrder
Date
F - No.
of
Order
s
Distinct
outliers are
detected in all
3 variables,
and they are
left-skewed.
Thus, will
apply
Johnsons
Transformatio
n, to help to
reduce
skewness.
MSum(S
ales)
Page 16 of 34
Outcome
Analysis
Current
Distributi
on
Current
Distributi
on + 1
Analysis
The data is highly skewed to the right, with distinct outliers. As Log is unable to log a 0%, we will add a constant 1, to all the variable.
Transfor
mation 1
Transfor
mation 2
Rationale
To normalize the data, we log the data. After first transformation, the data showed that it is still rightly skewed. Therefore, we transformed it again by
square root the data. After which, outliers are reduced, and the data showed less skewness in it.
Page 17 of 34
Current
Distributi
on
Current
Distributi
on + 1
Analysis
The data is highly skewed, with distinct outliers. As Log is unable to log a 0%, we will add a constant 1, to all the variable.
Transfor
mation 1
Transfor
mation 2
Rationale
To normalize the data, we log the data. After first transformation, the
data showed that it is still rightly skewed. Therefore, we transformed it
again by square root the data. After which, outliers are reduced, and the
data showed less skewness in it.
Page 18 of 34
After log & Square-root the data, it still show skewness and
outlier. Therefore, Johnson SI Transformation is applied, and the
distributions skewness and outliers are greatly reduced.
Current
Distributio
n
Current
Distributio
n+1
Analysis
The data is highly skewed, with distinct outliers. As Log is unable to log a 0%, we will add a constant 1, to
all the variable.
No transformation is done as
the data shows little outlier
and it is not too skewed.
To normalize the data, we log the data. After first transformation, the data showed that it is still rightly skewed.
Therefore, we transformed it again by square root the data. After which, outliers are reduced, and the data showed
less skewness in it.
No transformation is done as
the data shows little outlier
and it is not too skewed.
Transforma
tion 1
Transforma
tion 2
Rationale
Page 19 of 34
Field Name
Row%(Furniture) Product
Category
Row%(Technology)
Product Category
Row%(Office Supplies)
Product Category
Current
Distributio
n
Current
Distributio
n+1
Analysis
Transforma
tion 1
Transforma
tion 2
Rationale
Page 20 of 34
3.3
Thus, Square Root (Log(Row%(Delivery Truck) Shipping Mode +1))) will be excluded in the cluster.
3.4
Page 21 of 34
INTERPRETING CLUSTERS
A. Hierarchical Clustering
Through this exercise, it will help to propose the number of clusters suitable for k-means. The
results of hierarchical clustering as below;
Ward
Centroid
Average
Single Linkage
Complete
Linkage
No.of
clusters
formed
As shown from the result, the 5 different calculation has shown different results, with Ward &
Complete linkage more well distributed clusters. As the results conflict, we will use the suggested
number of cluster, and apply K-means instead.
B. K-Means
Using K-means analysis, the results below;
Page 22 of 34
Statistically, the recommended number of cluster is 3. However, when n=3, the customer distribution
(cluster 2 = 408) contains of exceeding 50% of the customer. Therefore, it may not be able to help us to
understand our consumer segment better. As such, we consider the next recommended n value with
the 2nd highest CCC value, n = 5.
With n=5, the customer distribution is more evenly spread, with cluster 5 of about 37%. Thus, n=5, or 5
customer clusters should be formed.
With this, we updated the dataset with the cluster, with a new field added, cluster.
Page 23 of 34
PROFILING CLUSTERING
The overview of the clusters as below;
With the clustering, the following feature of the customers are analyzed;
Analysis Tools
Analysis
Recently: The number the total number,
the most recent the purchases,
therefore; the smaller the %, the better.
In all, we can see that the most recent
customer comes from cluster 2 (~18%),
but they account for the least no. of
order (~8%), and revenue (5%)
Cluster 5 gives the company the most
number of order (~55%) and revenue
(62%), but is the 3rd performing in terms
of recently (~20%)
Page 24 of 34
Analysis
Page 25 of 34
cluster 5 bought
relatively most
office suppliers than
furniture and
technology
Page 26 of 34
Analysis
Page 27 of 34
Page 28 of 34
Analysis
Page 29 of 34
Analysis
1.
2.
Page 30 of 34
(3)
Profitability
Analysis Tools
1. Total Profit by Custer
Analysis
Page 31 of 34
3. With the above, a new field is added, to find out the proportion
of customers that generate negative sales in all sector, as below;
Page 32 of 34
(2) In RFM analysis, if we do a scoring according to the weightage, giving higher weightage to R, following
by F, and M respectively; we can see that segment 5 is the sector that the company should focus on,
followed by 2 & 3, and then 1.
( 5Clustering
1
2
3
4
5
1st,
R
2
5
4
1
3
Score:
3-3rd, 2-4th, 1-5th)
F
M
4
3
1
1
2
2
3
4
5
5
-2nd,
Total Score
(3R+2F+1M)
17
18
18
13
24
(5) Demographic
a. Most users bought items for corporate use, and found in the central region. The distribution in
all 5 clusters are similar.
(6) Customers net profit
a. 28% of total customers are with negative profit, and most are found in cluster 5, while by
proportion, cluster 2 & 4 contains proportional more negative customers within the cluster, as
compare to the others.
(7) Representation of Data can be Improved
a. Product Name: it is noticed that the brand of the product is embedded in the product name. As
such, new fields should be created to fill in the brand of the collection; such that we can
further analyse the performance of the brand, in the product mix.
(8) Data Collection - Insufficient information
Missing customer
account number
Page 34 of 34