Professional Documents
Culture Documents
For more information about SPSS software products, please visit our Web site at http://www.spss.com or contact Marketing Department SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6307 Tel: (312) 651-3000 Fax: (312) 651-3668 SPSS is a registered trademark and the other product names are the trademarks of SPSS Inc. for its proprietary computer software. No material describing such software may be produced or distributed without the written permission of the owners of the trademark and license rights in the software and the copyrights in the published materials. The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 60606-6307. General notice: Other product names mentioned herein are used for identification purposes only and may be trademarks of their respective companies. This product includes software developed by the Apache Software Foundation (http://www.apache.org). Windows is a registered trademark of Microsoft Corporation. UNIX is a registered trademark of The Open Group. DataDirect, INTERSOLV, SequeLink, and DataDirect Connect are registered trademarks of MERANT Solutions Inc. Clementine Application Template for Analytical CRM in Telecommunications 7.0 Copyright 2002 by Integral Solutions Limited All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher.
Contents
1
Introduction to Clementine Application Templates
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Introduction to Analytical CRM . . . . . . . . . . . . . . . . . . . . . . . . . 9 Life Cycle Model for CRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Why Analytical CRM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Getting Started
11
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 CAT Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 CAT Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 CAT Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 How to Use the CAT Streams . . . . . . . . . . . . . . . . . . . . . . . . . 14 Notes on Reusing CAT Streams . . . . . . . . . . . . . . . . . . . . . . . 15
16
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Module 1--Churn Application . . . . . . . . . . . . . . . . . . . . . . . . . 16 P1_aggregate.str-Aggregate Call Data and Merge with Customer Record. . . . P2_value.str--Customer Value and Tariff Appropriateness . . E1_explore.str--Visualize Customer Information and Value . . E2_ratios.str--Visualize Derived Usage Category Information P3_split.str-Derive Usage Category Information and Train/test Split. . . . M1_churnclust.str-Customer Clustering and Value/churn Analysis . . . . . . . . M2_churnpredict.str--Model Propensity to Churn . . . . . . . D1_churnscore.str--Score Propensity to Churn . . . . . . . . . . . . . . . . . . . . 19 21 23 26
. . . 29 . . . 31 . . . 34 . . . 36 . . . . . . 39 40 41 43 45 47
Module 2--Cross-Sell Streams . . . . . . . . . . . . . . . . . . . . . . . . 37 P4_basket.str--Produce Customer Product Basket Records . . . P5_custbasket.str--Merge Customer, Usage, and Basket Data . E3_products.str--Product Association Discovery . . . . . . . . . M3_prodassoc.str--Customer Clustering and Product Analysis . E4_prodvalue.str--Product Groupings Based on Customer Value M4_prodprofile.str--Propensity to Buy Grouped Products . . . . D2_recommend.str-Product Recommendations from Association Rules . . . . . . . .
. 48
51
Raw Data Files for Modules 1 and 2 . . . . . . . . . . . . . . . . . . . . . 51 Intermediate Data Files for Module 1 . . . . . . . . . . . . . . . . . . . . 53 Intermediate Data Files for Module 2 . . . . . . . . . . . . . . . . . . . . 57
60
.61 .63 .64 .65
Index
67
Chapter
1
Introduction to Clementine Application Templates
without modification. CAT data are supplied in flat files to avoid dependence on a database system. The data used in the CATs may be classified into two types: raw and intermediate. Raw data files are the starting point of each CAT. Intermediate files can be generated by the preprocessing streams supplied.
n A users guide that explains the application, the approach and structure used in the
stream library, the purpose and use of each stream, and how to apply the streams to new data.
Chapter
2
Introduction to the Telco CAT
Overview
The Telco CAT is a Clementine application template for analytical customer relationship management (CRM) in the telecommunications industry. It illustrates the data mining techniques applicable to churn management and cross-selling described below:
n Preprocessing. This phase of analytical CRM, or data mining, handles the merging
and aggregation of customer and call data, the derivation of customer value and tariff-related fields, and the preprocessing steps for producing "basket-style" customer product data.
n Exploration. This phase uses a wide range of exploratory techniques, such as
histograms and distribution charts, to understand the overall properties of the data, including the factors that influence customer churn and product purchase.
n Modeling and analysis. This phase illustrates the use of clustering and profiling to
understand customer churn and assist targeted cross-selling. Additionally, predictive techniques are used in Module 1 to predict the occurrence of churn, and association discovery is used in Module 2 for cross-selling.
n Deployment. The final phase illustrates the use of the Clementine Solution
Publisher to deploy churn prediction and cross-sell recommendation techniques. The techniques used in these data mining phases help to answer the business questions typically encountered in the telecommunications industry. The Telco CAT will help you to see how this is accomplished.
customer relationship as a unit--for example, by making all of the information about interactions with a particular customer available at every customer touchpoint.
n Analytical CRM provides a greater understanding of customers, both individually
and as a group, allowing the business to meet the needs of the customer at all levels, from individual transactions to overall strategy. Data mining is at the heart of analytical CRM because it is used to uncover the hidden meaning in customer interactions, allowing businesses to understand their customers and predict what they will do.
10 Chapter 2
segments and value. Specific segments, sometimes of a particularly high value, can be targeted in campaigns.
n Improved cross-selling through a better understanding of customer segments and
their relationship to product purchase. Knowledge of customers allows you to understand what they are likely to buy.
n Improved customer retention by understanding when and why customers are likely
to leave the customer base, enabling you to take remedial action where appropriate. The Telco CAT illustrates analytical CRM to achieve all of these benefits through specific applications of data mining.
Chapter
3
Getting Started
Overview
The Telco CAT is structured into two modules or "virtual applications" that explore Clementine operations typical to the telecommunications industry.
n Module 1 is a churn application designed to increase customer retention. n Module 2 is a cross-sell application that processes product information and merges
it with customer information from Module 1 for more targeted cross-selling. Each application consists of a number of streams that work either from raw data files or from intermediate files produced by the preprocessing streams.
CAT Data
The data provided with the CAT are based on a fictitious telecommunications company; the data is entirely synthetic and bears no relation to any real company. The raw data files are:
custinfo.dat cdr.dat tariff.dat products.dat Basic customer information Call data aggregated by month Details of the tariff scheme in use Table of products or services purchased by each customer
11
12 Chapter 3
The Telco CAT also contains six intermediate data files produced by stream operations. In several cases, these intermediate data files are then used in other stream operations.
CAT Streams
The two modules of the Telco CAT consist of 15 streams. The streams are organized according to the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology and contain a prefix indicating the appropriate data mining phase. For example, P1_aggregate.str is the first stream used in the preprocessing phase to aggregate data. As illustrated in the table below, the prefix codes used for Telco CAT streams are: P - preprocessing, E - exploration and understanding, M - modeling, and D - deployment.
CRISP-DM Phase Prefix Code Module 1 Streams Module 2 Streams Total Streams
P E M D
3 2 2 1 8
2 2 2 1 7
5 4 4 2 15
CAT Modules
The Telco CAT is grouped into two modules that illustrate the types of applications in analytical CRM for telecommunications.
Module 1
Module 1 is a churn application specific to the telecommunications industry. It consists of eight streams used to explore several data sets, prepare them for modeling, and create and deploy churn prediction models. This module uses several steps to move from the exploration of data to modeling and deployment:
Preprocessing streams produce a merged and augmented data file, cust_call_plus.dat,
13 Getting Started
cust_call_plus.dat.
Additional preprocessing occurs and the data set is split into training and test data files
that runs directly from the raw data. First, it performs all the required preprocessing and then scores the customer base on propensity to churn. For more information, see "Module 1--Churn Application" in the chapter Working with CAT Streams.
Module 2
Module 2 is a cross-sell application specific to the telecommunications industry. It uses seven streams to perform product analysis and cross-sell recommendations. This module uses the sequence of explorations and operations listed below to produce a recommendation model:
A preprocessing stream runs from the raw "till-roll" data (products.dat) that lists the
separate purchases made by each customer. This stream then produces a "basket" form of the data with one record per customer (cust_prod.dat).
Further preprocessing merges the basket data with the customer/behavioral data from
Unlike those of the churn application, the cross-sell streams often make direct use of the raw data.
As a final step, a deployment stream illustrates the techniques of product
recommendation using an association model. For more information, see "Module 2--Cross-Sell Streams" in the chapter Working with CAT Streams.
14 Chapter 3
own business-specific application. Simply load the streams into Clementine, execute them on the data provided with the CAT, and examine their composition using the stream information provided in this guide.
n Use the CAT streams as prepackaged components that you can attach to your
existing data. With the minor modifications detailed in this guide, you can use the templates directly for your own data mining applications. To determine which method is better for your business needs, address the questions below.
How well does the CAT match your technical situation?
your data It is not necessary that the match between the CAT data and your data be exact before the streams can be reused. For example, additional customer information, such as postal code, may be used without any need to change the stream. On the other hand, the inclusion of new usage categories that are used in various preprocessing steps, would require minor changes to some streams (for example, the addition, deletion, or modification of Derive nodes). A completely different organization of data in the database might require significant restructuring of the streams or the addition of new preprocessing streams to bring the data organization into line with that of the CAT.
How well does CAT answer your business questions?
To answer the second question, you should determine whether the specific questions addressed by the CAT match your business questions, such as What is the relationship between churn and dropped calls? You may have business reasons for believing that this relationship is not relevant to your specific situation, in which case, you could omit
15 Getting Started
this exploration as you use the streams. On the other hand, you may want to address other business questions that are not considered in the Telco CAT. In this case, you would want to supplement the CAT with additional streams of your own. As a general rule, the greater the divergence of your business questions from those addressed in the CAT, the greater the likelihood that streams will need to be modified or used for illustration rather than reuse.
may need to specify a different path in the Source node dialog box.
n When publishing a stream, be sure to check that the output file path is not set to the
$CLEO directory.
Chapter
4
Working with CAT Streams
Overview
Now that you have had an introduction to data mining in the telecommunications industry, you are ready to go into greater detail. In this chapter, you can examine indepth the streams of each module. You can see how the data are prepared and how the models are built. Read on for a closer look at how the Telco CAT works.
who have churned. The first approach is illustrated in M1_churnclust. The second and third approaches are shown in M2_churnpredict. Any of these approaches produce models that can be deployed in a churn prediction application. In this case, deployment is illustrated using a neural net scoring model (in the stream D1_churnscore).
16
The following diagram illustrates how the streams fit together to comprise the churn application.
Figure 4-1 Data files and streams in Module 1
Telco CAT Module 1 - Churn Application
E1_explore.str
P1_aggregate.str
Cust_calls.dat
P2_value.str
Cust_call_plus.dat E2_ratios.str
P3_split.str M1_churnclust.str
Cust_info.dat
Cdr.dat
Tariff.dat
Train data
Test data
D1_churnscore.str
Modeling
M2_churnpredict.str
Key:
Data
Stream
P1_aggregate. The first preprocessing stream takes two raw data files (custinfo and cdr) and produces an intermediate file (cust_calls.dat). Three preprocessing steps are performed:
n Aggregate the monthly call data into six monthly totals. n Produce averages and various combined fields from these totals. n Merge the customer information with this aggregated call data.
P2_value. The second preprocessing stream merges the intermediate file cust_calls and the tariff details file to produce a new intermediate file cust_call_plus.dat. This new file deals with the higher-level issues of customers total spending and the appropriateness of the tariffs. The stream also compares what each customer spends with what they would have spent on the "next higher" tariff and flags those who would be "better off" on a higher tariff.
18 Chapter 4
E1_explore1. A number of visualizations are performed in this stream that examines the churn indicator against a number of attributes considered likely to be relevant to churn behavior. The goal is to get a picture of the "shape" of the data before more detailed analyses are undertaken. E2_ratios. This stream performs explorations that require preprocessing. These explorations fall into five categories:
n Usage--How do usage bands, unused phones, and gender relate to churn? n Ratios--How do the relations between different usage categories relate to churn? n Handset--Do different types of handsets have different churn patterns? n Dropped calls--How does the rate of dropped calls relate to churn? n Tariff--Do tariff and tariff appropriateness have a relation to churn?
M1_churnclust. At this point in the module, business questions might focus on the relationship between certain churn and spend groups. This stream attempts to answer some of these questions as it produces clustering models and examines the relation of the discovered clusters to churn and customer value (total spending). Then, it produces rule-based profile models of the clusters. Characterizing the relevant clusters will allow churn reduction campaigns to be targeted accurately. Profiling the high-churn groups will also help you understand the reasons for churn. The derived fields from the explorations in E2_ratios are included in this analysis via the SuperNode called added_fields. P3_split. This stream prepares the augmented customer data, cust_call_plus, for predictive modeling. The fields from the explorations in the stream E2_ratios are added by a SuperNode, and then the data is split randomly in half as training and test data sets. M2_churnpredict. This is the main predictive modeling stream for the churn application. It builds a number of different predictive churn models using the training data and then compares their performance on the test data. D1_churnscore. This stream illustrates the deployment of scoring models using a neural net scoring model as an example. It is important to note that, because this stream will be deployed to run outside the context of Clementine, it must perform all of the preprocessing from the raw data independent of any intermediate files.
Stream Notes
Telco data is usually segmented into several different tables. For the purpose of data mining, these tables need to be combined into a single one. In this example, there are two types of data: CDR data (call data records) and customer information (length of service, tariff, age, handset, etc.). CDR data usually exists at several levels of aggregation:
n The lowest level is individual calls, which are usually too fine-grained for data
mining purposes.
n The next level is monthly aggregate calls by type (peak, off-peak, weekend, and
international). This type of data is often used for billing purposes. The Telco CAT data set includes CDR data at the monthly level and contains call minutes and the number of calls for each call type. This data is aggregated further to give a six-month average that smooths out monthly fluctuations and is a more reliable indicator of usage.
20 Chapter 4
The SuperNode Avgs & Counts derives additional fields for average call times (AvePeak, AveOffPeak, AveWeekend, etc.). From this analysis, you can see that clients who make longer calls and possess certain other attributes form a significant segment of the customer base.
Figure 4-3 Detailed view of SuperNode Avgs & Counts
Also derived in the SuperNode are All_calls_mins (sum of all minutes used for all call types) and a total and average length for national calls (all call types except international). These will be used to derive usage ratios in the exploration streams.
Figure 4-4 Deriving All_calls_mins
Stream Notes
In this stream, the previously merged customer/call information is merged again with the tariff details table so that each customer is tagged with the details of the type of tariff they are on. The SuperNode Tariff approp then calculates the several factors associated with cost. Cost has two elements: the fixed cost of the tariff and the cost of the calls. All the tariffs have some free minutes of call time, so customers pay only for calls over and above their free minutes. International calls are not included in free minutes. Call_cost_per_min is a calculation of the cost of all national calls before free minutes divided by the total number of national minutes. The higher the proportion of off-peak and weekend calls, the smaller this number will be.
22 Chapter 4
The SuperNode also calculates whether the user is on the correct tariff. In other words, you could ask Would their cost be less on the tariff above? The answer is calculated by comparing the difference between the tariff fixed charges with the amount spent on (nonfree national) calls. The calculation assumes that the general type of tariff (Play or CAT) is always correct but that the customer might be on the wrong tariff within that type (for example, on CAT 50 when he or she should be on CAT 100). The SuperNode also creates "usage bands" so that usage can be categorized for certain types of analysis.
Figure 4-6 Customer usage bands for National mins
By examining the graphs, you can learn that customers flagged as high for their tariff are more likely to churn.
24 Chapter 4
Stream Notes
The upper part of the stream shows analysis of attributes from the raw data. The lower part looks at aggregated or derived fields. Throughout the exploration phase, a range of indicator fields is examined. In each graph, the fields are overlaid with the churn flag to determine if there is a simple relationship between the variable and churn. Typically, what you look for here is an increase or decrease in churn behavior that might indicate different customer segments. For example, a histogram might reveal a trend or a specific band where churn is stronger. Similar to distributions, these values might be associated with a higher churn rate than others. Many of the churn segments in this module are characterized by a combination of several variables. Typically, single variable graphs like those used in this stream do not show any obvious trends. In this particular data set, however, you can see some patterns emerge. The following list describes the conclusions drawn from several of the graphs in this stream:
Age. In general, younger people tend to churn at a higher rate, and this is reflected in the data set. L_O_S (length of service). Churn often occurs just after a contract expires. The key period in this data set is 12-15 months. Handset. The cost to the customer of a handset is often subsidized. In general, service providers want customers to keep their handset for at least a year to recoup this cost. People with older handsets tend to churn because it is often a cheap way to upgrade their handsets. People with high-tech handsets may want to upgrade to the latest version as soon as it is available. For high-tech handsets, this period is normally less than six months. In this data set, the handsets have different product codes. The hightech handsets are ASAD with the larger version number being the newest. Dropped calls. This graph indicates service quality (although it can also be related to handset problems). Clients with high dropped calls tend to churn.
Tariff. Some tariffs may be more vulnerable to churn than others because competitors offer a more attractive package for this usage segment. Low-cost users in the cheapest tariffs tend to churn more as they find that even their current tariff is too expensive. Total_Cost (the total spending of the customer). Low-spending customers tend to churn more. Call cost without tariff (actual call cost) reveals this pattern even more strongly (this excludes international calls). Usage also reveals this trend more clearly in the usage bands than in the histogram (All_call_mins). Usage fields. These fields are worth examining in separate graphs in order to check for trends in usage segments related to churn. The presence of trends may depend on particular tariff structures compared to competitors and type of call (for example, international, peak, off-peak, weekend).
The fields examined in this stream are those one might examine in any churn application. As data exploration shows, not all of the fields have identifiable patterns, but some do. These relationships will help you determine the next step for your data mining project.
26 Chapter 4
Stream Notes
One way of characterizing churn is to partition the customer base into usage segments and then analyze these for propensity to churn. Deriving higher-level attributes helps this process and enables simpler rules that describe the segments and predict churn. The upper right part of this stream derives a set of usage ratios that can be used to describe customers in terms such as high offpeak calls when a large proportion of their calls are off-peak. Four ratios are derived:
Peak ratio OffPeak ratio Weekend ratio Nat-Internat ratio Peak minutes to national minutes Off-peak minutes to national minutes Weekend minutes to national minutes International minutes to national minutes
The ratios show the proportion of all calls by those in a particular category (rather than ratios between categories). This method avoids the distortion caused by very low numbers when "pure" ratios are used. Examining the graphs for these ratios, you can see:
n In general, the relation of these ratios to churn will depend on the tariff structure
and the competitive environment. Some tariffs favor peak or off-peak calls.
n In this data set, the OffPeak ratio is related to tariff and to high-churn
customer segments.
Figure 4-11 Graph of weekend ratio overlaid with churn
The upper left part of the stream, as viewed in the stream canvas, explores the phenomenon of no usage. A no usage flag identifies those customers who have not used their phones in the period covered by the data. This segment has a higher propensity to churn, with the exception of people who use their phones for emergency purposes only. The Select node called No Usage explores this segment in more detail. The lower left part of the stream examines the relation of tariff and the tariff appropriateness indicator to churn.
28 Chapter 4
The middle right part of the stream examines the churn properties of different handsets. The distribution of handsets shows that some handsets are particularly associated with churn. The churn score and aggregate branch of the stream calculates a churn score for each handset (the average churn fraction) and ranks handsets in order of score.
Figure 4-12 Distribution of churn with handset type
The lower right part of the stream examines dropped calls. You can explore the Dropped_calls histogram and generate a Derive node to flag records with a high number of dropped calls. The resulting distribution graph, high Dropped calls, shows the increased propensity to churn in this group. The derived fields in this stream will be included in the next preprocessing stage.
Stream Notes
The SuperNode added fields adds the higher level attributes derived in e2_ratios.
Figure 4-14 SuperNode stream segment
The Derive node called Split generates a random number (either 1 or 2). This field is used to partition the data set into training and test subsets that are written into separate files. These files are used for modeling in M1_churnclust.str.
30 Chapter 4
The stream also contains a Type node that can be saved and reused in the predictive modeling stream. This step is necessary to ensure that the Type node in use for modeling has been instantiated with all the data (and not just the training subset).
Stream Notes
Clustering is an alternative to predictive analysis and can give you insight into the "natural" segments in the data. For example, if high-churn clusters can be identified, business actions such as special offers might be tailored to that segment. Clustering can also be used for value analyses such as identifying high spending clusters and crosssell opportunities. The SuperNode added fields adds the attribute fields that were derived in e2_ratios. Two clustering techniques are then used: Kohonen and K-means. The
32 Chapter 4
upper part of the stream analyzes the Kohonen clustering, and the lower part analyzes the K-means clustering. The Kohonen network produces a two-dimensional grid of clusters. The Derive node Cluster No labels these fields, making it possible to analyze individual clusters. The top branch of the stream adds a churn score (either 0 or 1) and then aggregates to show the average churn per cluster. These clusters are then ranked and displayed in the table called churn.
Figure 4-17 Table illustrating clusters ranked by churn score
A related branch of the Kohonen network (ending in the table called value) calculates the average value per cluster and ranks the clusters in order to identify high-spending clusters. The relation of the clusters to churn and value is also examined by visualization. Another related branch uses C5.0 rule induction to create a ruleset that profiles the clusters. This model is called Cluster No. Comparison of the value, churn, and profile associated with particular clusters can give you detailed insight into the customer base. For example, cluster 11 is a highchurn, medium-value cluster associated with high-usage males with certain tariffs and handsets.
The lower part of the stream performs a similar analysis for a K-means cluster model. The relationship of the clusters to value and churn are explored, the value and churn cluster rankings are calculated, and the clusters are profiled using C5.0.
34 Chapter 4
Stream Notes
The C5.0, C&RT, and logistic regression models are categorical in that they make a yes/no prediction for churn. The neural network is a scoring model because the churn flag is replaced by a number (0.0 or 1.0, calledchurn score) used as the prediction
target. The neural network thus predicts churn on a continuum between 0 and 1. The C5.0 and C&RT predictions are also converted into scores between 0 and 1 using the confidence values (visualized in the histograms Score C and Score R). Logistic regression models produce probability fields that can be used directly for scoring. When using multiple models, you might ask How do I know which model to select for scoring? To answer this question, you can compare the performance of the three categorical models using the Analysis node in the lower stream and the nearby evaluation chart that compares the gains curve of all three models. The performance of the neural network is analyzed separately in the lower part of the stream, but the performance of all the models can be compared in the lower evaluation chart.
Figure 4-20 Evaluation chart comparing all four models
A second issue to consider is the likely value of the clients that may churn. The plot of Total_Cost v. $N-Churnscore is useful for devising a campaign matrix. The first clients to contact would be those with high value and high score (most likely to churn) followed by high value and medium score, and medium value and high score. At this point, you have completed all the data preparation and modeling for Module 1 and should have enough information to facilitate decision making. However, if you would like to deploy these streams independently of the Clementine application, read on for more information.
36 Chapter 4
Stream Notes
The deployment stream runs independently of Clementine and therefore has to combine all the operations needed to create the data in a single stream. Three input files are required: cdr.dat (call data), custinfo.dat (customer information), and tariff.dat. The lower left part of the stream duplicates the processing of p1_aggregate.str, the upper left part duplicates p2_value.str, and the second SuperNode duplicates the fields derived in p3_split.str. The data is then run through the neural net scoring model from m2_churnpredict.str. The final Publisher node is used to generate the stand-alone application.
In all of these manipulations, you should focus on discoveries that will allow you to predict purchasing patterns at a higher level than individual products. Again, any of these approaches can be deployed to make purchase recommendations (that is, to indicate likely purchases for individual customers). In this module, the deployment of association rules for recommendations is illustrated in D2_recommend.str.
Figure 4-22 Data files and streams in Module 2
Telco CAT Module 2 - Cross-sell Application
M3_prodassoc.str
Products.dat
P4_basket.str
Cust_prod.dat
P5_custbasket.str
Cust_call_prod.dat
E4_prodvalue.str
Groupings
D2_recommend.str
Association rules
E3_products.str
Key:
Data Stream
38 Chapter 4
P4_basket. This stream performs a simple set-to-flag or basket transformation on the raw till-roll style customer/product information. This process produces a basket data format with one record per customer and one flag field per product. The stream uses products.dat and produces cust_prod.dat. E3_products. This stream explores the relationships between product purchases using a web display and association rule modeling (Apriori). D2_recommend. This stream illustrates how association rules can be deployed to produce recommendations, or likely purchases for customers based on what they have already purchased. P5_custbasket. This stream combines the basket-style data produced by P4_basket.str (cust_prod.dat) with the augmented customer/call information (cust_call_plus.dat) to produce a new file (cust_call_prod.dat). M3_prodassoc. This stream builds a Kohonen clustering model based on customer and call information and then explores the relationships between the clusters and product purchases. The goal is to discover groups of customers with propensities for certain purchases. This stream is potentially useful for cross-selling recommendations. E4_prodvalue. This stream explores the relationship between product purchase and total customer spending (or value). Value-related product groups are discovered. M4_prodprofile. This stream profiles the value-related product groups discovered in E4_prodvalue in terms of customer and call information. The goal is to discover profiles of customers likely to buy the products in each group. The stream illustrates how customers predicted to buy the products in a group can be selected as targets for a cross-selling campaign.
Stream Notes
The raw product data is in the form of a till-roll, where each record links a customer ID to one product purchased. The Set-to-Flag node takes the set of all products and creates a flag field for each product and then aggregates by customer ID. The result is a "basket" record for each customer, containing T in the fields for products they have purchased and an F in the other (nonpurchased) product fields. Executing this stream will produce a new Source node called cust_prod.dat. This Source node is then merged with data from another source as shown in the next stream, p5_custbasket.str.
40 Chapter 4
Stream Notes
The Merge node combines the basket data with the cust_call_plus data. This latter data set includes both call data records and customer information, such as demographics. The Derive node splits customer value into five bands to assist the exploration of relationships between products and customer value or spending.
Figure 4-25 Splitting customer value into bands using a Derive node
42 Chapter 4
Stream Notes
The stream explores the purchasing relationships (which products are purchased together) in the basket data. The Web node called Products looks at pairwise associations between products.
Figure 4-27 Web analysis of product associations
The Apriori node called Products performs a basket analysis and can confirm these binary patterns while discovering more complex (multiproduct) purchasing patterns. You can tune the Apriori node, changing the thresholds to control the number of relationships found. The Web node can be used to estimate appropriate thresholds for confidence and coverage.
Stream Notes
In this stream, the Kohonen network is used to cluster customers based on call usage data (behavioral data) and customer information. The rest of the stream analyzes the characteristics of these clusters.
n The lower left part of the stream calculates the total and average customer value for
each cluster.
n The right side of the stream merges the clustered customer data with the individual
product list. This step helps you to characterize each cluster in terms of products purchased. To identify cross-selling opportunities, you can select clusters that have a high proportion of purchases of the product of interest and then try to sell the product to remaining clients in that cluster who have not purchased the product.
44 Chapter 4
The Matrix node called cluster x Product breaks down the sales of each product by cluster, thus highlighting the high-selling clusters for each product.
n The upper right branch sorts product purchases by cluster. Products are sorted in
terms of the number purchased in each cluster, showing the top-selling products for each cluster. The distribution graphs at the bottom of the stream canvas help to clarify the analysis:
n Cluster is a distribution of clusters overlaid with products, indicating the relative
importance of the different products in each cluster. For example, products 11 and 12 are relatively unimportant in cluster 02 but relatively important in cluster 32.
n ValueBand gives a distribution of value band overlaid by product. For
example, products 11 and 12 are relatively important in certain areas, such as low-value bands.
n Finally, Product shows the converse relationship, the relative importance of the
different value bands to each product. In the distribution graph, the product groupings are clearly visible. For example, products 1-4 are associated with highvalue customers.
Figure 4-29 ValueBand distribution showing products purchased in each value band
Stream Notes
The upper part of the stream analyzes the distribution of products purchased in different value bands. The Directed Web node called ValueBand x Products shows these relationships. Four groups of product associations, revealed by this web and the previous stream (e3_products), have been coded into the Derive nodes as flags for groups 1-4. The relations between these groups and the value bands are then explored in the directed web called ValueBand x Groups.
46 Chapter 4
The lower part of the stream uses the raw product records and merges them with the total cost and value band information in order to count products per customer. The upper branch of the stream calculates the average number of products per customer in each of the five value bands and ranks the bands in terms of this average. The lower branch calculates the total for each product purchased in each value band. The aggregated data is then sorted to give a ranked list of value bands for each product. This ranking helps answer questions such as For a given product, in which value bands does it sell best?
Stream Notes
Previous streams (e3_products.str and e4_prodvalue.str) have been used to discover product groupings or sets of products that customers purchase together. In this stream, these groups are flagged by derived indicators in the SuperNode called Groups. Such flagging is helpful for cross-selling, for example, when there is a threeproduct grouping and you have a number of customers with only two of the three products. You could use this information to identify these customers and to offer them the third product. This stream builds profiles for three of the four identified product groupings using customer behavioral and descriptive information. Some groups (in this case, group 1) produce no useful profile. The three models built from this data have quite different characters: the model for group 2 is very simple, the model for group 4 is moderately complex, and the model for group 3 is very complex. This complexity appears to have
48 Chapter 4
an inverse relationship to the quality of the model. In other words, the simpler the model, the more accurate (as illustrated by the Analysis node results) and better the gains chart is (as shown by the evaluation chart).
Figure 4-33 Evaluation of different propensity models
The models, once built, can be used to predict which clients will buy a particular product grouping. Similarly, those who have not purchased all products in their group can be targeted for cross-selling. The lower right branch of the stream shows the selection of targets for group 2 products.
Stream Notes
The upper part of the stream produces an association rule model and converts it into a form where it can be used for product recommendations. The lower stream uses the converted rules to recommend additional products for a users "basket" - the basket of products already purchased can be provided using a User Input node. The format of the input basket is a single record for each product purchased containing a user id and the product. This input format allows you to make recommendations for multiple users simply by substituting a file containing user id and product purchases (the file products.dat provides such an example.) There are actually three streams in this file. The top left stream generates the association rules using Apriori. The unrefined association rule model will appear in the Generated Models palette of the Managers window. To complete the first stage of rule preparation, you should browse the model and select Show criteria, then export the model as a text file (in this case assoc_rules.txt).
50 Chapter 4
The second stream converts the ruleset into a form that can be used for recommendations. The association rules are saved in the form:
instances support confidence consequent antecedent1 ...
Components are separated by tabs. In this case, the rules are interpreted to mean that if the basket contains the antecedents then purchasing the consequent is recommended (provided they havent already purchased it). The antecedents may contains one or more items. The stream assigns a rule number (Rule) to each rule and converts the rules into a record for each condition. It also adds a variable (Conds) which is the total number of conditions in each rule. The stream can process rulesets that have up to three conditions. (Further branches would have to be added for rules with more than three conditions.) The stream determines whether the conditions consist of one, two or three items by examining the antecedent fields. All rules have at least one condition (Cond1), so every rule/record passes thround the upper branch; the second and third branches are used only if there are second and third conditions (Cond2 and Cond3). The different branches select these cases and produce a separate record for every condition; the rule number (Rule) and the total number of conditions (Conds) are attached to each condition. These condition records are appended together into the file conditions.txt, which is used in the final stream for recommendations. The lower stream is the recommendation engine. A user ID and basket is entered via the User Input node. The basket items are entered in single quotes 01 02 etc. separated by spaces. This produces a record for each product purchased. The Derive node Condition converts each product into the same form as it appears in the conditions file from the association rules. The user basket is then merged with the conditions. All conditions that appear in the user basket will be matched. The resultant "matched conditions" data is then aggregated by user and rule number (Rule) and only those rules where all the conditions are matched are retained (Select node matched). The Derive node AllProducts accumulates the customer basket and selecting the last record for a customer yields the total product basket. (Note: this will work for multiple customers.) The total basket (AllProducts) is merged with the matched rules and those rules which recommend products that are already in the basket are discarded. The remaining rules are sorted on customer ID and rule confidence; the Distinct node then discards any products that have been recommended for the same customer more then once. The recommendations, already sorted in order of confidence, are displayed in the table. The Clementine Solution Publisher node can be used to deploy this stream as a standalone recommendation engine.
Appendix
A
Telco CAT Data Files and Field Names
Customer_ID Gender Age Connect_Date L_O_S Dropped_Calls Pay Method tariff Churn Handset
Unique customer key Sex--male or female Age in years Date phone was "connected"--start of customer relationship Length of service in months (since connect date) Number of dropped calls during 6-month period Method of payment--either pre- or post-paid Tariff type Flag--churned or active Name of handset type
51
52 Appendix A
Customer_ID Peak_calls Peak_mins OffPeak_calls OffPeak_mins_Sum Weekend_calls Weekend_mins International_mins Nat_call_cost_Sum month
Unique customer key Number of peak-time calls in month indicated Number of peak-time call minutes in month indicated Number of off-peak calls in month indicated Number of off-peak minutes in month indicated Number of weekend calls in month indicated Number of weekend minutes in month indicated Number of international-call minutes in month indicated Cost of national calls (peak + off-peak + weekend) in month indicated The month described by the record--6 months supplied for each customer
Data file: tariff.dat Field name tariff fixed_cost Free_mins peak_rate OffPeak_rate Weekend_rate International_rate Voicemail SMS Explanation Tariff type Fixed monthly cost for this tariff type Number of free (national) call minutes for this tariff type Cost per minute for peak-time calls beyond free minutes for this tariff type Cost per minute for off-peak calls beyond free minutes for this tariff type Cost per minute for weekend calls beyond free minutes for this tariff type Cost per minute for international calls for this tariff type Cost of voicemail service (not used) Cost of SMS service (not used)
Customer_ID Product
Unique customer key One product bought by this customer (a customer may have several rows)
Customer_ID Gender Age Connect_Date L_O_S Dropped_Calls Pay Method tariff Churn Handset Peak_calls_Sum Peak_mins_Sum OffPeak_calls_Sum OffPeak_mins_Sum Weekend_calls_Sum Weekend_mins_Sum International_mins_Sum Nat_call_cost_Sum AvePeak AveOffPeak AveWeekend National_calls National mins AveNational All_calls_mins
Total number of peak-time calls in 6-month period Total number of peak-time call minutes in 6-month period Total number of off-peak calls in 6-month period Total number of off-peak minutes in 6-month period Total number of weekend calls in 6-month period Total number of weekend minutes in 6-month period Total number of international-call minutes in 6-month period Total cost of national calls (peak + off-peak + weekend) Average duration of peak-time calls during 6-month period Average duration of off-peak calls during 6-month period Average duration of weekend calls during 6-month period Total number of national calls in 6-month period Total number of national minutes in 6-month period Average duration of national calls during 6-month period Total number of call minutes in 6-month period (national + international)
54 Appendix A
Customer_ID Gender Age Connect_Date L_O_S Dropped_Calls Pay Method tariff Churn Handset Peak_calls_Sum Peak_mins_Sum OffPeak_calls_Sum OffPeak_mins_Sum Weekend_calls_Sum Weekend_mins_Sum International_mins_Sum Nat_call_cost_Sum AvePeak AveOffPeak AveWeekend National_calls National mins AveNational All_calls_mins Usage_Band Mins_charge
A banding of national call minutes Number of chargeable national call minutes in 6-month period (national minutes - free minutes)
Continued
Field name
Explanation
call_cost_per_min actual call cost Total_call_cost Total_Cost Tariff_OK average cost min
Cost of national calls per minute ignoring free minutes Cost of national calls after free minutes removed--indicates call mix actual call cost + cost of international calls Total call cost + fixed cost of tariff Flag to indicate tariff appropriateness Total cost / all call minutes (average call cost per minute including tariff cost and international calls)
Data files: train.dat and test.dat--new fields added by p3_split.str--also derived and explored in e2_rations.str and m1_churnclust.str
Field name Explanation
Customer_ID Gender Age Connect_Date L_O_S Dropped_Calls Pay Method tariff Churn Handset Peak_calls_Sum Peak_mins_Sum OffPeak_calls_Sum OffPeak_mins_Sum Weekend_calls_Sum Weekend_mins_Sum International_mins_Sum Nat_call_cost_Sum AvePeak
Continued
56 Appendix A
Field name
Explanation
AveOffPeak AveWeekend National_calls National mins AveNational All_calls_mins Usage_Band Mins_charge call_cost_per_min actual call cost Total_call_cost Total_Cost Tariff_OK average cost min Peak ratio Offpeak ratio Weekend ratio Nat-InterNat Ratio High Dropped calls No usage Ratio of peak-time minutes / national minutes Ratio of off-peak minutes / national minutes Ratio of weekend call minutes / national minutes Ratio of international call minutes / national minutes Number of dropped calls above threshold Client has made 0 calls in the 6-month period Inherited from cust_call_plus.dat
Customer_ID Product_01 Product_02 Product_03 Product_04 Product_05 Product_06 Product_07 Product_08 Product_09 Product_10 Product_11 Product_12
Unique customer key For each product, a flag indicating whether the customer bought this product (one record per customer)
Customer_ID Gender Age Connect_Date L_O_S Dropped_Calls Pay Method tariff Churn Handset
Continued
58 Appendix A
Field name
Explanation
Peak_calls_Sum Peak_mins_Sum OffPeak_calls_Sum OffPeak_mins_Sum Weekend_calls_Sum Weekend_mins_Sum International_mins_Sum Nat_call_cost_Sum AvePeak AveOffPeak AveWeekend National_calls National mins AveNational All_calls_mins Usage_Band Mins_charge call_cost_per_min actual call cost Total_call_cost Total_Cost Tariff_OK average cost min Product_01 Product_02 Product_03 Product_04 Product_05 Product_06
Continued
Field name
Explanation
Appendix
B
Using the Data Mapping Tool
60
In contrast to earlier versions of Clementine, data mapping is now tightly integrated into stream building, and if you try to connect to a node that already has a connection, you will be offered the option of replacing the connection or mapping to that node.
62 Appendix B
Step 2: Add new data source to the stream canvas. Using one of Clementines source nodes, bring in the new replacement data. Step 3: Replace the template source node. Using the Data Mapping options on the context menu for the template source node, choose Select Replacement Node. Then select the source node for the replacement data.
Figure B-2 Selecting a replacement source node
Step 4: Check mapped fields. In the dialog box that opens, check that the software is mapping fields properly from the replacement data source to the stream. Any unmapped essential fields are displayed in red. These fields are used in stream operations and must be replaced with a similar field in the new data source in order for downstream operations to function properly. For more information, see "Examining MappedFields" below. After using the dialog box to ensure that all essential fields are properly mapped, the old data source is disconnected and the new data source is connected to the template stream using a Filter node called Map. This Filter node directs the actual mapping of fields in the stream. An Unmap Filter node is also included on the stream canvas. The Unmap Filter node can be used to reverse field name mapping by adding it to the stream. It will undo the mapped fields, but note that you will have to edit any downstream terminal nodes to reselect the fields and overlays.
Figure B-3 New data source successfully mapped to the template stream
64 Appendix B
Using the Field Chooser, you can add or remove fields from the list. To open the Field
66 Appendix B
Original. Lists all fields in the template or existing stream--all of the fields that are present further down stream. Fields from the new data source will be mapped to these fields. Mapped. Lists the fields selected for mapping to template fields. These are the fields whose names may have to change to match the original fields used in stream operations. Click in the table cell for a field to activate a drop-down list of available fields.
If you are unsure of which fields to map, it may be useful to examine the source data closely before mapping. For example, you can use the Types tab in the source node to review a summary of the source data.
Index
acquisition, 10 Analysis node, 34 analytical CRM, 9 benefits of, 10 Apriori, 41 association rules, 48
life cycle, 9 operational, 9 cross-sell application details, 37 overview, 12 cross-selling, 10 customer acquisition, 10 retention, 10
C&RT, 34 C5.0, 31, 34 CAT guidelines for use, 14 reusing streams, 15 Telco CAT data, 11 Telco CAT modules, 12 Telco CAT overview, 8 Telco CAT streams, 12 Telco CAT structure, 11 categorical models, 34 CDR data, 19 churn, 34 churn analysis, 23 churn application details, 16 overview, 12 churn score, 26 Clementine Application Templates (CATs) data mapping tool, 60 clustering, 31 in Module 1, 16 in Module 2, 37 CRISP-DM, 12 CRM analytical, 9 data mining, 9 introduction, 9
67
d1_churnscore.str, 36 d2_recommend.str, 48 data connecting to streams, 15 merging from both modules, 40 overview, 11 train/test sets, 29 data exploration in Module 1, 23, 26 data files intermediate data files, 53, 57 list of, 51 data mapping tool, 60, 61 data mining, 9 benefits of, 10 stream phases, 12 data preparation in Module 1, 19, 21, 29 in Module 2, 39, 40 data understanding, 41 deployment in Module 1, 36 in Module 2, 48
68 Index
e1_explore.str, 23 e2_ratios.str, 26 e3_products.str, 41 e4_prodvalue.str, 45 essential fields, 60, 64 evaluation charts, 34 exploration in Module 2, 41, 45
m1_churnclust.str, 31 m2_churnpredict.str, 34 m3_prodassoc.str, 43 m4_prodprofile.str, 47 mandatory fields, 65 mapping data, 15, 64 mapping fields, 60 Matrix node, 43 merging data, 40 modeling in Module 1, 31, 34 in Module 2, 43, 47 train/test sets, 29 models categorical, 34 scoring, 34
scoring models, 34 in Module 1, 16 in Module 2, 37 segmentation, 10, 26 solutions template library, 60 source nodes data mapping, 61 streams details, 16 guidelines for use, 14 overview, 12 reusing with your data, 15 tariff information, 21
69 Index
telecommunications CAT overview, 8 structure, 11 template fields, 65 templates, 60, 61 test data set, 29 till-roll, 39 train data set, 29
value bands, 45