Professional Documents
Culture Documents
Abstract. The customer churn problem aects hugely the telecommunication services in particular, and businesses in general. Note that in
majority of cases the number of potential customer churn is much smaller
than the non-churners. Therefore, the imbalance distribution of samples
between churners and non-churners is a concern when building a churn
prediction model. This paper presents a Local PCA approach to solve
imbalance classication problem by generating new churn samples. The
experiments were carried out on a large real-world Telecommunication
dataset and assessed on a churn prediction task. The experiments showed
that the Local PCA along with Smote outperformed Linear regression
and Standard PCA data generation techniques.
Keywords: PCA, Imbalanced Classication, Churn Prediction.
Introduction
Customer Churn has become a serious problem for companies mainly in telecommunication industry. This is as a result of recent changes in the telecommunications industry, such as, new services and the liberalisation of the market. In
recent years, Data Mining techniques have emerged as one of the method to
tackle the Customer Churn problem[1,8].
The study of customer churn can be seen as a classication problem (Churn
and Non-Churn classes). The main goal is to build a robust classier to predict
potential churn customers. However, imbalanced distribution of class samples is
an issue in data mining as it leads to poor classication results[4]. In this paper,
we focus on overcoming this problem by increasing the size of churn samples
by an over-sampling approach. The aim is to correctly set the distribution samples to build an optimal classier by adding minority class samples. There have
been various sampling approaches proposed to counter non-heuristic sampling
problems.
Synthetic Minority Over-sampling Technique (Smote)[3] generates articial
data along the line between minor class samples and K minority class nearest
B.-T. Zhang and M.A. Orgun (Eds.): PRICAI 2010, LNAI 6230, pp. 668673, 2010.
c Springer-Verlag Berlin Heidelberg 2010
669
neighbours. This causes the decision boundaries for the minor class space to
spread further into majority class space. An extended approach of Smote is
Smote + Edited Nearest Neighbour (ENN)[9] approach, which removes more
unnecessary samples and provides a more in depth data cleaning.
The main idea of our approach is to form a new minority class space by
generating minority class data using the K-means algorithm with PCA[7]. PCA
reveals the internal structure of a dataset by extracting uncorrelated variables
known as Principal Components (PC). In this paper, we adopt Local PCA data
regression to generate new dataset and add raw data to change the distribution
of class samples.
This paper is organised as follows: the next section outlines the proposed
approach on churn prediction task. Section 3 explains experiments and the evaluation criteria. We conclude and highlight some key remarks in Section 4.
Approaches
Our proposed approach combines PCA technique, the Genetic Algorithm (GA)
and K-means algorithm to generate a new data for minority class. First and
foremost, minority class dataset dchurn is formed from the original raw dataset
draw . The GA K-means clustering is applied on dchurn to form K clusters. The
next step is to apply PCA regression on each cluster set to transform them
back to original feature space in terms of selected principal component. We believe that applying regression locally would avoid the inclusion of redundant
information in principal component because of lower variance within the clusters. These transformed data is then added to draw to improve the distribution of minority class samples. Finally, draw is used to build a churn prediction model for a classication purpose. Figure 1 shows the main steps of the
approach.
670
2.1
T. Sato et al.
The standard K-means algorithm is sensitive to the initial centroid and poor
initial cluster centres would lead to poor cluster formation. We employ Genetic
Algorithm (GA)[5] to avoid sensitivity problem in centroid selection.
In GA K-means algorithm, a gene represents a cluster centre and a chromosome of K genes represents a set of K cluster centres. The GA K-means algorithm
steps are: 1)Initialization:Randomly select K points as cluster centres (chromosomes) from original data set and apply k-means, 2)Selection:The chromosomes
are selected according to specic selection method, 3) Crossover:Selected chromosomes are randomly paired with other parents for reproduction, 4) Mutation:
Apply mutation operation to ensure diversity in the population, 5) Elitism:Store
the chromosome that has the best tness value in each generation and 6) Iteration:Go to step 2, until the variation of tness value within the best chromosomes
is less than a specic threshold.
2.2
We apply the PCA regression technique on each cluster to generate a new dataset
in original feature space in terms of selected principal components (PC). PCA
has a property of searching for PC that accounts for large part of total variance
in the data and projecting data linearly onto new orthogonal bases using PC.
Consider a dataset X={xi , i = 1, 2, . . . , N, xi N } with attribute size of d
and N samples. The data is standardised so that the standard deviation and the
mean of each column are 1 and 0, respectively. PC can be extracted by solving
the following Eigenvalue Decomposition Problem[7].
= C, subject to ||||2 =
(1)
where is the eigenvectors and C is the covariance matrix. After solving the
equation (1), sort the eigenvalues in descending order as larger eigenvalue gives
signicant PC. Assume that matrix contains only a selected number of eigenvectors (PC). The transformed data is computed by
Xtr = T X T
(2)
1
3
3.1
671
Experiments
Dataset and Evaluation Criteria
Experimental Setup
The range of cluster size K, [36:72], produced better AUC results than smaller
K. In addition, GA K-means performed generally better than standard K-means.
The FP and TP rates of 3 data regression methods and Smote were compared
in Figure 2. For local PCA, we selected 2 best cluster sizes from range [36:72]
for each classier. The standard PCA operates similarly to local PCA but the
clustering technique is not applied on churn data. For all classiers, both types
of PCA regression performed as good as Smote and better than other methods
except C4.5, as it is hard to conclude which method is better.
The third experiment overall results are illustrated in Figure 3. The Figure
presents the graphs of AUC against the churn size for each classier using Local1
672
T. Sato et al.
Fig. 2. ROC graph: Comparison of Local PCA, Standard PCA and Linear Regression
data generation as it gave the best prediction results in experiment 2. From the
churn size of 6000 onward, additional churn samples generated by PCA were
added. The SVM, NB and LR performed well with size 6000 to 12000 but they
did not produce acceptable TP or FP rates afterwards as this can be easily seen
from Figure 3.
673
In summary, the experiments showed that 1) The clustering size K did produce dierent AUC results according to the size, 2) The local PCA data regression performed better than Standard PCA and LiR and nally 3) Adding similar
churn samples to original data improved the TP rate for most of the classiers.
However, FP reached over 50% after 12000. One of the reasons for the high FP
rate is due to the change in decision boundaries. More non-churn samples inside
enlarged churn space lead to high number of incorrectly classied non-churn.
References
1. Au, W., Chan, C.C., Yao, X.: A novel evolutionary data mining algorithm with applications to churn prediction. IEEE Transactions on Evolutionary Computation 7,
532545 (2003)
2. Bradley, A.P.: The use of the area under the roc curve in the evaluation of machine
learning algorithms. Pattern Recognition 30, 11451159 (1997)
3. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kergelmeyer, W.P.: Smote: synthetic minority over-sampling technique. JAIR 16, 321357 (2002)
4. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from
imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 16 (2004)
5. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning.
Kluwer Academic Publishers, Dordrecht (1989)
6. Huang, B.Q., Kechadi, M.-T., Buckley, B.: Customer churn prediction for broadband internet services. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.)
DaWaK 2009. LNCS, vol. 5691, pp. 229243. Springer, Heidelberg (2009)
7. Jollie, I.T.: Principal Components Analysis. Springer, Heidelberg (1986)
8. Wei, C., Chiu, I.: Turning telecommunications call details to churn prediction: a
data mining approach. Expert Systems with Applications 23, 103112 (2002)
9. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data.
IEEE Transactions on Systems, Man and Communications 2(3), 408421 (1972)