You are on page 1of 6

Local PCA Regression for Missing Data

Estimation in Telecommunication Dataset


T. Sato, B.Q. Huang, Y. Huang, and M.-T. Kechadi
School of Computer Science and Informatics, University College Dublin, Beleld,
Dublin 4, Ireland
bingquan.huang@ucd.ie, takeshi.sato@ucdconnect.ie

Abstract. The customer churn problem aects hugely the telecommunication services in particular, and businesses in general. Note that in
majority of cases the number of potential customer churn is much smaller
than the non-churners. Therefore, the imbalance distribution of samples
between churners and non-churners is a concern when building a churn
prediction model. This paper presents a Local PCA approach to solve
imbalance classication problem by generating new churn samples. The
experiments were carried out on a large real-world Telecommunication
dataset and assessed on a churn prediction task. The experiments showed
that the Local PCA along with Smote outperformed Linear regression
and Standard PCA data generation techniques.
Keywords: PCA, Imbalanced Classication, Churn Prediction.

Introduction

Customer Churn has become a serious problem for companies mainly in telecommunication industry. This is as a result of recent changes in the telecommunications industry, such as, new services and the liberalisation of the market. In
recent years, Data Mining techniques have emerged as one of the method to
tackle the Customer Churn problem[1,8].
The study of customer churn can be seen as a classication problem (Churn
and Non-Churn classes). The main goal is to build a robust classier to predict
potential churn customers. However, imbalanced distribution of class samples is
an issue in data mining as it leads to poor classication results[4]. In this paper,
we focus on overcoming this problem by increasing the size of churn samples
by an over-sampling approach. The aim is to correctly set the distribution samples to build an optimal classier by adding minority class samples. There have
been various sampling approaches proposed to counter non-heuristic sampling
problems.
Synthetic Minority Over-sampling Technique (Smote)[3] generates articial
data along the line between minor class samples and K minority class nearest
B.-T. Zhang and M.A. Orgun (Eds.): PRICAI 2010, LNAI 6230, pp. 668673, 2010.
c Springer-Verlag Berlin Heidelberg 2010


Local PCA Regression for Missing Data Estimation

669

neighbours. This causes the decision boundaries for the minor class space to
spread further into majority class space. An extended approach of Smote is
Smote + Edited Nearest Neighbour (ENN)[9] approach, which removes more
unnecessary samples and provides a more in depth data cleaning.
The main idea of our approach is to form a new minority class space by
generating minority class data using the K-means algorithm with PCA[7]. PCA
reveals the internal structure of a dataset by extracting uncorrelated variables
known as Principal Components (PC). In this paper, we adopt Local PCA data
regression to generate new dataset and add raw data to change the distribution
of class samples.
This paper is organised as follows: the next section outlines the proposed
approach on churn prediction task. Section 3 explains experiments and the evaluation criteria. We conclude and highlight some key remarks in Section 4.

Approaches

Our proposed approach combines PCA technique, the Genetic Algorithm (GA)
and K-means algorithm to generate a new data for minority class. First and
foremost, minority class dataset dchurn is formed from the original raw dataset
draw . The GA K-means clustering is applied on dchurn to form K clusters. The
next step is to apply PCA regression on each cluster set to transform them
back to original feature space in terms of selected principal component. We believe that applying regression locally would avoid the inclusion of redundant
information in principal component because of lower variance within the clusters. These transformed data is then added to draw to improve the distribution of minority class samples. Finally, draw is used to build a churn prediction model for a classication purpose. Figure 1 shows the main steps of the
approach.

Fig. 1. The description of the proposed approach

670

2.1

T. Sato et al.

GA K-Means Clustering Algorithm

The standard K-means algorithm is sensitive to the initial centroid and poor
initial cluster centres would lead to poor cluster formation. We employ Genetic
Algorithm (GA)[5] to avoid sensitivity problem in centroid selection.
In GA K-means algorithm, a gene represents a cluster centre and a chromosome of K genes represents a set of K cluster centres. The GA K-means algorithm
steps are: 1)Initialization:Randomly select K points as cluster centres (chromosomes) from original data set and apply k-means, 2)Selection:The chromosomes
are selected according to specic selection method, 3) Crossover:Selected chromosomes are randomly paired with other parents for reproduction, 4) Mutation:
Apply mutation operation to ensure diversity in the population, 5) Elitism:Store
the chromosome that has the best tness value in each generation and 6) Iteration:Go to step 2, until the variation of tness value within the best chromosomes
is less than a specic threshold.
2.2

Linear PCA and Data Generation

We apply the PCA regression technique on each cluster to generate a new dataset
in original feature space in terms of selected principal components (PC). PCA
has a property of searching for PC that accounts for large part of total variance
in the data and projecting data linearly onto new orthogonal bases using PC.
Consider a dataset X={xi , i = 1, 2, . . . , N, xi N } with attribute size of d
and N samples. The data is standardised so that the standard deviation and the
mean of each column are 1 and 0, respectively. PC can be extracted by solving
the following Eigenvalue Decomposition Problem[7].
= C, subject to ||||2 =

(1)

where is the eigenvectors and C is the covariance matrix. After solving the
equation (1), sort the eigenvalues in descending order as larger eigenvalue gives
signicant PC. Assume that matrix contains only a selected number of eigenvectors (PC). The transformed data is computed by
Xtr = T X T

(2)
1

From equation (2), matrix X T can be obtained by X T = T Xtr . Finally, the


matrix X T is transposed again to get the matrix X new . Since we standardised
the data in the rst step, the original standard deviation and the mean of each
new
. The newly generated data X new is then
column must be included in each Xij
added to draw to adjust the distribution of the samples. We continue this process
until all clusters are transformed.
We run two data generation approaches on PCA regression. The rst approach
utilises all clusters on PCA regression (Local1). The second approach only uses
the centroid of each cluster to form a dataset of centre points (Local2) and use
this data to extract principal components.

Local PCA Regression for Missing Data Estimation

3
3.1

671

Experiments
Dataset and Evaluation Criteria

We selected randomly 139,000 customers from a real world database provided by


Eircom. The distribution of churn and non-churn is imbalanced as the training
and testing data contain 6000 (resp. 2000) churners and 94000 (resp. 37000)
non-churners, respectively. These datasets are described by 122 features which
are explained in [6].
We implement the Decision Tree C4.5 (DT), the SVM, Logistic Regression
(LR) and the Naive Bayes (NB) to build prediction models. We performed these
models following the evaluation criteria:1) The true churn (TP) is the ratio of
churn that was classied correctly and 2) the false churn (FP) is the ratio of nonchurn that was incorrectly classied as churn. A good solution is considered as
dominant when TP is high and FP is low. We use the Receiver Operating Curve
technique (ROC) to evaluate the various learning algorithms. It was shown how
that the TP varies with FP. In addition, the Area under ROC curve (AUC)[2]
provides single number summary for the performance of learning algorithms. We
calculate the AUC threshold on FP as 0.5 as telecom companies are generally
not interested in FP above 50%.
3.2

Experimental Setup

The main objective of the experiments is to observe if additional churner samples


generated by PCA regression would improve churn prediction results. We rst
examines the optimal cluster size of the GA K-means algorithm for Local PCA
regression by setting K to be in [4 72]. The second experiment compares the
prediction results of each classier by PCA regression from experiment 1 to Linear Regression(LiR),Standard PCA based data generation and Smote. The nal
experiment examines the main objective. The number of churners is increased
from the original size of 6000 up to 30000 by setting the PC threshold to 0.9,
0.8, . . . , 0.6. A new dataset is generated based on Local1 & Local2 generation
method for all experiments.
3.3

Results and Discussion

The range of cluster size K, [36:72], produced better AUC results than smaller
K. In addition, GA K-means performed generally better than standard K-means.
The FP and TP rates of 3 data regression methods and Smote were compared
in Figure 2. For local PCA, we selected 2 best cluster sizes from range [36:72]
for each classier. The standard PCA operates similarly to local PCA but the
clustering technique is not applied on churn data. For all classiers, both types
of PCA regression performed as good as Smote and better than other methods
except C4.5, as it is hard to conclude which method is better.
The third experiment overall results are illustrated in Figure 3. The Figure
presents the graphs of AUC against the churn size for each classier using Local1

672

T. Sato et al.

(a) Logistic Regression

(b) Decision Tree

(c) Naive Bayes

(d) Support Vector Machine

Fig. 2. ROC graph: Comparison of Local PCA, Standard PCA and Linear Regression

Fig. 3. AUC Plot

data generation as it gave the best prediction results in experiment 2. From the
churn size of 6000 onward, additional churn samples generated by PCA were
added. The SVM, NB and LR performed well with size 6000 to 12000 but they
did not produce acceptable TP or FP rates afterwards as this can be easily seen
from Figure 3.

Local PCA Regression for Missing Data Estimation

673

In summary, the experiments showed that 1) The clustering size K did produce dierent AUC results according to the size, 2) The local PCA data regression performed better than Standard PCA and LiR and nally 3) Adding similar
churn samples to original data improved the TP rate for most of the classiers.
However, FP reached over 50% after 12000. One of the reasons for the high FP
rate is due to the change in decision boundaries. More non-churn samples inside
enlarged churn space lead to high number of incorrectly classied non-churn.

Conclusion and Future Works

In this paper, we have designed PCA regression method locally in combination


with GA K-means algorithm to generate churn class samples in anticipation to
solve Imbalance classication problem.
The approach was tested on a telecommunication data on churn prediction
task. The results showed that the Local PCA along with Smote performs better
than Standard PCA and LiR in general. Additional samples would improve TP
rate for churn size [6000:12000] but the FP rate would increase over 50%. Since
we are more interested in identifying potential churner as losing a client has
signicant eect for the telecom company, improvement in TP is a good results.
Nevertheless, FP rate must be limited as high FP can be expensive for future
marketing campaign. We are interested in understanding as to why additional
churn samples would give high FP. There is a possibility that the churn data
generated by various PC thresholds can lead to poor classication in FP.

References
1. Au, W., Chan, C.C., Yao, X.: A novel evolutionary data mining algorithm with applications to churn prediction. IEEE Transactions on Evolutionary Computation 7,
532545 (2003)
2. Bradley, A.P.: The use of the area under the roc curve in the evaluation of machine
learning algorithms. Pattern Recognition 30, 11451159 (1997)
3. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kergelmeyer, W.P.: Smote: synthetic minority over-sampling technique. JAIR 16, 321357 (2002)
4. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from
imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 16 (2004)
5. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning.
Kluwer Academic Publishers, Dordrecht (1989)
6. Huang, B.Q., Kechadi, M.-T., Buckley, B.: Customer churn prediction for broadband internet services. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.)
DaWaK 2009. LNCS, vol. 5691, pp. 229243. Springer, Heidelberg (2009)
7. Jollie, I.T.: Principal Components Analysis. Springer, Heidelberg (1986)
8. Wei, C., Chiu, I.: Turning telecommunications call details to churn prediction: a
data mining approach. Expert Systems with Applications 23, 103112 (2002)
9. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data.
IEEE Transactions on Systems, Man and Communications 2(3), 408421 (1972)

You might also like