You are on page 1of 11

A data mining framework for fraud detection in telecom based

on MapReduce

By
Mohammed Fahmi Kharma

May 31, 2011

Table of Contents

Introduction .......................................................................................................................................... 3
Background ............................................................................................................................................... 4
Related work ............................................................................................................................................. 6
Contribution .............................................................................................................................................. 7
General Objective ..................................................................................................................................... 7
Specific Objectives .................................................................................................................................... 7
Scope of the work ..................................................................................................................................... 8
The added value of our work .................................................................................................................... 8
Methodology............................................................................................................................................. 8
Time table ............................................................................................................................................... 10
References .............................................................................................................................................. 10

Introduction
During the last years, Word have seen a rapid growing and expansion in modern
technology especially in telecommunication and internet, in parallel with this development fraud
events are increasing dramatically where it is causing major losses estimated by billions of
dollars throughout the worldwide yearly. According to Concise Oxford dictionary fraud is a
wrongful or criminal deception intended to result in financial or personal gain.

MapReduce is a programming model and an associated implementation for processing and


generating large data sets. Users specify a map function that processes a key/value pair to
generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key [1], MapReduce model use two operations for
computation: map and reduce, map operation should executed before reduce operation, and its a
commonly style in functional programming languages. Each map operation applies computation
to a key-value pair, and the result is one or more key-value pairs that are fed as input to the
reduce step. Each reduce operation receives a list of key-value pairs which share the same key,
and reduces these pairs by aggregating the results into one or more values for this key.

MapReduce framework automatically parallelizes and executes on a large cluster of machines.


The run-time system takes care of the details of partitioning the input data, scheduling the
programs execution across a set of machines, handling machine failures, and managing the
required inter-machine communication. This enables programmers with no experience in parallel
and distributed systems programming to easily utilize the resources of a large distributed system.
The main idea inside MapReduce framework, Users specify a map function that processes a
key/value pair to generate a set of intermediate key/value pairs, and a reduce function that
merges all intermediate values associated with the same intermediate key[1].

Fig.1 - MapReduce overview, Jeffrey Dean and Sanjay[1]

Background
Telecommunication is one very interesting environment as it generating and storing a
huge amount of data collected through its systems to record and reflect the company operation
and its subscriber activity, one of these data can be obtained from call details record(CDR) where
information about A-number, B-number, Duration, Call Path, Timestamps...etc exists.
According to Mieke Jans et al(2010). They presented an overview of how they see the different
classifications and their relations to each other presented by In Figure 2; the most public
classification is the internal versus external fraud, since all other classifications are situated
within internal fraud. As already pointed out, we see occupational fraud and abuse as an

equivalent of internal fraud. Figure 2 also shows that all classifications left, apply only to
corporate fraud. Also they classified internal fraud into three different classifications. Starting
from a differentiation between statement fraud and transaction fraud. A second classification is
based upon the occupation level of the fraudulent employee. Thirdly, fraud classification for
fraud against the company [6].

Fig.2 - Fraud Classification Overview, Mert Sanver, Adem Karahoca [2]

Fraud activity can be defined as a dishonest or illegal use of services, with the intention to avoid
service charges. Fraud detection is the name of the activities to identify unauthorized usage and
prevent losses for the mobile network operators [2]. Telecommunication Companies often
receive revenue loss from customers fraudulent behaviors. There are different types of fraud in
the telecommunication business [3]. Shawe-Taylor et al. (2000) present six different fraud types:

subscription fraud, the manipulation of Private Branch Exchange (PBX) facilities or dial through
fraud, free phone fraud, premium rate service fraud, handset theft, and roaming fraud[5].
In fraud detection process, in order to determine the fraud attack and its types, Call detail
records are processed to investigate the subscription fraud, premium rate service fraud or
roaming fraud. In subscription fraud, a fraudster obtains a subscription with fake personal
information to be registered on the network to perform his fraudulent activity with no intention
to pay the bill or fees [2].

Related work
Telecom fraud history extends from early days of Telecom companies, where these
companies are expensing a lot of money to reduce fraudsters attaches and to keep the
competition with other operators by saving itself from possible significant losses may be caused
by fraudster that may affect the company ability in facing their competitors.
There are many studies have been started in fraud detection and prevention track, we will have a
look on some of these studies. Hamid Farvaresh et al. (2011) study aimed at identifying
customers subscription fraud in telecom by employing combined SOM and K-means techniques
through a hybrid approach consisting of preprocessing, clustering, and classification phases, and
adopting knowledge discovery process. MARTIN HGER et al. (2011) the application of
general outlier detection and classification methods to the problem of detecting fraudulent
behavior in an online advertisement metrics. Viaene et al. (2004) and Viaene et al. (2002) for
automobile insurance fraud detection by combining the advantages of boosting and the
explanatory power of the weight of evidence AdaBoosted naive Bayes scoring framework. A
combination of neural network and rules by Brause et al. (1999) and Estvez et al. (2006) have
been used. Mert Sanver et al.(2009) offers the Adaptive Neuro Fuzzy Inference (ANFIS)
method as a means to efficient fraud detection.

He et al. (1997) apply neural networks: a multi-layer perception network in the supervised
component of their study and Kohonens self-organizing maps for the unsupervised part. Fawcett
et al. proposed an adaptive rule-based detection framework for fraud detection. Roset et al. state
the standard classification and rule generation were not appropriate for fraud detection. D.
Hawkins(1980) interested in data outlier where these data most likely would be more suspicious
than regular and normal distributed data. R. Rastogi S. Ramaswamy(2000) et al. extend outlier
method based on the distance of a point from its k th nearest neighbor based on previous work
contained distance based method outlier applications was accomplished by R. Ng E. Knorr et
al(2000).

Contribution
We mentioned in previous sections various data mining techniques and how can be used
to enable fraud detection. In our work, we are focusing in design and implement the first fraud
detection model for telecom environment in different domain, in MapReduce domain, we will
use commodity machines and network to implement our model, where our model will be the first
live example on fraud detection using cloud computing. Our model will include implementation
of data mining algorithm, initially we selected K-mean algorithm and also we expect our model
should operate in near online mode to detect and classify fraud events, so this will enhance the
ability to detect the subscription fraud events early and results in major reduction in revenue
losses.

General Objective
The outputs of our research is a design and implement a model using data mining to
detect fraud cases targeting telecom environment where a huge volume of data should to be
processed based on cloud computing infrastructure we will build using the most popular and
powerful cloud computing framework MapReduce. We will use Data obtained from call details
record (CDR) in billing repository and the result is subscriber subset that classified as fraudulent
subscription in near online mode. This will help to reduce time in detecting fraud events and
enhance revenue assurance team ability to identify fraudulent cases efficiently.

Specific Objectives

Collecting required data from a telecomm operator.


Identifying the classification parameters required in for data mining process.
Design a framework for fraud detection based on MapReduce framework.
Running the proposed framework and collect the results based on collected data from the
telecom operator to analyze and evaluate our work from performance and classification
of fraud events point of view.

Scope of the work


We are interested in our research on telecommunication fraud. We will take one
telecommunication operator as a case study; International Data Corporation has identified more
than 200 forms of telecommunication fraud [12]. We will focus on subscription fraud in telecom
throughout our research as a specific type of fraud categories.

The added value of our work


Using our framework, we will get the following added values:
Design and implement the first fraud detection model for telecom environment based on
Map reduce framework.
Our system will work in near online results, so this will enhance the ability to detect the
subscription fraud events early and results in major reduction in revenue losses.
Increase the trust in the telecom operator who is using our system by avoiding the
company many fraudulent attaches.

Methodology
We are planning to build an environment for fraud detection/data mining for telecom
sector, our framework will be built on top of MapReduce framework, as we mentioned early,
MapReduce framework allows his users to parallelizes and executes program's on a large cluster
of machines through partitioning the input data, and scheduling the program's execution over a
set of machines. And as we are aware about the large volume of data that are generated every
day, we selected MapReduce to help use in building the distributed environment for our
framework.

Our framework will use fraud detection/data mining algorithms with adopted implementation to
MapReduce framework as it will work in parallelized and executed on a cluster of machines
which we plan to use SunGrid clusters to build our own distributed environment as SunGrid is
open source and free use or we can use one of cloud computing vendor infrastructure to use it as
infrastructure in our work like Amazon or Google.

Our framework will implement at least one classification algorithm; this algorithm/s will be used
to detect subscription fraud cases and to build a model from a set of training data. This model is
subsequently used to classify new data entered to the system. We will try to implement more
than one algorithm to see their results and performance also in MapReduce environment. Initial
K-means algorithm has been selected to be adopted in our framework as a starting point in our
work. We are organizing our work in our thesis as below:

Prepare all required research that we will be used in our thesis with taking advantage
from related work not necessary in telecom, May in other fields.
Design our model for detection for fraud based on MapReduce domain.
Identity and extract the top N factors that we will build our fraud detection data mining
model on them and any other parameter / rule that can help us in detecting fraud events.
Prepare the dataset and perform data cleaning from missing values...etc. and divide the
main dataset into testing data set and training dataset.
Setup the cloud infrastructure including MapReduce framework and SunGrid clusters to
build our own distributed environment.
Initial K-means algorithm has been selected to be adopted in our framework as a starting
point in our work.
Test the data mining results and validate it.
Perform stress test for our framework against the various volumes of datasets and
monitor its behavior.
Refine our framework and perform the necessary.
Final Review / Complete and submit the final report.

Time table
Number Tasks

Time Period

4 week

5
6
7
8
9
10

Preparing all needed research that we will be used in our thesis


with taking advantage from related work not necessary in telecom,
may in other fields
Design the proposed framework for fraud detection based on
MapReduce framework with identifying parameters required in
for data mining process
Gathering specific requirements if exists especially related to
MapReduce setup, required data, programming language and
supporting technology.
Set up the environment of MapReduce and SunGrid based on
distributed environment
Implementation of our detection framework, including coding
Phase one full testing and fixing bugs
Optimization and stress test
Phase two testing.
Final Review / Complete and submit.

2 weeks

2 weeks

2 weeks
4 weeks
3 Week
2 Week
2 weeks
3 week

References
1- Jeffrey Dean and Sanjay. MapReduce: Simplified Data Processing on Large Clusters,
Ghemawat, Google, Inc.
2- Mert Sanver, Adem Karahoca . Fraud Detection Using an Adaptive Neuro-Fuzzy
Inference System in Mobile Telecommunication Networks.
3- Shawe-Taylor, J., Howker, K., Burge, P.,. Detection of Fraud in Mobile
Telecommunications. Information Security Technical Report 4 (1), 1628.
4- Shawe-Taylor, J., Howker, K., Gosset, P., Hyland, M., Verrelst, H., Moreau, Y., et al..
Novel techniques for profiling and fraud in mobile telecommunication. In: Lisboa, P.J.G.,
Edisbury, B., Vellido, A. (Eds.), Business Applications of Neural Networks. The Stateof-the-Art of Real World Applications. World scientific, Singapore, pp. 113139.
5- Hamid Farvaresh, Mohammad Mehdi Sepehri, 2011. A data mining framework for
detecting subscription fraud in telecommunication.
6- Mieke Jans, Nadine Lybaert and Koen Vanhoof, 2011. Framework for Internal Fraud
Risk Reduction at IT Integrating Business Processes: The IFR Framework

7- MARTIN HGER TORSTEN LANDERGREN, 2010. Implementing best practices for


fraud detection on an online advertising platform
8- Viaene, S., Derrig, R., Baesens, B. & Dedene, G. (2002). A Comparison of State-of-theArt Classification Techniques for Expert Automobile Insurance Claim Fraud Detection.
9- Viaene, S., Derrig, R. & Dedene, G. (2004). A Case Study of Applying Boosting Naive
Bayes to Claim Fraud Diagnosis. IEEE Transactions on Knowledge and Data
Engineering.
10- Fawcett, T. and Provost, F. (1997). Adaptive fraud detection. Journal of Data Mining and
Knowledge Discovery 1(3).
11- Roset, S., Murad, U., Neumann, E., Idan, Y. and Pinkas, G. (1999). Discovery of fraud
rules for telecommunicationschallenges and solutions. Proceedings of the Fifth ACM
SIGKDD International.
12- OJUKA NELSON, 2009. DETECTION OF SUBSCRIPTION FRAUD IN
TELECOMMUNICATIONS USING DECISION TREE LEARNING.
13- D. Hawkins, 1980. Identification of outliers, Champman and Hall, Reading, London.
14- R. Ng E. Knorr and T. Tucakov, Distance-based outliers, Algorithms and Applications,
vol. 8, no. 3,pp. 237253, 2000.
15- R. Rastogi S. Ramaswamy and S. Kyuseok, Efficient algorithms for mining outliers
from large data sets, SIGMODOO, 2000.

You might also like