Professional Documents
Culture Documents
on MapReduce
By
Mohammed Fahmi Kharma
Table of Contents
Introduction .......................................................................................................................................... 3
Background ............................................................................................................................................... 4
Related work ............................................................................................................................................. 6
Contribution .............................................................................................................................................. 7
General Objective ..................................................................................................................................... 7
Specific Objectives .................................................................................................................................... 7
Scope of the work ..................................................................................................................................... 8
The added value of our work .................................................................................................................... 8
Methodology............................................................................................................................................. 8
Time table ............................................................................................................................................... 10
References .............................................................................................................................................. 10
Introduction
During the last years, Word have seen a rapid growing and expansion in modern
technology especially in telecommunication and internet, in parallel with this development fraud
events are increasing dramatically where it is causing major losses estimated by billions of
dollars throughout the worldwide yearly. According to Concise Oxford dictionary fraud is a
wrongful or criminal deception intended to result in financial or personal gain.
Background
Telecommunication is one very interesting environment as it generating and storing a
huge amount of data collected through its systems to record and reflect the company operation
and its subscriber activity, one of these data can be obtained from call details record(CDR) where
information about A-number, B-number, Duration, Call Path, Timestamps...etc exists.
According to Mieke Jans et al(2010). They presented an overview of how they see the different
classifications and their relations to each other presented by In Figure 2; the most public
classification is the internal versus external fraud, since all other classifications are situated
within internal fraud. As already pointed out, we see occupational fraud and abuse as an
equivalent of internal fraud. Figure 2 also shows that all classifications left, apply only to
corporate fraud. Also they classified internal fraud into three different classifications. Starting
from a differentiation between statement fraud and transaction fraud. A second classification is
based upon the occupation level of the fraudulent employee. Thirdly, fraud classification for
fraud against the company [6].
Fraud activity can be defined as a dishonest or illegal use of services, with the intention to avoid
service charges. Fraud detection is the name of the activities to identify unauthorized usage and
prevent losses for the mobile network operators [2]. Telecommunication Companies often
receive revenue loss from customers fraudulent behaviors. There are different types of fraud in
the telecommunication business [3]. Shawe-Taylor et al. (2000) present six different fraud types:
subscription fraud, the manipulation of Private Branch Exchange (PBX) facilities or dial through
fraud, free phone fraud, premium rate service fraud, handset theft, and roaming fraud[5].
In fraud detection process, in order to determine the fraud attack and its types, Call detail
records are processed to investigate the subscription fraud, premium rate service fraud or
roaming fraud. In subscription fraud, a fraudster obtains a subscription with fake personal
information to be registered on the network to perform his fraudulent activity with no intention
to pay the bill or fees [2].
Related work
Telecom fraud history extends from early days of Telecom companies, where these
companies are expensing a lot of money to reduce fraudsters attaches and to keep the
competition with other operators by saving itself from possible significant losses may be caused
by fraudster that may affect the company ability in facing their competitors.
There are many studies have been started in fraud detection and prevention track, we will have a
look on some of these studies. Hamid Farvaresh et al. (2011) study aimed at identifying
customers subscription fraud in telecom by employing combined SOM and K-means techniques
through a hybrid approach consisting of preprocessing, clustering, and classification phases, and
adopting knowledge discovery process. MARTIN HGER et al. (2011) the application of
general outlier detection and classification methods to the problem of detecting fraudulent
behavior in an online advertisement metrics. Viaene et al. (2004) and Viaene et al. (2002) for
automobile insurance fraud detection by combining the advantages of boosting and the
explanatory power of the weight of evidence AdaBoosted naive Bayes scoring framework. A
combination of neural network and rules by Brause et al. (1999) and Estvez et al. (2006) have
been used. Mert Sanver et al.(2009) offers the Adaptive Neuro Fuzzy Inference (ANFIS)
method as a means to efficient fraud detection.
He et al. (1997) apply neural networks: a multi-layer perception network in the supervised
component of their study and Kohonens self-organizing maps for the unsupervised part. Fawcett
et al. proposed an adaptive rule-based detection framework for fraud detection. Roset et al. state
the standard classification and rule generation were not appropriate for fraud detection. D.
Hawkins(1980) interested in data outlier where these data most likely would be more suspicious
than regular and normal distributed data. R. Rastogi S. Ramaswamy(2000) et al. extend outlier
method based on the distance of a point from its k th nearest neighbor based on previous work
contained distance based method outlier applications was accomplished by R. Ng E. Knorr et
al(2000).
Contribution
We mentioned in previous sections various data mining techniques and how can be used
to enable fraud detection. In our work, we are focusing in design and implement the first fraud
detection model for telecom environment in different domain, in MapReduce domain, we will
use commodity machines and network to implement our model, where our model will be the first
live example on fraud detection using cloud computing. Our model will include implementation
of data mining algorithm, initially we selected K-mean algorithm and also we expect our model
should operate in near online mode to detect and classify fraud events, so this will enhance the
ability to detect the subscription fraud events early and results in major reduction in revenue
losses.
General Objective
The outputs of our research is a design and implement a model using data mining to
detect fraud cases targeting telecom environment where a huge volume of data should to be
processed based on cloud computing infrastructure we will build using the most popular and
powerful cloud computing framework MapReduce. We will use Data obtained from call details
record (CDR) in billing repository and the result is subscriber subset that classified as fraudulent
subscription in near online mode. This will help to reduce time in detecting fraud events and
enhance revenue assurance team ability to identify fraudulent cases efficiently.
Specific Objectives
Methodology
We are planning to build an environment for fraud detection/data mining for telecom
sector, our framework will be built on top of MapReduce framework, as we mentioned early,
MapReduce framework allows his users to parallelizes and executes program's on a large cluster
of machines through partitioning the input data, and scheduling the program's execution over a
set of machines. And as we are aware about the large volume of data that are generated every
day, we selected MapReduce to help use in building the distributed environment for our
framework.
Our framework will use fraud detection/data mining algorithms with adopted implementation to
MapReduce framework as it will work in parallelized and executed on a cluster of machines
which we plan to use SunGrid clusters to build our own distributed environment as SunGrid is
open source and free use or we can use one of cloud computing vendor infrastructure to use it as
infrastructure in our work like Amazon or Google.
Our framework will implement at least one classification algorithm; this algorithm/s will be used
to detect subscription fraud cases and to build a model from a set of training data. This model is
subsequently used to classify new data entered to the system. We will try to implement more
than one algorithm to see their results and performance also in MapReduce environment. Initial
K-means algorithm has been selected to be adopted in our framework as a starting point in our
work. We are organizing our work in our thesis as below:
Prepare all required research that we will be used in our thesis with taking advantage
from related work not necessary in telecom, May in other fields.
Design our model for detection for fraud based on MapReduce domain.
Identity and extract the top N factors that we will build our fraud detection data mining
model on them and any other parameter / rule that can help us in detecting fraud events.
Prepare the dataset and perform data cleaning from missing values...etc. and divide the
main dataset into testing data set and training dataset.
Setup the cloud infrastructure including MapReduce framework and SunGrid clusters to
build our own distributed environment.
Initial K-means algorithm has been selected to be adopted in our framework as a starting
point in our work.
Test the data mining results and validate it.
Perform stress test for our framework against the various volumes of datasets and
monitor its behavior.
Refine our framework and perform the necessary.
Final Review / Complete and submit the final report.
Time table
Number Tasks
Time Period
4 week
5
6
7
8
9
10
2 weeks
2 weeks
2 weeks
4 weeks
3 Week
2 Week
2 weeks
3 week
References
1- Jeffrey Dean and Sanjay. MapReduce: Simplified Data Processing on Large Clusters,
Ghemawat, Google, Inc.
2- Mert Sanver, Adem Karahoca . Fraud Detection Using an Adaptive Neuro-Fuzzy
Inference System in Mobile Telecommunication Networks.
3- Shawe-Taylor, J., Howker, K., Burge, P.,. Detection of Fraud in Mobile
Telecommunications. Information Security Technical Report 4 (1), 1628.
4- Shawe-Taylor, J., Howker, K., Gosset, P., Hyland, M., Verrelst, H., Moreau, Y., et al..
Novel techniques for profiling and fraud in mobile telecommunication. In: Lisboa, P.J.G.,
Edisbury, B., Vellido, A. (Eds.), Business Applications of Neural Networks. The Stateof-the-Art of Real World Applications. World scientific, Singapore, pp. 113139.
5- Hamid Farvaresh, Mohammad Mehdi Sepehri, 2011. A data mining framework for
detecting subscription fraud in telecommunication.
6- Mieke Jans, Nadine Lybaert and Koen Vanhoof, 2011. Framework for Internal Fraud
Risk Reduction at IT Integrating Business Processes: The IFR Framework