You are on page 1of 57

Presentation On

Distributed Data Mining in Credit Card Fraud Detection

INTRODUCTION
Data: Data are any facts, numbers, or text that can be processed by a computer. E.g. sales, cost, inventory, forecast Information: The patterns, associations, or relationships among all this data can provide information. E.g. analysis of retail point of sale transaction data can yield information on which products are selling and when.

Continue
Knowledge: Information can be converted into knowledge about historical patterns and future trends. E.g. summary information on retail supermarket sales can be analysed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

What is Data Mining ??


Generally, Data Mining (sometimes called data or knowledge discovery) is the process of analysing data from a different perspectives and summarizing it into useful information. And that information that can be used to increase productively. Technically, Data Mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

How Does Data Mining Work ??


While large-scale information technology has been evolving separate transaction and analytical systems, data-mining provides the link between the two. Data mining software analyses relationships and patterns in stored transaction data based on open-ended user queries.

Continue
Generally, any of four types of relationships are sought: Classes: Stored data is used to locate data in groups. E.g. a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. Clusters: Data items are grouped according to logical relationships or consumer preferences. e.g. data can be mined to identify market segments or consumer affinities.

Continue
Associations: Sequential Patterns: Data is mined to anticipate behaviour patterns and trends. E.g. an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data Mining Consists of Five Major Elements: Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyse the data by application software. Present the data in a useful format, such as a graph or table.

Data Mining Process

Continue
Understand the application domain Collect and create the target dataset Clean and transform the target dataset Select features, reduce dimensions Apply data mining algorithms Interpret, evaluate, and visualize patterns

DISTRIBUTED DATA MINING


The continuous developments in information and communication technology have recently led to the appearance of distributed computing environments which comprise several, and different sources of large volumes of data and several computing units. The most prominent example of a distributed environment is the Internet, where increasingly more databases and data streams appear that deal with several areas, such as meteorology, oceanography, economy and others.

Continue
Distributed Data Mining (DDM) is concerned with the application of the classical Data Mining procedure in a distributed computing environment trying to make the best of the available resources (communication network, computing units and databases). Data Mining takes place both locally at each distributed site and at a global level where the local knowledge is fused in order to discover global knowledge.

The first phase normally involves the analysis of the local database at each distributed site. Then, the discovered knowledge is usually transmitted to a merger site, where the integration of the distributed local models is performed. The results are transmitted back to the distributed databases, so that all sites become updated with the global knowledge. In the latter case the attributes differ among the distributed databases. In certain applications a key attribute might be present in the heterogeneous databases, which will allow the association between tuples. In other applications the target attribute for prediction might be common across all distributed databases.

One trend that can be noticed during the last years is the implementation of DDM systems using emerging distributed computing paradigms such as Web services and the application of DDM algorithms in emerging distributed environments, such as mobile networks, sensor networks, grids and peer-topeer networks.

ASPECTS OF DATA MINING



Uncertainty handling Dealing with missing value Dealing with noisy data Efficiency of algorithm used Constraining Knowledge Discovered to only useful or interesting knowledge Size and complexity of data Data selection Understandability of discovered knowledge Consistency between Data and Discovered Knowledge

LOSSES DUE TO FRAUD

Bank-wise Cyber Fraud Data


ICICI Bank customers have been the biggest victims of Cyber Frauds. In last 4 years (from 2009 to 2012) ICICI Bank alone reported 34918 cases amounting to 74.25 crore rupees. American Express ranked 2nd based on the value of cyber frauds with 4 years (2009 to 2012) amounting to 26 crore rupees nearly 3 times less than ICICI Bank.

Citibank came in at 3rd reporting 24 crore worth of cyber frauds followed by Axis (15.9 crore) and HSBC (13.8 crore).

Credit Card & Debit Card Fraud Statistics World Wide


Between July 2005 and mid-January 2007, a breach of systems at TJX Companies exposed data from more than 45.6 million credit cards. Albert Gonzalez is accused of being the ringleader of the group responsible for the thefts. In 2012, about 40 million sets of payment card information were compromised by a hack of Adobe Systems. In August 2009 Gonzalez was also indicted for the biggest known credit card theft to date information from more than 130 million credit and debit cards was stolen at Heartland Payment Systems, retailers 7-Eleven and Hannaford Brothers, and two unidentified companies.

In July 2013, press reports indicated four Russians and a Ukrainian were indicted in New Jersey for what was called the largest hacking and data breach scheme ever prosecuted in the United States.
Between Nov. 27, 2013 and Dec. 15, 2013 a breach of systems at Target Corporation exposed data from about 40 million credit cards. The information stolen included names, account number, expiry date and Card security code. From 16 July to 30 October 2013, a hacking attack compromised about a million sets of payment card data stored on computers at Neiman-Marcus.

Largest Credit Card Data Breaches

Country United States


Mexico United Arab Emirates United Kingdom Brazil Australia China India Singapore Italy South Africa Canada

Cardholders Affected (Overall)

Cardholders Affected (Last 5 Years)

42%
44% 36% 34% 33% 31% 36% 37% 26% 24% 25% 25%

37%
37% 33% 31% 30% 30% 27% 27% 23% 22% 20% 19%

France
Indonesia Sweden Germany

20%
18% 12% 13%

18%
14% 11% 10%

APPLICATION OF DATA MINING


Financial Data Analysis: E.g. loan payment prediction, customer credit policy Retail Industry: E.g. sales, customer, product, region, effectiveness of sales campaign, customer loyalty It enables companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics.

Telecom Industry: Biological Industry: Identifying co-occuring gene sequences and linking genes to different stages of deases development Scientific Application: Accumulation of huge volumes of highdimensional data, stream data and heterogeneous data Intrusion Detection:

FAMOUS CREDIT FRAUD ATTACKS


Between July 2005 and mid-January 2007, a breach of systems at TJX Companies exposed data from more than 45.6 million credit cards. Albert Gonzalez is accused of being the ringleader of the group responsible for the thefts. In August 2009 Gonzalez was also indicted for the biggest known credit card theft to date information from more than 130 million credit and debit cards was stolen at Heartland Payment Systems, retailers 7Eleven and Hannaford Brothers, and two unidentified companies. In 2012, about 40 million sets of payment card information were compromised by a hack of Adobe Systems.

Continue
In July 2013, press reports indicated four Russians and a Ukrainian were indicted in New Jersey for what was called the largest hacking and data breach scheme ever prosecuted in the United States. Between Nov. 27, 2013 and Dec. 15, 2013 a breach of systems at Target Corporation exposed data from about 40 million credit cards. The information stolen included names, account number, expiry date and Card security code From 16 July to 30 October 2013, a hacking attack compromised about a million sets of payment card data stored on computers at NeimanMarcus.

Modeling Strategies
Data mining strategies fall into two broad categories: supervised learning and unsupervised learning. Supervised learning
when there exists a target variable with known values and about which predictions will be made by using the values of other variables as input.

Unsupervised learning
there does not exist a target variable with known values, but for which input variables do exist.

Modeling Objectives and Data Mining Techniques


Modeling Objective Prediction Supervised Regression and Logistic regression Neural Networks Decision Trees Unsupervised Not feasible

Note: Targets can be binary, interval, nominal, or ordinal.

Prediction algorithms determine models or rules to predict continuous or discrete target values for given input data. For example, a prediction problem could attempt to predict the value of the S&P 500 Index, given some input data such as a sudden change in a foreign exchange rate.

Modeling Objectives and Data Mining Techniques


Modeling Objective Classification Supervised Decision Trees Neural Networks Discriminant Analysis Note: Targets can be binary, nominal, or ordinal. Unsupervised Clustering (K-means, etc) Neural Networks Self-Organizing Maps (Kohonen Networks)

Classification algorithms determine models to predict discrete values for given input data. A classification problem might involve trying to determine if transactions represents fraudulent behavior based on some indicators such as, the type of establishment at which the purchase was made, the time of day the purchase was made, and the amount of the purchase.

Modeling Objectives and Data Mining Techniques


Modeling Objective Exploration Supervised Decision Trees Note: Targets can be binary, nominal, or ordinal. Unsupervised Principal Components Clustering (K-means, etc)

Exploration uncovers dimensionality in input data. For example, trying to uncover groups of similar customers based on spending habits for a large, targeted mailing is an exploration problem.

Modeling Objectives and Data Mining Techniques


Modeling Objective Affinity Supervised Not applicable Unsupervised Associations Sequences Factor Analysis

Affinity analysis determines which events are likely to occur in conjunction with one another. Retailers use affinity analysis to analyze product purchase combinations.

Techniques for fraud detection

If-Then rules (Expert rules)


Purpose is to use facts and rules, taken from the knowledge of many human experts, to help make decisions. Example of rules More than 4 ATM transactions in one hour? More than 2 transactions in 5 minutes? Magnetic stripe transaction then internet transaction?

If-Then rules (Expert rules)


Problems with rules New fraud patterns are not detected Only simple rules can be created Advantages of rules Easy to implement Very easy to interpret

Predictive modeling
Predictive modeling is the use of statistical and mathematical techniques to discover patterns in data in order to make predictions

Need of Data Mining


In field of Information technology we have huge amount of data available that need to be turned into useful information. This information further can be used for various applications such as market analysis, fraud detection, customer retention, production control, science exploration etc. Identify unexpected shopping patterns in supermarkets.

Optimize website profitability by making appropriate offers to each visitor. Predict customer response rates in marketing campaigns. Defining new customer groups for marketing purposes. Predict customer defections: which customers are likely to switch to an alternative supplier in the near future.

Distinguish between profitable and unprofitable customers. Improve yields in complex production processes by finding unexpected relationships between process parameters and defect rates. Identify "wedge issues" and target political campaigns. Identify suspicious (unusual) behavior, as part of a fraud detection process.

Application of Data Mining


Financial Data Analysis: E.g. loan payment prediction, customer credit policy Retail Industry: e.g. sales, customer, product, region, effectiveness of sales campaign, customer loyalty It enables companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics.

Telecom Industry: Biological Industry: Identifying co-occuring gene sequences and linking genes to different stages of deases development Scientific Application: Accumulation of huge volumes of highdimensional data, stream data and heterogeneous data Intrusion Detection:

Advantages of Data Mining


Data mining helps people to answer questions that they might not have even thought about. The information extracted from raw data would be usually in hidden form and could go unrevealed if proper data mining techniques are not been used. It helps companies to get information that they can use effectively to stand out from competition.

Quick and correct access to useful information which makes companies to concentrate more on decision making and other important processes made data mining so efficient and popular. Different industries or organizations use data mining to its maximum strength and what they try to get from their data could be market trends, industry research, sales promotion, competitor analysis, medical research etc.

Retailers can get to know about useful and correct trends about their customers and their purchasing behavior. This knowledge can be utilized to market the product in a better way, attract targeted customers more, come up with products that would be liked by customers, manage super market shelves or space in a better way, introduce coupons or discount offers on certain products, increase sales, set price strategies and so on.

All retailers or organizations that concentrate more on customer satisfaction go for data mining techniques. In law enforcement, data mining is helpful to identify criminal suspects by analyzing crime type, behaviour, habits etc. of other criminals who are already in the list. In healthcare, data mining techniques are used to identify certain diseases and to decide the treatment methods that are effective.

Financial institutions like banks and credit companies use data mining for identifying fraudulent customers, fraud medical claims, and risk management and so on. Weather forecast is an area where data mining is widely used. In the identification and classification and age determination of sky objects, data mining plays a major role. In development of new medicines, data mining is used to foresee the effectiveness of the developed medicines.

Limitation of Data Mining


Quality of data is the most important challenge faced in case of data mining. As everything is done on data, the outcome is mostly affected by the quality of the data. Completeness, reliability and accuracy of data contribute to the data quality. As thousands of records are usually analyzed and summarized for decision making, if anything wrong happens in data, then all steps in the knowledge discovery process would be badly affected.

Presence of duplicate records, missing data values, presence of unneeded data fields, lack of proper data standards and lack of timely data updates, human errors etc could affect the quality of data and thus data mining process. Removal of duplicate records, entering appropriate values for missing records (0 rather than making an entry null), removal of unneeded data fields, identifying and removing logically wrong values (200 as age, 01/01/1100 as birth date etc), standardizing data formats, updating data fields in a timely manner etc are completed as part of data cleaning process.

Interoperability is another major data mining issue. As data could be collected from heterogeneous resources, types of data would be different and it would be practically impossible to standardize all these different kinds of data. Different databases or data mining software need to be interoperable so that data could be analysed and summarized correctly to make the best use of data mining. Suppose, government comes with a mission to share the information of different government departments in order to improve inter department collaboration.

As existing databases of different departments would be of different type, the project would have to overcome the issues of interoperability. As larger amounts of private and sensitive information about companies or individuals would have to be stored and used for different data mining activities, security and privacy have become a major issue to be addressed before data mining becomes completely mature.

This could also lead to illegal access of confidential data and also to disclosure of implicit details of individuals or companies which they do not actually want to come out. The correct selection of data mining method is very important to get correct results. Performance issues are also to be resolved as performance is the most expected factor in case of data mining.

THANK YOU

You might also like