You are on page 1of 19

Data Mining

Nikita K Somaiya
MIM-14-08
Business Intelligence and Analysis
IES MCRC

Overview

Introduction
Explanation of Data Mining
Techniques
Advantages
Applications
Privacy

Data Mining

What is Data Mining?


The process of semi automatically analyzing large
databases to find useful patterns (Silberschatz)
KDD Knowledge Discovery in Databases (3)
Attempts to discover rules and patterns from data
Discover Rules Make Predictions
Areas of Use

Internet Discover needs of customers


Economics Predict stock prices
Science Predict environmental change
Medicine Match patients with similar problems cure

Example of Data Mining

Credit Card Company wants to discover information


about clients from databases. Want to find:

Clients who respond to promotions in Junk Mail


Clients that are likely to change to another
competitor
Clients that are likely to not pay
Services that clients use to try to promote services
affiliated with the Credit Card Company
Anything else that may help the Company provide/
promote services to help their clients and ultimately
make more money.

Data Mining & Data


Warehousing

Data Warehouse: is a repository (or archive) of


information gathered from multiple sources, stored
under a unified schema, at a single site.
(Silberschatz)

Collect data Store in single repository


Allows for easier query development as a single
repository can be queried.

Data Mining:

Analyzing databases or Data Warehouses to discover


patterns about the data to gain knowledge.
Knowledge is power.

Discovery of Knowledge

Data Mining Techniques

Classification
Clustering
Regression
Association Rules

Classification

Classification: Given a set of items that have several classes,


and given the past instances (training instances) with their
associated class, Classification is the process of predicting the
class of a new item.
Therefore to classify the new item and identify to which class
it belongs
Example: A bank wants to classify its Home Loan Customers
into groups according to their response to bank
advertisements. The bank might use the classifications
Responds Rarely, Responds Sometimes, Responds
Frequently.
The bank will then attempt to find rules about the customers
that respond Frequently and Sometimes.
The rules could be used to predict needs of potential
customers.

Technique for
Classification

Decision-Tree Classifiers
Job
Engineer

Carpenter

Income
<30
K

Bad

>50
K

Good

Income
<40
K

>90
K

Bad

Good

Doctor

Income
>100K

<50
K

Bad

Predicting credit risk of a person with the jobs

Good

Clustering

Clustering algorithms find groups of items that are


similar. It divides a data set so that records with
similar content are in the same group, and groups
are as different as possible from each other. (2)

Example: Insurance company could use clustering


to group clients by their age, location and types of
insurance purchased.

The categories are unspecified and this is referred


to as unsupervised learning

Clustering

Group Data into Clusters

Similar data is grouped in the same cluster


Dissimilar data is grouped in the same cluster

How is this achieved ?


K-Nearest Neighbor

A classification method that classifies a point by


calculating the distances between the point and
points in the training data set. Then it assigns the
point to the class that is most common among its
k-nearest neighbors (where k is an integer).(2)

Hierarchical

Group data into t-trees

Regression

Regression deals with the prediction of a value,


rather than a class. (1, P747)
Example: Find out if there is a relationship between
smoking patients and cancer related illness.
Given values: X1, X2... Xn
Objective predict variable Y
One way is to predict coefficients a0, a1, a2

Y = a0 + a1X1 + a2X2 + anXn


Linear Regression

Regression

Example graph:

Line of Best Fit


Curve Fitting

Association Rules

An association algorithm creates rules that


describe how often events have occurred
together. (2)

Example: When a customer buys a hammer,

then 90% of the time they will buy nails.

Association Rules

Support: is a measure of what fraction of the


population satisfies both the antecedent and
the consequent of the rule(1, p748)
Example:

People who buy hotdog buns also buy hotdog


sausages in 99% of cases. = High Support
People who buy hotdog buns buy hangers in 0.005%
of cases. = Low support

Situations where there is high support for the


antecedent are worth careful attention

E.g. Hotdog sausages should be placed in near hotdog


buns in supermarkets if there is also high confidence.

Association Rules

Confidence: is a measure of how often the consequent


is true when the antecedent is true. (1, p748)
Example:

90% of Hotdog bun purchases are accompanied by


hotdog sausages.
High confidence is meaningful as we can derive rules.

Hotdog bun Hotdog sausage


2 rules may have different confidence levels and
have the same support.
E.g. Hotdog sausage Hotdog bun may have a
much lower confidence than Hotdog bun Hotdog
sausage yet they both can have the same support.

Advantages of Data Mining

Provides new knowledge from existing data

Public databases
Government sources
Company Databases

Old data can be used to develop new knowledge

New knowledge can be used to improve services or


products

Improvements lead to:

Bigger profits
More efficient service

Uses of Data Mining

Sales/ Marketing

Risk Assessment

Identify Customers that pose high credit risk

Fraud Detection

Diversify target market


Identify clients needs to increase response rates

Identify people misusing the system. E.g. People who


have two Social Security Numbers

Customer Care

Identify customers likely to change providers


Identify customer needs

Privacy Concerns

Effective Data Mining requires large sources of data


To achieve a wide spectrum of data, link multiple
data sources
Linking sources leads can be problematic for
privacy as follows: If the following histories of a
customer were linked:

Shopping History
Credit History
Bank History
Employment History

The users life story can be painted from the


collected data

You might also like