Data Mining

Definitions
 Business intelligence
 DW & OLAP
 Data mining
 Data Warehousing and Data Mining Motivation
Data mining tasks
 Classification,
 clustering,
 association, etc.
What is business intelligence?
1. The new technology for understanding the past and predicting the futture
2. A broad category of technologies that allows for
 Gathering, storing, accessing and analyzing the data business users make better decisions
 Analyzing business performance through data-driven insight
3. A broad category of applications, which includes the activities of
 Decision support systems
 Query and reporting
 OLAP
 Statistical, forecasting and data mining
What is data warehouse?
A data warehouse is a simply a single, complete and consistent store of data obtained from a variety of
source and made available to end user in a way they can understand and use it in a business context
Data in OLTP and OLAP
What is data mining?
 Many Definitions
o Search for valuable information (knowledge) from large volumes of data
o Exploration & analysis, by automatic or semi-automatic means, of large quantities of
data in order to discover meaningful patterns & rules
 Alternative terms:
o Data analysis, pattern analysis, data dredging, data exploration, data understanding,
data summarization
o Data mining: a misnomer?
KDD process
1. Data cleaning: remove noise and inconsistent data

2. Data integration: from multiple sources -> data warehouse
3. Data selection and transformation: transform data into forms appropriate for data mining,
select relevant data
4. Data mining: extract patterns
5. Pattern evaluation/interpretation: using interestingness measures
6. Knowledge presentation: visualization and knowledge representation are used to present mined
knowledge to the user
What is not Data Mining?
 Look up phone number in phone directory

 Query a Web search engine for information about “Amazon”
What is Data Mining?
 Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston
area)
 Group together similar documents returned by search engine according to their context (e.g.
Amazon rainforest, Amazon.com)
Origins of Data Mining
a. Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
b. Traditional Techniques may be unsuitable due to
 Enormity of data
 High dimensionality
 of data
 Heterogeneous,
 distributed nature
 of data
Data mining in the BI context
The complete DSS from BI perspective
Data Warehousing and Data Mining Motivations

Motivation:
Data explosion problem:
n Automated data collection tools and mature database technology lead to large amounts of data
stored in databases and data warehouses
We are drowning in data, but starving for knowledge!
Do not believe it?
See the following for proof!
Why Mine Data? Commercial Viewpoint
Lots of data is being collected
and warehoused
a) Web data, e-commerce
b) purchases at department/
grocery stores
c) Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive pressure is strong
a) Provide better, customized services for an edge (e.g. in Customer Relationship Management)
a) Data collected and stored at

enormous speeds (GB/hour)
1) remote sensors on a satellite
2) telescopes scanning the skies
3) microarrays generating gene
expression data
4) scientific simulations
generating terabytes of data
What tools do we have?
1) Query processing
2) Reporting tool
3) Spreadsheet
4) Statistics
5) OLAP (On Line Analytical Processing)
What we need is
New technology that can

intellectually and automatically
assist humans in
analyzing and transforming
rapidly growing volume of
digital data into useful information
Data Mining Tasks
1) Prediction Methods
n Use some variables to predict unknown or future values of other variables.
2) Description Methods
n Find human-interpretable patterns that describe the data.
3) Classification [Predictive]
4) Clustering [Descriptive]
5) Association Rule Discovery [Descriptive]
6) Sequential Pattern Discovery [Descriptive]

7) Regression [Predictive]
8) Deviation Detection [Predictive]
Classification: Definition
1) Given a collection of records (training set )
n Each record contains a set of attributes, one of the attributes is the class.
n Find a model for class attribute as a function of the values of other attributes.
2) Goal: previously unseen records should be assigned a class as accurately as possible.
n A test set is used to determine the accuracy of the model.
Application: Credit card application
1) Institution: a credit card company typically receives thousands of applications for new cards. The
application contains information: annual salary, any outstanding debts, age etc.
2) The problem: A decision has to be taken whether to accept or reject the applications.
3) Data mining task: To categorize applications into those who have good credit, bad credit, or fall
into a gray area (thus requiring further human analysis).
Clustering
a) Groups data into meaningful classes/clusters
b) Unsupervised learning
c) Motivation:
1) We do not know what to look for
2) The first step in identifying useful patterns is to group data by their similarity
3) Once data are grouped (clustered), properties of each cluster can be analyzed
d) High quality clusters:
1) the intra-class similarity is high
2) the inter-class similarity is low
Clustering: Basic concept
Given points in some spaces, group the points into a small number of clusters
Association Rule Discovery: Definition
Given a set of records each of which contain some number of items from a given collection;
1) Produce dependency rules which will predict occurrence of an item based on occurrences of other
items.
Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with its own timeline of events, find rules that
predict strong sequential dependencies among different events.
Sequential Pattern Discovery: Examples
Stock market
1) (IBM_UP SUN_UP) –> (Microsoft_UP)
2) In point-of-sale transaction sequences,
3) Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) –>

(Perl_for_dummies,Tcl_Tk)
Athletic Apparel Store:
(Shoes) (Racket, Racketball) –> (Sports_Jacket)
Medical field
1) If a patient underwent cardiac bypass surgery for blocked arteries (blood vessel) and later
developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within
the next 18 months.
Deviation/Anomaly Detection
a) Detect significant deviations from normal behavior
b) Applications:
1) Credit Card Fraud Detection
2) Network Intrusion
Detection

Data Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

Definitions

Data mining tasks

What is business intelligence?

What is data warehouse?

Data in OLTP and OLAP

What is data mining?

1. Data cleaning: remove noise and inconsistent data

What is not Data Mining?

 Look up phone number in phone directory

What is Data Mining?

Origins of Data Mining

Data mining in the BI context

The complete DSS from BI perspective

Data Warehousing and Data Mining Motivations

Data explosion problem:

We are drowning in data, but starving for knowledge!

Do not believe it?

See the following for proof!

Why Mine Data? Commercial Viewpoint

Lots of data is being collected

a) Web data, e-commerce

Computers have become cheaper and more powerful

Competitive pressure is strong

a) Data collected and stored at

1) remote sensors on a satellite

2) telescopes scanning the skies

3) microarrays generating gene

generating terabytes of data

What tools do we have?

5) OLAP (On Line Analytical Processing)

New technology that can

analyzing and transforming

rapidly growing volume of

digital data into useful information

Data Mining Tasks

n Use some variables to predict unknown or future values of other variables.

n Find human-interpretable patterns that describe the data.

5) Association Rule Discovery [Descriptive]

6) Sequential Pattern Discovery [Descriptive]

8) Deviation Detection [Predictive]

1) Given a collection of records (training set )

2) Goal: previously unseen records should be assigned a class as accurately as possible.

n A test set is used to determine the accuracy of the model.

Application: Credit card application

1) We do not know what to look for

d) High quality clusters:

1) the intra-class similarity is high

2) the inter-class similarity is low

Clustering: Basic concept

Association Rule Discovery: Definition

Sequential Pattern Discovery: Examples

1) (IBM_UP SUN_UP) –> (Microsoft_UP)

2) In point-of-sale transaction sequences,

(Intro_To_Visual_C) (C++_Primer) –>

Athletic Apparel Store:

(Shoes) (Racket, Racketball) –> (Sports_Jacket)

a) Detect significant deviations from normal behavior

1) Credit Card Fraud Detection

You might also like