Professional Documents
Culture Documents
Data Mining
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable
patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend the patterns.
7
11
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
12
KDD Process
Selection: Obtain data from various sources.
Preprocessing: Cleanse data.
Transformation: Convert to common format. Transform to new format.
13
Preprocessing:
Transformation:
Data Mining:
Interpretation/Evaluation:
15
Questions:
What is the profile of a reader of a car magazine?
Is there any correlation between an interest in cars and an interest in
comics?
Data Selection
Select the information about people who have
subscribed to a magazine
Cleaning
Pollutions: Type errors, moving from one place to
another without notifying change of address,
people give incorrect information about
themselves
Pattern Recognition Algorithms
Cleaning
Lack of domain consistency
Enrichment
Need extra information about the clients consisting of
date of birth, income, amount of credit, and whether or
not an individual owns a car or a house
Enrichment
The new information need to be easily joined to the
existing client records
Extract more knowledge
Coding
We select only those records that have enough
information to be of value (row)
Project the fields in which we are interested
(column)
Coding
Code the information which is too detailed
Address to region
Birth date to age
Divide income by 1000
Divide credit by 1000
Convert cars yes-no to 1-0
Convert purchase date to month numbers starting from
1990
The way in which we code the information will determine the
type of patterns we find
Coding has to be performed repeatedly in order to get the best
results
Coding
The way in which we code the information will
determine the type of patterns we find
Coding
We are interested in the relationships between
readers of different magazines
Perform flattening operation
Data mining
We may find the following rules
A customer with credit > 13000 and aged between 22 and 31 who
has subscribed to a comics at time T will very likely subscribe to a
car magazine five years later
The number of house magazines sold to customers with credit
between 12000 and 31000 living in region 4 is increasing
A customer with credit between 5000 and 10000 who reads a
comics magazine will very likely become a customer with credit
between 12000 and 31000 who reads a sports and a house
magazine after 12 years
Business-Question-Driven Process
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Text databases
The World-Wide Web
BigData (NoSQL)