Professional Documents
Culture Documents
Concepts and
Techniques
Chapter 1
Introduction
Chapter 1. Introduction
Evolution of Sciences
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
The Internet and computing Grid that makes all these archives universally accessible
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
1960s:
1970s:
1980s:
1990s:
2000s
Alternative names
Data Warehouse
Data Cleaning
Data Integration
Databases
August 30, 2013
Input Data
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
Data
Mining
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
PostProcessing
Pattern
Pattern
Pattern
Pattern
evaluation
selection
interpretation
visualization
End User
Decision
Making
Data Presentation
Visualization Techniques
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
August 30, 2013
DBA
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
High-Performance
Computing
10
High-dimensionality of data
11
Data to be mined
Knowledge to be mined
Techniques utilized
Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Applications adapted
12
General functionality
13
Object-relational databases
Multimedia database
Text databases
14
15
16
Typical methods
Typical applications:
17
Cluster analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? One persons garbage could be another
persons treasure
18
19
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
Information network analysis
Social networks: actors (objects, nodes) and relationships (edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends,
family, classmates,
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining,
20
Handling high-dimensionality
21
Other Applications
22
Where does the data come from?Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share the same characteristics: interest,
income level, spending habits, etc.
Determine customer purchasing patterns over time
23
Resource planning
Competition
24
Medical insurance
Retail industry
Anti-terrorism
Data Mining: Concepts and Techniques
25
26
Interestingness measures
27
Approaches
First general all the patterns and then filter out the uninteresting
ones
Generate only the interesting patternsmining query
optimization
Data Mining: Concepts and Techniques
28
29
30
Task-relevant data
Background knowledge
31
Schema hierarchy
Set-grouping hierarchy
Operation-derived hierarchy
Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 P2) < $50
Data Mining: Concepts and Techniques
32
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule quality,
discriminating weight, etc.
Utility
potential usefulness, e.g., support (association), noise threshold
(description)
Novelty
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
33
34
Motivation
Design
35
36
37
38
Loose coupling
39
Knowl
edgeBase
Database or Data
Warehouse Server
data cleaning, integration, and selection
Database
August 30, 2013
Data
World-Wide Other Info
Repositories
Warehouse
Web
Data Mining: Concepts and Techniques
40