Professional Documents
Culture Documents
State
Srinivasan Parthasarathy
DL 693
srini@cis.ohio-state.edu
Outline
Motivation
KDD process
Key techniques
Our Vision and Sample Projects
Motivation
Data mining (knowledge discovery in
databases):
Extraction of interesting
Knowledge Interpretation
Data Mining
Task-relevant Data
Data transformations
Preprocessed
Data
Selection
Data Cleaning
Data Integration
Databases
Data Cleaning
Missing Data
Use existing data to extrapolate
Noisy Data
Compute the least noisy representation
B. Method/Algorithm Selection
Wide range of methods available depending
on needs
Incorporating Background knowledge: concept
hierarchies, users knowledge base.
Interestingness measurements: significance,
confidence, thresholds, abstraction levels, etc.
Performance
Copyright 2009, The Ohio State University
rule mining:
Finding
associations or
correlations among a set of
items or objects in
transaction databases,
relational databases, and data
warehouses.
Items
People
Dan
Kathy
1
1
Chuck
Bob
1
1
1
1
Example
Applications:
Basket
Beer
Rule
diapers) buys(x,
beers) [50%, 100%]
buys(x,
Step2B: Algorithm
Selection
Copyright 2009,for
The Ohio
State University
Sampling based methods
fast
approximate results
Step 3: Knowledge
Interpretation
A. Post Processing of mining results
When you have too many patterns, you need to:
Order them using some interestingness metric
Pass them to the visualization tool incrementally
B. Visualization
Render the patterns in an easy-to-use intuitive
manner
Highlight most relevant patterns
T2: Classification
Data categorization based on a set of
training objects.
Applications: credit approval, target
marketing, medical diagnosis, treatment
effectiveness analysis, automatic text
categorization etc.
Goal: Develop a description for each
class.
classification of future test data,
horsepower
<110
better understanding of
each class, and
prediction of certain properties.
Miles/gallon
Engine data example
>21
4
6
#cylinders
Copyright 2009, The Ohio State University
horsepower
<125
6
Clustering:
Other Techniques
Similarity Analysis
How close are two entities?
Temporal/Sequence Analysis
Trend/Deviation analysis, Event-based
analysis
Statistical techniques
Regression, Bayesian etc.
Outline
Motivation
KDD process
Key techniques
Our Vision and Sample Projects
Vision
Our goal is to extract novel, interpretable and actionable
knowledge from data efficiently.
1. Algorithms to extract novel forms of knowledge
2. Interpretable and actionable visual-analytics, domain dependant
3. Efficiency search space pruning, parallel and distributed algorithms,
architecture conscious solutions
Is there a problem?
Why?
1. Data intensive
memory wall high
latency
2. Irregular nature data
and parameter driven
difficult to predict
3. Reliance on pointerbased data structures
poor ILP
4. Multi-core even harder
5. Compilers/runtime
systems have a hard
time coping with1-4.
Core
0
L2 Cache
Data intensive
Latency to main memory
Pressure on memory bandwidth
Must maintain small memory footprints
and working sets
Core
1
System
Bus
System
Memory
Benefits:
Benefits:
Requirements
Requirements
Sensing the problem
Re-architecting algorithm
Task decomposition
State maintenance
105
106
104
103
10
103
102
101
100
101
100
Minimum support
Minimum support
Open Problems
What kind of services should we consider building?
Knowledge Caching? I/O services? Placement services?
Businesses,
from startups
to enterprises
Problem Domain(s)
Protein-protein
interactions in yeast
(Jeong et al, 2001)
Interaction Networks
Nodes represent entities
Edges represent
interactions among
entities
Examples Abound:
Biological Networks
Collaboration/Friendship
networks
Challenges
Community Discovery
Scale
Dynamic Nature
Visualization
Physicist collaboration
network (Newman and
Girvan, 2004)
Scalability?
Generative Models?
Application Specific Challenges
E.g. citizen sensing in Twitter (providing context)
Copyright 2009, The Ohio State University
NAV Architecture
C1
C2
C 31
C22
C6
C4
C5
C4
C4
C6
C5
C5
C5
C6
C6
C6
Partial clique
Partial clique
Peaks in SMD CSV plot
represents highly
cohesive stocks
Original Graph G
Static Layout of G
for comparitive purposes
Copyright 2009, The Ohio State University
Refined Graph G
Technicians
Analysts
MRI Data
Patient DB
CLIENT
Clinician
VIRTUAL DATASPACE
DAG Array
SERVER
List
Virtual Dataspace
GENE
DATABASES
Expression data
Sequence data
Copyright 2009, The Ohio State University
Biomedical Applications
Problem (joint with M.Twa et al)
Classification of normal vs.
keratoconic patients
Patient data + examination data
Results
90-95% accuracy, easy to interpret,
visual representation possible!
Conclusions
KDD is an iterative and interactive process
the goal of which is to extract interesting
and actionable information from
potentially large data stores efficiently
We have active projects in:
Architecture Conscious Data Management
and Mining
Visual Network Analytics and Management
Biomedical Informatics
RA positions available
Courses of interest
788G02/788J02/888J02/888Y11
674 Introduction to Data Mining (Spring 2010)
2
1
1
1
4
Contact Information
Office: DL693
Email: srini@cse.ohio-state.edu
Phone: 292-2568
Web: www.cse.ohio-state.edu/~srini
Data Mining Research Lab:
www.cse.ohio-state.edu/dmrl/
Feel free to stop by and talk
Questions?