Data Mining Research at Ohio State: Srinivasan Parthasarathy

Data Mining Research at Ohio
State
Srinivasan Parthasarathy
DL 693
srini@cis.ohio-state.edu
Copyright 2009, The Ohio State University
Outline
Motivation
KDD process
Key techniques
Our Vision and Sample Projects
Motivation
Data mining (knowledge discovery in
databases):
Extraction of interesting
knowledge (rules, regularities, patterns,

constraints) from data in large databases.
Data explosion problem --- computerized data
collection tools and mature database
technology lead to tremendous amounts of
data stored in databases.
We are drowning in data, but starving for
knowledge!
Why Data Mining?

Potential Applications
Business Application
Stock analysis, Consumer/Competitor Analysis business
edge
Scientific Applications
Bioinformatics, Mining scientific Simulations enabling novel
scientific discovery
Security Applications
Intrusion Detection, Privacy preserving mining security and
privacy of data
Sports and Entertainment
Professional Baseball/Basketball competitive edge
A MULTI BILLION DOLLAR INDUSTRY
KDD: Major Issues

Diversity of data mining tasks:
Summarization, characterization, association,
classification, clustering, trend and deviation
analysis, other pattern analysis.
Diversity of data:
Relational, transactional, data warehouse,
spatial, text, multimedia, active, objectoriented, Web, etc.
Efficiency and scalability
Expression and visualization of data mining
results
Social issues (security and privacy).
Knowledge Discovery Process

Data mining: the
core of knowledge
discovery process.
Knowledge Interpretation
Data Mining
Task-relevant Data
Data transformations
Preprocessed
Data
Selection
Data Cleaning
Data Integration
Databases
Step 1: Data Preprocessing

Data Selection
Select the relevant attributes for representation
Data Cleaning
Missing Data
Use existing data to extrapolate
Noisy Data
Compute the least noisy representation
Input-format related transformations

Ex. Discretization of continuous attributes for
association mining
Step 2: Data Mining

Specification
A. Mining task selection : Identify kinds of
knowledge to be mined. Choose amongst
various techniques like association mining,
clustering, time-series analysis, classification.
B. Method/Algorithm Selection
Wide range of methods available depending
on needs
Incorporating Background knowledge: concept
hierarchies, users knowledge base.
Interestingness measurements: significance,
confidence, thresholds, abstraction levels, etc.
Performance
Step2A: Task Selection

Assocation
rule mining:
Finding
associations or
correlations among a set of
items or objects in
transaction databases,
relational databases, and data
warehouses.
Items
People
Dan
Cheese Diapers Eggs
Kathy
1
1
Chuck
Bob
1
1
1
1
Example
Applications:
Basket
Beer
data analysis, crossmarketing, catalog design,

clustering, etc.
form: LHS RHS

[support, confidence].
Rule
diapers) buys(x,
beers) [50%, 100%]
buys(x,
Step2B: Algorithm
Selection
Data-Oriented: Boolean vs. quantitative

associations
Result-Oriented: Single level vs. multiple-level

analysis
E.g, [Coors, Huggies] or [Beer, Diapers]
Result-Oriented: Simple vs. constraint-based
Association on discrete vs. continuous data
E.g., small sales (sum < 100) trigger big buys

(sum > 1,000)?
Performance Oriented Selection
Scalable Parallel and Sequential algorithms
Copyright 2009,for
The Ohio
State University
Sampling based methods
fast
approximate results
Step 3: Knowledge
Interpretation
A. Post Processing of mining results
When you have too many patterns, you need to:
Order them using some interestingness metric
Pass them to the visualization tool incrementally
B. Visualization
Render the patterns in an easy-to-use intuitive
manner
Highlight most relevant patterns
Step 3B: Visualization
T2: Classification
Data categorization based on a set of
training objects.
Applications: credit approval, target
marketing, medical diagnosis, treatment
effectiveness analysis, automatic text
categorization etc.
Goal: Develop a description for each
class.
classification of future test data,
horsepower
<110
better understanding of
each class, and
prediction of certain properties.
Miles/gallon
Engine data example
>21
4
6
#cylinders
horsepower
<125
6
T3: Data Clustering

Analysis
Clustering:
Partitioning a set of data (or objects) into a

set of classes, called clusters, such that
members of each class sharing some
interesting common properties.
Sample Uses: Gene Clustering, Customer
Segmentation, & Image analysis
High quality clusters:
the intra-class similarity is high.
the inter-class similarity is low.
Measuring data clustering quality

Distance functions
Other Techniques
Similarity Analysis
How close are two entities?
Temporal/Sequence Analysis
Trend/Deviation analysis, Event-based
analysis
Generalized Frequent Patterns

Structure Discovery
Statistical techniques
Regression, Bayesian etc.
Outline
Motivation
KDD process
Key techniques
Our Vision and Sample Projects
Vision
Our goal is to extract novel, interpretable and actionable
knowledge from data efficiently.
1. Algorithms to extract novel forms of knowledge
2. Interpretable and actionable visual-analytics, domain dependant
3. Efficiency search space pruning, parallel and distributed algorithms,
architecture conscious solutions
Architecture Conscious Data

Mining and Management
Sample Projects
Are we utilizing architectures

efficiently?
Is there a problem?
Many state-of-the-art mining

algorithms grossly underutilize processor resources
[Ghoting 2005]
Often below 10% utilization
Why?
1. Data intensive
memory wall high
latency
2. Irregular nature data
and parameter driven
difficult to predict
3. Reliance on pointerbased data structures
poor ILP
4. Multi-core even harder
5. Compilers/runtime
systems have a hard
time coping with1-4.
Cache- Memory- I/O- Conscious

Algorithms
Idea is to re-architect
algorithm to leverage
features of the architecture
Prefetching
Simultaneous Multithreading
Locality Enhancements
Gains in performance can be

staggering
For network of workstations
minimize communication and
leverage remote memory
Enables mining of terabyte
scale distributed datasets
efficiently
400 fold improvement!
The Multicore Challenge

Challenges for data analysis applications
Irregular in nature
Hard to estimate the access pattern
Must maintain good data locality
Core
0
L2 Cache
Data intensive
Latency to main memory
Pressure on memory bandwidth
Must maintain small memory footprints
and working sets
Workload estimation is difficult

Data and parameter dependent tasks
Must maintain load balance among cores
Core
1
System
Bus
System
Memory
Key Idea: Adaptive algorithms

Key Idea: Trading off
memory for redundant
computation
Key Idea: Moldable

partitioning and adaptive
scheduling of tasks
Benefits:
Benefits:
Reduced working set sizes

Reduce bandwidth pressure
Utilizing strengths of the
CMP
Requirements
Sensing the problem

Re-architecting algorithm
to reduce memory
consumption
Better CPU utilization

Reduced Cache miss rates
(CoS)
Requirements
Sensing the problem
Re-architecting algorithm
Task decomposition
State maintenance
105
106
104
Run time (sec)
Memory footprint (in MB)
Sample Benefits: Tree Mining

105
104
103
10
103
102
101
100
2% 1.8% 1.7% 1.5%1.3% 1.2%1%
101
100
Minimum support
2% 1.8% 1.7% 1.5%1.3% 1.2%1%
Minimum support
Memory footprint of MCT is constant

660-folds reduction against TreeMiner, 2300-folds against iMB3, 300folds w.r.t. TRIPS
Improved locality leads to smaller run times

7200-folds speedup against TreeMiner, 66-folds against iMB3, 5-folds
w.r.t. TRIPS
Linear speedups on 8 and 16 core systems.
An Esoteric CMP: Sony Playstation

Streaming workloads are 12-75X faster
on Cell vs single cores, and 3-6X faster
than other CMPs
Nave PageRank is 20X faster on the Quad

Core Xeon!
Cell 8 SPU times are from 6 SPU Playstation execution times scaled to 8 cores based on simulator results
Data Mining on the Cloud

Targeted at reducing operational cost and productivity
Service oriented architectures
Plug and Play Semantics
Open Problems
What kind of services should we consider building?
Knowledge Caching? I/O services? Placement services?
Hadoop and Map-Reduce target productivity and availability but

what about efficiency of resource utilization? Can we do better?
Is Map Reduce the right interface? Is it enough?
Businesses,
from startups
to enterprises
4+ billion phones by 2010

[Source: Nokia]
Web 2.0enabled PCs,

TVs, etc.
Data Mining in a Flash

Have potential to really change the landscape
especially for out of core algorithms
Almost anorder of magnitude faster than traditional
hard drives for random reads and sequential writes.
Significantly lower energy costs
Technology is still in relatively early stages

Costs are currently prohibitive
Predicted to be less of an issue 3 to 4 years down
Algorithmic/Systemic challenges for data mining

algorithms
Wear leveling problem
Random writes
How do we work around these to realize
performance commensurate with this technology?
Energy-Conscious Data Centers

Energy is clearly a major economic and environmental
problem
Huge costs spent on managing this problem
E.g. US National Laboratories, Amazon/Google data centers
Some open problems
Given mining algorithm A and algorithm B which approach is more

energy efficient?
Given the choice to implement algorithm on STCI Cell or Intel
multicore which do we choose?
Can we leverage recent ideas from the architecture community
(e.g. reconfigurable caches)?
Given multiple tasks how can we co-locate/schedule related tasks
to lower energy costs?
How can we leverage architectural features such as underclocking
and DVFS?
Active Mining on Streaming Data
Crucial Issue: Data Influx Rate Exceeds Processing Rate

Problem for time-critical applications (e.g. Network Intrusion Detection)
Data Stratification: Reduce rows (sampling), Reduce columns (PCA)

Process incrementally, minimize access to original data.
Systems support
Memory placement, Compression, Disk-filters, Data manipulation.
Towards Visual Network

Analytics
Sample Projects
Problem Domain(s)
Protein-protein
interactions in yeast
(Jeong et al, 2001)
Interaction Networks
Nodes represent entities
Edges represent
interactions among
entities
Examples Abound:
Biological Networks
Collaboration/Friendship
networks
Challenges
Community Discovery
Scale
Dynamic Nature
Visualization
Physicist collaboration
network (Newman and
Girvan, 2004)
Questions & Challenges

How to extract modular structure?
common functional proteins, stable collaboratories etc.
What characterizes stability of groups over time?

What are the behavioral characteristics of nodes and
communites:
Which nodes are influential, which are bridging?
What are the relationships among communities?

How to visualize?
Mental Map, Handling dynamic updates, Pixel Wall challenge
Scalability?
Generative Models?
Application Specific Challenges
E.g. citizen sensing in Twitter (providing context)
NAV Architecture
Dynamic Analysis Framework

Community Detection
MLR-MCL (KDD09)
Viewpoints (KDD09)
Graph Partitioning (Metis)
CSV (SIGMOD08)
Event detection (KDD07,
TKDD09)
Entity Driven Events
Community Driven Events
Composing Behavioral
Measures
Stability, Sociability,
Influence
C1
C2
C 31
C22
C6
C4
C5
C4
C4
Visual Analysis and Inference

Dynamic Layout
Density Plots
C6
C5
C5
C5
C6
C6
C6
Visualization: Overview First

(Coarsened View)
Zoom and Filter
Event View (Importance of Ranking)
Split: Details on Demand

(ironic example )
Merge (Philosophy + Logic)
Dynamic Details (Sociability+ Influence)
Density (CSV) Plots

Computing density plots efficiently was identified
by SIGMOD keynote on Extreme Visualization as
an important grand challenge problem
Density Plots
Can help quickly localize dense subgraphs hidden within
a large graph
The challenge is to compute them efficiently
SMD: Stock Market Data

Bridging
vertex
Partial clique
Partial clique
Peaks in SMD CSV plot
represents highly
cohesive stocks
Dynamic Layout Strategy: Preliminary

Ideas
Original Graph G
Delete Nd 4, Propogating Updates

Housing within an R-tree
Static Layout of G
for comparitive purposes
Refined Graph G
Dynamic Graph Layout: Early Results

Enron Dataset
Energy profile of Static
(from scratch layout)
very similar to our
dynamic variant
Dynamic variant
maintains better
mental map (not
shown)
Dynamic variant is
also more efficient (up
to 40% more efficient)
Global Graphs: Managing Large

Graphs
SERVER
Technicians
Analysts
MRI Data
Patient DB
CLIENT
Clinician
VIRTUAL DATASPACE
DAG Array
SERVER
List
Virtual Dataspace
GENE
DATABASES
Expression data
Sequence data
Biomedical Applications
Problem (joint with M.Twa et al)
Classification of normal vs.
keratoconic patients
Patient data + examination data
Solution [SIAM 2003]

Embedding the science
Need to model the shape and
structure of cornea
Zernike representation
Emperically determine polynomial
order
Apply easy to interpret

classification model
Decision tree model
Results
90-95% accuracy, easy to interpret,
visual representation possible!
Mining Protein Structure

Protein Substructure Detection
Protein Network Analysis
Embedding the science
Need to model structure-activity

relationship
Lots of self repeating structures
functional role?
Goal: Find self repeating

structures
Distance based structure representation

Frequency based pattern identification
Fuzzy Hashing for handling noisy data.
Key results so far
Can detect multi-level tertiary
substructures
Same idea are applicable to other

scientific domains
MD simulation data
Structural similarity in Drugs
Conclusions
KDD is an iterative and interactive process
the goal of which is to extract interesting
and actionable information from
potentially large data stores efficiently
We have active projects in:
Architecture Conscious Data Management
and Mining
Visual Network Analytics and Management
Biomedical Informatics
KDD and You

Prospects
Interesting problems spanning a broad range of topics
Statistics, Databases, Parallel and Distributed Systems,
Combinatorics, Machine Learning etc.
Young emerging field

Scope for giant strides, high impact work
RA positions available
Courses of interest
788G02/788J02/888J02/888Y11
674 Introduction to Data Mining (Spring 2010)
Past DMRL Student Successes

10 PhD and 12 MS students graduated
Half in academia, half in industry
Winner of several competitive fellowships
2
1
1
1
4
Computing Innovation Fellowships in 2009

MSR Fellowship in 2007
IBM Fellowship in 2006
NSF Fellowship in 2005
straight departmental research awards
6 awards and 9 nominations for best paper from

top conferences (all joint with students each
one with a different student as lead)
High impact work in top forums.
Contact Information
Office: DL693
Email: srini@cse.ohio-state.edu
Phone: 292-2568
Web: www.cse.ohio-state.edu/~srini
Data Mining Research Lab:
www.cse.ohio-state.edu/dmrl/
Feel free to stop by and talk
Questions?

Data Mining Research at Ohio State: Srinivasan Parthasarathy

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Research at Ohio State: Srinivasan Parthasarathy

Uploaded by

Copyright:

Available Formats

Data Mining Research at Ohio

Copyright 2009, The Ohio State University

Copyright 2009, The Ohio State University

knowledge (rules, regularities, patterns,

Why Data Mining?

A MULTI BILLION DOLLAR INDUSTRY

Copyright 2009, The Ohio State University

KDD: Major Issues

Knowledge Discovery Process

Copyright 2009, The Ohio State University

Step 1: Data Preprocessing

Input-format related transformations

Copyright 2009, The Ohio State University

Step 2: Data Mining

Step2A: Task Selection

Cheese Diapers Eggs

data analysis, crossmarketing, catalog design,

form: LHS RHS

Copyright 2009, The Ohio State University

Data-Oriented: Boolean vs. quantitative

Result-Oriented: Single level vs. multiple-level

E.g, [Coors, Huggies] or [Beer, Diapers]

Result-Oriented: Simple vs. constraint-based

Association on discrete vs. continuous data

E.g., small sales (sum < 100) trigger big buys

Performance Oriented Selection

Scalable Parallel and Sequential algorithms

Copyright 2009, The Ohio State University

Step 3B: Visualization

Copyright 2009, The Ohio State University

T3: Data Clustering

Partitioning a set of data (or objects) into a

Measuring data clustering quality

Generalized Frequent Patterns

Copyright 2009, The Ohio State University

Copyright 2009, The Ohio State University

Copyright 2009, The Ohio State University

Architecture Conscious Data

Copyright 2009, The Ohio State University

Are we utilizing architectures

Many state-of-the-art mining

Copyright 2009, The Ohio State University

Cache- Memory- I/O- Conscious

Gains in performance can be

400 fold improvement!

Copyright 2009, The Ohio State University

The Multicore Challenge

Workload estimation is difficult

Copyright 2009, The Ohio State University

Key Idea: Adaptive algorithms

Key Idea: Moldable

Reduced working set sizes

Sensing the problem

Better CPU utilization

Copyright 2009, The Ohio State University

Run time (sec)

Memory footprint (in MB)

Sample Benefits: Tree Mining

2% 1.8% 1.7% 1.5%1.3% 1.2%1%

2% 1.8% 1.7% 1.5%1.3% 1.2%1%

Memory footprint of MCT is constant

Improved locality leads to smaller run times

An Esoteric CMP: Sony Playstation