You are on page 1of 51

Data Mining Research at Ohio

State
Srinivasan Parthasarathy
DL 693
srini@cis.ohio-state.edu

Copyright 2009, The Ohio State University

Outline
Motivation
KDD process
Key techniques
Our Vision and Sample Projects

Copyright 2009, The Ohio State University

Motivation
Data mining (knowledge discovery in
databases):

Extraction of interesting

knowledge (rules, regularities, patterns,


constraints) from data in large databases.
Data explosion problem --- computerized data
collection tools and mature database
technology lead to tremendous amounts of
data stored in databases.
We are drowning in data, but starving for
knowledge!
Copyright 2009, The Ohio State University

Why Data Mining?


Potential Applications
Business Application
Stock analysis, Consumer/Competitor Analysis business
edge
Scientific Applications
Bioinformatics, Mining scientific Simulations enabling novel
scientific discovery
Security Applications
Intrusion Detection, Privacy preserving mining security and
privacy of data
Sports and Entertainment
Professional Baseball/Basketball competitive edge

A MULTI BILLION DOLLAR INDUSTRY

Copyright 2009, The Ohio State University

KDD: Major Issues


Diversity of data mining tasks:
Summarization, characterization, association,
classification, clustering, trend and deviation
analysis, other pattern analysis.
Diversity of data:
Relational, transactional, data warehouse,
spatial, text, multimedia, active, objectoriented, Web, etc.
Efficiency and scalability
Expression and visualization of data mining
results
Social issues (security and privacy).
Copyright 2009, The Ohio State University

Knowledge Discovery Process


Data mining: the
core of knowledge
discovery process.

Knowledge Interpretation

Data Mining
Task-relevant Data
Data transformations
Preprocessed
Data

Selection

Data Cleaning
Data Integration
Databases

Copyright 2009, The Ohio State University

Step 1: Data Preprocessing


Data Selection
Select the relevant attributes for representation

Data Cleaning
Missing Data
Use existing data to extrapolate

Noisy Data
Compute the least noisy representation

Input-format related transformations


Ex. Discretization of continuous attributes for
association mining

Copyright 2009, The Ohio State University

Step 2: Data Mining


Specification
A. Mining task selection : Identify kinds of
knowledge to be mined. Choose amongst
various techniques like association mining,
clustering, time-series analysis, classification.

B. Method/Algorithm Selection
Wide range of methods available depending
on needs
Incorporating Background knowledge: concept
hierarchies, users knowledge base.
Interestingness measurements: significance,
confidence, thresholds, abstraction levels, etc.
Performance
Copyright 2009, The Ohio State University

Step2A: Task Selection


Assocation

rule mining:

Finding

associations or
correlations among a set of
items or objects in
transaction databases,
relational databases, and data
warehouses.

Items
People
Dan

Cheese Diapers Eggs

Kathy

1
1

Chuck

Bob

1
1

1
1

Example

Applications:
Basket

Beer

data analysis, crossmarketing, catalog design,


clustering, etc.

form: LHS RHS


[support, confidence].

Rule

diapers) buys(x,
beers) [50%, 100%]

buys(x,

Copyright 2009, The Ohio State University

Step2B: Algorithm
Selection

Data-Oriented: Boolean vs. quantitative


associations

Result-Oriented: Single level vs. multiple-level


analysis

E.g, [Coors, Huggies] or [Beer, Diapers]

Result-Oriented: Simple vs. constraint-based

Association on discrete vs. continuous data

E.g., small sales (sum < 100) trigger big buys


(sum > 1,000)?

Performance Oriented Selection

Scalable Parallel and Sequential algorithms

Copyright 2009,for
The Ohio
State University
Sampling based methods
fast
approximate results

Step 3: Knowledge
Interpretation
A. Post Processing of mining results
When you have too many patterns, you need to:
Order them using some interestingness metric
Pass them to the visualization tool incrementally

B. Visualization
Render the patterns in an easy-to-use intuitive
manner
Highlight most relevant patterns

Copyright 2009, The Ohio State University

Step 3B: Visualization

Copyright 2009, The Ohio State University

T2: Classification
Data categorization based on a set of
training objects.
Applications: credit approval, target
marketing, medical diagnosis, treatment
effectiveness analysis, automatic text
categorization etc.
Goal: Develop a description for each
class.
classification of future test data,
horsepower
<110
better understanding of
each class, and
prediction of certain properties.

Miles/gallon
Engine data example
>21
4

6
#cylinders
Copyright 2009, The Ohio State University

horsepower

<125
6

T3: Data Clustering


Analysis

Clustering:

Partitioning a set of data (or objects) into a


set of classes, called clusters, such that
members of each class sharing some
interesting common properties.
Sample Uses: Gene Clustering, Customer
Segmentation, & Image analysis
High quality clusters:
the intra-class similarity is high.
the inter-class similarity is low.

Measuring data clustering quality


Distance functions
Copyright 2009, The Ohio State University

Other Techniques
Similarity Analysis
How close are two entities?

Temporal/Sequence Analysis
Trend/Deviation analysis, Event-based
analysis

Generalized Frequent Patterns


Structure Discovery

Statistical techniques
Regression, Bayesian etc.

Copyright 2009, The Ohio State University

Outline
Motivation
KDD process
Key techniques
Our Vision and Sample Projects

Copyright 2009, The Ohio State University

Vision
Our goal is to extract novel, interpretable and actionable
knowledge from data efficiently.
1. Algorithms to extract novel forms of knowledge
2. Interpretable and actionable visual-analytics, domain dependant
3. Efficiency search space pruning, parallel and distributed algorithms,
architecture conscious solutions

Copyright 2009, The Ohio State University

Architecture Conscious Data


Mining and Management
Sample Projects

Copyright 2009, The Ohio State University

Are we utilizing architectures


efficiently?

Is there a problem?

Many state-of-the-art mining


algorithms grossly underutilize processor resources
[Ghoting 2005]
Often below 10% utilization

Why?
1. Data intensive
memory wall high
latency
2. Irregular nature data
and parameter driven
difficult to predict
3. Reliance on pointerbased data structures
poor ILP
4. Multi-core even harder
5. Compilers/runtime
systems have a hard
time coping with1-4.

Copyright 2009, The Ohio State University

Cache- Memory- I/O- Conscious


Algorithms
Idea is to re-architect
algorithm to leverage
features of the architecture
Prefetching
Simultaneous Multithreading
Locality Enhancements

Gains in performance can be


staggering
For network of workstations
minimize communication and
leverage remote memory
Enables mining of terabyte
scale distributed datasets
efficiently

400 fold improvement!

Copyright 2009, The Ohio State University

The Multicore Challenge


Challenges for data analysis applications
Irregular in nature
Hard to estimate the access pattern
Must maintain good data locality

Core
0

L2 Cache

Data intensive
Latency to main memory
Pressure on memory bandwidth
Must maintain small memory footprints
and working sets

Workload estimation is difficult


Data and parameter dependent tasks
Must maintain load balance among cores

Copyright 2009, The Ohio State University

Core
1

System
Bus

System
Memory

Key Idea: Adaptive algorithms


Key Idea: Trading off
memory for redundant
computation

Key Idea: Moldable


partitioning and adaptive
scheduling of tasks

Benefits:

Benefits:

Reduced working set sizes


Reduce bandwidth pressure
Utilizing strengths of the
CMP

Requirements

Sensing the problem


Re-architecting algorithm
to reduce memory
consumption

Better CPU utilization


Reduced Cache miss rates
(CoS)

Requirements
Sensing the problem
Re-architecting algorithm
Task decomposition
State maintenance

Copyright 2009, The Ohio State University

105

106

104

Run time (sec)

Memory footprint (in MB)

Sample Benefits: Tree Mining


105
104

103
10

103

102

101
100

2% 1.8% 1.7% 1.5%1.3% 1.2%1%

101
100

Minimum support

2% 1.8% 1.7% 1.5%1.3% 1.2%1%

Minimum support

Memory footprint of MCT is constant


660-folds reduction against TreeMiner, 2300-folds against iMB3, 300folds w.r.t. TRIPS

Improved locality leads to smaller run times


7200-folds speedup against TreeMiner, 66-folds against iMB3, 5-folds
w.r.t. TRIPS
Linear speedups on 8 and 16 core systems.
Copyright 2009, The Ohio State University

An Esoteric CMP: Sony Playstation


Streaming workloads are 12-75X faster
on Cell vs single cores, and 3-6X faster
than other CMPs

Nave PageRank is 20X faster on the Quad


Core Xeon!
Cell 8 SPU times are from 6 SPU Playstation execution times scaled to 8 cores based on simulator results
Copyright 2009, The Ohio State University

Data Mining on the Cloud


Targeted at reducing operational cost and productivity
Service oriented architectures
Plug and Play Semantics

Open Problems
What kind of services should we consider building?
Knowledge Caching? I/O services? Placement services?

Hadoop and Map-Reduce target productivity and availability but


what about efficiency of resource utilization? Can we do better?
Is Map Reduce the right interface? Is it enough?

Businesses,
from startups
to enterprises

4+ billion phones by 2010


[Source: Nokia]

Web 2.0enabled PCs,


Copyright 2009, The Ohio State University
TVs, etc.

Data Mining in a Flash


Have potential to really change the landscape
especially for out of core algorithms
Almost anorder of magnitude faster than traditional
hard drives for random reads and sequential writes.
Significantly lower energy costs

Technology is still in relatively early stages


Costs are currently prohibitive
Predicted to be less of an issue 3 to 4 years down

Algorithmic/Systemic challenges for data mining


algorithms
Wear leveling problem
Random writes
How do we work around these to realize
performance commensurate with this technology?
Copyright 2009, The Ohio State University

Energy-Conscious Data Centers


Energy is clearly a major economic and environmental
problem
Huge costs spent on managing this problem

E.g. US National Laboratories, Amazon/Google data centers

Some open problems

Given mining algorithm A and algorithm B which approach is more


energy efficient?
Given the choice to implement algorithm on STCI Cell or Intel
multicore which do we choose?
Can we leverage recent ideas from the architecture community
(e.g. reconfigurable caches)?
Given multiple tasks how can we co-locate/schedule related tasks
to lower energy costs?
How can we leverage architectural features such as underclocking
and DVFS?

Copyright 2009, The Ohio State University

Active Mining on Streaming Data

Crucial Issue: Data Influx Rate Exceeds Processing Rate


Problem for time-critical applications (e.g. Network Intrusion Detection)

Data Stratification: Reduce rows (sampling), Reduce columns (PCA)


Process incrementally, minimize access to original data.
Systems support
Memory placement, Compression, Disk-filters, Data manipulation.
Copyright 2009, The Ohio State University

Towards Visual Network


Analytics
Sample Projects

Copyright 2009, The Ohio State University

Problem Domain(s)
Protein-protein
interactions in yeast
(Jeong et al, 2001)

Interaction Networks
Nodes represent entities
Edges represent
interactions among
entities

Examples Abound:
Biological Networks
Collaboration/Friendship
networks

Challenges

Community Discovery
Scale
Dynamic Nature
Visualization

Physicist collaboration
network (Newman and
Girvan, 2004)

Copyright 2009, The Ohio State University

Questions & Challenges


How to extract modular structure?
common functional proteins, stable collaboratories etc.

What characterizes stability of groups over time?


What are the behavioral characteristics of nodes and
communites:
Which nodes are influential, which are bridging?

What are the relationships among communities?


How to visualize?
Mental Map, Handling dynamic updates, Pixel Wall challenge

Scalability?
Generative Models?
Application Specific Challenges
E.g. citizen sensing in Twitter (providing context)
Copyright 2009, The Ohio State University

NAV Architecture

Copyright 2009, The Ohio State University

Dynamic Analysis Framework


Community Detection
MLR-MCL (KDD09)
Viewpoints (KDD09)
Graph Partitioning (Metis)
CSV (SIGMOD08)
Event detection (KDD07,
TKDD09)
Entity Driven Events
Community Driven Events
Composing Behavioral
Measures
Stability, Sociability,
Influence

C1

C2

C 31

C22

C6

C4

C5

C4

C4

Visual Analysis and Inference


Dynamic Layout
Copyright 2009, The Ohio State University
Density Plots

C6

C5

C5

C5

C6

C6

C6

Visualization: Overview First


(Coarsened View)

Copyright 2009, The Ohio State University

Zoom and Filter

Copyright 2009, The Ohio State University

Event View (Importance of Ranking)

Copyright 2009, The Ohio State University

Split: Details on Demand


(ironic example )

Copyright 2009, The Ohio State University

Merge (Philosophy + Logic)

Copyright 2009, The Ohio State University

Dynamic Details (Sociability+ Influence)

Copyright 2009, The Ohio State University

Density (CSV) Plots


Computing density plots efficiently was identified
by SIGMOD keynote on Extreme Visualization as
an important grand challenge problem
Density Plots
Can help quickly localize dense subgraphs hidden within
a large graph
The challenge is to compute them efficiently

Copyright 2009, The Ohio State University

SMD: Stock Market Data


Bridging
vertex

Partial clique

Partial clique
Peaks in SMD CSV plot
represents highly
cohesive stocks

Copyright 2009, The Ohio State University

Dynamic Layout Strategy: Preliminary


Ideas

Original Graph G

Delete Nd 4, Propogating Updates


Housing within an R-tree

Static Layout of G
for comparitive purposes
Copyright 2009, The Ohio State University

Refined Graph G

Dynamic Graph Layout: Early Results


Enron Dataset
Energy profile of Static
(from scratch layout)
very similar to our
dynamic variant
Dynamic variant
maintains better
mental map (not
shown)
Dynamic variant is
also more efficient (up
to 40% more efficient)
Copyright 2009, The Ohio State University

Global Graphs: Managing Large


Graphs
SERVER

Technicians
Analysts

MRI Data

Patient DB
CLIENT

Clinician

VIRTUAL DATASPACE

DAG Array

SERVER

List

Virtual Dataspace
GENE
DATABASES
Expression data
Sequence data
Copyright 2009, The Ohio State University

Biomedical Applications
Problem (joint with M.Twa et al)
Classification of normal vs.
keratoconic patients
Patient data + examination data

Solution [SIAM 2003]


Embedding the science
Need to model the shape and
structure of cornea
Zernike representation
Emperically determine polynomial
order

Apply easy to interpret


classification model
Decision tree model

Results
90-95% accuracy, easy to interpret,
visual representation possible!

Copyright 2009, The Ohio State University

Mining Protein Structure


Protein Substructure Detection
Protein Network Analysis
Embedding the science

Need to model structure-activity


relationship
Lots of self repeating structures
functional role?

Goal: Find self repeating


structures

Distance based structure representation


Frequency based pattern identification
Fuzzy Hashing for handling noisy data.
Key results so far
Can detect multi-level tertiary
substructures

Same idea are applicable to other


scientific domains
MD simulation data
Structural similarity in Drugs

Copyright 2009, The Ohio State University

Conclusions
KDD is an iterative and interactive process
the goal of which is to extract interesting
and actionable information from
potentially large data stores efficiently
We have active projects in:
Architecture Conscious Data Management
and Mining
Visual Network Analytics and Management
Biomedical Informatics

Copyright 2009, The Ohio State University

KDD and You


Prospects
Interesting problems spanning a broad range of topics
Statistics, Databases, Parallel and Distributed Systems,
Combinatorics, Machine Learning etc.

Young emerging field


Scope for giant strides, high impact work

RA positions available
Courses of interest
788G02/788J02/888J02/888Y11
674 Introduction to Data Mining (Spring 2010)

Copyright 2009, The Ohio State University

Past DMRL Student Successes


10 PhD and 12 MS students graduated
Half in academia, half in industry

Winner of several competitive fellowships

2
1
1
1
4

Computing Innovation Fellowships in 2009


MSR Fellowship in 2007
IBM Fellowship in 2006
NSF Fellowship in 2005
straight departmental research awards

6 awards and 9 nominations for best paper from


top conferences (all joint with students each
one with a different student as lead)
High impact work in top forums.

Copyright 2009, The Ohio State University

Contact Information

Office: DL693
Email: srini@cse.ohio-state.edu
Phone: 292-2568
Web: www.cse.ohio-state.edu/~srini
Data Mining Research Lab:
www.cse.ohio-state.edu/dmrl/
Feel free to stop by and talk

Copyright 2009, The Ohio State University

Questions?

Copyright 2009, The Ohio State University

You might also like