You are on page 1of 28

Data Mining

ReseaRch gRoup Spring 09

Data & Information System (DAIS) Research Lab Department of Computer Science | University of Illinois at Urbana-Champaign

Preparing for the challenges in information access, retrieval, and management that lie ahead requires a coordinated and multifaceted approach. The Data Mining Research Group at the University of Illinois Department of Computer Science is proud of the successful research partnerships and research initiatives that are a part of our hallmark of excellence.

TABLE OF CONTENTS
01 02 03 08 11 23
About the Data Mining Research Group Jiawei Han Students, Alumni, and Visiting Scholars Awards and Publications Projects Funding

The Data Mining Research Group in the Department of Computer Science, University of Illinois at Urbana-Champaign, conducts leading edge research in the areas of data mining, data warehousing, database systems, and Web-based information systems. Work conducted by the group is pioneering new directions in the field, and is pushing the boundaries of data mining techniques. Their work aims to integrate and advance the knowledge produced in multiple disciplines, including database systems, statistics, machine learning, algorithms, information theory, spatial and multimedia databases, and Web technology, among others. With more than 20 members, the group is characterized by their breadth and depth of excellence, and their integrated approach to complex problem chains. The group is associated with the Data and Information System Laboratory. Its research projects include: information network analysis OLAP and mining of multidimensional text databases graph mining privacy and trust validation by data mining mining moving objects, trajectories, RFID, and traffic data image and video mining multidimensional promotion and ranking analysis transfer learning, dimensionality reduction, and pattern-based classification stream data mining data mining in biomedical, software engineering and cyberphysical system applications.

01

jiawei han
Professor Han is a world-recognized leader in the data mining field. His ground-breaking work includes pioneering techniques on frequent, sequential, and graph pattern mining; heterogeneous information network analysis; spatiotemporal data mining, stream data mining; and text cube, ranking cube, and data cube computation. His contributions and discoveries have been characterized by an integrative approach, advancing knowledge produced in multiple disciplines. Professor Han is one of the most cited authors in Data Mining, has written more than 400 papers for conferences and journals, organized a number of international conferences, and is the Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data. Working with government funding agencies and industry partners, Professor Han has extensive experience in managing large-scale, complex projects that take a multi-disciplinary approach. awaRDs: SIGKDD Innovations Award (2004) ACM Fellow (2004) IEEE CS Technical Achievement Award (2005) IEEE Fellow (2009)

ReseaRch collaboRatoRs anD FunDeRs:

02

students
Dustin boRtneR Network mining

Deng cai Machine learning, especially manifold learning and dimensionality reduction Information retrieval

chen chen Graph mining and related data management problems

bolin Ding Pattern mining algorithms Theoretical aspects of data mining and database problems

Jing gao Ensemble learning, transfer learning Data stream mining Anomaly detection

03

Xin Jin Image/video mining and retrieval

sangkyuM kiM Image/video mining High dimensional indexing

Zhenhui li Mining moving objects Spatialtemporal data mining

cinDy XiDe lin Graph mining Web mining Multidimensional analysis

chanDRasekaR RaMachanDRan Video/Image mining Dimensionality reduction on sparse datasets Indexing and search

04

sebastian seith Moving object and traffic mining

yiZhou sun Link analysis and information network analysis Graph mining and Web mining Machine learning

luan tang Spatial data mining Privacy-Preserving data mining Data mining with bio-medical application

tianyi wu Ranking query processing Association analysis Information network analysis

ZhiJun yin Web mining Information retrieval Machine learning

05

Xiao yu Anomaly detection Web mining

yintao yu Information network and social network analysis Web mining

bo Zhao Multidimensional text database systems Web mining, entity search and extraction Information network analysis

peiXiang Zhao Structural data mining Algorithms on massive data sets

FeiDa Zhu Structural pattern mining Approximation and complexity analysis for data mining problems

06

visiting scholars
Recent visiting scholaRs Min-soo kiM
Graph/network data mining Bioinformatics Indexing & query processing Information retrieval & search engines

R. alves (Portugal) R. angRyk (Montana State U.) F. beRZal (Spain) Jianlin Feng (China) Jae-gil lee (IBM Research) cuiping li (China)

lu liu
Web video analysis and mining Topic modeling Social-network analysis

alumni
ph.D. hong cheng Ph.D. 2008, City University of Hong Kong hectoR gonZaleZ Ph.D. 2008, Google Research Xiaolei li Ph.D. 2008, Microsoft chao liu Ph.D. 2007, Microsoft Research Dong Xin Ph.D. 2007, Microsoft Research XiaoXin yin Ph.D. 2007, Microsoft Research XiFeng yan Ph.D. 2006, University of California at Santa-Barbara hwanJo yu Ph.D. 2004, POSTECH University, Korea Recent MasteRs anD unDeRgRaDuate aluMni luiZ MenDes Jacob lee MaRgaRet Myslinska RicaRDo ReDDe John paul sonDag

07

distinguished honors: jiawei han


IEEE Fellow (2009) IEEE Computer Society Technical Achievement Award (2005) ACM SIGKDD Innovations Award (2004) ACM Fellow (2004) IBM Faculty Awards (2002, 2003, 2004) The Outstanding Contribution Award (2002, IEEE Computer Society, International Conference on Data Mining) UIUC Teachers Ranked as Excellent (2002-2007)

distinguished honors: students


Microsoft Research Graduate Womens Scholarship (2009): Cindy Xide Lin ACM SIGKDD Dissertation Award (2008): Xiaoxin Yin ACM SIGMOD Ph.D. Dissertation Runner-Up Award (2007): Xifeng Yan IBM Scholarship (2007): Hong Cheng Midwest Database Symp. Best Presentation Award (2007): Feida Zhu Henry Ford II Award (2006): Deng Cai

08

conference awards
D. Zhang, C. Zhai, and J. Han, Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases, in Proc. 2006 SIAM Int. Conf. on Data Mining (SDM09) (One of Best of SDM09) F. Zhu, X. Yan, J. Han, and P. S. Yu, gPrune: A Constraint Pushing Framework for Graph Pattern Mining, in Proc. 2007 Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD07) (Best Student Paper Award) X. Li, J. Han, S. Kim, and H. Gonzalez, ROAM: Rule- and Motif-Based Anomaly Detection in Massive Moving Object Data Sets, in Proc. 2007 SIAM Int. Conf. on Data Mining (SDM07) (One of Best of SDM07) F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, Mining Colossal Frequent Patterns by Core Pattern Fusion, in Proc. 2007 Int. Conf. on Data Engineering (ICDE07) (Best Student Paper Award) Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai, Generating Semantic Annotations for Frequent Patterns with Context Analysis, in Proc. 2006 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD06) (Best Student Paper Runner-Up Award) Hongyan Liu, Jiawei Han, Dong Xin, and Zheng Shao, Mining Interesting Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach, in Proc. 2006 SIAM Int. Conf. on Data Mining (SDM06). (One of Best of SDM06) H. Gonzalez, J. Han, X. Li, and D. Klabjan, Warehousing and Analysis of Massive RFID Data Sets, in Proc. 2006 Int. Conf. on Data Engineering (ICDE06) (Best Student Paper Award) X. Yan, H. Cheng, J. Han, and D. Xin, Summarizing Itemset Patterns: A Profile-Based Approach, in Proc. 2005 Int. Conf. on Knowledge Discovery and Data Mining (KDD05) (Best Student Paper Runner-Up Award)

09

conference tutorials
D. Cai, X. He, and J. Han, A Geometric Perspective on Dimensionality Reduction, SDM09, Sparks, NV, April 2009 J. Pei, Y. Tao, and J. Han, Preference Queries from OLAP and Data Mining Perspective, ICDE09, Shanghai, China, March 2009 J. Han, X. Yan, and P. S. Yu, Scalable OLAP and Mining of Information Networks, EDBT09, St. Petersburg, Russia, March 2009 H. Cheng, J. Han, X. Yan, and P. S. Yu, Integration of Classification and Pattern Mining: A Discriminative and Frequent Pattern-based Approach, ICDM08, Pisa, Italy, December 2008 J. Han, J.-G. Lee, H. Gonzalez, and X. Li, Mining Massive RFID, Trajectory, and Traffic Data Sets, ACM SIGKDD08, Las Vegas, NE, August 2008 J. Han, X. Yin, and P. S. Yu, Exploring the Power of Links in Data Mining, ICDE08, Cancun, Mexico, April 2008 (Also, ECML/PKDD07, Warsaw, Poland, Sept. 2007) C. Liu, T. Xie, and J. Han, Mining for Software Reliability, ICDM07, Omaha, NE, Oct. 2007 J. Han, X. Yan, and P. S. Yu, Mining and Searching Graphs and Structures, KDD06, Philadelphia, PA, August 2006 (Also, ICDE06, Atlanta, GA, April 2006, and ICDM05, Huston, TX, Nov. 2005)

10

project list
12 13 14 15 16 17 18 19 20 21
Information Network Analysis OLAP and Mining of Multidimensional Text Databases Graph Mining Privacy and Trust Validation by Data Mining Mining Moving Objects, Trajectories, RFID, and Traffic Data Image and Video Mining Multidimensional Promotion and Ranking Analysis Transfer Learning, Dimensionality Reduction, and Pattern-Based Classification Stream Data Mining Data Mining Applications

11

information network analysis


researchers: Yizhou Sun, Yintao Yu, Chen Chen, Cindy Xide Lin, Tianyi Wu,

Dustin Botner, and Jiawei Han

Bo Zhao,

description:

Information network analysis investigates effective discovery of patterns and knowledge from large-scale networks that consist of interconnected physical, technological, conceptual, and human/societal components. The major themes in our study include: (1) ranking-based clustering on different types of objects in heterogeneous information networks; (2) hierarchical network structure analysis for OLAP, multidimensional text database analysis, and ranking promotion; (3) query-based information network extraction and analysis; and (4) link-based veracity analysis for bibliographic networks and news information networks.

selected publications:

Y. Sun, Y. Yu, and J. Han, Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema, KDD09 Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu, RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis, EDBT09 Y. Sun, T. Wu, H. Cheng, J. Han, X. Yin, and P. Zhao, BibNetMiner: Mining Bibliographic Information Networks, (demo paper), SIGMOD08 X. Yin, J. Han, and P. S. Yu, LinkClus: Efficient Clustering via Heterogeneous Semantic Links, VLDB06

12

olap and mining of multidimensional text databases


researchers: Cindy Xide Lin, Bo Zhao, Bolin Ding, Duo Zhang, ChengXiang Zhai,

and Jiawei Han

description:

A multidimensional text database, such as customer reviews, flight reports, job descriptions and service feedbacks, is a database that consists of both multidimensional categorical attributes and narrative text attributes. We investigate how to construct text or topic data cubes, perform effective information retrieval, OLAP, and text mining on such data cubes, and how textual and structured multidimensional information could work together to enhance information retrieval and knowledge discovery.

selected publications:

C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR Measures for Multidimensional Text Database Analysis, ICDM08 D. Zhang, C. Zhai, and J. Han, Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases, SDM09 (Best of SDM09)

13

graph mining
researchers: Chen Chen, Feida Zhu, Cindy Xide Lin, Peixiang Zhao, Xifeng Yan (Univ. of

California at Santa-Barbara), Jiawei Han and Philip S. Yu (Univ. of Illinois at Chicago)

description:

Graph mining is to mine patterns, classification models, clusters, and other kinds of knowledge from massive graph data sets and develop indexing, similarity search and OLAP tools for graph data. Applications include bioinformatics, computer system diagnoistics, social network analysis, and Web search and mining.

selected publications:

X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining Significant Graph Patterns by Scalable Leap Search, SIGMOD08 C. Chen, X. Yan, F. Zhu, J. Han, and P. S. Yu, Graph OLAP: Towards Online Analytical Processing on Graphs, ICDM08 C. Chen, C. X.Lin, X. Yan, and J. Han, On Effective Presentation of Graph Patterns: A Structural Representative Approach, CIKM08 C. Chen, X. Yan, P. S. Yu, J. Han, D. Zhang, and X. Gu, Towards Graph Containment Search and Indexing, VLDB07 X. Yan, F. Zhu, P. S. Yu, and J. Han, Feature-based Substructure Similarity Search, ACM Transactions on Database Systems (TODS), .31: 1418 -1453, 2006

14

privacy and trust validation by data mining


researchers: Bolin Ding, Zhijun Yin, and Jiawei Han

description:

Can we trust pieces of information provided by other parties and other information providers including newspapers, Web, TV? We investigate this issue and develop techniques to provide trustable analysis of the truthfulness of information from multiple information providers and automatically identify the trustworthy information. Alternatively, can we develop data mining mechanisms that find interesting information and still preserve the required privacy specified by information providers? We study privacy-preserving data mining and developed a constraint-based clustering approach for privacy-preservation data publishing. We are also working on privacy- preserving data cube that may provide multidimensional aggregate information as well as preserve privacy of sensitive data.

selected publications:

A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng, Privacy-Preserving Data Publishing: A Constraint-Based Clustering Approach, in S. Basu, et al. (eds.), Constrained Clustering: Advances in Algorithms, Theory, and Applications, Taylor and Francis, 2008 X. Yin, J. Han, and P. S. Yu, Truth Discovery with Multiple Conflicting Information Providers on the Web, TKDE08 X. Yin, J. Han, and P. S. Yu, Object Distinction: Distinguishing Objects with Identical Names by Link Analysis, ICDE07

15

mining moving objects, trajectories, rfid, and traffic data


researchers: Zhenhui Li, LuAn Tang, Sebastian Seith, and Jiawei Han

description:

The world is increasingly become more mobile. We design and develop effective and scalable methods for mining massive moving-object data, trajectory data, RFID data, and traffic data to uncover clusters, classification models, frequent and sequential patterns, and outliers in large sets of moving objects, with applications in homeland security, law enforcement, traffic control, animal/bird migration analysis, and environmental studies.

selected publications: X. Li, Z. Li, J. Han, and J.-G. Lee, Temporal Outlier Detection in Vehicle Traffic Data, ICDE09 J.-G. Lee, J. Han, X. Li, and H.Gonzalez, TraClass: Trajectory Classification Using Hierarchical Region-Based and TrajectoryBased Clustering, VLDB08 J.-G. Lee, J. Han, and X. Li, Trajectory Outlier Detection: A Partition-and-Detect Framework, ICDE08 H. Gonzalez, J. Han, X. Li, M. Myslinska, and J. P. Sondag, Adaptive Fastest Path Computation on a Road Network: A Traffic Mining Approach, VLDB07 J.-G. Lee, J. Han, and K.-Y. Whang, Trajectory Clustering: A Partition-and-Group Framework, SIGMOD07

16

image and video mining


researchers: Sangkyum Kim, Xin Jin, Chandrasekar Ramachandran, Liangliang Cao,

and Klara Nahrstedt

description:

We investigate efficient image and video pattern mining, clustering, classification, and indexing methods. including developing an image frequent spatial pattern mining algorithm SpIBag (Spatial Item Bag Mining), an image clustering algorithm SpaRClus (Spatial Relationship Pattern-Based Hierarchical Clustering) which persists over shifting, scaling and rotation transformations, and a multi-layer ring-based index structure for both r-Range search and k-NN search.

selected publications:

X. Jin, S. Kim, J. Han, L. Cao, and Z. Yin, GAD: General Activity Detection for Fast Clustering on Large Data, SDM09 R. Malik, S.Kim, X. Jin, C. Ramachandran, J. Han, I. Gupta, and K. Nahrstedt, MLR-Index: An Index Structure for Fast and Scalable Similarity Search in High Dimensions, SSDBM09 S. Kim, X. Jin, and J. Han, SpaRClus: Spatial Relationship Pattern-Based Hierarchical Clustering, SDM08

17

multidimensional promotion and ranking analysis


researchers: Tianyi Wu, Dong Xin (Microsoft Research), and Jiawei Han

description:

As decision support and business intelligence applications become increasingly large-scale, it is critical to support effective and efficient search and knowledge discovery through online multidimensional analysis. Promotion and ranking are indispensable functions of such an analysis engine: ranking aims at enabling analysts to explore top-k interesting aggregate or nonaggregate answers at multiple resolutions; and promotion helps decision makers promote any given object of interest through discovering the best subspaces or data regions where the object becomes prominent, without manually navigating the data set. We have developed RankingCube and PromotionCube methods that are efficient and scalable at processing flexible queries in multidimensional space.

selected publications:

T. Wu, D. Xin, and J. Han, ARCube: Supporting Ranking Aggregate Queries in Partially Materialized Data Cubes, SIGMOD08 D. Xin and J. Han, P-Cube: Answering Preference Queries in Multi-Dimensional Space, ICDE08 T. Wu, X. Li, D. Xin, J. Han, J. Lee, and R. Redder, DataScope: Viewing Database Contents in Google Maps Way, VLDB08 (demo)

18

transfer learning, dimesionality reduction, and pattern-based classification


researchers: Jing Gao, Deng Cai, Hong Cheng (Chinese Univ. of Hong Kong), and Jiawei Han

description:

Classification is a core problem widely studied in machine learning, statistical learning and data mining. Real-world applications, such text, image and web categorization, gene prediction, system and network intrusion detection, can be cast into a classification problem. Although many learning algorithms, such as Support Vector Machines, logistic regression, and decision tree induction, have been developed, there are still numerous challenges in effective classification. We investigate methods for improving classification accuracy by exploring knowledge embedded in data and develop novel methods to construct discriminative and compact feature set for complex structured data, explore manifold structure for learning, and combine multiple sources or learning models for better predictions.

selected publications:

J. Gao, W. Fan, J. Jiang, and J. Han, Knowledge Transfer via Multiple Model Local Structure Mapping, KDD08 D. Cai, X. He, and J. Han, Training Linear Discriminant Analysis in Linear Time, ICDE08 H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for Effective Classification, ICDE08 D. Cai, X. He, and J. Han, SRDA: An Efficient Algorithm for Large Scale Discriminant Analysis, TKDE08. H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for Effective Classification, ICDE07

19

stream data mining


researchers: Jing Gao, Wei Fan (IBM Research), and Jiawei Han

description:

In many real-time applications, such as network traffic monitoring, credit card fraud detection, and web click stream, data arriving continuously and in large amount, forming data streams. We investigate stream data mining principles and algorithms, develop effective and scalable methods for mining the dynamics of data streams in multi-dimensional space, including discovering changes, trends and evolution characteristics in data streams, constructing clusters and classification models, and exploring frequent patterns and similarities among data streams.

selected publications:

L. Mendes, B. Ding, and J. Han, Stream Sequential Pattern Mining with Precise Error Bounds, ICDM08 J. Gao, W. Fan, and J. Han, On Appropriate Assumptions to Mine Data Streams: Analysis and Practice, ICDM07 J. Gao, W. Fan, J. Han, and P. S. Yu, A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions, SDM07

20

data mining applications


sequential pattern mining (bolin ding and jiawei han):

Motivated by long sequences in text data, biological data, software engineering, and sensor networks, we study mining repetitive gapped subsequences to capture the occurrences of sequential patterns repeating within each sequence of a large database and use them as features for classification or prediction.

biological and medical data mining (jing gao, xiao yu, min-soo kim, Zhijun yin, jiawei han):

We investigate medical classification problems include gene prediction based on micro-array data and cancer prediction based on medical images and develop discriminative pattern based methods to improve the accuracy of medical data classification, as well as provide useful discriminative patterns to help the medical experts with their decisions.

software engineering and sensor network mining (xin jin, jiawei han and tarek abdelzaher):

We investigate statistical analysis and sequence/graph mining methods for software bug detection, failure indexing, troubleshooting and root-cause analysis in sensor networks and data streams.

cyberphysical systems (luan tang and jiawei han):

A cyberphysical system consists of a large number of interacting physical and information components. For example, a patient-care system may link a patient monitoring system with a network of patients and associated medical information and an emergency handling system. We investigate data mining cyberphysical networks, including real-time analysis of massive amount of streaming data, reliable and trusted data analysis, and effective spatiotemporal data analysis in cyberphysical networks.

21

selected publications:

D. Lo, H.Cheng, J. Han, S. Khoo, and C. Sun, Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach, KDD09 B. Ding, D. Lo, J. Han, and S.-C. Khoo, Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database, ICDE09 M. M. H. Khan, T. Abdelzaher, J. Han, and H. Ahmadi, Finding Symbolic Bug Patterns in Sensor Networks, DCOSS09 M. M. H. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han, DustMiner: Troubleshooting Interactive Complexity Bugs in Sensor Networks, Sensys08 F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, Mining Colossal Frequent Patterns by Core Pattern Fusion, ICDE07 (Best Student Paper Award)

22

research funding
NASA (with ChengXiang Zhai, et al.): Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events (2008-2010) Air Force (MURI, with Tim Finin as PI, et al.): A Framework for Managing the Assured Information Sharing Lifecycle (2008-2012) NSF: SGER: CS-BibCube: OLAPing and Mining of Computer Science Literature (2008-2010) NSF (with Roland Kays et al.): BDI: Movebank: Integrated Database for Networked Organism Tracking (2007-2010) NSF: SGER: DataScope: Viewing Database Contents in Multi-Resolution at Your Finger Tips (20062007) NSF (with Jasmine Zhou): Collaborative Research: Endowing Biological Databases With Analytical Power: Indexing, Querying, and Mining of Complex Biological Structures (2005-2009) NSF (with Ouri Wolfson): SEI(IIS): MotionEye: Querying and Mining Large Datasets of Moving Objects (2005-2008) NSF (With Xiaosong Ma) Collaborative Research: Reusable, Observation-based Performance Prediction across Platforms (2004-2005) DHS (with Dan Roth as PI, et al.): Multimodal Information Access and Synthesis Center (2007-2010) ONR/NCASSR (with Michael Welge), Detection and Apprehension of Rare Events in Data Streams (2008-2009) Boeing: On-Line Mining of Strange Moving Objects for Security Protection (2007-2010) U.S. Air Force (with IAI Inc) Distributed High-Dimensional Mining Tool for Bioscience Data Analysis (2006-2009) NSF (with Josep Torrelas as PI, et al.): ITR: Automatic On-the-fly Detection, Characterization, Recovery, and Correction of Software Bugs in Production Runs (2003-2008) NSF: Mining Dynamics of Data Streams in Multi-Dimensional Space (2003-2006)

23

ONR (with Michael Welge): Mining Changes and Alarming Events in Streaming Data (2003-2006) NSF: Mining Sequential and Structured Patterns: Scalability, Flexibility, Extensibility and Applicability (2002-2006) Research gifts and grants from industry: Microsoft Research, Intel, IBM (Faculty Award, Innovation Award), Google, Yahoo!, NCSA (Faculty Fellowship), HP-Labs.

24

http://dm1.cs.uiuc.edu
Data Mining Research Group Department of Computer Science, UIUC

You might also like