Professional Documents
Culture Documents
Background
Content of human mind, Sample data mining problems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
Data, Information, Knowledge, and Wisdom by Gene Bellinger, Durval Castro, Anthony Mills According to Russell Ackoff, content of human mind can be classified into five categories: Data, Information, Knowledge, Understanding and wisdom Data: Symbols Data represents a fact or statement of event without relation to other things. Data is raw. It simply exists and has no significance beyond its existence (in and of itself). It can exist in any form, usable or not. It does not have meaning of itself. In computer parlance, a spreadsheet generally starts out by holding data. Ex: It is raining.
Content of Human Mind Information: Data that are processed to be useful; provides answer to who, what, where, and when questions.
Information is data that has been given meaning by way of relational connection. This "meaning" can be useful, but does not have to be. In computer parlance, a relational database makes information from the data stored within it. Information embodies the understanding of a relationship of some sort, possibly cause and effect. Example The temperature dropped 15 degrees and then it started raining.
Ex: It rains because it rains. And this encompasses an understanding of all the
interactions that happen between raining, evaporation, air currents, temperature gradients, changes, and raining.
Evolution of Database Technology Before 1960s: Primitive file processing 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s: Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused
Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions
Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation
Number of analysts
1998 1999
From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications
Evolution of Sciences
Before 1600, empirical science 1600-1950s, theoretical science Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. 1950s-1990s, computational science Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now, data science The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible
Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002
Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
Look up phone number in phone directory Query a Web search engine for information about Amazon
Certain names are more prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area)
Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
Data Cleaning
Data Integration Databases
Database
Database and data warehouse server: Responsible for fetching the relevant data, based on the users data mining request. Knowledge-base: Domain knowledge which is used to guide the data mining process.
Attribute levels, semantics, user beliefs, pattern interestingness, thrsholds, meta data
Data mining engine: Set of functional modules for tasks such as characterization, summarization, association, classification, clustering, outlier extraction Pattern evaluation: Employees interestingness measures
Put the evaluation pattern as much deep as you can so that one can optimize.
User interface: communication between users and the data mining system.
Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
Uses audio signals to indicate the patterns of data or the features of data mining results
Traversal Diagram
Making Decisions
Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting
End User
Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
DBA
Machine Learning
Data Mining
Visualization
Information Science
Other Disciplines
Other disciplines: pattern recognition, image processing, signal processing Spatial or temporal data analysis.
Using the proposed techniques, interesting knowledge, regularities or high-level information can be extracted from the databases and viewed or browsed from different angles.
Efficiency: Without compromising quality
Scalability: Running time should grow approximately linearly in proportion to the size of data.
High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of
Data WarehouseSubject-Oriented
Organized around major subjects, such as customer,
product, sales
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular
Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records
Data WarehouseNonvolatile
A physically separate store of data transformed from
data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAP analysis directly on relational databases
Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
Objectives
Data mining is the process of extracting interesting and useful information/knowledge from large databases or data warehouses. The course covers
the concepts and techniques of data mining such as association rules, clustering, and classification. the basic concepts, architecture and general implementations of data warehousing technology
Course topics
Introduction (3 hrs): Definition, KDD framework, Issues in data mining. Association Rules (9hrs): Problem definition, Frequent item-set generation, A priori and FP-growth algorithm, Evaluation of Association patterns. Clustering (9hrs): Overview, Types of Data, K-means, Aglomerative clustering, Clustering algorithms (DBSCAN, BIRCH, CURE, ROCK, CHAMELEON). Classification (9hrs): Overview, Decision tree induction, Over-fitting and under-fitting, Scalable decision tree algorithms, Bayesian Classification, Regression-based Prediction methods Data preprocessing (6 hrs): Data summarization, Data cleaning, Data integration and transformation, Data reduction, Data discretization and Concept hierarchy. Data warehousing (9 hrs): Multidimensional data model, Data warehousing architecture, Data cube computation and OLAP technology.
Text Books
Research Papers:
In this course, about 25 research papers will be covered. Students can refer the following books for the details of some research papers and other background information.
Text books
Book: Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second edition, 2006, Elseiver Inc. Pang-Nong Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining, 2006, Pearson Education.
Reference Books:
Papers from the proceeding of the conferences and journals related to data mining and data warehousing.
LAB WORK
Several data mining tasks related to data preprocessing, association rules, clustering and classification will be given.
Outcome
After completing the course, the students will be able to appreciate the importance of extracting useful knowledge from large amounts of data to improve the performance of a business/organization. get enough exposure to investigate new/improved data mining methods. will understand the basics of data warehousing technology and its links to data mining. Will be able play a role of a Data Miner in an organization.
GRADING
MidSem1: 15 %; MidSemII: 15 %; EndSem: 30%; Research Paper Quiz: 10 % Project/Lab: 30 %
1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD95-98) Journal of Data Mining and Knowledge Discovery (1997) ACM SIGKDD conferences since 1998 and SIGKDD Explorations More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
Web and IR
Statistics
Visualization
Reference Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,
Springer-Verlag, 2001 B. Liu, Web Data Mining, Springer 2006. T. M. Mitchell, Machine Learning, McGraw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
Description Methods
Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Rules Discovered:
{Milk} --> {Coke} {Diaper, Milk} --> {Beer}
1 2 3 4 5
Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk
(A B)
(C)
(D E)
Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.
(A B)
<= xg
(C)
<= ms
(D E)
>ng <= ws
Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach:
Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
Correctly Placed
364 260 36 746 573 278
1 2 3 4
Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N, Sun-DOW N Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN, ADV-M icro-Device-DOWN,Andrew-Corp-DOWN, Co mputer-Assoc-DOWN,Circuit-City-DOWN, Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N, MBNA-Corp -DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlu mberger-UP
Technology1-DOWN
Technology2-DOWN
Financial-DOWN Oil-UP
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification Example
Tid Refund Marital Status 1 2 3 4 5 6 7 8 9 10
10
No Yes No Yes
Test Set
Training Set
Learn Classifier
Model
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach:
Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, dont buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions. Approach:
Use credit card transactions and the information on its accountholder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a competitor. Approach:
Use detailed record of transactions with each of the past and present customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.
Classification: Application 4
Sky Survey Cataloging
Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).
3000 images with 23,040 x 23,040 pixels per image.
Approach:
Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Attributes:
Intermediate
Late
Data Size:
Regression
Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples:
Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.
Deviation/Anomaly Detection
Detect significant deviations from normal behavior Applications:
Credit Card Fraud Detection
Typical network traffic at University level may reach over 100 million connections per day
First Assignment
Assignment 1: Identify a problem from your own experience that you think would be amenable to data mining. Describe: (i) What the data is. (ii) What type of benefit you might hope to get from data mining. (iii) What type of data mining (classification, clustering, etc.) you think would be relevant. For each, illustrate with an example, e.g., if you think clustering is relevant, describe what you think a likely cluster might contain and what the real-world meaning would be. Submit twwo pages of 11 point single-spaced typeset text (leave 0.5 inch margins). Wrie your roll number and name. Last Date: 14-08-08 (5PM) References: Introductory chapters of any data mining book or any data mining paper and the PPTs of first two classes.
Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
User interaction
Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction
Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
Resource planning:
summarize and compare the resources and spending
Competition:
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
Summary
Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories