You are on page 1of 103

Outline

Background
Content of human mind, Sample data mining problems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

Data, Information, Knowledge, and Wisdom by Gene Bellinger, Durval Castro, Anthony Mills According to Russell Ackoff, content of human mind can be classified into five categories: Data, Information, Knowledge, Understanding and wisdom Data: Symbols Data represents a fact or statement of event without relation to other things. Data is raw. It simply exists and has no significance beyond its existence (in and of itself). It can exist in any form, usable or not. It does not have meaning of itself. In computer parlance, a spreadsheet generally starts out by holding data. Ex: It is raining.

Content of Human Mind Information: Data that are processed to be useful; provides answer to who, what, where, and when questions.
Information is data that has been given meaning by way of relational connection. This "meaning" can be useful, but does not have to be. In computer parlance, a relational database makes information from the data stored within it. Information embodies the understanding of a relationship of some sort, possibly cause and effect. Example The temperature dropped 15 degrees and then it started raining.

Content of Human Mind


Knowledge: application of data and information; answers how questions.
Knowledge is the appropriate collection of information, such that it's intent is to be useful. Knowledge is a deterministic process. When someone "memorizes" information (as less-aspiring test-bound students often do), then they have amassed knowledge. Ex: If the humidity is very high and the temperature drops suddenly the atmosphere is often unlikely to be able to hold the moisture so it rains.

Content of Human Mind


Understanding: appreciation of why
It is the process by which one can take knowledge and synthesize new knowledge from the previously held knowledge. The difference between understanding and knowledge is the difference between "learning" and "memorizing". People who have understanding can undertake useful actions because they can synthesize new knowledge, or in some cases, at least new information, from what is previously known (and understood). That is, understanding can build upon currently held information, knowledge and understanding itself. In computer parlance, AI systems possess understanding in the sense that they are able to synthesize new knowledge from previously stored information and knowledge.

Content of human mind


Wisdom: evaluated understanding
It is the process by which we also discern, or judge, between right and wrong, good and bad. I personally believe that computers do not have, and will never have the ability to posses wisdom.

Ex: It rains because it rains. And this encompasses an understanding of all the
interactions that happen between raining, evaporation, air currents, temperature gradients, changes, and raining.

Sample data mining problem # 1


I manage a supermarket (restaurant, video store, book store) and my cash register (or web site) pumps transactions into my DB.
Can you help me visualize my sales ? Can you profile my customers ? Tell me something interesting I do not know statistics, and I do not want to hire statisticians.

Sample data mining problem #2


I am an astronomer and I have sky survey 3 tera bytes of data, 2 billion objects. Can you help to recognize the objects ? Most of my data is beyond my reach. Can you find new/unusual items in my data ? Can you help me with basic manipulation, so I can focus on basic science ? I know my data and statistics, but that is not enough

About Data mining


Look-up a few records SQL Populate standard report SQL Create a new report OLAP/mining Data mining Optimize business process Locate a new problem Understand something new Answer a tough question

Evolution of Database Technology Before 1960s: Primitive file processing 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s: Data mining, data warehousing, multimedia databases, and Web databases

2000s
Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems

Why Data Mining ?


The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability
Automated data collection tools, database systems, Web, computerized
society

Major sources of abundant data


Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

Necessity is the mother of inventionData miningAutomated analysis


of massive data sets

Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused
Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions

Computers have become cheaper and more powerful

Competitive Pressure is Strong


Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint


Data collected and stored at enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data

Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation

Mining Large Data Sets - Motivation


There is often information hidden in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all
4,000,000 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 1995 1996 1997

The Data Gap


Total new disk (TB) since 1995

Number of analysts
1998 1999

From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications

Evolution of Sciences
Before 1600, empirical science 1600-1950s, theoretical science Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. 1950s-1990s, computational science Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now, data science The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible

Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002

Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

What Is Data Mining?


Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their inside stories: Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? (Deductive) query processing. Expert systems or small ML/statistical programs

What is (not) Data Mining?


What is not Data Mining?

What is Data Mining?

Look up phone number in phone directory Query a Web search engine for information about Amazon

Certain names are more prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area)
Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

Data Mining: A KDD Process


Pattern Evaluation Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Selection

Data Cleaning
Data Integration Databases

Steps of a KDD Process


Learning the application domain: relevant prior knowledge and goals of application Data cleaning: to remove noise and inconsistent data Data integration: Multiple data sources can be combined Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, association, classification, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Architecture: Typical Data Mining System


Graphical User Interface Pattern Evaluation Data Mining Engine Database or Data Warehouse Server
data cleaning, integration, and selection Knowl edgeBase

Database

Data World-Wide Other Info Repositories Warehouse Web

Components of data mining system


Database, Data warehouse, World Wide Web or other information Repository
Data cleaning and data integration techniques are performed on this data

Database and data warehouse server: Responsible for fetching the relevant data, based on the users data mining request. Knowledge-base: Domain knowledge which is used to guide the data mining process.
Attribute levels, semantics, user beliefs, pattern interestingness, thrsholds, meta data

Data mining engine: Set of functional modules for tasks such as characterization, summarization, association, classification, clustering, outlier extraction Pattern evaluation: Employees interestingness measures
Put the evaluation pattern as much deep as you can so that one can optimize.

User interface: communication between users and the data mining system.

Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

Data Visualization One Picture May Worth 1000 Words!


Visual Data Mining
Visualization of data Visualization of data mining results Visualization of data mining processes Interactive data mining: visual classification

One melody may worth 1000 words too!


Audio data mining: turn data into music and melody!

Uses audio signals to indicate the patterns of data or the features of data mining results

Visualization of data mining results in SAS Enterprise Miner: scatter plots

Visualization of association rules in MineSet 3.0

Visualization of a decision tree in MineSet 3.0

Visualization of Data Mining Processes by Clementine

Interactive Visual Mining by Perception-Based Classification (PBC)

Visualization on NTT i-Townpage

Traversal Diagram

Visitor Success Path

Day/Night Success Path

Data Mining and Business Intelligence


Increasing potential to support business decisions

Making Decisions
Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting

End User

Business Analyst Data Analyst

Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP

DBA

Data Mining: Confluence of Multiple Disciplines


Database Technology Statistics

Machine Learning

Data Mining

Visualization

Information Science

Other Disciplines

Other disciplines: pattern recognition, image processing, signal processing Spatial or temporal data analysis.

Regarding this course


Emphasis is on efficient and scalable data mining techniques. Algorithms must be highly scalable to handle such as terabytes of data Scalability: Running time should grow approximately linearly in proportion to the size of data given the available resources such as main memory and disk space.

Using the proposed techniques, interesting knowledge, regularities or high-level information can be extracted from the databases and viewed or browsed from different angles.
Efficiency: Without compromising quality

Why Not Traditional Data Analysis? (statistics, .)


Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data

Scalability: Running time should grow approximately linearly in proportion to the size of data.
High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications

Multi-Dimensional View of Data Mining


Data to be mined Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

Multiple/integrated functions and mining at multiple levels


Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.

Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Data Mining: On What Kinds of Data?


Database-oriented data sets and applications Relational database, data warehouse, transactional database

Advanced data sets and advanced applications


Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web

Data Mining Functionalities


Multidimensional concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Frequent patterns, association, correlation vs. causality Diaper Beer [0.5%, 75%] (Correlation or causality?) Classification and prediction Construct models (functions) that describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)

Predict some unknown or missing numerical values

Data Mining Functionalities (2)


Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based analysis Other pattern-directed or statistical analyses

Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

What is Data Warehouse?


Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from the organizations operational database Support information processing by providing a solid platform of consolidated, historical data for analysis.

A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of

managements decision-making process.W. H. Inmon


Data warehousing:
The process of constructing and using data warehouses

Data WarehouseSubject-Oriented
Organized around major subjects, such as customer,

product, sales
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular

subject issues by excluding data that are not useful in


the decision support process

Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied.


Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.

Data WarehouseTime Variant


The time horizon for the data warehouse is significantly longer than that of operational systems
Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse


Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain time element

Data WarehouseNonvolatile
A physically separate store of data transformed from

the operational environment


Operational update of data does not occur in the data warehouse environment
Does not require transaction processing, recovery, and concurrency control mechanisms

Requires only two operations in data accessing:


initial loading of data and access of data

Data Warehouse vs. Heterogeneous DBMS


Traditional heterogeneous DB integration: A query driven approach
Build wrappers/mediators on top of heterogeneous databases When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Complex information filtering, compete for resources

Data warehouse: update-driven, high performance


Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis

Data Warehouse vs. Operational DBMS


OLTP (on-line transaction processing)
Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)


Major task of data warehouse system
Data analysis and decision making

Distinct features (OLTP vs. OLAP):


User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries

OLTP vs. OLAP


OLTP users function DB design data clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans complex query millions hundreds 100GB-TB query throughput, response

usage access unit of work # records accessed #users DB size metric

Why Separate Data Warehouse?


High performance for both systems
DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery Warehousetuned for OLAP: complex OLAP queries, multidimensional view, consolidation

Different functions and different data:


missing data: Decision support requires historical data which operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources

data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Note: There are more and more systems which perform OLAP analysis directly on relational databases

Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

Objectives
Data mining is the process of extracting interesting and useful information/knowledge from large databases or data warehouses. The course covers
the concepts and techniques of data mining such as association rules, clustering, and classification. the basic concepts, architecture and general implementations of data warehousing technology

Course topics
Introduction (3 hrs): Definition, KDD framework, Issues in data mining. Association Rules (9hrs): Problem definition, Frequent item-set generation, A priori and FP-growth algorithm, Evaluation of Association patterns. Clustering (9hrs): Overview, Types of Data, K-means, Aglomerative clustering, Clustering algorithms (DBSCAN, BIRCH, CURE, ROCK, CHAMELEON). Classification (9hrs): Overview, Decision tree induction, Over-fitting and under-fitting, Scalable decision tree algorithms, Bayesian Classification, Regression-based Prediction methods Data preprocessing (6 hrs): Data summarization, Data cleaning, Data integration and transformation, Data reduction, Data discretization and Concept hierarchy. Data warehousing (9 hrs): Multidimensional data model, Data warehousing architecture, Data cube computation and OLAP technology.

Text Books
Research Papers:
In this course, about 25 research papers will be covered. Students can refer the following books for the details of some research papers and other background information.

Text books
Book: Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second edition, 2006, Elseiver Inc. Pang-Nong Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining, 2006, Pearson Education.

Reference Books:
Papers from the proceeding of the conferences and journals related to data mining and data warehousing.

LAB WORK
Several data mining tasks related to data preprocessing, association rules, clustering and classification will be given.

Outcome
After completing the course, the students will be able to appreciate the importance of extracting useful knowledge from large amounts of data to improve the performance of a business/organization. get enough exposure to investigate new/improved data mining methods. will understand the basics of data warehousing technology and its links to data mining. Will be able play a role of a Data Miner in an organization.

GRADING
MidSem1: 15 %; MidSemII: 15 %; EndSem: 30%; Research Paper Quiz: 10 % Project/Lab: 30 %

A Brief History of Data Mining Society


1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)

1991-1994 Workshops on Knowledge Discovery in Databases


Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, 1996)

1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD95-98) Journal of Data Mining and Knowledge Discovery (1997) ACM SIGKDD conferences since 1998 and SIGKDD Explorations More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.

ACM Transactions on KDD starting in 2007

Conferences and Journals on Data Mining


Other related conferences KDD Conferences ACM SIGKDD Int. Conf. on ACM SIGMOD Knowledge Discovery in VLDB Databases and Data Mining (IEEE) ICDE (KDD) WWW, SIGIR SIAM Data Mining Conf. (SDM) ICML, CVPR, NIPS (IEEE) Int. Conf. on Data Mining (ICDM) Journals Conf. on Principles and practices Data Mining and Knowledge of Knowledge Discovery and Discovery (DAMI or DMKD) Data Mining (PKDD) IEEE Trans. On Knowledge and Pacific-Asia Conf. on Knowledge Data Eng. (TKDE) Discovery and Data Mining (PAKDD) KDD Explorations
ACM Trans. on KDD

Where to Find References? DBLP, CiteSeer, Google


Data mining and KDD (SIGKDD: CDROM)
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEEPAMI, etc. Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc.

Database systems (SIGMOD: ACM SIGMOD AnthologyCD ROM)


AI & Machine Learning


Web and IR

Statistics

Visualization

Reference Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,
Springer-Verlag, 2001 B. Liu, Web Data Mining, Springer 2006. T. M. Mitchell, Machine Learning, McGraw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005

Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

Data Mining Tasks


Prediction Methods
Use some variables to predict unknown or future values of other variables.

Description Methods
Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Tasks...


Association Rule Discovery [Descriptive] Clustering [Descriptive] Classification [Predictive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive]

Association Rule Discovery: Definition


Given a set of records each of which contain some number of items from a given collection;
Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
TID Items

Rules Discovered:
{Milk} --> {Coke} {Diaper, Milk} --> {Beer}

1 2 3 4 5

Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk

Association Rule Discovery: Application 1


Marketing and Sales Promotion:
Let the rule discovered be {Bagels, } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

Association Rule Discovery: Application 2


Supermarket shelf management.
Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule - If a customer buys diaper and milk, then he is very likely to buy beer. So, dont be surprised if you find six-packs stacked next to diapers!

Association Rule Discovery: Application 3


Inventory Management:
Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.

Sequential Pattern Discovery: Definition


Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.

(A B)

(C)

(D E)

Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.

(A B)
<= xg

(C)
<= ms

(D E)
>ng <= ws

Sequential Pattern Discovery: Examples


In telecommunications alarm logs,
(Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm)

In point-of-sale transaction sequences,


Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket)

Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another.

Similarity Measures:
Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.

Intracluster distances are minimized

Intercluster distances are maximized

Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach:
Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Illustrating Document Clustering


Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in these documents (after some word filtering).
Category Financial Foreign National Metro Sports Entertainment Total Articles
555 341 273 943 738 354

Correctly Placed
364 260 36 746 573 278

Clustering of S&P 500 Stock Data


Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day.
We used association rules to quantify a similarity measure.
Discovered Clusters Industry Group

1 2 3 4

Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N, Sun-DOW N Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN, ADV-M icro-Device-DOWN,Andrew-Corp-DOWN, Co mputer-Assoc-DOWN,Circuit-City-DOWN, Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N, MBNA-Corp -DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlu mberger-UP

Technology1-DOWN

Technology2-DOWN

Financial-DOWN Oil-UP

Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Example
Tid Refund Marital Status 1 2 3 4 5 6 7 8 9 10
10

Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No


10

Refund Marital Status No Yes No Yes No No Single Married Married

Taxable Income Cheat 75K 50K 150K ? ? ? ? ? ?

Yes No No Yes No No Yes No No No

Single Married Single Married

Divorced 90K Single Married 40K 80K

Divorced 95K Married 60K

Divorced 220K Single Married Single 85K 75K 90K

No Yes No Yes

Test Set

Training Set

Learn Classifier

Model

Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach:
Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, dont buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers.
Type of business, where they stay, how much they earn, etc.

Use this information as input attributes to learn a classifier model.


From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions. Approach:
Use credit card transactions and the information on its accountholder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc

Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.

Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a competitor. Approach:
Use detailed record of transactions with each of the past and present customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.

Label the customers as loyal or disloyal. Find a model for loyalty.


From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 4
Sky Survey Cataloging
Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).
3000 images with 23,040 x 23,040 pixels per image.

Approach:
Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying Galaxies Courtesy: http://aps.umn.edu


Early Class:
Stages of Formation

Attributes:

Image features, Characteristics of light waves received, etc.

Intermediate

Late

Data Size:

72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

Regression
Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples:
Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.

Deviation/Anomaly Detection
Detect significant deviations from normal behavior Applications:
Credit Card Fraud Detection

Network Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day

First Assignment
Assignment 1: Identify a problem from your own experience that you think would be amenable to data mining. Describe: (i) What the data is. (ii) What type of benefit you might hope to get from data mining. (iii) What type of data mining (classification, clustering, etc.) you think would be relevant. For each, illustrate with an example, e.g., if you think clustering is relevant, describe what you think a likely cluster might contain and what the real-world meaning would be. Submit twwo pages of 11 point single-spaced typeset text (leave 0.5 inch margins). Wrie your roll number and name. Last Date: 14-08-08 (5PM) References: Introductory chapters of any data mining book or any data mining paper and the PPTs of first two classes.

Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

Top-10 Most Popular DM Algorithms: 18 Identified Candidates (I)


Classification #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., 1993. #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6) #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398. Statistical Learning #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag. #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00.

The 18 Identified Candidates (II)


Link Mining #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998. #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. SODA, 1998. Clustering #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967. #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. Bagging and Boosting #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decisiontheoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.

The 18 Identified Candidates (III)


Sequential Patterns #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996. #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by PrefixProjected Pattern Growth. In ICDE '01. Integrated Mining #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98. Rough Sets #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992 Graph Mining #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02.

Top-10 Algorithm Finally Selected at ICDM06


#1: C4.5 (61 votes) #2: K-Means (60 votes) #3: SVM (58 votes) #4: Apriori (52 votes) #5: EM (48 votes) #6: PageRank (46 votes) #7: AdaBoost (45 votes) #7: kNN (45 votes) #7: Naive Bayes (45 votes) #10: CART (34 votes)

Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

Challenges of Data Mining


Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data

Major Issues in Data Mining


Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion

User interaction
Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction

Applications and social impacts


Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy

Outline
Background
Content of human mind, Sample data mining problems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

DM applications: Market Analysis and Management


Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage, etc. Cross-market analysis

Associations/co-relations between product sales


Prediction based on the association information

DM applications: Market Analysis and Management.


Customer profiling data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements identifying the best products for different customers

use prediction to find what factors will attract new customers


Provides summary information various multidimensional summary reports

statistical summary information (data central tendency and variation)

DM applications: Corporate Analysis and Risk Management

Finance planning and asset evaluation


cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)

Resource planning:
summarize and compare the resources and spending

Competition:
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market

DM applications: Fraud Detection and Management


Applications widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references

DM applications: Fraud Detection and Management


Detecting inappropriate medical treatment Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail Analysts estimate that 38% of retail shrink is due to dishonest employees.

Other Applications of data mining


Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Summary
Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories

Data mining systems and architectures


Data warehousing Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Major issues in data mining

You might also like