You are on page 1of 38

Data Mining

Concepts
T D S3301 L E CTURE 1
Think about this…
A manager of a restaurant wants to identify the common
sets of preferences among customers.
A software developer wants to describe the frequent
patterns exhibited by users of a courseware.
A funding manager wants to predict the stock market for
the next couple days.

2
What Is Data Mining?
Data mining (knowledge discovery in databases):
◦ Extraction of interesting information or patterns (non-trivial,
implicit, previously unknown and potentially useful) from data
in large databases

Alternative names:
◦ Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.

3
What Is Data Mining?

Data mining is inductive (bottom-up reasoning, from observations


to generalizations).
What is not data mining?
◦ (Deductive) query processing.
◦ Expert systems or small statistical programs

4
What Motivated Data Mining?

Data explosion problem


◦ Too much data. From database to data warehouse.
Limitations of SQL commands, statistical presentations
Limitations of experts’ knowledge in certain areas, e.g., prediction
of earthquakes

5
Data mining supports…
Database analysis and decision support
◦ Market analysis and management
◦ Risk analysis and management
◦ Fraud detection and management

Text mining (news group, email, documents) and Web


analysis.
Intelligent query answering.

6
TDS 3341 DATA M INING
Market Analysis and Management (1)
Data sources for analysis:
◦ Credit card, loyalty cards, discount coupons, customer complaint
calls, public lifestyle studies
Target marketing:
◦ Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
◦ Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
◦ Associations/co-relations between product sales.
◦ Prediction based on the association information.
7
Market Analysis and Management (2)
Customer profiling
◦ What types of customers buy what products (clustering or
classification).
Identifying customer requirements
◦ Identifying the best products for different customers.
◦ Use prediction to find what factors will attract new customers.
Provides summary information
◦ Various multidimensional summary reports.
◦ Statistical summary information (data central tendency and
variation)

8
Corporate Analysis and Risk
Management
Finance planning and asset evaluation
◦ Cash flow analysis and prediction.
◦ Cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.).
Resource planning
◦ Summarize and compare the resources and spending.
Competition
◦ Monitor competitors and market directions.
◦ Group customers into classes and a class-based pricing
procedure.
◦ Set pricing strategy in a highly competitive market.
9
Fraud Detection and Management (1)
Applications
◦ Widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
◦ Use historical data to build models of fraudulent behavior and use
data mining to help identify similar instances.
Examples:
◦ auto insurance: detect a group of people who stage accidents to
collect on insurance.
◦ money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network).
◦ medical insurance: detect professional patients and ring of doctors
and ring of references.
10
Fraud Detection and Management (2)
Detecting inappropriate medical treatment
◦ Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested (save Australian
$1m/yr).
Detecting telephone fraud
◦ Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
◦ British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and broke
a multimillion dollar fraud.

11
Data Mining: A KDD Process
Evaluation and
Data mining: the core of Presentation
knowledge discovery process.
Data Mining
Patterns

Selection and
Transformation
Task-relevant
Data
Data
Data Cleaning warehouse
Data Integration

Flat files
Databases
12
Steps of a KDD Process
Learning the application domain:
◦ relevant prior knowledge and goals of application.
Creating a target data set: data selection.
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
◦ Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining.
◦ summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest.
Pattern evaluation and knowledge presentation:
◦ visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge.
13
Data Mining and Business Intelligence

Increasing potential Making End User


to support Decisions
business decisions
Data Presentation
Business
Visualization Techniques
Analyst

Data Mining
Information Discovery Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, M DA
Data Sources DBA
Paper, Files, Information Providers, Database Systems, OLTP

14
Architecture: Typical Data Mining
System
Graphical user interface

Pattern evaluation

Knowledge
Data mining engine Base

Database or data warehouse server


Data cleaning & Filtering
data integration

Data
Database
warehouse

15
Data Mining: On What Kind of Data?

Relational databases
Data warehouses (repository, from multiple sources)
Transactional databases
Advanced DB and information repositories
◦ Object-oriented and object-relational databases
◦ Spatial databases (relating to space: geographic DB)
◦ Time-series data (data collected hr, daily, weekly), sequence
database (sequence of ordered events) and temporal data
(relational data with time attribute)
◦ Text databases and multimedia databases
◦ Heterogeneous and legacy databases
◦ The WWW
16
Data Mining Functionalities (1)
Concept/Class Description
◦ Data can be associated with class / concepts
◦ Ex: Computer shop. Classes of items: Printer, computer
Concepts of consumer: bigSpender, budgetSpender
Can be derived via
◦ Data characterization: refers to a summarization of the general
characteristics or features of a target class of data
◦ Data discrimination: refers to a comparison of the general
features of target class data objects with the general features
of objects from one or set of contrasting classes

17
Data Mining Functionalities (1)
Association analysis (correlation vs. causality)
Diaper  Beer (Simplified)
[support=0.5%, confidence=75%]
- A confidence, or certainty, of 75% means that if a customer buys a diaper, there is
a 75% chance that he/she will buy beer as well
- A 0.5% support means that 0.5% of all of the transactions under analysis showed
that diaper and beer were purchased together.
- Single-dimensional association rules

age(X, “20...29”)^income(X, “20K...29K”)-->buys(X, “CD player”)


[support = 2%, confidence = 60%]
- 2% of the customers are 20 to 29 years of age with an income of 20,000 to 29,000 and
have purchased a CD player. There is a 60% probability that a customer in this age and
income group will purchase a CD player.
- Multi-dimensional association rules

18
Data Mining Functionalities (2)
Classification and Prediction
◦ Finding models (functions) that describe and distinguish classes
or concepts for future prediction.
◦ E.g., classify countries based on climate, or classify cars based
on gas mileage.
◦ Presentation: decision-tree, classification rule, neural network.
◦ Prediction: Predict some unknown or missing numerical values.

19
Classification
Find ways to separate data items into “Route documents to most likely
pre-defined groups interested parties”
◦ We know X and Y belong together, ◦ English or non-english?
find other things in same group ◦ Domestic or Foreign?
Requires “training data”: Data items
where group is known
Uses: Training Data
◦ Profiling
Technologies:
◦ Generate decision trees (results are Groups
human understandable)
◦ Neural Networks (black box classifier
approach)

20
Data Mining Functionalities (3)
Cluster analysis
◦ Class label is unknown: Group data to form new classes,
e.g., cluster houses to find distribution patterns
◦ Clustering based on the principle: maximizing the intra-
class similarity and minimizing the interclass similarity

21
Clustering
Find groups of similar data items “Group people with similar travel
profiles”
Statistical techniques require some ◦ George, Patricia
definition of “distance” (e.g. between
travel profiles) while conceptual ◦ Jeff, Evelyn, Chris
techniques use background concepts ◦ Rob
and logical descriptions
Uses:
◦ Demographic analysis (related to
structure of populations)
Technologies:
◦ Self-Organizing Maps Clusters
◦ Probability Densities
◦ Conceptual Clustering

22
Data Mining Functionalities (4)
Outlier Analysis
◦ Outlier: a data object that does not comply with the general
behavior of the data.
◦ It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis.
Trend and Evolution Analysis
◦ Trend and deviation: regression analysis.
◦ Sequential pattern mining, periodicity analysis
◦ Similarity-based analysis.

23
Are all of the patterns interesting?
A pattern is interesting if:
◦ It is easily understood by humans.
◦ It is valid on new or test data with some degree of
certainty.
◦ It is potentially useful.
◦ It is novel.
◦ It validates a hypothesis that the user sought to confirm.
An interesting pattern represents knowledge.

24
Are All the “Discovered” Patterns
Interesting?
Objective vs. subjective interestingness measures:
◦ Objective:
◦ Based on statistics and structures of patterns, e.g., support
and confidence for association rules, etc.
◦ Generally, each interestingness measure is associated with a
threshold.
◦ Subjective:
◦ based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.

25
Can We Find All and Only Interesting
Patterns?
Find all the interesting patterns: Completeness
◦ Can a data mining system find all the interesting patterns?
◦ Association vs. classification vs. clustering.
Search for only interesting patterns: Optimization
◦ Can a data mining system generate only interesting patterns?
◦ Approaches
◦ First general all the patterns and then filter out the
uninteresting ones.
◦ Generate only the interesting patterns—mining query
optimization.

26
Data Mining: Confluence of Multiple
Disciplines
Database Statistics
Technology

Machine Data Visualization


Learning Mining

Information Other
Science Disciplines

27
Data Mining Schemes
Databases to be mined
◦ Relational, transactional, object-oriented, object-relational, active, spatial,
time-series, text, multi-media, heterogeneous, legacy, WWW, etc.
Knowledge to be mined
◦ Characterization, discrimination, association, classification, clustering, trend,
deviation and outlier analysis, etc.
◦ Multiple/integrated functions and mining at multiple levels.
Techniques utilized
◦ Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, neural network, etc.
Applications adapted
◦ Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.

28
Major Issues in Data Mining (1)
Mining methodology
◦ Mining different kinds of knowledge in databases
◦ Mining knowledge in multi-dimensional space
◦ Handling noise and incomplete data
◦ Pattern evaluation: the interestingness problem. Not all
patterns are interesting.
User interaction
◦ Interactive mining with user interfaces
◦ Incorporation of background knowledge, as data mining is an
inter-disciplinary effort
◦ Expression and visualization of data mining results

29
Major Issues in Data Mining (1)
Performance and scalability
◦ Efficiency and scalability of data mining algorithms
◦ Parallel, distributed and incremental mining methods
Issues relating to the diversity of data types
◦ Handling relational and complex types of data.
◦ Mining information from heterogeneous databases and global
information systems (WWW).

30
Major Issues in Data Mining (2)
Issues related to applications and social impacts
◦ Application of discovered knowledge
◦ Domain-specific data mining tools
◦ Intelligent query answering
◦ Process control and decision making
◦ Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem.
◦ Protection of data security, integrity, and privacy.

31
Top-10 Most Popular DM Algorithms:
18 Identified Candidates (I)
Classification
◦ #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., 1993.
◦ #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression
Trees. Wadsworth, 1984.
◦ #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive
Nearest Neighbor Classification. TPAMI. 18(6)
◦ #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat.
Statist. Rev. 69, 385-398.

Statistical Learning
◦ #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag.
◦ #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York.
Association Analysis
◦ #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining
Association Rules. In VLDB '94.
◦ #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate
generation. In SIGMOD '00.

32
The 18 Identified Candidates (II)
Link Mining
◦ #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale
hypertextual Web search engine. In WWW-7, 1998.
◦ #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked
environment. SODA, 1998.
Clustering
◦ #11. K-Means: MacQueen, J. B., Some methods for classification and
analysis of multivariate observations, in Proc. 5th Berkeley Symp.
Mathematical Statistics and Probability, 1967.
◦ #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an
efficient data clustering method for very large databases. In SIGMOD '96.
Bagging and Boosting
◦ #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic
generalization of on-line learning and an application to boosting. J.
Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.

33
The 18 Identified Candidates (III)
Sequential Patterns
◦ #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and
Performance Improvements. In Proceedings of the 5th International Conference on
Extending Database Technology, 1996.
◦ #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu.
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In
ICDE '01.

Integrated Mining
◦ #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining.
KDD-98.

Rough Sets
◦ #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about
Data, Kluwer Academic Publishers, Norwell, MA, 1992

Graph Mining
◦ #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In
ICDM '02.

34
Top-10 Algorithm Finally Selected at ICDM’06
#1: C4.5 (61 votes)
#2: K-Means (60 votes)
#3: SVM (58 votes)
#4: Apriori (52 votes)
#5: EM (48 votes)
#6: PageRank (46 votes)
#7: AdaBoost (45 votes)
#7: kNN (45 votes)
#7: Naive Bayes (45 votes)
#10: CART (34 votes)

35
Summary
Data mining: discovering interesting patterns from large amounts of data.
A natural evolution of database technology, in great demand, with wide
applications.
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation.

36
Summary
Mining can be performed in a variety of information repositories.
Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
Classification of data mining systems.
Major issues in data mining.

37
References
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann Publishers, 2001 (ISBN:1-55860-489-8).
Introduction to Data Mining,
http://www.cs.purdue.edu/homes/clifton/cs490d/
Introduction to Data Mining, http://www.cs.rpi.edu/~zaki/dmcourse/fall00/
Mathematical background, http://www.jfsowa.com/logic/math.htm

38

You might also like