Professional Documents
Culture Documents
Week 1-2
nadeem.majeed@uettaxila.edu.pk
1
Course Objectives
This is a course for students on the topic of
Statistical Analysis and Data Mining. Topics include
statistical analysis, data mining applications, data
preparation, data reduction and various data
mining techniques (such as association, clustering,
classification, anomaly detection)
2
Outline
Course Logistics
Data Mining Introduction
Four Key Characteristics
Combination of Theory and Application
Engineering Process
Collection of Functionalities
Interdisciplinary field
How do we categorize data mining systems?
History of Data Mining
Research Issues
Curse of Dimensionality
3
Artificial Intelligence in Sci Fi
Artificial Intelligence in Sci Fi
Intelligence
The ability to solve the problems
Consider the following sequence
1, 3, 7, 13, 21, __
What is the next number ?
Intelligence is to reason in a logical way to
reach a conclusion
Intelligence
Ability to solve problems
Ability to plan and schedule
Ability to memorize and process information
Ability to answer fuzzy questions
Ability to learn
Ability to recognize
Ability to understand
Ability to perceive
And many more
The exciting new effort to make computers think machines with minds, in
the full and literal sense (Haugeland, 1985)
The study of computation that make it possible to perceive, reason and act
(Winston 1992)
The art of creating machines that perform functions that require intelligence
when performed by people (Kurzweil 1990)
9
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
14
See this Can you identify the profitable routes from Airline
reservation system?
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Can you detect fraud from transactional
data?
Vision-based biometrics
How the Afghan Girl was Identified by Her Iris Patterns Read the story
wikipedia
Google Car
AISIGHT
Where am I
SPAM
Massive volumes of data from sensors and networks of sensors
Large Synoptic
Survey
Telescope (LSST)
40TB/day
(an SDSS every two
days),
Key Characteristics
Combination of Theory and Application
Engineering Process
Data Pre-processing and Post-processing, Interpretation
Collection of Functionalities
Different Tasks and Algorithms
Interdisciplinary Field
32
Real Example from NBA
AS (Advanced Scout) software from IBM Research
Coach can assess the effectiveness of certain coaching
decisions
Good/bad player matchups
Plays that work well against a given team
Raw Data: Play-by-play information recorded by
teams
Who is on court
Who took a shot, the type of shot, the outcome, any
rebounds
33
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship
management (CRM), market basket analysis,
cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved
underwriting, quality control, competitive analysis
Fraud detection and detection of unusual patterns
(outliers)
34
Potential Applications
Other Applications
Text mining (news group, email, documents)
and Web mining
Stream data mining
System and Network Management
Multimedia Applications
Music, Image, Video
DNA and bio-data analysis
35
Example: Use in retailing
Goal: Improved business efficiency
Improve marketing (advertise to the most likely buyers)
Inventory reduction (stock only needed quantities)
Information source: Historical business data
Example: Supermarket sales records
Date/Time/Register Fish Turkey Cranberries Wine ...
12/6 13:15 2 N Y Y N ...
12/6 13:16 3 Y N N Y ...
39
Corporate Analysis and Risk Management
Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-
ratio, trend analysis, etc.)
Resource planning:
summarize and compare the resources and
spending
Competition:
monitor competitors and market directions
group customers into classes and a class-based
pricing procedure
set pricing strategy in a highly competitive market 40
Fraud Detection and Management (1)
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
41
Fraud Detection and Management (2)
Detecting inappropriate medical treatment
Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
Detecting telephone fraud
Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.
British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
Retail
Analysts estimate that 38% of retail shrink is due to
dishonest employees.
42
Data Mining: An Engineering Process
Data mining: interactive and iterative process.
Interpretation/
Evaluation
Mining
Algorithms Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), From Knowledge Discovery to Data
Mining: An Overview, Advances in Knowledge Discovery and 43
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Steps of a KDD Process
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
44
Architecture of a Typical Data Mining
System
Pattern evaluation
Data
Databases Warehouse
45
Data Mining: On What Kind of
Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
46
What Can Data Mining Do?
Cluster
Classify
Categorical, Regression
Semi-supervised
Summarize
Summary statistics, Summary rules
Link Analysis / Model Dependencies
Association rules
Sequence analysis
Time-series analysis, Sequential associations
Detect Deviations
47
Learning?
Definitions of learning from dictionary:
To get knowledge of by study,
experience, or being taught
To become aware by information or
from observation
To commit to memory
To be informed of, ascertain; to receive instruction
48
Machine Learning
Learning
capabilities can improve the
performance of an intelligent system over
time.
49
A Generic System
x1 y1
x2 y2
System
xN h1 , h2 ,..., hK
yM
51
Past When are ML algorithms
NOT needed?
When the relationships between all system
variables (input, output, and hidden) is
completely understood!
52
Machine Learning
Learning
53
Machine Learning
Supervised
Learning
Learning
54
Machine Learning
Unsupervised Supervised
Learning Learning
Learning
55
Machine Learning
Unsupervised Supervised
Learning Learning
Learning
Reinforcement
Learning
56
Carpentry
57
Machine Learning
Unsupervised Supervised
Learning Learning
Today!
Learning
Reinforcement
Learning
58
Supervised Learning
59
What does Data Look Like?
60
Data
M observations :
For each observation (i) we have x(i) and y(i)
61
The Data and goal
Data: A set of data records (also called examples, instances or cases)
described by
k attributes: A1, A2, Ak.
a class: Each example is labelled with a pre-defined class.
Goal: To learn a classification model from the data that can be used to
predict the classes of new (future, or test) cases/instances.
62
Supervised Learning
Training
Set
Learning
Algorithm
63
Supervised Learning
Training
Set
Learning
Algorithm
x h
64
Supervised Learning
Training
Set
Learning
Algorithm
x h predicted
y
65
The learning task
Learn a classification model from the data
Use the model to classify future loan applications into
Yes (approved) and
No (not approved)
What is the class for following case/instance?
66
Learning the Target Function
Like human learning from past experiences.
A computer does not have experiences.
A computer system learns from data, which represent
some past experiences of an application domain.
Our focus: learn a target function that can be used to
predict the values of a discrete class attribute, e.g.,
approve or not-approved, and high-risk or low risk.
The task is commonly called: Supervised learning,
classification, or inductive learning.
67
Formally, What is Learning?
Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D to perform the task T if after
learning the systems performance on T improves as measured by M.
In other words, the learned model helps the system to perform T better
as compared to no learning.
68
Supervised vs. Unsupervised
Supervised learning: classification is seen as supervised learning from
examples.
Supervision: The data (observations, measurements, etc.) are labeled with pre-
defined classes. It is like that a teacher gives the classes (supervision).
Test data are classified into these classes too.
Unsupervised learning (clustering)
Class labels of the data are unknown
Given a set of data, the task is to establish the existence of classes or clusters in
the data
69
Classification Vs. Regression
70
Some Learning algorithms
71
Classification
Learn a method for predicting the instance class from pre-
labeled (classified) instances
Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...
2 5 X
73
Classification: Neural Nets
74
Linear Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes wi from
data to minimize squared
error to fit the data
Not flexible enough
75
Examples
76
Example: The weather problem
Outlook Temperature Humidity Windy Play
sunny hot high false no
Given past data,
sunny hot high true no
Can you come up
overcast hot high false yes
with the rules for
rainy mild high false yes
rainy mild normal false yes
Play/Not Play ?
rainy mild normal true no
overcast mild normal true yes
sunny mild high false no
sunny mild normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
77
witten&eibe
The weather problem
Given this data, what are the rules for play/not play?
78
witten&eibe
The weather problem
81
Weather data with mixed attributes
83
witten&eibe
A complete and correct rule set
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope
and astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no
and tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age young and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age = pre-presbyopic
and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
84
witten&eibe
A decision tree for this problem
85
witten&eibe
Classifying iris flowers
88
witten&eibe
Discriminative Vs. Generative
Learning approaches
89
Assumption in learning
Assumption: The distribution of training examples is identical to the
distribution of test examples (including future unseen examples).
90
Evaluation Methodologies
91
Cross Validation
92
N Fold Cross Validation
93
Steps in Supervised Learning
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model accuracy
94
Metrics
Accuracy
F1
Precision
Recall
AUC (Area Under the Curve)
ROC (Receiver Operating Characteristic)
Efficiency (time, memory)
95
YES (Actual) No (Actual)
YES (Predicted) a b
NO (Predicted) c d
Measurements
Precision p = a / (a+b)
Recall r = a / (a+c)
F1 value F1 = 2rp / (r+p)
Tradeoff between Precision and Recall
kNN tends to have higher precision than recall,
especially when k becomes larger. 96
AUC
97
Overfitting?
98
Overfitting?
Training Set
AUC (how good)
Testing Set
Model Complexity
99
When your model has too many parameters relative
to the number of data points, you're prone to
overestimate the utility of your model.
Over fitting means that you are fitting your model to
the noise instead of the underlying signal.
An over-fit model is it is a model that is overly bound
to the training data.
This means that it does an excellent job of 'predicting' the training data and a
very poor job of predicting any other data (test data).
100
Which one is over fitted?
101
Curse of Dimensionality
102
Increase in number of dimensions leads to rapid increase in volume.
This means, as the dimensions increase we need to collect exponentially
larger quantities of data (to be statistical significant). This exponential
increase of data is the curse. It limits our ability to store, compute and
make decisions quickly
The classic inverse problem is just a linear equation Ax =b We seek
solutions like x = inverse(A)*b
The curse of dimensionality simply means that you have
way more features, or dimensions, than you have data
points, and consequently, you can not actually invert A (it is
singular) to obtain a unique solution. A standard solution is
to add some additional information (i.e. Regularization,
Bayesian Prior)
103
Features
104
Features
105
Good Features
106
Good Features
107
Good Features
108
Detect Multiple Faces?
109
Sliding Window
110
Sliding Window
111
No Face
112
No Face
113
No Face
114
Maybe Face
115
Face!
116
The End?
117