Data Mining Week 1 2

Advance Statistics & Data Mining
Week 1-2
Dr. Muhammad Nadeem Majeed
nadeem.majeed@uettaxila.edu.pk
1
Course Objectives
This is a course for students on the topic of
Statistical Analysis and Data Mining. Topics include
statistical analysis, data mining applications, data
preparation, data reduction and various data
mining techniques (such as association, clustering,
classification, anomaly detection)
2
Outline
Course Logistics
Data Mining Introduction
Four Key Characteristics
Combination of Theory and Application
Engineering Process
Collection of Functionalities
Interdisciplinary field
How do we categorize data mining systems?
History of Data Mining
Research Issues
Curse of Dimensionality
3
Artificial Intelligence in Sci Fi
Artificial Intelligence in Sci Fi
Intelligence
The ability to solve the problems
Consider the following sequence
1, 3, 7, 13, 21, __
What is the next number ?
Intelligence is to reason in a logical way to
reach a conclusion
Intelligence
Ability to solve problems
Ability to plan and schedule
Ability to memorize and process information
Ability to answer fuzzy questions
Ability to learn
Ability to recognize
Ability to understand
Ability to perceive
And many more
Can only humans beings and animals possess these qualities?

But If
A machine searches through a mesh and finds a path?
A machine solves problems like the next number in
the sequence?
A machine develops plans?
A machine diagnoses and prescribes?
A machine answers ambiguous questions?
A machine recognizes fingerprints?
A machine understands?
A machine perceives?
A machine does MANY MORE SUCH THINGS
[The automation of] activities that we associate with human thinking,
activities such as decision making, problem solving, learning (Bellman,
1978)
The exciting new effort to make computers think machines with minds, in
the full and literal sense (Haugeland, 1985)
The study of computation that make it possible to perceive, reason and act
(Winston 1992)
The art of creating machines that perform functions that require intelligence
when performed by people (Kurzweil 1990)
The branch of computer science that is concerned with the automation of

intelligent behavior (Luger and Stubblefield, 1993)
9
Artificial Intelligence VS. Human Intelligence
14
See this Can you identify the profitable routes from Airline
reservation system?
Can you detect fraud from transactional
data?
Vision-based biometrics
How the Afghan Girl was Identified by Her Iris Patterns Read the story
wikipedia
Google Car
AISIGHT
Where am I
SPAM
Massive volumes of data from sensors and networks of sensors
Large Synoptic
Survey
Telescope (LSST)
40TB/day
(an SDSS every two
days),
100+PB in its 10-year

lifetime
Machine Learning
- Grew out of work in AI
- New capability for computers
Examples:
- Database mining
Large datasets from growth of automation/web.
E.g., Web click data, medical records, biology,
engineering
- Applications cant program by hand.
E.g., Autonomous helicopter, handwriting
recognition, most of Natural Language
Processing (NLP), Computer Vision.
Machine Learning definition
Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.
Well-posed Learning Problem: A computer

program is said to learn from experience E with
respect to some task T and some performance
measure P, if its performance on T, as
measured by P, improves with experience E.
Why Data Mining?
Motivation: Necessity is the Mother of Invention
Data explosion problem
Applications generate huge amounts of data
WWW, computer systems/programs, biology experiments,
Business transactions, Scientific computation and simulation,
Medical and person data, Surveillance video and pictures,
Satellite sensing, Digital media,
Technologies are available to collect and store data
Bar codes, scanners, satellites, cameras etc.
Databases, data warehouses, variety of repositories
We are drowning in data, but starving for knowledge!
31
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
What is not data mining?
(Deductive) query processing.
Expert systems or small ML/statistical programs
Key Characteristics
Combination of Theory and Application
Engineering Process
Data Pre-processing and Post-processing, Interpretation
Collection of Functionalities
Different Tasks and Algorithms
Interdisciplinary Field
32
Real Example from NBA
AS (Advanced Scout) software from IBM Research
Coach can assess the effectiveness of certain coaching
decisions
Good/bad player matchups
Plays that work well against a given team
Raw Data: Play-by-play information recorded by
teams
Who is on court
Who took a shot, the type of shot, the outcome, any
rebounds
33
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship
management (CRM), market basket analysis,
cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved
underwriting, quality control, competitive analysis
Fraud detection and detection of unusual patterns
(outliers)
34
Potential Applications
Other Applications
Text mining (news group, email, documents)
and Web mining
Stream data mining
System and Network Management
Multimedia Applications
Music, Image, Video
DNA and bio-data analysis
35
Example: Use in retailing
Goal: Improved business efficiency
Improve marketing (advertise to the most likely buyers)
Inventory reduction (stock only needed quantities)
Information source: Historical business data
Example: Supermarket sales records
Date/Time/Register Fish Turkey Cranberries Wine ...
12/6 13:15 2 N Y Y N ...
12/6 13:16 3 Y N N Y ...
Size ranges from 50k records (research studies) to terabytes (years of

data from chains)
Data is already being warehoused
Sample question what products are generally
purchased together?
The answers are in the data, if only we could see
them
36
Other Applications
Network System management
Event Mining Research at IBM
Astronomy
JPL and the Palomar Observatory discovered
22 quasars with the help of data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior
pages, analyzing effectiveness of Web
marketing, improving Web site organization,
etc.
37
Market Analysis and Management (1)
Where are the data sources for analysis?

Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
38
Market Analysis and Management (2)
Customer profiling
data mining can tell you what types of customers buy
what products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central tendency
and variation)
39
Corporate Analysis and Risk Management
Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-
ratio, trend analysis, etc.)
Resource planning:
summarize and compare the resources and
spending
Competition:
monitor competitors and market directions
group customers into classes and a class-based
pricing procedure
set pricing strategy in a highly competitive market 40
Fraud Detection and Management (1)
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
41
Fraud Detection and Management (2)
Detecting inappropriate medical treatment
Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
Detecting telephone fraud
Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.
British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
Retail
Analysts estimate that 38% of retail shrink is due to
dishonest employees.
42
Data Mining: An Engineering Process
Data mining: interactive and iterative process.
Interpretation/
Evaluation
Mining
Algorithms Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), From Knowledge Discovery to Data
Mining: An Overview, Advances in Knowledge Discovery and 43
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Steps of a KDD Process
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
44
Architecture of a Typical Data Mining
System
Graphical user interface
Pattern evaluation
Data mining engine

Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering
Data
Databases Warehouse
45
Data Mining: On What Kind of
Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
46
What Can Data Mining Do?
Cluster
Classify
Categorical, Regression
Semi-supervised
Summarize
Summary statistics, Summary rules
Link Analysis / Model Dependencies
Association rules
Sequence analysis
Time-series analysis, Sequential associations
Detect Deviations
47
Learning?
Definitions of learning from dictionary:
To get knowledge of by study,
experience, or being taught
To become aware by information or
from observation
To commit to memory
To be informed of, ascertain; to receive instruction
48
Machine Learning
Machinelearning involves adaptive

mechanisms that enable computers to
learn
from experience,
learn by example
learn by analogy.
Learning
capabilities can improve the
performance of an intelligent system over
time.
49
A Generic System
x1 y1
x2 y2
System

xN h1 , h2 ,..., hK
yM
Input Variables: x = ( x1 , x2 ,..., xN )

Hidden Variables: h = ( h1 , h2 ,..., hK )
Output Variables: y = ( y1 , y2 ,..., yK )
50
Another definition
Machine Learning algorithms discover the
relationships between the variables of a system
(input, output and hidden) from direct samples
of the system
These algorithms originate form many fields:
Statistics, mathematics, theoretical computer
science, physics, neuroscience, etc
51
Past When are ML algorithms
NOT needed?
When the relationships between all system
variables (input, output, and hidden) is
completely understood!
This is NOT the case for almost any real system!
52
Machine Learning
Learning
53
Machine Learning
Supervised
Learning
Learning
54
Machine Learning
Unsupervised Supervised
Learning Learning
Learning
55
Machine Learning
Learning Learning
Learning
Reinforcement
Learning
56
Carpentry
57
Machine Learning
Learning Learning
Today!
Learning
Reinforcement
Learning
58
Supervised Learning
Given labeled data. Predict output.

(Learning with a teacher)
Carpentry of Supervised Learning
59
What does Data Look Like?
60
Data
M observations :
For each observation (i) we have x(i) and y(i)
61
The Data and goal
Data: A set of data records (also called examples, instances or cases)
described by
k attributes: A1, A2, Ak.
a class: Each example is labelled with a pre-defined class.
Goal: To learn a classification model from the data that can be used to
predict the classes of new (future, or test) cases/instances.
62
Supervised Learning
Training
Set
Learning
Algorithm
63
Supervised Learning
Training
Set
Learning
Algorithm
x h
64
Supervised Learning
Training
Set
Learning
Algorithm
x h predicted
y
65
The learning task
Learn a classification model from the data
Use the model to classify future loan applications into
Yes (approved) and
No (not approved)
What is the class for following case/instance?
66
Learning the Target Function
Like human learning from past experiences.
A computer does not have experiences.
A computer system learns from data, which represent
some past experiences of an application domain.
Our focus: learn a target function that can be used to
predict the values of a discrete class attribute, e.g.,
approve or not-approved, and high-risk or low risk.
The task is commonly called: Supervised learning,
classification, or inductive learning.
67
Formally, What is Learning?
Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D to perform the task T if after
learning the systems performance on T improves as measured by M.
In other words, the learned model helps the system to perform T better
as compared to no learning.
68
Supervised vs. Unsupervised
Supervised learning: classification is seen as supervised learning from
examples.
Supervision: The data (observations, measurements, etc.) are labeled with pre-
defined classes. It is like that a teacher gives the classes (supervision).
Test data are classified into these classes too.
Unsupervised learning (clustering)
Class labels of the data are unknown
Given a set of data, the task is to establish the existence of classes or clusters in
the data
69
Classification Vs. Regression
Supervised Learning Input:

A description of an instance, xX, where X is the input features and C= classes
Trainings Set: {(x1,c1), (x2,c1), (x5,c3), . (x6,c2),}
Test Set: x
Supervised Learning Task:
The category of x: c(x)C, where c(x) is a
classification/Regression function
Classification:
A fixed set of Classes:
C ={c1, c2,cn}
Regression:
C = Continuous variable
70
Some Learning algorithms
*Just introduction, we will cover them
71
Classification
Learn a method for predicting the instance class from pre-
labeled (classified) instances
Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...
Given a set of points from classes

what is the class of new point ?
72
Classification: Decision Trees
if X > 5 then blue
else if Y > 3 then blue
Y else if X > 2 then green
else blue
2 5 X
73
Classification: Neural Nets
Can select more complex

regions
Can be more accurate
Also can overfit the data
find patterns in random noise
74
Linear Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes wi from
data to minimize squared
error to fit the data
Not flexible enough
75
Examples
76
Example: The weather problem
Outlook Temperature Humidity Windy Play
sunny hot high false no
Given past data,
sunny hot high true no
Can you come up
overcast hot high false yes
with the rules for
rainy mild high false yes
rainy mild normal false yes
Play/Not Play ?
rainy mild normal true no
overcast mild normal true yes
sunny mild high false no
sunny mild normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
77
witten&eibe
The weather problem
Given this data, what are the rules for play/not play?

Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes

78
witten&eibe
The weather problem
Conditions for playing

Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
79
witten&eibe
Weather data with mixed attributes
sunny 85 85 false no
sunny 80 90 true no
overcast 83 86 false yes
rainy 70 96 false yes
rainy 65 70 true no
overcast 64 65 true yes
sunny 72 95 false no
sunny 69 70 false yes
sunny 75 70 true yes
overcast 72 90 true yes
overcast 81 75 false yes
rainy 71 91 true no
80
How will the rules change when some attributes have

numeric values?

Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes

81
Rules with mixed attributes

Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes

If outlook = sunny and humidity > 83 then play = no

If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
82
witten&eibe
The contact lenses data
Age Spectacle prescription Astigmatism Tear production rate Recommended lenses
Young Myope No Reduced None

Young Myope No Normal Soft
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope No Reduced None
Young Hypermetrope No Normal Soft
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope No Reduced None
Pre-presbyopic Myope No Normal Soft
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope No Reduced None
Pre-presbyopic Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope No Reduced None
Presbyopic Myope No Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope No Reduced None
Presbyopic Hypermetrope No Normal Soft
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None
83
witten&eibe
A complete and correct rule set
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no
If age = presbyopic and spectacle prescription = myope
and astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no
If spectacle prescription = myope and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age young and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age = pre-presbyopic
and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
84
witten&eibe
A decision tree for this problem
85
witten&eibe
Classifying iris flowers
Sepal length Sepal width Petal length Petal width Type

1 5.1 3.5 1.4 0.2 Iris setosa
2 4.9 3.0 1.4 0.2 Iris setosa

51 7.0 3.2 4.7 1.4 Iris versicolor
52 6.4 3.2 4.5 1.5 Iris versicolor

101 6.3 3.3 6.0 2.5 Iris virginica
102 5.8 2.7 5.1 1.9 Iris virginica

If petal length < 2.45 then Iris setosa

If sepal width < 2.10 then Iris versicolor
86 ...
witten&eibe
Predicting CPU performance
Example: 209 different computer configurations
Cycle time (ns) Main memory (Kb) Cache Channels Performance

(Kb)
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 125 256 6000 256 16 128 198
2 29 8000 32000 32 8 32 269

208 480 512 8000 32 0 0 67
209 480 1000 4000 0 0 0 45
Linear regression function
PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX

+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX
87
witten&eibe
Soybean classification
Attribute Number of Sample value
values
Environment Time of occurrence 7 July
Precipitation 3 Above normal

Seed Condition 2 Normal
Mold growth 2 Absent

Fruit Condition of fruit pods 4 Normal
Fruit spots 5 ?
Leaves Condition 2 Abnormal
Leaf spot size 3 ?

Stem Condition 2 Abnormal
Stem lodging 2 Yes

Roots Condition 3 Normal
Diagnosis 19 Diaporthe stem canker
88
witten&eibe
Discriminative Vs. Generative
Learning approaches
89
Assumption in learning
Assumption: The distribution of training examples is identical to the
distribution of test examples (including future unseen examples).
In practice, this assumption is often violated to certain degree.

Strong violations will clearly result in poor classification accuracy.
To achieve good accuracy on the test data, training examples must be
sufficiently representative of the test data.
90
Evaluation Methodologies
91
Cross Validation
Train Set Test Set
92
N Fold Cross Validation
Train Train Train Test Train

Set Set Set Set Set
93
Steps in Supervised Learning
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model accuracy
Number of correct classifications

Accuracy = ,
Total number of test cases
94
Metrics
Accuracy
F1
Precision
Recall
AUC (Area Under the Curve)
ROC (Receiver Operating Characteristic)
Efficiency (time, memory)
95
YES (Actual) No (Actual)
YES (Predicted) a b
NO (Predicted) c d
Measurements
Precision p = a / (a+b)
Recall r = a / (a+c)
F1 value F1 = 2rp / (r+p)
Tradeoff between Precision and Recall
kNN tends to have higher precision than recall,
especially when k becomes larger. 96
AUC
97
Overfitting?
98
Overfitting?
Training Set
AUC (how good)
Testing Set
Model Complexity
99
When your model has too many parameters relative
to the number of data points, you're prone to
overestimate the utility of your model.
Over fitting means that you are fitting your model to
the noise instead of the underlying signal.
An over-fit model is it is a model that is overly bound
to the training data.
This means that it does an excellent job of 'predicting' the training data and a
very poor job of predicting any other data (test data).
100
Which one is over fitted?
101
Curse of Dimensionality
102
Increase in number of dimensions leads to rapid increase in volume.
This means, as the dimensions increase we need to collect exponentially
larger quantities of data (to be statistical significant). This exponential
increase of data is the curse. It limits our ability to store, compute and
make decisions quickly
The classic inverse problem is just a linear equation Ax =b We seek
solutions like x = inverse(A)*b
The curse of dimensionality simply means that you have
way more features, or dimensions, than you have data
points, and consequently, you can not actually invert A (it is
singular) to obtain a unique solution. A standard solution is
to add some additional information (i.e. Regularization,
Bayesian Prior)
103
Features
104
Features
Features can be good/bad

Use training set to find
Features are domain dependent
Feature Selection algorithm are used to find
good features (when you have many more
than expected)
E.g. Principle Component analysis
(PCA)
105
Good Features
106
Good Features
107
Good Features
108
Detect Multiple Faces?
109
Sliding Window
110
Sliding Window
111
No Face
112
No Face
113
No Face
114
Maybe Face
115
Face!
116
The End?
117

Data Mining Week 1 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Week 1 2

Uploaded by

Copyright:

Available Formats

Advance Statistics & Data Mining

Dr. Muhammad Nadeem Majeed

Can only humans beings and animals possess these qualities?

The branch of computer science that is concerned with the automation of

100+PB in its 10-year

Well-posed Learning Problem: A computer

Size ranges from 50k records (research studies) to terabytes (years of

Where are the data sources for analysis?

Graphical user interface

Data mining engine

Machinelearning involves adaptive

Input Variables: x = ( x1 , x2 ,..., xN )

This is NOT the case for almost any real system!

Given labeled data. Predict output.

Carpentry of Supervised Learning

Supervised Learning Input:

*Just introduction, we will cover them

Given a set of points from classes

Can select more complex

Outlook Temperature Humidity Windy Play

Conditions for playing

If outlook = sunny and humidity = high then play = no

How will the rules change when some attributes have

Outlook Temperature Humidity Windy Play

Rules with mixed attributes

If outlook = sunny and humidity > 83 then play = no

Young Myope No Reduced None

Sepal length Sepal width Petal length Petal width Type

If petal length < 2.45 then Iris setosa

Example: 209 different computer configurations

Cycle time (ns) Main memory (Kb) Cache Channels Performance

Linear regression function

PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX

In practice, this assumption is often violated to certain degree.

Train Set Test Set

Train Train Train Test Train

Number of correct classifications

Features can be good/bad

You might also like