Professional Documents
Culture Documents
Concepts and
Techniques
Chapter 6
Jianlin Cheng
Department of Computer Science
University of Missouri
Slides Adapted from
2006 Jiawei Han and Micheline Kamber, All rights reserved
is prediction? (SVM)
Issues regarding
Associative classification
Target marketing
Medical diagnosis
Fraud detection
set
The model is represented as classification rules, decision
Classification
Algorithms
Training
Data
Classifier
(Model)
IF rank = professor
OR years >= 5
THEN tenured = yes
March 29, 2017 Data Mining: Concepts and Techniques 5
Process (2): Using the Model in
Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
March 29, 2017 Data Mining: Concepts and Techniques 6
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the model built on
training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
is prediction? (SVM)
Issues regarding
Associative classification
Data cleaning
Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
attributes
Speed
time to construct the model (training time)
is prediction? (SVM)
Issues regarding
Associative classification
age?
<=30 overcast
31..40 >40
no yes no yes
v
| Dj |
New entropy En (after
trop using A to
yA (D) D |into
split
D|
v E(Dj )
partitions) to classify D: j1
Class P: buys_computer = 5 4
Entropy age ( D ) E ( 2,3) E (4,0)
yes 14 14
Class N: buys_computer 9 9 5 = 5 5
Entropy ( D ) E (9,5) log 2 ( ) log 2 ( ) 0.940
no E (3,2) 0.694
14 14 14 14 14
5
E ( 2,3) means age <=30 has
14 5 out of 14 samples, with 2
yeses and 3 nos. Hence
Gain(age) Entropy ( D) Entropy age ( D ) 0.246
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40 low
3140 low
yes
yes
excellent
excellent
no
yes
Gain(income) 0.029
Gain( student ) 0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium
3140 medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating ) 0.048
3140 high yes fair yes
>40Marchmedium
29, 2017 no excellent Data no
Mining: Concepts and Techniques 16
Decision Boundary ?
Class B
Class A
X
Class C
Y
March 29, 2017 Data Mining: Concepts and Techniques 17
Decision Boundary ?
Class B
Class A
X
Class C
Y
March 29, 2017 Data Mining: Concepts and Techniques 18
Decision Boundary ?
Class B
Class A
X
Class C
Y
March 29, 2017 Data Mining: Concepts and Techniques 19
Computing Information-Gain for
Continuous-Value Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of a i and ai+1
The point yielding maximum information gain for A is
selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A split-point,
and D2 is the set of tuples in D satisfying A > split-
point
March 29, 2017 Data Mining: Concepts and Techniques 20
Gain Ratio for Attribute Selection
(C4.5)
Information gain measure is biased towards
attributes with a large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome
the problem (normalization to information gain)
v | Dj | | Dj |
SplitEntro py A ( D) log 2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A)/SplitEntropy(A)
Ex. 4 4 6 6 4 4
SplitEntropy A ( D) log 2 ( ) log 2 ( ) log 2 ( ) 0.926
14 14 14 14 14 14
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is
selected as the splitting attribute
Reduction in Impurity:
classification methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
methods
March 29, 2017 Data Mining: Concepts and Techniques 28
Scalable Decision Tree Induction
Methods
yes no
yes no
high 2 2
<=30 3 2
31..40 4 0 medium 4 2
>40 3 2 low 3 1
AVC-set on
AVC-set on Student
credit_rating
student Buy_Computer Buy_Computer
yes no Credit
rating yes no
yes 6 1 fair 6 2
no 3 4 excellent 3 3
Classification
Clustering
Regression
Feature Selectio
Outlier detection
Review paper:
http://www.sciencedirect.com/science/article/pii/S0
031320310003973
March 29, 2017 Data Mining: Concepts and Techniques 36
An Example
Wikipedia
is prediction? (SVM)
Issues regarding
Associative classification
P(X | C )P(C )
i i
needs to be maximized
and P(xk|Ci) is P ( X | C i ) g ( xk , C i , C i )
March 29, 2017 Data Mining: Concepts and Techniques 43
Nave Bayesian Classifier: Training
Dataset
Class:
C1:buys_computer =
yes
C2:buys_computer = no
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
uncorrected counterparts
March 29, 2017 Data Mining: Concepts and Techniques 47
Nave Bayesian Classifier:
Comments
Advantages
Easy to implement
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
Vote Classificatio
http://people.csail.mit.edu/kersting/profile/PROFILE_nb.html
is prediction? (SVM)
Issues regarding
Associative classification
pos neg
FOIL _ Prune( R )
Pos/neg are # of positive/negativepos negcovered by R.
tuples
If FOIL_Prune is higher for the pruned version of R, prune R
is prediction? (SVM)
Issues regarding
Associative classification
x1 : # of a word homepage
x2 : # of a word welcome
Mathematically
x X = n, y Y = {+1, 1}
We want a function f: X Y
As compared to Bayesian methods in general
robust, works when training examples contain errors
fast evaluation of the learned target function
Bayesian networks are normally slow
Criticism
Cannot simulate data generation
Bayesian networks can be used easily for pattern discovery
not easy to incorporate domain knowledge
Easy in the form of priors on the data or distributions
x1
Images.google.com
March 29, 2017 Data Mining: Concepts and Techniques 68
March 29, 2017 Data Mining: Concepts and Techniques 69
A Neuron (= a perceptron)
1 w0
x1 w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi w0 )
vector x vector w sum function i 1
Output vector
Output layer
Hidden layer
wij
Input layer
Input vector: X
March 29, 2017 Data Mining: Concepts and Techniques 71
March 29, 2017 Data Mining: Concepts and Techniques 72
March 29, 2017 Data Mining: Concepts and Techniques 73
Activation / Transfer Function
Sigmoid function:
Tanh function
is prediction? (SVM)
Issues regarding
Associative classification
Let data D be (X1, y1), , (X|D|, y|D|), where Xi is the set of training
tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but
we want to find the best one (the one that minimizes classification
error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
March 29, 2017 Data Mining: Concepts and Techniques 96
March 29, 2017 Data Mining: Concepts and Techniques 97
Quadratic Optimization
is prediction? (SVM)
Issues regarding
Associative classification
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM01)
Efficiency: Uses an enhanced FP-tree that maintains the distribution
of class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R and R , if the antecedent of R is more
1 2 1
general than that of R2 and conf(R1) conf(R2), then R2 is pruned
Prunes rules for which the rule antecedent and class are not
positively correlated, based on a 2 test of statistical significance
Classification based on generated/pruned rules
If only one rule satisfies tuple X, assign the class label of the rule
divides S into groups according to class labels
uses a weighted 2 measure to find the strongest group of
rules, based on the statistical correlation of rules within a
group
assigns X the class label of the strongest group
March 29, 2017 Data Mining: Concepts and Techniques 126
Associative Classification May Achieve
High Accuracy and Efficiency (Cong et al.
SIGMOD05)
is prediction? (SVM)
Issues regarding
Associative classification
Euclidean space.
Locally weighted regression
is prediction? (SVM)
Issues regarding
Associative classification
is prediction? (SVM)
Issues regarding
Associative classification
KNN)
Other regression methods: generalized linear model, Poisson
X1 X2 Die / Live
0.01 0.00 1
4
0.001 0.02 0
0.003 0.00 1
5
Binomial distribution:
For one dose level xi, yi = P(die|xi), likelihood = yiti(1-yi)(1-t )
i
www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf
March 29, 2017 Data Mining: Concepts and Techniques 148
Algorithm
1. Start with a single node containing all points.
Calculate mc and S.
is prediction? (SVM)
Issues regarding
Associative classification
| y i yi ' | ( yi yi ' ) 2
i 1
Relative absolute error: i 1
d
Relative squared error: d
| y
i 1
i y|
(y
i 1
i y)2
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive
is prediction? (SVM)
Issues regarding
Associative classification
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Ensemble: combining a set of heterogeneous
classifiers
March 29, 2017 Data Mining: Concepts and Techniques 159
Bagging: Boostrap
Aggregation
Analogy: Diagnosis based on multiple doctors majority vote
Training
Given a set D of d tuples, at each iteration i, a training set D of
i
d tuples is sampled with replacement from D (i.e., boostrap)
A classifier model M is learned for each training set D
i i
Classification: classify an unknown sample X
Each classifier M returns its class prediction
i
The bagged classifier M* counts the votes and assigns the class
with the most votes to X
Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
Accuracy
Often significant better than a single classifier derived from D
is prediction? (SVM)
Issues regarding
Associative classification
is prediction? (SVM)
Issues regarding
Associative classification