Professional Documents
Culture Documents
Introduction
Classification problem, evaluation of classifiers
Bayesian Classifiers
Optimal Bayes classifier, naive Bayes classifier, applications
We considered up to now
small data sets
New requirements
larger and larger commercial databases
Sampling
use a subset of the data as training set such that
sample fits into main memory
evaluate sample of all potential splits (for numerical
attributes)
Æ poor quality of resulting decision trees
Attribute lists
values of an attribute in ascending order
sequential access
secondary storage resident
Class list
contains class label for each training object and
random access
main memory resident
Histograms
for each leaf node of the decision tree
SLIQ: Algorithm
Shortcomings of SLIQ
Size of class list linearly grows with the size of the
database, i.e. with the number of training examples
SLIQ scales well only if sufficient main memory for
the entire class list is available
Goals of SPRINT
Scalability for arbitrarily large databases
Class list
there is no class list any longer
Attribute lists
no single attribute list for the entire training set
7000
6000
5000 SPRINT
4000
3000
2000
1000 SLIQ number of objects
0
0 0.5 1.0 1.5 2.0 2.5 3.0 (in millions)
Shortcomings of SPRINT
Does not exploit the available main memory
Goals of RainForest
Exploits the available main memory to increase the efficiency
N2 N3
AVC set „income“ for N2
AVC set „age“ for N2
value class count
15 B 1 value class count
65 G 1 young B 1
75 G 1 young G 2
RainForest: Algorithms
Assumption
The entire AVC group of the root node fits into main memory
Then, the AVC groups of each node also fit into main memory
Algorithm RF_Write
Construction of the AVC group of node k in main memory by
sequential scan over the training set
Determination of the optimal split for node k by using the AVC
group
Reading the training set and distribution (writing) to the
partitions
Algorithm RF_Read
Avoids explicit writing of the partitions to secondary storage
possible
Training database is read for each tree level multiple times
Algorithm RF_Hybrid
Usage of RF_Read as long as the AVC groups of all nodes from
the current level of the decision tree fit into main memory
Subsequent materialization of the partitions by using RF_Write
20,000
number of
training objects
SPRINT (in millions)
10,000
RainForest
Boosting: Algorithm
Algorithm
Assign every example an equal weight 1/N
For t = 1, 2, …, T do
Introduction
Classification problem, evaluation of classifiers
Bayesian Classifiers
Optimal Bayes classifier, naive Bayes classifier, applications
Neural Networks
Advantages
prediction accuracy is generally high
Criticism
long training time
Network Training
Output vector
Errj = O j (1 − O j )∑k Errk w jk
Output nodes
θ j = θ j + (l) Err j
wij = wij + (l ) Err j Oi
Hidden nodes Err j = O j (1 − O j )(T j − O j )
wij 1
Oj = −I
1+ e j
Input nodes
I j = ∑ wij Oi + θ j
i
Input vector: xi
WS 2003/04 Data Mining Algorithms 7 – 105
Network pruning
Fully connected network will be hard to articulate
N input nodes, h hidden nodes and m output nodes lead to
h⋅(m+N) weights
Pruning: Remove some of the links without affecting classification
accuracy of the network
Extracting rules from a trained network
Discretize activation values; replace individual activation value by
the cluster average maintaining the network accuracy
Enumerate the output from the discretized activation values to
find rules between activation value and output
Find the relationship between the input and activation value
Combine the above two to have rules relating the output to input
WS 2003/04 Data Mining Algorithms 7 – 106
Genetic Algorithms
GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
An initial population is created consisting of randomly
generated rules
e.g., “If A1 and Not A2 then C2” can be encoded as 100
© and acknowledgements: Prof. Dr. Hans-Peter Kriegel and Matthias Schubert (LMU Munich)
and Dr. Thorsten Joachims (U Dortmund and Cornell U)
…
p2 p2
p1 p1
Criteria
Stability at insertion
y i ⋅ ξ ⋅ ( w, xi + b ) ≥ ξ
r r
∀i ∈ [1..n]:
y i ⋅ ( w, x i + b ) ≥ 1
r r
∀i ∈ [1..n]:
1 r r
Maximization of r r
w, w
corresponds to a minimization of w, w
Kernel Machines:
Extension of the Hypotheses Space
Principle
input space φ extended feature space
Example
(x, y, z) φ (x, y, z, x2, xy, xz, y2, yz, z2)
x1 x12
x1 x12
0 0 1
What Is Prediction?
function: f$ ( x ) = w + w a ( x ) +L + w a ( x )
0 1 1 n n
minimize the squared error: distance-decreasing weight K
E ( xq ) ≡ 1 ∑ x∈nearest _ neighbors (x ,k ) ( f ( x) − fˆ ( x)) 2 ⋅ K (d ( xq , x))
2 q
Chapter 7 – Conclusions
References (II)
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,
pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France,
March 1996.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary
Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial
Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.
R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and
pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August
1998.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data
mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept.
1996.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufman, 1991.