7 Bag Boost

CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
CSCE 478/878 Lecture 7:

Bagging and Boosting
Introduction
Outline
Bagging
Stephen Scott
Boosting
(Adapted from Ethem Alpaydin and Rob Schapire and Yoav Freund)
sscott@cse.unl.edu
1 / 19
Introduction
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting
2 / 19
Sometimes a single classifier (e.g., neural network,

decision tree) wont perform well, but a weighted
combination of them will
When asked to predict the label for a new example,
each classifier (inferred from a base learner) makes its
own prediction, and then the master algorithm (or
meta-learner) combines them using the weights for its
own prediction
If the classifiers themselves cannot learn (e.g.,
heuristics) then the best we can do is to learn a good
set of weights (e.g., Weighted Majority)
If we are using a learning algorithm (e.g., ANN, dec.
tree), then we can rerun the algorithm on different
subsamples of the training set and set the classifiers
weights during training
Outline
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting
3 / 19
Bagging
Boosting
Bagging
[Breiman, ML Journal, 1996]
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Experiment
Stability
Boosting
Bagging = Bootstrap aggregating

Bootstrap sampling: given a set X containing N training
examples:
Create Xj by drawing N examples uniformly at random
with replacement from X
Expect Xj to omit 37% of examples from X
Bagging:
Create L bootstrap samples X1 , . . . , XL
Train classifier dj on Xj
Classify new instance x by majority vote of learned

classifiers (equal weights)
4 / 19
Result: An ensemble of classifiers
Bagging Experiment
[Breiman, ML Journal, 1996]
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Given sample X of labeled data, Breiman did the following

100 times and reported avg:
Introduction
Outline
Divide X randomly into test set T (10%) and train set D

(90%)
Learn decision tree from D and let eS be error rate on T
Do 50 times: Create bootstrap set Xj and learn

decision tree (so ensemble size = 50). Then let eB be
the error of a majority vote of the trees on T
Bagging
Experiment
Stability
Boosting
5 / 19
Bagging Experiment
Results
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Experiment
Stability
Boosting
6 / 19
Data Set
waveform
heart
breast cancer
ionosphere
diabetes
glass
soybean
eS
29.0
10.0
6.0
11.2
23.4
32.0
14.5
eB
19.4
5.3
4.2
8.6
18.8
24.9
10.6
Decrease
33%
47%
30%
23%
20%
27%
27%
Bagging Experiment
(contd)
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Experiment
Stability
Boosting
Same experiment, but using a nearest neighbor classifier,

where prediction of new example xs label is that of xs
nearest neighbor in training set, where distance is e.g.,
Euclidean distance
Results
Data Set
waveform
heart
breast cancer
ionosphere
diabetes
glass
What happened?
7 / 19
eS
26.1
6.3
4.9
35.7
16.4
16.4
eB
26.1
6.3
4.9
35.7
16.4
16.4
Decrease
0%
0%
0%
0%
0%
0%
When Does Bagging Help?

CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
When learner is unstable, i.e., if small change in training set

causes large change in hypothesis produced
Outline
Bagging
Decision trees, neural networks
Experiment
Stability
Not nearest neighbor
Boosting
Experimentally, bagging can help substantially for unstable

learners; can somewhat degrade results for stable learners
8 / 19
Boosting
[Schapire & Freund Book]
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm
Example
Experimental Results
Similar to bagging, but dont always sample uniformly;

instead adjust resampling distribution pj over X to focus
attention on previously misclassified examples
Final classifier weights learned classifiers, but not uniform;
instead weight of classifier dj depends on its performance
on data it was trained on
Miscellany
Final classifier is weighted combination of d1 , . . . , dL , where

dj s weight depends on its error on X w.r.t. pj
9 / 19
Boosting
Algorithm Idea [pj Dj ; dj hj ]
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Repeat for j = 1, . . . , L:
1
Introduction
Outline
Can sample X according to pj and train normally, or

directly minimize error on X w.r.t. pj
Bagging
Boosting
Algorithm
Example
Run learning algorithm on examples randomly drawn

from training set X according to distribution pj (p1 =
uniform)
Output of learner is binary hypothesis dj
Compute errorpj (dj ) = error of dj on examples from X

drawn according to pj (can compute exactly)
Create pj+1 from pj by decreasing weight of instances

that dj predicts correctly
Miscellany
10 / 19
Boosting
Algorithm Pseudocode (Fig 17.2)
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm
Example
Miscellany
11 / 19
each round, the weights of incorrectly classified examples are increased so that, effectively,
hard examples get successively higher weight, forcing the base learner to focus its attention
on them.
Boosting
Algorithm Pseudocode (Schapire & Freund)

Algorithm 1.1
The boosting algorithm AdaBoost
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Given: (x1 , y1 ), . . . , (xm , ym ) where xi X , yi {1, +1}.

Initialize: D1 (i) = 1/m for i = 1, . . . , m.
For t = 1, . . . , T :
Outline
Bagging
Boosting
Algorithm
Example
Miscellany
Train weak learner using distribution Dt .

Get weak hypothesis ht : X {1, +1}.
Aim: select ht to minimalize the weighted error:

.
t = PriDt [ht (xi ) = yi ] .
!
"
1
1 t
Choose t = ln
.
2
t
Update, for i = 1, . . . , m:
#
Dt (i)
e t if ht (xi ) = yi
Dt+1 (i) =
e t
if ht (xi ) = yi
Zt
=
Dt (i) exp(t yi ht (xi ))

,
Zt
where Zt is a normalization factor (chosen so that Dt+1 will be a distribution).

Output the final hypothesis:
H (x) = sign
12 / 19
$ T
%
t=1
&
t ht (x) .
Boosting
Schapire & Freund Example [Dj = pj ; hj = dj ; j =

D1
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
1
2

1 Introduction
1 and Ove
ln(1/j ) =
h1
2
10
9
Bagging
Boosting
Algorithm
Example
Miscellany
13 / 19
D2
h2
1
2
ln
j
Boosting
1
2
ln(1/j ) =
1
2
ln
1.2 Boosting
CSCE
478/878
Lecture 7:
Bagging and
Boosting
1j
j

]
9
Table 1.1
The numerical calculations corresponding to the toy example in figure 1.1
1
10
D1 (i)
e1 yi h1 (xi )
D1 (i) e1 yi h1 (xi )
0.10
1.53
0.15
0.10
1.53
0.15
0.10
1.53
0.15
0.10
0.65
0.07
0.10
0.65
0.07
0.10
0.65
0.07
0.10
0.65
0.07
0.10
0.65
0.07
0.10
0.65
0.07
0.10
0.65
0.07
D2 (i)
e2 yi h2 (xi )
D2 (i) e2 yi h2 (xi )
0.17
0.52
0.09
0.17
0.52
0.09
0.17
0.52
0.09
0.07
0.52
0.04
0.07
0.52
0.04
0.07
1.91
0.14
0.07
1.91
0.14
0.07
0.52
0.04
0.07
1.91
0.14
0.07
0.52
0.04
D3 (i)
e3 yi h3 (xi )
D3 (i) e3 yi h3 (xi )
0.11
0.40
0.04
0.11
0.40
0.04
0.11
0.40
0.04
0.05
2.52
0.11
0.05
2.52
0.11
0.17
0.40
0.07
0.17
0.40
0.07
0.05
2.52
0.11
0.17
0.40
0.07
0.05
0.40
0.02
Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm
Example
Miscellany
1 = 0.30, 1 0.42
Z1 0.92
2 0.21, 2 0.65
Z2 0.82
3 0.14, 3 0.92
Z3 0.69
Calculations are shown for the ten examples as numbered in the figure. Examples on which hypothesis ht makes
a mistake are indicated by underlined figures in the rows marked Dt .
14 / 19
On round 1, AdaBoost assigns equal weight to all of the examples, as is indicated in

the figure by drawing all examples in the box marked D1 to be of the same size. Given
examples with these weights, the base learner chooses the base hypothesis indicated by h1
in the figure, which classifies points as positive if and only if they lie to the left of this
line. This hypothesis incorrectly classifies three pointsnamely, the three circled positive
Boosting
1
2
ln(1/j ) =
1
2
ln
1j
j

]
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
D3
h3
Bagging
Boosting
Algorithm
Example
Miscellany
Figure 1.1
An illustration of how AdaBoost behaves on a tiny toy problem with m = 10 examples. Each row depi
round, for t = 1, 2, 3. The left box in each row represents the distribution Dt , with the size of each example
in proportion to its weight under that distribution. Each box on the right shows the weak hypothesis ht ,
darker shading indicates the region of the domain predicted to be positive. Examples that are misclassifie
have been circled.
15 / 19
Boosting
Example (contd)
CSCE
478/878
Lecture 7:
Bagging and
Boosting
H final = sign
0.42
+ 0.65
+ 0.92
Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm
Not in original
hypothesis class!
Example
Miscellany
In this case, need at least two of the three hypotheses to

predict +1 for weighted sum to exceed 0.
16 / 19
Boosting
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
12
and Overview
Scatter
plot: Percent classification error1ofIntroduction
non-boosted
vs
boosted on 27 learning tasks
Introduction
30
80
Outline
25
Bagging
Example
C4.5
Algorithm
Stumps
Boosting
60
40
20
15
10
Miscellany
20
5
0
17 / 19
20
40
60
Boosting stumps
80
10
15
20
25
Boosting C4.5
30
Figure 1.3
Comparison of two base learning algorithmsdecision stumps and C4.5with and without boosting. Each point
in each scatterplot shows the test error rate of the two competing algorithms on one of 27 benchmark learning
problems. The x-coordinate of each point gives the test error rate (in percent) using boosting, and the y-coordinate
gives the error rate without boosting when using decision stumps (left plot) or C4.5 (right plot). All error rates
20
Boosting
40
60
Boosting stumps
80
10
15
20
25
Boosting C4.5
30
Figure 1.3
Experimental
Results (contd)
Comparison of two base learning algorithmsdecision stumps and C4.5with and without boosting. Each point
CSCE
478/878
Lecture 7:
Bagging and
Boosting
in each scatterplot shows the test error rate of the two competing algorithms on one of 27 benchmark learning
problems. The x-coordinate of each point gives the test error rate (in percent) using boosting, and the y-coordinate
gives the error rate without boosting when using decision stumps (left plot) or C4.5 (right plot). All error rates
have been averaged over multiple runs.
Stephen Scott
30
30
25
25
Bagging
Boosting
Algorithm
Boosting C4.5
Outline
C4.5
Introduction
20
15
20
15
10
10
Example
Miscellany
10
15
20
25
Boosting stumps
30
10
15
20
25
Boosting stumps
30
Figure 1.4
Comparison of boosting using decision stumps as the base learner versus unboosted C4.5 (left plot) and boosted
C4.5 (right plot).
18 / 19
Boosting
Miscellany
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm
Example
Miscellany
If each j < 1/2

Pj , error of ensemble on X drops
exponentially in Lj=1 j
Can also bound generalization error of ensemble
Very successful empirically
Generalization sometimes improves if training continues
after ensembles error on X drops to 0
Contrary to intuition: would expect overfitting
Related to increasing the combined classifiers margin
Useful even with very simple base learners, e.g.,

decision stumps
19 / 19

7 Bag Boost

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 Bag Boost

Uploaded by

Copyright:

Available Formats

CSCE

CSCE 478/878 Lecture 7:

Sometimes a single classifier (e.g., neural network,

Bagging = Bootstrap aggregating

Classify new instance x by majority vote of learned

Result: An ensemble of classifiers

Given sample X of labeled data, Breiman did the following

Divide X randomly into test set T (10%) and train set D

Learn decision tree from D and let eS be error rate on T

Do 50 times: Create bootstrap set Xj and learn

Same experiment, but using a nearest neighbor classifier,

When Does Bagging Help?

When learner is unstable, i.e., if small change in training set

Decision trees, neural networks

Not nearest neighbor

Experimentally, bagging can help substantially for unstable

Similar to bagging, but dont always sample uniformly;

Final classifier is weighted combination of d1 , . . . , dL , where

Can sample X according to pj and train normally, or

Run learning algorithm on examples randomly drawn

Output of learner is binary hypothesis dj

Compute errorpj (dj ) = error of dj on examples from X

Create pj+1 from pj by decreasing weight of instances

Algorithm Pseudocode (Schapire & Freund)

Given: (x1 , y1 ), . . . , (xm , ym ) where xi X , yi {1, +1}.

Train weak learner using distribution Dt .

Aim: select ht to minimalize the weighted error:

Dt (i) exp(t yi ht (xi ))

where Zt is a normalization factor (chosen so that Dt+1 will be a distribution).

Schapire & Freund Example [Dj = pj ; hj = dj ; j =

On round 1, AdaBoost assigns equal weight to all of the examples, as is indicated in

In this case, need at least two of the three hypotheses to

If each j < 1/2

Useful even with very simple base learners, e.g.,

You might also like

If each j < 1/2