You are on page 1of 19

CSCE

478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott

CSCE 478/878 Lecture 7:


Bagging and Boosting

Introduction
Outline
Bagging

Stephen Scott

Boosting

(Adapted from Ethem Alpaydin and Rob Schapire and Yoav Freund)

sscott@cse.unl.edu
1 / 19

Introduction
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting

2 / 19

Sometimes a single classifier (e.g., neural network,


decision tree) wont perform well, but a weighted
combination of them will
When asked to predict the label for a new example,
each classifier (inferred from a base learner) makes its
own prediction, and then the master algorithm (or
meta-learner) combines them using the weights for its
own prediction
If the classifiers themselves cannot learn (e.g.,
heuristics) then the best we can do is to learn a good
set of weights (e.g., Weighted Majority)
If we are using a learning algorithm (e.g., ANN, dec.
tree), then we can rerun the algorithm on different
subsamples of the training set and set the classifiers
weights during training

Outline
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting

3 / 19

Bagging
Boosting

Bagging
[Breiman, ML Journal, 1996]
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Experiment
Stability

Boosting

Bagging = Bootstrap aggregating


Bootstrap sampling: given a set X containing N training
examples:
Create Xj by drawing N examples uniformly at random
with replacement from X
Expect Xj to omit 37% of examples from X
Bagging:
Create L bootstrap samples X1 , . . . , XL
Train classifier dj on Xj

Classify new instance x by majority vote of learned


classifiers (equal weights)
4 / 19

Result: An ensemble of classifiers

Bagging Experiment
[Breiman, ML Journal, 1996]
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott

Given sample X of labeled data, Breiman did the following


100 times and reported avg:

Introduction
Outline

Divide X randomly into test set T (10%) and train set D


(90%)

Learn decision tree from D and let eS be error rate on T

Do 50 times: Create bootstrap set Xj and learn


decision tree (so ensemble size = 50). Then let eB be
the error of a majority vote of the trees on T

Bagging
Experiment
Stability

Boosting

5 / 19

Bagging Experiment
Results
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Experiment
Stability

Boosting

6 / 19

Data Set
waveform
heart
breast cancer
ionosphere
diabetes
glass
soybean

eS
29.0
10.0
6.0
11.2
23.4
32.0
14.5

eB
19.4
5.3
4.2
8.6
18.8
24.9
10.6

Decrease
33%
47%
30%
23%
20%
27%
27%

Bagging Experiment
(contd)
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Experiment
Stability

Boosting

Same experiment, but using a nearest neighbor classifier,


where prediction of new example xs label is that of xs
nearest neighbor in training set, where distance is e.g.,
Euclidean distance
Results
Data Set
waveform
heart
breast cancer
ionosphere
diabetes
glass
What happened?

7 / 19

eS
26.1
6.3
4.9
35.7
16.4
16.4

eB
26.1
6.3
4.9
35.7
16.4
16.4

Decrease
0%
0%
0%
0%
0%
0%

When Does Bagging Help?


CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction

When learner is unstable, i.e., if small change in training set


causes large change in hypothesis produced

Outline
Bagging

Decision trees, neural networks

Experiment
Stability

Not nearest neighbor

Boosting

Experimentally, bagging can help substantially for unstable


learners; can somewhat degrade results for stable learners

8 / 19

Boosting
[Schapire & Freund Book]
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm
Example
Experimental Results

Similar to bagging, but dont always sample uniformly;


instead adjust resampling distribution pj over X to focus
attention on previously misclassified examples
Final classifier weights learned classifiers, but not uniform;
instead weight of classifier dj depends on its performance
on data it was trained on

Miscellany

Final classifier is weighted combination of d1 , . . . , dL , where


dj s weight depends on its error on X w.r.t. pj

9 / 19

Boosting
Algorithm Idea [pj Dj ; dj hj ]
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott

Repeat for j = 1, . . . , L:
1

Introduction
Outline

Can sample X according to pj and train normally, or


directly minimize error on X w.r.t. pj

Bagging
Boosting
Algorithm
Example

Run learning algorithm on examples randomly drawn


from training set X according to distribution pj (p1 =
uniform)

Output of learner is binary hypothesis dj

Compute errorpj (dj ) = error of dj on examples from X


drawn according to pj (can compute exactly)

Create pj+1 from pj by decreasing weight of instances


that dj predicts correctly

Experimental Results
Miscellany

10 / 19

Boosting
Algorithm Pseudocode (Fig 17.2)
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm
Example
Experimental Results
Miscellany

11 / 19

each round, the weights of incorrectly classified examples are increased so that, effectively,
hard examples get successively higher weight, forcing the base learner to focus its attention
on them.

Boosting

Algorithm Pseudocode (Schapire & Freund)


Algorithm 1.1
The boosting algorithm AdaBoost

CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction

Given: (x1 , y1 ), . . . , (xm , ym ) where xi X , yi {1, +1}.


Initialize: D1 (i) = 1/m for i = 1, . . . , m.
For t = 1, . . . , T :

Outline
Bagging

Boosting

Algorithm
Example
Experimental Results
Miscellany

Train weak learner using distribution Dt .


Get weak hypothesis ht : X {1, +1}.

Aim: select ht to minimalize the weighted error:


.
t = PriDt [ht (xi ) = yi ] .
!
"
1
1 t
Choose t = ln
.
2
t
Update, for i = 1, . . . , m:
#
Dt (i)
e t if ht (xi ) = yi
Dt+1 (i) =

e t
if ht (xi ) = yi
Zt
=

Dt (i) exp(t yi ht (xi ))


,
Zt

where Zt is a normalization factor (chosen so that Dt+1 will be a distribution).


Output the final hypothesis:
H (x) = sign

12 / 19

$ T
%
t=1

&

t ht (x) .

Boosting

Schapire & Freund Example [Dj = pj ; hj = dj ; j =


D1

CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline

1
2



1 Introduction
1 and Ove

ln(1/j ) =

h1
2

10
9

Bagging
Boosting
Algorithm
Example
Experimental Results
Miscellany

13 / 19

D2

h2

1
2

ln

j

Boosting
Schapire & Freund Example [Dj = pj ; hj = dj ; j =

1
2

ln(1/j ) =

1
2

ln

1.2 Boosting
CSCE
478/878
Lecture 7:
Bagging and
Boosting

1j
j


]
9

Table 1.1
The numerical calculations corresponding to the toy example in figure 1.1
1

10

D1 (i)
e1 yi h1 (xi )
D1 (i) e1 yi h1 (xi )

0.10
1.53
0.15

0.10
1.53
0.15

0.10
1.53
0.15

0.10
0.65
0.07

0.10
0.65
0.07

0.10
0.65
0.07

0.10
0.65
0.07

0.10
0.65
0.07

0.10
0.65
0.07

0.10
0.65
0.07

D2 (i)
e2 yi h2 (xi )
D2 (i) e2 yi h2 (xi )

0.17
0.52
0.09

0.17
0.52
0.09

0.17
0.52
0.09

0.07
0.52
0.04

0.07
0.52
0.04

0.07
1.91
0.14

0.07
1.91
0.14

0.07
0.52
0.04

0.07
1.91
0.14

0.07
0.52
0.04

D3 (i)
e3 yi h3 (xi )
D3 (i) e3 yi h3 (xi )

0.11
0.40
0.04

0.11
0.40
0.04

0.11
0.40
0.04

0.05
2.52
0.11

0.05
2.52
0.11

0.17
0.40
0.07

0.17
0.40
0.07

0.05
2.52
0.11

0.17
0.40
0.07

0.05
0.40
0.02

Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm
Example
Experimental Results
Miscellany

1 = 0.30, 1 0.42
Z1 0.92
2 0.21, 2 0.65
Z2 0.82
3 0.14, 3 0.92
Z3 0.69

Calculations are shown for the ten examples as numbered in the figure. Examples on which hypothesis ht makes
a mistake are indicated by underlined figures in the rows marked Dt .

14 / 19

On round 1, AdaBoost assigns equal weight to all of the examples, as is indicated in


the figure by drawing all examples in the box marked D1 to be of the same size. Given
examples with these weights, the base learner chooses the base hypothesis indicated by h1
in the figure, which classifies points as positive if and only if they lie to the left of this
line. This hypothesis incorrectly classifies three pointsnamely, the three circled positive

Boosting
Schapire & Freund Example [Dj = pj ; hj = dj ; j =

1
2

ln(1/j ) =

1
2

ln

1j
j


]

CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline

D3
h3

Bagging
Boosting
Algorithm
Example
Experimental Results
Miscellany

Figure 1.1
An illustration of how AdaBoost behaves on a tiny toy problem with m = 10 examples. Each row depi
round, for t = 1, 2, 3. The left box in each row represents the distribution Dt , with the size of each example
in proportion to its weight under that distribution. Each box on the right shows the weak hypothesis ht ,
darker shading indicates the region of the domain predicted to be positive. Examples that are misclassifie
have been circled.

15 / 19

Boosting
Example (contd)
CSCE
478/878
Lecture 7:
Bagging and
Boosting

H final = sign

0.42

+ 0.65

+ 0.92

Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm

Not in original
hypothesis class!

Example
Experimental Results
Miscellany

In this case, need at least two of the three hypotheses to


predict +1 for weighted sum to exceed 0.
16 / 19

Boosting
Experimental Results
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott

12
and Overview
Scatter
plot: Percent classification error1ofIntroduction
non-boosted
vs
boosted on 27 learning tasks

Introduction

30

80

Outline

25

Bagging

Example

C4.5

Algorithm

Stumps

Boosting

60

40

Experimental Results

20
15
10

Miscellany

20
5
0

17 / 19

20

40
60
Boosting stumps

80

10

15
20
25
Boosting C4.5

30

Figure 1.3
Comparison of two base learning algorithmsdecision stumps and C4.5with and without boosting. Each point
in each scatterplot shows the test error rate of the two competing algorithms on one of 27 benchmark learning
problems. The x-coordinate of each point gives the test error rate (in percent) using boosting, and the y-coordinate
gives the error rate without boosting when using decision stumps (left plot) or C4.5 (right plot). All error rates

20

Boosting

40
60
Boosting stumps

80

10

15
20
25
Boosting C4.5

30

Figure 1.3
Experimental
Results (contd)
Comparison of two base learning algorithmsdecision stumps and C4.5with and without boosting. Each point

CSCE
478/878
Lecture 7:
Bagging and
Boosting

in each scatterplot shows the test error rate of the two competing algorithms on one of 27 benchmark learning
problems. The x-coordinate of each point gives the test error rate (in percent) using boosting, and the y-coordinate
gives the error rate without boosting when using decision stumps (left plot) or C4.5 (right plot). All error rates
have been averaged over multiple runs.

Stephen Scott

30

30

25

25

Bagging
Boosting
Algorithm

Boosting C4.5

Outline

C4.5

Introduction
20
15

20
15

10

10

Example
Experimental Results
Miscellany

10
15
20
25
Boosting stumps

30

10
15
20
25
Boosting stumps

30

Figure 1.4
Comparison of boosting using decision stumps as the base learner versus unboosted C4.5 (left plot) and boosted
C4.5 (right plot).

18 / 19

Boosting
Miscellany
CSCE
478/878
Lecture 7:
Bagging and
Boosting
Stephen Scott
Introduction
Outline
Bagging
Boosting
Algorithm
Example
Experimental Results
Miscellany

If each j < 1/2


Pj , error of ensemble on X drops
exponentially in Lj=1 j
Can also bound generalization error of ensemble
Very successful empirically
Generalization sometimes improves if training continues
after ensembles error on X drops to 0
Contrary to intuition: would expect overfitting
Related to increasing the combined classifiers margin

Useful even with very simple base learners, e.g.,


decision stumps

19 / 19

You might also like