You are on page 1of 14

Random Forest

Applied Multivariate Statistics Spring 2012

Overview
Intuition of Random Forest
The Random Forest Algorithm
De-correlation gives better accuracy

Out-of-bag error (OOB-error)


Variable importance

Healthy
Diseased

Healthy
Diseased

Diseased

Intuition of Random Forest

Tree 2

Tree 1

young

old
young

old

diseased

healthy
diseased

healthy
male

tall

female

short

healthy
healthy

healthy

diseased
Tree 3

New sample:

retired

working

healthy

healthy

old, retired, male, short


Tree predictions:
diseased, healthy, diseased
tall

Majority rule:
diseased

healthy

short
diseased
2

The Random Forest Algorithm

Differences to standard tree


Train each tree on bootstrap resample of data
(Bootstrap resample of data set with N samples:
Make new data set by drawing with replacement N samples; i.e., some samples will
probably occur multiple times in new data set)

For each split, consider only m randomly selected variables


Dont prune

Fit B trees in such a way and use average or majority


voting to aggregate results

Why Random Forest works 1/2


Mean Squared Error = Variance + Bias2
If trees are sufficiently deep, they have very small bias
How could we improve the variance over that of a single
tree?

Why Random Forest works 2/2

i=j

Decreaes, if
decreases, i.e., if

De-correlation gives
better accuracy

m decreases

Decreases, if number of trees B


increases (irrespective of )
6

Estimating generalization error:


Out-of bag (OOB) error
Similar to leave-one-out cross-validation, but almost
without any additional computational burden
OOB error is a random number, since based on random
resamples of the data
Data:

Resampled Data:

Out of bag samples:

old, tall healthy

old, tall healthy

young, short diseased

old, short diseased

old, short diseased

young, tall healthy

young, tall healthy

young, short diseased

young, tall healthy

young, short healthy


young, tall healthy
old, short diseased

young, short healthy

young, tall healthy


young

old

old, short diseased

diseased
tall
healthy

healthy

short
diseased

Out of bag (OOB) error rate:


= 0.25

Variable Importance for variable i


Data
using Permutations
Resampled

Resampled

Dataset 1

Dataset m

OOB

Data 1

OOB

Data m

Permute values of
variable i in OOB

Tree 1

Tree m

data set
OOB error e1

OOB error em
d1 = e1p1

dm =em-pm
OOB error pm

OOB error p1

d=
s2d

1
m

Pm

1
m1

i=1 di

Pm

i=1 (di

d)

vi =

d
sd
8

Trees

vs.

Random Forest

+ Trees yield insight into


decision rules
+ Rather fast
+ Easy to tune
parameters

+ RF as smaller prediction
variance and therefore
usually a better general
performance
+ Easy to tune parameters

- Prediction of trees tend


to have a high variance

- Rather slow
- Black Box: Rather hard
to get insights into decision
rules

Comparing runtime
(just for illustration)
Up to thousands of variables
Problematic if there are categorical predictors with many levels (max: 32 levels)

RF: First predictor cut into 15 levels


RF

Tree

10

RF

vs.

+ Can model nonlinear


class boundaries
+ OOB error for free (no
CV needed)
+ Works on continuous and
categorical responses
(regression / classification)
+ Gives variable
importance
+ Very good performance
x

- Black box
- Slow

xx x

x
x

x
xx
x
x x
x

LDA
+ Very fast
+ Discriminants for visualizing
group separation
+ Can read off decision rule
- Can model only linear class
boundaries
- Mediocre performance
- No variable selection
- Only on categorical response
- Needs CV for estimating
prediction error
x x
x
x
x
x
x
x x
x x
11

Concepts to know
Idea of Random Forest and how it reduces the prediction
variance of trees
OOB error
Variable Importance based on Permutation

12

R functions to know
Function randomForest and varImpPlot from package
randomForest

13

You might also like