Professional Documents
Culture Documents
Overview
Intuition of Random Forest
The Random Forest Algorithm
De-correlation gives better accuracy
Healthy
Diseased
Healthy
Diseased
Diseased
Tree 2
Tree 1
young
old
young
old
diseased
healthy
diseased
healthy
male
tall
female
short
healthy
healthy
healthy
diseased
Tree 3
New sample:
retired
working
healthy
healthy
Majority rule:
diseased
healthy
short
diseased
2
i=j
Decreaes, if
decreases, i.e., if
De-correlation gives
better accuracy
m decreases
Resampled Data:
old
diseased
tall
healthy
healthy
short
diseased
Resampled
Dataset 1
Dataset m
OOB
Data 1
OOB
Data m
Permute values of
variable i in OOB
Tree 1
Tree m
data set
OOB error e1
OOB error em
d1 = e1p1
dm =em-pm
OOB error pm
OOB error p1
d=
s2d
1
m
Pm
1
m1
i=1 di
Pm
i=1 (di
d)
vi =
d
sd
8
Trees
vs.
Random Forest
+ RF as smaller prediction
variance and therefore
usually a better general
performance
+ Easy to tune parameters
- Rather slow
- Black Box: Rather hard
to get insights into decision
rules
Comparing runtime
(just for illustration)
Up to thousands of variables
Problematic if there are categorical predictors with many levels (max: 32 levels)
Tree
10
RF
vs.
- Black box
- Slow
xx x
x
x
x
xx
x
x x
x
LDA
+ Very fast
+ Discriminants for visualizing
group separation
+ Can read off decision rule
- Can model only linear class
boundaries
- Mediocre performance
- No variable selection
- Only on categorical response
- Needs CV for estimating
prediction error
x x
x
x
x
x
x
x x
x x
11
Concepts to know
Idea of Random Forest and how it reduces the prediction
variance of trees
OOB error
Variable Importance based on Permutation
12
R functions to know
Function randomForest and varImpPlot from package
randomForest
13