Professional Documents
Culture Documents
Outline
CART Overview and Gymtutor Tutorial Example Splitting Criteria Handling Missing Values Pruning
Finding Optimal Tree
Focused on accurate assessment when data is noisy Currently distributed by Salford Systems
target
View data
CART Menu: View -> Data Info
all)
Testing
default 10-fold cross-validation
7
Sample Tree
10
Tree Details
11
12
13
14
Revised Tree
15
16
Trimming off parts of the tree that don t work Ordering the nodes of a large tree by contribution to tree accuracy
which nodes come off first? Optimal tree selection
Deciding on the best tree after growing and pruning Balancing simplicity against accuracy
parent gets two children each child produces two grandchildren four grandchildren produce 8 great grandchildren
Standard split:
if answer to question is YES a case goes left; otherwise it goes right this is the form of all primary splits
example :
Is AGE e 62.5?
Split Tables
Q: Where splits need to be evaluated? Sorted by Blood Pressure Sorted by Age
AGE 40 40 40 43 43 43 45 48 48 49 49 BP 91 110 83 99 78 135 120 119 122 150 110 SINUST SURVIVE 0 SURVIVE 0 SURVIVE 1 DEAD 0 SURVIVE 1 DEAD 0 SURVIVE 0 SURVIVE 1 DEAD 0 SURVIVE 0 DEAD 1 SURVIVE
AGE 43 40 40 43 40 49 48 45 48 43 49
SINUST SURVIVE 1 DEAD 1 DEAD 0 SURVIVE 0 SURVIVE 0 SURVIVE 1 SURVIVE 1 DEAD 0 SURVIVE 0 SURVIVE 0 SURVIVE 0 DEAD
where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed. Advanced: CART also has other splitting criteria
Twoing is recommended for multi-class
23
method used in classical statistical modeling unacceptable in a data mining context w/ many missings
Freeze case in node in which missing splitter encountered
used by CHAID algorithm; an option in Salford software allow special handling for missing but all missings treated as
indistinguishable from each other
Consider our example of INCOME Other variables like Education or Occupation might work as good surrogates
Higher education people usually have higher incomes People in high income occupations will usually (though not always) have
higher incomes
Using surrogate means that missing on primary not all treated same way Whether go left or right depends on surrogate value
If primary splitter is MISSING then CART will use a surrogate if top surrogate missing CART uses 2nd best surrogate etc
Left 90 80 15 80 25 14 78 74 21
Right 10 20 85 20 75 86 22 26 79
Primary Split
Competitor Split
Surrogate Split
28
This becomes first stage in finding the best tree Next we will have to get rid the parts of the overgrown tree that don t work (not supported by test data)
30
Tree Pruning
Take a very large tree ( maximal tree) Tree may be radically over-fit
Tracks all the idiosyncrasies of THIS data set Tracks patterns that may not be found in other data sets At bottom of tree splits based on very few cases Analogous to a regression with very large number of variables
32
contribution to overall tree a function of both increase in accuracy and accuracy gain is weighted by share of sample small nodes tend to get removed before large ones
If several nodes have same contribution they all prune away simultaneously
Hence more than two terminal nodes could be cut off in one pruning
Sequence determined all the way back to root node
need to allow for possibility that entire tree is bad if target variable is unpredictable we will want to prune back to root .
. . the no model solution
24 Terminal Nodes
21 Terminal Nodes
20 Terminal Nodes
34
18 Terminal Nodes
Do same for 2nd largest tree, 3rd largest tree, etc Performance of every tree in sequence is measured Results reported in table and graph formats Note that this critical stage is impossible to complete without test data CART procedure requires test data to guide tree evaluation
35
Learn R(T) always decreases as tree grows (Q: Why?) Test R(T) first declines then increases (Q: Why?) Overfitting is the result tree of too much reliance on learn
R(T)
Can lead to disasters when applied to new data
if not something is wrong useful to compare not just overall error rate but also within node performance between training and test data
^ k) R(T
CART Summary
CART Key Features
binary splits gini index as splitting criteria grow, then prune surrogates for missing values optimal tree 1 SE rule lots of nice graphics
41
Both C4.5 and CART are robust tools No method is always superior experiment!
witten & eibe
42