Machine Learning in Real World: Understanding CART Decision Trees

Machine Learning in Real World: CART
Outline
CART Overview and Gymtutor Tutorial Example Splitting Criteria Handling Missing Values Pruning
Finding Optimal Tree
CART Classification And Regression Tree

Developed 1974-1984 by 4 statistics professors Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)
Focused on accurate assessment when data is noisy Currently distributed by Salford Systems
CART Tutorial Data: Gymtutor

CART HELP, Sec 3 in CARTManual.pdf ANYRAQT ONAER NSUPPS OFFAER NFAMMEM TANNING ANYPOOL SMALLBUS FIT HOME PERSTRN CLASSES SEGMENT Racquet ball usage (binary indicator coded 0, 1) Number of on-peak aerobics classes attended Number of supplements purchased Number of off-peak aerobics classes attended Number of family members Number of visits to tanning salon Pool usage (binary indicator coded 0, 1) Small business discount (binary indicator coded 0, 1) Fitness score Home ownership (binary indicator coded 0, 1) Personal trainer (binary indicator coded 0, 1) Number of classes taken. Member s market segment (1, 2, 3)
4
target
View data
CART Menu: View -> Data Info
CART Example: Gymtutor
CART Model Setup

Target -- required Predictors (default Categorical
ANYRAQT, ANYPOOL, SMALLBUS, HOME Categorical: if field name ends in $ , or from values
all)
Testing
default 10-fold cross-validation

7
Sample Tree
Color-coding using class
Decision Tree: Splitters
10
Tree Details
11
Tree Summary Reports
12
Pruning the tree
13
Keeping only important variables
14
Revised Tree
15
Automating CART: Command Log
16
Key CART features

Automated field selection
handles any number of fields
automatically selects relevant fields
No data preprocessing needed

Does not require any kind of variable transforms Impervious to outliers
Missing value tolerant

Moderate loss of accuracy due to missing values
CART: Key Parts of Tree Structured Data Analysis

Tree growing Splitting rules to generate tree Stopping criteria: how far to grow? Missing values: using surrogates Tree pruning
Trimming off parts of the tree that don t work Ordering the nodes of a large tree by contribution to tree accuracy
which nodes come off first? Optimal tree selection
Deciding on the best tree after growing and pruning Balancing simplicity against accuracy
CART is a form of Binary Recursive Partitioning

Data is split into two partitions
Q: Does C4.5 always have binary partitions?
Partitions can also be split into sub-partitions

hence procedure is recursive
CART tree is generated by repeated partitioning of data set
parent gets two children each child produces two grandchildren four grandchildren produce 8 great grandchildren
Splits always determined by questions with YES/NO answers

Is continuous variable X e c ? Does categorical variable D take on levels i, j, or k?
is GENDER M or F ?
Standard split:
if answer to question is YES a case goes left; otherwise it goes right this is the form of all primary splits
example :
Is AGE e 62.5?
More complex conditions possible:

Boolean combinations: AGE<=62 OR BP<=91 Linear combinations: .66*AGE - .75*BP< -40
Searching all Possible Splits

For any node CART will examine ALL possible splits.
CART allows search over a random sample if desired

Look at first variable in our data set AGE with minimum value 40
Test split Is AGE e 40?
Will separate out the youngest persons to the left Could be many cases if many people have the same AGE
Next increase the AGE threshold to the next youngest person

Is AGE e43? This will direct additional cases to the left
Continue increasing the splitting threshold value by value

each value is tested for how good the split is . . . how effective it is in
separating the classes from each other
Q: Consider splits between values of the same class?
Split Tables
Q: Where splits need to be evaluated? Sorted by Blood Pressure Sorted by Age
AGE 40 40 40 43 43 43 45 48 48 49 49 BP 91 110 83 99 78 135 120 119 122 150 110 SINUST SURVIVE 0 SURVIVE 0 SURVIVE 1 DEAD 0 SURVIVE 1 DEAD 0 SURVIVE 0 SURVIVE 1 DEAD 0 SURVIVE 0 DEAD 1 SURVIVE
AGE 43 40 40 43 40 49 48 45 48 43 49
BP 78 83 91 99 110 110 119 120 122 135 150
SINUST SURVIVE 1 DEAD 1 DEAD 0 SURVIVE 0 SURVIVE 0 SURVIVE 1 SURVIVE 1 DEAD 0 SURVIVE 0 SURVIVE 0 SURVIVE 0 DEAD
CART Splitting Criteria: Gini Index

If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed. Advanced: CART also has other splitting criteria
Twoing is recommended for multi-class
23
Handling of Missing Splitter Values in Tree Growing

If splitter variable missing dont know which way to send case (Left or Right in binary tree) Could delete cases that have missing values
method used in classical statistical modeling unacceptable in a data mining context w/ many missings
Freeze case in node in which missing splitter encountered
do with what tree has learned so far for this case

Allow cases with missing split variable to follow majority
assume all missings are somehow typical

Allow missing to be a separate value of variable
used by CHAID algorithm; an option in Salford software allow special handling for missing but all missings treated as
indistinguishable from each other
Missing as a distinct splitter value

CHAID treats missing as a distinct categorical value e.g AGE is 25-44, 45-64, 65-95 or missing method also adopted by C4.5 If missing is a distinct value then all cases with missing go the same way in the tree Assumption: whatever the unknown value it is the same for all cases with missing value Problem: can be more than one reason for a database field to be missing: E.g. Income as a splitter wants to separate high from low Levels most likely to be missing? High Income AND Low
Income!
Dont want to send both groups to same side of tree
CART Treatment of Missing Primary Splitters: Surrogates

CART uses a more refined method missing primary field a surrogate is used as a stand in for a
surrogate should be a valid replacement for primary
Consider our example of INCOME Other variables like Education or Occupation might work as good surrogates
Higher education people usually have higher incomes People in high income occupations will usually (though not always) have
higher incomes
Using surrogate means that missing on primary not all treated same way Whether go left or right depends on surrogate value
thus record specific . . . some cases go left others go right

26
Surrogates Mimicking Alternatives to Primary Splitters

A primary splitter is the best splitter of a node A surrogate is a splitter that splits in a fashion similar to the primary Surrogate Why Useful?
If the primary is expensive or difficult to gather and the surrogate is not
Then consider using the surrogate instead Loss in predictive accuracy might be slight
variable with near equivalent information
If primary splitter is MISSING then CART will use a surrogate if top surrogate missing CART uses 2nd best surrogate etc
If all surrogates missing also CART uses majority rule
*Competitors vs. Surrogates

Class A Class B Class C 100 100 100
Left 90 80 15 80 25 14 78 74 21
Right 10 20 85 20 75 86 22 26 79
Primary Split
Class A Class B Class C
Competitor Split
Surrogate Split
28
CART Pruning Method: Grow Full Tree, Then Prune

You will never know when to stop . . . so don t! Instead . . . grow trees that are obviously too big Largest tree grown is called maximal tree Maximal tree could have hundreds or thousands of nodes
usually instruct CART to grow only moderately too big rule of thumb: should grow trees about twice the size of the truly
best tree
This becomes first stage in finding the best tree Next we will have to get rid the parts of the overgrown tree that don t work (not supported by test data)
Maximal Tree Example
30
Tree Pruning
Take a very large tree ( maximal tree) Tree may be radically over-fit
Tracks all the idiosyncrasies of THIS data set Tracks patterns that may not be found in other data sets At bottom of tree splits based on very few cases Analogous to a regression with very large number of variables
PRUNE away branches from this large tree

But which branch to cut first?
CART determines a pruning sequence:

the exact order in which each node should be removed pruning sequence determined for EVERY node sequence determined all the way back to root node
Pruning: Which nodes come off next?
32
Order of Pruning: Weakest Link Goes First

Prune away "weakest link" accuracy of the tree
size of node
the nodes that add least to overall
contribution to overall tree a function of both increase in accuracy and accuracy gain is weighted by share of sample small nodes tend to get removed before large ones
If several nodes have same contribution they all prune away simultaneously
Hence more than two terminal nodes could be cut off in one pruning
Sequence determined all the way back to root node
need to allow for possibility that entire tree is bad if target variable is unpredictable we will want to prune back to root .
. . the no model solution
Pruning Sequence Example
24 Terminal Nodes
21 Terminal Nodes
20 Terminal Nodes
34
18 Terminal Nodes
Now we test every tree in the pruning sequence

Take a test data set and drop it down the largest tree in the sequence and measure its predictive accuracy
how many cases right and how many wrong measure accuracy overall and by class
Do same for 2nd largest tree, 3rd largest tree, etc Performance of every tree in sequence is measured Results reported in table and graph formats Note that this critical stage is impossible to complete without test data CART procedure requires test data to guide tree evaluation
35
Training Data Vs. Test Data Error Rates

Compare error rates measured by
learn data large test set
No. Terminal Nodes 71 63 58 40 34 19 **10 9 7 6 5 2 1 R(T) .00 .00 .03 .10 .12 .20 .29 .32 .41 .46 .53 .75 .86 Rts(T) .42 .40 .39 .32 .32 .31 .30 .34 .47 .54 .61 .82 .91
Learn R(T) always decreases as tree grows (Q: Why?) Test R(T) first declines then increases (Q: Why?) Overfitting is the result tree of too much reliance on learn
R(T)
Can lead to disasters when applied to new data
Why look at training data error rates (or cost) at all?

First, provides a rough guide of how you are doing
Truth will typically be WORSE than training data measure If tree performing poorly on training data error may not want to pursue further Training data error rate more accurate for smaller trees
So reasonable guide for smaller trees Poor guide for larger trees

At optimal tree training and test error rates should be similar

if not something is wrong useful to compare not just overall error rate but also within node performance between training and test data
CART: Optimal Tree

Within a single CART run which tree is best? Process of pruning the maximal tree can yield many sub-trees Test data set or cross- validation measures the error rate of each tree Current wisdom select the tree with smallest error rate Only drawback minimum may not be precisely estimated Typical error rate as a function of tree size has flat region Minimum could be anywhere in this region
0 0 10 20 ~ |Tk | 30 40 50 1
The Best Pruned Subtree: An Estimation Problem
^ k) R(T
One SE Rule -- One Standard Error Rule

Original monograph recommends NOT choosing minimum error tree because of possible instability of results from run to run Instead suggest SMALLEST TREE within 1 SE of the minimum error tree Tends to provide very stable results from run to run Is possibly as accurate as minimum cost tree yet simpler Current learning one SERULE is good for small data sets For large data sets one should pick most accurate tree known as the zero SE rule
In what sense is the optimal tree best ?

Optimal tree has lowest or near lowest cost as determined by a test procedure Tree should exhibit very similar accuracy when applied to new data
BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database
trees somewhat larger or smaller than optimal may be preferred
Room for user judgment

judgment not about split variable or values judgment as to how much of tree to keep determined by story tree is telling willingness to sacrifice a small amount of accuracy for simplicity
CART Summary
CART Key Features
binary splits gini index as splitting criteria grow, then prune surrogates for missing values optimal tree 1 SE rule lots of nice graphics
41
Decision Tree Summary

Decision Trees
splits binary, multi-way entropy, gini, split criteria pruning rule extraction from trees
missing value treatment
Both C4.5 and CART are robust tools No method is always superior experiment!
witten & eibe
42

Machine Learning in Real World: Understanding CART Decision Trees

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning in Real World: Understanding CART Decision Trees

Uploaded by

Copyright:

Available Formats

Machine Learning in Real World: CART

CART Classification And Regression Tree

CART Tutorial Data: Gymtutor

CART Example: Gymtutor

CART Model Setup

Color-coding using class

Decision Tree: Splitters

Tree Summary Reports

Pruning the tree

Keeping only important variables

Automating CART: Command Log

Key CART features

 No data preprocessing needed

 Missing value tolerant

CART: Key Parts of Tree Structured Data Analysis

CART is a form of Binary Recursive Partitioning

 Partitions can also be split into sub-partitions

 CART tree is generated by repeated partitioning of data set

Splits always determined by questions with YES/NO answers

 More complex conditions possible:

Searching all Possible Splits

 CART allows search over a random sample if desired

 Next increase the AGE threshold to the next youngest person

 Continue increasing the splitting threshold value by value

 Q: Consider splits between values of the same class?

BP 78 83 91 99 110 110 119 120 122 135 150

CART Splitting Criteria: Gini Index

Handling of Missing Splitter Values in Tree Growing

 do with what tree has learned so far for this case

 assume all missings are somehow typical

Missing as a distinct splitter value

 Dont want to send both groups to same side of tree

CART Treatment of Missing Primary Splitters: Surrogates

 surrogate should be a valid replacement for primary

 thus record specific . . . some cases go left others go right

Surrogates Mimicking Alternatives to Primary Splitters

variable with near equivalent information

 If all surrogates missing also CART uses majority rule

*Competitors vs. Surrogates

Class A Class B Class C

Class A Class B Class C

Class A Class B Class C

CART Pruning Method: Grow Full Tree, Then Prune

Maximal Tree Example

 PRUNE away branches from this large tree

 CART determines a pruning sequence:

Pruning: Which nodes come off next?

Order of Pruning: Weakest Link Goes First

the nodes that add least to overall

Pruning Sequence Example

Now we test every tree in the pruning sequence

Training Data Vs. Test Data Error Rates

Why look at training data error rates (or cost) at all?

At optimal tree training and test error rates should be similar

CART: Optimal Tree

The Best Pruned Subtree: An Estimation Problem

One SE Rule -- One Standard Error Rule

In what sense is the optimal tree best ?

 trees somewhat larger or smaller than optimal may be preferred

 Room for user judgment

Decision Tree Summary

missing value treatment

You might also like

No data preprocessing needed

Missing value tolerant

Partitions can also be split into sub-partitions

CART tree is generated by repeated partitioning of data set

More complex conditions possible:

CART allows search over a random sample if desired

Next increase the AGE threshold to the next youngest person

Continue increasing the splitting threshold value by value

Q: Consider splits between values of the same class?

do with what tree has learned so far for this case

assume all missings are somehow typical

Dont want to send both groups to same side of tree

surrogate should be a valid replacement for primary

thus record specific . . . some cases go left others go right

If all surrogates missing also CART uses majority rule

PRUNE away branches from this large tree

CART determines a pruning sequence:

trees somewhat larger or smaller than optimal may be preferred

Room for user judgment