You are on page 1of 42

Machine Learning in Real World: CART

Outline
 CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning
 Finding Optimal Tree

CART Classification And Regression Tree


 Developed 1974-1984 by 4 statistics professors  Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)

Focused on accurate assessment when data is noisy Currently distributed by Salford Systems

CART Tutorial Data: Gymtutor


CART HELP, Sec 3 in CARTManual.pdf  ANYRAQT  ONAER  NSUPPS  OFFAER  NFAMMEM  TANNING  ANYPOOL  SMALLBUS  FIT  HOME  PERSTRN  CLASSES  SEGMENT Racquet ball usage (binary indicator coded 0, 1) Number of on-peak aerobics classes attended Number of supplements purchased Number of off-peak aerobics classes attended Number of family members Number of visits to tanning salon Pool usage (binary indicator coded 0, 1) Small business discount (binary indicator coded 0, 1) Fitness score Home ownership (binary indicator coded 0, 1) Personal trainer (binary indicator coded 0, 1) Number of classes taken. Member s market segment (1, 2, 3)
4

target

View data
 CART Menu: View -> Data Info

CART Example: Gymtutor

CART Model Setup


 Target -- required  Predictors (default  Categorical
 ANYRAQT, ANYPOOL, SMALLBUS, HOME  Categorical: if field name ends in $ , or from values

all)

 Testing
 default 10-fold cross-validation


7

Sample Tree

Color-coding using class

Decision Tree: Splitters

10

Tree Details

11

Tree Summary Reports

12

Pruning the tree

13

Keeping only important variables

14

Revised Tree

15

Automating CART: Command Log

16

Key CART features


 Automated field selection
 handles any number of fields
 automatically selects relevant fields

 No data preprocessing needed


 Does not require any kind of variable transforms  Impervious to outliers

 Missing value tolerant


 Moderate loss of accuracy due to missing values

CART: Key Parts of Tree Structured Data Analysis


 Tree growing  Splitting rules to generate tree  Stopping criteria: how far to grow?  Missing values: using surrogates  Tree pruning

 Trimming off parts of the tree that don t work  Ordering the nodes of a large tree by contribution to tree accuracy
which nodes come off first?  Optimal tree selection

 Deciding on the best tree after growing and pruning  Balancing simplicity against accuracy

CART is a form of Binary Recursive Partitioning


 Data is split into two partitions
 Q: Does C4.5 always have binary partitions?

 Partitions can also be split into sub-partitions


 hence procedure is recursive

 CART tree is generated by repeated partitioning of data set

 parent gets two children  each child produces two grandchildren  four grandchildren produce 8 great grandchildren

Splits always determined by questions with YES/NO answers


 Is continuous variable X e c ?  Does categorical variable D take on levels i, j, or k?
 is GENDER M or F ?

 Standard split:
 if answer to question is YES a case goes left; otherwise it goes right  this is the form of all primary splits

 example :

Is AGE e 62.5?

 More complex conditions possible:


 Boolean combinations: AGE<=62 OR BP<=91  Linear combinations: .66*AGE - .75*BP< -40

Searching all Possible Splits


 For any node CART will examine ALL possible splits.

 CART allows search over a random sample if desired


 Look at first variable in our data set AGE with minimum value 40
 Test split Is AGE e 40?
 Will separate out the youngest persons to the left  Could be many cases if many people have the same AGE

 Next increase the AGE threshold to the next youngest person


 Is AGE e43?  This will direct additional cases to the left

 Continue increasing the splitting threshold value by value


 each value is tested for how good the split is . . . how effective it is in
separating the classes from each other

 Q: Consider splits between values of the same class?

Split Tables
Q: Where splits need to be evaluated? Sorted by Blood Pressure Sorted by Age
AGE 40 40 40 43 43 43 45 48 48 49 49 BP 91 110 83 99 78 135 120 119 122 150 110 SINUST SURVIVE 0 SURVIVE 0 SURVIVE 1 DEAD 0 SURVIVE 1 DEAD 0 SURVIVE 0 SURVIVE 1 DEAD 0 SURVIVE 0 DEAD 1 SURVIVE

AGE 43 40 40 43 40 49 48 45 48 43 49

BP 78 83 91 99 110 110 119 120 122 135 150

SINUST SURVIVE 1 DEAD 1 DEAD 0 SURVIVE 0 SURVIVE 0 SURVIVE 1 SURVIVE 1 DEAD 0 SURVIVE 0 SURVIVE 0 SURVIVE 0 DEAD

CART Splitting Criteria: Gini Index


 If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed.  Advanced: CART also has other splitting criteria
 Twoing is recommended for multi-class
23

Handling of Missing Splitter Values in Tree Growing


 If splitter variable missing dont know which way to send case (Left or Right in binary tree)  Could delete cases that have missing values

 method used in classical statistical modeling  unacceptable in a data mining context w/ many missings
 Freeze case in node in which missing splitter encountered

 do with what tree has learned so far for this case


 Allow cases with missing split variable to follow majority

 assume all missings are somehow typical


 Allow missing to be a separate value of variable

 used by CHAID algorithm; an option in Salford software  allow special handling for missing but all missings treated as
indistinguishable from each other

Missing as a distinct splitter value


 CHAID treats missing as a distinct categorical value  e.g AGE is 25-44, 45-64, 65-95 or missing  method also adopted by C4.5  If missing is a distinct value then all cases with missing go the same way in the tree  Assumption: whatever the unknown value it is the same for all cases with missing value  Problem: can be more than one reason for a database field to be missing:  E.g. Income as a splitter wants to separate high from low  Levels most likely to be missing? High Income AND Low
Income!

 Dont want to send both groups to same side of tree

CART Treatment of Missing Primary Splitters: Surrogates


 CART uses a more refined method missing primary field a surrogate is used as a stand in for a

 surrogate should be a valid replacement for primary

 Consider our example of INCOME  Other variables like Education or Occupation might work as good surrogates

 Higher education people usually have higher incomes  People in high income occupations will usually (though not always) have
higher incomes

 Using surrogate means that missing on primary not all treated same way  Whether go left or right depends on surrogate value

 thus record specific . . . some cases go left others go right


26

Surrogates Mimicking Alternatives to Primary Splitters


 A primary splitter is the best splitter of a node  A surrogate is a splitter that splits in a fashion similar to the primary  Surrogate  Why Useful?
 If the primary is expensive or difficult to gather and the surrogate is not
 Then consider using the surrogate instead  Loss in predictive accuracy might be slight

variable with near equivalent information

 If primary splitter is MISSING then CART will use a surrogate  if top surrogate missing CART uses 2nd best surrogate etc

 If all surrogates missing also CART uses majority rule

*Competitors vs. Surrogates


Class A Class B Class C 100 100 100

Left 90 80 15 80 25 14 78 74 21

Right 10 20 85 20 75 86 22 26 79

Primary Split

Class A Class B Class C

Competitor Split

Class A Class B Class C

Surrogate Split

Class A Class B Class C

28

CART Pruning Method: Grow Full Tree, Then Prune


 You will never know when to stop . . . so don t!  Instead . . . grow trees that are obviously too big  Largest tree grown is called maximal tree  Maximal tree could have hundreds or thousands of nodes
 usually instruct CART to grow only moderately too big  rule of thumb: should grow trees about twice the size of the truly
best tree

 This becomes first stage in finding the best tree  Next we will have to get rid the parts of the overgrown tree that don t work (not supported by test data)

Maximal Tree Example

30

Tree Pruning
 Take a very large tree ( maximal tree)  Tree may be radically over-fit
 Tracks all the idiosyncrasies of THIS data set  Tracks patterns that may not be found in other data sets  At bottom of tree splits based on very few cases  Analogous to a regression with very large number of variables

 PRUNE away branches from this large tree


 But which branch to cut first?

 CART determines a pruning sequence:


 the exact order in which each node should be removed  pruning sequence determined for EVERY node  sequence determined all the way back to root node

Pruning: Which nodes come off next?

32

Order of Pruning: Weakest Link Goes First


 Prune away "weakest link" accuracy of the tree
size of node

the nodes that add least to overall

 contribution to overall tree a function of both increase in accuracy and  accuracy gain is weighted by share of sample  small nodes tend to get removed before large ones
 If several nodes have same contribution they all prune away simultaneously

 Hence more than two terminal nodes could be cut off in one pruning
 Sequence determined all the way back to root node

 need to allow for possibility that entire tree is bad  if target variable is unpredictable we will want to prune back to root .
. . the no model solution

Pruning Sequence Example

24 Terminal Nodes

21 Terminal Nodes

20 Terminal Nodes
34

18 Terminal Nodes

Now we test every tree in the pruning sequence


 Take a test data set and drop it down the largest tree in the sequence and measure its predictive accuracy
 how many cases right and how many wrong  measure accuracy overall and by class

 Do same for 2nd largest tree, 3rd largest tree, etc  Performance of every tree in sequence is measured  Results reported in table and graph formats  Note that this critical stage is impossible to complete without test data  CART procedure requires test data to guide tree evaluation
35

Training Data Vs. Test Data Error Rates


 Compare error rates measured by
 learn data  large test set
No. Terminal Nodes 71 63 58 40 34 19 **10 9 7 6 5 2 1 R(T) .00 .00 .03 .10 .12 .20 .29 .32 .41 .46 .53 .75 .86 Rts(T) .42 .40 .39 .32 .32 .31 .30 .34 .47 .54 .61 .82 .91

 Learn R(T) always decreases as tree grows (Q: Why?)  Test R(T) first declines then increases (Q: Why?)  Overfitting is the result tree of too much reliance on learn

R(T)
 Can lead to disasters when applied to new data

Why look at training data error rates (or cost) at all?


 First, provides a rough guide of how you are doing
 Truth will typically be WORSE than training data measure  If tree performing poorly on training data error may not want to pursue further  Training data error rate more accurate for smaller trees
 So reasonable guide for smaller trees  Poor guide for larger trees


At optimal tree training and test error rates should be similar


 

if not something is wrong useful to compare not just overall error rate but also within node performance between training and test data

CART: Optimal Tree


 Within a single CART run which tree is best?  Process of pruning the maximal tree can yield many sub-trees  Test data set or cross- validation measures the error rate of each tree  Current wisdom select the tree with smallest error rate  Only drawback minimum may not be precisely estimated  Typical error rate as a function of tree size has flat region  Minimum could be anywhere in this region
0 0 10 20 ~ |Tk | 30 40 50 1

The Best Pruned Subtree: An Estimation Problem

^ k) R(T

One SE Rule -- One Standard Error Rule


 Original monograph recommends NOT choosing minimum error tree because of possible instability of results from run to run  Instead suggest SMALLEST TREE within 1 SE of the minimum error tree  Tends to provide very stable results from run to run  Is possibly as accurate as minimum cost tree yet simpler  Current learning one SERULE is good for small data sets  For large data sets one should pick most accurate tree  known as the zero SE rule

In what sense is the optimal tree best ?


 Optimal tree has lowest or near lowest cost as determined by a test procedure  Tree should exhibit very similar accuracy when applied to new data
 BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database

 trees somewhat larger or smaller than optimal may be preferred

 Room for user judgment


 judgment not about split variable or values  judgment as to how much of tree to keep  determined by story tree is telling  willingness to sacrifice a small amount of accuracy for simplicity

CART Summary
 CART Key Features
 binary splits  gini index as splitting criteria  grow, then prune  surrogates for missing values  optimal tree 1 SE rule  lots of nice graphics

41

Decision Tree Summary


 Decision Trees
     splits binary, multi-way entropy, gini, split criteria pruning rule extraction from trees

missing value treatment

 Both C4.5 and CART are robust tools  No method is always superior experiment!
witten & eibe
42

You might also like