DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
()
About this ebook
Read more from César Pérez López
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB Rating: 0 out of 5 stars0 ratingsDATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB Rating: 0 out of 5 stars0 ratings
Related to DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES
Related ebooks
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python Rating: 0 out of 5 stars0 ratingsMachine Learning: Adaptive Behaviour Through Experience: Thinking Machines Rating: 4 out of 5 stars4/5Practical Data Analysis Rating: 4 out of 5 stars4/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5Environmental Data Analysis with MatLab Rating: 0 out of 5 stars0 ratingsData Mining: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsIntroduction to C Programming, a Practical Approach Rating: 0 out of 5 stars0 ratingsCrash Course in Machine Learning: A Beginner's Quick Start Guide Rating: 0 out of 5 stars0 ratingsDeep Learning with Python, Second Edition Rating: 0 out of 5 stars0 ratingsInformation Hiding in Speech Signals for Secure Communication Rating: 0 out of 5 stars0 ratingsStatistical Classification: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsSimultaneous multithreading A Complete Guide Rating: 0 out of 5 stars0 ratingsBattery management system Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsPattern Recognition: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsAutomatic Image Annotation: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsMachine Learning: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsMachine Learning and Deep Learning With Python Rating: 0 out of 5 stars0 ratingsMATLAB Machine Learning Recipes: A Problem-Solution Approach Rating: 0 out of 5 stars0 ratingsKernel Methods: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsDeep Learning for Vision Systems Rating: 5 out of 5 stars5/5Feature Selection in Machine Learning with Python Rating: 0 out of 5 stars0 ratingsAlternating Decision Tree: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsDeep Belief Nets in C++ and CUDA C: Volume 2: Autoencoding in the Complex Domain Rating: 0 out of 5 stars0 ratingsGroup Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis Rating: 0 out of 5 stars0 ratingsAdvanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data Rating: 0 out of 5 stars0 ratingsMachine Learning For Beginners Rating: 0 out of 5 stars0 ratingsJumpstart Your ML Journey: A Beginner's Handbook to Success Rating: 0 out of 5 stars0 ratings
Applications & Software For You
Adobe Illustrator: A Complete Course and Compendium of Features Rating: 0 out of 5 stars0 ratingsAdobe Photoshop: A Complete Course and Compendium of Features Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5How Do I Do That In InDesign? Rating: 5 out of 5 stars5/5The Little SAS Book: A Primer, Sixth Edition Rating: 5 out of 5 stars5/5Adobe InDesign CC: A Complete Course and Compendium of Features Rating: 0 out of 5 stars0 ratingsAdobe Illustrator CC For Dummies Rating: 5 out of 5 stars5/5Blender 3D Basics Beginner's Guide Second Edition Rating: 5 out of 5 stars5/5The Best Hacking Tricks for Beginners Rating: 4 out of 5 stars4/5Vocal Rescue: Rediscover the Beauty, Power and Freedom in Your Singing Rating: 4 out of 5 stars4/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5GarageBand Basics: The Complete Guide to GarageBand: Music Rating: 0 out of 5 stars0 ratingsExperts' Guide to OneNote Rating: 5 out of 5 stars5/5Audio Engineering: Know It All Rating: 5 out of 5 stars5/5iPhone Photography For Dummies Rating: 0 out of 5 stars0 ratingsLogic Pro X For Dummies Rating: 0 out of 5 stars0 ratingsExperts' Guide to Todoist Rating: 3 out of 5 stars3/580 Ways to Use ChatGPT in the Classroom Rating: 5 out of 5 stars5/5Photoshop For Beginners: Learn Adobe Photoshop cs5 Basics With Tutorials Rating: 0 out of 5 stars0 ratingsLearning Robotics Using Python Rating: 0 out of 5 stars0 ratingsSamsung Galaxy S23 Ultra User Guide for Beginners and Seniors Rating: 3 out of 5 stars3/5Mastering ChatGPT Rating: 0 out of 5 stars0 ratingsAdobe Premiere Pro: A Complete Course and Compendium of Features Rating: 0 out of 5 stars0 ratingsGarageBand For Dummies Rating: 5 out of 5 stars5/5Kodi User Manual: Watch Unlimited Movies & TV shows for free on Your PC, Mac or Android Devices Rating: 0 out of 5 stars0 ratingsSynthesizer Cookbook: How to Use Filters: Sound Design for Beginners, #2 Rating: 3 out of 5 stars3/5The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application Rating: 0 out of 5 stars0 ratingsSound Design for Filmmakers: Film School Sound Rating: 5 out of 5 stars5/5
Reviews for DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES
0 ratings0 reviews
Book preview
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES - César Pérez López
Currently the weak learner types are:
'Discriminant' (recommended for Subspace ensemble)
'KNN' (only for Subspace ensemble)
'Tree' (for any ensemble except Subspace)
There are two ways to set the weak learner type in the ensemble.
To create an ensemble with default weak learner options, pass in the character vectors as the weak learner. For example:
ens = fitensemble(X,Y,'AdaBoostM2',50,'Tree');
% or
ens = fitensemble(X,Y,'Subspace',50,'KNN');
To create an ensemble with nondefault weak learner options, create a nondefault weak learner using the appropriate template method. For example, if you have missing data, and want to use trees with surrogate splits for better accuracy:
templ = templateTree('Surrogate','all');
ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
To grow trees with leaves containing a number of observations that is at least 10% of the sample size:
templ = templateTree('MinLeafSize',size(X,1)/10);
ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
Alternatively, choose the maximal number of splits per tree:
templ = templateTree('MaxNumSplits',4);
ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
While you can give fitensemble a cell array of learner templates, the most common usage is to give just one weak learner template.
Decision trees can handle NaN values in X. Such values are called missing
. If you have some missing values in a row of X, a decision tree finds optimal splits using nonmissing values only. If an entire row consists of NaN, fitensemble ignores that row. If you have data with a large fraction of missing values in X, use surrogate decision splits.
Common Settings for Tree Weak Learners
The depth of a weak learner tree makes a difference for training time, memory usage, and predictive accuracy. You control the depth these parameters:
MaxNumSplits — The maximal number of branch node splits is MaxNumSplits per tree. Set large values of MaxNumSplits to get deep trees. The default for bagging is size(X,1) - 1. The default for boosting is 1.
MinLeafSize — Each leaf has at least MinLeafSize observations. Set small values of MinLeafSize to get deep trees. The default for classification is 1 and 5 for regression.
MinParentSize — Each branch node in the tree has at least MinParentSize observations. Set small values of MinParentSize to get deep trees. The default for classification is 2 and 10 for regression.
If you supply both MinParentSize and MinLeafSize, the learner uses the setting that gives larger leaves (shallower trees):
MinParent = max(MinParent,2*MinLeaf)
If you additionally supply MaxNumSplits, then the software splits a tree until one of the three splitting criteria is satisfied.
Surrogate — Grow decision trees with surrogate splits when Surrogate is 'on'. Use surrogate splits when your data has missing values.
PredictorSelection — fitensemble and TreeBagger grow trees using the standard CART[1] algorithm by default. If the predictor variables are heterogeneous or there are predictors having many levels and other having few levels, then standard CART tends to select predictors having many levels as split predictors. For split-predictor selection that is robust to the number of levels that the predictors have, consider specifying 'curvature' or 'interaction-curvature'. These specifications conduct chi-square tests of association between each predictor and the response or each pair of predictors and the response, respectively. The predictor that yields the minimal p-value is the split predictor for a particular node.
The syntax of fitensemble is:
ens = fitensemble(X,Y,model,numberens,learners)
X is the matrix of data. Each row contains one observation, and each column contains one predictor variable.
Y is the responses, with the same number of observations as rows in X.
model is a character vector, such as 'bag', naming the type of ensemble.
numberens is the number of weak learners in ens from each element of learners. The number of elements in ens is numberens times the number of elements in learners.
learners is a character vector, such as 'tree', naming a weak learner, a weak learner template, or a cell array of such character vectors and templates.
The result of fitensemble is an ensemble object, suitable for making predictions on new data.
Where to Set Name-Value Pairs. There are several name-value pairs you can pass to fitensemble, and several that apply to the weak learners (templateDiscriminant, templateKNN, and templateTree). To determine which name-value pair argument is appropriate, the ensemble or the weak learner:
Use template name-value pairs to control the characteristics of the weak learners.
Use fitensemble name-value pair arguments to control the ensemble as a whole, either for algorithms or for structure.
For example, for an ensemble of boosted classification trees with each tree deeper than the default, set the templateTree name-value pair arguments MinLeafSize and MinParentSize to smaller values than the defaults. Or, MaxNumSplits to a larger value than the defaults. The trees are then leafier (deeper).
To name the predictors in the ensemble (part of the structure of the ensemble), use the PredictorNames name-value pair in fitensemble.
This example shows how to create a classification tree ensemble for the ionosphere data set, and use it to predict the classification of a radar return with average measurements.
Load the ionosphere data set.
load ionosphere
Train a classification ensemble. For binary classification problems, fitcensemble aggregates 100 classification trees using LogitBoost.
Mdl = fitcensemble(X,Y)
Mdl =
classreg.learning.classif.ClassificationEnsemble
ResponseName: 'Y'
CategoricalPredictors: []
ClassNames: {'b' 'g'}
ScoreTransform: 'none'
NumObservations: 351
NumTrained: 100
Method: 'LogitBoost'
LearnerNames: {'Tree'}
ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.'
FitInfo: [100×1 double]
FitInfoDescription: {2×1 cell}
Mdl is a ClassificationEnsemble model.
Plot a graph of the first trained classification tree in the ensemble.
view(Mdl.Trained{1}.CompactRegressionLearner,'Mode','graph');
Descripción: http://www.mathworks.com/help/examples/stats/win64/TrainAClassificationEnsembleExample_01.pngBy default, fitcensemble grows shallow trees for boosting algorithms. You can alter the tree depth by passing a tree template object to fitcensemble. For more details, see templateTree.
Predict the quality of a radar return with average predictor measurements.
label = predict(Mdl,mean(X))
label =
cell
'g'
This example shows how to create a regression ensemble to predict mileage of cars based on their horsepower and weight, trained on the carsmall data.
Load the carsmall data set.
load carsmall
Prepare the predictor data.
X = [Horsepower Weight];
The response data is MPG. The only available boosted regression ensemble type is LSBoost. For this example, arbitrarily choose an ensemble of 100 trees, and use the default tree options.
Train an ensemble of regression trees.
Mdl = fitensemble(X,MPG,'LSBoost',100,'Tree')
Mdl =
classreg.learning.regr.RegressionEnsemble
ResponseName: 'Y'
CategoricalPredictors: []
ResponseTransform: 'none'
NumObservations: 94
NumTrained: 100
Method: 'LSBoost'
LearnerNames: {'Tree'}
ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.'
FitInfo: [100×1 double]
FitInfoDescription: {2×1 cell}
Regularization: []
Plot a graph of the first trained regression tree in the ensemble.
view(Mdl.Trained{1},'Mode','graph');
Descripción: http://www.mathworks.com/help/examples/stats/win64/TrainARegressionEnsemble1Example_01.pngBy default, fitensemble grows stumps for boosted trees.
Predict the mileage of a car with 150 horsepower weighing 2750 lbs.
mileage = predict(Mdl,[150 2750])
mileage =
22.4236
This example shows how to choose the appropriate split predictor selection technique for your data set when growing a random forest of regression trees. This example also shows how to decide which predictors are most important to include in the training data.
Load and Preprocess Data
Load the carbig data set. Consider a model that predicts the fuel economy of a car given its number of cylinders, engine displacement, horsepower, weight, acceleration, model year, and country of origin. Consider Cylinders, Model_Year, and Origin as categorical variables.
load carbig
Cylinders = categorical(Cylinders);
Model_Year = categorical(Model_Year);
Origin = categorical(cellstr(Origin));
X = table(Cylinders,Displacement,Horsepower,Weight,Acceleration,Model_Year,...
Origin,MPG);
Determine Levels in Predictors
The standard CART algorithm tends to split predictors with many unique values (levels), e.g., continuous variables, over those with fewer levels, e.g., categorical variables. If your data is heterogeneous, or your predictor variables vary greatly in their number of levels, then consider using the curvature or interaction tests for split-predictor selection instead of standard CART.
For each predictor, determine the number of levels in the data. One way to do this is define an anonymous function that:
Converts all variables to the categorical data type using categorical
Determines all unique categories while ignoring missing values using categories
Counts the categories using numel
Then, apply the function to each variable using varfun.
countLevels = @(x)numel(categories(categorical(x)));
numLevels = varfun(countLevels,X(:,1:end-1),'OutputFormat','uniform');
Compare the number of levels among the predictor variables.
figure;
bar(numLevels);
title('Number of Levels Among Predictors');
xlabel('Predictor variable');
ylabel('Number of levels');
h = gca;
h.XTickLabel = X.Properties.VariableNames(1:end-1);
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_01.pngThe continuous variables have many more levels than the categorical variables. Because the number of levels among the predictors vary so much, using standard CART to select split predictors at each node of the trees in a random forest can yield inaccurate predictor importance estimates.
Grow Robust Random Forest
Grow a random forest of 200 regression trees. Specify sampling all variables at each node. Specify usage of the interaction test to select split predictors. Because there are missing values in the data, specify usage of surrogate splits to increase accuracy.
t = templateTree('NumPredictorsToSample','all',...
'PredictorSelection','interaction-curvature','Surrogate','on');
rng(1); % For reproducibility
Mdl = fitrensemble(X,'MPG','Method','bag','NumLearningCycles',200,...
'Learners',t);
Mdl is a RegressionBaggedEnsemble model.
Estimate the model Descripción: $R^2$ using out-of-bag predictions.
yHat = oobPredict(Mdl);
R2 = corr(Mdl.Y,yHat)^2
R2 =
0.8739
Mdl explains 87.39% of the variability around the mean.
Predictor Importance Estimation
Estimate predictor importance values by permuting out-of-bag observations among the trees.
impOOB = oobPermutedPredictorImportance(Mdl);
impOOB is a 1-by-7 vector of predictor importance estimates corresponding to the predictors in Mdl.PredictorNames. The estimates are not biased toward predictors containing many levels.
Compare the predictor importance estimates.
figure;
bar(impOOB);
title('Unbiased Predictor Importance Estimates');
xlabel('Predictor variable');
ylabel('Importance');
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_02.pngGreater importance estimates indicate more important predictors. The bar graph suggests that Model_Year is the most important predictor, followed by Weight. Model_Year has 13 distinct levels only, whereas Weight has over 300.
Compare predictor importance estimates by permuting out-of-bag observations and those estimates obtained by summing gains in the mean squared error due to splits on each predictor. Also, obtain predictor association measures estimated by surrogate splits.
[impGain,predAssociation] = predictorImportance(Mdl);
figure;
plot(1:numel(Mdl.PredictorNames),[impOOB' impGain']);
title('Predictor Importance Estimation Comparison')
xlabel('Predictor variable');
ylabel('Importance');
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
legend('OOB permuted','MSE improvement')
grid on
Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_03.pngimpGain is commensurate with impOOB. According to the values of impGain, Model_Year and Weight do not appear to be the most important predictors.
predAssociation is a 7-by-7 matrix of predictor association measures. Rows and columns correspond to the predictors in Mdl.PredictorNames. You can infer the strength of the relationship between pairs of predictors using the elements of predAssociation. Larger values indicate more highly correlated pairs of predictors.
figure;
imagesc(predAssociation);
title('Predictor Association Estimates');
colorbar;
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
h.YTickLabel = Mdl.PredictorNames;
predAssociation(1,2)
ans =
0.6830
Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_04.pngThe largest association is between Cylinders and Displacement, but the value is not high enough to indicate a strong relationship between the two predictors.
Grow Random Forest Using Reduced Predictor Set
Because prediction time increases with the number of predictors in random forests, it is good practice to create a model using as few predictors as possible.
Grow a random forest of 200 regression trees using the best two predictors only.
MdlReduced = fitrensemble(X(:,{'Model_Year' 'Weight' 'MPG'}),'MPG','Method','bag',...
'NumLearningCycles',200,'Learners',t);
Compute the Descripción: $R^2$ of the reduced model.
yHatReduced = oobPredict(MdlReduced);
r2Reduced = corr(Mdl.Y,yHatReduced)^2
r2Reduced =
0.8525
The Descripción: $R^2$ for the reduced model is close to the Descripción: $R^2$ of the full model. This result suggests that the reduced model is sufficient for prediction.
Usually you cannot evaluate the predictive quality of an ensemble based on its performance on training data. Ensembles tend to overtrain,
meaning they produce overly optimistic estimates of their predictive power. This means the result of resubLoss for classification (resubLoss for regression) usually indicates lower error than you get on new