Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
Ebook316 pages2 hours

DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Data Mining and Machine Learning uses two types of techniques: predictive techniques (supervised techniques), which trains a model on known input and output data so that it can predict future outputs, and descriptive techniques (unsupervised techniques), which finds hidden patterns or intrinsic structures in input data. The aim of predictive techniques is to build a model that makes predictions based on evidence in the presence of uncertainty. A predictive algorithm takes a known set of input data and known responses to the data (output) and trains a model to generate reasonable predictions for the response to new data. Predictive techniques uses regression techniques to develop predictive models. This book develoop ensemble methods, boosting, bagging, random forest, decision trees and regression trees. Exercises are solved with MATLAB software.
LanguageEnglish
PublisherLulu.com
Release dateNov 11, 2021
ISBN9781794829053
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB

Read more from César Pérez López

Related to DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES

Related ebooks

Applications & Software For You

View More

Related articles

Reviews for DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES - César Pérez López

    Currently the weak learner types are:

    'Discriminant' (recommended for Subspace ensemble)

    'KNN' (only for Subspace ensemble)

    'Tree' (for any ensemble except Subspace)

    There are two ways to set the weak learner type in the ensemble.

    To create an ensemble with default weak learner options, pass in the character vectors as the weak learner. For example:

    ens = fitensemble(X,Y,'AdaBoostM2',50,'Tree');

    % or

    ens = fitensemble(X,Y,'Subspace',50,'KNN');

    To create an ensemble with nondefault weak learner options, create a nondefault weak learner using the appropriate template method. For example, if you have missing data, and want to use trees with surrogate splits for better accuracy:

    templ = templateTree('Surrogate','all');

    ens = fitensemble(X,Y,'AdaBoostM2',50,templ);

    To grow trees with leaves containing a number of observations that is at least 10% of the sample size:

    templ = templateTree('MinLeafSize',size(X,1)/10);

    ens = fitensemble(X,Y,'AdaBoostM2',50,templ);

    Alternatively, choose the maximal number of splits per tree:

    templ = templateTree('MaxNumSplits',4);

    ens = fitensemble(X,Y,'AdaBoostM2',50,templ);

    While you can give fitensemble a cell array of learner templates, the most common usage is to give just one weak learner template.

    Decision trees can handle NaN values in X. Such values are called missing. If you have some missing values in a row of X, a decision tree finds optimal splits using nonmissing values only. If an entire row consists of NaN, fitensemble ignores that row. If you have data with a large fraction of missing values in X, use surrogate decision splits.

    Common Settings for Tree Weak Learners  

    The depth of a weak learner tree makes a difference for training time, memory usage, and predictive accuracy. You control the depth these parameters:

    MaxNumSplits — The maximal number of branch node splits is MaxNumSplits per tree. Set large values of MaxNumSplits to get deep trees. The default for bagging is size(X,1) - 1. The default for boosting is 1.

    MinLeafSize — Each leaf has at least MinLeafSize observations. Set small values of MinLeafSize to get deep trees. The default for classification is 1 and 5 for regression.

    MinParentSize — Each branch node in the tree has at least MinParentSize observations. Set small values of MinParentSize to get deep trees. The default for classification is 2 and 10 for regression.

    If you supply both MinParentSize and MinLeafSize, the learner uses the setting that gives larger leaves (shallower trees):

    MinParent = max(MinParent,2*MinLeaf)

    If you additionally supply MaxNumSplits, then the software splits a tree until one of the three splitting criteria is satisfied.

    Surrogate — Grow decision trees with surrogate splits when Surrogate is 'on'. Use surrogate splits when your data has missing values.

    PredictorSelection — fitensemble and TreeBagger grow trees using the standard CART[1] algorithm by default. If the predictor variables are heterogeneous or there are predictors having many levels and other having few levels, then standard CART tends to select predictors having many levels as split predictors. For split-predictor selection that is robust to the number of levels that the predictors have, consider specifying 'curvature' or 'interaction-curvature'. These specifications conduct chi-square tests of association between each predictor and the response or each pair of predictors and the response, respectively. The predictor that yields the minimal p-value is the split predictor for a particular node.

    The syntax of fitensemble is:

    ens = fitensemble(X,Y,model,numberens,learners)

    X is the matrix of data. Each row contains one observation, and each column contains one predictor variable.

    Y is the responses, with the same number of observations as rows in X.

    model is a character vector, such as 'bag', naming the type of ensemble.

    numberens is the number of weak learners in ens from each element of learners. The number of elements in ens is numberens times the number of elements in learners.

    learners is a character vector, such as 'tree', naming a weak learner, a weak learner template, or a cell array of such character vectors and templates.

    The result of fitensemble is an ensemble object, suitable for making predictions on new data.

    Where to Set Name-Value Pairs.  There are several name-value pairs you can pass to fitensemble, and several that apply to the weak learners (templateDiscriminant, templateKNN, and templateTree). To determine which name-value pair argument is appropriate, the ensemble or the weak learner:

    Use template name-value pairs to control the characteristics of the weak learners.

    Use fitensemble name-value pair arguments to control the ensemble as a whole, either for algorithms or for structure.

    For example, for an ensemble of boosted classification trees with each tree deeper than the default, set the templateTree name-value pair arguments MinLeafSize and MinParentSize to smaller values than the defaults. Or, MaxNumSplits to a larger value than the defaults. The trees are then leafier (deeper).

    To name the predictors in the ensemble (part of the structure of the ensemble), use the PredictorNames name-value pair in fitensemble.

    This example shows how to create a classification tree ensemble for the ionosphere data set, and use it to predict the classification of a radar return with average measurements.

    Load the ionosphere data set.

    load ionosphere

    Train a classification ensemble. For binary classification problems, fitcensemble aggregates 100 classification trees using LogitBoost.

    Mdl = fitcensemble(X,Y)

    Mdl =

      classreg.learning.classif.ClassificationEnsemble

                ResponseName: 'Y'

        CategoricalPredictors: []

                  ClassNames: {'b'  'g'}

              ScoreTransform: 'none'

              NumObservations: 351

                  NumTrained: 100

                      Method: 'LogitBoost'

                LearnerNames: {'Tree'}

        ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.'

                      FitInfo: [100×1 double]

          FitInfoDescription: {2×1 cell}

    Mdl is a ClassificationEnsemble model.

    Plot a graph of the first trained classification tree in the ensemble.

    view(Mdl.Trained{1}.CompactRegressionLearner,'Mode','graph');

    Descripción: http://www.mathworks.com/help/examples/stats/win64/TrainAClassificationEnsembleExample_01.png

    By default, fitcensemble grows shallow trees for boosting algorithms. You can alter the tree depth by passing a tree template object to fitcensemble. For more details, see templateTree.

    Predict the quality of a radar return with average predictor measurements.

    label = predict(Mdl,mean(X))

    label =

      cell

        'g'

    This example shows how to create a regression ensemble to predict mileage of cars based on their horsepower and weight, trained on the carsmall data.

    Load the carsmall data set.

    load carsmall

    Prepare the predictor data.

    X = [Horsepower Weight];

    The response data is MPG. The only available boosted regression ensemble type is LSBoost. For this example, arbitrarily choose an ensemble of 100 trees, and use the default tree options.

    Train an ensemble of regression trees.

    Mdl = fitensemble(X,MPG,'LSBoost',100,'Tree')

    Mdl =

      classreg.learning.regr.RegressionEnsemble

                ResponseName: 'Y'

        CategoricalPredictors: []

            ResponseTransform: 'none'

              NumObservations: 94

                  NumTrained: 100

                      Method: 'LSBoost'

                LearnerNames: {'Tree'}

        ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.'

                      FitInfo: [100×1 double]

          FitInfoDescription: {2×1 cell}

              Regularization: []

    Plot a graph of the first trained regression tree in the ensemble.

    view(Mdl.Trained{1},'Mode','graph');

    Descripción: http://www.mathworks.com/help/examples/stats/win64/TrainARegressionEnsemble1Example_01.png

    By default, fitensemble grows stumps for boosted trees.

    Predict the mileage of a car with 150 horsepower weighing 2750 lbs.

    mileage = predict(Mdl,[150 2750])

    mileage =

      22.4236

    This example shows how to choose the appropriate split predictor selection technique for your data set when growing a random forest of regression trees. This example also shows how to decide which predictors are most important to include in the training data.

    Load and Preprocess Data

    Load the carbig data set. Consider a model that predicts the fuel economy of a car given its number of cylinders, engine displacement, horsepower, weight, acceleration, model year, and country of origin. Consider Cylinders, Model_Year, and Origin as categorical variables.

    load carbig

    Cylinders = categorical(Cylinders);

    Model_Year = categorical(Model_Year);

    Origin = categorical(cellstr(Origin));

    X = table(Cylinders,Displacement,Horsepower,Weight,Acceleration,Model_Year,...

        Origin,MPG);

    Determine Levels in Predictors

    The standard CART algorithm tends to split predictors with many unique values (levels), e.g., continuous variables, over those with fewer levels, e.g., categorical variables. If your data is heterogeneous, or your predictor variables vary greatly in their number of levels, then consider using the curvature or interaction tests for split-predictor selection instead of standard CART.

    For each predictor, determine the number of levels in the data. One way to do this is define an anonymous function that:

    Converts all variables to the categorical data type using categorical

    Determines all unique categories while ignoring missing values using categories

    Counts the categories using numel

    Then, apply the function to each variable using varfun.

    countLevels = @(x)numel(categories(categorical(x)));

    numLevels = varfun(countLevels,X(:,1:end-1),'OutputFormat','uniform');

    Compare the number of levels among the predictor variables.

    figure;

    bar(numLevels);

    title('Number of Levels Among Predictors');

    xlabel('Predictor variable');

    ylabel('Number of levels');

    h = gca;

    h.XTickLabel = X.Properties.VariableNames(1:end-1);

    h.XTickLabelRotation = 45;

    h.TickLabelInterpreter = 'none';

    Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_01.png

    The continuous variables have many more levels than the categorical variables. Because the number of levels among the predictors vary so much, using standard CART to select split predictors at each node of the trees in a random forest can yield inaccurate predictor importance estimates.

    Grow Robust Random Forest

    Grow a random forest of 200 regression trees. Specify sampling all variables at each node. Specify usage of the interaction test to select split predictors. Because there are missing values in the data, specify usage of surrogate splits to increase accuracy.

    t = templateTree('NumPredictorsToSample','all',...

        'PredictorSelection','interaction-curvature','Surrogate','on');

    rng(1); % For reproducibility

    Mdl = fitrensemble(X,'MPG','Method','bag','NumLearningCycles',200,...

        'Learners',t);

    Mdl is a RegressionBaggedEnsemble model.

    Estimate the model  Descripción: $R^2$  using out-of-bag predictions.

    yHat = oobPredict(Mdl);

    R2 = corr(Mdl.Y,yHat)^2

    R2 =

        0.8739

    Mdl explains 87.39% of the variability around the mean.

    Predictor Importance Estimation

    Estimate predictor importance values by permuting out-of-bag observations among the trees.

    impOOB = oobPermutedPredictorImportance(Mdl);

    impOOB is a 1-by-7 vector of predictor importance estimates corresponding to the predictors in Mdl.PredictorNames. The estimates are not biased toward predictors containing many levels.

    Compare the predictor importance estimates.

    figure;

    bar(impOOB);

    title('Unbiased Predictor Importance Estimates');

    xlabel('Predictor variable');

    ylabel('Importance');

    h = gca;

    h.XTickLabel = Mdl.PredictorNames;

    h.XTickLabelRotation = 45;

    h.TickLabelInterpreter = 'none';

    Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_02.png

    Greater importance estimates indicate more important predictors. The bar graph suggests that Model_Year is the most important predictor, followed by Weight. Model_Year has 13 distinct levels only, whereas Weight has over 300.

    Compare predictor importance estimates by permuting out-of-bag observations and those estimates obtained by summing gains in the mean squared error due to splits on each predictor. Also, obtain predictor association measures estimated by surrogate splits.

    [impGain,predAssociation] = predictorImportance(Mdl);

    figure;

    plot(1:numel(Mdl.PredictorNames),[impOOB' impGain']);

    title('Predictor Importance Estimation Comparison')

    xlabel('Predictor variable');

    ylabel('Importance');

    h = gca;

    h.XTickLabel = Mdl.PredictorNames;

    h.XTickLabelRotation = 45;

    h.TickLabelInterpreter = 'none';

    legend('OOB permuted','MSE improvement')

    grid on

    Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_03.png

    impGain is commensurate with impOOB. According to the values of impGain, Model_Year and Weight do not appear to be the most important predictors.

    predAssociation is a 7-by-7 matrix of predictor association measures. Rows and columns correspond to the predictors in Mdl.PredictorNames. You can infer the strength of the relationship between pairs of predictors using the elements of predAssociation. Larger values indicate more highly correlated pairs of predictors.

    figure;

    imagesc(predAssociation);

    title('Predictor Association Estimates');

    colorbar;

    h = gca;

    h.XTickLabel = Mdl.PredictorNames;

    h.XTickLabelRotation = 45;

    h.TickLabelInterpreter = 'none';

    h.YTickLabel = Mdl.PredictorNames;

    predAssociation(1,2)

    ans =

        0.6830

    Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_04.png

    The largest association is between Cylinders and Displacement, but the value is not high enough to indicate a strong relationship between the two predictors.

    Grow Random Forest Using Reduced Predictor Set

    Because prediction time increases with the number of predictors in random forests, it is good practice to create a model using as few predictors as possible.

    Grow a random forest of 200 regression trees using the best two predictors only.

    MdlReduced = fitrensemble(X(:,{'Model_Year' 'Weight' 'MPG'}),'MPG','Method','bag',...

        'NumLearningCycles',200,'Learners',t);

    Compute the  Descripción: $R^2$  of the reduced model.

    yHatReduced = oobPredict(MdlReduced);

    r2Reduced = corr(Mdl.Y,yHatReduced)^2

    r2Reduced =

        0.8525

    The  Descripción: $R^2$  for the reduced model is close to the  Descripción: $R^2$  of the full model. This result suggests that the reduced model is sufficient for prediction.

    Usually you cannot evaluate the predictive quality of an ensemble based on its performance on training data. Ensembles tend to overtrain, meaning they produce overly optimistic estimates of their predictive power. This means the result of resubLoss for classification (resubLoss for regression) usually indicates lower error than you get on new

    Enjoying the preview?
    Page 1 of 1