Ensemble Learning

Ensemble learning - Wikipedia https://en.wikipedia.
org/wiki/Ensemble_learning
Ensemble learning
From Wikipedia, the free encyclopedia
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better
predictive performance than could be obtained from any of the constituent learning algorithms alone.[1][2][3]
Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble
refers only to a concrete finite set of alternative models, but typically allows for much more flexible
structure to exist among those alternatives.
Contents
1 Overview
2 Ensemble theory
3 Ensemble Size
4 Common types of ensembles
4.1 Bayes optimal classifier
4.2 Bootstrap aggregating (bagging)
4.3 Boosting
4.4 Bayesian parameter averaging
4.5 Bayesian model combination
4.6 Bucket of models
4.7 Stacking
5 Implementations in statistics packages
6 See also
7 References
8 Further reading
9 External links
Overview
Supervised learning algorithms are most commonly described as performing the task of searching through a
hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem.
Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may
be very difficult to find a good one. Ensembles combine multiple hypotheses to form a (hopefully) better
hypothesis. The term ensemble is usually reserved for methods that generate multiple hypotheses using the
same base learner. The broader term of multiple classifier systems also covers hybridization of hypotheses
that are not induced by the same base learner.
Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction
of a single model, so ensembles may be thought of as a way to compensate for poor learning algorithms by
performing a lot of extra computation. Fast algorithms such as decision trees are commonly used in
ensemble methods (for example Random Forest), although slower algorithms can benefit from ensemble
techniques as well.
By analogy, ensemble techniques have been used also in unsupervised learning scenarios, for example in
consensus clustering or in anomaly detection.
Ensemble theory
1 de 7 26/6/17 8:39
Ensemble learning - Wikipedia https://en.wikipedia.org/wiki/Ensemble_learning
An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make
predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not
necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can
be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable
them to over-fit the training data more than a single model would, but in practice, some ensemble techniques
(especially bagging) tend to reduce problems related to over-fitting of the training data.
Empirically, ensembles tend to yield better results when there is a significant diversity among the models.
[4][5] Many ensemble methods, therefore, seek to promote diversity among the models they combine.[6][7]
Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to
produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees).[8] Using
a variety of strong learning algorithms, however, has been shown to be more effective than using techniques
that attempt to dumb-down the models in order to promote diversity.[9]
Ensemble Size
While the number of component classifiers of an ensemble has a great impact on the accuracy of prediction,
there is a limited number of studies addressing this problem. A priori determining of ensemble size and the
volume and velocity of big data streams make this even more crucial for online ensemble classifiers. Mostly
statistical tests were used for determining the proper number of components. More recently, a theoretical
framework suggested that there is an ideal number of component classifiers for an ensemble which having
more or less than this number of classifiers would deteriorate the accuracy. It is called "the law of
diminishing returns in ensemble construction." Their theoretical framework shows that using the same
number of independent component classifiers as class labels gives the highest accuracy.[10]
Common types of ensembles

Bayes optimal classifier
The Bayes Optimal Classifier is a classification technique. It is an ensemble of all the hypotheses in the
hypothesis space. On average, no other ensemble can outperform it.[11] Each hypothesis is given a vote
proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis
were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior
probability of that hypothesis. The Bayes Optimal Classifier can be expressed with the following equation:
where is the predicted class, is the set of all possible classes, is the hypothesis space, refers to a
probability, and is the training data. As an ensemble, the Bayes Optimal Classifier represents a hypothesis
that is not necessarily in . The hypothesis represented by the Bayes Optimal Classifier, however, is the
optimal hypothesis in ensemble space (the space of all possible ensembles consisting only of hypotheses in
).
Unfortunately, the Bayes Optimal Classifier cannot be practically implemented for any but the most simple
of problems. There are several reasons why the Bayes Optimal Classifier cannot be practically implemented:
1. Most interesting hypothesis spaces are too large to iterate over, as required by the .
2. Many hypotheses yield only a predicted class, rather than a probability for each class as required by
the term .
3. Computing an unbiased estimate of the probability of the training set given a hypothesis ( ) is
2 de 7 26/6/17 8:39
non-trivial.
4. Estimating the prior probability for each hypothesis ( ) is rarely feasible.
Bootstrap aggregating (bagging)
Bootstrap aggregating, often abbreviated as bagging, involves having each model in the ensemble vote with
equal weight. In order to promote model variance, bagging trains each model in the ensemble using a
randomly drawn subset of the training set. As an example, the random forest algorithm combines random
decision trees with bagging to achieve very high classification accuracy.[12]
Boosting
Boosting involves incrementally building an ensemble by training each new model instance to emphasize the
training instances that previous models mis-classified. In some cases, boosting has been shown to yield better
accuracy than bagging, but it also tends to be more likely to over-fit the training data. By far, the most
common implementation of Boosting is Adaboost, although some newer algorithms are reported to achieve
better results.
Bayesian parameter averaging
Bayesian parameter averaging (BPA) is an ensemble technique that seeks to approximate the Bayes Optimal
Classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes' law.[13]
Unlike the Bayes optimal classifier, Bayesian model averaging (BMA) can be practically implemented.
Hypotheses are typically sampled using a Monte Carlo sampling technique such as MCMC. For example,
Gibbs sampling may be used to draw hypotheses that are representative of the distribution . It has
been shown that under certain circumstances, when hypotheses are drawn in this manner and averaged
according to Bayes' law, this technique has an expected error that is bounded to be at most twice the
expected error of the Bayes optimal classifier.[14] Despite the theoretical correctness of this technique, early
work showed experimental results suggesting that the method promoted over-fitting and performed worse
compared to simpler ensemble techniques such as bagging;[15] however, these conclusions appear to be
based on a misunderstanding of the purpose of Bayesian model averaging vs. model combination.[16]
Additionally, there have been considerable advances in theory and practice of BMA. Recent rigorous proofs
demonstrate the accuracy of BMA in variable selection and estimation in high-dimensional settings,[17] and
provide empirical evidence highlighting the role of sparsity-enforcing priors within the BMA in alleviating
overfitting.[18]
Bayesian model combination
Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA).
Instead of sampling each model in the ensemble individually, it samples from the space of possible
ensembles (with model weightings drawn randomly from a Dirichlet distribution having uniform parameters).
This modification overcomes the tendency of BMA to converge toward giving all of the weight to a single
model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield
dramatically better results. The results from BMC have been shown to be better on average (with statistical
significance) than BMA, and bagging.[19]
The use of Bayes' law to compute model weights necessitates computing the probability of the data given
each model. Typically, none of the models in the ensemble are exactly the distribution from which the
training data were generated, so all of them correctly receive a value close to zero for this term. This would
work well if the ensemble were big enough to sample the entire model-space, but such is rarely possible.
Consequently, each pattern in the training data will cause the ensemble weight to shift toward the model in
the ensemble that is closest to the distribution of the training data. It essentially reduces to an unnecessarily
3 de 7 26/6/17 8:39
complex method for doing model selection.
The possible weightings for an ensemble can be visualized as lying on a simplex. At each vertex of the
simplex, all of the weight is given to a single model in the ensemble. BMA converges toward the vertex that
is closest to the distribution of the training data. By contrast, BMC converges toward the point where this
distribution projects onto the simplex. In other words, instead of selecting the one model that is closest to the
generating distribution, it seeks the combination of models that is closest to the generating distribution.
The results from BMA can often be approximated by using cross-validation to select the best model from a
bucket of models. Likewise, the results from BMC may be approximated by using cross-validation to select
the best ensemble combination from a random sampling of possible weightings.
Bucket of models
A "bucket of models" is an ensemble technique in which a model selection algorithm is used to choose the
best model for each problem. When tested with only one problem, a bucket of models can produce no better
results than the best model in the set, but when evaluated across many problems, it will typically produce
much better results, on average, than any model in the set.
The most common approach used for model-selection is cross-validation selection (sometimes called a
"bake-off contest"). It is described with the following pseudo-code:
For each model m in the bucket:

Do c times: (where 'c' is some constant)
Randomly divide the training dataset into two datasets: A, and B.
Train m with A
Test m with B
Select the model that obtains the highest average score
Cross-Validation Selection can be summed up as: "try them all with the training set, and pick the one that
works best".[20]
Gating is a generalization of Cross-Validation Selection. It involves training another learning model to decide
which of the models in the bucket is best-suited to solve the problem. Often, a perceptron is used for the
gating model. It can be used to pick the "best" model, or it can be used to give a linear weight to the
predictions from each model in the bucket.
When a bucket of models is used with a large set of problems, it may be desirable to avoid training some of
the models that take a long time to train. Landmark learning is a meta-learning approach that seeks to solve
this problem. It involves training only the fast (but imprecise) algorithms in the bucket, and then using the
performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to do
best.[21]
Stacking
Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the
predictions of several other learning algorithms. First, all of the other algorithms are trained using the
available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the
other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can
theoretically represent any of the ensemble techniques described in this article, although in practice, a
single-layer logistic regression model is often used as the combiner.
Stacking typically yields performance better than any single one of the trained models.[22] It has been
successfully used on both supervised learning tasks (regression,[23] classification and distance learning [24])
and unsupervised learning (density estimation).[25] It has also been used to estimate bagging's error rate.[3][26]
4 de 7 26/6/17 8:39
It has been reported to out-perform Bayesian model-averaging.[27] The two top-performers in the Netflix
competition utilized blending, which may be considered to be a form of stacking.[28]
Implementations in statistics packages

R: at least three packages offer Bayesian model averaging tools,[29] including the BMS (an acronym for
Bayesian Model Selection) package,[30] the BAS (an acronym for Bayesian Adaptive Sampling)
package,[31] and the BMA package.[32]
Python : Scikit-learn,a package for Machine Learning in python offers packages for ensemble learning
including packages for bagging and averaging methods.
MATLAB: classification ensembles are implemented in Statistics and Machine Learning Toolbox.[33]
See also
Ensemble averaging (machine learning)
Bayesian structural time series (BSTS)
References
1. Opitz, D.; Maclin, R. (1999). "Popular ensemble methods: An empirical study". Journal of Artificial
Intelligence Research. 11: 169198. doi:10.1613/jair.614 (https://doi.org/10.1613%2Fjair.614).
2. Polikar, R. (2006). "Ensemble based systems in decision making". IEEE Circuits and Systems Magazine. 6 (3):
2145. doi:10.1109/MCAS.2006.1688199 (https://doi.org/10.1109%2FMCAS.2006.1688199).
3. Rokach, L. (2010). "Ensemble-based classifiers". Artificial Intelligence Review. 33 (1-2): 139.
doi:10.1007/s10462-009-9124-7 (https://doi.org/10.1007%2Fs10462-009-9124-7).
4. Kuncheva, L. and Whitaker, C., Measures of diversity in classifier ensembles, Machine Learning, 51, pp.
181-207, 2003
5. Sollich, P. and Krogh, A., Learning with ensembles: How overfitting can be useful, Advances in Neural
Information Processing Systems, volume 8, pp. 190-196, 1996.
6. Brown, G. and Wyatt, J. and Harris, R. and Yao, X., Diversity creation methods: a survey and categorisation.,
Information Fusion, 6(1), pp.5-20, 2005.
7. Accuracy and Diversity in Ensembles of Text Categorisers (http://www.clei.cl/cleiej/papers/v8i2p1.pdf). J. J.
Garca Adeva, Ulises Cervio, and R. Calvo, CLEI Journal, Vol. 8, No. 2, pp. 1 - 12, December 2005.
8. Ho, T., Random Decision Forests, Proceedings of the Third International Conference on Document Analysis
and Recognition, pp. 278-282, 1995.
9. Gashler, M. and Giraud-Carrier, C. and Martinez, T., Decision Tree Ensemble: Small Heterogeneous Is Better
Than Large Homogeneous (http://axon.cs.byu.edu/papers/gashler2008icmla.pdf), The Seventh International
Conference on Machine Learning and Applications, 2008, pp. 900-905., DOI 10.1109/ICMLA.2008.154
(http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=4796917)
10. R. Bonab, Hamed; Can, Fazli (2016). A Theoretical Framework on the Ideal Number of Classifiers for Online
Ensembles in Data Streams (http://dl.acm.org/citation.cfm?id=2983907). CIKM. USA: ACM. p. 2053.
11. Tom M. Mitchell, Machine Learning, 1997, pp. 175
12. Breiman, L., Bagging Predictors, Machine Learning, 24(2), pp.123-140, 1996.
13. Hoeting, J. A.; Madigan, D.; Raftery, A. E.; Volinsky, C. T. (1999). "Bayesian Model Averaging: A Tutorial".
Statistical Science. 14 (4): 382401. JSTOR 2676803 (https://www.jstor.org/stable/2676803).
doi:10.2307/2676803 (https://doi.org/10.2307%2F2676803).
14. David Haussler, Michael Kearns, and Robert E. Schapire. Bounds on the sample complexity of Bayesian
learning using information theory and the VC dimension. Machine Learning, 14:83113, 1994
15. Domingos, Pedro (2000). Bayesian averaging of classifiers and the overfitting problem
(http://www.cs.washington.edu/homes/pedrod/papers/mlc00b.pdf) (PDF). Proceedings of the 17th International
Conference on Machine Learning (ICML). pp. 223230.
16. Minka, Thomas (2002), Bayesian model averaging is not model combination (http://research.microsoft.com/en-
us/um/people/minka/papers/minka-bma-isnt-mc.pdf) (PDF)
5 de 7 26/6/17 8:39
17. Castillo, I.; Schmidt-Hieber, J.; van der Vaart, A. (2015). "Bayesian linear regression with sparse priors". Annals
of Statistics. 43 (5): 19862018. doi:10.1214/15-AOS1334 (https://doi.org/10.1214%2F15-AOS1334).
18. Hernndez-Lobato, D.; Hernndez-Lobato, J. M.; Dupont, P. (2013). "Generalized Spike-and-Slab Priors for
Bayesian Group Feature Selection Using Expectation Propagation" (http://www.jmlr.org/papers/volume14
/hernandez-lobato13a/hernandez-lobato13a.pdf) (PDF). Journal of Machine Learning Research. 14: 18911945.
19. Monteith, Kristine; Carroll, James; Seppi, Kevin; Martinez, Tony. (2011). Turning Bayesian Model Averaging
into Bayesian Model Combination (http://axon.cs.byu.edu/papers/Kristine.ijcnn2011.pdf) (PDF). Proceedings of
the International Joint Conference on Neural Networks IJCNN'11. pp. 26572663.
20. Saso Dzeroski, Bernard Zenko, Is Combining Classifiers Better than Selecting the Best One
(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.6096), Machine Learning, 2004, pp. 255--273
21. Bensusan, Hilan and Giraud-Carrier, Christophe G., Discovering Task Neighbourhoods Through Landmark
Learning Performances, PKDD '00: Proceedings of the 4th European Conference on Principles of Data Mining
and Knowledge Discovery, Springer-Verlag, 2000, pages 325--330
22. Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992
23. Breiman, L., Stacked Regression, Machine Learning, 24, 1996 doi:10.1007/BF00117832 (https://dx.doi.org
/10.1007%2FBF00117832)
24. Ozay, M.; Yarman Vural, F. T. (2013). "A New Fuzzy Stacked Generalization Technique and Analysis of its
Performance". arXiv:1204.0171 (https://arxiv.org/abs/1204.0171) .
25. Smyth, P. and Wolpert, D. H., Linearly Combining Density Estimators via Stacking, Machine Learning Journal,
36, 59-83, 1999
26. Wolpert, D.H., and Macready, W.G., An Efficient Method to Estimate Baggings Generalization Error, Machine
Learning Journal, 35, 41-55, 1999
27. Clarke, B., Bayes model averaging and stacking when model approximation error cannot be ignored, Journal
of Machine Learning Research, pp 683-712, 2003
28. Sill, J.; Takacs, G.; Mackey, L.; Lin, D. (2009). "Feature-Weighted Linear Stacking". arXiv:0911.0460
(https://arxiv.org/abs/0911.0460) .
29. Amini, Shahram M.; Parmeter, Christopher F. (2011). "Bayesian model averaging in R" (https://core.ac.uk
/download/pdf/6494889.pdf) (PDF). Journal of Economic and Social Measurement. 36 (4): 253287.
30. "BMS: Bayesian Model Averaging Library" (https://cran.r-project.org/web/packages/BMS/index.html). The
Comprehensive R Archive Network. Retrieved September 9, 2016.
31. "BAS: Bayesian Model Averaging using Bayesian Adaptive Sampling" (https://cran.r-project.org/web/packages
/BAS/index.html). The Comprehensive R Archive Network. Retrieved September 9, 2016.
32. "BMA: Bayesian Model Averaging" (https://cran.r-project.org/web/packages/BMA/index.html). The
Comprehensive R Archive Network. Retrieved September 9, 2016.
33. "Classification Ensembles" (https://uk.mathworks.com/help/stats/classification-ensembles.html). MATLAB &
Simulink. Retrieved June 8, 2017.
Further reading
Zhou Zhihua (2012). Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC.
ISBN 978-1-439-83003-1.
Robert Schapire; Yoav Freund (2012). Boosting: Foundations and Algorithms. MIT.
ISBN 978-0-262-01718-3.
External links
Robi Polikar (ed.). "Ensemble learning" (http://www.scholarpedia.org/article/Ensemble_learning).
Scholarpedia.
The Waffles (machine learning) toolkit contains implementations of Bagging, Boosting, Bayesian
Model Averaging, Bayesian Model Combination, Bucket-of-models, and other ensemble techniques
Retrieved from "https://en.wikipedia.org/w/index.php?title=Ensemble_learning&oldid=786053212"
Categories: Ensemble learning
6 de 7 26/6/17 8:39
This page was last edited on 16 June 2017, at 23:58.

Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may
apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia is a registered
trademark of the Wikimedia Foundation, Inc., a non-profit organization.
7 de 7 26/6/17 8:39

Ensemble Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ensemble Learning

Uploaded by

Copyright:

Available Formats

Ensemble learning - Wikipedia https://en.wikipedia.

Common types of ensembles

Bootstrap aggregating (bagging)

Bayesian parameter averaging

Bayesian model combination

complex method for doing model selection.

For each model m in the bucket:

Implementations in statistics packages

Retrieved from "https://en.wikipedia.org/w/index.php?title=Ensemble_learning&oldid=786053212"

Categories: Ensemble learning

This page was last edited on 16 June 2017, at 23:58.

You might also like