You are on page 1of 10

review articles

doi:10.1145/ 2347736.2347755

Tapping into the “folk knowledge” needed to


advance machine learning applications.
by Pedro Domingos

A Few Useful
Things to
Know About
Machine
Learning is needed to successfully develop
machine learning applications is not
readily available in them. As a result,
many machine learning projects take
much longer than necessary or wind
up producing less-than-ideal results.
Yet much of this folk knowledge is
fairly easy to communicate. This is
Machine learning systems automatically learn
the purpose of this article.
programs from data. This is often a very attractive
alternative to manually constructing them, and in the key insights
last decade the use of machine learning has spread  M achine learning algorithms can figure
rapidly throughout computer science and beyond. out how to perform important tasks
by generalizing from examples. This is
Machine learning is used in Web search, spam filters, often feasible and cost-effective where
manual programming is not. As more
recommender systems, ad placement, credit scoring, data becomes available, more ambitious
problems can be tackled.
fraud detection, stock trading, drug design, and many
other applications. A recent report from the McKinsey  M achine learning is widely used in
computer science and other fields.
Global Institute asserts that machine learning (a.k.a. However, developing successful
machine learning applications requires a
data mining or predictive analytics) will be the driver substantial amount of “black art” that is
difficult to find in textbooks.
of the next big wave of innovation.15 Several fine
textbooks are available to interested practitioners and  T his article summarizes 12 key lessons
that machine learning researchers and
researchers (for example, Mitchell16 and Witten et practitioners have learned. These include
pitfalls to avoid, important issues to focus
al.24). However, much of the “folk knowledge” that on, and answers to common questions.

78 comm unicatio ns o f the ac m | o c to ber 201 2 | vo l . 5 5 | no. 1 0


Many different types of machine Learning = Representation + or scoring function) is needed to dis-
learning exist, but for illustration Evaluation + Optimization tinguish good classifiers from bad
purposes I will focus on the most Suppose you have an application that ones. The evaluation function used
mature and widely used one: clas- you think machine learning might be internally by the algorithm may dif-
sification. Nevertheless, the issues I good for. The first problem facing you fer from the external one that we want
will discuss apply across all of ma- is the bewildering variety of learning al- the classifier to optimize, for ease of
chine learning. A classifier is a sys- gorithms available. Which one to use? optimization and due to the issues I
tem that inputs (typically) a vector There are literally thousands available, will discuss.
of discrete and/or continuous fea- and hundreds more are published each ˲˲ Optimization. Finally, we need
ture values and outputs a single dis- year. The key to not getting lost in this a method to search among the clas-
crete value, the class. For example, huge space is to realize that it consists sifiers in the language for the high-
a spam filter classifies email mes- of combinations of just three compo- est-scoring one. The choice of op-
sages into “spam” or “not spam,” nents. The components are: timization technique is key to the
and its input may be a Boolean vec- ˲˲ Representation. A classifier must efficiency of the learner, and also
tor x = (x 1,…,x j,…,x d), where x j = 1 if be represented in some formal lan- helps determine the classifier pro-
the j th word in the dictionary appears guage that the computer can handle. duced if the evaluation function has
in the email and x j = 0 otherwise. A Conversely, choosing a representa- more than one optimum. It is com-
learner inputs a training set of ex- tion for a learner is tantamount to mon for new learners to start out using
amples (x i, y i), where x i = (x i,1 , . . . , choosing the set of classifiers that it off-the-shelf optimizers, which are lat-
x i, d) is an observed input and y i is the can possibly learn. This set is called er replaced by custom-designed ones.
Im age by agsa ndre w/ Sh uttersto c k. com

corresponding output, and outputs the hypothesis space of the learner. The accompanying table shows
a classifier. The test of the learner is If a classifier is not in the hypothesis common examples of each of these
whether this classifier produces the space, it cannot be learned. A related three components. For example, k-
correct output yt for future examples question, that I address later, is how nearest neighbor classifies a test ex-
xt (for example, whether the spam to represent the input, in other words, ample by finding the k most similar
filter correctly classifies previously what features to use. training examples and predicting the
unseen email messages as spam or ˲˲ Evaluation. An evaluation func- majority class among them. Hyper-
not spam). tion (also called objective function plane-based methods form a linear

o cto b er 2 0 1 2 | vol . 55 | n o. 1 0 | c om m u n i cat ion s o f t he ac m 79


review articles

Table 1. The three components of learning algorithms. sible different inputs.) Doing well on
the training set is easy (just memorize
the examples). The most common
Representation Evaluation Optimization mistake among machine learning be-
Instances Accuracy/Error rate Combinatorial optimization ginners is to test on the training data
K-nearest neighbor Precision and recall Greedy search and have the illusion of success. If the
Support vector machines Squared error Beam search chosen classifier is then tested on new
Hyperplanes Likelihood Branch-and-bound data, it is often no better than ran-
Naive Bayes Posterior probability Continuous optimization dom guessing. So, if you hire someone
Logistic regression Information gain Unconstrained to build a classifier, be sure to keep
Decision trees K-L divergence Gradient descent some of the data to yourself and test
Sets of rules Cost/Utility Conjugate gradient the classifier they give you on it. Con-
Propositional rules Margin Quasi-Newton methods versely, if you have been hired to build
Logic programs Constrained a classifier, set some of the data aside
Neural networks Linear programming from the beginning, and only use it to
Graphical models Quadratic programming test your chosen classifier at the very
Bayesian networks end, followed by learning your final
Conditional random fields classifier on the whole data.
Contamination of your classifier by
test data can occur in insidious ways,
for example, if you use test data to
Algorithm 1. Decision tree induction. tune parameters and do a lot of tun-
ing. (Machine learning algorithms
LearnDT (TrainSet) have lots of knobs, and success of-
ten comes from twiddling them a lot,
if all examples in TrainSet have the same class y* then so this is a real concern.) Of course,
return MakeLeaf(y*) holding out data reduces the amount
if no feature xj has InfoGain(xj ,y) > 0 then available for training. This can be mit-
y* ← Most frequent class in TrainSet igated by doing cross-validation: ran-
return MakeLeaf(y*) domly dividing your training data into
x* ← argmaxxj InfoGain(xj, y)
(say) 10 subsets, holding out each one
TS0 ← Examples in TrainSet with x* = 0
while training on the rest, testing each
TS1 ← Examples in TrainSet with x* = 1
learned classifier on the examples it
return MakeNode(x*, LearnDT(TS0), LearnDT(TS1))
did not see, and averaging the results
to see how well the particular param-
eter setting does.
combination of the features per class day may not be far when every single In the early days of machine learn-
and predict the class with the high- possible combination has appeared in ing, the need to keep training and test
est-valued combination. Decision some learner! data separate was not widely appreci-
trees test one feature at each internal Most textbooks are organized by ated. This was partly because, if the
node, with one branch for each fea- representation, and it is easy to over- learner has a very limited representa-
ture value, and have class predictions look the fact that the other compo- tion (for example, hyperplanes), the
at the leaves. Algorithm 1 (above) nents are equally important. There is difference between training and test
shows a bare-bones decision tree no simple recipe for choosing each error may not be large. But with very
learner for Boolean domains, using component, but I will touch on some flexible classifiers (for example, deci-
information gain and greedy search.20 of the key issues here. As we will see, sion trees), or even with linear classifi-
InfoGain(xj, y) is the mutual informa- some choices in a machine learning ers with a lot of features, strict separa-
tion between feature xj and the class y. project may be even more important tion is mandatory.
MakeNode(x,c0,c1) returns a node that than the choice of learner. Notice that generalization being
tests feature x and has c0 as the child the goal has an interesting conse-
for x = 0 and c1 as the child for x = 1. It’s Generalization that Counts quence for machine learning. Unlike
Of course, not all combinations of The fundamental goal of machine in most other optimization problems,
one component from each column of learning is to generalize beyond the we do not have access to the function
the table make equal sense. For exam- examples in the training set. This is we want to optimize! We have to use
ple, discrete representations naturally because, no matter how much data training error as a surrogate for test
go with combinatorial optimization, we have, it is very unlikely that we will error, and this is fraught with dan-
and continuous ones with continu- see those exact examples again at test ger. (How to deal with it is addressed
ous optimization. Nevertheless, many time. (Notice that, if there are 100,000 later.) On the positive side, since the
learners have both discrete and con- words in the dictionary, the spam fil- objective function is only a proxy for
tinuous components, and in fact the ter described above has 2100,000 pos- the true goal, we may not need to fully

80 comm unicatio ns o f the acm | o c to ber 201 2 | vo l . 5 5 | no. 1 0


review articles

optimize it; in fact, a local optimum main, instance-based methods may one that is 75% accurate on both, it
returned by simple greedy search may be a good choice. If we have knowl- has overfit.
be better than the global optimum. edge about probabilistic dependen- Everyone in machine learning
cies, graphical models are a good fit. knows about overfitting, but it comes
Data Alone Is Not Enough And if we have knowledge about what in many forms that are not immedi-
Generalization being the goal has an- kinds of preconditions are required by ately obvious. One way to understand
other major consequence: Data alone each class, “IF . . . THEN . . .” rules may overfitting is by decomposing gener-
is not enough, no matter how much be the best option. The most useful alization error into bias and variance.9
of it you have. Consider learning a learners in this regard are those that Bias is a learner’s tendency to con-
Boolean function of (say) 100 vari- do not just have assumptions hard- sistently learn the same wrong thing.
ables from a million examples. There wired into them, but allow us to state Variance is the tendency to learn ran-
are 2100 − 106 examples whose classes them explicitly, vary them widely, and dom things irrespective of the real sig-
you do not know. How do you figure incorporate them automatically into nal. Figure 1 illustrates this by an anal-
out what those classes are? In the ab- the learning (for example, using first- ogy with throwing darts at a board. A
sence of further information, there is order logic21 or grammars6). linear learner has high bias, because
just no way to do this that beats flip- In retrospect, the need for knowl- when the frontier between two classes
ping a coin. This observation was first edge in learning should not be sur- is not a hyperplane the learner is un-
made (in somewhat different form) by prising. Machine learning is not able to induce it. Decision trees do not
the philosopher David Hume over 200 magic; it cannot get something from have this problem because they can
years ago, but even today many mis- nothing. What it does is get more represent any Boolean function, but
takes in machine learning stem from from less. Programming, like all en- on the other hand they can suffer from
failing to appreciate it. Every learner gineering, is a lot of work: we have to high variance: decision trees learned
must embody some knowledge or as- build everything from scratch. Learn- on different training sets generated by
sumptions beyond the data it is given ing is more like farming, which lets the same phenomenon are often very
in order to generalize beyond it. This nature do most of the work. Farmers different, when in fact they should be
notion was formalized by Wolpert in combine seeds with nutrients to grow
his famous “no free lunch” theorems, crops. Learners combine knowledge Figure 1. Bias and variance in
dart-throwing.
according to which no learner can with data to grow programs.
beat random guessing over all pos-
sible functions to be learned.25 Overfitting Has Many Faces
Low High
This seems like rather depressing What if the knowledge and data we Variance Variance
news. How then can we ever hope to have are not sufficient to completely
learn anything? Luckily, the functions determine the correct classifier? Then
we want to learn in the real world are we run the risk of just hallucinating High
Bias
not drawn uniformly from the set of all a classifier (or parts of it) that is not
mathematically possible functions! In grounded in reality, and is simply en-
fact, very general assumptions—like coding random quirks in the data.
smoothness, similar examples hav- This problem is called overfitting, and
Low
ing similar classes, limited depen- is the bugbear of machine learning. Bias
dences, or limited complexity—are When your learner outputs a classi-
often enough to do very well, and this fier that is 100% accurate on the train-
is a large part of why machine learn- ing data but only 50% accurate on test
ing has been so successful. Like de- data, when in fact it could have output
duction, induction (what learners do)
is a knowledge lever: it turns a small Figure 2. Naïve Bayes can outperform a state-of-the-art rule learner (C4.5rules) even
when the true classifier is a set of rules.
amount of input knowledge into a
large amount of output knowledge.
Induction is a vastly more powerful 80
 Bayes    C4.5

lever than deduction, requiring much


less input knowledge to produce use- 75
Test-Set Accuracy (%)

ful results, but it still needs more than 70


zero input knowledge to work. And, as
with any lever, the more we put in, the 65

more we can get out. 60


A corollary of this is that one of the
key criteria for choosing a representa- 55

tion is which kinds of knowledge are 50


easily expressed in it. For example, if 10 100 1000 10000
we have a lot of knowledge about what Number of Examples

makes examples similar in our do-

o c to b e r 2 0 1 2 | vo l. 55 | n o. 1 0 | c om m u n ic at ion s of t he acm 81
review articles

the same. Similar reasoning applies like training examples labeled with tion of about 10−18 of the input space.
to the choice of optimization meth- the wrong class. This can indeed ag- This is what makes machine learning
od: beam search has lower bias than gravate overfitting, by making the both necessary and hard.
greedy search, but higher variance, be- learner draw a capricious frontier to More seriously, the similarity-
cause it tries more hypotheses. Thus, keep those examples on what it thinks based reasoning that machine learn-
contrary to intuition, a more powerful is the right side. But severe overfitting ing algorithms depend on (explicitly
learner is not necessarily better than a can occur even in the absence of noise. or implicitly) breaks down in high di-
less powerful one. For instance, suppose we learn a Bool- mensions. Consider a nearest neigh-
Figure 2 illustrates this.a Even ean classifier that is just the disjunc- bor classifier with Hamming distance
though the true classifier is a set of tion of the examples labeled “true” as the similarity measure, and sup-
rules, with up to 1,000 examples na- in the training set. (In other words, pose the class is just x1 ∧ x2. If there
ive Bayes is more accurate than a the classifier is a Boolean formula in are no other features, this is an easy
rule learner. This happens despite disjunctive normal form, where each problem. But if there are 98 irrelevant
naive Bayes’s false assumption that term is the conjunction of the feature features x3,..., x100, the noise from
the frontier is linear! Situations like values of one specific training exam- them completely swamps the signal in
this are common in machine learn- ple.) This classifier gets all the training x1 and x2, and nearest neighbor effec-
ing: strong false assumptions can be examples right and every positive test tively makes random predictions.
better than weak true ones, because example wrong, regardless of whether Even more disturbing is that near-
a learner with the latter needs more the training data is noisy or not. est neighbor still has a problem even
data to avoid overfitting. The problem of multiple testing13 is if all 100 features are relevant! This
Cross-validation can help to com- closely related to overfitting. Standard is because in high dimensions all
bat overfitting, for example by using it statistical tests assume that only one examples look alike. Suppose, for
to choose the best size of decision tree hypothesis is being tested, but mod- instance, that examples are laid out
to learn. But it is no panacea, since if ern learners can easily test millions on a regular grid, and consider a test
we use it to make too many parameter before they are done. As a result what example xt. If the grid is d-dimen-
choices it can itself start to overfit.17 looks significant may in fact not be. sional, xt’s 2d nearest examples are
Besides cross-validation, there For example, a mutual fund that beats all at the same distance from it. So as
are many methods to combat overfit- the market 10 years in a row looks very the dimensionality increases, more
ting. The most popular one is adding impressive, until you realize that, if and more examples become nearest
a regularization term to the evaluation there are 1,000 funds and each has a neighbors of xt, until the choice of
function. This can, for example, pe- 50% chance of beating the market on nearest neighbor (and therefore of
nalize classifiers with more structure, any given year, it is quite likely that class) is effectively random.
thereby favoring smaller ones with one will succeed all 10 times just by This is only one instance of a more
less room to overfit. Another option luck. This problem can be combatted general problem with high dimen-
is to perform a statistical significance by correcting the significance tests to sions: our intuitions, which come
test like chi-square before adding new take the number of hypotheses into from a three-dimensional world, of-
structure, to decide whether the dis- account, but this can also lead to un- ten do not apply in high-dimensional
tribution of the class really is differ- derfitting. A better approach is to con- ones. In high dimensions, most of the
ent with and without this structure. trol the fraction of falsely accepted mass of a multivariate Gaussian dis-
These techniques are particularly use- non-null hypotheses, known as the tribution is not near the mean, but in
ful when data is very scarce. Neverthe- false discovery rate.3 an increasingly distant “shell” around
less, you should be skeptical of claims it; and most of the volume of a high-
that a particular technique “solves” Intuition Fails in High Dimensions dimensional orange is in the skin, not
the overfitting problem. It is easy to After overfitting, the biggest problem the pulp. If a constant number of ex-
avoid overfitting (variance) by falling in machine learning is the curse of amples is distributed uniformly in a
into the opposite error of underfitting dimensionality. This expression was high-dimensional hypercube, beyond
(bias). Simultaneously avoiding both coined by Bellman in 1961 to refer some dimensionality most examples
requires learning a perfect classifier, to the fact that many algorithms that are closer to a face of the hypercube
and short of knowing it in advance work fine in low dimensions become than to their nearest neighbor. And if
there is no single technique that will intractable when the input is high- we approximate a hypersphere by in-
always do best (no free lunch). dimensional. But in machine learn- scribing it in a hypercube, in high di-
A common misconception about ing it refers to much more. General- mensions almost all the volume of the
overfitting is that it is caused by noise, izing correctly becomes exponentially hypercube is outside the hypersphere.
harder as the dimensionality (number This is bad news for machine learning,
a Training examples consist of 64 Boolean fea- of features) of the examples grows, be- where shapes of one type are often ap-
tures and a Boolean class computed from cause a fixed-size training set covers a proximated by shapes of another.
them according to a set of “IF . . . THEN . . .” dwindling fraction of the input space. Building a classifier in two or three
rules. The curves are the average of 100 runs
with different randomly generated sets of
Even with a moderate dimension of dimensions is easy; we can find a rea-
rules. Error bars are two standard deviations. 100 and a huge training set of a trillion sonable frontier between examples
See Domingos and Pazzani10 for details. examples, the latter covers only a frac- of different classes just by visual in-

82 comm uni catio ns o f the ac m | o cto ber 201 2 | vol . 5 5 | no. 1 0


review articles

spection. (It has even been said that if bad classifiers in the learner’s hypoth-
people could see in high dimensions esis space H. The probability that at
machine learning would not be neces- least one of them is consistent is less
sary.) But in high dimensions it is dif- than b(1 − ε)n, by the union bound. As-
ficult to understand what is happen-
ing. This in turn makes it difficult to One of the major suming the learner always returns a
consistent classifier, the probability
design a good classifier. Naively, one
might think that gathering more fea-
developments of that this classifier is bad is then less
than |H|(1 − ε)n, where we have used
tures never hurts, since at worst they recent decades has the fact that b ≤ |H|. So if we want this
provide no new information about the
class. But in fact their benefits may
been the realization probability to be less than δ, it suffices
to make n > ln(δ/|H|)/ ln(1 − ε) ≥ 1/ε (ln
be outweighed by the curse of dimen- that we can have |H| + ln 1/δ).
sionality.
Fortunately, there is an effect that
guarantees on the Unfortunately, guarantees of this
type have to be taken with a large grain
partly counteracts the curse, which results of induction, of salt. This is because the bounds ob-
might be called the “blessing of non-
uniformity.” In most applications particularly if we tained in this way are usually extreme-
ly loose. The wonderful feature of the
examples are not spread uniformly are willing to settle bound above is that the required num-
throughout the instance space, but
are concentrated on or near a lower- for probabilistic ber of examples only grows logarith-
mically with |H| and 1/δ. Unfortunate-
dimensional manifold. For example,
k-nearest neighbor works quite well
guarantees. ly, most interesting hypothesis spaces
are doubly exponential in the number
for handwritten digit recognition of features d, which still leaves us
even though images of digits have needing a number of examples expo-
one dimension per pixel, because the nential in d. For example, consider
space of digit images is much smaller the space of Boolean functions of d
than the space of all possible images. Boolean variables. If there are e pos-
Learners can implicitly take advan- sible different examples, there are
tage of this lower effective dimension, 2e possible different functions, so
or algorithms for explicitly reducing since there are 2d possible examples,
d
the dimensionality can be used (for the total number of functions is 22 .
example, Tenenbaum22). And even for hypothesis spaces that
are “merely” exponential, the bound
Theoretical Guarantees is still very loose, because the union
Are Not What They Seem bound is very pessimistic. For exam-
Machine learning papers are full of ple, if there are 100 Boolean features
theoretical guarantees. The most com- and the hypothesis space is decision
mon type is a bound on the number of trees with up to 10 levels, to guarantee
examples needed to ensure good gen- δ = ε = 1% in the bound above we need
eralization. What should you make of half a million examples. But in prac-
these guarantees? First of all, it is re- tice a small fraction of this suffices for
markable that they are even possible. accurate learning.
Induction is traditionally contrasted Further, we have to be careful
with deduction: in deduction you can about what a bound like this means.
guarantee that the conclusions are For instance, it does not say that, if
correct; in induction all bets are off. your learner returned a hypothesis
Or such was the conventional wisdom consistent with a particular training
for many centuries. One of the major set, then this hypothesis probably
developments of recent decades has generalizes well. What it says is that,
been the realization that in fact we can given a large enough training set, with
have guarantees on the results of in- high probability your learner will ei-
duction, particularly if we are willing ther return a hypothesis that general-
to settle for probabilistic guarantees. izes well or be unable to find a consis-
The basic argument is remarkably tent hypothesis. The bound also says
simple.5 Let’s say a classifier is bad nothing about how to select a good
if its true error rate is greater than ε. hypothesis space. It only tells us that,
Then the probability that a bad clas- if the hypothesis space contains the
sifier is consistent with n random, in- true classifier, then the probability
dependent training examples is less that the learner outputs a bad classi-
than (1 − ε)n. Let b be the number of fier decreases with training set size.

o c to b e r 2 0 1 2 | vo l. 55 | n o. 1 0 | c om m u n ic at ion s of t he acm 83
review articles

If we shrink the hypothesis space, the chine learning. But it makes sense if
bound improves, but the chances that you consider how time-consuming it
it contains the true classifier shrink is to gather data, integrate it, clean it
also. (There are bounds for the case and preprocess it, and how much trial
where the true classifier is not in the
hypothesis space, but similar consid- A dumb algorithm and error can go into feature design.
Also, machine learning is not a one-
erations apply to them.)
Another common type of theoreti-
with lots and lots shot process of building a dataset and
running a learner, but rather an itera-
cal guarantee is asymptotic: given in- of data beats tive process of running the learner,
finite data, the learner is guaranteed
to output the correct classifier. This
a clever one analyzing the results, modifying the
data and/or the learner, and repeat-
is reassuring, but it would be rash to with modest ing. Learning is often the quickest
choose one learner over another be-
cause of its asymptotic guarantees. In
amounts of it. part of this, but that is because we
have already mastered it pretty well!
practice, we are seldom in the asymp- Feature engineering is more diffi-
totic regime (also known as “asymp- cult because it is domain-specific,
topia”). And, because of the bias-vari- while learners can be largely general
ance trade-off I discussed earlier, if purpose. However, there is no sharp
learner A is better than learner B given frontier between the two, and this is
infinite data, B is often better than A another reason the most useful learn-
given finite data. ers are those that facilitate incorpo-
The main role of theoretical guar- rating knowledge.
antees in machine learning is not as Of course, one of the holy grails
a criterion for practical decisions, of machine learning is to automate
but as a source of understanding and more and more of the feature engi-
driving force for algorithm design. In neering process. One way this is often
this capacity, they are quite useful; in- done today is by automatically gener-
deed, the close interplay of theory and ating large numbers of candidate fea-
practice is one of the main reasons tures and selecting the best by (say)
machine learning has made so much their information gain with respect
progress over the years. But caveat to the class. But bear in mind that
emptor: learning is a complex phe- features that look irrelevant in isola-
nomenon, and just because a learner tion may be relevant in combination.
has a theoretical justification and For example, if the class is an XOR of
works in practice does not mean the k input features, each of them by it-
former is the reason for the latter. self carries no information about the
class. (If you want to annoy machine
Feature Engineering Is The Key learners, bring up XOR.) On the other
At the end of the day, some machine hand, running a learner with a very
learning projects succeed and some large number of features to find out
fail. What makes the difference? Eas- which ones are useful in combination
ily the most important factor is the may be too time-consuming, or cause
features used. Learning is easy if you overfitting. So there is ultimately no
have many independent features that replacement for the smarts you put
each correlate well with the class. On into feature engineering.
the other hand, if the class is a very
complex function of the features, you More Data Beats
may not be able to learn it. Often, the a Cleverer Algorithm
raw data is not in a form that is ame- Suppose you have constructed the
nable to learning, but you can con- best set of features you can, but the
struct features from it that are. This classifiers you receive are still not ac-
is typically where most of the effort in curate enough. What can you do now?
a machine learning project goes. It is There are two main choices: design a
often also one of the most interesting better learning algorithm, or gather
parts, where intuition, creativity and more data (more examples, and pos-
“black art” are as important as the sibly more raw features, subject to
technical stuff. the curse of dimensionality). Machine
First-timers are often surprised by learning researchers are mainly con-
how little time in a machine learning cerned with the former, but pragmati-
project is spent actually doing ma- cally the quickest path to success is

84 communicatio ns o f th e ac m | o c to ber 201 2 | vo l . 5 5 | no. 1 0


review articles

often to just get more data. As a rule ers are seductive, but they are usually cycles. In research papers, learners
of thumb, a dumb algorithm with lots harder to use, because they have more are typically compared on measures
and lots of data beats a clever one with knobs you need to turn to get good re- of accuracy and computational cost.
modest amounts of it. (After all, ma- sults, and because their internals are But human effort saved and insight
chine learning is all about letting data more opaque. gained, although harder to measure,
do the heavy lifting.) Learners can be divided into two are often more important. This favors
This does bring up another prob- major types: those whose representa- learners that produce human-under-
lem, however: scalability. In most of tion has a fixed size, like linear classi- standable output (for example, rule
computer science, the two main lim- fiers, and those whose representation sets). And the organizations that make
ited resources are time and memory. can grow with the data, like decision the most of machine learning are
In machine learning, there is a third trees. (The latter are sometimes called those that have in place an infrastruc-
one: training data. Which one is the nonparametric learners, but this is ture that makes experimenting with
bottleneck has changed from decade somewhat unfortunate, since they many different learners, data sources,
to decade. In the 1980s it tended to usually wind up learning many more and learning problems easy and effi-
be data. Today it is often time. Enor- parameters than parametric ones.) cient, and where there is a close col-
mous mountains of data are avail- Fixed-size learners can only take ad- laboration between machine learning
able, but there is not enough time vantage of so much data. (Notice how experts and application domain ones.
to process it, so it goes unused. This the accuracy of naive Bayes asymptotes
leads to a paradox: even though in at around 70% in Figure 2.) Variable- Learn Many Models, Not Just One
principle more data means that more size learners can in principle learn any In the early days of machine learn-
complex classifiers can be learned, in function given sufficient data, but in ing, everyone had a favorite learner,
practice simpler classifiers wind up practice they may not, because of limi- together with some a priori reasons
being used, because complex ones tations of the algorithm (for example, to believe in its superiority. Most ef-
take too long to learn. Part of the an- greedy search falls into local optima) fort went into trying many variations
swer is to come up with fast ways to or computational cost. Also, because of it and selecting the best one. Then
learn complex classifiers, and indeed of the curse of dimensionality, no ex- systematic empirical comparisons
there has been remarkable progress isting amount of data may be enough. showed that the best learner varies
in this direction (for example, Hulten For these reasons, clever algorithms— from application to application, and
and Domingos11). those that make the most of the data systems containing many different
Part of the reason using cleverer and computing resources available— learners started to appear. Effort now
algorithms has a smaller payoff than often pay off in the end, provided you went into trying many variations of
you might expect is that, to a first ap- are willing to put in the effort. There many learners, and still selecting just
proximation, they all do the same. is no sharp frontier between design- the best one. But then researchers
This is surprising when you consider ing learners and learning classifiers; noticed that, if instead of selecting
representations as different as, say, rather, any given piece of knowledge the best variation found, we combine
sets of rules and neural networks. But could be encoded in the learner or many variations, the results are bet-
in fact propositional rules are readily learned from data. So machine learn- ter—often much better—and at little
encoded as neural networks, and sim- ing projects often wind up having a extra effort for the user.
ilar relationships hold between other significant component of learner de- Creating such model ensembles is
representations. All learners essen- sign, and practitioners need to have now standard.1 In the simplest tech-
tially work by grouping nearby exam- some expertise in it.12 nique, called bagging, we simply gen-
ples into the same class; the key dif- In the end, the biggest bottleneck erate random variations of the train-
ference is in the meaning of “nearby.” is not data or CPU cycles, but human ing set by resampling, learn a classifier
With nonuniformly distributed data, on each, and combine the results by
learners can produce widely different Figure 3. Very different frontiers can yield voting. This works because it greatly
similar predictions. (+ and – are training
frontiers while still making the same examples of two classes.)
reduces variance while only slightly
predictions in the regions that matter increasing bias. In boosting, training
(those with a substantial number of examples have weights, and these are
training examples, and therefore also N. Bayes varied so that each new classifier fo-
where most test examples are likely to cuses on the examples the previous
appear). This also helps explain why kNN ones tended to get wrong. In stacking,
powerful learners can be unstable but SVM the outputs of individual classifiers
still accurate. Figure 3 illustrates this become the inputs of a “higher-level”
in 2D; the effect is much stronger in learner that figures out how best to
high dimensions. combine them.
As a rule, it pays to try the simplest Many other techniques exist, and
learners first (for example, naïve Bayes D. Tree the trend is toward larger and larger
before logistic regression, k-nearest ensembles. In the Netflix prize, teams
neighbor before support vector ma- from all over the world competed to
chines). More sophisticated learn- build the best video recommender

octo b e r 2 0 1 2 | vol . 55 | n o. 1 0 | c om m u n i cat ion s o f t he ac m 85


review articles

system (http://netflixprize.com). As continues to improve by adding clas-


the competition progressed, teams sifiers even after the training error has
found they obtained the best results reached zero. Another counterexam-
by combining their learners with oth- ple is support vector machines, which
er teams’, and merged into larger and
larger teams. The winner and runner- Just because can effectively have an infinite num-
ber of parameters without overfitting.
up were both stacked ensembles of
over 100 learners, and combining the
a function can Conversely, the function sign(sin(ax))
can discriminate an arbitrarily large,
two ensembles further improved the be represented arbitrarily labeled set of points on the
results. Doubtless we will see even
larger ones in the future.
does not mean x axis, even though it has only one pa-
rameter.23 Thus, contrary to intuition,
Model ensembles should not be it can be learned. there is no necessary connection be-
confused with Bayesian model av- tween the number of parameters of a
eraging (BMA)—the theoretically model and its tendency to overfit.
optimal approach to learning.4 In A more sophisticated view instead
BMA, predictions on new examples equates complexity with the size of
are made by averaging the individual the hypothesis space, on the basis that
predictions of all classifiers in the smaller spaces allow hypotheses to be
hypothesis space, weighted by how represented by shorter codes. Bounds
well the classifiers explain the train- like the one in the section on theoreti-
ing data and how much we believe cal guarantees might then be viewed
in them a priori. Despite their su- as implying that shorter hypotheses
perficial similarities, ensembles and generalize better. This can be further
BMA are very different. Ensembles refined by assigning shorter codes to
change the hypothesis space (for ex- the hypotheses in the space we have
ample, from single decision trees to some a priori preference for. But
linear combinations of them), and viewing this as “proof” of a trade-off
can take a wide variety of forms. BMA between accuracy and simplicity is
assigns weights to the hypotheses in circular reasoning: we made the hy-
the original space according to a fixed potheses we prefer simpler by design,
formula. BMA weights are extremely and if they are accurate it is because
different from those produced by our preferences are accurate, not be-
(say) bagging or boosting: the latter cause the hypotheses are “simple” in
are fairly even, while the former are the representation we chose.
extremely skewed, to the point where A further complication arises from
the single highest-weight classifier the fact that few learners search their
usually dominates, making BMA ef- hypothesis space exhaustively. A
fectively equivalent to just selecting learner with a larger hypothesis space
it.8 A practical consequence of this is that tries fewer hypotheses from it
that, while model ensembles are a key is less likely to overfit than one that
part of the machine learning toolkit, tries more hypotheses from a smaller
BMA is seldom worth the trouble. space. As Pearl18 points out, the size of
the hypothesis space is only a rough
Simplicity Does Not guide to what really matters for relat-
Imply Accuracy ing training and test error: the proce-
Occam’s razor famously states that dure by which a hypothesis is chosen.
entities should not be multiplied be- Domingos7 surveys the main argu-
yond necessity. In machine learning, ments and evidence on the issue of
this is often taken to mean that, given Occam’s razor in machine learning.
two classifiers with the same training The conclusion is that simpler hy-
error, the simpler of the two will likely potheses should be preferred because
have the lowest test error. Purported simplicity is a virtue in its own right,
proofs of this claim appear regularly not because of a hypothetical connec-
in the literature, but in fact there are tion with accuracy. This is probably
many counterexamples to it, and the what Occam meant in the first place.
“no free lunch” theorems imply it can-
not be true. Representable Does Not
We saw one counterexample previ- Imply Learnable
ously: model ensembles. The gener- Essentially all representations used in
alization error of a boosted ensemble variable-size learners have associated

86 com muni catio ns o f th e ac m | o cto ber 201 2 | vol . 5 5 | no. 1 0


review articles

theorems of the form “Every function More often than not, the goal References
can be represented, or approximated of learning predictive models is to 1. Bauer, E. and Kohavi, R. An empirical comparison of
voting classification algorithms: Bagging, boosting
arbitrarily closely, using this repre- use them as guides to action. If we and variants. Machine Learning 36 (1999), 105–142.
sentation.” Reassured by this, fans of find that beer and diapers are often 2. Bengio, Y. Learning deep architectures for AI.
Foundations and Trends in Machine Learning 2, 1
the representation often proceed to bought together at the supermar- (2009), 1–127.
ignore all others. However, just be- ket, then perhaps putting beer next 3. Benjamini, Y. and Hochberg, Y. Controlling the false
discovery rate: A practical and powerful approach
cause a function can be represented to the diaper section will increase to multiple testing. Journal of the Royal Statistical
does not mean it can be learned. For sales. (This is a famous example in Society, Series B, 57 (1995), 289–300.
4. Bernardo, J.M. and Smith, A.F.M. Bayesian Theory.
example, standard decision tree learn- the world of data mining.) But short Wiley, NY, 1994.
ers cannot learn trees with more leaves of actually doing the experiment it is 5. Blumer, A., Ehrenfeucht, A., Haussler, D. and
Warmuth, M.K. Occam’s razor. Information
than there are training examples. In difficult to tell. Machine learning is Processing Letters 24 (1987), 377–380.
6. Cohen, W.W. Grammatically biased learning:
continuous spaces, representing even usually applied to observational data, Learning logic programs using an explicit antecedent
simple functions using a fixed set of where the predictive variables are not description language. Artificial Intelligence 68
(1994), 303–366.
primitives often requires an infinite under the control of the learner, as 7. Domingos, P. The role of Occam’s razor in knowledge
number of components. Further, if opposed to experimental data, where discovery. Data Mining and Knowledge Discovery 3
(1999), 409–425.
the hypothesis space has many local they are. Some learning algorithms 8. Domingos, P. Bayesian averaging of classifiers and
optima of the evaluation function, as can potentially extract causal infor- the overfitting problem. In Proceedings of the 17th
International Conference on Machine Learning
is often the case, the learner may not mation from observational data, but (Stanford, CA, 2000), Morgan Kaufmann, San Mateo,
find the true function even if it is rep- their applicability is rather restrict- CA, 223–230.
9. Domingos, P. A unified bias-variance decomposition
resentable. Given finite data, time and ed.19 On the other hand, correlation and its applications. In Proceedings of the 17th
memory, standard learners can learn is a sign of a potential causal connec- International Conference on Machine Learning
(Stanford, CA, 2000), Morgan Kaufmann, San Mateo,
only a tiny subset of all possible func- tion, and we can use it as a guide to CA, 231–238.
tions, and these subsets are different further investigation (for example, 10. Domingos, P. and Pazzani, M. On the optimality of
the simple Bayesian classifier under zero-one loss.
for learners with different represen- trying to understand what the causal Machine Learning 29 (1997), 103–130.
tations. Therefore the key question is chain might be). 11. Hulten, G. and Domingos, P. Mining complex models
from arbitrarily large databases in constant time. In
not “Can it be represented?” to which Many researchers believe that cau- Proceedings of the 8th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining
the answer is often trivial, but “Can it sality is only a convenient fiction. For (Edmonton, Canada, 2002). ACM Press, NY, 525–531.
be learned?” And it pays to try different example, there is no notion of causal- 12. Kibler, D. and Langley, P. Machine learning as an
experimental science. In Proceedings of the 3rd
learners (and possibly combine them). ity in physical laws. Whether or not European Working Session on Learning (London, UK,
Some representations are exponen- causality really exists is a deep philo- 1988). Pitman.
13. Klockars, A.J. and Sax, G. Multiple Comparisons.
tially more compact than others for sophical question with no definitive Sage, Beverly Hills, CA, 1986.
some functions. As a result, they may answer in sight, but there are two 14. Kohavi, R., Longbotham, R., Sommerfield, D. and
Henne, R. Controlled experiments on the Web:
also require exponentially less data to practical points for machine learn- Survey and practical guide. Data Mining and
learn those functions. Many learners ers. First, whether or not we call them Knowledge Discovery 18 (2009), 140–181.
15. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs,
work by forming linear combinations “causal,” we would like to predict the R., Roxburgh, C. and Byers, A. Big data: The next
of simple basis functions. For exam- effects of our actions, not just corre- frontier for innovation, competition, and productivity.
Technical report, McKinsey Global Institute, 2011.
ple, support vector machines form lations between observable variables. 16. Mitchell, T.M. Machine Learning. McGraw-Hill,
combinations of kernels centered at Second, if you can obtain experimen- NY, 1997.
17. Ng, A.Y. Preventing “overfitting” of cross-validation
some of the training examples (the tal data (for example by randomly as- data. In Proceedings of the 14th International
support vectors). Representing parity signing visitors to different versions of Conference on Machine Learning (Nashville, TN,
1997). Morgan Kaufmann, San Mateo, CA, 245–253.
of n bits in this way requires 2n basis a Web site), then by all means do so.14 18. Pearl, J. On the connection between the complexity
and credibility of inferred models. International
functions. But using a representation Journal of General Systems 4 (1978), 255–264.
with more layers (that is, more steps Conclusion 19. Pearl, J. Causality: Models, Reasoning, and
Inference. Cambridge University Press, Cambridge,
between input and output), parity can Like any discipline, machine learn- UK, 2000.
be encoded in a linear-size classifier. ing has a lot of “folk wisdom” that can 20. Quinlan, J.R. C4.5: Programs for Machine Learning.
Morgan Kaufmann, San Mateo, CA, 1993.
Finding methods to learn these deeper be difficult to come by, but is crucial 21. Richardson, M. and P. Domingos. Markov logic
representations is one of the major re- for success. This article summarized networks. Machine Learning 62 (2006), 107–136.
22. Tenenbaum, J., Silva, V. and Langford, J. A global
search frontiers in machine learning.2 some of the most salient items. Of geometric framework for nonlinear dimensionality
course, it is only a complement to the reduction. Science 290 (2000), 2319–2323.
23. Vapnik, V.N. The Nature of Statistical Learning
Correlation Does Not more conventional study of machine Theory. Springer, NY, 1995.
Imply Causation learning. Check out http://www. 24. Witten, I., Frank, E. and Hall, M. Data Mining:
Practical Machine Learning Tools and Techniques,
The point that correlation does not cs.washington.edu/homes/pedrod/ 3rd Edition. Morgan Kaufmann, San Mateo, CA, 2011.
imply causation is made so often that class for a complete online machine 25. Wolpert, D. The lack of a priori distinctions between
learning algorithms. Neural Computation 8 (1996),
it is perhaps not worth belaboring. learning course that combines formal 1341–1390.
But, even though learners of the kind and informal aspects. There is also a
we have been discussing can only treasure trove of machine learning Pedro Domingos (pedrod@cs.washington.edu) is a
professor in the Department of Computer Science and
learn correlations, their results are lectures at http://www.videolectures. Engineering at the University of Washington, Seattle.
often treated as representing causal net. A good open source machine
relations. Isn’t this wrong? If so, then learning toolkit is Weka.24
why do people do it? Happy learning! © 2012 ACM 0001-0782/12/10 $15.00

octo b e r 2 0 1 2 | vol . 55 | n o. 1 0 | c om m u n i cat ion s o f t he ac m 87