1 views

Uploaded by Kevin Jarrín

- Machine Learning
- MenonMovva PredictingFlightDelays Report
- Review on Machine Learning Framework for Software Defect Prediction
- lecture1a
- 07394628
- ML Medicine 2015
- Edisc a Class-tailored Discretization Technique for Rule-based Classification
- Machine Learning, Neural And Statistical Classification - Cc Taylor.pdf
- MIT18_657F15_L1
- LNKnet
- Microsoft Azure Machine Learning - Sample Chapter
- Burr_Settles_from Theories to Queries_active Learning in Practice
- Machine Learning Coursera
- Test Plan v2
- Automatic Detection of Poor Speech Recognition at the Dialogue Level
- Some Effective Techniques for Naive Bayes Text Classification
- Ml Brochure
- 4.1 Data Driven Modelling
- Battle Quants
- Week 1 Assignment Solution

You are on page 1of 10

doi:10.1145/ 2347736.2347755

advance machine learning applications.

by Pedro Domingos

A Few Useful

Things to

Know About

Machine

Learning is needed to successfully develop

machine learning applications is not

readily available in them. As a result,

many machine learning projects take

much longer than necessary or wind

up producing less-than-ideal results.

Yet much of this folk knowledge is

fairly easy to communicate. This is

Machine learning systems automatically learn

the purpose of this article.

programs from data. This is often a very attractive

alternative to manually constructing them, and in the key insights

last decade the use of machine learning has spread M achine learning algorithms can figure

rapidly throughout computer science and beyond. out how to perform important tasks

by generalizing from examples. This is

Machine learning is used in Web search, spam filters, often feasible and cost-effective where

manual programming is not. As more

recommender systems, ad placement, credit scoring, data becomes available, more ambitious

problems can be tackled.

fraud detection, stock trading, drug design, and many

other applications. A recent report from the McKinsey M achine learning is widely used in

computer science and other fields.

Global Institute asserts that machine learning (a.k.a. However, developing successful

machine learning applications requires a

data mining or predictive analytics) will be the driver substantial amount of “black art” that is

difficult to find in textbooks.

of the next big wave of innovation.15 Several fine

textbooks are available to interested practitioners and T his article summarizes 12 key lessons

that machine learning researchers and

researchers (for example, Mitchell16 and Witten et practitioners have learned. These include

pitfalls to avoid, important issues to focus

al.24). However, much of the “folk knowledge” that on, and answers to common questions.

Many different types of machine Learning = Representation + or scoring function) is needed to dis-

learning exist, but for illustration Evaluation + Optimization tinguish good classifiers from bad

purposes I will focus on the most Suppose you have an application that ones. The evaluation function used

mature and widely used one: clas- you think machine learning might be internally by the algorithm may dif-

sification. Nevertheless, the issues I good for. The first problem facing you fer from the external one that we want

will discuss apply across all of ma- is the bewildering variety of learning al- the classifier to optimize, for ease of

chine learning. A classifier is a sys- gorithms available. Which one to use? optimization and due to the issues I

tem that inputs (typically) a vector There are literally thousands available, will discuss.

of discrete and/or continuous fea- and hundreds more are published each ˲˲ Optimization. Finally, we need

ture values and outputs a single dis- year. The key to not getting lost in this a method to search among the clas-

crete value, the class. For example, huge space is to realize that it consists sifiers in the language for the high-

a spam filter classifies email mes- of combinations of just three compo- est-scoring one. The choice of op-

sages into “spam” or “not spam,” nents. The components are: timization technique is key to the

and its input may be a Boolean vec- ˲˲ Representation. A classifier must efficiency of the learner, and also

tor x = (x 1,…,x j,…,x d), where x j = 1 if be represented in some formal lan- helps determine the classifier pro-

the j th word in the dictionary appears guage that the computer can handle. duced if the evaluation function has

in the email and x j = 0 otherwise. A Conversely, choosing a representa- more than one optimum. It is com-

learner inputs a training set of ex- tion for a learner is tantamount to mon for new learners to start out using

amples (x i, y i), where x i = (x i,1 , . . . , choosing the set of classifiers that it off-the-shelf optimizers, which are lat-

x i, d) is an observed input and y i is the can possibly learn. This set is called er replaced by custom-designed ones.

Im age by agsa ndre w/ Sh uttersto c k. com

corresponding output, and outputs the hypothesis space of the learner. The accompanying table shows

a classifier. The test of the learner is If a classifier is not in the hypothesis common examples of each of these

whether this classifier produces the space, it cannot be learned. A related three components. For example, k-

correct output yt for future examples question, that I address later, is how nearest neighbor classifies a test ex-

xt (for example, whether the spam to represent the input, in other words, ample by finding the k most similar

filter correctly classifies previously what features to use. training examples and predicting the

unseen email messages as spam or ˲˲ Evaluation. An evaluation func- majority class among them. Hyper-

not spam). tion (also called objective function plane-based methods form a linear

review articles

Table 1. The three components of learning algorithms. sible different inputs.) Doing well on

the training set is easy (just memorize

the examples). The most common

Representation Evaluation Optimization mistake among machine learning be-

Instances Accuracy/Error rate Combinatorial optimization ginners is to test on the training data

K-nearest neighbor Precision and recall Greedy search and have the illusion of success. If the

Support vector machines Squared error Beam search chosen classifier is then tested on new

Hyperplanes Likelihood Branch-and-bound data, it is often no better than ran-

Naive Bayes Posterior probability Continuous optimization dom guessing. So, if you hire someone

Logistic regression Information gain Unconstrained to build a classifier, be sure to keep

Decision trees K-L divergence Gradient descent some of the data to yourself and test

Sets of rules Cost/Utility Conjugate gradient the classifier they give you on it. Con-

Propositional rules Margin Quasi-Newton methods versely, if you have been hired to build

Logic programs Constrained a classifier, set some of the data aside

Neural networks Linear programming from the beginning, and only use it to

Graphical models Quadratic programming test your chosen classifier at the very

Bayesian networks end, followed by learning your final

Conditional random fields classifier on the whole data.

Contamination of your classifier by

test data can occur in insidious ways,

for example, if you use test data to

Algorithm 1. Decision tree induction. tune parameters and do a lot of tun-

ing. (Machine learning algorithms

LearnDT (TrainSet) have lots of knobs, and success of-

ten comes from twiddling them a lot,

if all examples in TrainSet have the same class y* then so this is a real concern.) Of course,

return MakeLeaf(y*) holding out data reduces the amount

if no feature xj has InfoGain(xj ,y) > 0 then available for training. This can be mit-

y* ← Most frequent class in TrainSet igated by doing cross-validation: ran-

return MakeLeaf(y*) domly dividing your training data into

x* ← argmaxxj InfoGain(xj, y)

(say) 10 subsets, holding out each one

TS0 ← Examples in TrainSet with x* = 0

while training on the rest, testing each

TS1 ← Examples in TrainSet with x* = 1

learned classifier on the examples it

return MakeNode(x*, LearnDT(TS0), LearnDT(TS1))

did not see, and averaging the results

to see how well the particular param-

eter setting does.

combination of the features per class day may not be far when every single In the early days of machine learn-

and predict the class with the high- possible combination has appeared in ing, the need to keep training and test

est-valued combination. Decision some learner! data separate was not widely appreci-

trees test one feature at each internal Most textbooks are organized by ated. This was partly because, if the

node, with one branch for each fea- representation, and it is easy to over- learner has a very limited representa-

ture value, and have class predictions look the fact that the other compo- tion (for example, hyperplanes), the

at the leaves. Algorithm 1 (above) nents are equally important. There is difference between training and test

shows a bare-bones decision tree no simple recipe for choosing each error may not be large. But with very

learner for Boolean domains, using component, but I will touch on some flexible classifiers (for example, deci-

information gain and greedy search.20 of the key issues here. As we will see, sion trees), or even with linear classifi-

InfoGain(xj, y) is the mutual informa- some choices in a machine learning ers with a lot of features, strict separa-

tion between feature xj and the class y. project may be even more important tion is mandatory.

MakeNode(x,c0,c1) returns a node that than the choice of learner. Notice that generalization being

tests feature x and has c0 as the child the goal has an interesting conse-

for x = 0 and c1 as the child for x = 1. It’s Generalization that Counts quence for machine learning. Unlike

Of course, not all combinations of The fundamental goal of machine in most other optimization problems,

one component from each column of learning is to generalize beyond the we do not have access to the function

the table make equal sense. For exam- examples in the training set. This is we want to optimize! We have to use

ple, discrete representations naturally because, no matter how much data training error as a surrogate for test

go with combinatorial optimization, we have, it is very unlikely that we will error, and this is fraught with dan-

and continuous ones with continu- see those exact examples again at test ger. (How to deal with it is addressed

ous optimization. Nevertheless, many time. (Notice that, if there are 100,000 later.) On the positive side, since the

learners have both discrete and con- words in the dictionary, the spam fil- objective function is only a proxy for

tinuous components, and in fact the ter described above has 2100,000 pos- the true goal, we may not need to fully

review articles

optimize it; in fact, a local optimum main, instance-based methods may one that is 75% accurate on both, it

returned by simple greedy search may be a good choice. If we have knowl- has overfit.

be better than the global optimum. edge about probabilistic dependen- Everyone in machine learning

cies, graphical models are a good fit. knows about overfitting, but it comes

Data Alone Is Not Enough And if we have knowledge about what in many forms that are not immedi-

Generalization being the goal has an- kinds of preconditions are required by ately obvious. One way to understand

other major consequence: Data alone each class, “IF . . . THEN . . .” rules may overfitting is by decomposing gener-

is not enough, no matter how much be the best option. The most useful alization error into bias and variance.9

of it you have. Consider learning a learners in this regard are those that Bias is a learner’s tendency to con-

Boolean function of (say) 100 vari- do not just have assumptions hard- sistently learn the same wrong thing.

ables from a million examples. There wired into them, but allow us to state Variance is the tendency to learn ran-

are 2100 − 106 examples whose classes them explicitly, vary them widely, and dom things irrespective of the real sig-

you do not know. How do you figure incorporate them automatically into nal. Figure 1 illustrates this by an anal-

out what those classes are? In the ab- the learning (for example, using first- ogy with throwing darts at a board. A

sence of further information, there is order logic21 or grammars6). linear learner has high bias, because

just no way to do this that beats flip- In retrospect, the need for knowl- when the frontier between two classes

ping a coin. This observation was first edge in learning should not be sur- is not a hyperplane the learner is un-

made (in somewhat different form) by prising. Machine learning is not able to induce it. Decision trees do not

the philosopher David Hume over 200 magic; it cannot get something from have this problem because they can

years ago, but even today many mis- nothing. What it does is get more represent any Boolean function, but

takes in machine learning stem from from less. Programming, like all en- on the other hand they can suffer from

failing to appreciate it. Every learner gineering, is a lot of work: we have to high variance: decision trees learned

must embody some knowledge or as- build everything from scratch. Learn- on different training sets generated by

sumptions beyond the data it is given ing is more like farming, which lets the same phenomenon are often very

in order to generalize beyond it. This nature do most of the work. Farmers different, when in fact they should be

notion was formalized by Wolpert in combine seeds with nutrients to grow

his famous “no free lunch” theorems, crops. Learners combine knowledge Figure 1. Bias and variance in

dart-throwing.

according to which no learner can with data to grow programs.

beat random guessing over all pos-

sible functions to be learned.25 Overfitting Has Many Faces

Low High

This seems like rather depressing What if the knowledge and data we Variance Variance

news. How then can we ever hope to have are not sufficient to completely

learn anything? Luckily, the functions determine the correct classifier? Then

we want to learn in the real world are we run the risk of just hallucinating High

Bias

not drawn uniformly from the set of all a classifier (or parts of it) that is not

mathematically possible functions! In grounded in reality, and is simply en-

fact, very general assumptions—like coding random quirks in the data.

smoothness, similar examples hav- This problem is called overfitting, and

Low

ing similar classes, limited depen- is the bugbear of machine learning. Bias

dences, or limited complexity—are When your learner outputs a classi-

often enough to do very well, and this fier that is 100% accurate on the train-

is a large part of why machine learn- ing data but only 50% accurate on test

ing has been so successful. Like de- data, when in fact it could have output

duction, induction (what learners do)

is a knowledge lever: it turns a small Figure 2. Naïve Bayes can outperform a state-of-the-art rule learner (C4.5rules) even

when the true classifier is a set of rules.

amount of input knowledge into a

large amount of output knowledge.

Induction is a vastly more powerful 80

Bayes C4.5

less input knowledge to produce use- 75

Test-Set Accuracy (%)

zero input knowledge to work. And, as

with any lever, the more we put in, the 65

A corollary of this is that one of the

key criteria for choosing a representa- 55

easily expressed in it. For example, if 10 100 1000 10000

we have a lot of knowledge about what Number of Examples

o c to b e r 2 0 1 2 | vo l. 55 | n o. 1 0 | c om m u n ic at ion s of t he acm 81

review articles

the same. Similar reasoning applies like training examples labeled with tion of about 10−18 of the input space.

to the choice of optimization meth- the wrong class. This can indeed ag- This is what makes machine learning

od: beam search has lower bias than gravate overfitting, by making the both necessary and hard.

greedy search, but higher variance, be- learner draw a capricious frontier to More seriously, the similarity-

cause it tries more hypotheses. Thus, keep those examples on what it thinks based reasoning that machine learn-

contrary to intuition, a more powerful is the right side. But severe overfitting ing algorithms depend on (explicitly

learner is not necessarily better than a can occur even in the absence of noise. or implicitly) breaks down in high di-

less powerful one. For instance, suppose we learn a Bool- mensions. Consider a nearest neigh-

Figure 2 illustrates this.a Even ean classifier that is just the disjunc- bor classifier with Hamming distance

though the true classifier is a set of tion of the examples labeled “true” as the similarity measure, and sup-

rules, with up to 1,000 examples na- in the training set. (In other words, pose the class is just x1 ∧ x2. If there

ive Bayes is more accurate than a the classifier is a Boolean formula in are no other features, this is an easy

rule learner. This happens despite disjunctive normal form, where each problem. But if there are 98 irrelevant

naive Bayes’s false assumption that term is the conjunction of the feature features x3,..., x100, the noise from

the frontier is linear! Situations like values of one specific training exam- them completely swamps the signal in

this are common in machine learn- ple.) This classifier gets all the training x1 and x2, and nearest neighbor effec-

ing: strong false assumptions can be examples right and every positive test tively makes random predictions.

better than weak true ones, because example wrong, regardless of whether Even more disturbing is that near-

a learner with the latter needs more the training data is noisy or not. est neighbor still has a problem even

data to avoid overfitting. The problem of multiple testing13 is if all 100 features are relevant! This

Cross-validation can help to com- closely related to overfitting. Standard is because in high dimensions all

bat overfitting, for example by using it statistical tests assume that only one examples look alike. Suppose, for

to choose the best size of decision tree hypothesis is being tested, but mod- instance, that examples are laid out

to learn. But it is no panacea, since if ern learners can easily test millions on a regular grid, and consider a test

we use it to make too many parameter before they are done. As a result what example xt. If the grid is d-dimen-

choices it can itself start to overfit.17 looks significant may in fact not be. sional, xt’s 2d nearest examples are

Besides cross-validation, there For example, a mutual fund that beats all at the same distance from it. So as

are many methods to combat overfit- the market 10 years in a row looks very the dimensionality increases, more

ting. The most popular one is adding impressive, until you realize that, if and more examples become nearest

a regularization term to the evaluation there are 1,000 funds and each has a neighbors of xt, until the choice of

function. This can, for example, pe- 50% chance of beating the market on nearest neighbor (and therefore of

nalize classifiers with more structure, any given year, it is quite likely that class) is effectively random.

thereby favoring smaller ones with one will succeed all 10 times just by This is only one instance of a more

less room to overfit. Another option luck. This problem can be combatted general problem with high dimen-

is to perform a statistical significance by correcting the significance tests to sions: our intuitions, which come

test like chi-square before adding new take the number of hypotheses into from a three-dimensional world, of-

structure, to decide whether the dis- account, but this can also lead to un- ten do not apply in high-dimensional

tribution of the class really is differ- derfitting. A better approach is to con- ones. In high dimensions, most of the

ent with and without this structure. trol the fraction of falsely accepted mass of a multivariate Gaussian dis-

These techniques are particularly use- non-null hypotheses, known as the tribution is not near the mean, but in

ful when data is very scarce. Neverthe- false discovery rate.3 an increasingly distant “shell” around

less, you should be skeptical of claims it; and most of the volume of a high-

that a particular technique “solves” Intuition Fails in High Dimensions dimensional orange is in the skin, not

the overfitting problem. It is easy to After overfitting, the biggest problem the pulp. If a constant number of ex-

avoid overfitting (variance) by falling in machine learning is the curse of amples is distributed uniformly in a

into the opposite error of underfitting dimensionality. This expression was high-dimensional hypercube, beyond

(bias). Simultaneously avoiding both coined by Bellman in 1961 to refer some dimensionality most examples

requires learning a perfect classifier, to the fact that many algorithms that are closer to a face of the hypercube

and short of knowing it in advance work fine in low dimensions become than to their nearest neighbor. And if

there is no single technique that will intractable when the input is high- we approximate a hypersphere by in-

always do best (no free lunch). dimensional. But in machine learn- scribing it in a hypercube, in high di-

A common misconception about ing it refers to much more. General- mensions almost all the volume of the

overfitting is that it is caused by noise, izing correctly becomes exponentially hypercube is outside the hypersphere.

harder as the dimensionality (number This is bad news for machine learning,

a Training examples consist of 64 Boolean fea- of features) of the examples grows, be- where shapes of one type are often ap-

tures and a Boolean class computed from cause a fixed-size training set covers a proximated by shapes of another.

them according to a set of “IF . . . THEN . . .” dwindling fraction of the input space. Building a classifier in two or three

rules. The curves are the average of 100 runs

with different randomly generated sets of

Even with a moderate dimension of dimensions is easy; we can find a rea-

rules. Error bars are two standard deviations. 100 and a huge training set of a trillion sonable frontier between examples

See Domingos and Pazzani10 for details. examples, the latter covers only a frac- of different classes just by visual in-

review articles

spection. (It has even been said that if bad classifiers in the learner’s hypoth-

people could see in high dimensions esis space H. The probability that at

machine learning would not be neces- least one of them is consistent is less

sary.) But in high dimensions it is dif- than b(1 − ε)n, by the union bound. As-

ficult to understand what is happen-

ing. This in turn makes it difficult to One of the major suming the learner always returns a

consistent classifier, the probability

design a good classifier. Naively, one

might think that gathering more fea-

developments of that this classifier is bad is then less

than |H|(1 − ε)n, where we have used

tures never hurts, since at worst they recent decades has the fact that b ≤ |H|. So if we want this

provide no new information about the

class. But in fact their benefits may

been the realization probability to be less than δ, it suffices

to make n > ln(δ/|H|)/ ln(1 − ε) ≥ 1/ε (ln

be outweighed by the curse of dimen- that we can have |H| + ln 1/δ).

sionality.

Fortunately, there is an effect that

guarantees on the Unfortunately, guarantees of this

type have to be taken with a large grain

partly counteracts the curse, which results of induction, of salt. This is because the bounds ob-

might be called the “blessing of non-

uniformity.” In most applications particularly if we tained in this way are usually extreme-

ly loose. The wonderful feature of the

examples are not spread uniformly are willing to settle bound above is that the required num-

throughout the instance space, but

are concentrated on or near a lower- for probabilistic ber of examples only grows logarith-

mically with |H| and 1/δ. Unfortunate-

dimensional manifold. For example,

k-nearest neighbor works quite well

guarantees. ly, most interesting hypothesis spaces

are doubly exponential in the number

for handwritten digit recognition of features d, which still leaves us

even though images of digits have needing a number of examples expo-

one dimension per pixel, because the nential in d. For example, consider

space of digit images is much smaller the space of Boolean functions of d

than the space of all possible images. Boolean variables. If there are e pos-

Learners can implicitly take advan- sible different examples, there are

tage of this lower effective dimension, 2e possible different functions, so

or algorithms for explicitly reducing since there are 2d possible examples,

d

the dimensionality can be used (for the total number of functions is 22 .

example, Tenenbaum22). And even for hypothesis spaces that

are “merely” exponential, the bound

Theoretical Guarantees is still very loose, because the union

Are Not What They Seem bound is very pessimistic. For exam-

Machine learning papers are full of ple, if there are 100 Boolean features

theoretical guarantees. The most com- and the hypothesis space is decision

mon type is a bound on the number of trees with up to 10 levels, to guarantee

examples needed to ensure good gen- δ = ε = 1% in the bound above we need

eralization. What should you make of half a million examples. But in prac-

these guarantees? First of all, it is re- tice a small fraction of this suffices for

markable that they are even possible. accurate learning.

Induction is traditionally contrasted Further, we have to be careful

with deduction: in deduction you can about what a bound like this means.

guarantee that the conclusions are For instance, it does not say that, if

correct; in induction all bets are off. your learner returned a hypothesis

Or such was the conventional wisdom consistent with a particular training

for many centuries. One of the major set, then this hypothesis probably

developments of recent decades has generalizes well. What it says is that,

been the realization that in fact we can given a large enough training set, with

have guarantees on the results of in- high probability your learner will ei-

duction, particularly if we are willing ther return a hypothesis that general-

to settle for probabilistic guarantees. izes well or be unable to find a consis-

The basic argument is remarkably tent hypothesis. The bound also says

simple.5 Let’s say a classifier is bad nothing about how to select a good

if its true error rate is greater than ε. hypothesis space. It only tells us that,

Then the probability that a bad clas- if the hypothesis space contains the

sifier is consistent with n random, in- true classifier, then the probability

dependent training examples is less that the learner outputs a bad classi-

than (1 − ε)n. Let b be the number of fier decreases with training set size.

o c to b e r 2 0 1 2 | vo l. 55 | n o. 1 0 | c om m u n ic at ion s of t he acm 83

review articles

If we shrink the hypothesis space, the chine learning. But it makes sense if

bound improves, but the chances that you consider how time-consuming it

it contains the true classifier shrink is to gather data, integrate it, clean it

also. (There are bounds for the case and preprocess it, and how much trial

where the true classifier is not in the

hypothesis space, but similar consid- A dumb algorithm and error can go into feature design.

Also, machine learning is not a one-

erations apply to them.)

Another common type of theoreti-

with lots and lots shot process of building a dataset and

running a learner, but rather an itera-

cal guarantee is asymptotic: given in- of data beats tive process of running the learner,

finite data, the learner is guaranteed

to output the correct classifier. This

a clever one analyzing the results, modifying the

data and/or the learner, and repeat-

is reassuring, but it would be rash to with modest ing. Learning is often the quickest

choose one learner over another be-

cause of its asymptotic guarantees. In

amounts of it. part of this, but that is because we

have already mastered it pretty well!

practice, we are seldom in the asymp- Feature engineering is more diffi-

totic regime (also known as “asymp- cult because it is domain-specific,

topia”). And, because of the bias-vari- while learners can be largely general

ance trade-off I discussed earlier, if purpose. However, there is no sharp

learner A is better than learner B given frontier between the two, and this is

infinite data, B is often better than A another reason the most useful learn-

given finite data. ers are those that facilitate incorpo-

The main role of theoretical guar- rating knowledge.

antees in machine learning is not as Of course, one of the holy grails

a criterion for practical decisions, of machine learning is to automate

but as a source of understanding and more and more of the feature engi-

driving force for algorithm design. In neering process. One way this is often

this capacity, they are quite useful; in- done today is by automatically gener-

deed, the close interplay of theory and ating large numbers of candidate fea-

practice is one of the main reasons tures and selecting the best by (say)

machine learning has made so much their information gain with respect

progress over the years. But caveat to the class. But bear in mind that

emptor: learning is a complex phe- features that look irrelevant in isola-

nomenon, and just because a learner tion may be relevant in combination.

has a theoretical justification and For example, if the class is an XOR of

works in practice does not mean the k input features, each of them by it-

former is the reason for the latter. self carries no information about the

class. (If you want to annoy machine

Feature Engineering Is The Key learners, bring up XOR.) On the other

At the end of the day, some machine hand, running a learner with a very

learning projects succeed and some large number of features to find out

fail. What makes the difference? Eas- which ones are useful in combination

ily the most important factor is the may be too time-consuming, or cause

features used. Learning is easy if you overfitting. So there is ultimately no

have many independent features that replacement for the smarts you put

each correlate well with the class. On into feature engineering.

the other hand, if the class is a very

complex function of the features, you More Data Beats

may not be able to learn it. Often, the a Cleverer Algorithm

raw data is not in a form that is ame- Suppose you have constructed the

nable to learning, but you can con- best set of features you can, but the

struct features from it that are. This classifiers you receive are still not ac-

is typically where most of the effort in curate enough. What can you do now?

a machine learning project goes. It is There are two main choices: design a

often also one of the most interesting better learning algorithm, or gather

parts, where intuition, creativity and more data (more examples, and pos-

“black art” are as important as the sibly more raw features, subject to

technical stuff. the curse of dimensionality). Machine

First-timers are often surprised by learning researchers are mainly con-

how little time in a machine learning cerned with the former, but pragmati-

project is spent actually doing ma- cally the quickest path to success is

review articles

often to just get more data. As a rule ers are seductive, but they are usually cycles. In research papers, learners

of thumb, a dumb algorithm with lots harder to use, because they have more are typically compared on measures

and lots of data beats a clever one with knobs you need to turn to get good re- of accuracy and computational cost.

modest amounts of it. (After all, ma- sults, and because their internals are But human effort saved and insight

chine learning is all about letting data more opaque. gained, although harder to measure,

do the heavy lifting.) Learners can be divided into two are often more important. This favors

This does bring up another prob- major types: those whose representa- learners that produce human-under-

lem, however: scalability. In most of tion has a fixed size, like linear classi- standable output (for example, rule

computer science, the two main lim- fiers, and those whose representation sets). And the organizations that make

ited resources are time and memory. can grow with the data, like decision the most of machine learning are

In machine learning, there is a third trees. (The latter are sometimes called those that have in place an infrastruc-

one: training data. Which one is the nonparametric learners, but this is ture that makes experimenting with

bottleneck has changed from decade somewhat unfortunate, since they many different learners, data sources,

to decade. In the 1980s it tended to usually wind up learning many more and learning problems easy and effi-

be data. Today it is often time. Enor- parameters than parametric ones.) cient, and where there is a close col-

mous mountains of data are avail- Fixed-size learners can only take ad- laboration between machine learning

able, but there is not enough time vantage of so much data. (Notice how experts and application domain ones.

to process it, so it goes unused. This the accuracy of naive Bayes asymptotes

leads to a paradox: even though in at around 70% in Figure 2.) Variable- Learn Many Models, Not Just One

principle more data means that more size learners can in principle learn any In the early days of machine learn-

complex classifiers can be learned, in function given sufficient data, but in ing, everyone had a favorite learner,

practice simpler classifiers wind up practice they may not, because of limi- together with some a priori reasons

being used, because complex ones tations of the algorithm (for example, to believe in its superiority. Most ef-

take too long to learn. Part of the an- greedy search falls into local optima) fort went into trying many variations

swer is to come up with fast ways to or computational cost. Also, because of it and selecting the best one. Then

learn complex classifiers, and indeed of the curse of dimensionality, no ex- systematic empirical comparisons

there has been remarkable progress isting amount of data may be enough. showed that the best learner varies

in this direction (for example, Hulten For these reasons, clever algorithms— from application to application, and

and Domingos11). those that make the most of the data systems containing many different

Part of the reason using cleverer and computing resources available— learners started to appear. Effort now

algorithms has a smaller payoff than often pay off in the end, provided you went into trying many variations of

you might expect is that, to a first ap- are willing to put in the effort. There many learners, and still selecting just

proximation, they all do the same. is no sharp frontier between design- the best one. But then researchers

This is surprising when you consider ing learners and learning classifiers; noticed that, if instead of selecting

representations as different as, say, rather, any given piece of knowledge the best variation found, we combine

sets of rules and neural networks. But could be encoded in the learner or many variations, the results are bet-

in fact propositional rules are readily learned from data. So machine learn- ter—often much better—and at little

encoded as neural networks, and sim- ing projects often wind up having a extra effort for the user.

ilar relationships hold between other significant component of learner de- Creating such model ensembles is

representations. All learners essen- sign, and practitioners need to have now standard.1 In the simplest tech-

tially work by grouping nearby exam- some expertise in it.12 nique, called bagging, we simply gen-

ples into the same class; the key dif- In the end, the biggest bottleneck erate random variations of the train-

ference is in the meaning of “nearby.” is not data or CPU cycles, but human ing set by resampling, learn a classifier

With nonuniformly distributed data, on each, and combine the results by

learners can produce widely different Figure 3. Very different frontiers can yield voting. This works because it greatly

similar predictions. (+ and – are training

frontiers while still making the same examples of two classes.)

reduces variance while only slightly

predictions in the regions that matter increasing bias. In boosting, training

(those with a substantial number of examples have weights, and these are

training examples, and therefore also N. Bayes varied so that each new classifier fo-

where most test examples are likely to cuses on the examples the previous

appear). This also helps explain why kNN ones tended to get wrong. In stacking,

powerful learners can be unstable but SVM the outputs of individual classifiers

still accurate. Figure 3 illustrates this become the inputs of a “higher-level”

in 2D; the effect is much stronger in learner that figures out how best to

high dimensions. combine them.

As a rule, it pays to try the simplest Many other techniques exist, and

learners first (for example, naïve Bayes D. Tree the trend is toward larger and larger

before logistic regression, k-nearest ensembles. In the Netflix prize, teams

neighbor before support vector ma- from all over the world competed to

chines). More sophisticated learn- build the best video recommender

review articles

the competition progressed, teams sifiers even after the training error has

found they obtained the best results reached zero. Another counterexam-

by combining their learners with oth- ple is support vector machines, which

er teams’, and merged into larger and

larger teams. The winner and runner- Just because can effectively have an infinite num-

ber of parameters without overfitting.

up were both stacked ensembles of

over 100 learners, and combining the

a function can Conversely, the function sign(sin(ax))

can discriminate an arbitrarily large,

two ensembles further improved the be represented arbitrarily labeled set of points on the

results. Doubtless we will see even

larger ones in the future.

does not mean x axis, even though it has only one pa-

rameter.23 Thus, contrary to intuition,

Model ensembles should not be it can be learned. there is no necessary connection be-

confused with Bayesian model av- tween the number of parameters of a

eraging (BMA)—the theoretically model and its tendency to overfit.

optimal approach to learning.4 In A more sophisticated view instead

BMA, predictions on new examples equates complexity with the size of

are made by averaging the individual the hypothesis space, on the basis that

predictions of all classifiers in the smaller spaces allow hypotheses to be

hypothesis space, weighted by how represented by shorter codes. Bounds

well the classifiers explain the train- like the one in the section on theoreti-

ing data and how much we believe cal guarantees might then be viewed

in them a priori. Despite their su- as implying that shorter hypotheses

perficial similarities, ensembles and generalize better. This can be further

BMA are very different. Ensembles refined by assigning shorter codes to

change the hypothesis space (for ex- the hypotheses in the space we have

ample, from single decision trees to some a priori preference for. But

linear combinations of them), and viewing this as “proof” of a trade-off

can take a wide variety of forms. BMA between accuracy and simplicity is

assigns weights to the hypotheses in circular reasoning: we made the hy-

the original space according to a fixed potheses we prefer simpler by design,

formula. BMA weights are extremely and if they are accurate it is because

different from those produced by our preferences are accurate, not be-

(say) bagging or boosting: the latter cause the hypotheses are “simple” in

are fairly even, while the former are the representation we chose.

extremely skewed, to the point where A further complication arises from

the single highest-weight classifier the fact that few learners search their

usually dominates, making BMA ef- hypothesis space exhaustively. A

fectively equivalent to just selecting learner with a larger hypothesis space

it.8 A practical consequence of this is that tries fewer hypotheses from it

that, while model ensembles are a key is less likely to overfit than one that

part of the machine learning toolkit, tries more hypotheses from a smaller

BMA is seldom worth the trouble. space. As Pearl18 points out, the size of

the hypothesis space is only a rough

Simplicity Does Not guide to what really matters for relat-

Imply Accuracy ing training and test error: the proce-

Occam’s razor famously states that dure by which a hypothesis is chosen.

entities should not be multiplied be- Domingos7 surveys the main argu-

yond necessity. In machine learning, ments and evidence on the issue of

this is often taken to mean that, given Occam’s razor in machine learning.

two classifiers with the same training The conclusion is that simpler hy-

error, the simpler of the two will likely potheses should be preferred because

have the lowest test error. Purported simplicity is a virtue in its own right,

proofs of this claim appear regularly not because of a hypothetical connec-

in the literature, but in fact there are tion with accuracy. This is probably

many counterexamples to it, and the what Occam meant in the first place.

“no free lunch” theorems imply it can-

not be true. Representable Does Not

We saw one counterexample previ- Imply Learnable

ously: model ensembles. The gener- Essentially all representations used in

alization error of a boosted ensemble variable-size learners have associated

review articles

theorems of the form “Every function More often than not, the goal References

can be represented, or approximated of learning predictive models is to 1. Bauer, E. and Kohavi, R. An empirical comparison of

voting classification algorithms: Bagging, boosting

arbitrarily closely, using this repre- use them as guides to action. If we and variants. Machine Learning 36 (1999), 105–142.

sentation.” Reassured by this, fans of find that beer and diapers are often 2. Bengio, Y. Learning deep architectures for AI.

Foundations and Trends in Machine Learning 2, 1

the representation often proceed to bought together at the supermar- (2009), 1–127.

ignore all others. However, just be- ket, then perhaps putting beer next 3. Benjamini, Y. and Hochberg, Y. Controlling the false

discovery rate: A practical and powerful approach

cause a function can be represented to the diaper section will increase to multiple testing. Journal of the Royal Statistical

does not mean it can be learned. For sales. (This is a famous example in Society, Series B, 57 (1995), 289–300.

4. Bernardo, J.M. and Smith, A.F.M. Bayesian Theory.

example, standard decision tree learn- the world of data mining.) But short Wiley, NY, 1994.

ers cannot learn trees with more leaves of actually doing the experiment it is 5. Blumer, A., Ehrenfeucht, A., Haussler, D. and

Warmuth, M.K. Occam’s razor. Information

than there are training examples. In difficult to tell. Machine learning is Processing Letters 24 (1987), 377–380.

6. Cohen, W.W. Grammatically biased learning:

continuous spaces, representing even usually applied to observational data, Learning logic programs using an explicit antecedent

simple functions using a fixed set of where the predictive variables are not description language. Artificial Intelligence 68

(1994), 303–366.

primitives often requires an infinite under the control of the learner, as 7. Domingos, P. The role of Occam’s razor in knowledge

number of components. Further, if opposed to experimental data, where discovery. Data Mining and Knowledge Discovery 3

(1999), 409–425.

the hypothesis space has many local they are. Some learning algorithms 8. Domingos, P. Bayesian averaging of classifiers and

optima of the evaluation function, as can potentially extract causal infor- the overfitting problem. In Proceedings of the 17th

International Conference on Machine Learning

is often the case, the learner may not mation from observational data, but (Stanford, CA, 2000), Morgan Kaufmann, San Mateo,

find the true function even if it is rep- their applicability is rather restrict- CA, 223–230.

9. Domingos, P. A unified bias-variance decomposition

resentable. Given finite data, time and ed.19 On the other hand, correlation and its applications. In Proceedings of the 17th

memory, standard learners can learn is a sign of a potential causal connec- International Conference on Machine Learning

(Stanford, CA, 2000), Morgan Kaufmann, San Mateo,

only a tiny subset of all possible func- tion, and we can use it as a guide to CA, 231–238.

tions, and these subsets are different further investigation (for example, 10. Domingos, P. and Pazzani, M. On the optimality of

the simple Bayesian classifier under zero-one loss.

for learners with different represen- trying to understand what the causal Machine Learning 29 (1997), 103–130.

tations. Therefore the key question is chain might be). 11. Hulten, G. and Domingos, P. Mining complex models

from arbitrarily large databases in constant time. In

not “Can it be represented?” to which Many researchers believe that cau- Proceedings of the 8th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining

the answer is often trivial, but “Can it sality is only a convenient fiction. For (Edmonton, Canada, 2002). ACM Press, NY, 525–531.

be learned?” And it pays to try different example, there is no notion of causal- 12. Kibler, D. and Langley, P. Machine learning as an

experimental science. In Proceedings of the 3rd

learners (and possibly combine them). ity in physical laws. Whether or not European Working Session on Learning (London, UK,

Some representations are exponen- causality really exists is a deep philo- 1988). Pitman.

13. Klockars, A.J. and Sax, G. Multiple Comparisons.

tially more compact than others for sophical question with no definitive Sage, Beverly Hills, CA, 1986.

some functions. As a result, they may answer in sight, but there are two 14. Kohavi, R., Longbotham, R., Sommerfield, D. and

Henne, R. Controlled experiments on the Web:

also require exponentially less data to practical points for machine learn- Survey and practical guide. Data Mining and

learn those functions. Many learners ers. First, whether or not we call them Knowledge Discovery 18 (2009), 140–181.

15. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs,

work by forming linear combinations “causal,” we would like to predict the R., Roxburgh, C. and Byers, A. Big data: The next

of simple basis functions. For exam- effects of our actions, not just corre- frontier for innovation, competition, and productivity.

Technical report, McKinsey Global Institute, 2011.

ple, support vector machines form lations between observable variables. 16. Mitchell, T.M. Machine Learning. McGraw-Hill,

combinations of kernels centered at Second, if you can obtain experimen- NY, 1997.

17. Ng, A.Y. Preventing “overfitting” of cross-validation

some of the training examples (the tal data (for example by randomly as- data. In Proceedings of the 14th International

support vectors). Representing parity signing visitors to different versions of Conference on Machine Learning (Nashville, TN,

1997). Morgan Kaufmann, San Mateo, CA, 245–253.

of n bits in this way requires 2n basis a Web site), then by all means do so.14 18. Pearl, J. On the connection between the complexity

and credibility of inferred models. International

functions. But using a representation Journal of General Systems 4 (1978), 255–264.

with more layers (that is, more steps Conclusion 19. Pearl, J. Causality: Models, Reasoning, and

Inference. Cambridge University Press, Cambridge,

between input and output), parity can Like any discipline, machine learn- UK, 2000.

be encoded in a linear-size classifier. ing has a lot of “folk wisdom” that can 20. Quinlan, J.R. C4.5: Programs for Machine Learning.

Morgan Kaufmann, San Mateo, CA, 1993.

Finding methods to learn these deeper be difficult to come by, but is crucial 21. Richardson, M. and P. Domingos. Markov logic

representations is one of the major re- for success. This article summarized networks. Machine Learning 62 (2006), 107–136.

22. Tenenbaum, J., Silva, V. and Langford, J. A global

search frontiers in machine learning.2 some of the most salient items. Of geometric framework for nonlinear dimensionality

course, it is only a complement to the reduction. Science 290 (2000), 2319–2323.

23. Vapnik, V.N. The Nature of Statistical Learning

Correlation Does Not more conventional study of machine Theory. Springer, NY, 1995.

Imply Causation learning. Check out http://www. 24. Witten, I., Frank, E. and Hall, M. Data Mining:

Practical Machine Learning Tools and Techniques,

The point that correlation does not cs.washington.edu/homes/pedrod/ 3rd Edition. Morgan Kaufmann, San Mateo, CA, 2011.

imply causation is made so often that class for a complete online machine 25. Wolpert, D. The lack of a priori distinctions between

learning algorithms. Neural Computation 8 (1996),

it is perhaps not worth belaboring. learning course that combines formal 1341–1390.

But, even though learners of the kind and informal aspects. There is also a

we have been discussing can only treasure trove of machine learning Pedro Domingos (pedrod@cs.washington.edu) is a

professor in the Department of Computer Science and

learn correlations, their results are lectures at http://www.videolectures. Engineering at the University of Washington, Seattle.

often treated as representing causal net. A good open source machine

relations. Isn’t this wrong? If so, then learning toolkit is Weka.24

why do people do it? Happy learning! © 2012 ACM 0001-0782/12/10 $15.00

- Machine LearningUploaded byBharathwaj Sreedhar
- MenonMovva PredictingFlightDelays ReportUploaded byNirmalendu Kumar
- Review on Machine Learning Framework for Software Defect PredictionUploaded byIJRASETPublications
- lecture1aUploaded byDamir Ramazanov
- 07394628Uploaded byJulio Souza
- ML Medicine 2015Uploaded bymkal
- Edisc a Class-tailored Discretization Technique for Rule-based ClassificationUploaded byncctstudentproject
- Machine Learning, Neural And Statistical Classification - Cc Taylor.pdfUploaded bychampo18
- MIT18_657F15_L1Uploaded byeldelmoño luci
- LNKnetUploaded bySelcuk Can
- Microsoft Azure Machine Learning - Sample ChapterUploaded byPackt Publishing
- Burr_Settles_from Theories to Queries_active Learning in PracticeUploaded bySanjan
- Machine Learning CourseraUploaded byarsenaldo
- Test Plan v2Uploaded byThức Thông Thái
- Automatic Detection of Poor Speech Recognition at the Dialogue LevelUploaded byjehosha
- Some Effective Techniques for Naive Bayes Text ClassificationUploaded byShehbaz Shaikh
- Ml BrochureUploaded byAshlin Aarthi
- 4.1 Data Driven ModellingUploaded byTharsiga Thevakaran
- Battle QuantsUploaded bythyagosmesme
- Week 1 Assignment SolutionUploaded byVengatraman Pondicherry
- Automated Identification of Sugar Beet Diseases Using SmartphonesUploaded byganamu
- Naive BayesUploaded byHouda Kamouss
- Research.txtUploaded byIshan Pandey (tsup?)
- Text Classification and Classifiers:A SurveyUploaded byAdam Hansen
- Linear vs K-NN Classification - Simulation StudyUploaded bytmunilla
- 4.1 Data Driven ModellingUploaded byGeorge Mavromatidis
- IRJET-Intelligence Extraction using Various Machine Learning AlgorithmsUploaded byIRJET Journal
- major1Uploaded byyarrabandi Sk yogesh reddy
- PAKDD02_KNNUploaded bynobeen666
- 1 IntroductionUploaded byvikasnar

- EticaUploaded byKevin Jarrín
- definiciontiposdeagentes.pdfUploaded byKevin Jarrín
- definiciontiposdeagentes.pdfUploaded byKevin Jarrín
- agentesloigcos.pdfUploaded byKevin Jarrín
- Capitulo 9Uploaded byKevin Jarrín
- Come CocosUploaded byKevin Jarrín
- VMware View vs Citrix XenDesktopUploaded byKevin Jarrín
- Arquitectura de AplicacionesUploaded byKevin Jarrín
- Dialnet-CortinaA-6309850Uploaded byKevin Jarrín
- U1 Fundamentos IAUploaded byKevin Jarrín
- Curvas y Tablas de Crecimiento Fundación OrbegozoUploaded byMariana Urdapilleta
- Algebra LinealUploaded byKevin Jarrín
- Shifter a La IzquierdaUploaded byKevin Jarrín
- Proyecto 1.2Uploaded byKevin Jarrín
- Proyecto 1.2Uploaded byKevin Jarrín
- ELECTRODINÁMICAUploaded byKevin Jarrín
- Connectify Hotspot PRO - V3.7.1 [Seriales]Uploaded byKevin Jarrín
- Programa C# ListasUploaded byKevin Jarrín
- Teroria de sistemas semana 9 2 de junio 2017.pdfUploaded byKevin Jarrín
- Teroria de sistemas semana 9 2 de junio 2017.pdfUploaded byKevin Jarrín
- subneteo.pdfUploaded byDaniel Castro

- Entrpriceperfmgmtreadmefile.pdfUploaded byshri1177
- google20Uploaded byapi-301419574
- Nano-, Bio-, Multi-, Inter-, ... - Polymer Research in an Era of PrefixesUploaded byTOUFIK
- Book NameUploaded byax33m144
- Brochure New Iss02Uploaded bySugathan Kunjubava
- seat allotment letter.pdfUploaded byakash
- Salman KhanUploaded byNazre Alan Siddiqui
- b2629b_dfe5f6d1dcf04965b6528af0cb88de76Uploaded byAna Irina
- Course Outline MamaUploaded byvon
- listening quizUploaded byapi-397725676
- New channel of TVUploaded byArasan Arasan
- ASD and Hanson Robotics' ZENO Robot: An Emotion Recognition Study (Salvador, Et Al)Uploaded byGina Smith
- Day 11Uploaded byAravind Dorai
- Bits PilaniUploaded bytriptirajvanshi31dec
- Connected Congregations: Announcement LaunchUploaded bylisacolton
- GreenDot UFT Collective Bargaining Agreement 09Uploaded byGothamSchools.org
- 2nd Week 1 Day 1,2,5 TG_1Uploaded byCastle Gelyn
- Ahmed Talal PortfolioUploaded byAhmed Omer
- Conference on Smart MaterialsUploaded byanishkrishnannayar
- Informal Organization - Wikipedia, The Free EncyclopediaUploaded byapshapadaliya
- Physicochemical Principles of PharmacyUploaded byAdam
- Health Pack For Oligospermia or Low sperm count शुक्राणु की कमी के लक्षण और दवा - NidanamUploaded bysubrat
- critical read aloud lesson planUploaded byapi-279097798
- spam email classification using decision tree ensembleUploaded byImmanuel Reon
- Seanewdim Philology ii4 Issue 24Uploaded byseanewdim
- Help Kids Become Math PowerfulUploaded byChristine Renee Floyd Dalton
- critical analytical response to texts essay format-1 1 1Uploaded byapi-242346407
- Samskara.pdfUploaded bytechzones
- JK Rowling Speech in Praise of FailureUploaded bySiao Da
- Tkt Glossary DocumentUploaded byLamaC