Professional Documents
Culture Documents
The goal of this blog post is to descend from these lofty heights, from
what we imagine some abstracted self-modifying agent could do, to
where the eld actually is today: its successes, its limitations, and how
far we are from robust multi-task intelligence.
Why can humans do what we do?
Speci cally: in many reinforcement learning tasks, it takes algorithms a
strikingly long time to learn tasks, relative to how long it would take
humans; the current state of the art in playing Atari games takes about
83 hours (or 18 million frames) of gameplay time to reach median
human performance, which most people can reach after being exposed
to the game for just a few hours.
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 1/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
A gure from the recent Rainbow RL paper
1. Learned Priors: In this lens, humans can learn new tasks quickly
because we can reuse information we already learned in past
tasks, like the intuitive physics of how objects move around a
space, or the meta-knowledge that losing a life in a video games
leads to lowered reward.
2. Learned Strategies: This is the idea that, over our lives (and
perhaps over evolutionary timescales), we don’t just gather object
level knowledge about the world, but also have developed a neural
structure that is more e cient at taking in input and turning it
into output or strategies, even in very novel environments.
Now, obviously, these two ideas aren’t mutually exclusive, and there
isn’t even a hard and fast boundary between them: some of our hard-
coded strategies for interacting with the world may be based on deep
priors about the world, like the fact that (at least for all purposes
relevant to this blog post) the world has a causal structure. That said, I
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 2/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
nd the ideas distinct enough that it’s worth separating them under
these two labels, and thinking them as poles of a relevant axis.
Not Throwing Away My (One) Shot
Before delving into meta learning proper, it’s useful to get some
conceptual grounding in the related eld of One Shot Learning. Where
the problem of meta learning is “how can I build a model that learns
new tasks quickly”, the problem of One Shot Learning is “how can I
build a model that can learn how to classify a class after only seeing one
example of that class”.
Let’s think for a second about what makes the problem of One Shot
Learning hard, on a conceptual level. If we try to train a vanilla model
on only one example of a relevant class, it will almost certainly over t.
If a model ever only sees one drawing of, let’s say, the letter 3, it won’t
understand what kinds of pixel variations an image can undergo and
still remain essentially a 3. For example, if the model is only shown the
rst 3 in this lineup, how is it to know a priori that the second 3 is an
example of the same species? Couldn’t it be theoretically possible that
the class label we’re interested in the network learning has to do with
the thickness of the lines making up the letter? That seems silly to us,
but with only one example of three-ness, it’s not a trivial inference for
the network to make.
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 3/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
For the sake of making this post somewhat less of a behemoth, I’ll
primarily focus on 1 and 3, to illustrate the two conceptual ends of this
spectrum.
A Task By Any Other Name
Another brief aside — the last one, I promise — that I hope will clear up
what might otherwise be a confusing topic. Frequently, in discussions
of meta learning, you’ll see mention of the idea of a “distribution of
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 4/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
tasks”. You may notice that this is a poorly de ned concept, and you’d
be right. There doesn’t seem to be a clear standard of when a problem is
one task, or a distribution over tasks. For example: should we think of
ImageNet as one task — object recognition — or many — di erentiating
dogs is one task, di erentiating cats is another? Why is playing a single
Atari game one task, rather than the several tasks that make up
individual levels of the game?
• For any given distribution of tasks, how di erent those tasks are
from one another can be really dramatically di erent (i.e. where
each task is learning a sine wave of a di erent amplitudes, vs
where each task is playing a di erent Atari game)
• So, it’s worth not not just immediately saying “ah, this method can
generalize over <this example distribution of tasks>, so that’s a
good indicator that it can generally perform well on some
arbitrary di erent distribution of tasks”. It’s certainly not bad
evidence in the direction of the method being e ective, but it does
require critical thinking to consider how much exibility the
network really had to demonstrate in order to perform well at all
the tasks.
Those Which Are Inexplicably Named
After Animals
In early 2017, Chelsea Finn and a team from Berkeley released a
technique called MAML: Model Agnostic Meta Learning.
In case you didn’t think the joke was intentional, turn to the “Species of MAML” section of the paper
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 5/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
On the line between learned strategy and learned priors, this approach
leans towards the latter side. The goal of this network is to train a
model that, if given one gradient step update of a new task, can
generalize well on that task. The pseudocode algorithm for this goes
like
4. Then, take the gradient of the task-t test set performance with
respect your initial parameters theta. Then update those
parameter based on this gradient. Circle back to step one, using
your just-updated theta as the initial theta in that step
What is this doing? On a very abstracted level, it’s nding the point in
parameter space that is the closest in expectation to a good
generalization point for many of the tasks in its distribution. You can
think of this as forcing the model to maintain some level of uncertainty
and caution in its exploration of parameter space. Simplistically: where
a network that believes its gradients are fully representative of the
population distribution might dive into a region of particularly low loss,
MAML would be more incentivized to nd a region near the cusp of
multiple valleys that each contain reasonably low loss over all tasks in
expectation. It’s this incentive for caution that helps keep MAML from
over tting the way a model typically might when given only a small
number of examples from a new task.
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 6/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
g1 here represents the gradient update you get from only performing one gradient descent step
per task
It’s a little intuitively strange that this works at all, since, naively, this
seems to be no di erent than training on all the tasks mixed together as
one. However, the authors make an argument that, due to taking
multiple steps of SGD for each task, the second derivates of each task’s
loss function are given in uence. To do so, they decompose their
update into two parts:
1. A term pushing the result towards the “joint training loss”, i.e. the
outcome you’d get if you just trained on a mixture of tasks, and
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 7/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
To clarify this point, take a look at the gure above. It compares MAML
against a network that’s just pretrained, when both have been trained
on a set of regression tasks comprised of sine waves of varying phase
and amplitude. At this point, both are “ ne tuned” on a new speci c
task: the curve shown in red. The purple triangles represents the data
points used in the small number of gradient steps. Compared to the
pretrained network, MAML has learned, for example, that sine waves
has a periodic structure: at K=5, it’s able to much more quickly move
the lefthand peak to the correct place without actually having observed
data from that region of space. While it’s hard to tell whether our
(somewhat pat) explanations are a perfect mechanical match to what’s
happening under the hood, we might infer that MAML has done a
better job of guring out the two relevant ways that sine waves di er
from one another — phase and amplitude — and how to learn those
representations from the data it’s given.
Networks All the Way Down
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 8/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
For some people, even the idea of learning global priors using a known
algorithm like gradient descent. Who says the learning algorithms
we’ve devised are the most e ective? Isn’t it possible we could learn a
better one?
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 9/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
MDP (MDP = Markov Decision Process. For this explanation, you can
just think of each MDP as de ning a set of possible actions and
underlying rewards those actions generate in the environment). The
RNN is then trained — as RNNs typically are — over many sequences,
which in this case correspond to many di erent MDPs, and the
parameters of the RNN are optimized to produce low regret over all the
sequences/trials in aggregate. Regret is a metric that captures your
total reward over a set of episodes, so in addition to incentivizing the
network to reach a good policy by the end of the trial, it also
incentivizes quicker learning, so that fewer of your exploratory actions
are taken under a bad and thus low reward policy.
A diagram showing the internal workings of the RNN over multiple trials, corresponding to multiple
di erent MDPs.
Can We Scale This?
This is only a very compressed introduction to the eld, and I’m sure
that there are ideas I’ve missed, or concepts I’ve misstated. If you want
an additional (and better-informed) perspective, I highly recommend
this blog post by Chelsea Finn, the rst author on the MAML paper.
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 10/13
1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 11/13