Learning About Algorithms That Learn To Learn

1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science
Learning About Algorithms That

Learn to Learn
Cody Marie Wild
Mar 24, 2018 · 13 min read
The premise of meta learning was an intoxicating one to me, when I

rst of heard it: the project of building machines that are not only able
to learn, but are able to learn how to learn. The dreamed-of aspiration
of meta learning is algorithms able to modify fundamental aspects of
their architecture and parameter-space in response to signals of
performance, algorithms able to leverage accumulated experience
when they confront new environments. In short: when futurists weave
us dreams of generally competent AI, components that t this
description are integral to those visions.
The goal of this blog post is to descend from these lofty heights, from
what we imagine some abstracted self-modifying agent could do, to
where the eld actually is today: its successes, its limitations, and how
far we are from robust multi-task intelligence.
Why can humans do what we do?
Speci cally: in many reinforcement learning tasks, it takes algorithms a
strikingly long time to learn tasks, relative to how long it would take
humans; the current state of the art in playing Atari games takes about
83 hours (or 18 million frames) of gameplay time to reach median
human performance, which most people can reach after being exposed
to the game for just a few hours.
https://towardsdatascience.com/learning-about-algorithms-that-learn-to-learn-9022f2fa3dd5 1/13
A gure from the recent Rainbow RL paper
This discrepancy leads machine learning researchers to frame the

question as: what tools and capabilities does the human brain bring to
such a task, and how can we conceive of those tools in statistical and
information-theoretic ways? To be concrete, it seems like there are two
main strategies pursued by meta learning researchers, that roughly
correspond to two theories about what these tools are.
1. Learned Priors: In this lens, humans can learn new tasks quickly
because we can reuse information we already learned in past
tasks, like the intuitive physics of how objects move around a
space, or the meta-knowledge that losing a life in a video games
leads to lowered reward.
2. Learned Strategies: This is the idea that, over our lives (and
perhaps over evolutionary timescales), we don’t just gather object
level knowledge about the world, but also have developed a neural
structure that is more e cient at taking in input and turning it
into output or strategies, even in very novel environments.
Now, obviously, these two ideas aren’t mutually exclusive, and there
isn’t even a hard and fast boundary between them: some of our hard-
coded strategies for interacting with the world may be based on deep
priors about the world, like the fact that (at least for all purposes
relevant to this blog post) the world has a causal structure. That said, I
nd the ideas distinct enough that it’s worth separating them under
these two labels, and thinking them as poles of a relevant axis.
Not Throwing Away My (One) Shot
Before delving into meta learning proper, it’s useful to get some
conceptual grounding in the related eld of One Shot Learning. Where
the problem of meta learning is “how can I build a model that learns
new tasks quickly”, the problem of One Shot Learning is “how can I
build a model that can learn how to classify a class after only seeing one
example of that class”.
Let’s think for a second about what makes the problem of One Shot
Learning hard, on a conceptual level. If we try to train a vanilla model
on only one example of a relevant class, it will almost certainly over t.
If a model ever only sees one drawing of, let’s say, the letter 3, it won’t
understand what kinds of pixel variations an image can undergo and
still remain essentially a 3. For example, if the model is only shown the
rst 3 in this lineup, how is it to know a priori that the second 3 is an
example of the same species? Couldn’t it be theoretically possible that
the class label we’re interested in the network learning has to do with
the thickness of the lines making up the letter? That seems silly to us,
but with only one example of three-ness, it’s not a trivial inference for
the network to make.
Having more examples of 3s helps solve this problem because we can

learn what features of an image de ne its essential three-ness —
presence of two convex shapes, mostly vertical orientation — , and what
kinds of modi cations are irrelevant — thickness of line, sharpness of
angles. To succeed at one shot learning, we have to incentivize the
network to learn what kinds of properties generally distinguish one
number from another, without having examples of the speci c allowed
variances for each digit.
A common technique in one shot learning is to learn an embedded

space in which calculating Euclidean similarity between the
representations of two examples in that space is a good proxy for
calculating whether these two examples are of the same class.
Intuitively, this requires learning the internal dimensions along which
class di erentiation is strongest in general within this distribution (in
my example: the distribution over digits), and learning how to

compress and transform inputs into those most relevant dimensions.
I nd that having this problem in mind is a useful foundation, though

instead of trying to learn how to summarize the shared information
and patterns that exist in a distribution of classes, you’re instead trying
to learn the regularities that exist over a distribution of tasks, each with
it’s own internal structure or objective.
If asked to construct a ranking of meta-parameters of neural networks,

from least to most abstract, it would go something like:
1. A network learning representations useful across a full

distribution of tasks by using hyperparameterized gradient
descent. MAML and Reptile are good straightforward examples of
this, while Meta Learning with Shared Hierarchies is an intriguing
approaches that learns representations as explicit subpolicies
controlled by a master policy.
2. A network learning to optimize the parameters of its own

gradient descent operation. These parameters are things like:
learning rate, momentum, and weights on adaptive learning rate
algorithms. Here, we’re starting to go down the track of modifying
the learning algorithm itself, but in limited, parametric ways. This
is what Learning to Learn By Gradient Descent by Gradient
Descent does. Yes, that is the real title of the paper.
3. A network that learns an inner-loop optimizer that is itself a

network. That is to say: where gradient descent is used for
updating the neural-optimizer-network parameters such that they
perform well across tasks, but where the mapping from input data
to output prediction within each single task is entirely conducted
by a network, without any explicit calculations of loss or gradients.
This is how both RL² and A Simple Neural Attentive Meta Learner
work.
For the sake of making this post somewhat less of a behemoth, I’ll
primarily focus on 1 and 3, to illustrate the two conceptual ends of this
spectrum.
A Task By Any Other Name
Another brief aside — the last one, I promise — that I hope will clear up
what might otherwise be a confusing topic. Frequently, in discussions
of meta learning, you’ll see mention of the idea of a “distribution of
tasks”. You may notice that this is a poorly de ned concept, and you’d
be right. There doesn’t seem to be a clear standard of when a problem is
one task, or a distribution over tasks. For example: should we think of
ImageNet as one task — object recognition — or many — di erentiating
dogs is one task, di erentiating cats is another? Why is playing a single
Atari game one task, rather than the several tasks that make up
individual levels of the game?
What I’ve been able to extract from all this is:
• The notion of a “task” is pretty convolved with which datasets

happen to have been built, since it’s natural to think about
learning on one dataset as a single task
• For any given distribution of tasks, how di erent those tasks are
from one another can be really dramatically di erent (i.e. where
each task is learning a sine wave of a di erent amplitudes, vs
where each task is playing a di erent Atari game)
• So, it’s worth not not just immediately saying “ah, this method can
generalize over <this example distribution of tasks>, so that’s a
good indicator that it can generally perform well on some
arbitrary di erent distribution of tasks”. It’s certainly not bad
evidence in the direction of the method being e ective, but it does
require critical thinking to consider how much exibility the
network really had to demonstrate in order to perform well at all
the tasks.
Those Which Are Inexplicably Named
After Animals
In early 2017, Chelsea Finn and a team from Berkeley released a
technique called MAML: Model Agnostic Meta Learning.
In case you didn’t think the joke was intentional, turn to the “Species of MAML” section of the paper
On the line between learned strategy and learned priors, this approach
leans towards the latter side. The goal of this network is to train a
model that, if given one gradient step update of a new task, can
generalize well on that task. The pseudocode algorithm for this goes
like
1. Initialize a network’s parameters, theta, randomly
2. Pick some task, t, over a distribution of tasks T. Using k examples

(generally ~10) from the training set, perform one gradient step
at the location indicated by the current set of parameters, giving
you a nal set of parameters.
3. Evaluate the performance on these nalized parameters on the

test dataset
4. Then, take the gradient of the task-t test set performance with
respect your initial parameters theta. Then update those
parameter based on this gradient. Circle back to step one, using
your just-updated theta as the initial theta in that step
What is this doing? On a very abstracted level, it’s nding the point in
parameter space that is the closest in expectation to a good
generalization point for many of the tasks in its distribution. You can
think of this as forcing the model to maintain some level of uncertainty
and caution in its exploration of parameter space. Simplistically: where
a network that believes its gradients are fully representative of the
population distribution might dive into a region of particularly low loss,
MAML would be more incentivized to nd a region near the cusp of
multiple valleys that each contain reasonably low loss over all tasks in
expectation. It’s this incentive for caution that helps keep MAML from
over tting the way a model typically might when given only a small
number of examples from a new task.
A more recent addition to the literature, called Reptile, came out in

early 2018. As you could possibly guess from it’s name — a play on the
earlier MAML — Reptile started from MAML’s premises, but found a
way of calculating it’s initial-parameter update loop that was more
computationally e cient. Where MAML explicitly takes the gradient of
the test set loss with respect to the initial parameters theta, Reptile
instead just performs several steps of SGD updates on each task, and
then uses the di erence between the weights at the end of the updates,
and the initial weights, as the “gradient” used to update the initial
weights.
g1 here represents the gradient update you get from only performing one gradient descent step
per task
It’s a little intuitively strange that this works at all, since, naively, this
seems to be no di erent than training on all the tasks mixed together as
one. However, the authors make an argument that, due to taking
multiple steps of SGD for each task, the second derivates of each task’s
loss function are given in uence. To do so, they decompose their
update into two parts:
1. A term pushing the result towards the “joint training loss”, i.e. the
outcome you’d get if you just trained on a mixture of tasks, and
2. A term pushing the initialization towards a point where there’s the

gradients of subsequent SGD minibatches are close to each other:
i.e. the variance in gradients is low across minibatches. The
authors speculate that its this term that causes quick learning
time, because it incentivizes being in a more stable and low
variance training region on each task.
I chose the MAML/Reptile group as being representative of the

“learned priors” end of things because the network, in theory, succeeds
by learning internal representations that are either useful for
classi cation across the full distribution of tasks, or representations

that are close in parameter space to broadly useful representations.
To clarify this point, take a look at the gure above. It compares MAML
against a network that’s just pretrained, when both have been trained
on a set of regression tasks comprised of sine waves of varying phase
and amplitude. At this point, both are “ ne tuned” on a new speci c
task: the curve shown in red. The purple triangles represents the data
points used in the small number of gradient steps. Compared to the
pretrained network, MAML has learned, for example, that sine waves
has a periodic structure: at K=5, it’s able to much more quickly move
the lefthand peak to the correct place without actually having observed
data from that region of space. While it’s hard to tell whether our
(somewhat pat) explanations are a perfect mechanical match to what’s
happening under the hood, we might infer that MAML has done a
better job of guring out the two relevant ways that sine waves di er
from one another — phase and amplitude — and how to learn those
representations from the data it’s given.
Networks All the Way Down
For some people, even the idea of learning global priors using a known
algorithm like gradient descent. Who says the learning algorithms
we’ve devised are the most e ective? Isn’t it possible we could learn a
better one?
This is the approach taken by RL² (Fast Reinforcement Learning via

Slow Reinforcement Learning). The basic structure of this model is a
Recurrent Neural Network (technically: a LTSM network). Because
RNNs have the ability to store state information, and give di erent
outputs as a function of that state, it is theoretically possible for them to
learn arbitrary computable algorithms: in other words, they have the
potential to be Turing complete. Using this as a foundation, the authors
of RL² structured a RNN such that each “sequence” the RNN was
trained on was in fact a series of episodes of experience with a given
MDP (MDP = Markov Decision Process. For this explanation, you can
just think of each MDP as de ning a set of possible actions and
underlying rewards those actions generate in the environment). The
RNN is then trained — as RNNs typically are — over many sequences,
which in this case correspond to many di erent MDPs, and the
parameters of the RNN are optimized to produce low regret over all the
sequences/trials in aggregate. Regret is a metric that captures your
total reward over a set of episodes, so in addition to incentivizing the
network to reach a good policy by the end of the trial, it also
incentivizes quicker learning, so that fewer of your exploratory actions
are taken under a bad and thus low reward policy.
A diagram showing the internal workings of the RNN over multiple trials, corresponding to multiple
di erent MDPs.
At each point in a trial, the action taken by the network is a function

parameterized by both a weight matrix learned over multiple tasks, and
the content of the hidden state, which is updated as a function of the
data, and serves as a kind of dynamic parameter set. So, the RNN is
learning the weights of how to update its hidden state, and the weights
controlling how to leverage it, over multiple tasks. And then, in a given
task, the hidden state can capture information about how certain the
network is, whether it’s time to explore vs exploit, etc, as a function of
the data it has seen on that speci c task. In that sense, the RNN is
learning an algorithm that determines how to best explore the space
and update it’s notion of a best policy, and learning that algorithm such
that it performs well over a distribution of tasks. The authors compare
the RL² architecture against algorithms shown to be asymptotically
optimal for the tasks they’re attempting, and the RL² performs
comparably.
Can We Scale This?
This is only a very compressed introduction to the eld, and I’m sure
that there are ideas I’ve missed, or concepts I’ve misstated. If you want
an additional (and better-informed) perspective, I highly recommend
this blog post by Chelsea Finn, the rst author on the MAML paper.
The several-week process of trying to compress this distribution of

papers together conceptually, and generate a broad understanding that
applies to all of them, has left me with a set of general questions:
• How well do these methods scale to higher-diversity tasks?

Most of these papers conducted their Proof of Concept tests on
distributions on tasks with relatively low levels of diversity: sine
curves with di erent parameters, multi armed bandits with
di erent parameter, character recognition from di erent
languages. It’s not obvious to me performing well on these task
distributions necessarily generalizes to, for example, tasks of
di erent complexity levels and from di erent modalities, like
image recognition combined with question answering combined
with logical puzzles.
And yet, the human brain does form its priors from these highly
diverse sets of tasks, transferring information about the world
back and forth between them. My main question here is: will these
methods work as advertised with these more diverse tasks, as long
as you throw more units and compute at them? Or is there a non-
linear e ect at some point along the task diversity curve such that
methods that work at these low diversities just won’t be e ective
at all at high diversities.
• How much are these approaches dependent on huge amounts

of compute? Part of the reason most of these papers operated on
small and simple datasets is that when each of your training runs
involves an inner loop of (e ectively) training a model to a
datapoint about the e cacy of your meta-parameters, testing can
be very costly in time and compute. Given that Moore’s Law seems
to be slowing down recently, how possible will it be for anywhere
outside of Google to do research into useful-scale versions of these
algorithms, where each inner loop iteration on a hard problem
might take hundreds of hours of GPU time?
• How do these methods compare to the idea of nding ways to

explicitly code priors about the world? One incredibly valuable
tool in the human arsenal is language. In machine learning terms,
this is basically highly compressed information, embedded in a
space we know how to conceptually manipulate, that we can
transfer from person to person. No one human would ever be able
to distill all of that knowledge from experience on their own, so
I’m skeptical that we’ll ever really crack the problem of models
that can integrate knowledge about the world unless we gure out
how to do something similar for learning algorithms.

Learning About Algorithms That Learn To Learn

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning About Algorithms That Learn To Learn

Uploaded by

Copyright:

Available Formats

1/8/2019 Learning About Algorithms That Learn to Learn – Towards Data Science

Learning About Algorithms That

The premise of meta learning was an intoxicating one to me, when I

This discrepancy leads machine learning researchers to frame the

Having more examples of 3s helps solve this problem because we can

A common technique in one shot learning is to learn an embedded

my example: the distribution over digits), and learning how to

I nd that having this problem in mind is a useful foundation, though

If asked to construct a ranking of meta-parameters of neural networks,

1. A network learning representations useful across a full

2. A network learning to optimize the parameters of its own

3. A network that learns an inner-loop optimizer that is itself a

What I’ve been able to extract from all this is:

• The notion of a “task” is pretty convolved with which datasets

1. Initialize a network’s parameters, theta, randomly

2. Pick some task, t, over a distribution of tasks T. Using k examples

3. Evaluate the performance on these nalized parameters on the

A more recent addition to the literature, called Reptile, came out in

2. A term pushing the initialization towards a point where there’s the

I chose the MAML/Reptile group as being representative of the

classi cation across the full distribution of tasks, or representations

This is the approach taken by RL² (Fast Reinforcement Learning via

At each point in a trial, the action taken by the network is a function

The several-week process of trying to compress this distribution of

• How well do these methods scale to higher-diversity tasks?

• How much are these approaches dependent on huge amounts

• How do these methods compare to the idea of nding ways to

You might also like