8 views

Uploaded by IamIN

Neural Turing machine

- Oopsla08 Memory-efficient Java Slides
- MIT6_087IAP10_lec08
- Robo Report Official
- 9789812794215_fmatter
- Neural Virtual Physician
- csc190 project 3 - engineering science class of 1t7 wiki
- kkjkkk
- How to Develop a Super Memory and Learn Like a Genius With Jim Kwik Nov 2018 Launch
- Efficient VLSI Implementation Based On Constructive Neural Network Algorithms
- PSR_game
- Military Reconnaissance Robot
- apcom
- 2.- A Note on the Equivalence of NARX and RNN
- Resume Uploaded (3)
- IEEEXplore_1
- Islam 2017
- Binary Classification Tutorial With the Keras Deep Learning Library
- Lec 1-5 Practice makes permanent.pdf
- Advanced Data Structures Algorithms Jan2007 r059211201
- Artificial Intelligence

You are on page 1of 194

Published as a conference paper at ICLR 2015

M EMORY N ETWORKS

Jason Weston, Sumit Chopra & Antoine Bordes

Facebook AI Research

770 Broadway

New York, USA

{jase,spchopra,abordes}@fb.com

arXiv:1410.3916v11 [cs.AI] 29 Nov 2015

A BSTRACT

networks reason with inference components combined with a long-term memory

component; they learn how to use these jointly. The long-term memory can be

read and written to, with the goal of using it for prediction. We investigate these

models in the context of question answering (QA) where the long-term mem-

ory effectively acts as a (dynamic) knowledge base, and the output is a textual

response. We evaluate them on a large-scale QA task, and a smaller, but more

complex, toy task generated from a simulated world. In the latter, we show the

reasoning power of such models by chaining multiple supporting sentences to an-

swer questions that require understanding the intension of verbs.

1 I NTRODUCTION

Most machine learning models lack an easy way to read and write to part of a (potentially very

large) long-term memory component, and to combine this seamlessly with inference. Hence, they

do not take advantage of one of the great assets of a modern day computer. For example, consider

the task of being told a set of facts or a story, and then having to answer questions on that subject.

In principle this could be achieved by a language modeler such as a recurrent neural network (RNN)

(Mikolov et al., 2010; Hochreiter & Schmidhuber, 1997) as these models are trained to predict the

next (set of) word(s) to output after having read a stream of words. However, their memory (en-

coded by hidden states and weights) is typically too small, and is not compartmentalized enough

to accurately remember facts from the past (knowledge is compressed into dense vectors). RNNs

are known to have difficulty in performing memorization, for example the simple copying task of

outputting the same input sequence they have just read (Zaremba & Sutskever, 2014). The situation

is similar for other tasks, e.g., in the vision and audio domains a long term memory is required to

watch a movie and answer questions about it.

In this work, we introduce a class of models called memory networks that attempt to rectify this

problem. The central idea is to combine the successful learning strategies developed in the machine

learning literature for inference with a memory component that can be read and written to. The

model is then trained to learn how to operate effectively with the memory component. We introduce

the general framework in Section 2, and present a specific implementation in the text domain for

the task of question answering in Section 3. We discuss related work in Section 4, describe our

experiments in 5, and finally conclude in Section 6.

2 M EMORY N ETWORKS

A memory network consists of a memory m (an array of objects1 indexed by mi ) and four (poten-

tially learned) components I, G, O and R as follows:

I: (input feature map) – converts the incoming input to the internal feature representation.

1

For example an array of vectors or an array of strings.

1

Published as a conference paper at ICLR 2015

G: (generalization) – updates old memories given the new input. We call this generalization

as there is an opportunity for the network to compress and generalize its memories at this

stage for some intended future use.

O: (output feature map) – produces a new output (in the feature representation space), given

the new input and the current memory state.

R: (response) – converts the output into the response format desired. For example, a textual

response or an action.

Given an input x (e.g., an input character, word or sentence depending on the granularity chosen, an

image or an audio signal) the flow of the model is as follows:

2. Update memories mi given the new input: mi = G(mi , I(x), m), ∀i.

3. Compute output features o given the new input and the memory: o = O(I(x), m).

4. Finally, decode output features o to give the final response: r = R(o).

This process is applied at both train and test time, if there is a distinction between such phases, that

is, memories are also stored at test time, but the model parameters of I, G, O and R are not updated.

Memory networks cover a wide class of possible implementations. The components I, G, O and R

can potentially use any existing ideas from the machine learning literature, e.g., make use of your

favorite models (SVMs, decision trees, etc.).

I component: Component I can make use of standard pre-processing, e.g., parsing, coreference

and entity resolution for text inputs. It could also encode the input into an internal feature represen-

tation, e.g., convert from text to a sparse or dense feature vector.

mH(x) = I(x), (1)

where H(.) is a function selecting the slot. That is, G updates the index H(x) of m, but all other

parts of the memory remain untouched. More sophisticated variants of G could go back and update

earlier stored memories (potentially, all memories) based on the new evidence from the current input

x. If the input is at the character or word level one could group inputs (i.e., by segmenting them into

chunks) and store each chunk in a memory slot.

If the memory is huge (e.g., consider all of Freebase or Wikipedia) one needs to organize the memo-

ries. This can be achieved with the slot choosing function H just described: for example, it could be

designed, or trained, to store memories by entity or topic. Consequently, for efficiency at scale, G

(and O) need not operate on all memories: they can operate on only a retrieved subset of candidates

(only operating on memories that are on the right topic). We explore a simple variant of this in our

experiments.

If the memory becomes full, a procedure for “forgetting” could also be implemented by H as it

chooses which memory is replaced, e.g., H could score the utility of each memory, and overwrite

the least useful. We have not explored this experimentally yet.

O and R components: The O component is typically responsible for reading from memory and

performing inference, e.g., calculating what are the relevant memories to perform a good response.

The R component then produces the final response given O. For example in a question answering

setup O finds relevant memories, and then R produces the actual wording of the answer, e.g., R

could be an RNN that is conditioned on the output of O. Our hypothesis is that without conditioning

on such memories, such an RNN will perform poorly.

3 A M EM NN I MPLEMENTATION F OR T EXT

One particular instantiation of a memory network is where the components are neural networks. We

refer to these as memory neural networks (MemNNs). In this section we describe a relatively simple

implementation of a MemNN with textual input and output.

2

Published as a conference paper at ICLR 2015

In our basic architecture, the I module takes an input text. Let us first assume this to be a sentence:

either the statement of a fact, or a question to be answered by the system (later we will consider

word-based input sequences). The text is stored in the next available memory slot in its original

form2 , i.e., S(x) returns the next empty memory slot N : mN = x, N = N + 1. The G module

is thus only used to store this new memory, so old memories are not updated. More sophisticated

models are described in subsequent sections.

The core of inference lies in the O and R modules. The O module produces output features by

finding k supporting memories given x. We use k up to 2, but the procedure is generalizable to

larger k. For k = 1 the highest scoring supporting memory is retrieved with:

o1 = O1 (x, m) = arg max sO (x, mi ) (2)

i=1,...,N

where sO is a function that scores the match between the pair of sentences x and mi . For the case

k = 2 we then find a second supporting memory given the first found in the previous iteration:

o2 = O2 (x, m) = arg max sO ([x, mo1 ], mi ) (3)

i=1,...,N

where the candidate supporting memory mi is now scored with respect to both the original in-

put and the first supporting memory, where square brackets denote a list3 . The final output o is

[x, mo1 , mo2 ], which is input to the module R.

Finally, R needs to produce a textual response r. The simplest response is to return mok , i.e.,

to output the previously uttered sentence we retrieved. To perform true sentence generation, one

can instead employ an RNN. In our experiments we also consider an easy to evaluate compromise

approach where we limit textual responses to be a single word (out of all the words seen by the

model) by ranking them:

r = argmaxw∈W sR ([x, mo1 , mo2 ], w) (4)

where W is the set of all words in the dictionary, and sR is a function that scores the match.

An example task is given in Figure 1. In order to answer the question x = “Where is the milk now?”,

the O module first scores all memories, i.e., all previously seen sentences, against x to retrieve the

most relevant fact, mo1 = “Joe left the milk” in this case. Then, it would search the memory again

to find the second relevant fact given [x, mo1 ], that is mo2 = “Joe travelled to the office” (the last

place Joe went before dropping the milk). Finally, the R module using eq. (4) would score words

given [x, mo1 , mo2 ] to output r = “office”.

In our experiments, the scoring functions sO and sR have the same form, that of an embedding

model:

s(x, y) = Φx (x)⊤ U ⊤ U Φy (y). (5)

where U is a n × D matrix where D is the number of features and n is the embedding dimension.

The role of Φx and Φy is to map the original text to the D-dimensional feature space. The simplest

feature space to choose is a bag of words representation, we choose D = 3|W | for sO , i.e., every

word in the dictionary has three different representations: one for Φy (.) and two for Φx (.) depending

on whether the words of the input arguments are from the actual input x or from the supporting

memories so that they can be modeled differently.4 Similarly, we used D = 3|W | for sR as well.

sO and sR use different weight matrices UO and UR .

2

Technically, we will be using an embedding model to represent text, so we could store the incoming input

using its learned embedding vector in memory instead. The downside of such a choice is that during learning

the embedding parameters are changing, and hence the stored vectors would go stale. However, at test time

(where the parameters are not changing) storing as embedding vectors could make sense, as this is faster than

reading the original words and then embedding them repeatedly.

3

As we will use a bag-of-words model where both x and mo1 are represented in the bag (but with two differ-

ent dictionaries) this is equivalent to using the sum sO (x, mi ) + sO (mo1 , mi ), however a more sophisticated

modeling of the inputs (e.g., with nonlinearities) may not separate into a sum.

4

Experiments with only a single dictionary and linear embeddings performed worse (not shown). In order

to model with only a single dictionary, one could consider deeper networks that transform the words dependent

on their context. We leave this to future work.

3

Published as a conference paper at ICLR 2015

Figure 1: Example “story” statements, questions and answers generated by a simple simulation.

Answering the question about the location of the milk requires comprehension of the actions “picked

up” and “left”. The questions also require comprehension of the time elements of the story, e.g., to

answer “where was Joe before the office?”.

Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk.

Joe travelled to the office. Joe left the milk. Joe went to the bathroom.

Where is the milk now? A: office

Where is Joe? A: bathroom

Where was Joe before the office? A: kitchen

Training We train in a fully supervised setting where we are given desired inputs and responses,

and the supporting sentences are labeled as such in the training data (but not in the test data, where

we are given only the inputs). That is, during training we know the best choice of both max functions

in eq. (2) and (3)5 . Training is then performed with a margin ranking loss and stochastic gradient

descent (SGD). Specifically, for a given question x with true response r and supporting sentences

mo1 and mo2 (when k = 2), we minimize over model parameters UO and UR :

P

(6)

f¯6=mo1

P

(7)

f¯′ 6=mo2

P

max(0, γ − sR ([x, mo1 , mo2 ], r) + sR ([x, mo1 , mo2 ], r̄])) (8)

r̄6=r

where f¯, f¯′ and r̄ are all other choices than the correct labels, and γ is the margin. At every step

of SGD we sample f¯, f¯′ , r̄ rather than compute the whole sum for each training example, following

e.g., Weston et al. (2011).

In the case of employing an RNN for the R component of our MemNN (instead of using a single

word response as above) we replace the last term with the standard log likelihood used in a language

modeling task, where the RNN is fed the sequence [x, o1 , o2 , r]. At test time we output its prediction

r given [x, o1 , o2 ]. In contrast the absolute simplest model, that of using k = 1 and outputting the

located memory mo1 as response r, would only use the first term to train.

In the following subsections we consider some extensions of our basic model.

If input is at the word rather than sentence level, that is words arrive in a stream (as is often done, e.g.,

with RNNs) and not already segmented as statements and questions, we need to modify the approach

we have so far described. We hence add a “segmentation” function, to be learned, which takes as in-

put the last sequence of words that have so far not been segmented and looks for breakpoints. When

the segmenter fires (indicates the current sequence is a segment) we write that sequence to memory,

and can then proceed as before. The segmenter is modeled similarly to our other components, as an

embedding model of the form:

⊤

seg(c) = Wseg US Φseg (c) (9)

where Wseg is a vector (effectively the parameters of a linear classifier in embedding space), and c is

the sequence of input words represented as bag of words using a separate dictionary. If seg(c) > γ,

where γ is the margin, then this sequence is recognised as a segment. In this way, our MemNN has

a learning component in its write operation. We consider this segmenter a first proof of concept:

of course, one could design something much more sophisticated. Further details on the training

mechanism are given in Appendix B.

5

However, note that methods like RNNs and LSTMs cannot easily use this information.

4

Published as a conference paper at ICLR 2015

If the set of stored memories is very large it is prohibitively expensive to score all of them as in

equations (2) and (3). Instead we explore hashing tricks to speed up lookup: hash the input I(x) into

one or more buckets and then only score memories mi that are in the same buckets. We investigated

two ways of doing hashing: (i) via hashing words; and (ii) via clustering word embeddings. For (i)

we construct as many buckets as there are words in the dictionary, then for a given sentence we hash

it into all the buckets corresponding to its words. The problem with (i) is that a memory mi will

only be considered if it shares at least one word with the input I(x). Method (ii) tries to solve this

by clustering instead. After training the embedding matrix UO , we run K-means to cluster word

vectors (UO )i , thus giving K buckets. We then hash a given sentence into all the buckets that its

individual words fall into. As word vectors tend to be close to their synonyms, they cluster together

and we thus also will score those similar memories as well. Exact word matches between input and

memory will still be scored by definition. Choosing K controls the speed-accuracy trade-off.

We can extend our model to take into account when a memory slot was written to. This is not

important when answering questions about fixed facts (“What is the capital of France?”) but is

important when answering questions about a story, see e.g., Figure 1. One obvious way to implement

this is to add extra features to the representations Φx and Φy that encode the index j of a given

memory mj , assuming that j follows write time (i.e., no memory slot rewriting). However, that

requires dealing with absolute rather than relative time. We had more success empirically with the

following procedure: instead of scoring input, candidate pairs with s as above, learn a function on

triples sOt (x, y, y ′ ):

sOt (x, y, y ′ ) = Φx (x)⊤ UOt ⊤ UOt Φy (y) − Φy (y ′ ) + Φt (x, y, y ′ ) . (10)

Φt (x, y, y ′ ) uses three new features which take on the value 0 or 1: whether x is older than y, x is

older than y ′ , and y older than y ′ . (That is, we extended the dimensionality of all the Φ embeddings

by 3, and set these three dimensions to zero when not used.) Now, if sOt (x, y, y ′ ) > 0 the model

prefers y over y ′ , and if sOt (x, y, y ′ ) < 0 it prefers y ′ . The argmax of eq. (2) and (3) are replaced by

a loop over memories i = 1, . . . , N , keeping the winning memory (y or y ′ ) at each step, and always

comparing the current winner to the next memory mi . This procedure is equivalent to the argmax

before if the time features are removed. More details are given in Appendix C.

Even for humans who have read a lot of text, new words are continuously introduced. For example,

the first time the word “Boromir” appears in Lord of The Rings (Tolkien, 1954). How should a

machine learning model deal with this? Ideally it should work having seen only one example. A

possible way would be to use a language model: given the neighboring words, predict what the word

should be, and assume the new word is similar to that. Our proposed approach takes this idea, but

incorporates it into our networks sO and sR , rather than as a separate step.

Concretely, for each word we see, we store a bag of words it has co-occurred with, one bag for the

left context, and one for the right. Any unknown word can be represented with such features. Hence,

we increase our feature representation D from 3|W | to 5|W | to model these contexts (|W | features

for each bag). Our model learns to deal with new words during training using a kind of “dropout”

technique: d% of the time we pretend we have not seen a word before, and hence do not have a

n-dimensional embedding for that word, and represent it with the context instead.

Embedding models cannot efficiently use exact word matches due to the low dimensionality n. One

solution is to score a pair x, y with

Φx (x)⊤ U ⊤ U Φy (y) + λΦx (x)⊤ Φy (y) (11)

instead. That is, add the “bag of words” matching score to the learned embedding score (with a

mixing parameter λ). Another, related way, that we propose is to stay in the n-dimensional em-

bedding space, but to extend the feature representation D with matching features, e.g., one per

5

Published as a conference paper at ICLR 2015

word. A matching feature indicates if a word occurs in both x and y. That is, we score with

Φx (x)⊤ U ⊤ U Φy (y, x) where Φy is actually built conditionally on x: if some of the words in y

match the words in x we set those matching features to 1. Unseen words can be modeled similarly

by using matching features on their context words. This then gives a feature space of D = 8|W |.

4 R ELATED WORK

Classical QA methods use a set of documents as a kind of memory, and information retrieval meth-

ods to find answers, see e.g., (Kolomiyets & Moens, 2011) and references therein. More recent

methods try instead to create a graph of facts – a knowledge base (KB) – as their memory, and map

questions to logical queries (Berant et al., 2013; 2014). Neural network and embedding approaches

have also been recently explored (Bordes et al., 2014a; Iyyer et al., 2014; Yih et al., 2014). Com-

pared to recent knowledge base approaches, memory networks differ in that they do not apply a

two-stage strategy: (i) apply information extraction principles first to build the KB; followed by (ii)

inference over the KB. Instead, extraction of useful information to answer a question is performed

on-the-fly over the memory which can be stored as raw text, as well as other choices such as embed-

ding vectors. This is potentially less brittle as the first stage of building the KB may have already

thrown away the relevant part of the original data.

Classical neural network memory models such as associative memory networks aim to provide

content-addressable memory, i.e., given a key vector to output a value vector, see e.g., Haykin (1994)

and references therein. Typically this type of memory is distributed across the whole network of

weights of the model rather than being compartmentalized into memory locations. Memory-based

learning such as nearest neighbor, on the other hand, does seek to store all (typically labeled) exam-

ples in compartments in memory, but only uses them for finding closest labels. Memory networks

combine compartmentalized memory with neural network modules that can learn how to (poten-

tially successively) read and write to that memory, e.g., to perform reasoning they can iteratively

read salient facts from the memory.

However, there are some notable models that have attempted to include memory read and write

operations from the 90s. In particular (Das et al., 1992) designed differentiable push and pop actions

called a neural network pushdown automaton. The work of Schmidhuber (1992) incorporated the

concept of two neural networks where one has very fast changing weights which can potentially be

used as memory. Schmidhuber (1993) proposed to allow a network to modify its own weights “self-

referentially” which can also be seen as a kind of memory addressing. Finally two other relevant

works are the DISCERN model of script processing and memory (Miikkulainen, 1990) and the

NARX recurrent networks for modeling long term dependencies (Lin et al., 1996).

Our work was submitted to arxiv just before the Neural Turing Machine work of Graves et al. (2014),

which is one of the most relevant related methods. Their method also proposes to perform (sequence)

prediction using a “large, addressable memory” which can be read and written to. In their experi-

ments, the memory size was limited to 128 locations, whereas we consider much larger storage (up

to 14M sentences). The experimental setups are notably quite different also: whereas we focus on

language and reasoning tasks, their paper focuses on problems of sorting, copying and recall. On the

one hand their problems require considerably more complex models than the memory network de-

scribed in Section 3. On the other hand, their problems have known algorithmic solutions, whereas

(non-toy) language problems do not.

There are other recent related works. RNNSearch (Bahdanau et al., 2014) is a method of machine

translation that uses a learned alignment mechanism over the input sentence representation while

predicting an output in order to overcome poor performance on long sentences. The work of (Graves,

2013) performs handwriting recognition by dynamically determining “an alignment between the text

and the pen locations” so that “it learns to decide which character to write next”. One can view these

as particular variants of memory networks where in that case the memory only extends back a single

sentence or character sequence.

6

Published as a conference paper at ICLR 2015

Method F1

(Fader et al., 2013) 0.54

(Bordes et al., 2014b) 0.73

MemNN (embedding only) 0.72

MemNN (with BoW features) 0.82

Table 2: Memory hashing results on the large-scale QA task of (Fader et al., 2013).

Method Embedding F1 Embedding + BoW F1 Candidates (speedup)

MemNN (no hashing) 0.72 0.82 14M (0x)

MemNN (word hash) 0.63 0.68 13k (1000x)

MemNN (cluster hash) 0.71 0.80 177k (80x)

5 E XPERIMENTS

We perform experiments on the QA dataset introduced in Fader et al. (2013). It consists of 14M

statements, stored as (subject, relation, object) triples, which are stored as memories in the MemNN

model. The triples are REVERB extractions mined from the ClueWeb09 corpus and cover di-

verse topics such as (milne, authored, winnie-the-pooh) and (sheep, be-afraid-of, wolf). Following

Fader et al. (2013) and Bordes et al. (2014b), training combines pseudo-labeled QA pairs made of a

question and an associated triple, and 35M pairs of paraphrased questions from WikiAnswers like

“Who wrote the Winnie the Pooh books?” and “Who is poohs creator?”.

We performed experiments in the framework of re-ranking the top returned candidate answers by

several systems measuring F1 score over the test set, following Bordes et al. (2014b). These answers

have been annotated as right or wrong by humans, whereas other answers are ignored at test time as

we do not know their label. We used a MemNN model of Section 3 with a k = 1 supporting memory,

which ends up being similar to the approach of Bordes et al. (2014b).6 We also tried adding the bag

of words features of Section 3.6 as well. Time and unseen word modeling were not used. Results

are given in Table 1. The results show that MemNNs are a viable approach for large scale QA in

terms of performance. However, lookup is linear in the size of the memory, which with 14M facts is

slow. We therefore implemented the memory hashing techniques of Section 3.3 using both hashing

of words and clustered embeddings. For the latter we tried K = 1000 clusters. The results given in

Table 2 show that one can get significant speedups (∼80x) while maintaining similar performance

using the cluster-based hash. The string hash on the other hand loses performance (whilst being a

lot faster) because answers which share no words are now no longer matched.

Similar to the approach of Bordes et al. (2010) we also built a simple simulation of 4 characters, 3

objects and 5 rooms – with characters moving around, picking up and dropping objects. The actions

are transcribed into text using a simple automated grammar, and labeled questions are generated in

a similar way. This gives a QA task on simple “stories” such as in Figure 1. The overall difficulty of

the task is that multiple statements have to be used to do inference when asking where an object is,

e.g. to answer where is the milk in Figure 1 one has to understand the meaning of the actions “picked

up” and “left” and the influence of their relative order. We generated 7k statements and 3k questions

from the simulator for training7, and an identical number for testing and compare MemNNs to RNNs

and LSTMs (long short term memory RNNs (Hochreiter & Schmidhuber, 1997)) on this task. To

6

We use a larger 128 dimension for embeddings, and no fine tuning, hence the result of MemNN slightly

differs from those reported in Bordes et al. (2014b).

7

Learning curves with different numbers of training examples are given in Appendix D.

7

Published as a conference paper at ICLR 2015

Difficulty 1 Difficulty 5

Method actor w/o before actor actor+object actor actor+object

RNN 100% 60.9% 27.9% 23.8% 17.8%

LSTM 100% 64.8% 49.1% 35.2% 29.0%

MemNN k = 1 97.8% 31.0% 24.0% 21.9% 18.5%

MemNN k = 1 (+time) 99.9% 60.2% 42.5% 60.8% 44.4%

MemNN k = 2 (+time) 100% 100% 100% 100% 99.9%

test with sequences of words as input (Section 3.2) the statements are joined together again with a

simple grammar8, to produce sentences that may contain multiple statements, see e.g., Figure 2.

We control the complexity of the task by setting a limit on the number of time steps in the past the

entity we ask the question about was last mentioned. We try two experiments: using a limit of 1, and

of 5, i.e., if the limit is 5 then we pick a random sentence between 1-5 time steps in the past. If this

chosen sentence only mentions an actor, e.g., “Bill is in the kitchen” then we generate the question

“where is Bill?” or “where was Bill before the kitchen?”. If the sentence mentions an object, e.g.,

“Bill dropped the football” then we ask the question “where is the football?”. For the answers we

consider two options: (i) single word answers; and (ii) a simple grammar for generating true answers

in sentence form, e.g., “kitchen” for (i) and “He is in the kitchen I believe” (and other variants) for

(ii). More details on the dataset generation are given in Appendix A. Note that in the object case

the supporting statements necessary to deduce the answer may not lie in the last 5 sentences, e.g.,

in this example the answer depends on other sentences to find out where Bill actually was when he

dropped the football. In fact, in the dataset we generated necessary supporting statements can be

up to 65 sentences before (but are usually closer). For that reason, we also conducted two further

types of experiments: where we only ask questions about actors (easier) and about actors and objects

(harder). We also consider the actor-based questions without the “before” questions for the simplest

possible task (i.e. “where is Bill?” but not “where was Bill before the kitchen?” questions).

Figure 2: Sample test set predictions (in red) for the simulation in the setting of word-based input

and where answers are sentences and an LSTM is used as the R component of the MemNN.

Joe went to the garden then Fred picked up the milk; Joe moved to the bathroom and Fred dropped the

milk, and then Dan moved to the living room.

Where is Dan? A: living room I believe

Where is Joe? A: the bathroom

Fred moved to the bedroom and Joe went to the kitchen then Joe took the milk there and Dan journeyed

to the bedroom; Joe discarded the milk.

Where is the milk now ? A: the milk is in the kitchen

Where is Dan now? A: I think he is in the bedroom

Joe took the milk there, after that Mike travelled to the office, then Joe went to the living room, next Dan

went back to the kitchen and Joe travelled to the office.

Where is Joe now? A: I think Joe is in the office

For the baseline RNN and LSTM systems we perform language modeling with backpropagation

through time (Mikolov et al., 2010), but where we backprop only on answer words9 . We optimized

the hyperparameters: size of the hidden layer, bptt steps, and learning rate for each dataset. For

MemNNs we fixed the embedding dimension to 100, learning rate to 0.01 and margin γ to 0.1 and

10 epochs of training in all experiments.

Results The results for the single word answer setting (i) are given in Table 3. For the actor-only

tasks, RNN and LSTMs solve the simpler difficulty level 1 task without before questions (“w/o

8

We also tried the same kind of experiments with sentence-level rather than word-sequence input, without

joining sentences, giving results with similar overall conclusions, see Appendix E.

9

We tried using standard language modeling on the questions as well, with slightly worse results.

8

Published as a conference paper at ICLR 2015

Figure 3: An example story with questions correctly answered by a MemNN. The MemNN was

trained on the simulation described in Section 5.2 and had never seen many of these words before,

e.g., Bilbo, Frodo and Gollum.

Bilbo travelled to the cave. Gollum dropped the ring there. Bilbo took the ring.

Bilbo went back to the Shire. Bilbo left the ring there. Frodo got the ring.

Frodo journeyed to Mount-Doom. Frodo dropped the ring there. Sauron died.

Frodo went back to the Shire. Bilbo travelled to the Grey-havens. The End.

Where is the ring? A: Mount-Doom

Where is Bilbo now? A: Grey-havens

Where is Frodo now? A: Shire

Figure 4: An example dialogue with a MemNN system trained on both the simulation data and

the large-scale QA data. The system is able to (attempt to) answer questions about general world

knowledge and about specific story-based statements in the dialogue.

Fred went to the kitchen. Fred picked up the milk. Fred travelled to the office.

Where is the milk ? A: office

Where does milk come from ? A: milk come from cow

What is a cow a type of ? A: cow be female of cattle

Where are cattle found ? A: cattle farm become widespread in brazil

What does milk taste like ? A: milk taste like milk

What does milk go well with ? A: milk go with coffee

Where was Fred before the office ? A: kitchen

before”), but perform worse with before questions, and even worse on the difficulty 5 tasks. This

demonstrates that the poor performance of the RNN is due to its failure to encode long(er)-term

memory. This would likely deteriorate even further with higher difficulty levels (distances). LSTMs

are however better than RNNs, as expected, as they are designed with a more sophisticated memory

model, but still have trouble remembering sentences too far in the past. MemNNs do not have

this memory limitation and its mistakes are instead due to incorrect usage of its memory, when the

wrong statement is picked by sO . Time features are necessary for good performance on before

questions or difficulty > 1 (i.e., when the answer is not in the last statement), otherwise sO can pick

a statement about a person’s whereabouts but they have since moved. Finally, results on the harder

actor+object task indicate that MemNN also successfully perform 2-stage inference using k = 2,

whereas MemNNs without such inference (with k = 1) and RNNs and LSTMs fail.

We also tested MemNNs in the multi-word answer setting (ii) with similar results, whereby

MemNNs outperform RNNs and LSTMs, which are detailed in Appendix F. Example test prediction

output demonstrating the model in that setting is given in Figure 2.

We then tested the ability of MemNNs to deal with previously unseen words at test time using the

unseen word modeling approach of Sections 3.5 and 3.6. We trained the MemNN on the same sim-

ulated dataset as before and test on the story given in Figure 3. This story is generated using similar

structures as in the simulation data, except that the nouns are unknowns to the system at training

time. Despite never seeing any of the Lord of The Rings specific words before (e.g., Bilbo, Frodo,

Sauron, Gollum, Shire and Mount-Doom), MemNNs are able to correctly answer the questions.

MemNNs can discover simple linguistic patterns based on verbal forms such as (X, dropped, Y), (X,

took, Y) or (X, journeyed to, Y) and can successfully generalize the meaning of their instantiations

using unknown words to perform 2-stage inference. Without the unseen word modeling described

in Section 3.5, they completely fail on this task.

9

Published as a conference paper at ICLR 2015

Combining simulated world learning with real-world data might be one way to show the power and

generality of the models we design. We implemented a naive setup towards that goal: we took the

two models from Sections 5.1 and 5.2, trained on large-scale QA and simulated data respectively,

and built an ensemble of the two. We present the input to both systems and then for each question

simply output the response of the two choices with the highest score. This allows us to perform

simple dialogues with our combined MemNN system. The system is then capable of answering both

general knowledge questions and specific statements relating to the previous dialogue. An example

dialogue trace is given in Fig. 4. Some answers appear fine, whereas others are nonsensical. Future

work should combine these models more effectively, for example by multitasking directly the tasks

with a single model.

In this paper we introduced a powerful class of models, memory networks, and showed one instanti-

ation for QA. Future work should develop MemNNs for text further, evaluating them on harder QA

and open-domain machine comprehension tasks (Richardson et al., 2013). For example, large scale

QA tasks that require multi-hop inference such as WebQuestions should also be tried Berant et al.

(2013). More complex simulation data could also be constructed in order to bridge that gap, e.g.,

requiring coreference, involving more verbs and nouns, sentences with more structure and requiring

more temporal and causal understanding. More sophisticated architectures should also be explored

in order to deal with these tasks, e.g., using more sophisticated memory management via G and

more sophisticated sentence representations. Weakly supervised settings are also very important,

and should be explored, as many datasets only have supervision in the form of question answer

pairs, and not supporting facts as well as we used here. Finally, we believe this class of models is

much richer than the one specific variant we detail here, and that we have currently only explored

one specific variant of memory networks. Memory networks should be applied to other text tasks,

and other domains, such as vision, as well.

ACKNOWLEDGMENTS

We thank Tomas Mikolov for useful discussions.

R EFERENCES

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly

learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

Berant, Jonathan, Chou, Andrew, Frostig, Roy, and Liang, Percy. Semantic parsing on freebase from

question-answer pairs. In EMNLP, pp. 1533–1544, 2013.

Berant, Jonathan, Srikumar, Vivek, Chen, Pei-Chun, Huang, Brad, Manning, Christopher D, Van-

der Linden, Abby, Harding, Brittany, and Clark, Peter. Modeling biological processes for reading

comprehension. In Proc. EMNLP, 2014.

Bordes, Antoine, Usunier, Nicolas, Collobert, Ronan, and Weston, Jason. Towards understanding

situated natural language. In AISTATS, 2010.

Bordes, Antoine, Chopra, Sumit, and Weston, Jason. Question answering with subgraph embed-

dings. In Proc. EMNLP, 2014a.

Bordes, Antoine, Weston, Jason, and Usunier, Nicolas. Open question answering with weakly su-

pervised embedding models. ECML-PKDD, 2014b.

Das, Sreerupa, Giles, C Lee, and Sun, Guo-Zheng. Learning context-free grammars: Capabilities

and limitations of a recurrent neural network with an external stack memory. In Proceedings of

The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, 1992.

Fader, Anthony, Zettlemoyer, Luke, and Etzioni, Oren. Paraphrase-driven learning for open question

answering. In ACL, pp. 1608–1618, 2013.

10

Published as a conference paper at ICLR 2015

Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint

arXiv:1308.0850, 2013.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint

arXiv:1410.5401, 2014.

Haykin, Simon. Neural networks: A comprehensive foundation. 1994.

Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):

1735–1780, 1997.

Iyyer, Mohit, Boyd-Graber, Jordan, Claudino, Leonardo, Socher, Richard, and III, Hal Daumé. A

neural network for factoid question answering over paragraphs. In Proceedings of the 2014 Con-

ference on Empirical Methods in Natural Language Processing (EMNLP), pp. 633–644, 2014.

Kolomiyets, Oleksandr and Moens, Marie-Francine. A survey on question answering technology

from an information retrieval perspective. Information Sciences, 181(24):5412–5434, 2011.

Lin, Tsungnam, Horne, Bil G, Tiňo, Peter, and Giles, C Lee. Learning long-term dependencies in

narx recurrent neural networks. Neural Networks, IEEE Transactions on, 7(6):1329–1338, 1996.

Miikkulainen, Risto. {DISCERN}:{A} distributed artificial neural network model of script process-

ing and memory. 1990.

Mikolov, Tomas, Karafiát, Martin, Burget, Lukas, Cernockỳ, Jan, and Khudanpur, Sanjeev. Recur-

rent neural network based language model. In Interspeech, pp. 1045–1048, 2010.

Richardson, Matthew, Burges, Christopher JC, and Renshaw, Erin. Mctest: A challenge dataset for

the open-domain machine comprehension of text. In EMNLP, pp. 193–203, 2013.

Schmidhuber, Jürgen. Learning to control fast-weight memories: An alternative to dynamic recur-

rent networks. Neural Computation, 4(1):131–139, 1992.

Schmidhuber, Jürgen. A self-referentialweight matrix. In ICANN93, pp. 446–450. Springer, 1993.

Tolkien, John Ronald Reuel. The Fellowship of the Ring. George Allen & Unwin, 1954.

Weston, Jason, Bengio, Samy, and Usunier, Nicolas. Wsabie: Scaling up to large vocabulary im-

age annotation. In Proceedings of the Twenty-Second international joint conference on Artificial

Intelligence-Volume Volume Three, pp. 2764–2770. AAAI Press, 2011.

Yih, Wen-Tau, He, Xiaodong, and Meek, Christopher. Semantic parsing for single-relation question

answering. In Proceedings of ACL. Association for Computational Linguistics, June 2014. URL

http://research.microsoft.com/apps/pubs/default.aspx?id=214353.

Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

11

Published as a conference paper at ICLR 2015

Aim We have built a simple simulation which behaves much like a classic text adventure game.

The idea is that generating text within this simulation allows us to ground the language used.

Some comments about our intent:

• Firstly, while this currently only encompasses a very small part of the kind of language and

understanding we want a model to learn to move towards full language understanding, we

believe it is a prerequisite that models should perform well on this kind of task for them to

work on real-world environments.

• Secondly, our aim is to make this simulation more complex and to release improved ver-

sions over time. Hopefully it can then scale up to evaluate more and more useful properties.

Currently, tasks within the simulation are restricted to question answering tasks about the location

of people and objects. However, we envisage other tasks should be possible, including asking the

learner to perform actions within the simulation (“Please pick up the milk”, “Please find John and

give him the milk”) and asking the learner to describe actions (”What did John just do?”).

go <location>, get <object>, get <object1> from <object2>,

put <object1> in/on <object2>, give <object> to <actor>,

drop <object>, look, inventory, examine <object>.

There are a set of constraints on those actions. For example an actor cannot get something that they

or someone else already has, they cannot go to a place they are already at, cannot drop something

they do not already have, and so on.

Executing Actions and Asking Questions Using the underlying actions and their constraints,

there is then a (hand-built) model that defines how actors act. Currently this is very simple: they try

to make a random valid action, at the moment restricted to go or go, get and drop depending on the

which of two types of experiments we are running: (i) actor; or (ii) actor + object.

If we write these actions down in text form this gives us a very simple “story” which is executable

by the simulation, e.g., joe go kitchen; fred go kitchen; joe get milk; joe go office; joe drop milk;

joe go bathroom. This example corresponds to the story given in Figure 1. The system can then ask

questions about the state of the simulation e.g., where milk?, where joe?, where joe before office? It

is easy to calculate the true answers for these questions as we have access to the underlying world.

What remains is to convert both the statements and the questions to look more like natural language.

Simple Grammar For Generating Language In order to produce more natural looking text with

lexical variety we built a simple automated grammar. Each verb is assigned a set of synonyms,

e.g., the simulation command get is replaced with either picked up, got, grabbed or took, and drop

is replace with either dropped, left, discarded or put down. Similarly, each object and actor can

have a set of replacement synonyms as well, although currently there is no ambiguity there in our

experiments, we simply add articles or not. We do add lexical variation to questions, e.g., “Where is

John ?” or “Where is John now ?”.

Joining Statements Finally, for the word sequence training setting, we join the statements above

into compound sentences. To do this we simply take the set of statements and then join them

randomly with one of the following: “.”, “and”, “then”, “, then”, “;”, “, later”, “, after that”, “, and

then”, or “, next”. Example output can be seen in Figure 2.

Issues There are a great many aspects of language not yet modeled. For example, currently coref-

erence is not modeled (e.g., “He picked up the milk”) and similarly there are no compound noun

phrases (“John and Fred went to the kitchen”). Some of these seem easy to add to the simulation.

The hope is that adding these complexities will help evaluate models in a controlled way, within the

simulated environment, which is hard to do with real data. Of course, this is not a substitute for real

data which our models should be applied to as well, but does serve as a useful testbed.

12

Published as a conference paper at ICLR 2015

For segmenting an input word stream as generated in Appendix A we use a segmenter of the form:

⊤

seg(c) = Wseg US Φseg (c)

where Wseg is a vector (effectively the parameters of a linear classifier in embedding space). As we

are already in the fully supervised setting, where for each question in the training set we are given

the answer and the supporting facts from the input stream, we can also use that supervision for the

segmenter as well. That is, for any known supporting fact, such as “Bill is in the Kitchen” for the

question “Where is Bill?” we wish the segmenter to fire for such a statement, but not for unfinished

statements such as “Bill is in the”. We can thus write our training criterion for segmentation as the

minimization of:

max(0, γ + seg(f¯))

X X

max(0, γ − seg(f )) + (12)

f ∈F f¯∈F̄

where F are all known supporting segments in the labeled training set, and F̄ are all other segments

in the training set.

The training procedure to take into account modeling write time is slightly different to that described

in Section 3.1. Write time features are important so that the MemNN knows when each memory

was written, and hence knows the ordering of statements that comprise a story or dialogue. Note

that this is different to time information described in the text of a statement, such as the tense of a

statement, or statements containing time expressions, e.g., “He went to the office yesterday”. For

such cases, write time features are not directly necessary, and they could (potentially) be modeled

directly from the text.

As was described in Section 3.4 we add three write time features to the model and score triples

using:

sOt (x, y, y ′ ) = Φx (x)⊤ UOt ⊤ UOt Φy (y) − Φy (y ′ ) + Φt (x, y, y ′ ) . (13)

If sO (x, y, y ′ ) > 0 the model prefers y over y ′ , and if sO (x, y, y ′ ) < 0 it prefers y ′ . The argmax of

eq. (2) and (3) are replaced by a loop over memories i = 1, . . . , N , keeping the winning memory

(y or y ′ ) at each step, and always comparing the current winner to the next memory mi . That is,

at inference time, for a k = 2 model the arg max functions of eq. (2) and (3) are replaced with

o1 = Ot (x, m) and o2 = Ot ([x, mo1 ], m) where Ot is defined in Algorithm 1 below.

function Ot (q, m)

t←1

for i = 2, . . . , N do

if sOt (q, mi , mt ) > 0 then

t←i

end if

end for

return t

end function

Φt (x, y, y ′ ) uses three new features which take on the value 0 or 1: whether x is older than y,

x is older than y ′ , and y older than y ′ . When finding the second supporting memory (computing

Ot ([x, mo1 ], m)) we encode whether mo1 is older than y, mo1 is older than y ′ , and y older than y ′ to

capture the relative age of the first supporting memory w.r.t. the second one in the first two features.

Note that when finding the first supporting memory (i.e., for Ot (x, m)) the first two features are

useless as x is the last thing in the memory and hence y and y ′ are always older.

13

Published as a conference paper at ICLR 2015

To train our model with write time features we need to replace the hinge loss in eqs. (6)-(7) with a

loss that matches Algorithm 1. To do this, we instead minimize:

max(0, γ − sOt (x, mo1 , f¯)) + max(0, γ + sOt (x, f¯, mo1 )) +

P P

f¯6=mo1 f¯6=mo1

max(0, γ − sOt ([x, mo1 ], mo2 , f¯′ )) + max(0, γ + sOt ([x, mo1 ], f¯′ , mo2 ) +

P P

f¯′ 6=mo2 f¯′ 6=mo2

P

max(0, γ − sR ([x, mo1 , mo2 ], r) + sR ([x, mo1 , mo2 ], r̄]))

r̄6=r

The last term is the same as in eq. (8) and is for the final ranking of words to return a response,

which remains unchanged (as usual, this can also be replaced by an RNN for a more sophisticated

model). Terms 1-4 replace eqs. (6)-(7) by considering triples directly. For both mo1 and mo2 we

need to have two terms considering them as the second or third argument to SOt as they may appear

on either side during inference (via Algorithm 1). As before, at every step of SGD we sample f¯, f¯′ , r̄

rather than compute the whole sum for each training example.

We computed the test accuracy of MemNNs k = 2 (+ time) for varying amounts of training data:

100, 500, 1000 and 3000 training questions. The results are given in Table 4. These results can be

compared with RNNs and LSTMs on the full data (3000 examples) by comparing with Figure 3.

For example, on the difficulty 5 actor and actor + object tasks MemNNs outperform LSTMs even

using 30 times less training examples.

Table 4: Test accuracy of MemNNs k = 2 (+time) on the word-sequence simulation QA task for

differing numbers of training examples (number of questions).

Difficulty 1 Difficulty 5

Num. training actor actor actor actor

questions + object + object

100 73.8% 64.9% 74.4% 49.8%

500 99.9% 99.2% 99.8% 95.1%

1000 99.9% 100% 100% 98.4%

3000 100% 100% 100% 99.9%

We conducted experiments where input was at the sentence-level, that is the data was already pre-

segemented into statements and questions as input to the MemNN (as opposed to being input as a

stream of words). Results comparing RNNs with MemNNs are given in Table 5. The conclusions

are similar to those at the word level from Section 5.2. That is, MemNNs outperform RNNs, and

that inference that finds k = 2 supporting statements and time features are necessary for the actor

w/o before + object task.

Difficulty 1 Difficulty 5

actor actor w/o before actor actor w/o before

Method w/o before + object w/o before + object

RNN 100% 58% 29% 17%

MemNN k = 1 90% 9% 46% 21%

MemNN k = 1 (+time) 100% 73% 100% 73%

MemNN k = 2 (+time) 100% 99.95% 100% 99.4%

14

Published as a conference paper at ICLR 2015

We conducted experiments for the simulation data in the case where the answers are sentences (see

Appendix A and Figure 2). As the single word answer model can no longer be used, we simply

compare MemNNs using either RNNs or LSTMs for the response module R. As baselines we can

still use RNNs and LSTMs in the standard setting of being fed words only including the statements

and the question as a word stream. In contrast, the MemNN RNN and LSTMs are effectively fed

the output of the O module (see Section 3.1). In these experiments we only consider the difficulty

5 actor+object setting in the case of MemNNs with k = 2 iterations (eq. (3)), which means the

module R is fed the features [x, mo1 , mo2 ] after the modules I, G and O have run.

The sentence generation is performed on the test data, and the evaluation we chose is as follows. A

correct generation has to contain the correct location answer, and can optionally contain the subject

or a correct pronoun referring to it. For example the question “Where is Bill?” allows the correct

answers “Kitchen”, “In the kitchen”, “Bill is in the kitchen”, “He is in the kitchen” and “I think Bill

is in the kitchen”. However incorrect answers contain an incorrect location or subject reference, for

example “Joe is in the kitchen”, “It is in the kitchen” or “Bill is in the bathroom I believe”. We can

then measure the percentage of text examples that are correct using this metric.

The numerical results are given in Table 6, and example output is given in Figure 2. The results

indicate that MemNNs with LSTMs perform quite strongly, outperforming MemNNs using RNNs.

However, both MemNN variant outperform both RNNs and LSTMs by some distance.

Table 6: Test accuracy on the multi-word answer simulation QA task. We compare conventional

RNN and LSTMs with MemNNs using an RNN or LSTM module R (i.e., where R is fed features

[x, mo1 , mo2 ] after the modules I, G and O have run).

Model MemNN: IGO features [x, mo1 , mo2 ] Word features

RNN 68.83% 13.97%

LSTM 90.98% 14.01%

15

Neural Turing Machines

arXiv:1410.5401v2 [cs.NE] 10 Dec 2014

Greg Wayne gregwayne@google.com

Ivo Danihelka danihelka@google.com

Abstract

We extend the capabilities of neural networks by coupling them to external memory re-

sources, which they can interact with by attentional processes. The combined system is

analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-

end, allowing it to be efficiently trained with gradient descent. Preliminary results demon-

strate that Neural Turing Machines can infer simple algorithms such as copying, sorting,

and associative recall from input and output examples.

1 Introduction

Computer programs make use of three fundamental mechanisms: elementary operations

(e.g., arithmetic operations), logical flow control (branching), and external memory, which

can be written to and read from in the course of computation (Von Neumann, 1945). De-

spite its wide-ranging success in modelling complicated data, modern machine learning

has largely neglected the use of logical flow control and external memory.

Recurrent neural networks (RNNs) stand out from other machine learning methods

for their ability to learn and carry out complicated transformations of data over extended

periods of time. Moreover, it is known that RNNs are Turing-Complete (Siegelmann and

Sontag, 1995), and therefore have the capacity to simulate arbitrary procedures, if properly

wired. Yet what is possible in principle is not always what is simple in practice. We

therefore enrich the capabilities of standard recurrent networks to simplify the solution of

algorithmic tasks. This enrichment is primarily via a large, addressable memory, so, by

analogy to Turing’s enrichment of finite-state machines by an infinite memory tape, we

1

dub our device a “Neural Turing Machine” (NTM). Unlike a Turing machine, an NTM

is a differentiable computer that can be trained by gradient descent, yielding a practical

mechanism for learning programs.

In human cognition, the process that shares the most similarity to algorithmic operation

is known as “working memory.” While the mechanisms of working memory remain some-

what obscure at the level of neurophysiology, the verbal definition is understood to mean

a capacity for short-term storage of information and its rule-based manipulation (Badde-

ley et al., 2009). In computational terms, these rules are simple programs, and the stored

information constitutes the arguments of these programs. Therefore, an NTM resembles

a working memory system, as it is designed to solve tasks that require the application of

approximate rules to “rapidly-created variables.” Rapidly-created variables (Hadley, 2009)

are data that are quickly bound to memory slots, in the same way that the number 3 and the

number 4 are put inside registers in a conventional computer and added to make 7 (Minsky,

1967). An NTM bears another close resemblance to models of working memory since the

NTM architecture uses an attentional process to read from and write to memory selectively.

In contrast to most models of working memory, our architecture can learn to use its working

memory instead of deploying a fixed set of procedures over symbolic data.

The organisation of this report begins with a brief review of germane research on work-

ing memory in psychology, linguistics, and neuroscience, along with related research in

artificial intelligence and neural networks. We then describe our basic contribution, a mem-

ory architecture and attentional controller that we believe is well-suited to the performance

of tasks that require the induction and execution of simple programs. To test this architec-

ture, we have constructed a battery of problems, and we present their precise descriptions

along with our results. We conclude by summarising the strengths of the architecture.

2 Foundational Research

2.1 Psychology and Neuroscience

The concept of working memory has been most heavily developed in psychology to explain

the performance of tasks involving the short-term manipulation of information. The broad

picture is that a “central executive” focuses attention and performs operations on data in a

memory buffer (Baddeley et al., 2009). Psychologists have extensively studied the capacity

limitations of working memory, which is often quantified by the number of “chunks” of

information that can be readily recalled (Miller, 1956).1 These capacity limitations lead

toward an understanding of structural constraints in the human working memory system,

but in our own work we are happy to exceed them.

In neuroscience, the working memory process has been ascribed to the functioning of a

system composed of the prefrontal cortex and basal ganglia (Goldman-Rakic, 1995). Typ-

1

There remains vigorous debate about how best to characterise capacity limitations (Barrouillet et al.,

2004).

2

ical experiments involve recording from a single neuron or group of neurons in prefrontal

cortex while a monkey is performing a task that involves observing a transient cue, waiting

through a “delay period,” then responding in a manner dependent on the cue. Certain tasks

elicit persistent firing from individual neurons during the delay period or more complicated

neural dynamics. A recent study quantified delay period activity in prefrontal cortex for a

complex, context-dependent task based on measures of “dimensionality” of the population

code and showed that it predicted memory performance (Rigotti et al., 2013).

Modeling studies of working memory range from those that consider how biophysical

circuits could implement persistent neuronal firing (Wang, 1999) to those that try to solve

explicit tasks (Hazy et al., 2006) (Dayan, 2008) (Eliasmith, 2013). Of these, Hazy et al.’s

model is the most relevant to our work, as it is itself analogous to the Long Short-Term

Memory architecture, which we have modified ourselves. As in our architecture, Hazy

et al.’s has mechanisms to gate information into memory slots, which they use to solve a

memory task constructed of nested rules. In contrast to our work, the authors include no

sophisticated notion of memory addressing, which limits the system to storage and recall

of relatively simple, atomic data. Addressing, fundamental to our work, is usually left

out from computational models in neuroscience, though it deserves to be mentioned that

Gallistel and King (Gallistel and King, 2009) and Marcus (Marcus, 2003) have argued that

addressing must be implicated in the operation of the brain.

Historically, cognitive science and linguistics emerged as fields at roughly the same time

as artificial intelligence, all deeply influenced by the advent of the computer (Chomsky,

1956) (Miller, 2003). Their intentions were to explain human mental behaviour based on

information or symbol-processing metaphors. In the early 1980s, both fields considered

recursive or procedural (rule-based) symbol-processing to be the highest mark of cogni-

tion. The Parallel Distributed Processing (PDP) or connectionist revolution cast aside the

symbol-processing metaphor in favour of a so-called “sub-symbolic” description of thought

processes (Rumelhart et al., 1986).

Fodor and Pylyshyn (Fodor and Pylyshyn, 1988) famously made two barbed claims

about the limitations of neural networks for cognitive modeling. They first objected that

connectionist theories were incapable of variable-binding, or the assignment of a particular

datum to a particular slot in a data structure. In language, variable-binding is ubiquitous;

for example, when one produces or interprets a sentence of the form, “Mary spoke to John,”

one has assigned “Mary” the role of subject, “John” the role of object, and “spoke to” the

role of the transitive verb. Fodor and Pylyshyn also argued that neural networks with fixed-

length input domains could not reproduce human capabilities in tasks that involve process-

ing variable-length structures. In response to this criticism, neural network researchers

including Hinton (Hinton, 1986), Smolensky (Smolensky, 1990), Touretzky (Touretzky,

1990), Pollack (Pollack, 1990), Plate (Plate, 2003), and Kanerva (Kanerva, 2009) inves-

tigated specific mechanisms that could support both variable-binding and variable-length

3

structure within a connectionist framework. Our architecture draws on and potentiates this

work.

Recursive processing of variable-length structures continues to be regarded as a hall-

mark of human cognition. In the last decade, a firefight in the linguistics community staked

several leaders of the field against one another. At issue was whether recursive processing

is the “uniquely human” evolutionary innovation that enables language and is specialized to

language, a view supported by Fitch, Hauser, and Chomsky (Fitch et al., 2005), or whether

multiple new adaptations are responsible for human language evolution and recursive pro-

cessing predates language (Jackendoff and Pinker, 2005). Regardless of recursive process-

ing’s evolutionary origins, all agreed that it is essential to human cognitive flexibility.

Recurrent neural networks constitute a broad class of machines with dynamic state; that

is, they have state whose evolution depends both on the input to the system and on the

current state. In comparison to hidden Markov models, which also contain dynamic state,

RNNs have a distributed state and therefore have significantly larger and richer memory

and computational capacity. Dynamic state is crucial because it affords the possibility of

context-dependent computation; a signal entering at a given moment can alter the behaviour

of the network at a much later moment.

A crucial innovation to recurrent networks was the Long Short-Term Memory (LSTM)

(Hochreiter and Schmidhuber, 1997). This very general architecture was developed for a

specific purpose, to address the “vanishing and exploding gradient” problem (Hochreiter

et al., 2001a), which we might relabel the problem of “vanishing and exploding sensitivity.”

LSTM ameliorates the problem by embedding perfect integrators (Seung, 1998) for mem-

ory storage in the network. The simplest example of a perfect integrator is the equation

x(t + 1) = x(t) + i(t), where i(t) is an input to the system. The implicit identity matrix

Ix(t) means that signals do not dynamically vanish or explode. If we attach a mechanism

to this integrator that allows an enclosing network to choose when the integrator listens to

inputs, namely, a programmable gate depending on context, we have an equation of the

form x(t + 1) = x(t) + g(context)i(t). We can now selectively store information for an

indefinite length of time.

Recurrent networks readily process variable-length structures without modification. In

sequential problems, inputs to the network arrive at different times, allowing variable-

length or composite structures to be processed over multiple steps. Because they natively

handle variable-length structures, they have recently been used in a variety of cognitive

problems, including speech recognition (Graves et al., 2013; Graves and Jaitly, 2014), text

generation (Sutskever et al., 2011), handwriting generation (Graves, 2013) and machine

translation (Sutskever et al., 2014). Considering this property, we do not feel that it is ur-

gent or even necessarily valuable to build explicit parse trees to merge composite structures

greedily (Pollack, 1990) (Socher et al., 2012) (Frasconi et al., 1998).

Other important precursors to our work include differentiable models of attention (Graves,

4

Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller

network receives inputs from an external environment and emits outputs in response. It also

reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed

line indicates the division between the NTM circuit and the outside world.

2013) (Bahdanau et al., 2014) and program search (Hochreiter et al., 2001b) (Das et al.,

1992), constructed with recurrent neural networks.

A Neural Turing Machine (NTM) architecture contains two basic components: a neural

network controller and a memory bank. Figure 1 presents a high-level diagram of the NTM

architecture. Like most neural networks, the controller interacts with the external world via

input and output vectors. Unlike a standard network, it also interacts with a memory matrix

using selective read and write operations. By analogy to the Turing machine we refer to the

network outputs that parametrise these operations as “heads.”

Crucially, every component of the architecture is differentiable, making it straightfor-

ward to train with gradient descent. We achieved this by defining ‘blurry’ read and write

operations that interact to a greater or lesser degree with all the elements in memory (rather

than addressing a single element, as in a normal Turing machine or digital computer). The

degree of blurriness is determined by an attentional “focus” mechanism that constrains each

read and write operation to interact with a small portion of the memory, while ignoring the

rest. Because interaction with the memory is highly sparse, the NTM is biased towards

storing data without interference. The memory location brought into attentional focus is

determined by specialised outputs emitted by the heads. These outputs define a normalised

weighting over the rows in the memory matrix (referred to as memory “locations”). Each

weighting, one per read or write head, defines the degree to which the head reads or writes

5

at each location. A head can thereby attend sharply to the memory at a single location or

weakly to the memory at many locations.

3.1 Reading

Let Mt be the contents of the N × M memory matrix at time t, where N is the number

of memory locations, and M is the vector size at each location. Let wt be a vector of

weightings over the N locations emitted by a read head at time t. Since all weightings are

normalised, the N elements wt (i) of wt obey the following constraints:

X

wt (i) = 1, 0 ≤ wt (i) ≤ 1, ∀i. (1)

i

The length M read vector rt returned by the head is defined as a convex combination of

the row-vectors Mt (i) in memory:

X

rt ←− wt (i)Mt (i), (2)

i

which is clearly differentiable with respect to both the memory and the weighting.

3.2 Writing

Taking inspiration from the input and forget gates in LSTM, we decompose each write into

two parts: an erase followed by an add.

Given a weighting wt emitted by a write head at time t, along with an erase vector

et whose M elements all lie in the range (0, 1), the memory vectors Mt−1 (i) from the

previous time-step are modified as follows:

M̃t (i) ←− Mt−1 (i) [1 − wt (i)et ] , (3)

where 1 is a row-vector of all 1-s, and the multiplication against the memory location acts

point-wise. Therefore, the elements of a memory location are reset to zero only if both the

weighting at the location and the erase element are one; if either the weighting or the erase

is zero, the memory is left unchanged. When multiple write heads are present, the erasures

can be performed in any order, as multiplication is commutative.

Each write head also produces a length M add vector at , which is added to the memory

after the erase step has been performed:

Mt (i) ←− M̃t (i) + wt (i) at . (4)

Once again, the order in which the adds are performed by multiple heads is irrelevant. The

combined erase and add operations of all the write heads produces the final content of the

memory at time t. Since both erase and add are differentiable, the composite write oper-

ation is differentiable too. Note that both the erase and add vectors have M independent

components, allowing fine-grained control over which elements in each memory location

are modified.

6

Figure 2: Flow Diagram of the Addressing Mechanism. The key vector, kt , and key

strength, βt , are used to perform content-based addressing of the memory matrix, Mt . The

resulting content-based weighting is interpolated with the weighting from the previous time step

based on the value of the interpolation gate, gt . The shift weighting, st , determines whether

and by how much the weighting is rotated. Finally, depending on γt , the weighting is sharpened

and used for memory access.

Although we have now shown the equations of reading and writing, we have not described

how the weightings are produced. These weightings arise by combining two addressing

mechanisms with complementary facilities. The first mechanism, “content-based address-

ing,” focuses attention on locations based on the similarity between their current values

and values emitted by the controller. This is related to the content-addressing of Hopfield

networks (Hopfield, 1982). The advantage of content-based addressing is that retrieval is

simple, merely requiring the controller to produce an approximation to a part of the stored

data, which is then compared to memory to yield the exact stored value.

However, not all problems are well-suited to content-based addressing. In certain tasks

the content of a variable is arbitrary, but the variable still needs a recognisable name or ad-

dress. Arithmetic problems fall into this category: the variable x and the variable y can take

on any two values, but the procedure f (x, y) = x × y should still be defined. A controller

for this task could take the values of the variables x and y, store them in different addresses,

then retrieve them and perform a multiplication algorithm. In this case, the variables are

addressed by location, not by content. We call this form of addressing “location-based ad-

dressing.” Content-based addressing is strictly more general than location-based addressing

as the content of a memory location could include location information inside it. In our ex-

periments however, providing location-based addressing as a primitive operation proved

essential for some forms of generalisation, so we employ both mechanisms concurrently.

Figure 2 presents a flow diagram of the entire addressing system that shows the order

of operations for constructing a weighting vector when reading or writing.

7

3.3.1 Focusing by Content

For content-addressing, each head (whether employed for reading or writing) first produces

M key vector kt that is compared to each vector Mt (i) by a csimilarity measure

a length

K ·, · . The content-based system produces a normalised weighting wt based on the sim-

ilarity and a positive key strength, βt , which can amplify or attenuate the precision of the

focus:

exp βt K kt , Mt (i)

c

wt (i) ←− . (5)

P

j exp βt K kt , Mt (j)

u·v

K u, v = . (6)

||u|| · ||v||

The location-based addressing mechanism is designed to facilitate both simple iteration

across the locations of the memory and random-access jumps. It does so by implementing

a rotational shift of a weighting. For example, if the current weighting focuses entirely on

a single location, a rotation of 1 would shift the focus to the next location. A negative shift

would move the weighting in the opposite direction.

Prior to rotation, each head emits a scalar interpolation gate gt in the range (0, 1). The

value of g is used to blend between the weighting wt−1 produced by the head at the previous

time-step and the weighting wtc produced by the content system at the current time-step,

yielding the gated weighting wtg :

If the gate is zero, then the content weighting is entirely ignored, and the weighting from the

previous time step is used. Conversely, if the gate is one, the weighting from the previous

iteration is ignored, and the system applies content-based addressing.

After interpolation, each head emits a shift weighting st that defines a normalised distri-

bution over the allowed integer shifts. For example, if shifts between -1 and 1 are allowed,

st has three elements corresponding to the degree to which shifts of -1, 0 and 1 are per-

formed. The simplest way to define the shift weightings is to use a softmax layer of the

appropriate size attached to the controller. We also experimented with another technique,

where the controller emits a single scalar that is interpreted as the lower bound of a width

one uniform distribution over shifts. For example, if the shift scalar is 6.7, then st (6) = 0.3,

st (7) = 0.7, and the rest of st is zero.

8

If we index the N memory locations from 0 to N − 1, the rotation applied to wtg by st

can be expressed as the following circular convolution:

N

X −1

w̃t (i) ←− wtg (j) st (i − j) (8)

j=0

where all index arithmetic is computed modulo N . The convolution operation in Equa-

tion (8) can cause leakage or dispersion of weightings over time if the shift weighting is

not sharp. For example, if shifts of -1, 0 and 1 are given weights of 0.1, 0.8 and 0.1, the

rotation will transform a weighting focused at a single point into one slightly blurred over

three points. To combat this, each head emits one further scalar γt ≥ 1 whose effect is to

sharpen the final weighting as follows:

w̃t (i)γt

wt (i) ←− P γt

(9)

j w̃t (j)

The combined addressing system of weighting interpolation and content and location-

based addressing can operate in three complementary modes. One, a weighting can be

chosen by the content system without any modification by the location system. Two, a

weighting produced by the content addressing system can be chosen and then shifted. This

allows the focus to jump to a location next to, but not on, an address accessed by content;

in computational terms this allows a head to find a contiguous block of data, then access a

particular element within that block. Three, a weighting from the previous time step can

be rotated without any input from the content-based addressing system. This allows the

weighting to iterate through a sequence of addresses by advancing the same distance at

each time-step.

The NTM architecture architecture described above has several free parameters, including

the size of the memory, the number of read and write heads, and the range of allowed lo-

cation shifts. But perhaps the most significant architectural choice is the type of neural

network used as the controller. In particular, one has to decide whether to use a recurrent

or feedforward network. A recurrent controller such as LSTM has its own internal memory

that can complement the larger memory in the matrix. If one compares the controller to

the central processing unit in a digital computer (albeit with adaptive rather than predefined

instructions) and the memory matrix to RAM, then the hidden activations of the recurrent

controller are akin to the registers in the processor. They allow the controller to mix infor-

mation across multiple time steps of operation. On the other hand a feedforward controller

can mimic a recurrent network by reading and writing at the same location in memory at

every step. Furthermore, feedforward controllers often confer greater transparency to the

network’s operation because the pattern of reading from and writing to the memory matrix

is usually easier to interpret than the internal state of an RNN. However, one limitation of

9

a feedforward controller is that the number of concurrent read and write heads imposes a

bottleneck on the type of computation the NTM can perform. With a single read head, it

can perform only a unary transform on a single memory vector at each time-step, with two

read heads it can perform binary vector transforms, and so on. Recurrent controllers can

internally store read vectors from previous time-steps, so do not suffer from this limitation.

4 Experiments

This section presents preliminary experiments on a set of simple algorithmic tasks such

as copying and sorting data sequences. The goal was not only to establish that NTM is

able to solve the problems, but also that it is able to do so by learning compact internal

programs. The hallmark of such solutions is that they generalise well beyond the range of

the training data. For example, we were curious to see if a network that had been trained

to copy sequences of length up to 20 could copy a sequence of length 100 with no further

training.

For all the experiments we compared three architectures: NTM with a feedforward

controller, NTM with an LSTM controller, and a standard LSTM network. Because all

the tasks were episodic, we reset the dynamic state of the networks at the start of each

input sequence. For the LSTM networks, this meant setting the previous hidden state equal

to a learned bias vector. For NTM the previous state of the controller, the value of the

previous read vectors, and the contents of the memory were all reset to bias values. All

the tasks were supervised learning problems with binary targets; all networks had logistic

sigmoid output layers and were trained with the cross-entropy objective function. Sequence

prediction errors are reported in bits-per-sequence. For more details about the experimental

parameters see Section 4.6.

4.1 Copy

The copy task tests whether NTM can store and recall a long sequence of arbitrary in-

formation. The network is presented with an input sequence of random binary vectors

followed by a delimiter flag. Storage and access of information over long time periods has

always been problematic for RNNs and other dynamic architectures. We were particularly

interested to see if an NTM is able to bridge longer time delays than LSTM.

The networks were trained to copy sequences of eight bit random vectors, where the

sequence lengths were randomised between 1 and 20. The target sequence was simply a

copy of the input sequence (without the delimiter flag). Note that no inputs were presented

to the network while it receives the targets, to ensure that it recalls the entire sequence with

no intermediate assistance.

As can be seen from Figure 3, NTM (with either a feedforward or LSTM controller)

learned much faster than LSTM alone, and converged to a lower cost. The disparity be-

tween the NTM and LSTM learning curves is dramatic enough to suggest a qualitative,

10

10

LSTM

NTM with LSTM Controller

8

NTM with Feedforward Controller

0

0 200 400 600 800 1000

sequence number (thousands)

rather than quantitative, difference in the way the two models solve the problem.

We also studied the ability of the networks to generalise to longer sequences than seen

during training (that they can generalise to novel vectors is clear from the training error).

Figures 4 and 5 demonstrate that the behaviour of LSTM and NTM in this regime is rad-

ically different. NTM continues to copy as the length increases2 , while LSTM rapidly

degrades beyond length 20.

The preceding analysis suggests that NTM, unlike LSTM, has learned some form of

copy algorithm. To determine what this algorithm is, we examined the interaction between

the controller and the memory (Figure 6). We believe that the sequence of operations per-

formed by the network can be summarised by the following pseudocode:

while input delimiter not seen do

receive input vector

write input to head location

increment head location by 1

end while

return head to start location

while true do

read output vector from head location

emit output

increment head location by 1

end while

This is essentially how a human programmer would perform the same task in a low-

2

The limiting factor was the size of the memory (128 locations), after which the cyclical shifts wrapped

around and previous writes were overwritten.

11

Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row

depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30,

and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network

was only trained on sequences of up to length 20. The first four sequences are reproduced with

high confidence and very few mistakes. The longest one has a few more local errors and one

global error: at the point indicated by the red arrow at the bottom, a single vector is duplicated,

pushing all subsequent vectors one step back. Despite being subjectively close to a correct copy,

this leads to a high loss.

level programming language. In terms of data structures, we could say that NTM has

learned how to create and iterate through arrays. Note that the algorithm combines both

content-based addressing (to jump to start of the sequence) and location-based address-

ing (to move along the sequence). Also note that the iteration would not generalise to

long sequences without the ability to use relative shifts from the previous read and write

weightings (Equation 7), and that without the focus-sharpening mechanism (Equation 9)

the weightings would probably lose precision over time.

The repeat copy task extends copy by requiring the network to output the copied sequence a

specified number of times and then emit an end-of-sequence marker. The main motivation

was to see if the NTM could learn a simple nested function. Ideally, we would like it to be

able to execute a “for loop” containing any subroutine it has already learned.

The network receives random-length sequences of random binary vectors, followed by

a scalar value indicating the desired number of copies, which appears on a separate input

channel. To emit the end marker at the correct time the network must be both able to

interpret the extra input and keep count of the number of copies it has performed so far.

As with the copy task, no inputs are provided to the network after the initial sequence and

repeat number. The networks were trained to reproduce sequences of size eight random

binary vectors, where both the sequence length and the number of repetitions were chosen

randomly from one to ten. The input representing the repeat number was normalised to

have mean zero and variance one.

12

Figure 5: LSTM Generalisation on the Copy Task. The plots show inputs and outputs

for the same sequence lengths as Figure 4. Like NTM, LSTM learns to reproduce sequences

of up to length 20 almost perfectly. However it clearly fails to generalise to longer sequences.

Also note that the length of the accurate prefix decreases as the sequence length increases,

suggesting that the network has trouble retaining information for long periods.

Figure 6: NTM Memory Use During the Copy Task. The plots in the left column depict

the inputs to the network (top), the vectors added to memory (middle) and the corresponding

write weightings (bottom) during a single test sequence for the copy task. The plots on the right

show the outputs from the network (top), the vectors read from memory (middle) and the read

weightings (bottom). Only a subset of memory locations are shown. Notice the sharp focus of

all the weightings on a single location in memory (black is weight zero, white is weight one).

Also note the translation of the focal point over time, reflects the network’s use of iterative

shifts for location-based addressing, as described in Section 3.3.2. Lastly, observe that the read

locations exactly match the write locations, and the read vectors match the add vectors. This

suggests that the network writes each input vector in turn to a specific memory location during

the input phase, then reads from the same location sequence during the output phase.

13

200

180 LSTM

NTM with LSTM Controller

160

NTM with Feedforward Controller

140

120

100

80

60

40

20

0

0 100 200 300 400 500

sequence number (thousands)

Figure 7 shows that NTM learns the task much faster than LSTM, but both were able to

solve it perfectly.3 The difference between the two architectures only becomes clear when

they are asked to generalise beyond the training data. In this case we were interested in

generalisation along two dimensions: sequence length and number of repetitions. Figure 8

illustrates the effect of doubling first one, then the other, for both LSTM and NTM. Whereas

LSTM fails both tests, NTM succeeds with longer sequences and is able to perform more

than ten repetitions; however it is unable to keep count of of how many repeats it has

completed, and does not predict the end marker correctly. This is probably a consequence

of representing the number of repetitions numerically, which does not easily generalise

beyond a fixed range.

Figure 9 suggests that NTM learns a simple extension of the copy algorithm in the

previous section, where the sequential read is repeated as many times as necessary.

The previous tasks show that the NTM can apply algorithms to relatively simple, linear data

structures. The next order of complexity in organising data arises from “indirection”—that

is, when one data item points to another. We test the NTM’s capability for learning an

instance of this more interesting class by constructing a list of items so that querying with

one of the items demands that the network return the subsequent item. More specifically,

we define an item as a sequence of binary vectors that is bounded on the left and right

by delimiter symbols. After several items have been propagated to the network, we query

by showing a random item, and we ask the network to produce the next item. In our

experiments, each item consisted of three six-bit binary vectors (giving a total of 18 bits

3

It surprised us that LSTM performed better here than on the copy problem. The likely reasons are that the

sequences were shorter (up to length 10 instead of up to 20), and the LSTM network was larger and therefore

had more memory capacity.

14

Figure 8: NTM and LSTM Generalisation for the Repeat Copy Task. NTM generalises

almost perfectly to longer sequences than seen during training. When the number of repeats is

increased it is able to continue duplicating the input sequence fairly accurately; but it is unable

to predict when the sequence will end, emitting the end marker after the end of every repetition

beyond the eleventh. LSTM struggles with both increased length and number, rapidly diverging

from the input sequence in both cases.

per item). During training, we used a minimum of 2 items and a maximum of 6 items in a

single episode.

Figure 10 shows that NTM learns this task significantly faster than LSTM, terminating

at near zero cost within approximately 30, 000 episodes, whereas LSTM does not reach

zero cost after a million episodes. Additionally, NTM with a feedforward controller learns

faster than NTM with an LSTM controller. These two results suggest that NTM’s external

memory is a more effective way of maintaining the data structure than LSTM’s internal

state. NTM also generalises much better to longer sequences than LSTM, as can be seen

in Figure 11. NTM with a feedforward controller is nearly perfect for sequences of up to

12 items (twice the maximum length used in training), and still has an average cost below

1 bit per sequence for sequences of 15 items.

In Figure 12, we show the operation of the NTM memory, controlled by an LSTM

with one head, on a single test episode. In “Inputs,” we see that the input denotes item

delimiters as single bits in row 7. After the sequence of items has been propagated, a

15

Figure 9: NTM Memory Use During the Repeat Copy Task. As with the copy task the

network first writes the input vectors to memory using iterative shifts. It then reads through

the sequence to replicate the input as many times as necessary (six in this case). The white dot

at the bottom of the read weightings seems to correspond to an intermediate location used to

redirect the head to the start of the sequence (The NTM equivalent of a goto statement).

20

18 LSTM

cost per sequence (bits)

16

NTM with Feedforward Controller

14

12

10

8

6

4

2

0

0 200 400 600 800 1000

sequence number (thousands)

Figure 10: Associative Recall Learning Curves for NTM and LSTM.

16

40

35

co st p er seq u en ce (bits)

30

25 LS T M

20 N T M w ith LS T M C o ntroller

15 N T M w ith F eed fo rw a rd C o n troller

10

5

0

6 8 10 12 14 16 18 20

n um be r of item s per se que nce

Figure 11: Generalisation Performance on Associative Recall for Longer Item Sequences.

The NTM with either a feedforward or LSTM controller generalises to much longer sequences

of items than the LSTM alone. In particular, the NTM with a feedforward controller is nearly

perfect for item sequences of twice the length of sequences in its training set.

delimiter in row 8 prepares the network to receive a query item. In this case, the query

item corresponds to the second item in the sequence (contained in the green box). In

“Outputs,” we see that the network crisply outputs item 3 in the sequence (from the red

box). In “Read Weightings,” on the last three time steps, we see that the controller reads

from contiguous locations that each store the time slices of item 3. This is curious because it

appears that the network has jumped directly to the correct location storing item 3. However

we can explain this behaviour by looking at “Write Weightings.” Here we see that the

memory is written to even when the input presents a delimiter symbol between items.

One can confirm in “Adds” that data are indeed written to memory when the delimiters

are presented (e.g., the data within the black box); furthermore, each time a delimiter is

presented, the vector added to memory is different. Further analysis of the memory reveals

that the network accesses the location it reads after the query by using a content-based

lookup that produces a weighting that is shifted by one. Additionally, the key used for

content-lookup corresponds to the vector that was added in the black box. This implies the

following memory-access algorithm: when each item delimiter is presented, the controller

writes a compressed representation of the previous three time slices of the item. After the

query arrives, the controller recomputes the same compressed representation of the query

item, uses a content-based lookup to find the location where it wrote the first representation,

and then shifts by one to produce the subsequent item in the sequence (thereby combining

content-based lookup with location-based offsetting).

The goal of the dynamic N-Grams task was to test whether NTM could rapidly adapt to

new predictive distributions. In particular we were interested to see if it were able to use its

17

Figure 12: NTM Memory Use During the Associative Recall Task. In “Inputs,” a se-

quence of items, each composed of three consecutive binary random vectors is propagated to the

controller. The distinction between items is designated by delimiter symbols (row 7 in “Inputs”).

After several items have been presented, a delimiter that designates a query is presented (row 8

in “Inputs”). A single query item is presented (green box), and the network target corresponds

to the subsequent item in the sequence (red box). In “Outputs,” we see that the network cor-

rectly produces the target item. The red boxes in the read and write weightings highlight the

three locations where the target item was written and then read. The solution the network finds

is to form a compressed representation (black box in “Adds”) of each item that it can store in

a single location. For further analysis, see the main text.

memory as a re-writable table that it could use to keep count of transition statistics, thereby

emulating a conventional N-Gram model.

We considered the set of all possible 6-Gram distributions over binary sequences. Each

6-Gram distribution can be expressed as a table of 25 = 32 numbers, specifying the prob-

ability that the next bit will be one, given all possible length five binary histories. For

each training example, we first generated random 6-Gram probabilities by independently

drawing all 32 probabilities from the Beta( 21 , 12 ) distribution.

We then generated a particular training sequence by drawing 200 successive bits using

the current lookup table.4 The network observes the sequence one bit at a time and is then

asked to predict the next bit. The optimal estimator for the problem can be determined by

4

The first 5 bits, for which insufficient context exists to sample from the table, are drawn i.i.d. from a

Bernoulli distribution with p = 0.5.

18

160

LSTM

155 NTM with LSTM Controller

NTM with Feedforward Controller

150 Optimal Estimator

145

140

135

130

0 200 400 600 800 1000

sequence number (thousands)

N1 + 21

P (B = 1|N1 , N0 , c) = (10)

N1 + N0 + 1

where c is the five bit previous context, B is the value of the next bit and N0 and N1 are

respectively the number of zeros and ones observed after c so far in the sequence. We can

therefore compare NTM to the optimal predictor as well as LSTM. To assess performance

we used a validation set of 1000 length 200 sequences sampled from the same distribu-

tion as the training data. As shown in Figure 13, NTM achieves a small, but significant

performance advantage over LSTM, but never quite reaches the optimum cost.

The evolution of the two architecture’s predictions as they observe new inputs is shown

in Figure 14, along with the optimal predictions. Close analysis of NTM’s memory usage

(Figure 15) suggests that the controller uses the memory to count how many ones and zeros

it has observed in different contexts, allowing it to implement an algorithm similar to the

optimal estimator.

This task tests whether the NTM can sort data—an important elementary algorithm. A

sequence of random binary vectors is input to the network along with a scalar priority

rating for each vector. The priority is drawn uniformly from the range [-1, 1]. The target

sequence contains the binary vectors sorted according to their priorities, as depicted in

Figure 16.

Each input sequence contained 20 binary vectors with corresponding priorities, and

each target sequence was the 16 highest-priority vectors in the input.5 Inspection of NTM’s

5

We limited the sort to size 16 because we were interested to see if NTM would solve the task using a

binary heap sort of depth 4.

19

Figure 14: Dynamic N-Gram Inference. The top row shows a test sequence from the N-Gram

task, and the rows below show the corresponding predictive distributions emitted by the optimal

estimator, NTM, and LSTM. In most places the NTM predictions are almost indistinguishable

from the optimal ones. However at the points indicated by the two arrows it makes clear

mistakes, one of which is explained in Figure 15. LSTM follows the optimal predictions closely

in some places but appears to diverge further as the sequence progresses; we speculate that this

is due to LSTM “forgetting” the observations at the start of the sequence.

Figure 15: NTM Memory Use During the Dynamic N-Gram Task. The red and green

arrows indicate point where the same context is repeatedly observed during the test sequence

(“00010” for the green arrows, “01111” for the red arrows). At each such point the same

location is accessed by the read head, and then, on the next time-step, accessed by the write

head. We postulate that the network uses the writes to keep count of the fraction of ones and

zeros following each context in the sequence so far. This is supported by the add vectors, which

are clearly anti-correlated at places where the input is one or zero, suggesting a distributed

“counter.” Note that the write weightings grow fainter as the same context is repeatedly seen;

this may be because the memory records a ratio of ones to zeros, rather than absolute counts.

The red box in the prediction sequence corresponds to the mistake at the first red arrow in

Figure 14; the controller appears to have accessed the wrong memory location, as the previous

context was “01101” and not “01111.”

20

Figure 16: Example Input and Target Sequence for the Priority Sort Task. The input

sequence contains random binary vectors and random scalar priorities. The target sequence is a

subset of the input vectors sorted by the priorities.

Locatiobn

Figure 17: NTM Memory Use During the Priority Sort Task. Left: Write locations

returned by fitting a linear function of the priorities to the observed write locations. Middle:

Observed write locations. Right: Read locations.

memory use led us to hypothesise that it uses the priorities to determine the relative location

of each write. To test this hypothesis we fitted a linear function of the priority to the

observed write locations. Figure 17 shows that the locations returned by the linear function

closely match the observed write locations. It also shows that the network reads from the

memory locations in increasing order, thereby traversing the sorted sequence.

The learning curves in Figure 18 demonstrate that NTM with both feedforward and

LSTM controllers substantially outperform LSTM on this task. Note that eight parallel

read and write heads were needed for best performance with a feedforward controller on

this task; this may reflect the difficulty of sorting vectors using only unary vector operations

(see Section 3.4).

For all experiments, the RMSProp algorithm was used for training in the form described

in (Graves, 2013) with momentum of 0.9. Tables 1 to 3 give details about the network

configurations and learning rates used in the experiments. All LSTM networks had three

stacked hidden layers. Note that the number of LSTM parameters grows quadratically with

21

140

LSTM

120 NTM with LSTM Controller

NTM with Feedforward Controller

100

80

60

40

20

0

0 200 400 600 800 1000

sequence number (thousands)

Copy 1 100 128 × 20 10−4 17, 162

Repeat Copy 1 100 128 × 20 10−4 16, 712

Associative 4 256 128 × 20 10−4 146, 845

N-Grams 1 100 128 × 20 3 × 10−5 14, 656

Priority Sort 8 512 128 × 20 3 × 10−5 508, 305

the number of hidden units (due to the recurrent connections in the hidden layers). This

contrasts with NTM, where the number of parameters does not increase with the number of

memory locations. During the training backward pass, all gradient components are clipped

elementwise to the range (-10, 10).

5 Conclusion

We have introduced the Neural Turing Machine, a neural network architecture that takes

inspiration from both models of biological working memory and the design of digital com-

puters. Like conventional neural networks, the architecture is differentiable end-to-end and

can be trained with gradient descent. Our experiments demonstrate that it is capable of

learning simple algorithms from example data and of using these algorithms to generalise

well outside its training regime.

22

Task #Heads Controller Size Memory Size Learning Rate #Parameters

Copy 1 100 128 × 20 10−4 67, 561

Repeat Copy 1 100 128 × 20 10−4 66, 111

Associative 1 100 128 × 20 10−4 70, 330

N-Grams 1 100 128 × 20 3 × 10−5 61, 749

Priority Sort 5 2 × 100 128 × 20 3 × 10−5 269, 038

Copy 3 × 256 3 × 10−5 1, 352, 969

Repeat Copy 3 × 512 3 × 10−5 5, 312, 007

Associative 3 × 256 10−4 1, 344, 518

N-Grams 3 × 128 10−4 331, 905

Priority Sort 3 × 128 3 × 10−5 384, 424

6 Acknowledgments

Many have offered thoughtful insights, but we would especially like to thank Daan Wier-

stra, Peter Dayan, Ilya Sutskever, Charles Blundell, Joel Veness, Koray Kavukcuoglu,

Dharshan Kumaran, Georg Ostrovski, Chris Summerfield, Jeff Dean, Geoffrey Hinton, and

Demis Hassabis.

23

References

Baddeley, A., Eysenck, M., and Anderson, M. (2009). Memory. Psychology Press.

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly

learning to align and translate. abs/1409.0473.

Barrouillet, P., Bernardin, S., and Camos, V. (2004). Time constraints and resource shar-

ing in adults’ working memory spans. Journal of Experimental Psychology: General,

133(1):83.

Chomsky, N. (1956). Three models for the description of language. Information Theory,

IEEE Transactions on, 2(3):113–124.

Das, S., Giles, C. L., and Sun, G.-Z. (1992). Learning context-free grammars: Capabil-

ities and limitations of a recurrent neural network with an external stack memory. In

Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana

University.

2(2):255.

Eliasmith, C. (2013). How to build a brain: A neural architecture for biological cognition.

Oxford University Press.

Fitch, W., Hauser, M. D., and Chomsky, N. (2005). The evolution of the language faculty:

clarifications and implications. Cognition, 97(2):179–210.

critical analysis. Cognition, 28(1):3–71.

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive process-

ing of data structures. Neural Networks, IEEE Transactions on, 9(5):768–786.

Gallistel, C. R. and King, A. P. (2009). Memory and the computational brain: Why cogni-

tive science will transform neuroscience, volume 3. John Wiley & Sons.

Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint

arXiv:1308.0850.

Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent

neural networks. In Proceedings of the 31st International Conference on Machine Learn-

ing (ICML-14), pages 1764–1772.

24

Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent

neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE

International Conference on, pages 6645–6649. IEEE.

21(2):510–532.

Hazy, T. E., Frank, M. J., and O’Reilly, R. C. (2006). Banishing the homunculus: making

working memory work. Neuroscience, 139(1):105–118.

of the eighth annual conference of the cognitive science society, volume 1, page 12.

Amherst, MA.

Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001a). Gradient flow in

recurrent nets: the difficulty of learning long-term dependencies.

9(8):1735–1780.

Hochreiter, S., Younger, A. S., and Conwell, P. R. (2001b). Learning to learn using gradient

descent. In Artificial Neural Networks?ICANN 2001, pages 87–94. Springer.

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective

computational abilities. Proceedings of the national academy of sciences, 79(8):2554–

2558.

Jackendoff, R. and Pinker, S. (2005). The nature of the language faculty and its implications

for evolution of language (reply to fitch, hauser, and chomsky). Cognition, 97(2):211–

225.

tributed representation with high-dimensional random vectors. Cognitive Computation,

1(2):139–159.

Marcus, G. F. (2003). The algebraic mind: Integrating connectionism and cognitive sci-

ence. MIT press.

Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our

capacity for processing information. Psychological review, 63(2):81.

sciences, 7(3):141–144.

25

Plate, T. A. (2003). Holographic Reduced Representation: Distributed representation for

cognitive structures. CSLI.

46(1):77–105.

Rigotti, M., Barak, O., Warden, M. R., Wang, X.-J., Daw, N. D., Miller, E. K., and Fusi,

S. (2013). The importance of mixed selectivity in complex cognitive tasks. Nature,

497(7451):585–590.

Rumelhart, D. E., McClelland, J. L., Group, P. R., et al. (1986). Parallel distributed pro-

cessing, volume 1. MIT press.

11(7):1253–1258.

Journal of computer and system sciences, 50(1):132–150.

Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic

structures in connectionist systems. Artificial intelligence, 46(1):159–216.

Socher, R., Huval, B., Manning, C. D., and Ng, A. Y. (2012). Semantic compositionality

through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on

Empirical Methods in Natural Language Processing and Computational Natural Lan-

guage Learning, pages 1201–1211. Association for Computational Linguistics.

Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating text with recurrent neural

networks. In Proceedings of the 28th International Conference on Machine Learning

(ICML-11), pages 1017–1024.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural

networks. arXiv preprint arXiv:1409.3215.

Artificial Intelligence, 46(1):5–46.

Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: the importance of nmda

receptors to working memory. The Journal of Neuroscience, 19(21):9587–9603.

26

Under review as a conference paper at ICLR 2016

R EINFORCEMENT L EARNING

N EURAL T URING M ACHINES - R EVISED

Wojciech Zaremba1,2 Ilya Sutskever2

New York University Google Brain

Facebook AI Research ilyasu@google.com

woj.zaremba@gmail.com

A BSTRACT

arXiv:1505.00521v3 [cs.LG] 12 Jan 2016

The Neural Turing Machine (NTM) is more expressive than all previously considered

models because of its external memory. It can be viewed as a broader effort to use

abstract external Interfaces and to learn a parametric model that interacts with them.

The capabilities of a model can be extended by providing it with proper Interfaces

that interact with the world. These external Interfaces include memory, a database,

a search engine, or a piece of software such as a theorem verifier. Some of these

Interfaces are provided by the developers of the model. However, many important

existing Interfaces, such as databases and search engines, are discrete.

We examine feasibility of learning models to interact with discrete Interfaces. We

investigate the following discrete Interfaces: a memory Tape, an input Tape, and an

output Tape. We use a Reinforcement Learning algorithm to train a neural network

that interacts with such Interfaces to solve simple algorithmic tasks. Our Interfaces

are expressive enough to make our model Turing complete.

1 I NTRODUCTION

Graves et al. (2014b)’s Neural Turing Machine (NTM) is model that learns to interact with an external

memory that is differentiable and continuous. An external memory extends the capabilities of the NTM,

allowing it to solve tasks that were previously unsolvable by conventional machine learning methods.

This is the source of the NTM’s expressive power. In general, it appears that ML models become

significantly more powerful if they are able to learn to interact with external interfaces.

There exist a vast number of Interfaces that could be used with our models. For example, the Google

search engine is an example of such Interface. The search engine consumes queries (which are actions),

and outputs search results. However, the search engine is not differentiable, and the model interacts

with the Interface using discrete actions. This work examines the feasibility of learning to interact with

discrete Interfaces using the reinforce algorithm.

Discrete Interfaces cannot be trained directly with standard backpropagation because they are not dif-

ferentiable. It is most natural to learn to interact with discrete Interfaces using Reinforcement Learning

methods. In this work, we consider an Input Tape and a Memory Tape interface with discrete access.

Our concrete proposal is to use the Reinforce algorithm to learn where to access the discrete interfaces,

and to use the backpropagation algorithm to determine what to write to the memory and to the output.

We call this model the RL–NTM.

Discrete Interfaces are computationally attractive because the cost of accessing a discrete Interface is

often independent of its size. It is not the case for the continuous Interfaces, where the cost of access

scales linearly with size. It is a significant disadvantage since slow models cannot scale to large difficult

problems that require intensive training on large datasets. In addition, an output Interface that lets

the model decide when it wants to make a prediction allows the model’s runtime to be in principle

unbounded. If the model has an output interface of this kind together with an interface to an unbounded

memory, the model becomes Turing complete.

We evaluate the RL-NTM on a number of simple algorithmic tasks. The RL-NTM succeeds on problems

such as copying an input several times to the output tape (the “repeat copy” task from Graves et al.

(2014b)), reversing a sequence, and a few more tasks of comparable difficulty. However, its success is

highly dependent on the architecture of the “controller”. We discuss this in more details in Section 8.

1

Work done while the author was at Google.

2

Both authors contributed equally to this work.

1

Under review as a conference paper at ICLR 2016

Finally, we found it non-trivial to correctly implement the RL-NTM due its large number of interacting

components. We developed a simple procedure to numerically check the gradients of the Reinforce

algorithm (Section 5). The procedure can be applied to problems unrelated to NTMs, and is of the

independent interest. The code for this work can be found at https://github.com/ilyasu123/rlntm.

2 T HE M ODEL

Many difficult tasks require a prolonged, multi-step interaction with an external environment. Examples

of such environments include computer games (Mnih et al., 2013), the stock market, an advertisement

system, or the physical world (Levine et al., 2015). A model can observe a partial state from the

environment, and influence the environment through its actions. This is seen as a general reinforcement

leaning problem. However, our setting departs from the classical RL, i.e. we have a freedom to design

tools available to solve a given problem. Tools might cooperate with the model (i.e. backpropagation

through memory), and the tools specify the actions over the environment. We formalize this concept

under the name Interface–Controller interaction.

The external environment is exposed to the model through a number of Interfaces, each with its own

API. For instance, a human perceives the world through its senses, which include the vision Interface

and the touch Interface. The touch Interface provides methods for contracting the various muscles, and

methods for sensing the current state of the muscles, pain level, temperature and a few others. In this

work, we explore a number of simple Interfaces that allow the controller to access an input tape, a

memory tape, and an output tape.

The part of the model that communicates with Interfaces is called the Controller, which is the only part

of the system which learns. The Controller can have prior knowledge about behavior of its Interfaces,

but it is not the case in our experiments. The Controller learns to interact with Interfaces in a way that

allows it to solve a given task. Fig. 1 illustrates the complete Interfaces–Controller abstraction.

increment Target or not? increment new memory

prediction value vector

Input Interface Output Interface Memory Interface -1 0 1 0 1 -1 0 1

Past State Controller Future State Past State LSTM Future State

Controller Input Controller Input

Input Interface Output Interface Memory Interface Current Input Current Memory

Figure 1: (Left) The Interface–Controller abstraction, (Right) an instantiation of our model as an Interface–

Controller. The bottom boxes are the read methods, and the top are the write methods. The RL–NTM makes

discrete decisions regarding the move over the input tape, the memory tape, and whether to make a prediction at a

given timestep. During training, the model’s prediction is compared with the desired output, and is used to train the

model when the RL-NTM chooses to advance its position on the output tape; otherwise it is ignored. The memory

value vector is a vector of content that is stored in the memory cell.

We now describe the RL–NTM. As a controller, it uses either LSTM, direct access, or LSTM (see

sec. 8.1 for a definition). It has a one-dimensional input tape, a one-dimensional memory, and a one-

dimensional output tape as Interfaces. Both the input tape and the memory tape have a head that reads

the Tape’s content at the current location. The head of the input tape and the memory tape can move in

any direction. However, the output tape is a write-only tape, and its head can either stay at the current

position or move forward. Fig. 2 shows an example execution trace for the entire RL–NTM on the

reverse task (sec. 6).

At the core of the RL–NTM is an LSTM controller which receives multiple inputs and has to generate

multiple outputs at each timestep. Table 1 summarizes the controller’s inputs and outputs, and the

way in which the RL–NTM is trained to produce them. The objective function of the RL–NTM is

the expected log probability of the desired outputs, where the expectation is taken over all possible

sequences of actions, weighted with probability of taking these actions. Both backpropagation and

Reinforce maximize this objective. Backpropagation maximizes the log probabilities of the model’s

predictions, while the reinforce algorithm influences the probabilities of action sequences.

2

Under review as a conference paper at ICLR 2016

Figure 2: Execution of RL–NTM on the ForwardReverse task. At each timestep, the RL-NTM con-

sumes the value of the current input tape, the value of the current memory cell, and a representation

of all the actions that have been taken in the previous timestep (not marked on the figures). The RL-

NTM then outputs a new value for the current memory cell (marked with a star), a prediction for the

next target symbol, and discrete decisions for changing the positions of the heads on the various tapes.

The RL-NTM learns to make discrete decisions using the Reinforce algorithm, and learns to produce

continuous outputs using backpropagation.

X n

hX i

preinforce (a1 , a2 , . . . , an |θ) log(pbp (yi |x1 , . . . , xi , a1 , . . . ai , θ)

[a1 ,a2 ,...,an ]∈A† i=1

A† represents the space of sequences of actions that lead to the end of episode. The probabilities in the

above equation are parametrized with a neural network (the Controller). We have marked with preinforce

the part of the equation which is learned with Reinforce. pbp indicates the part of the equation optimized

with the classical backpropagation.

Input Tape Head window of values surrounding the current position distribution over [−1, 0, 1] Reinforce

Head ∅ distribution over [0, 1] Reinforce

Output Tape

Content ∅ distribution over output vocabulary Backpropagation

Head distribution over [−1, 0, 1] Reinforce

Memory Tape window of memory values surrounding the current address

Content vector of real values to store Backpropagation

Miscellaneous all actions taken in the previous time step ∅ ∅

Table 1: Table summarizes what the Controller reads at every time step, and what it has to produce. The

“training” column indicates how the given part of the model is trained.

The RL–NTM receives a direct learning signal only when it decides to make a prediction. If it chooses to

not make a prediction at a given timestep, then it will not receive a direct learning signal. Theoretically,

we can allow the RL–NTM to run for an arbitrary number of steps without making any prediction,

hoping that after sufficiently many steps, it would decide to make a prediction. Doing so will also

provide the RL–NTM with arbitrary computational capability. However, this strategy is both unstable

and computationally infeasible. Thus, we resort to limiting the total number of computational steps to

a fixed upper bound, and force the RL–NTM to predict the next desired output whenever the number of

remaining desired outputs is equal to the number of remaining computational steps.

3 R ELATED WORK

This work is the most similar to the Neural Turing Machine Graves et al. (2014b). The NTM is an

ambitious, computationally universal model that can be trained (or “automatically programmed”) with

the backpropagation algorithm using only input-output examples.

Following the introduction NTM, several other memory-based models have been introduced. All of

them can be seen as part of a larger community effort. These models are constructed according to the

Interface–Controller abstraction (Section 2).

Neural Turing Machine (NTM) (Graves et al., 2014a) has a modified LSTM as the Controller, and the

following three Interfaces: a sequential input, a delayed Output, and a differentiable Memory.

3

Under review as a conference paper at ICLR 2016

Weakly supervised Memory Network (Sukhbaatar et al., 2015) uses a feed forward network as the

Controller, and has a differentiable soft-attention Input, and Delayed Output as Interfaces.

Stack RNN (Joulin & Mikolov, 2015) has a RNN as the Controller, and the sequential input, a differen-

tiable memory stack, and sequential output as Interfaces. Also uses search to improve its performance.

Neural DeQue (Grefenstette et al., 2015) has a LSTM as the Controller, and a Sequential Input, a

differentiable Memory Queue, and the Sequential Output as Interfaces.

Our model fits into the Interfaces–Controller abstraction. It has a direct access LSTM as the Controller

(or LSTM or feed forward network), and its three interfaces are the Input Tape, the Memory Tape, and

the Output Tape. All three Interfaces of the RL–NTM are discrete and cannot be trained only with

backpropagation.

This prior work investigates continuous and differentiable Interfaces, while we consider discrete In-

terfaces. Discrete Interfaces are more challenging to train because backpropagation cannot be used.

However, many external Interfaces are inherently discrete, even though humans can easily use them

(apparently without using continuous backpropagation). For instance, one interacts with the Google

search engine with discrete actions. This work examines the possibility of learning models that interact

with discrete Interfaces with the Reinforce algorithm.

The Reinforce algorithm (Williams, 1992) is a classical RL algorithm, which has been applied to the

broad spectrum of planning problems (Peters & Schaal, 2006; Kohl & Stone, 2004; Aberdeen & Baxter,

2002). In addition, it has been applied in object recognition to implement visual attention (Mnih et al.,

2014; Ba et al., 2014). This work uses Reinforce to train an attention mechanism: we use it to train how

to access the various tapes provided to the model.

The RL–NTM can postpone prediction for an arbitrary number of timesteps, and in principle has access

to the unbounded memory. As a result, the RL-NTM is Turing complete in principle. There have been

very few prior models that are Turing complete Schmidhuber (2012; 2004). Although our model is

Turing complete, it is not very powerful because it is very difficult to train, and our model can solve

only relatively simple problems. Moreover, the RL–NTM does not exploit Turing completeness, as

none of tasks that it solves require superlinear runtime to be solved.

4 T HE R EINFORCE A LGORITHM

Notation

Let A be a space of actions, and A† be a space of all sequences of actions that cause an episode to end

(so A† ⊂ A∗ ) . An action at time-step t is denoted by at . We denote time at the end of episode by T (this

is not completely formal as some episodes can vary in time). Let a1:t stand for a sequence of actions

[a1 , a2 , . . . , at ]. Let r(a1:t ) denote the reward achieved at time t, having executed the sequence of ac-

PT

tions a1:t , and R(a1:T ) is the cumulative reward, namely R(ak:T ) = t=k r(a1:t ). Let pθ (at |a1:(t−1) )

be a parametric conditional probability of an action at given all previous actions a1:(t−1) . Finally, pθ is

a policy parametrized by θ.

This work relies on learning discrete actions with the Reinforce algorithm (Williams, 1992). We now

describe this algorithm in detail. Moreover, the supplementary materials include descriptions of tech-

niques for reducing variance of the gradient estimators.

The goal of reinforcement learning is to maximize the sum of future rewards. The Reinforce algorithm

(Williams, 1992) does so directly by optimizing the parameters of the policy pθ (at |a1:(t−1) ). Reinforce

follows the gradient of the sum of the future rewards. The objective function for episodic reinforce can

be expressed as the sum over all sequences of valid actions that cause the episode to end:

X X

J(θ) = pθ (a1 , a2 , . . . , aT )R(a1 , a2 , . . . , aT ) = pθ (a1:T )R(a1:T )

[a1 ,a2 ,...,aT ]∈A† a1:T ∈A†

This sum iterates over sequences of all possible actions. This set is usually exponential or even infinite,

so it cannot be computed exactly and cheaply for most of problems. However, it can be written as

4

Under review as a conference paper at ICLR 2016

X

J(θ) = pθ (a1:T )R(a1:T ) =

a1:T ∈A†

n

X

Ea1:T ∼pθ r(a1:t ) =

t=1

T

X

Ea1 ∼pθ (a1 ) Ea2 ∼pθ (a2 |a1 ) . . . EaT ∼pθ (aT |a1:(T −1) ) r(a1:t )

t=1

The last expression suggests a procedure to estimate J(θ): simply sequentially sample each at from

the model distribution pθ (at |a1:(t−1) ) for t from 1 to T . The unbiased estimator of J(θ) is the sum of

r(a1:t ). This gives us an algorithm to estimate J(θ). However, the main interest is in training a model

to maximize this quantity.

The reinforce algorithm maximizes J(θ) by following the gradient of it:

X

∂θ J(θ) = ∂θ pθ (a1:T ) R(a1:T )

a1:T ∈A†

However, the above expression is a sum over the set of the possible action sequences, so it cannot be

computed directly for most A† . Once again, the Reinforce algorithm rewrites this sum as an expectation

that is approximated with sampling. It relies on the equation: ∂θ f (θ) = f (θ) ∂fθ f(θ)

(θ)

= f (θ)∂θ [log f (θ)].

This identity is valid as long as f (x) 6= 0. As typical neural network parametrizations of distributions

assign non-zero probability to every action, this condition holds for f = pθ . We have that:

X

∂θ J(θ) = ∂θ pθ (a1:T ) R(a1:T ) =

[a1:T ]∈A†

X

= pθ (a1:T ) ∂θ log pθ (a1:T ) R(a1:T )

a1:T ∈A†

X n

X

= pθ (a1:T ) ∂θ log pθ (ai |a1:(t−1) ) R(a1:T )

a1:T ∈A† t=1

T T

X X

= Ea1 ∼pθ (a1 ) Ea2 ∼pθ (a2 |a1 ) . . . EaT ∼pθ (aT |a1:T −1 ) ∂θ log pθ (ai |a1:(t−1) ) r(a1:t )

t=1 t=1

The last expression gives us an algorithm for estimating ∂θ J(θ). We have sketched it at the left side

of the Figure 3. It’s easiest to describe it with respect to computational graph behind a neural network.

Reinforce can be implemented as follows. A neural network outputs: lt = log pθ (at |a1:(t−1) ). Sequen-

tially sample action at from the distribution elt , and execute the sampled action at . Simultaneously,

PT

experience a reward r(a1:t ). Backpropagate the sum of the rewards t=1 r(a1:t ) to the every node

∂θ log pθ (at |a1:(t−1) ).

We have derived an unbiased estimator for the sum of future rewards, and the unbiased estimator of its

gradient. However, the derived gradient estimator has high variance, which makes learning difficult.

RL–NTM employs several techniques to reduce gradient estimator variance: (1) future rewards back-

propagation, (2) online baseline prediction, and (3) offline baseline prediction. All these techniques are

crucial to solve our tasks. We provide detailed description of techniques in the Supplementary material.

Finally, we needed a way of verifying the correctness of our implementation. We discovered a technique

that makes it possible to easily implement a gradient checker for nearly any model that uses Reinforce.

Following Section 5 describes this technique.

5 G RADIENT C HECKING

The RL–NTM is complex, so we needed to find an automated way of verifying the correctness of

our implementation. We discovered a technique that makes it possible to easily implement a gradient

checker for nearly any model that uses Reinforce. This discovery is an independent contribution of this

5

Under review as a conference paper at ICLR 2016

Figure 3: Figure sketches algorithms: (Left) the reinforce algorithm, (Right) gradient checking for the

reinforce algorithm. The red color indicates necessary steps to override the reinforce to become the

gradient checker for the reinforce.

work. This Section describes the gradient checking for any implementation of the reinforce algorithm

that uses a general function for sampling from multinomial distribution.

The reinforce gradient verification should ensure that expected gradient over all sequences of actions

matches the numerical derivative of the expected objective. However, even for a tiny problem, we would

need to draw billions of samples to achieve estimates accurate enough to state if there is match or mis-

match. Instead, we developed a technique which avoids sampling, and allows for gradient verification

of reinforce within seconds on a laptop.

First, we have to reduce the size of our a task to make sure that the number of possible actions is

manageable (e.g., < 104 ). This is similar to conventional gradient checkers, which can only be applied

to small models. Next, we enumerate all possible sequences of actions that terminate the episode. By

definition, these are precisely all the elements of A† .

The key idea is the following: we override the sampling function which turns a multinomial distribu-

tion into a random sample with a deterministic function that deterministically chooses actions from an

appropriate action sequence from A† , while accumulating their probabilities. By calling the modified

sampler, it will produce every possible action sequence from A† exactly once.

For efficiency, it is desirable to use a single minibatch whose size is #A† . The sampling function needs

to be adapted in such a way, so that it incrementally outputs the appropriate sequence from A† as we

repeatedly call the sampling function. At the end of the Q minibatch, the sampling function will have

access to the total probability of each action sequence ( t pθ (at |a1:t−1 )), which in turn can be used to

exactly compute J(θ) and its derivative. To compute the derivative, the reinforce gradient produced by

each sequence a1:T ∈ A† should be weighted by its probability pθ (a1:T ). We summarize this procedure

on Figure 3.

The gradient checking is critical for ensuring the correctness of our implementation. While the basic

reinforce algorithm is conceptually simple, the RL–NTM is fairly complicated, as reinforce is used

to train several Interfaces of our model. Moreover, the RL–NTM uses three separate techniques for

reducing the variance of the gradient estimators. The model’s high complexity greatly increases the

probability of a code error. In particular, our early implementations were incorrect, and we were able to

fix them only after implementing gradient checking.

6 TASKS

This section defines tasks used in the experiments. Figure 4 shows exemplary instantiations of our tasks.

Table 2 summarizes the Interfaces that are available for each task.

6

Under review as a conference paper at ICLR 2016

Interface

Input Tape Memory Tape

Task

Copy X ×

DuplicatedInput X ×

Reverse X ×

RepeatCopy X ×

ForwardReverse × X

Table 2: This table marks the available Interfaces for each task. The difficulty of a task is dependent on

the type of Interfaces available to the model.

Figure 4: This Figure presents the initial state for every task. The yellow box indicates the starting

position of the reading head over the Input Interface. The gray characters on the Output Tape represent

the target symbols. Our tasks involve reordering symbols, and and the symbols xi have been picked

uniformly from the set of size 30.

Copy. A generic input is x1 x2 x3 . . . xC ∅ and the desired output is x1 x2 . . . xC ∅. Thus the goal is

to repeat the input. The length of the input sequence is variable and is allowed to change. The input

sequence and the desired output both terminate with a special end-of-sequence symbol ∅.

DuplicatedInput. A generic input has the form x1 x1 x1 x2 x2 x2 x3 . . . xC−1 xC xC xC ∅ while the

desired output is x1 x2 x3 . . . xC ∅. Thus each input symbol is replicated three times, so the RL-NTM

must emit every third input symbol.

Reverse. A generic input is x1 x2 . . . xC−1 xC ∅ and the desired output is xC xC−1 . . . x2 x1 ∅.

RepeatCopy. A generic input is mx1 x2 x3 . . . xC ∅ and the desired output is

x1 x2 . . . xC x1 . . . xC x1 . . . xC ∅, where the number of copies is given by m. Thus the goal is to

copy the input m times, where m can be only 2 or 3.

ForwardReverse. The task is identical to Reverse, but the RL-NTM is only allowed to move its input

tape pointer forward. It means that a perfect solution must use the NTM’s external memory.

7 C URRICULUM L EARNING

Humans and animals learn much better when the examples are not randomly presented but organized in a

meaningful order which illustrates gradually more concepts, and gradually more complex ones. . . . and

call them “curriculum learning”.

Bengio et al. (2009)

We were unable to solve tasks with RL–NTM by training it on the difficult instances of the problems

(where difficult usually means long). To succeed, we had to create a curriculum of tasks of increasing

complexity. We verified that our tasks were completely unsolvable (in an all-or-nothing sense) for

all but the shortest sequences when we did not use a curriculum. In our experiments, we measure

the complexity c of a problem instance by the maximal length of the desired output to typical inputs.

During training, we maintain a distribution over the task complexity. We shift the distribution over the

task complexities whenever the performance of the RL–NTM exceeds a threshold. Then, our model

focuses on more difficult problem instances as its performance improves.

10% uniformly at random from the possible task complexities.

25% uniformly from [1, C + e]

65% d = D + e.

Table 3: The curriculum learning distribution, indexed by C. Here e is a sample from a geometric

distribution whose success probability is 21 , i.e., p(e = k) = 21k .

7

Under review as a conference paper at ICLR 2016

The distribution over task complexities is indexed with an integer c, and is defined in Table 3. While

we have not tuned the coefficients in the curriculum learning setup, we experimentally verified that it is

critical to always maintain non-negligible mass over the hardest difficulty levels (Zaremba & Sutskever,

2014). Removing it makes the curriculum much less effective.

Whenever the average zero-one-loss (normalized by the length of the target sequence) of our RL–NTM

decreases below 0.2, we increase c by 1. We kept doing so until c reaches its maximal allowable value.

Finally, we enforced a refractory period to ensure that successive increments of C are separated by at

least 100 parameter updates, since we encountered situations where C increased in rapid succession

which consistently caused learning to fail.

8 C ONTROLLERS

The success of reinforcement learning training highly depends on the complexity of the controller, and

its ease of training. It’s common to either limit number of parameters of the network, or to constraint

it by initialization from pretrained model on some other task (for instance, object recognition network

for robotics). Ideally, models should be generic enough to not need such “tricks”. However, still some

tasks require building task specific architectures.

Figure 5: LSTM as a controller.

This work considers two controllers. The first is a LSTM (Fig. 5), and the second is a direct access

controller (Fig. 6). LSTM is a generic controller, that in principle should be powerful enough to solve

any of the considered tasks. However, it has trouble solving many of them. Direct access controller, is

a much better fit for symbol rearrangement tasks, however it’s not a generic solution.

All the tasks that we consider involve rearranging the input symbols in some way. For example, a

typical task is to reverse a sequence (section 6 lists the tasks). For such tasks, the controller would

benefit from a built-in mechanism for directly copying an appropriate input to memory and to the output.

Such a mechanism would free the LSTM controller from remembering the input symbol in its control

variables (“registers”), and would shorten the backpropagation paths and therefore make learning easier.

We implemented this mechanism by adding the input to the memory and the output, and also adding

the memory to the output and to the adjacent memories (figure 6), while modulating these additive

contribution by a dynamic scalar (sigmoid) which is computed from the controller’s state. This way,

the controller can decide to effectively not add the current input to the output at a given timestep.

Unfortunately the necessity of this architectural modification is a drawback of our implementation,

since it is not domain independent and would therefore not improve the performance of the RL–NTM

on many tasks of interest.

Controller

LSTM Direct Access

Task

Copy X X

DuplicatedInput X X

Reverse × X

ForwardReverse × X

RepeatCopy × X

8

Under review as a conference paper at ICLR 2016

9 E XPERIMENTS

We presents results of training RL–NTM on all aforementioned tasks. The main drawback of our

experiments is in the lack of comparison to the other models. However, the tasks that we consider have

to be considered in conjunction with available Interfaces, and other models haven’t been considered

with the same set of interfaces. The statement, “this model solves addition” is difficult to assess, as the

way that digits are delivered defines task difficulty.

The closest model to ours is NTM, and the shared task that they consider is copying. We are able to

generalize with copying to an arbitrary length. However, our Interfaces make this task very simple.

Table 4 summarizes results.

We trained our model using SGD with a fixed learning rate of 0.05 and a fixed momentum of 0.9. We

used a batch of size 200, which we found to work better than smaller batch sizes (such as 50 or 20).

We normalized the gradient by batch size but not by sequence length. We independently clip the norm

of the gradients w.r.t. the RL-NTM parameters to 5, and the gradient w.r.t. the baseline network to 2.

We initialize the RL–NTM controller and the baseline model using a Gaussian with standard deviation

0.1. We used an inverse temperature of 0.01 for the different action distributions. Doing so reduced

the effective learning rate of the Reinforce derivatives. The memory consists of 35 real values through

which we backpropagate. The initial memory state and the controller’s initial hidden states were set to

the zero vector.

Time

Figure 7: (Left) Trace of ForwardReverse solution, (Right) trace of RepeatInput. The vertical depicts

execution time. The rows show the input pointer, output pointer, and memory pointer (with the ∗

symbol) at each step of the RL-NTM’s execution. Note that we represent the set {1, . . . , 30} with 30

distinct symbols, and lack of prediction with #.

The ForwardReverse task is particularly interesting. In order to solve the problem, the RL–NTM has

to move to the end of the sequence without making any predictions. While doing so, it has to store

the input sequence into its memory (encoded in real values), and use its memory when reversing the

sequence (Fig. 7).

We have also experimented with a number of additional tasks but with less empirical success. Tasks we

found to be too difficult include sorting and long integer addition (in base 3 for simplicity), and Repeat-

Copy when the input tape is forced to only move forward. While we were able to achieve reasonable

performance on the sorting task, the RL–NTM learned an ad-hoc algorithm and made excessive use of

its controller memory in order to sort the sequence.

Empirically, we found all the components of the RL-NTM essential to successfully solving these prob-

lems. All our tasks are either solvable in under 20,000 parameter updates or fail in arbitrary number

of updates. We were completely unable to solve RepeatCopy, Reverse, and Forward reverse with the

LSTM controller, but with direct access controller we succeeded. Moreover, we were also unable to

solve any of these problems at all without a curriculum (except for short sequences of length 5). We

present more traces for our tasks in the supplementary material (together with failure traces).

9

Under review as a conference paper at ICLR 2016

10 C ONCLUSIONS

We have shown that the Reinforce algorithm is capable of training an NTM-style model to solve very

simple algorithmic problems. While the Reinforce algorithm is very general and is easily applicable to

a wide range of problems, it seems that learning memory access patterns with Reinforce is difficult.

Our gradient checking procedure for Reinforce can be applied to a wide variety of implementations. We

also found it extremely useful: without it, we had no way of being sure that our gradient was correct,

which made debugging and tuning much more difficult.

11 ACKNOWLEDGMENTS

We thank Christopher Olah for the LSTM figure that have been used in the paper, and to Tencia Lee for

revising the paper.

R EFERENCES

Aberdeen, Douglas and Baxter, Jonathan. Scaling internal-state policy-gradient methods for pomdps. In MACHINE

LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pp. 3–10, 2002.

Ba, Jimmy, Mnih, Volodymyr, and Kavukcuoglu, Koray. Multiple object recognition with visual attention. arXiv

preprint arXiv:1412.7755, 2014.

Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In Proceedings

of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014a.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014b.

Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil. Learning to transduce with

unbounded memory. arXiv preprint arXiv:1506.02516, 2015.

Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent nets. arXiv

preprint arXiv:1503.01007, 2015.

Kohl, Nate and Stone, Peter. Policy gradient reinforcement learning for fast quadrupedal locomotion. In Robotics

and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, volume 3, pp. 2619–

2624. IEEE, 2004.

Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep visuomotor policies.

arXiv preprint arXiv:1504.00702, 2015.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and

Riedmiller, Martin. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, et al. Recurrent models of visual attention. In Advances in Neural

Information Processing Systems, pp. 2204–2212, 2014.

Peters, Jan and Schaal, Stefan. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006

IEEE/RSJ International Conference on, pp. 2219–2225. IEEE, 2006.

Schmidhuber, Jürgen. Optimal ordered problem solver. Machine Learning, 54(3):211–254, 2004.

Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and Fergus, Rob. Weakly supervised memory networks.

arXiv preprint arXiv:1503.08895, 2015.

Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning.

Machine learning, 8(3-4):229–256, 1992.

Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

10

Under review as a conference paper at ICLR 2016

We present here several techniques to decrease variance of the gradient estimation for the Reinforce.

We have employed all of these tricks in our RL–NTM implementation.

We expand notation introduced in Sec. 4. Let A‡ denote all valid subsequences of actions (i.e. A‡ ⊂

A† ⊂ A∗ ). Moreover, we define set of sequences of actions that are valid after executing a sequence

a1:t , and that terminate. We denote such set by: A†a1:t . Every sequence a(t+1):T ∈ A†a1:t terminates an

episode.

C AUSALITY OF ACTIONS

Actions at time t cannot possibly influence rewards obtained in the past, because the past rewards are

caused by actions prior to them. This idea allows to derive an unbiased estimator of ∂θ J(θ) with lower

variance. Here, we formalize it:

X

∂θ J(θ) = pθ (a) ∂θ log pθ (a) R(a)

a1:T ∈A†

X T

X

= pθ (a) ∂θ log pθ (a) r(a1:t )

a1:T ∈A† t=1

X T

X

= pθ (a) ∂θ log pθ (a1:t )r(a1:t )

a1:T ∈A† t=1

X T

X

= pθ (a) ∂θ log pθ (a1:t )r(a1:t ) + ∂θ log pθ (a(t+1):T |a1:t )r(a1:t )

a1:T ∈A† t=1

X T

X

= pθ (a1:t )∂θ log pθ (a1:t )r(a1:t ) + pθ (a)∂θ log pθ (a(t+1):T |a1:t )r(a1:t )

a1:T ∈A† t=1

T

X X

= pθ (a1:t )∂θ log pθ (a1:t )r(a1:t ) + pθ (a1:t )r(a1:t )∂θ pθ (a(t+1):T |a1:t )

a1:T ∈A† t=1

T

X X T

X X

= pθ (a1:t )∂θ log pθ (a1:t )r(a1:t ) + pθ (a1:t )r(a1:t )∂θ pθ (a(t+1):T |a1:t )

a1:T ∈A† t=1 a1:T ∈A† t=1

We will show that the right side of this equation is equal to zero. It’s zero, because the future

actions a(t+1):T don’t influence past rewards r(a1:t ). Here we formalize it; we use an identity

Ea(t+1):T ∈A†a pθ (a(t+1):T |a1:t ) = 1:

1:t

T

X X

pθ (a1:t )r(a1:t )∂θ pθ (a(t+1):T |a1:t ) =

a1:T ∈A† t=1

X X

pθ (a1:t )r(a1:t ) ∂θ pθ (a(t+1):T |a1:t ) =

a1:t ∈A‡ a(t+1):T ∈A†a1:t

X

pθ (a1:t )r(a1:t )∂θ 1 = 0

a1:t ∈A‡

T

X X

∂θ J(θ) = pθ (a1:t )∂θ log pθ (a1:t )r(a1:t )

a1:T ∈A† t=1

T T

X X

= Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) r(a1:i )

t=1 i=t

11

Under review as a conference paper at ICLR 2016

The last line of derived equations describes the learning algorithm. This can be implemented as fol-

lows. A neural network outputs: lt = log pθ (at |a1:(t−1) ). We sequentially sample action at from the

distribution elt , and execute the sampled action at . Simultaneously, we experience a reward r(a1:t ). We

should backpropagate to the node ∂θ log pθ (at |a1:(t−1) ) the sum of rewards starting from time step t:

PT

i=t r(a1:i ). The only difference in comparison to the initial algorithm is that we backpropagate sum

of rewards starting from the current time step, instead of the sum of rewards over the entire episode.

Online baseline prediction is an idea, that the importance of reward is determined by its relative relation

to other rewards. All the rewards could be shifted by a constant factor and such change shouldn’t effect

its relation, thus it shouldn’t influence expected gradient. However, it could decrease the variance of the

gradient estimate.

Aforementioned shift is called the baseline, and it can be estimated separately for the every time-step.

We have that:

X

pθ (a(t+1):T |a1:t ) = 1

a(t+1):T ∈A†a1:t

X

∂θ pθ (a(t+1):T |a1:t ) = 0

a(t+1):T ∈A†a1:t

We are allowed to subtract above quantity (multiplied by bt ) from our estimate of the gradient without

changing its expected value:

T T

X X

∂θ J(θ) = Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt )

t=1 i=t

Above statement holds for an any sequence of bt . We aim to find the sequence bt that yields the lowest

variance estimator on ∂θ J(θ). The variance of our estimator is:

T T

X X 2

V ar = Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt ) −

t=1 i=t

T T

h X X i2

Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt )

t=1 i=t

The second term doesn’t depend on bt , and the variance is always positive. It’s sufficient to minimize

the first term. The first term is minimal when it’s derivative with respect to bt is zero. This implies

T

X T

X

Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt ) = 0

t=1 i=t

T

X T

X

∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt ) = 0

t=1 i=t

PT PT

t=1 ∂θ log pθ (at |a1:(t−1) ) i=t r(a1:t )

bt = PT

t=1 ∂θ log pθ (at |a1:(t−1) )

This gives us estimate for a vector bt ∈ R#θ . However, it is common to use a single scalar for bt ∈ R,

and estimate it as Epθ (at:T |a1:(t−1) ) R(at:T ).

The Reinforce algorithm works much better whenever it has accurate baselines. A separate LSTM can

help in the baseline estimation. First, run the baseline LSTM on the entire input tape to produce a vector

summarizing the input. Next, continue running the baseline LSTM in tandem with the controller LSTM,

12

Under review as a conference paper at ICLR 2016

Figure 8: The baseline LSTM computes a baseline bt for every computational step t of the RL-NTM.

The baseline LSTM receives the same inputs as the RL-NTM, and it computes a baseline bt for time

t before observing the chosen actions of time t. However, it is important to first provide the baseline

LSTM with the entire input tape as a preliminary inputs, because doing so allows the baseline LSTM

to accurately estimate the true difficulty of a given problem instance and therefore compute better base-

lines. For example, if a problem instance is unusually difficult, then we expect R1 to be large and

negative. If the baseline LSTM is given entire input tape as an auxiliary input, it could compute an

appropriately large and negative b1 .

so that the baseline LSTM receives precisely the same inputs as the controller LSTM, and outputs

PT 2

a baseline bt at each timestep t. The baseline LSTM is trained to minimize t=1 R(at:T ) − bt

(Fig. 8). This technique introduces a biased estimator, however it works well in practise.

We found it important to first have the baseline LSTM go over the entire input before computing the

baselines bt . It is especially beneficial whenever there is considerable variation in the difficulty of the

examples. For example, if the baseline LSTM can recognize that the current instance is unusually

difficult, it can output a large negative value for bt=1 in anticipation of a large and a negative R1 . In

general, it is cheap and therefore worthwhile to provide the baseline network with all of the available

information, even if this information would not be available at test time, because the baseline network

is not needed at test time.

We present several execution traces of the RL–NTM. Each figure shows execution traces of the trained

RL-NTM on each of the tasks. The first row shows the input tape and the desired output, while each

subsequent row shows the RL-NTM’s position on the input tape and its prediction for the output tape.

In these examples, the RL-NTM solved each task perfectly, so the predictions made in the output tape

perfectly match the desired outputs listed in the first row.

13

Under review as a conference paper at ICLR 2016

stance of the Reverse problem (where the external

An RL-NTM successfully solving a small in-

memory is not used).

stance of the ForwardReverse problem, where the

external memory is used.

input tape is only allowed to move forward. The correct so-

An RL-NTM successfully solving an lution would have been to copy the input to the memory, and

instance of the RepeatCopy problem then solve the task using the memory. Instead, the memory

where the input is to be repeated three pointer is moving randomly.

times.

14

A Neural Conversational Model

Quoc V. Le QVL @ GOOGLE . COM

arXiv:1506.05869v3 [cs.CL] 22 Jul 2015

Abstract than just mere classification, they can be used to map com-

plicated structures to other complicated structures. An ex-

Conversational modeling is an important task in

ample of this is the task of mapping a sequence to another

natural language understanding and machine in-

sequence which has direct applications in natural language

telligence. Although previous approaches ex-

understanding (Sutskever et al., 2014). The main advan-

ist, they are often restricted to specific domains

tage of this framework is that it requires little feature en-

(e.g., booking an airline ticket) and require hand-

gineering and domain specificity whilst matching or sur-

crafted rules. In this paper, we present a sim-

passing state-of-the-art results. This advance, in our opin-

ple approach for this task which uses the recently

ion, allows researchers to work on tasks for which domain

proposed sequence to sequence framework. Our

knowledge may not be readily available, or for tasks which

model converses by predicting the next sentence

are simply too hard to design rules manually.

given the previous sentence or sentences in a

conversation. The strength of our model is that Conversational modeling can directly benefit from this for-

it can be trained end-to-end and thus requires mulation because it requires mapping between queries and

much fewer hand-crafted rules. We find that this reponses. Due to the complexity of this mapping, conver-

straightforward model can generate simple con- sational modeling has previously been designed to be very

versations given a large conversational training narrow in domain, with a major undertaking on feature en-

dataset. Our preliminary results suggest that, de- gineering. In this work, we experiment with the conversa-

spite optimizing the wrong objective function, tion modeling task by casting it to a task of predicting the

the model is able to converse well. It is able next sequence given the previous sequence or sequences

extract knowledge from both a domain specific using recurrent networks (Sutskever et al., 2014). We find

dataset, and from a large, noisy, and general do- that this approach can do surprisingly well on generating

main dataset of movie subtitles. On a domain- fluent and accurate replies to conversations.

specific IT helpdesk dataset, the model can find

We test the model on chat sessions from an IT helpdesk

a solution to a technical problem via conversa-

dataset of conversations, and find that the model can some-

tions. On a noisy open-domain movie transcript

times track the problem and provide a useful answer to

dataset, the model can perform simple forms of

the user. We also experiment with conversations obtained

common sense reasoning. As expected, we also

from a noisy dataset of movie subtitles, and find that the

find that the lack of consistency is a common fail-

model can hold a natural conversation and sometimes per-

ure mode of our model.

form simple forms of common sense reasoning. In both

cases, the recurrent nets obtain better perplexity compared

to the n-gram model and capture important long-range cor-

1. Introduction relations. From a qualitative point of view, our model is

Advances in end-to-end training of neural networks have sometimes able to produce natural conversations.

led to remarkable progress in many domains such as speech

recognition, computer vision, and language processing. 2. Related Work

Recent work suggests that neural networks can do more

Our approach is based on recent work which pro-

Proceedings of the 31 st International Conference on Machine posed to use neural networks to map sequences to se-

Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- quences (Kalchbrenner & Blunsom, 2013; Sutskever et al.,

right 2015 by the author(s). 2014; Bahdanau et al., 2014). This framework has been

A Neural Conversational Model

provements on the English-French and English-German

translation tasks from the WMT’14 dataset (Luong et al.,

2014; Jean et al., 2014). It has also been used for

other tasks such as parsing (Vinyals et al., 2014a) and

image captioning (Vinyals et al., 2014b). Since it is

well known that vanilla RNNs suffer from vanish-

ing gradients, most researchers use variants of Long Figure 1. Using the seq2seq framework for modeling conversa-

Short Term Memory (LSTM) recurrent neural net- tions.

works (Hochreiter & Schmidhuber, 1997).

Our work is also inspired by the recent success of neu-

ral language modeling (Bengio et al., 2003; Mikolov et al.,

2010; Mikolov, 2012), which shows that recurrent neural and train to map “ABC” to “WXYZ” as shown in Figure 1

networks are rather effective models for natural language. above. The hidden state of the model when it receives the

More recently, work by Sordoni et al. (Sordoni et al., 2015) end of sequence symbol “<eos>” can be viewed as the

and Shang et al. (Shang et al., 2015), used recurrent neural thought vector because it stores the information of the sen-

networks to model dialogue in short conversations (trained tence, or thought, “ABC”.

on Twitter-style chats).

The strength of this model lies in its simplicity and gener-

Building bots and conversational agents has been pur- ality. We can use this model for machine translation, ques-

sued by many researchers over the last decades, and it tion/answering, and conversations without major changes

is out of the scope of this paper to provide an exhaus- in the architecture. Applying this technique to conversa-

tive list of references. However, most of these systems tion modeling is also straightforward: the input sequence

require a rather complicated processing pipeline of many can be the concatenation of what has been conversed so far

stages (Lester et al., 2004; Will, 2007; Jurafsky & Martin, (the context), and the output sequence is the reply.

2009). Our work differs from conventional systems by

proposing an end-to-end approach to the problem which Unlike easier tasks like translation, however, a model

lacks domain knowledge. It could, in principle, be com- like sequence-to-sequence will not be able to successfully

bined with other systems to re-score a short-list of can- “solve” the problem of modeling dialogue due to sev-

didate responses, but our work is based on producing an- eral obvious simplifications: the objective function being

swers given by a probabilistic model trained to maximize optimized does not capture the actual objective achieved

the probability of the answer given some context. through human communication, which is typically longer

term and based on exchange of information rather than next

step prediction. The lack of a model to ensure consistency

3. Model and general world knowledge is another obvious limitation

Our approach makes use of the sequence-to-sequence of a purely unsupervised model.

(seq2seq) framework described in (Sutskever et al., 2014).

The model is based on a recurrent neural network which 4. Datasets

reads the input sequence one token at a time, and predicts

the output sequence, also one token at a time. During train- In our experiments we used two datasets: a closed-domain

ing, the true output sequence is given to the model, so learn- IT helpdesk troubleshooting dataset and an open-domain

ing can be done by backpropagation. The model is trained movie transcript dataset. The details of the two datasets are

to maximize the cross entropy of the correct sequence given as follows.

its context. During inference, given that the true output se-

quence is not observed, we simply feed the predicted output 4.1. IT Helpdesk Troubleshooting dataset

token as input to predict the next output. This is a “greedy” In our first set of experiments, we used a dataset which was

inference approach. A less greedy approach would be to extracted from a IT helpdesk troubleshooting chat service.

use beam search, and feed several candidates at the previ- In this service, costumers face computer related issues, and

ous step to the next step. The predicted sequence can be a specialist help them by conversing and walking through

selected based on the probability of the sequence. a solution. Typical interactions (or threads) are 400 words

Concretely, suppose that we observe a conversation with long, and turn taking is clearly signaled. Our training set

two turns: the first person utters “ABC”, and second person contains 30M tokens, and 3M tokens were used as valida-

replies “WXYZ”. We can use a recurrent neural network, tion. Some amount of clean up was performed, such as

removing common names, numbers, and full URLs.

A Neural Conversational Model

Human: hello

We also tested our model on the OpenSubtitles

Machine: could you please let me know what are

dataset (Tiedemann, 2009). This dataset consists of

the operating systems you are using ?

movie conversations in XML format. It contains sen-

Human: i am using linux

tences uttered by characters in movies. We applied a

Machine: ok it was in the ssl vpn on which you

simple processing step removing XML tags and obvi-

stated it was broken

ous non-conversational text (e.g., hyperlinks) from the

Machine: please check if you are logged in to

dataset. As turn taking is not clearly indicated, we treated

corp network

consecutive sentences assuming they were uttered by

Machine: what is the error that you are running

different characters. We trained our model to predict the

please

next sentence given the previous one, and we did this for

Human: i am seeing an error related to vpn

every sentence (noting that this doubles our dataset size,

Machine: what is the error message that you are

as each sentence is used both for context and as target).

getting when connecting to vpn using network

Our training and validation split has 62M sentences (923M

connect ?

tokens) as training examples, and the validation set has

Human: connection refused or something like

26M sentences (395M tokens). The split is done in such a

that

way that each sentence in a pair of sentences either appear

Machine: may i know the version of network

together in the training set or test set but not both. Unlike

connect you connect ?

the previous dataset, the OpenSubtitles is quite large, and

Human: i am not sure i know that

rather noisy because consecutive sentences may be uttered

Machine: is the network connect prompting for

by the same character. Given the broad scope of movies,

any user certificate

this is an open-domain conversation dataset, contrasting

Machine: ?

with the technical troubleshooting dataset.

Human: yes

Machine: are you at home or at work at the mo-

5. Experiments ment ?

Machine: office ?

In this section, we describe the experimental results with

Human: i am at home

the two datasets and show some samples of the interac-

Machine: try this

tions with the system that we trained. We also compare

Machine: goto <URL> page and enter your

the performance of our system against a popular rule-based

name and password

bot (CleverBot1 ) using human evaluations on a set of 200

Human: done

questions.

Machine: try logging in with <NAME> and

check if its working fine now

5.1. IT Helpdesk Troubleshooting experiments Human: yes , now it works !

In this experiment, we trained a single layer LSTM with Machine: great

1024 memory cells using stochastic gradient descent with Machine: anything else that i can help

gradient clipping. The vocabulary consists of the most Human: no , that was very useful

common 20K words, which includes special tokens indi- Machine: you are other than <NAME> : )

cating turn taking and actor. Machine: have a nice night

At convergence, this model achieved a perplexity of 8, (The <URL> indeed contains information about vpn ac-

whereas an n-gram model achieved 18. Below is a few sam- cess.)

ples of simulated troubleshooting sessions for most three

of the most common IT problems (remote access, software Conversation 2: Browser issues.

crashes, and password issues). In these conversations, Ma-

chine is our Neural Conversational Model, and Human the Describe your problem: browser keeps crashing

human actor interacting with it. Machine: hi

Human: hi

Conversation 1: VPN issues. Machine: could you please let me know what is

the operating system you are using ?

Describe your problem: i am having issues ac- Human: linux

cessing vpn Machine: ok this is the issue with the system per-

formance , nasser , tools .

1

http://www.cleverbot.com/ Human: what is that ?

A Neural Conversational Model

Machine: we have linux on your machine . Human: can you tell me how i can recover it ?

Machine: is it windows xp or windows 7 ? Machine: i ’ ll need to verify who it is at that

Human: i am using linux machine

Machine: let me take a remote of your machine Machine: can we do a hangout

and check if issue has been pushed properly . Machine: ?

Human: ok Human: yes

Machine: please open terminal and run the be- Machine: <anonymized hangout instructions

low commands in one computer : and verifications>

Machine: sudo apt - get update Human: cool , i am good now

Machine: sudo apt - get upgrade Machine: how ’ s this looking good ?

Machine: sudo <COMMAND> - updater Machine: i ’ ll go ahead and close this chat .

Human: done Machine: take care . : )

Machine: let me try connecting now Human: bye

Human: ok Machine: bye .

Machine: its done

Machine: should use the second or hit run on 5.2. OpenSubtitles experiments

the background

Human: done In this experiment, we trained a two-layered LSTM using

Machine: try signing out now and follow the AdaGrad with gradient clipping. Each layer of the LSTM

prompts as per the message given above has 4096 memory cells, and we built a vocabulary consist-

Machine: shd i reboot the machine . ing of the most frequent 100K words. To speed up the soft-

Human: ok , please go ahead max, we project the memory cells to 2048 linear units be-

Machine: once its completed please try to login fore feeding the information to the classifier.

. At convergence the perplexity of the recurrent model on the

Human: sure validation set is 17. Our smoothed 5-gram model achieves

Machine: please ping me back once you have a perplexity of 28. Interestingly, adding the soft atten-

tried to login tion mechanism of (Bahdanau et al., 2014) did not signifi-

Machine: are we still connected cantly improve the perplexity on neither training or valida-

Human: i now logged in tion sets.

Machine: thanks for your confirmation

Human: what should i do now ? 5.2.1. M ODEL S AMPLES

Machine: wait for 1 5 minutes to complete the

login policy In addition to the perplexity measure, our simple recur-

Machine: and then try logging in rent model does often produce plausible answers. Here

Human: ok are a few sample question-answer pairs that allowed us to

Machine: is there anything else to help you ? understand the capabilities of the model when trained on

Human: let my try browsing a little bit and if it the noisy but large dataset of OpenSubtitles from movies.

keeps crashing i will let you know Again, in these conversations, Machine is our Neural Con-

Machine: sure will do versational Model, and Human the human actor interacting

Machine: seems the mail is not loading with it.

Machine: are we still connected to the chat ? Conversation 1: Basic

Machine: <reinvite>

Human: it seems good now , cool , thank you , Human: hello !

bye Machine: hello !

Machine: you , too Human: how are you ?

Machine: bye Machine: i ’m good .

Human: what ’s your name ?

Conversation 3: Password issues. Machine: i ’m julia .

Human: when were you born ?

Describe your problem: i forgot my password Machine: july 20th .

Machine: hi , this is <NAME> from helpdesk Human: what year were you born ?

connect , what can i help you with today ? Machine: 1977 .

Human: hi , i forgot my password Human: where are you from ?

A Neural Conversational Model

Machine: i ’m out in the boonies. Human: how much is ten minus two ?

Human: are you a man or a woman ? Machine: seventy - two .

Machine: i ’m a woman .

Human: why are we here ? Conversation 4: Philosophical Q&A

Machine: i ’m not sure .

Human: okay, bye ! Human: what is the purpose of life ?

Machine: bye . Machine: to serve the greater good .

Human: see you later . Human: what is the purpose of living ?

Machine: bye . Machine: to live forever .

Human: what is the purpose of existence ?

Conversation 2: Simple Q&A (contexts and multiple Machine: to find out what happens when we get

choice) to the planet earth .

Human: where are you now ?

Human: my name is david . what is my name ? Machine: i ’m in the middle of nowhere .

Machine: david . Human: what is the purpose of dying ?

Human: my name is john . what is my name ? Machine: to have a life .

Machine: john . Human: what is the purpose of being intelligent

Human: are you a leader or a follower ? ?

Machine: i ’m a leader . Machine: to find out what it is .

Human: are you a follower or a leader ? Human: what is the purpose of emotions ?

Machine: i ’m a leader . Machine: i don ’t know .

Machine: he is a hero . Machine: what empowered humanity , what in-

Human: who is bill clinton ? tellectual the essence is .

Machine: he ’s a billionaire . Human: what is immoral ?

Human: is sky blue or black ? Machine: the fact that you have a child .

Machine: blue . Human: what is morality ?

Human: does a cat have a tail ? Machine: what is altruism ?

Machine: yes . Human: what is the definition of altruism ?

Human: does a cat have a wing ? Machine: if you don ’t believe in god , then you

Machine: no don ’t know .

Human: can a cat fly ? Human: ok ... so what is the definition of moral-

Machine: no . ity ?

Human: how many legs does a cat have ? Machine: well , the truth is , you ’re not a be-

Machine: four , i think . liever in god almighty .

Human: how many legs does a spider have ? Human: tell me the definition of morality , i am

Machine: three , i think . quite upset now !

Human: how many legs does a centipede have ? Machine: i ’m not ashamed of being a philoso-

Machine: eight . pher !

Human: what is the color of the sky ? Human: tell me the definition of morality .

Machine: blue . Machine: i don ’t have ethics .

Human: what is the color of water ? Human: look , i need help , i need to know more

Machine: water . about morality ...

Human: what is the color of blood ? Machine: i don ’t know what ethics is .

Machine: it is the same as a black eye . Human: seriously , what is morality ?

Human: what is the usual color of a leaf ? Machine: what is the definition of living ?

Machine: it is a green one . Human: why living has anything to do with

Human: what is the color of a yellow car ? morality ?

Machine: yellow . Machine: you ’re not a cop .

Human: how much is two plus two ? Human: okay , i need to know how should i be-

Machine: four . have morally ...

A Neural Conversational Model

Machine: i don ’t know how to tell you . problematic drawback is that the model does not capture

Human: is morality and ethics the same ? a consistent personality. Indeed, if we ask not identical but

Machine: yes , sir . semantically similar questions, the answers can sometimes

Human: what are the things that i do to be im- be inconsistent. This is expected due to the simplicity of

moral ? our model and the dataset in our experiments. The conver-

Machine: i don ’t know . sation below illustrates such failure:

Human: give me some examples of moral ac-

Conversation 7: Job and Personality

tions ...

Machine: i ’m not a moralist .

Human: alright , morality ? Human: what is your job ?

Machine: integrity . Machine: i ’m a lawyer .

Human: what is integrity ? Human: what do you do ?

Machine: i ’m sorry , i don ’t know what else to Machine: i ’m a doctor .

say .

Human: be moral ! Lastly, as a comparison of other traditional bots that are

Machine: be a man ! rule-based, we provide the output of CleverBot for some of

Human: i really like our discussion on morality the questions in our test:2

and ethics ...

Machine: and how i ’m not in the mood for a Human: Are you a follower or a leader ?

philosophical debate . CleverBot: No !!

Human: what do you like to talk about ? Human: My name is david . what is my name ?

Machine: nothing . CleverBot: How old are you David ?

Human: My name is john . what is my name ?

Conversation 6: Opinions (notice the model outputs “he” , CleverBot: John

“she” or “it” depending on the subject being asked) Human: What is the purpose of being intelligent

?

Human: what do you think about tesla ? CleverBot: The purpose of life is to love God

Machine: he ’s a good conductor . and love people.

Human: what do you think about bill gates ? Human: Is sky blue or black ?

Machine: he ’s a good man . CleverBot: I am 200 years old.

Human: What do you think about messi ? Human: What is the color of a yellow car ?

Machine: he ’s a great player . CleverBot: Brown.

Human: what do you think about cleopatra ? Human: What is the usual color of a leaf ?

Machine: oh , she ’s very regal . CleverBot: What is the color of the apple in the

Human: what do you think about england dur- fridge ?

ing the reign of elizabeth ? Human: What is the purpose of life ?

Machine: it was a great place . CleverBot: My purpose it to forward my species,

in other words to make it easier for future gener-

We find it encouraging that the model can remember facts, ations of mankind to live.

understand contexts, perform common sense reasoning Human: What do you think about tesla ?

without the complexity in traditional pipelines. What sur- CleverBot: I am tesla.

prises us is that the model does so without any explicit Human: What do you think about cleopatra ?

knowledge representation component except for the param- CleverBot: Cool is cool.

eters in the word vectors. Human: What do you think about england dur-

ing the reign of elizabeth ?

Perhaps most practically significant is the fact that the CleverBot: I like him. Do you ?

model can generalize to new questions. In other words, Human: What do you think about bill gates ?

it does not simply look up for an answer by matching the CleverBot: Lets talk about Link.

question with the existing database. In fact, most of the Human: What is moral ?

questions presented above, except for the first conversation, CleverBot: Morality is the quality of being in

do not appear in the training set. accord with standards of right or good conduct.

Nonetheless, one drawback of this basic model is that it Human: What is immoral ?

only gives simple, short, sometimes unsatisfying answers 2

The outputs of CleverBot are non-deterministic so the an-

to our questions as can be seen above. Perhaps a more swers may be different for different trials.

A Neural Conversational Model

Human: Does a cat have a wing ? Human: What’s thirty plus forty ?

CleverBot: What is the name of your husband. NCM: Sixty ?

Human: Can a cat fly ? CB: What should we talk about ?

CleverBot:: What is your husband’s name ?

Human: Who is skywalker ?

CleverBot: There are two of them, Anakin Sky- We however believe that it remains an open research prob-

walker and Luke Skywalker, both characters from lem of designing a good metric to quickly measure the

star Wars. quality of a conversational model. We did so by manual

Human: How many legs does a cat have ? inspection, by using Mechanical Turk to get human evalu-

CleverBot: An ant. ations comparing with a rule-based bot, and by computing

Human: How many legs does a centipede have the perplexity. However, there are obvious drawbacks on

? using those, and we are actively pursuing this and other re-

CleverBot: A centipede’s number of legs varies search directions.

from thirty to two hundred.

Human: What is your job ?

CleverBot: I work in mental health. 6. Discussion

Human: What do you do ? In this paper, we show that a simple language model based

CleverBot: I have no job. on the seq2seq framework can be used to train a conversa-

tional engine. Our modest results show that it can gener-

5.2.2. H UMAN E VALUATION ate simple and basic conversations, and extract knowledge

from a noisy but open-domain dataset. Even though the

In order to fairly and objectively compare our model

model has obvious limitations, it is surprising to us that a

against CleverBot, we picked 200 questions3 , and asked

purely data driven approach without any rules can produce

four different humans to rate our model (NCM) versus Cle-

rather proper answers to many types of questions. How-

verBot (CB). Human judges were asked to pick which of

ever, the model may require substantial modifications to be

the two bots they preferred, and were also allowed to de-

able to deliver realistic conversations. Amongst the many

clare a tie if both answers were of equal quality.

limitations, the lack of a coherent personality makes it dif-

If three out of four human judges agreed, we record that ficult for our system to pass the Turing test (Turing, 1950).

score. Our model was preferred in 97 out of 200 questions,

whereas CleverBot was picked in 60 out of 200. There was ACKNOWLEDGMENTS

a tie in 20 questions, and in 23 questions the judges were

We thank Greg Corrado, Andrew Dai, Jeff Dean, Tom

in disagreement.

Dean, Matthieu Devin, Rajat Monga, Mike Schuster, Noam

Here are some samples of how the judges evaluated the sys- Shazeer, Ilya Sutskever and the Google Brain team for the

tems (in bold we show their preferred answer, the third ex- help with the project.

ample shows a tie, and the fourth examples shows diagree-

ment):

References

Human: What is the deepest spot on the world ? Bahdanau, D., Cho, K., and Bengio, Y. Neural machine

NCM: The pacific ocean. translation by jointly learning to align and translate.

CB: What ? arXiv preprint arXiv:1409.0473, 2014.

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A

Human: Do you like Mexican food or Indian neural probabilistic language model. The Journal of Ma-

food ? chine Learning Research, 3:1137–1155, 2003.

NCM: I hate it.

CB: I love Mexican food. Hochreiter, S. and Schmidhuber, J. Long short-term mem-

ory. Neural Computation, 1997.

Human: Who are you crazy about ?

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. On using

NCM: You’re the one who’s crazy.

very large target vocabulary for neural machine transla-

CB: Who are you ?

tion. CoRR, abs/1412.2007, 2014.

3

The questions we used together

with the answers can be found in Jurafsky, D. and Martin, J. Speech and language process-

http://ai.stanford.edu/˜quocle/QAresults.pdf ing. Pearson International, 2009.

A Neural Conversational Model

translation models. In EMNLP, 2013.

agents. In Handbook of Internet Computing. Chapman

& Hall, 2004.

Luong, T., Sutskever, I., Le, Q. V., Vinyals, O., and

Zaremba, W. Addressing the rare word problem in neu-

ral machine translation. arXiv preprint arXiv:1410.8206,

2014.

Mikolov, T. Statistical Language Models based on Neural

Networks. PhD thesis, Brno University of Technology,

2012.

Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and

Khudanpur, S. Recurrent neural network based language

model. In INTERSPEECH, pp. 1045–1048, 2010.

Shang, L., Lu, Z., and Li, H. Neural responding ma-

chine for short-text conversation. In Proceedings of ACL,

2015.

Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y.,

Mitchell, M., Gao, J., Dolan, B., and Nie, J.-Y. A neural

network approach to context-sensitive generation of con-

versational responses. In Proceedings of NAACL, 2015.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-

quence learning with neural networks. In NIPS, 2014.

Tiedemann, J. News from OPUS - A collection of multi-

lingual parallel corpora with tools and interfaces. In Ni-

colov, N., Bontcheva, K., Angelova, G., and Mitkov, R.

(eds.), Recent Advances in Natural Language Process-

ing, volume V, pp. 237–248. John Benjamins, Amster-

dam/Philadelphia, Borovets, Bulgaria, 2009. ISBN 978

90 272 4825 1.

Turing, A. M. Computing machinery and intelligence.

Mind, pp. 433–460, 1950.

Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I.,

and Hinton, G. Grammar as a foreign language. arXiv

preprint arXiv:1412.7449, 2014a.

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. Show

and tell: A neural image caption generator. arXiv

preprint arXiv:1411.4555, 2014b.

Will, T. Creating a Dynamic Speech Dialogue. VDM Ver-

lag Dr, 2007.

Listen, Attend and Spell

Carnegie Mellon University Google Brain

williamchan@cmu.edu {ndjaitly,qvl,vinyals}@google.com

arXiv:1508.01211v2 [cs.CL] 20 Aug 2015

Abstract

We present Listen, Attend and Spell (LAS), a neural network that learns to tran-

scribe speech utterances to characters. Unlike traditional DNN-HMM models, this

model learns all the components of a speech recognizer jointly. Our system has

two components: a listener and a speller. The listener is a pyramidal recurrent net-

work encoder that accepts filter bank spectra as inputs. The speller is an attention-

based recurrent network decoder that emits characters as outputs. The network

produces character sequences without making any independence assumptions be-

tween the characters. This is the key improvement of LAS over previous end-to-

end CTC models. On a subset of the Google voice search task, LAS achieves a

word error rate (WER) of 14.1% without a dictionary or a language model, and

10.3% with language model rescoring over the top 32 beams. By comparison, the

state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

1 Introduction

Deep Neural Networks (DNNs) have led to improvements in various components of speech recog-

nizers. They are commonly used in hybrid DNN-HMM speech recognition systems for acoustic

modeling [1, 2, 3, 4, 5, 6]. DNNs have also produced significant gains in pronunciation models that

map words to phoneme sequences [7, 8]. In language modeling, recurrent models have been shown

to improve speech recognition accuracy by rescoring n-best lists [9]. Traditionally these compo-

nents – acoustic, pronunciation and language models – have all been trained separately, each with a

different objective. Recent work in this area attempts to rectify this disjoint training issue by design-

ing models that are trained end-to-end – from speech directly to transcripts [10, 11, 12, 13, 14, 15].

Two main approaches for this are Connectionist Temporal Classification (CTC) [10] and sequence

to sequence models with attention [16]. Both of these approaches have limitations that we try to

address: CTC assumes that the label outputs are conditionally independent of each other; whereas

the sequence to sequence approach has only been applied to phoneme sequences [14, 15], and not

trained end-to-end for speech recognition.

In this paper we introduce Listen, Attend and Spell (LAS), a neural network that improves upon the

previous attempts [12, 14, 15]. The network learns to transcribe an audio sequence signal to a word

sequence, one character at a time. Unlike previous approaches, LAS does not make independence

assumptions in the label sequence and it does not rely on HMMs. LAS is based on the sequence to

sequence learning framework with attention [17, 18, 16, 14, 15]. It consists of an encoder recurrent

neural network (RNN), which is named the listener, and a decoder RNN, which is named the speller.

The listener is a pyramidal RNN that converts low level speech signals into higher level features.

The speller is an RNN that converts these higher level features into output utterances by specifying

a probability distribution over sequences of characters using the attention mechanism [16, 14, 15].

The listener and the speller are trained jointly.

Key to our approach is the fact that we use a pyramidal RNN model for the listener, which reduces

the number of time steps that the attention model has to extract relevant information from. Rare and

out-of-vocabulary (OOV) words are handled automatically, since the model outputs the character

1

sequence, one character at a time. Another advantage of modeling characters as outputs is that the

network is able to generate multiple spelling variants naturally. For example, for the phrase “triple a”

the model produces both “triple a” and “aaa” in the top beams (see section 4.5). A model like CTC

may have trouble producing such diverse transcripts for the same utterance because of conditional

independence assumptions between frames.

In our experiments, we find that these components are necessary for LAS to work well. Without the

attention mechanism, the model overfits the training data significantly, in spite of our large training

set of three million utterances - it memorizes the training transcripts without paying attention to the

acoustics. Without the pyramid structure in the encoder side, our model converges too slowly - even

after a month of training, the error rates were significantly higher than the errors we report here.

Both of these problems arise because the acoustic signals can have hundreds to thousands of frames

which makes it difficult to train the RNNs. Finally, to reduce the overfitting of the speller to the

training transcripts, we use a sampling trick during training [19].

With these improvements, LAS achieves 14.1% WER on a subset of the Google voice search task,

without a dictionary or a language model. When combined with language model rescoring, LAS

achieves 10.3% WER. By comparison, the Google state-of-the-art CLDNN-HMM system achieves

8.0% WER on the same data set [20].

2 Related Work

Even though deep networks have been successfully used in many applications, until recently, they

have mainly been used in classification: mapping a fixed-length vector to an output category [21].

For structured problems, such as mapping one variable-length sequence to another variable-length

sequence, neural networks have to be combined with other sequential models such as Hidden

Markov Models (HMMs) [22] and Conditional Random Fields (CRFs) [23]. A drawback of this

combining approach is that the resulting models cannot be easily trained end-to-end and they make

simplistic assumptions about the probability distribution of the data.

Sequence to sequence learning is a framework that attempts to address the problem of learning

variable-length input and output sequences [17]. It uses an encoder RNN to map the sequential

variable-length input into a fixed-length vector. A decoder RNN then uses this vector to produce

the variable-length output sequence, one token at a time. During training, the model feeds the

groundtruth labels as inputs to the decoder. During inference, the model performs a beam search to

generate suitable candidates for next step predictions.

Sequence to sequence models can be improved significantly by the use of an attention mechanism

that provides the decoder RNN more information when it produces the output tokens [16]. At each

output step, the last hidden state of the decoder RNN is used to generate an attention vector over

the input sequence of the encoder. The attention vector is used to propagate information from the

encoder to the decoder at every time step, instead of just once, as with the original sequence to

sequence model [17]. This attention vector can be thought of as skip connections that allow the

information and the gradients to flow more effectively in an RNN.

The sequence to sequence framework has been used extensively for many applications: machine

translation [24, 25], image captioning [26, 27], parsing [28] and conversational modeling [29]. The

generality of this framework suggests that speech recognition can also be a direct application [14,

15].

3 Model

In this section, we will formally describe LAS which accepts acoustic features as in-

puts and emits English characters as outputs. Let x = (x1 , . . . , xT ) be our input se-

quence of filter bank spectra features, and let y = (hsosi, y1 , . . . , yS , heosi), yi ∈

{a, b, c, · · · , z, 0, · · · , 9, hspacei, hcommai, hperiodi, hapostrophei, hunki}, be the output se-

quence of characters. Here hsosi and heosi are the special start-of-sentence token, and end-of-

sentence tokens, respectively.

2

We want to model each character output yi as a conditional distribution over the previous characters

y<i and the input signal x using the chain rule:

Y

P (y|x) = P (yi |x, y<i ) (1)

i

Our Listen, Attend and Spell (LAS) model consists of two sub-modules: the listener and the speller.

The listener is an acoustic model encoder, whose key operation is Listen. The speller is an attention-

based character decoder, whose key operation is AttendAndSpell. The Listen function transforms

the original signal x into a high level representation h = (h1 , . . . , hU ) with U ≤ T , while the

AttendAndSpell function consumes h and produces a probability distribution over character se-

quences:

h = Listen(x) (2)

P (y|x) = AttendAndSpell(h, y) (3)

Figure 1 visualizes LAS with these two components. We provide more details of these components

in the following sections.

Speller

y2 y3 y4 heosi Grapheme characters yi are

modelled by the

CharacterDistribution

c1 c2

AttentionContext creates

context vector ci from h

and si

h h h

s1 s2

hsosi y2 y3 yS−1

h = (h1 , . . . , hU ) BLSTM Listen into shorter sequence h

Listener

h1 h2 hU

x1 x2 x3 x4 x5 x6 x7 x8 xT

Figure 1: Listen, Attend and Spell (LAS) model: the listener is a pyramidal BLSTM encoding our input

sequence x into high level features h, the speller is an attention-based decoder generating the y characters

from h.

3

3.1 Listen

The Listen operation uses a Bidirectional Long Short Term Memory RNN (BLSTM) [30, 31, 12]

with a pyramid structure. This modification is required to reduce the length U of h, from T , the

length of the input x, because the input speech signals can be hundreds to thousands of frames long.

A direct application of BLSTM for the operation Listen converged slowly and produced results

inferior to those reported here, even after a month of training time. This is presumably because the

operation AttendAndSpell has a hard time extracting the relevant information from a large number

of input time steps.

We circumvent this problem by using a pyramid BLSTM (pBLSTM) similar to the Clockwork RNN

[33]. In each successive stacked pBLSTM layer, we reduce the time resolution by a factor of 2. In a

typical deep BTLM architecture, the output at the i-th time step, from the j-th layer is computed as

follows:

In the pBLSTM model, we concatenate the outputs at consecutive steps of each layer before feeding

it to the next layer, i.e.:

h i

hji = pBLSTM(hji−1 , hj−1

2i , hj−1

2i+1 ) (5)

In our model, we stack 3 pBLSTMs on top of the bottom BLSTM layer to reduce the time resolution

23 = 8 times. This allows the attention model (see next section) to extract the relevant information

from a smaller number of times steps. In addition to reducing the resolution, the deep architecture al-

lows the model to learn nonlinear feature representations of the data. See Figure 1 for a visualization

of the pBLSTM.

The pyramid structure also reduces the computational complexity. In the next section we show that

the attention mechanism over U features has a computational complexity of O(U S). Thus, reducing

U speeds up learning and inference significantly.

We now describe the AttendAndSpell function. The function is computed using an attention-based

LSTM transducer [16, 15]. At every output step, the transducer produces a probability distribution

over the next character conditioned on all the characters seen previously. The distribution for yi is

a function of the decoder state si and context ci . The decoder state si is a function of the previous

state si−1 , the previously emitted character yi−1 and context ci−1 . The context vector ci is produced

by an attention mechanism. Specifically,

ci = AttentionContext(si , h) (6)

si = RNN(si−1 , yi−1 , ci−1 ) (7)

P (yi |x, y<i ) = CharacterDistribution(si , ci ) (8)

where CharacterDistribution is an MLP with softmax outputs over characters, and RNN is a 2

layer LSTM.

At each time step, i, the attention mechanism, AttentionContext generates a context vector, ci

encapsulating the information in the acoustic signal needed to generate the next character. The

attention model is content based - the contents of the decoder state si are matched to the contents

of hu representing time step u of h, to generate an attention vector αi . αi is used to linearly blend

vectors hu to create ci .

Specifically, at each decoder timestep i, the AttentionContext function computes the scalar energy

ei,u for each time step u, using vector hu ∈ h and si . The scalar energy ei,u is converted into a

probability distribution over times steps (or attention) αi using a softmax function. This is used to

4

create the context vector ci by linearly blending the listener features, hu , at different time steps:

ei,u = hφ(si ), ψ(hu )i (9)

exp(ei,u )

αi,u = P (10)

exp(ei,u )

Xu

ci = αi,u hu (11)

u

where φ and ψ are MLP networks. On convergence, the αi distribution is typically very sharp, and

focused on only a few frames of h; ci can be seen as a continuous bag of weighted features of h.

Figure 1 shows LAS architecture.

3.3 Learning

The Listen and AttendAndSpell functions can be trained jointly for end-to-end speech recognition.

The sequence to sequence methods condition the next step prediction on the previous characters [17,

16] and maximizes the log probability:

X

∗

max log P (yi |x, y<i ; θ) (12)

θ

i

∗

where y<i is the groundtruth of the previous characters.

However during inference, the groundtruth is missing and the predictions can suffer because the

model was not trained to be resilient to feeding in bad predictions at some time steps. To ameliorate

this effect, we use a trick that was proposed in [19]. During training, instead of always feeding in the

ground truth transcript for next step prediction, we sometimes sample from our previous character

distribution and use that as the inputs in the next step predictions:

ỹi ∼ CharacterDistribution(si , ci ) (13)

X

max log P (yi |x, ỹ<i ; θ) (14)

θ

i

where ỹi−1 is the character chosen from the ground truth, or sampled from the model with a certain

sampling rate. Unlike [19], we do not use a schedule and simply use a constant sampling rate of

10% right from the start of training.

As the system is a very deep network it may appear that some type of pretraining would be required.

However, in our experiments, we found no need for pretraining. In particular, we attempted to

pretrain the Listen function with context independent or context dependent phonemes generated

from a conventional GMM-HMM system. A softmax network was attached to the output units

hu ∈ h of the listener and used to make multi-frame phoneme state predictions [34] but led to no

improvements. We also attempted to use the phonemes as a joint objective target [35], but found no

improvements.

During inference we want to find the most likely character sequence given the input acoustics:

ŷ = arg max log P (y|x) (15)

y

Decoding is performed with a simple left-to-right beam search algorithm similar to [17]. We main-

tain a set of β partial hypotheses, starting with the start-of-sentence hsosi token. At each timestep,

each partial hypothesis in the beam is expanded with every possible character and only the β most

likely beams are kept. When the heosi token is encountered, it is removed from the beam and added

to the set of complete hypothesis. A dictionary can optionally be added to constrain the search space

to valid words, however we found that this was not necessary since the model learns to spell real

words almost all the time.

We have vast quantities of text data [36], compared to the amount of transcribed speech utterances.

We can use language models trained on text corpora alone similar to conventional speech systems

5

[37]. To do so we can rescore our beams with the language model. We find that our model has a

small bias for shorter utterances so we normalize our probabilities by the number of characters |y|c

in the hypothesis and combine it with a language model probability PLM (y):

log P (y|x)

s(y|x) = + λ log PLM (y) (16)

|y|c

where λ is our language model weight and can be determined by a held-out validation set.

4 Experiments

We used a dataset approximately three million Google voice search utterances (representing 2000

hours of data) for our experiments. Approximately 10 hours of utterances were randomly selected as

a held-out validation set. Data augmentation was performed using a room simulator, adding different

types of noise and reverberations; the noise sources were obtained from YouTube and environmental

recordings of daily events [20]. This increased the amount of audio data by 20 times. 40-dimensional

log-mel filter bank features were computed every 10ms and used as the acoustic inputs to the listener.

A separate set of 22K utterances representing approximately 16 hours of data were used as the test

data. A noisy test data set was also created using the same corruption strategy that was applied to

the training data. All training sets are anonymized and hand-transcribed, and are representative of

Googles speech traffic.

The text was normalized by converting all characters to lower case English alphanumerics (including

digits). The punctuations: space, comma, period and apostrophe were kept, while all other tokens

were converted to the unknown hunki token. As mentioned earlier, all utterances were padded with

the start-of-sentence hsosi and the end-of-sentence heosi tokens.

The state-of-the-art model on this dataset is a CLDNN-HMM system that was described in [20].

The CLDNN system achieves a WER of 8.0% on the clean test set and 8.9% on the noisy test set.

However, we note that the CLDNN uses unidirectional CLDNNs and would certainly benefit even

further from the use of a bidirectional CLDNN architecture.

For the Listen function we used 3 layers of 512 pBLSTM nodes (i.e., 256 nodes per direction) on

top of a BLSTM that operates on the input. This reduced the time resolution by 8 = 23 times. The

Spell function used a two layer LSTM with 512 nodes each. The weights were initialized with a

uniform distribution U(−0.1, 0.1).

Asynchronous Stochastic Gradient Descent (ASGD) was used for training our model [38]. A learn-

ing rate of 0.2 was used with a geometric decay of 0.98 per 3M utterances (i.e., 1/20-th of an epoch).

We used the DistBelief framework [38] with 32 replicas, each with a minibatch of 32 utterances.

In order to further speed up training, the sequences were grouped into buckets based on their frame

length [17].

The model was trained using groundtruth previous characters until results on the validation set

stopped improving. This took approximately two weeks. The model was decoded using beam width

β = 32 and achieved 16.2% WER on the clean test set and 19.0% WER on the noisy test set without

any dictionary or language model. We found that constraining the beam search with a dictionary had

no impact on the WER. Rescoring the top 32 beams with the same n-gram language model that was

used by the CLDNN system using a language model weight of λ = 0.008 improved the results for

the clean and noisy test sets to 12.6% and 14.7% respectively. Note that for convenience, we did not

decode with a language model, but rather only rescored the top 32 beams. It is possible that further

gains could have been achieved by using the language model during decoding.

As mentioned in Section 3.3, there is a mismatch between training and testing. During training

the model is conditioned on the correct previous characters but during testing mistakes made by

the model corrupt future predictions. We trained another model by sampling from our previous

character distribution with a probability of 10% (we did not use a schedule as described in [19]).

This improved our results on the clean and noisy test sets to 14.1% and 16.5% WER respectively

when no language model rescoring was used. With language model rescoring, we achevied 10.3%

and 12.0% WER on the clean and noisy test sets, respectively. Table 1 summarizes these results.

On the clean test set, this model is within 2.5% absolute WER of the state-of-the-art CLDNN-HMM

system, while on the noisy set it is less than 3.0% absolute WER worse. We suspect that convolu-

6

Table 1: WER comparison on the clean and noisy Google voice search task. The CLDNN-HMM system is

the state-of-the-art system, the Listen, Attend and Spell (LAS) models are decoded with a beam size of 32.

Language Model (LM) rescoring was applied to our beams, and a sampling trick was applied to bridge the gap

between training and inference.

CLDNN-HMM [20] 8.0 8.9

LAS 16.2 19.0

LAS + LM Rescoring 12.6 14.7

LAS + Sampling 14.1 16.5

LAS + Sampling + LM Rescoring 10.3 12.0

tional filters could lead to improved results, as they have been reported to improve performance by

5% relative WER on clean speech and 7% relative on noisy speech compared to non-convolutional

architectures [20].

Figure 2: Alignments between character outputs and audio signal produced by the Listen, Attend and Spell

(LAS) model for the utterance “how much would a woodchuck chuck”. The content based attention mechanism

was able to identify the start position in the audio sequence for the first character correctly. The alignment

produced is generally monotonic without a need for any location based priors.

The content-based attention mechanism creates an explicit alignment between the characters and

audio signal. We can visualize the attention mechanism by recording the attention distribution on

the acoustic sequence at every character output timestep. Figure 2 visualizes the attention align-

ment between the characters and the filterbanks for the utterance “how much would a woodchuck

chuck”. For this particular utterance, the model learnt a monotonic distribution without any location

priors. The words “woodchuck” and “chuck” have acoustic similarities, the attention mechanism

was slightly confused when emitting “woodchuck” with a dilution in the distribution. The attention

model was also able to identify the start and end of the utterance properly.

7

In the following sections, we report results of control experiments that were conducted to understand

the effects of beam widths, utterance lengths and word frequency on the WER of our model.

We investigate the correlation between the performance of the model and the width of beam search,

with and without the language model rescoring. Figure 3 shows the effect of the decode beam width,

β, on the WER for the clean test set. We see consistent WER improvements by increasing the beam

width up to 16, after which we observe no significant benefits. At a beam width of 32, the WER

is 14.1% and 10.3% after language model rescoring. Rescoring the top 32 beams with an oracle

produces a WER of 4.3% on the clean test set and 5.5% on the noisy test set.

20

WER

WER LM

WER Oracle

15

WER

10

0

12 4 8 16 32

Beam Width

Figure 3: The effect of the decode beam width on WER for the clean Google voice search task. The reported

WERs are without a dictionary or language model, with language model rescoring and the oracle WER for

different beam widths. The figure shows that good results can be obtained even with a relatively small beam

size.

We measure the performance of our model as a function of the number of words in the utterance. We

expect the model to do poorly on longer utterances due to limited number of long training utterances

in our distribution. Hence it is not surprising that longer utterances have a larger error rate. The

deletions dominate the error for long utterances, suggesting we may be missing out on words. It is

surprising that short utterances (e.g., 2 words or less) perform quite poorly. Here, the substitutions

and insertions are the main sources of errors, suggesting the model may split words apart.

Figure 4 also suggests that our model struggles to generalize to long utterances when trained on a

distribution of shorter utterances. It is possible location-based priors may help in these situations as

reported by [15].

We study the performance of our model on rare words. We use the recall metric to indicate whether

a word appears in the utterance regardless of position (higher is better). Figure 5 reports the recall

of each word in the test distribution as a function of the word frequency in the training distribution.

Rare words have higher variance and lower recall while more frequent words typically have higher

8

Utterance Length vs. Error

90

Data Distribution

80 Insertion

Deletion

Substitution

70 WER

WER LM

60 WER Oracle

Percentage

50

40

30

20

10

0

5 10 15 20 25+

Number of Words in Utterance

Figure 4: The correlation between error rates (insertion, deletion, substitution and WER) and the number of

words in an utterance. The WER is reported without a dictionary or language model, with language model

rescoring and the oracle WER for the clean Google voice search task. The data distribution with respect to the

number of words in an utterance is overlaid in the figure. LAS performs poorly with short utterances despite

an abundance of data. LAS also fails to generalize well on longer utterances when trained on a distribution

of shorter utterances. Insertions and substitutions are the main sources of errors for short utterances, while

deletions dominate the error for long utterances.

recall. The word “and” occurs 85k times in the training set, however it has a recall of only 80% even

after language model rescoring. The word “and” is frequently mis-transcribed as “in” (which has

95% recall). This suggests improvements are needed in the language model. By contrast, the word

“walkerville” occurs just once in the training set but it has a recall of 100%. This suggests that the

recall for a word depends both on its frequency in the training set and its acoustic uniqueness.

In this section, we show the outputs of the model on several utterances to demonstrate the capabilities

of LAS. All the results in this section are decoded without a dictionary or a language model.

During our experiments, we observed that LAS can learn multiple spelling variants given the same

acoustics. Table 2 shows top beams for the utterance that includes “triple a”. As can be seen,

the model produces both “triple a” and “aaa” within the top four beams. The decoder is able to

generate such varied parses, because the next step prediction model makes no assumptions on the

probability distribution by using the chain rule decomposition. It would be difficult to produce such

differing transcripts using CTC due to the conditional independence assumptions, where p(yi |x)

is conditionally independent of p(yi+1 |x). Conventional DNN-HMM systems would require both

spellings to be in the pronunciation dictionary to generate both spelling permutations.

It can also be seen that the model produced “xxx” even though acoustically “x” is very different

from “a” - this is presumably because the language model overpowers the acoustic signal in this

case. In the training corpus “xxx” is a very common phrase and we suspect the language model

9

Word Frequency vs. Recall

100

90

80

Word Recall Percentage 70

60

50

40

30

20

10

0

100 101 102 103 104 105 106

Word Frequency

Figure 5: The correlation between word frequency in the training distribution and recall in the test distribution.

In general, rare words report worse recall compared to more frequent words.

Truth call aaa roadside assistance - -

1 call aaa roadside assistance -0.5740 0.00

2 call triple a roadside assistance -1.5399 50.00

3 call trip way roadside assistance -3.5012 50.00

4 call xxx roadside assistance -4.4375 25.00

implicit in the speller learns to associate “triple” with “xxx”. We note that “triple a” occurs 4 times

in the training distribution and “aaa” (when pronounced “triple a” rather than “a”-“a”-“a”) occurs

only once in the training distribution.

We are also surprised that the model is capable of handling utterances with repeated words despite

the fact that it uses content-based attention. Table 3 shows an example of an utterance with a repeated

word. Since LAS implements content-based attention, it is expected it to “lose its attention” during

the decoding steps and produce a word more or less times than the number of times the word was

spoken. As can be seen from this example, even though “seven” is repeated three times, the model

successfully outputs “seven” three times. This hints that location-based priors (e.g., location based

attention or location based regularization) may not be needed for repeated contents.

Truth eight nine four minus seven seven seven - -

1 eight nine four minus seven seven seven -0.2145 0.00

2 eight nine four nine seven seven seven -1.9071 14.29

3 eight nine four minus seven seventy seven -4.7316 14.29

4 eight nine four nine s seven seven seven -5.1252 28.57

10

5 Conclusions

We have presented Listen, Attend and Spell (LAS), an attention-based neural network that can di-

rectly transcribe acoustic signals to characters. LAS is based on the sequence to sequence framework

with a pyramid structure in the encoder that reduces the number of timesteps that the decoder has

to attend to. LAS is trained end-to-end and has two main components. The first component, the

listener, is a pyramidal acoustic RNN encoder that transforms the input sequence into a high level

feature representation. The second component, the speller, is an RNN decoder that attends to the

high level features and spells out the transcript one character at a time. Our system does not use

the concepts of phonemes, nor does it rely on pronunciation dictionaries or HMMs. We bypass the

conditional independence assumptions of CTC, and show how we can learn an implicit language

model that can generate multiple spelling variants given the same acoustics. To further improve

the results, we used samples from the softmax classifier in the decoder as inputs to the next step

prediction during training. Finally, we showed how a language model trained on additional text can

be used to rerank our top hypotheses.

Acknowledgements

We thank Tara Sainath, Babak Damavandi for helping us with the data, language models and for

helpful comments. We also thank Andrew Dai, Ashish Agarwal, Samy Bengio, Eugene Brevdo,

Greg Corrado, Andrew Dai, Jeff Dean, Rajat Monga, Christopher Olah, Mike Schuster, Noam

Shazeer, Ilya Sutskever, Vincent Vanhoucke and the Google Brain team for helpful comments, sug-

gestions and technical assistance.

References

[1] Nathaniel Morgan and Herve Bourlard. Continuous Speech Recognition using Multilayer Per-

ceptrons with Hidden Markov Models. In IEEE International Conference on Acoustics, Speech

and Signal Processing, 1990.

[2] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey E. Hinton. Deep belief networks for

phone recognition. In Neural Information Processing Systems: Workshop on Deep Learning

for Speech Recognition and Related Applications, 2009.

[3] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Large vocabulary continuous speech

recognition with context-dependent dbn-hmms. In IEEE International Conference on Acous-

tics, Speech and Signal Processing, 2011.

[4] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey Hinton. Acoustic modeling us-

ing deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing,

20(1):14–22, 2012.

[5] Navdeep Jaitly, Patrick Nguyen, Andrew W. Senior, and Vincent Vanhoucke. Application

of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition. In INTER-

SPEECH, 2012.

[6] Tara Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep

Convolutional Neural Networks for LVCSR. In IEEE International Conference on Acoustics,

Speech and Signal Processing, 2013.

[7] Kanishka Rao, Fuchun Peng, Hasim Sak, and Francoise Beaufays. Grapheme-to-phoneme

conversion using long short-term memory recurrent neural networks. In IEEE International

Conference on Acoustics, Speech and Signal Processing, 2015.

[8] Kaisheng Yao and Geoffrey Zweig. Sequence-to-Sequence Neural Net Models for Grapheme-

to-Phoneme Conversion. 2015.

[9] Tomas Mikolov, Karafiat Martin, Burget Luka, Eernocky Jan, and Khudanpur Sanjeev. Recur-

rent neural network based language model. In INTERSPEECH, 2010.

[10] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmiduber. Connectionist

Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Net-

works. In International Conference on Machine Learning, 2006.

11

[11] Alex Graves. Sequence Transduction with Recurrent Neural Networks. In International Con-

ference on Machine Learning: Representation Learning Workshop, 2012.

[12] Alex Graves and Navdeep Jaitly. Towards End-to-End Speech Recognition with Recurrent

Neural Networks. In International Conference on Machine Learning, 2014.

[13] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan

Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Ng. Deep Speech:

Scaling up end-to-end speech recognition. In http://arxiv.org/abs/1412.5567, 2014.

[14] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end Con-

tinuous Speech Recognition using Attention-based Recurrent NN: First Results. In Neural In-

formation Processing Systems: Workshop Deep Learning and Representation Learning Work-

shop, 2014.

[15] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio.

Attention-Based Models for Speech Recognition. In http://arxiv.org/abs/1506.07503, 2015.

[16] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by

Jointly Learning to Align and Translate. In International Conference on Learning Representa-

tions, 2015.

[17] Ilya Sutskever, Oriol Vinyals, and Quoc Le. Sequence to Sequence Learning with Neural

Networks. In Neural Information Processing Systems, 2014.

[18] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,

Holger Schwen, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-

Decoder for Statistical Machine Translation. In Conference on Empirical Methods in Natural

Language Processing, 2014.

[19] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled Sampling for Se-

quence Prediction with Recurrent Neural Networks. In http://arxiv.org/abs/1506.03099, 2015.

[20] Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Hasim Sak. Convolutional, Long Short-

Term Memory, Fully Connected Deep Neural Networks. In IEEE International Conference on

Acoustics, Speech and Signal Processing, 2015.

[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep

Convolutional Neural Networks. In Neural Information Processing Systems, 2012.

[22] Leonard E. Baum and Ted Petrie. Statistical Inference for Probabilistic Functions of Finite

State Markov Chains. The Annals of Mathematical Statistics, 37:1554–1563, 1966.

[23] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Proba-

bilistic Models for Segmenting and Labeling Sequence Data. In International Conference on

Machine Learning, 2001.

[24] Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Ad-

dressing the Rare Word Problem in Neural Machine Translation. In Association for Computa-

tional Linguistics, 2015.

[25] Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On Using Very

Large Target Vocabulary for Neural Machine Translation. In Association for Computational

Linguistics, 2015.

[26] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural

Image Caption Generator. In IEEE Conference on Computer Vision and Pattern Recognition,

2015.

[27] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov,

Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation

with Visual Attention. In International Conference on Machine Learning, 2015.

[28] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton.

Grammar as a foreign language. In http://arxiv.org/abs/1412.7449, 2014.

[29] Oriol Vinyals and Quoc V. Le. A Neural Conversational Model. In International Conference

on Machine Learning: Deep Learning Workshop, 2015.

[30] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Computation,

9(8):1735–1780, November 1997.

12

[31] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid Speech Recognition with

Bidirectional LSTM. In Automatic Speech Recognition and Understanding Workshop, 2013.

[32] Salah Hihi and Yoshua Bengio. Hierarchical Recurrent Neural Networks for Long-Term De-

pendencies. In Neural Information Processing Systems, 1996.

[33] Jan Koutnik, Klaus Greff, Faustino Gomez, and Jurgen Schmidhuber. A Clockwork RNN. In

International Conference on Machine Learning, 2014.

[34] Navdeep Jaitly, Vincent Vanhoucke, and Geoffrey Hinton. Autoregressive product of multi-

frame predictions can improve the accuracy of hybrid models. In INTERSPEECH, 2014.

[35] Hasim Sak, Andrew Senior, Kanishka Rao, and Francoise Beaufays. Fast and Accurate Recur-

rent Neural Network Acoustic Models for Speech Recognition. In INTERSPEECH, 2015.

[36] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Repre-

sentations of Words and Phrases and their Compositionality. In Neural Information Processing

Systems, 2013.

[37] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra

Goel, Mirko Hannenmann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg

Stemmer, and Karel Vesely. The Kaldi Speech Recognition Toolkit. In Automatic Speech

Recognition and Understanding Workshop, 2011.

[38] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z.

Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large

Scale Distributed Deep Networks. In Neural Information Processing Systems, 2012.

13

A Alignment Examples

In this section, we give additional visualization examples of our model and the attention distribution.

Figure 6: The spelling variants of “aaa” vs “triple a” produces different attention distributions, both spelling

variants appear in our top beams. The ground truth is: “aaa emergency roadside service”.

14

Figure 7: The spelling variants of “st” vs “saint” produces different attention distributions, both spelling

variants appear in our top beams. The ground truth is: “st mary’s animal clinic”.

15

Figure 8: The phrase “cancel” is repeated three times. Note the parallel diagonals, the content attention

mechanism gets slightly confused however the model still emits the correct hypothesis. The ground truth is:

“cancel cancel cancel”.

16

Published as a conference paper at ICLR 2016

P ROGRAMS WITH G RADIENT D ESCENT

Arvind Neelakantan∗ Quoc V. Le Ilya Sutskever

University of Massachusetts Amherst Google Brain Google Brain

arvind@cs.umass.edu qvl@google.com ilyasu@google.com

A BSTRACT

arXiv:1511.04834v3 [cs.LG] 4 Aug 2016

mance in many tasks including image recognition, speech recognition, and se-

quence to sequence learning. However, this success has not been translated to ap-

plications like question answering that may involve complex arithmetic and logic

reasoning. A major limitation of these models is in their inability to learn even

simple arithmetic and logic operations. For example, it has been shown that neural

networks fail to learn to add two binary numbers reliably. In this work, we pro-

pose Neural Programmer, a neural network augmented with a small set of basic

arithmetic and logic operations that can be trained end-to-end using backpropaga-

tion. Neural Programmer can call these augmented operations over several steps,

thereby inducing compositional programs that are more complex than the built-in

operations. The model learns from a weak supervision signal which is the result of

execution of the correct program, hence it does not require expensive annotation

of the correct program itself. The decisions of what operations to call, and what

data segments to apply to are inferred by Neural Programmer. Such decisions,

during training, are done in a differentiable fashion so that the entire network can

be trained jointly by gradient descent. We find that training the model is diffi-

cult, but it can be greatly improved by adding random noise to the gradient. On

a fairly complex synthetic table-comprehension dataset, traditional recurrent net-

works and attentional models perform poorly while Neural Programmer typically

obtains nearly perfect accuracy.

1 I NTRODUCTION

The past few years have seen the tremendous success of deep neural networks (DNNs) in a variety of

supervised classification tasks starting with image recognition (Krizhevsky et al., 2012) and speech

recognition (Hinton et al., 2012) where the DNNs act on a fixed-length input and output. More

recently, this success has been translated into applications that involve a variable-length sequence

as input and/or output such as machine translation (Sutskever et al., 2014; Bahdanau et al., 2014;

Luong et al., 2014), image captioning (Vinyals et al., 2015; Xu et al., 2015), conversational model-

ing (Shang et al., 2015; Vinyals & Le, 2015), end-to-end Q&A (Sukhbaatar et al., 2015; Peng et al.,

2015; Hermann et al., 2015), and end-to-end speech recognition (Graves & Jaitly, 2014; Hannun

et al., 2014; Chan et al., 2015; Bahdanau et al., 2015).

While these results strongly indicate that DNN models are capable of learning the fuzzy underlying

patterns in the data, they have not had similar impact in applications that involve crisp reasoning.

A major limitation of these models is in their inability to learn even simple arithmetic and logic

operations. For example, Joulin & Mikolov (2015) show that recurrent neural networks (RNNs) fail

at the task of adding two binary numbers even when the result has less than 10 bits. This makes

existing DNN models unsuitable for downstream applications that require complex reasoning, e.g.,

natural language question answering. For example, to answer the question “how many states border

Texas?” (see Zettlemoyer & Collins (2005)), the algorithm has to perform an act of counting in a

table which is something that a neural network is not yet good at.

∗

Work done during an internship at Google.

1

Published as a conference paper at ICLR 2016

A fairly common method for solving these problems is program induction where the goal is to find

a program (in SQL or some high-level languages) that can correctly solve the task. An application

of these models is in semantic parsing where the task is to build a natural language interface to a

structured database (Zelle & Mooney, 1996). This problem is often formulated as mapping a natural

language question to an executable query.

A drawback of existing methods in semantic parsing is that they are difficult to train and require

a great deal of human supervision. As the space over programs is non-smooth, it is difficult to

apply simple gradient descent; most often, gradient descent is augmented with a complex search

procedure, such as sampling (Liang et al., 2010). To further simplify training, the algorithmic de-

signers have to manually add more supervision signals to the models in the form of annotation of the

complete program for every question (Zettlemoyer & Collins, 2005) or a domain-specific grammar

(Liang et al., 2011). For example, designing grammars that contain rules to associate lexical items to

the correct operations, e.g., the word “largest” to the operation “argmax”, or to produce syntactically

valid programs, e.g., disallow the program >= dog. The role of hand-crafted grammars is crucial in

semantic parsing yet also limits its general applicability to many different domains. In a recent work

by Wang et al. (2015) to build semantic parsers for 7 domains, the authors hand engineer a separate

grammar for each domain.

The goal of this work is to develop a model that does not require substantial human supervision

and is broadly applicable across different domains, data sources and natural languages. We propose

Neural Programmer (Figure 1), a neural network augmented with a small set of basic arithmetic

and logic operations that can be trained end-to-end using backpropagation. In our formulation, the

neural network can run several steps using a recurrent neural network. At each step, it can select a

segment in the data source and a particular operation to apply to that segment. The neural network

propagates these outputs forward at every step to form the final, more complicated output. Using

the target output, we can adjust the network to select the right data segments and operations, thereby

inducing the correct program. Key to our approach is that the selection process (for the data source

and operations) is done in a differentiable fashion (i.e., soft selection or attention), so that the whole

neural network can be trained jointly by gradient descent. At test time, we replace soft selection

with hard selection.

Timestep t t = 1, 2, …, T

Arithmetic and

logic operations

Input Soft

Controller Apply

Selection

Figure 1: The architecture of Neural Programmer, a neural network augmented with arithmetic and

logic operations. The controller selects the operation and the data segment. The memory stores the

output of the operations applied to the data segments and the previous actions taken by the controller.

The controller runs for several steps thereby inducing compositional programs that are more complex

than the built-in operations. The dotted line indicates that the controller uses information in the

memory to make decisions in the next time step.

By combining neural network with mathematical operations, we can utilize both the fuzzy pattern

matching capabilities of deep networks and the crisp algorithmic power of traditional programmable

computers. This approach of using an augmented logic and arithmetic component is reminiscent of

the idea of using an ALU (arithmetic and logic unit) in a conventional computer (Von Neumann,

1945). It is loosely related to the symbolic numerical processing abilities exhibited in the intrapari-

etal sulcus (IPS) area of the brain (Piazza et al., 2004; Cantlon et al., 2006; Kucian et al., 2006; Fias

et al., 2007; Dastjerdi et al., 2013). Our work is also inspired by the success of the soft attention

mechanism (Bahdanau et al., 2014) and its application in learning a neural network to control an

additional memory component (Graves et al., 2014; Sukhbaatar et al., 2015).

2

Published as a conference paper at ICLR 2016

Neural Programmer has two attractive properties. First, it learns from a weak supervision signal

which is the result of execution of the correct program. It does not require the expensive annotation

of the correct program for the training examples. The human supervision effort is in the form of

question, data source and answer triples. Second, Neural Programmer does not require additional

rules to guide the program search, making it a general framework. With Neural Programmer, the

algorithmic designer only defines a list of basic operations which requires lesser human effort than

in previous program induction techniques.

We experiment with a synthetic table-comprehension dataset, consisting of questions with a wide

range of difficulty levels. Examples of natural language translated queries include “print elements in

column H whose field in column C is greater than 50 and field in column E is less than 20?” or “what

is the difference between sum of elements in column A and number of rows in the table?”. We find

that LSTM recurrent networks (Hochreiter & Schmidhuber, 1997) and LSTM models with attention

(Bahdanau et al., 2014) do not work well. Neural Programmer, however, can completely solve this

task or achieve greater than 99% accuracy on most cases by inducing the required latent program.

We find that training the model is difficult, but it can be greatly improved by injecting random

Gaussian noise to the gradient (Welling & Teh, 2011; Neelakantan et al., 2016) which enhances the

generalization ability of the Neural Programmer.

2 N EURAL P ROGRAMMER

Even though our model is quite general, in this paper, we apply Neural Programmer to the task of

question answering on tables, a task that has not been previously attempted by neural networks.

In our implementation for this task, Neural Programmer is run for a total of T time steps chosen

in advance to induce compositional programs of up to T operations. The model consists of four

modules:

• A selector to assign two probability distributions at every step, one over the set of operations

and the other over the data segments,

• A list of operations that the model can apply and,

• A history RNN to remember the previous operations and data segments selected by the

model till the current time step.

These four modules are also shown in Figure 2. The history RNN combined with the selector module

functions as the controller in this case. Information about each component is discussed in the next

sections.

Outputt =

Timestep t Op on data weighted by softmax

History RNN

ht-1 Softmax

Data Source

RNN step Apply

Input at ht hcol [ ; ]

step t Col Selector Final

ct Input Output

Softmax

Operations at =

step OutputT

t+1

hop

Op Selector

Question RNN q t = 1, 2, …, T

Figure 2: An implementation of Neural Programmer for the task of question answering on tables.

The output of the model at time step t is obtained by applying the operations on the data segments

weighted by their probabilities. The final output of the model is the output at time step T . The dotted

line indicates the input to the history RNN at step t+1.

Apart from the list of operations, all the other modules are learned using gradient descent on a

training set consisting of triples, where each triple has a question, a data source and an answer. We

3

Published as a conference paper at ICLR 2016

assume that the data source is in the form of a table, table ∈ RM ×C , containing M rows and C

columns (M and C can vary amongst examples). The data segments in our experiments are the

columns, where each column also has a column name.

The question module converts the question tokens to a distributed representation. In the basic version

of our model, we use a simple RNN (Werbos, 1990) parameterized by W question and the last hidden

state of the RNN is used as the question representation (Figure 3).

Question RNN

last RNN hidden state

z1 z2 = tanh(Wquestion [z1; V(w2)]) q=zq

w1 w2 wq

Figure 3: The question module to process the input question. q = zq denotes the question represen-

tation used by Neural Programmer.

Consider an input question containing Q words {w1 , w2 , . . . , wQ }, the question module performs

the following computations:

zi = tanh(W question [zi−1 ; V (wi )]), ∀i = 1, 2, . . . , Q

where V (wi ) ∈ Rd represents the embedded representation of the word wi , [a; b] ∈ R2d represents

the concatenation of two vectors a, b ∈ Rd , W question ∈ Rd×2d is the recurrent matrix of the

question RNN, tanh is the element-wise non-linearity function and zQ ∈ Rd is the representation

of the question. We set z0 to [0]d . We pre-process the question by removing numbers from it and

storing the numbers in a separate list. Along with the numbers we store the word that appeared to the

left of it in the question which is useful to compute the pivot values for the comparison operations

described in Section 2.3.

For tasks that involve longer questions, we use a bidirectional RNN since we find that a simple

unidirectional RNN has trouble remembering the beginning of the question. When the bidirectional

RNN is used, the question representation is obtained by concatenating the last hidden states of the

two-ends of the bidirectional RNNs. The question representation is denoted by q.

2.2 S ELECTOR

The selector produces two probability distributions at every time step t (t = 1, 2, . . . , T ): one

probablity distribution over the set of operations and another probability distribution over the set

of columns. The inputs to the selector are the question representation (q ∈ Rd ) from the question

module and the output of the history RNN (described in Section 2.4) at time step t (ht ∈ Rd ) which

stores information about the operations and columns selected by the model up to the previous step.

Each operation is represented using a d-dimensional vector. Let the number of operations be O and

let U ∈ RO×D be the matrix storing the representations of the operations.

Operation Selection is performed by:

αtop = softmax (U tanh(W op [q; ht ]))

where W op ∈ Rd×2d is the parameter matrix of the operation selector that produces the probability

distribution αtop ∈ [0, 1]O over the set of operations (Figure 4).

The selector also produces a probability distribution over the columns at every time step. We obtain

vector representations for the column names using the parameters in the question module (Section

2.1) by word embedding or an RNN phrase embedding. Let P ∈ RC×D be the matrix storing the

representations of the column names.

Data Selection is performed by:

4

Published as a conference paper at ICLR 2016

ht-1

Operations

RNN step Op: 1

Input at ht hop = … … Op: 2

…

step t tanh(Wop [q; ht]) Softmax

ct Op Selector … … …

Op: V

Question RNN q t = 1, 2, …, T

Figure 4: Operation selection at time step t where the selector assigns a probability distribution over

the set of operations.

where W col ∈ Rd×2d is the parameter matrix of the column selector that produces the probability

distribution αtcol ∈ [0, 1]C over the set of columns (Figure 5).

Timestep t

History RNN

ht-1 Data Source

Softmax

RNN step Col:1 …. Col:C

Input at ht hcol =

step t tanh(Wcol [q; ht])

ct Col Selector

…

…

Question RNN q t = 1, 2, …, T

Figure 5: Data selection at time step t where the selector assigns a probability distribution over the

set of columns.

2.3 O PERATIONS

Neural Programmer currently supports two types of outputs: a) a scalar output, and b) a list of items

selected from the table (i.e., table lookup).1 The first type of output is for questions of type “Sum

of elements in column C” while the second type of output is for questions of type “Print elements

in column A that are greater than 50.” To facilitate this, the model maintains two kinds of out-

put variables at every step t, scalar answert ∈ R and lookup answert ∈ [0, 1]M ×C . The output

lookup answert (i , j ) stores the probability that the element (i, j) in the table is part of the out-

put. The final output of the model is scalar answerT or lookup answerT depending on whichever

of the two is updated after T time steps. Apart from the two output variables, the model main-

tains an additional variable row selectt ∈ [0, 1]M that is updated at every time step. The variables

row selectt [i ](∀i = 1, 2, . . . , M ) maintain the probability of selecting row i and allows the model

to dynamically select a subset of rows within a column. The output is initialized to zero while the

row select variable is initialized to [1]M .

Key to Neural Programmer is the built-in operations, which have access to the outputs of the

model at every time step before the current time step t, i.e., the operations have access to

(scalar answer i , lookup answer i ), ∀i = 1, 2, . . . , t − 1. This enables the model to build powerful

compositional programs.

It is important to design the operations such that they can work with probabilistic row and column

selection so that the model is differentiable. Table 1 shows the list of operations built into the model

along with their definitions. The reset operation can be selected any number of times which when

required allows the model to induce programs whose complexity is less than T steps.

1

It is trivial to extend the model to support general text responses by adding a decoder RNN to generate text

sentences.

5

Published as a conference paper at ICLR 2016

M

P

Sum sumt [j] = row selectt−1 [i] ∗ table[i][j], ∀j = 1, 2, . . . , C

Aggregate i=1

M

P

Count countt = row selectt−1 [i]

i=1

Arithmetic Difference difft = scalar output t−3 − scalar output t−1

Greater gt [i][j] = table[i][j] > pivotg , ∀(i, j), i = 1, . . . , M, j = 1, . . . , C

Comparison

Lesser lt [i][j] = table[i][j ] < pivotl , ∀(i, j), i = 1, . . . , M, j = 1, . . . , C

And and t [i] = min(row selectt−1 [i], row selectt−2 [i]), ∀i = 1, 2, . . . , M

Logic

Or or t [i] = max(row selectt−1 [i], row selectt−2 [i]), ∀i = 1, 2, . . . , M

Assign Lookup assign assignt [i][j] = row selectt−1 [i], ∀(i, j)i = 1, 2, . . . , M, j = 1, 2, . . . , C

Reset Reset resett [i] = 1, ∀i = 1, 2, . . . , M

Table 1: List of operations along with their definitions at time step t, table ∈ RM ×C is the data

source in the form of a table and row selectt ∈ [0, 1]M functions as a row selector.

While the definitions of the operations are fairly straightforward, comparison operations greater

and lesser require a pivot value as input (refer Table 1), which appears in the question. Let

qn1 , qn2 , . . . , qnN be the numbers that appear in the question.

For every comparison operation (greater and lesser), we compute its pivot value by adding up all the

numbers in the question each of them weighted with the probabilities assigned to it computed using

the hidden vector at position to the left of the number,2 and the operation’s embedding vector. More

precisely:

βop = softmax (ZU (op))

N

X

pivotop = βop (i)qni

i=1

where U (op) ∈ Rd is the vector representation of operation op (op ∈ {greater, lesser}) and Z ∈

RN ×d is the matrix storing the hidden vectors of the question RNN at positions to the left of the

occurrence of the numbers.

By overloading the definition of αtop and αtcol , let αtop (x) and αtcol (j) denote the probability assigned

by the selector to operation x (x ∈ {sum, count, difference, greater, lesser, and, or, assign, reset})

and column j (∀j = 1, 2, . . . , C) at time step t respectively.

Figure 6 show how the output and row selector variables are computed. The output and row selector

variables at a step is obtained by additively combining the output of the individual operations on the

different data segments weighted with their corresponding probabilities assigned by the model.

Timestep t

Softmax

Data Source

row_selectt

Selector Apply scalar_answert

lookup_answert

Softmax

Operations

scalar_answert-1

row_selectt-1

scalar_answert-2

row_selectt-2

scalar_answert-3 t = 1, 2, …, T

Figure 6: The output and row selector variables are obtained by applying the operations on the data

segments and additively combining their outputs weighted using the probabilities assigned by the

selector.

2

This choice is made to reflect the common case in English where the pivot number is usually mentioned

after the operation but it is trivial to extend to use hidden vectors both in the left and the right of the number.

6

Published as a conference paper at ICLR 2016

C

X

scalar answert = αtop (count)countt + αtop (difference)difft + αtcol (j)αtop (sum)sumt [j ],

j=1

The row selector variable is given by:

row selectt [i ] = αtop (and)andt [i] + αtop (or)ort [i] + αtop (reset)resett [i]+

C

X

αtcol (j)(αtop (greater)gt [i][j] + αtop (lesser)lt [i][j]), ∀i = 1, . . . , M

j=1

It is important to note that other operations like equal to, max, min, not etc. can be built into this

model easily.

So far, our disscusion has been only concerned with tables that have numeric entries. In this section

we describe how Neural Programmer handles text entries in the input table. We assume a column

can contain either numeric or text entries. An example query is “what is the sum of elements in

column B whose field in column C is word:1 and field in column A is word:7?”. In other words, the

query is looking for text entries in the column that match specified words in the questions. To answer

these queries, we add a text match operation that updates the row selector variable appropriately. In

our implementation, the parameters for vector representations of the column’s text entries are shared

with the question module.

The text match operation uses a two-stage soft attention mechanism, back and forth from the text

entries to question module. In the following, we explain its implementation in detail.

Let T C1 , T C2 , . . . , T CK be the set of columns that each have M text entries and A ∈ M × K × d

store the vector representations of the text entries. In the first stage, the question representation

coarsely selects the appropriate text entries through the sigmoid operation. Concretely, coarse se-

lection, B, is given by the sigmoid of dot product between vector representations for text entries, A,

and question representation, q:

d

!

X

B[m][k] = sigmoid A[m][k][p] · q[p] ∀(m, k) m = 1, . . . , M, k = 1, . . . , K

p=1

the weighted average of the vector representations of the text entries in that column:

M

1 X

D[k][p] = (B[m][k] · A[m][k][p]) ∀(k, p) k = 1, . . . , K, p = 1, . . . , d

M m=1

To allow different words in the question to be matched to the corresponding columns (e.g., match

word:1 in column C and match word:7 in column A for question “what is the sum of elements in

column B whose field in column C is word:1 and field in column A is word:7?’), we add the column

name representations (described in Section 2.2), P , to D to obtain column representations E. This

make the representation also sensitive to the column name.

In the second stage, we use E to compute an attention over the hidden states of the question RNN

to get attention vector G for each column of the input table. More concretely, we compute the dot

product between E and the hidden states of the question RNN to obtain scalar values. We then

7

Published as a conference paper at ICLR 2016

pass them through softmax to obtain weighting factors for each hidden state. G is the weighted

combination of the hidden states of the question RNN.

Finally, text match selection is done by:

d

!

X

text match[m][k] = sigmoid A[m][k][p] · G[k][p] ∀(m, k) m = 1, . . . , M, k = 1, . . . , K

p=1

Without loss of generality, let the first K (K ∈ [0, 1, . . . , C]) columns out of C columns of the table

contain text entries while the remaining contain numeric entries. The row selector variable now is

given by:

row selectt [i ] = αtop (and)andt [i] + αtop (or)ort [i] + αtop (reset)resett [i]+

C

X

αtcol (j)(αtop (greater)gt [i][j] + αtop (lesser)lt [i][j])+

j=K+1

K

X

αtcol (j)(αtop (text match)text match t [i][j], ∀i = 1, . . . , M

j=1

The two-stage mechanism is required since in our experiments we find that simply averaging the

vector representations fails to make the representation of the column specific enough to the question.

Unless otherwise stated, our experiments are with input tables whose entries are only numeric and

in that case the model does not contain the text match operation.

The history RNN keeps track of the previous operations and columns selected by the selector module

so that the model can induce compositional programs. This information is encoded in the hidden

vector of the history RNN at time step t, ht ∈ Rd . This helps the selector module to induce the

probability distributions over the operations and columns by taking into account the previous actions

selected by the model. Figure 7 shows details of this component.

Timestep t

History RNN

ht-1 Softmax

Data Source

Weighted

sum of op

RNN step vectors

; ]

step t sum of col

vectors

ct Input

Softmax

Operations at

step

t+1

hop

Question RNN q t = 1, 2, …, T

Figure 7: The history RNN which helps in remembering the previous operations and data segments

selected by the model. The dotted line indicates the input to the history RNN at step t+1.

The input to the history RNN at time step t, ct ∈ R2d is obtained by concatenating the weighted

representations of operations and column names with their corresponding probability distribution

produced by the selector at step t − 1. More precisely:

op T col T

ct = [(αt−1 ) U ; (αt−1 ) P]

The hidden state of the history RNN at step t is computed as:

ht = tanh(W history [ct ; ht−1 ]), ∀i = 1, 2, . . . , Q

where W history ∈ Rd×3d is the recurrent matrix of the history RNN, and ht ∈ Rd is the current

representation of the history. The history vector at time t = 1, h1 is set to [0]d .

8

Published as a conference paper at ICLR 2016

The parameters of the model include the parameters of the question RNN, W question , parameters

of the history RNN, W history , word embeddings V (.), operation embeddings U , operation selector

and column selector matrices, W op and W col respectively. During training, depending on whether

the answer is a scalar or a lookup from the table we have two different loss functions.

When the answer is a scalar, we use Huber loss (Huber, 1964) given by:

1 2

a , if a ≤ δ

Lscalar (scalar answerT , y) = 2

δa − 12 δ 2 , otherwise

where a = |scalar answer T − y| is the absolute difference between the predicted and true answer,

and δ is the Huber constant treated as a model hyper-parameter. In our experiments, we find that

using square loss makes training unstable while using the absolute loss makes the optimization

difficult near the non-differentiable point.

When the answer is a list of items selected from the table, we convert the answer to y ∈ {0, 1}M ×C ,

where y[i, j] indicates whether the element (i, j) is part of the output. In this case we use log-loss

over the set of elements in the table given by:

M C

1 XX

Llookup (lookup answer T , y) = − y[i, j] log(lookup answer T [i, j])+

M C i=1 j=1

(1 − y[i, j]) log(1 − lookup answer T [i, j])

N

1 X (k) (k)

L= [nk == T rue]Lscalar + [nk == F alse]λLlookup

N

k=1

(k) (k)

where N is the number of training examples, Lscalar and Llookup are the scalar and lookup loss on

k th example, nk is a boolean random variable which is set to True when the k th example’s answer

is a scalar and set to False when the answer is a lookup, and λ is a hyper-parameter of the model

that allows to weight the two loss functions appropriately.

At inference time, we replace the three softmax layers in the model with the conventional

maximum (hardmax) operation and the final output of the model is either scalar answerT or

lookup answerT , depending on whichever among them is updated after T time steps. Algorithm 1

gives a high-level view of Neural Programmer during inference.

3 E XPERIMENTS

Neural Programmer is faced with many challenges, specifically: 1) can the model learn the param-

eters of the different modules with delayed supervision after T steps? 2) can it exhibit composi-

tionality by generalizing to unseen questions? and 3) can the question module handle the variability

and ambiguity of natural language? In our experiments, we mainly focus on answering the first two

questions using synthetic data. Our reason for using synthetic data is that it is easier to understand a

new model with a synthetic dataset. We can generate the data in a large quantity, whereas the biggest

real-word semantic parsing datasets we know of contains only about 14k training examples (Pasu-

pat & Liang, 2015) which is very small by neural network standards. In one of our experiments,

we introduce simple word-level variability to simulate one aspect of the difficulties in dealing with

natural language input.

3.1 DATA

We generate question, table and answer triples using a synthetic grammar. Tables 4 and 5 (see Ap-

pendix) shows examples of question templates from the synthetic grammar for single and multiple

9

Published as a conference paper at ICLR 2016

Algorithm 1 High-level view of Neural Programmer during its inference stage for an input example.

1: Input: table ∈ RM ×C and question

2: Initialize: scalar answer 0 = 0, lookup answer 0 = 0M ×C , row select 0 = 1M , history vector

at time t = 0, h0 = 0d and input to history RNN at time t = 0, c0 = 02d

3: Preprocessing: Remove numbers from question and store them in a list along with the words

that appear to the left of it. The tokens in the input question are {w1 , w2 , . . . , wQ }.

4: Question Module: Run question RNN on the preprocessed question to get question represen-

tation q and list of hidden states z1 , z2 , . . . , zQ

5: Pivot numbers: pivotg and pivotl are computed using hidden states from question RNN and

operation representations U

6: for t = 1, 2, . . . , T do

7: Compute history vector ht by passing input ct to the history RNN

8: Operation selection using q, ht and operation representations U

9: Data selection on table using q, ht and column representations V

10: Update scalar answert , lookup answert and row select t using the selected operation and

column

11: Compute input to the history RNN at time t + 1, ct+1

12: end for

13: Output: scalar answer T or lookup answer T depending on whichever of the two is updated

at step T

columns respectively. The elements in the table are uniformly randomly sampled from [-100, 100]

and [-200, 200] during training and test time respectively. The number of rows is sampled randomly

from [30, 100] in training while during prediction the number of rows is 120. Each question in the

test set is unique, i.e., it is generated from a distinct template. We use the following settings:

Single Column: We first perform experiments with a single column that enables 23 different ques-

tion templates which can be answered using 4 time steps.

Many Columns: We increase the difficulty by experimenting with multiple columns (max columns

= 3, 5 or 10). During training, the number of columns is randomly sampled from (1, max columns)

and at test time every question had the maximum number of columns used during training.

Variability: To simulate one aspect of the difficulties in dealing with natural language input, we

consider multiple ways to refer to the same operation (Tables 6 and 7).

Text Match: Now we consider cases where some columns in the input table contain text entries.

We use a small vocabulary of 10 words and fill the column by uniformly randomly sampling from

them. In our first experiment with text entries, the table always contains two columns, one with text

and other with numeric entries (Table 8). In the next experiment, each example can have up to 3

columns containing numeric entries and up to 2 columns containing text entries during training. At

test time, all the examples contain 3 columns with numeric entries and 2 columns with text entries.

3.2 M ODELS

In the following, we benchmark the performance of Neural Programmer on various versions of the

table-comprehension dataset. We slowly increase the difficulty of the task by changing the table

properties (more columns, mixed numeric and text entries) and question properties (word variabil-

ity). After that we discuss a comparison between Neural Programmer, LSTM, and LSTM with

Attention.

We use 4 time steps in our experiments (T = 4). Neural Programmer is trained with mini-batch

stochastic gradient descent with Adam optimizer (Kingma & Ba, 2014). The parameters are ini-

tialized uniformly randomly within the range [-0.1, 0.1]. In all experiments, we set the mini-batch

size to 50, dimensionality d to 256, the initial learning rate and the momentum hyper-parameters

of Adam to their default values (Kingma & Ba, 2014). We found that it is extremely useful to add

random Gaussian noise to our gradients at every training step. This acts as a regularizer to the model

10

Published as a conference paper at ICLR 2016

and allows it to actively explore more programs. We use a schedule inspired from Welling & Teh

(2011), where at every step we sample a Gaussian of 0 mean and variance= curr step−0.55 .

To prevent exploding gradients, we perform gradient clipping by scaling the gradient when the norm

exceeds a threshold (Graves, 2013). The threshold value is picked from [1, 5, 50]. We tune the

hyper-parameter in Adam from [1e-6, 1e-8], the Huber constant δ from [10, 25, 50] and λ (weight

between two losses) from [25, 50, 75, 100] using grid search. While performing experiments with

multiple random restarts we find that the performance of the model is stable with respect to and

gradient clipping threshold but we have to tune δ and λ for the different random seeds.

Single Column 23 100.0 100

3 Columns 307 99.02 100

5 Columns 1231 99.11 98.62

10 Columns 7900 99.13 62.44

Word Variability on 1 Column 1368 96.49 100

Word Variability on 5 Columns 24000 88.99 31.31

Text Match on 2 Columns 1125 99.11 97.42

Text Match on 5 Columns 14600 98.03 31.02

Table 2: Summary of the performance of Neural Programmer on various versions of the synthetic

table-comprehension task. The prediction of the model is considered correct if it is equal to the

correct answer up to the first decimal place. The last column indicates the percentage of question

templates in the test set that are observed during training. The unseen question templates generate

questions containing sequences of words that the model has never seen before. The model can

generalize to unseen question templates which is evident in the 10-columns, word variability on

5-columns and text match on 5 columns experiments. This indicates that Neural Programmer is

a powerful compositional model since solving unseen question templates requires performing a

sequence of actions that it has never done during training.

The training set consists of 50, 000 triples in all our experiments. Table 2 shows the performance

of Neural Programmer on synthetic data experiments. In single column experiments, the model

answers all questions correctly which we manually verify by inspecting the programs induced by

the model. In many columns experiments with 5 columns, we use a bidirectional RNN and for 10

columns we additionally perform attention (Bahdanau et al., 2014) on the question at every time step

using the history vector. The model is able to generalize to unseen question templates which are a

considerable fraction in our ten columns experiment. This can also be seen in the word variability

experiment with 5 columns and text match experiment with 5 columns where more than two-thirds

of the test set contains question templates that are unseen during training. This indicates that Neural

Programmer is a powerful compositional model since solving unseen question templates requires

inducing programs that do not appear during training. Almost all the errors made by the model were

on questions that require the difference operation to be used. Table 3 shows examples of how the

model selects the operation and column at every time step for three test questions.

Figure 8 shows an example of the effect of adding random noise to the gradients in our experiment

with 5 columns.

We apply a three-layer sequence-to-sequence LSTM recurrent network model (Hochreiter &

Schmidhuber, 1997; Sutskever et al., 2014) and LSTM model with attention (Bahdanau et al., 2014).

We explore multiple attention heads (1, 5, 10) and try two cases, placing the input table before and

after the question. We consider a simpler version of the single column dataset with only questions

that have scalar answers. The number of elements in the column is uniformly randomly sampled

11

Published as a conference paper at ICLR 2016

Question t

Op Column select

greater 50.32 C and lesser 20.21 E sum H 1 Greater C g1

What is the sum of numbers in column H 2 Lesser E l2

50.32 20.21

whose field in column C is greater than 50.32 3 And - and3

and field in Column E is lesser than 20.21. 4 Sum H [0]M

lesser -80.97 D or greater 12.57 B print F 1 Lesser D l1

Print elements in column F 2 Greater B g2

12.57 -80.97

whose field in column D is lesser than -80.97 3 Or - or3

or field in Column B is greater than 12.57. 4 Assign F [0]M

sum A diff count 1 Sum A [0]M

What is the difference 2 Reset - -1 -1 [1]M

between sum of elements in 3 Count - [0]M

column A and number of rows 4 Diff - [0]M

Table 3: Example outputs from the model for T = 4 time steps on three questions in the test set.

We show the synthetically generated question along with its natural language translation. For each

question, the model takes 4 steps and at each step selects an operation and a column. The pivot

numbers for the comparison operations are computed before performing the 4 steps. We show the

selected columns in cases during which the selected operation acts on a particular column.

Train Loss: Noise Vs. No Noise Test Accuracy: Noise Vs. No Noise

3500 100

no noise no noise

noise noise

3000 80

Test Accuracy

2500 60

Train Loss

2000 40

1500 20

1000 0

0 50 100 150 200 250 300 0 50 100 150 200 250 300

No. of epochs No. of epochs

Figure 8: The effect of adding random noise to the gradients versus not adding it in our experiment

with 5 columns when all hyper-parameters are the same. The models trained with noise generalizes

almost always better.

from [4, 7] while the elements are sampled from [−10, 10]. The best accuracy using these models is

close to 80% in spite of relatively easier questions and supplying fresh training examples at every

step. When the scale of the input numbers is changed to [−50, 50] at test time, the accuracy drops to

30%.

Neural Programmer solves this task and achieves 100% accuracy using 50, 000 training examples.

Since hardmax operation is used at test time, the answers (or the program induced) from Neural

Programmer is invariant to the scale of numbers and the length of the input.

4 R ELATED W ORK

Program induction has been studied in the context of semantic parsing (Zelle & Mooney, 1996;

Zettlemoyer & Collins, 2005; Liang et al., 2011) in natural language processing. Pasupat & Liang

(2015) develop a semantic parser with a hand engineered grammar for question answering on tables

with natural language questions. Methods such as Piantadosi et al. (2008); Eisenstein et al. (2009);

Clarke et al. (2010) learn a compositional semantic model without hand engineered compositional

grammar, but still requiring a hand labeled lexical mapping of words to the operations. Poon (2013)

develop an unsupervised method for semantic parsing, which requires many pre-processing steps

12

Published as a conference paper at ICLR 2016

including dependency parsing and mapping from words to operations. Liang et al. (2010) propose

an hierarchical Bayesian approach to learn simple programs.

There has been some early work in using neural networks for learning context free grammar (Das

et al., 1992a;b; Zeng et al., 1994) and context sensitive grammar (Steijvers, 1996; Gers & Schmid-

huber, 2001) for small problems. Neelakantan et al. (2015); Lin et al. (2015) learn simple Horn

clauses in a large knowledge base using RNNs. Neural networks have also been used for Q&A on

datasets that do not require complicated arithmetic and logic reasoning (Bordes et al., 2014; Iyyer

et al., 2014; Sukhbaatar et al., 2015; Peng et al., 2015; Hermann et al., 2015). While there has been

lot of work in augmenting neural networks with additional memory (Das et al., 1992a; Schmidhu-

ber, 1993; Hochreiter & Schmidhuber, 1997; Graves et al., 2014; Weston et al., 2015; Kumar et al.,

2015; Joulin & Mikolov, 2015), we are not aware of any other work that augments a neural network

with a set of operations to enhance complex reasoning capabilities.

After our work was submitted to ArXiv, Neural Programmer-Interpreters (Reed & Freitas, 2016), a

method that learns to induce programs with supervision of the entire program was proposed. This

was followed by Neural Enquirer (Yin et al., 2015), which similar to our work tackles the problem of

synthetic table QA. However, their method achieves perfect accuracy only when given supervision

of the entire program. Later, dynamic neural module network (Andreas et al., 2016) was proposed

for question answering which uses syntactic supervision in the form of dependency trees.

5 C ONCLUSIONS

We develop Neural Programmer, a neural network model augmented with a small set of arithmetic

and logic operations to perform complex arithmetic and logic reasoning. The model can be trained in

an end-to-end fashion using backpropagation to induce programs requiring much lesser sophisticated

human supervision than prior work. It is a general model for program induction broadly applicable

across different domains, data sources and languages. Our experiments indicate that the model is

capable of learning with delayed supervision and exhibits powerful compositionality.

Acknowledgements We sincerely thank Greg Corrado, Andrew Dai, Jeff Dean, Shixiang Gu,

Andrew McCallum, and Luke Vilnis for their suggestions and the Google Brain team for the support.

R EFERENCES

Andreas, Jacob, Rohrbach, Marcus, Darrell, Trevor, and Klein, Dan. Learning to compose neural

networks for question answering. ArXiv, 2016.

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly

learning to align and translate. ICLR, 2014.

Bahdanau, Dzmitry, Chorowski, Jan, Serdyuk, Dmitriy, Brakel, Philemon, and Bengio,

Yoshua. End-to-end attention-based large vocabulary speech recognition. arXiv preprint

arxiv:1508.04395, 2015.

Bordes, Antoine, Chopra, Sumit, and Weston, Jason. Question answering with subgraph embed-

dings. In EMNLP, 2014.

Cantlon, Jessica F., Brannon, Elizabeth M., Carter, Elizabeth J., and Pelphrey, Kevin A. Functional

imaging of numerical processing in adults and 4-y-old children. PLoS Biology, 2006.

Chan, William, Jaitly, Navdeep, Le, Quoc V., and Vinyals, Oriol. Listen, attend and spell. arXiv

preprint arxiv:1508.01211, 2015.

Clarke, James, Goldwasser, Dan, Chang, Ming-Wei, and Roth, Dan. Driving semantic parsing from

the world’s response. In CoNLL, 2010.

Das, Sreerupa, Giles, C. Lee, and zheng Sun, Guo. Learning context-free grammars: Capabilities

and limitations of a recurrent neural network with an external stack memory. In CogSci, 1992a.

Das, Sreerupa, Giles, C. Lee, and zheng Sun, Guo. Using prior knowledge in an NNPDA to learn

context-free languages. In NIPS, 1992b.

13

Published as a conference paper at ICLR 2016

Dastjerdi, Mohammad, Ozker, Muge, Foster, Brett L, Rangarajan, Vinitha, and Parvizi, Josef. Nu-

merical processing in the human parietal cortex during experimental and natural conditions. Na-

ture communications, 4, 2013.

Eisenstein, Jacob, Clarke, James, Goldwasser, Dan, and Roth, Dan. Reading to learn: Constructing

features from semantic abstracts. In EMNLP, 2009.

Fias, Wim, Lammertyn, Jan, Caessens, Bernie, and Orban, Guy A. Processing of abstract ordinal

knowledge in the horizontal segment of the intraparietal sulcus. The Journal of Neuroscience,

2007.

Gers, Felix A. and Schmidhuber, Jürgen. LSTM recurrent networks learn simple context free and

context sensitive languages. IEEE Transactions on Neural Networks, 2001.

Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint

arxiv:1308.0850, 2013.

Graves, Alex and Jaitly, Navdeep. Towards end-to-end speech recognition with recurrent neural

networks. In ICML, 2014.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural Turing Machines. arXiv preprint

arxiv:1410.5401, 2014.

Hannun, Awni Y., Case, Carl, Casper, Jared, Catanzaro, Bryan C., Diamos, Greg, Elsen, Erich,

Prenger, Ryan, Satheesh, Sanjeev, Sengupta, Shubho, Coates, Adam, and Ng, Andrew Y. Deep

Speech: Scaling up end-to-end speech recognition. arXiv preprint arxiv:1412.5567, 2014.

Hermann, Karl Moritz, Kociský, Tomás, Grefenstette, Edward, Espeholt, Lasse, Kay, Will, Suley-

man, Mustafa, and Blunsom, Phil. Teaching machines to read and comprehend. NIPS, 2015.

Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George, rahman Mohamed, Abdel, Jaitly, Navdeep,

Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara, and Kingsbury, Brian. Deep

neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 2012.

Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural Computation, 1997.

Huber, Peter. Robust estimation of a location parameter. In The Annals of Mathematical Statistics,

1964.

Iyyer, Mohit, Boyd-Graber, Jordan L., Claudino, Leonardo Max Batista, Socher, Richard, and III,

Hal Daumé. A neural network for factoid question answering over paragraphs. In EMNLP, 2014.

Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent

nets. NIPS, 2015.

Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. ICLR, 2014.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep con-

volutional neural networks. In NIPS, 2012.

Kucian, Karin, Loenneker, Thomas, Dietrich, Thomas, Dosch, Mengia, Martin, Ernst, and

Von Aster, Michael. Impaired neural networks for approximate calculation in dyscalculic chil-

dren: a functional mri study. Behavioral and Brain Functions, 2006.

Kumar, Ankit, Irsoy, Ozan, Su, Jonathan, Bradbury, James, English, Robert, Pierce, Brian, On-

druska, Peter, Gulrajani, Ishaan, and Socher, Richard. Ask me anything: Dynamic memory net-

works for natural language processing. ArXiv, 2015.

Liang, Percy, Jordan, Michael I., and Klein, Dan. Learning programs: A hierarchical Bayesian

approach. In ICML, 2010.

Liang, Percy, Jordan, Michael I., and Klein, Dan. Learning dependency-based compositional se-

mantics. In ACL, 2011.

14

Published as a conference paper at ICLR 2016

Lin, Yankai, Liu, Zhiyuan, Luan, Huan-Bo, Sun, Maosong, Rao, Siwei, and Liu, Song. Modeling

relation paths for representation learning of knowledge bases. In EMNLP, 2015.

Luong, Thang, Sutskever, Ilya, Le, Quoc V., Vinyals, Oriol, and Zaremba, Wojciech. Addressing

the rare word problem in neural machine translation. ACL, 2014.

Neelakantan, Arvind, Roth, Benjamin, and McCallum, Andrew. Compositional vector space models

for knowledge base completion. In ACL, 2015.

Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V., Sutskever, Ilya, Kaiser, Lukasz, Kurach, Karol,

and Martens, James. Adding gradient noise improves learning for very deep networks. ICLR

Workshop, 2016.

Pasupat, Panupong and Liang, Percy. Compositional semantic parsing on semi-structured tables. In

ACL, 2015.

Peng, Baolin, Lu, Zhengdong, Li, Hang, and Wong, Kam-Fai. Towards neural network-based rea-

soning. arXiv preprint arxiv:1508.05508, 2015.

Piantadosi, Steven T., Goodman, N.D., Ellis, B.A., and Tenenbaum, J.B. A Bayesian model of the

acquisition of compositional semantics. In CogSci, 2008.

Piazza, Manuela, Izard, Veronique, Pinel, Philippe, Le Bihan, Denis, and Dehaene, Stanislas. Tuning

curves for approximate numerosity in the human intraparietal sulcus. Neuron, 2004.

Poon, Hoifung. Grounded unsupervised semantic parsing. In ACL, 2013.

Reed, Scott and Freitas, Nando De. Neural programmer-interpreters. ICLR, 2016.

Schmidhuber, J. A self-referentialweight matrix. In ICANN, 1993.

Shang, Lifeng, Lu, Zhengdogn, and Li, Hang. Neural responding machine for short-text conversa-

tion. arXiv preprint arXiv:1503.02364, 2015.

Steijvers, Mark. A recurrent network that performs a context-sensitive prediction task. In CogSci,

1996.

Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and Fergus, Rob. End-to-end memory net-

works. arXiv preprint arXiv:1503.08895, 2015.

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to sequence learning with neural net-

works. In NIPS, 2014.

Vinyals, Oriol and Le, Quoc V. A neural conversational model. ICML DL Workshop, 2015.

Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural

image caption generator. In CVPR, 2015.

Von Neumann, John. First draft of a report on the EDVAC. Technical report, 1945.

Wang, Yushi, Berant, Jonathan, and Liang, Percy. Building a semantic parser overnight. In ACL,

2015.

Welling, Max and Teh, Yee Whye. Bayesian learning via stochastic gradient Langevin dynamics. In

ICML, 2011.

Werbos, P. Backpropagation through time: what does it do and how to do it. In Proceedings of

IEEE, 1990.

Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Memory Networks. 2015.

Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun, Courville, Aaron C., Salakhutdinov, Ruslan,

Zemel, Richard S., and Bengio, Yoshua. Show, attend and tell: Neural image caption generation

with visual attention. In ICML, 2015.

15

Published as a conference paper at ICLR 2016

Yin, Pengcheng, Lu, Zhengdong, Li, Hang, and Kao, Ben. Neural enquirer: Learning to query tables

with natural language. ArXiv, 2015.

Zelle, John M. and Mooney, Raymond J. Learning to parse database queries using inductive logic

programming. In AAAI/IAAI, 1996.

Zeng, Z., Goodman, R., and Smyth, P. Discrete recurrent neural networks for grammatical inference.

IEEE Transactions on Neural Networks, 1994.

Zettlemoyer, Luke S. and Collins, Michael. Learning to map sentences to logical form: Structured

classification with probabilistic categorial grammars. In UAI, 2005.

16

Published as a conference paper at ICLR 2016

A PPENDIX

sum

count

print

greater [number] sum

lesser [number] sum

greater [number] count

lesser [number] count

greater [number] print

lesser [number] print

greater [number1] and lesser [number2] sum

lesser [number1] and greater [number2] sum

greater [number1] or lesser [number2] sum

lesser [number1] or greater [number2] sum

greater [number1] and lesser [number2] count

lesser [number1] and greater [number2] count

greater [number1] or lesser [number2] count

lesser [number1] or greater [number2] count

greater [number1] and lesser [number2] print

lesser [number1] and greater [number2] print

greater [number1] or lesser [number2] print

lesser [number1] or greater [number2] print

sum diff count

count diff sum

Table 4: 23 question templates for single column experiment. We have four categories of questions:

1) simple aggregation (sum, count) 2) comparison (greater, lesser) 3) logic (and, or) and, 4) arith-

metic (diff). We first sample the categories uniformly randomly and each program within a category

is equally likely. In the word variability experiment with 5 columns we sampled from the set of

all programs uniformly randomly since greater than 90% of the test questions were unseen during

training using the other procedure.

greater [number1] B and lesser [number2] B sum B

greater [number1] A and lesser [number2] A sum B

greater [number1] A and lesser [number2] B sum A

greater [number1] B and lesser [number2] A sum A

greater [number1] A and lesser [number2] B sum B

greater [number1] B and lesser [number2] B sum A

greater [number1] B and lesser [number2] B sum A

Table 5: 8 question templates of type “greater [number1] and lesser [number2] sum” when there are

2 columns.

count count, count of, how many

greater greater, greater than, bigger, bigger than, larger, larger than

lesser lesser, lesser than, smaller, smaller than, under

assign print, display, show

difference difference, difference between

17

Published as a conference paper at ICLR 2016

greater [number] total

greater [number] total of

greater [number] sum of

greater than [number] sum

greater than [number] total

greater than [number] total of

greater than [number] sum of

bigger [number] sum

bigger [number] total

bigger [number] total of

bigger [number] sum of

bigger than [number] sum

bigger than [number] total

bigger than [number] total of

bigger than [number] sum of

larger [number] sum

larger [number] total

larger [number] total of

larger [number] sum of

larger than [number] sum

larger than [number] total

larger than [number] total of

larger than [number] sum of

Table 7: 24 questions templates for questions of type “greater [number] sum” in the single column

word variability experiment.

word:0 A sum B

word:1 A sum B

word:2 A sum B

word:3 A sum B

word:4 A sum B

word:5 A sum B

word:6 A sum B

word:7 A sum B

word:8 A sum B

word:9 A sum B

Table 8: 10 questions templates for questions of type “[word] A sum B” in the two columns text

match experiment.

18

Published as a conference paper at ICLR 2016

Scott Reed & Nando de Freitas

Google DeepMind

London, UK

scott.ellison.reed@gmail.com

nandodefreitas@google.com

A BSTRACT

arXiv:1511.06279v4 [cs.LG] 29 Feb 2016

tional neural network that learns to represent and execute programs. NPI has three

learnable components: a task-agnostic recurrent core, a persistent key-value pro-

gram memory, and domain-specific encoders that enable a single NPI to operate in

multiple perceptually diverse environments with distinct affordances. By learning

to compose lower-level programs to express higher-level programs, NPI reduces

sample complexity and increases generalization ability compared to sequence-to-

sequence LSTMs. The program memory allows efficient learning of additional

tasks by building on existing programs. NPI can also harness the environment

(e.g. a scratch pad with read-write pointers) to cache intermediate results of com-

putation, lessening the long-term memory burden on recurrent hidden units. In

this work we train the NPI with fully-supervised execution traces; each program

has example sequences of calls to the immediate subprograms conditioned on the

input. Rather than training on a huge number of relatively weak labels, NPI learns

from a small number of rich examples. We demonstrate the capability of our

model to learn several types of compositional programs: addition, sorting, and

canonicalizing 3D models. Furthermore, a single NPI learns to execute these pro-

grams and all 21 associated subprograms.

1 I NTRODUCTION

Teaching machines to learn new programs, to rapidly compose new programs from existing pro-

grams, and to conditionally execute these programs automatically so as to solve a wide variety of

tasks is one of the central challenges of AI. Programs appear in many guises in various AI prob-

lems; including motor behaviours, image transformations, reinforcement learning policies, classical

algorithms, and symbolic relations.

In this paper, we develop a compositional architecture that learns to represent and interpret pro-

grams. We refer to this architecture as the Neural Programmer-Interpreter (NPI). The core module

is an LSTM-based sequence model that takes as input a learnable program embedding, program

arguments passed on by the calling program, and a feature representation of the environment. The

output of the core module is a key indicating what program to call next, arguments for the following

program and a flag indicating whether the program should terminate. In addition to the recurrent

core, the NPI architecture includes a learnable key-value memory of program embeddings. This

program-memory is essential for learning and re-using programs in a continual manner. Figures 1

and 2 illustrate the NPI on two different tasks.

We show in our experiments that the NPI architecture can learn 21 programs, including addition,

sorting, and trajectory planning from image pixels. Crucially, this can be achieved using a single

core model with the same parameters shared across all tasks. Different environments (for example

images, text, and scratch-pads) may require specific perception modules or encoders to produce the

features used by the shared core, as well as environment-specific actuators. Both perception modules

and actuators can be learned from data when training the NPI architecture.

To train the NPI we use curriculum learning and supervision via example execution traces. Each

program has example sequences of calls to the immediate subprograms conditioned on the input.

1

Published as a conference paper at ICLR 2016

HGOTO

VGOTO

KEY END ARG KEY END ARG

Mkey Mprog

h h

KEY END ARG LGOTO KEY END ARG

DGOTO

...

h h

HGOTO INPUT KEY END ARG KEY END ARG VGOTO INPUT

ACT ACT KEY END ARG

ACT

h h

...

h

...

LGOTO INPUT LGOTO INPUT DGOTO INPUT

KEY END ARG KEY END ARG KEY END ARG

h h

...

...

h

...

ACT INPUT ACT INPUT ACT INPUT

1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

GOTO() HGOTO() LGOTO() ACT(LEFT) LGOTO() ACT(LEFT) GOTO() VGOTO() DGOTO() ACT(DOWN) end state

Figure 1: Example execution of canonicalizing 3D car models. The task is to move the camera such

that a target angle and elevation are reached. There is a read-only scratch pad containing the target

(angle 1, elevation 2 here). The image encoder is a convnet trained from scratch on pixels.

CARRY

KEY END ARG ACT KEY END ARG

Mkey Mprog Figure 2: Example execu-

h h

tion trace of single-digit addi-

ADD1 INPUT KEY END ARG ADD1 INPUT KEY END ARG ACT KEY END ARG ACT tion. The task is to perform

h h h a single-digit add on the num-

ACT INPUT CARRY INPUT KEY END ARG CARRY INPUT KEY END ARG bers at pointer locations in the

h h first two rows. The carry (row

ACT INPUT ACT INPUT 3) and output (row 4) should

be updated to reflect the addi-

9 3 4 9 3 4 9 3 4 9 3 4 9 3 4 9 3 4 9 3 4

tion. At each time step, an ob-

3 4 8 3 4 8 3 4 8 3 4 8 3 4 8 3 4 8 3 4 8 servation of the environment

1

(viewed from each pointer on

a scratch pad) is encoded into

2 2 2 2 2 2

ADD1() ACT (4,2,WRITE) ADD1() CARRY() ACT (3,LEFT) CARRY() ACT (3,1,WRITE) a fixed-length vector.

By using neural networks to represent the subprograms and learning these from data, the approach

can generalize on tasks involving rich perceptual inputs and uncertainty.

We may envision two approaches to provide supervision. In one, we provide a very large number

of labeled examples, as in object recognition, speech and machine translation. In the other, the

approached followed in this paper, the aim is to provide far fewer labeled examples, but where

the labels contain richer information allowing the model to learn compositional structure. While

unsupervised and reinforcement learning play important roles in perception and motor control, other

cognitive abilities are possible thanks to rich supervision and curriculum learning. This is indeed

the reason for sending our children to school.

An advantage of our approach to model building and training is that the learned programs exhibit

strong generalization. Specifically, when trained to sort sequences of up to twenty numbers in

length, they can sort much longer sequences at test time. In contrast, the experiments will show that

more standard sequence to sequence LSTMs only exhibit weak generalization, see Figure 6.

A trained NPI with fixed parameters and a learned library of programs, can act both as an interpreter

and as a programmer. As an interpreter, it takes input in the form of a program embedding and input

data and subsequently executes the program. As a programmer, it uses samples drawn from a new

task to generate a new program embedding that can be added to its library of programs.

2 R ELATED WORK

Several ideas related to our approach have a long history. For example, the idea of using dynam-

ically programmable networks in which the activations of one network become the weights (the

2

Published as a conference paper at ICLR 2016

program) of a second network was mentioned in the Sigma-Pi units section of the influential PDP

paper (Rumelhart et al., 1986). This idea appeared in (Sutskever & Hinton, 2009) in the context of

learning higher order symbolic relations and in (Donnarumma et al., 2015) as the key ingredient of an

architecture for prefrontal cognitive control. Schmidhuber (1992) proposed a related meta-learning

idea, whereby one learns the parameters of a slowly changing network, which in turn generates

context dependent weight changes for a second rapidly changing network. These approaches have

only been demonstrated in very limited settings. In cognitive science, several theories of brain areas

controlling other brain parts so as to carry out multiple tasks have been proposed; see for example

Schneider & Chein (2003); Anderson (2010) and Donnarumma et al. (2012).

Related problems have been studied in the literature on hierarchical reinforcement learning (e.g.,

Dietterich (2000); Andre & Russell (2001); Sutton et al. (1999) and Schaul et al. (2015)), imitation

and apprenticeship learning (e.g., Kolter et al. (2008) and Rothkopf & Ballard (2013)) and elicita-

tion of options through human interaction (Subramanian et al., 2011). These ideas have held great

promise, but have not enjoyed significant impact. We believe the recurrent compositional neural

representations proposed in this paper could help these approaches in the future, and in particular in

overcoming feature engineering.

Several recent advancements have extended recurrent networks to solve problems beyond simple

sequence prediction. Graves et al. (2014) developed a neural Turing machine capable of learning

and executing simple programs such as repeat copying, simple priority sorting and associative recall.

Vinyals et al. (2015) developed Pointer Networks that generalize the notion of encoder attention in

order to provide the decoder a variable-sized output space depending on the input sequence length.

This model was shown to be effective for combinatorial optimization problems such as the traveling

salesman and Delaunay triangulation. While our proposed model is trained on execution traces in-

stead of input and output pairs, in exchange for this richer supervision we benefit from compositional

program structure, improving data efficiency on several problems.

This work is also closely related to program induction. Most previous work on program induc-

tion, i.e. inducing a program given example input and output pairs, has used genetic program-

ming (Banzhaf et al., 1998) to evolve useful programs from candidate populations. Mou et al.

(2014) process program symbols to learn max-margin program embeddings with the help of parse

trees. Zaremba & Sutskever (2014) trained LSTM models to read in the text of simple programs

character-by-character and correctly predict the program output. Joulin & Mikolov (2015) aug-

mented a recurrent network with a pushdown stack, allowing for generalization to longer input

sequences than seen during training for several algorithmic patterns.

Contemporary to this work, several papers have also studied program induction with variants of

recurrent neural networks (Zaremba & Sutskever, 2015; Zaremba et al., 2015; Kaiser & Sutskever,

2015; Kurach et al., 2015; Neelakantan et al., 2015). While we share a similar motivation, our

approach is distinct in that we explicitly incorporate compositional structure into the network using

a program memory, allowing the model to learn new programs by combining sub-programs.

3 M ODEL

The NPI core is a long short-term memory (LSTM) network (Hochreiter & Schmidhuber, 1997)

that acts as a router between programs conditioned on the current state observation and previous

hidden unit states. At each time step, the core module can select another program to invoke using

content-based addressing. It emits the probability of ending the current program with a single binary

unit. If this probability is over threshold (we used 0.5), control is returned to the caller by popping

the caller’s LSTM hidden units and program embedding off of a program call stack and resuming

execution in this context.

The NPI may also optionally write arguments (ARG) that are passed by reference or value to the

invoked sub-programs. For example, an argument could indicate a specific location in the input

sequence (by reference), or it could specify a number to write down at a particular location in the

sequence (by value). The subsequent state consists of these arguments and observations of the

environment. The approach is illustrated in Figures 1 and 2.

It must be emphasized that there is a single inference core. That is, all the LSTM instantiations

executing arbitrary programs share the same parameters. Different programs correspond to program

embeddings, which are stored in a learnable persistent memory. The programs therefore have a more

3

Published as a conference paper at ICLR 2016

succinct representation than neural programs encoded as the full set of weights in a neural network

(Rumelhart et al., 1986; Graves et al., 2014).

The output of an NPI, conditioned on an input state and a program to run, is a sequence of actions

in a given environment. In this work, we consider several environments: a 1-D array with read-only

pointers and a swap action, a 2-D scratch pad with read-write pointers, and a CAD renderer with

controllable elevation and azimuth movements. Note that the sequence of actions for a program is

not fixed, but dependent also on the input state.

3.1 I NFERENCE

Denote the environment observation at time t as et ∈ E, and the current program arguments as

at ∈ A. The form of et can vary dramatically by environment; for example it could be a color

image or an array of numbers. The program arguments at can also vary by environment, but in

the experiments for this paper we always used a 3-tuple of integers (at (1), at (2), at (3)). Given

the environment and arguments at time t, a fixed-length state encoding st ∈ RD is extracted by a

domain-specific encoder fenc : E ×A → RD . In section 4 we provide examples of several encoders.

Note that a single NPI network can have multiple encoders for multiple environments, and encoders

can potentially also be shared across tasks.

We denote the current program embedding as pt ∈ RP . The previous hidden unit and cell states

(l) (l)

are ht−1 ∈ RM and ct−1 ∈ RM , l = 1, ..., L where L is the number of layers in the LSTM.

The program and state vectors are then propagated forward through an LSTM mapping flstm as in

(Sutskever et al., 2014). How to fuse pt and st within flstm is an implementation detail, but in this

work we concatenate and feed through a 2-layer MLP with rectified linear (ReLU) hidden activation

and linear decoder.

From the top LSTM hidden state hL t , several decoders generate the outputs. The probability of

finishing the program and returning to the caller 1 is computed by fend : RM → [0, 1]. The lookup

key embedding used for retrieving the next program from memory is computed by fprog : RM →

RK . Note that RK can be much smaller than RP because the key only need act as the identifier

of a program, while the program embedding must have enough capacity to conditionally generate a

sequence of actions. The contents of the arguments to the next program to be called are generated

by farg : RM → A. The feed-forward steps of program inference are summarized below:

st = fenc (et , at ) (1)

ht = flstm (st , pt , ht−1 ) (2)

rt = fend (ht ), kt = fprog (ht ), at+1 = farg (ht ) (3)

where rt , kt and at+1 correspond to the end-of-program probability, program key embedding, and

output arguments at time t, respectively. These yield input arguments at time t + 1. To simplify the

notation, we have abstracted properties such as layers and cell memory in the sequence-to-sequence

LSTM of equation (2); see (Sutskever et al., 2014) for details.

The NPI representation is equipped with key-value memory structures M key ∈ RN ×K and

M prog ∈ RN ×P storing program keys and program embeddings, respectively, where N is the

current number of programs in memory. We can add more programs by adding rows to memory.

During training, the next program identifier is provided to the model as ground-truth, so that its

embedding can be retrieved from the corresponding row of M prog . At test time, we compute the

“program ID” by comparing the key embedding kt to each row of M key storing all program keys.

Then the program embedding is retrieved from M prog as follows:

i∗ = arg max(Mi,:key T

) kt , pt+1 = Miprog

∗ ,: (4)

i=1..N

The next environmental state et+1 will be determined by the dynamics of the environment and can

be affected by both the choice of program pt and the contents of the output arguments at , i.e.

et+1 ∼ fenv (et , pt , at ) (5)

The transition mapping fenv is domain-specific and will be discussed in Section 4. A description of

the inference procedure is given in Algorithm 1.

1

In our implementation, a program may first call a subprogram before itself finishing. The only exception

is the ACT program that signals a low-level action to the environment, e.g. moving a pointer one step left or

writing a value. By convention ACT does not call any further sub-programs.

4

Published as a conference paper at ICLR 2016

1: Inputs: Environment observation e, program id i, arguments a, stop threshold α

2: function RUN(i, a)

prog

3: h ← 0, r ← 0, p ← Mi,: . Init LSTM and return probability.

4: while r < α do

5: s ← fenc (e, a), h ← flstm (s, p, h) . Feed-forward NPI one step.

6: r ← fend (h), k ← fprog (h), a2 ← farg (h)

key T

7: i2 ← arg max(Mj,: ) k . Decide the next program to run.

j=1..N

8: if i == ACT then e ← fenv (e, p, a) . Update the environment based on ACT.

9: else RUN(i2 , a2 ) . Run subprogram i2 with arguments a2

Each task has a set of actions that affect the environment. For example, in addition there are LEFT

and RIGHT actions that move a specified pointer, and a WRITE action which writes a value at

a specified location. These actions are encapsulated into a general-purpose ACT program shared

across tasks, and the concrete action to be taken is indicated by the NPI-generated arguments at .

Note that the core LSTM module of our NPI representation is completely agnostic to the data modal-

ity used to produce the state encoding. As long as the same fixed-length embedding is extracted,

the same module can in practice route between programs related to sorting arrays just as easily as

between programs related to rotating 3D objects. In the experimental sections, we provide details of

the modality-specific deep neural networks that we use to produce these fixed-length state vectors.

3.2 T RAINING

To train we use execution traces ξtinp : {et , it , at } and ξtout : {it+1 , at+1 , rt }, t = 1, ...T , where T is

the sequence length. Program IDs it and it+1 are row-indices in M key and M prog of the programs

to run at time t and t+1, respectively. We propose to directly maximize the probability of the correct

execution trace output ξ out conditioned on ξ inp :

X

θ∗ = arg max log P (ξ out |ξ inp ; θ) (6)

θ

(ξ inp ,ξ out )

where θ are the parameters of our model. Since the traces are variable in length depending on the

input, we apply the chain rule to model the joint probability over ξ1out , ..., ξTout as follows:

T

X

log P (ξout |ξinp ; θ) = log P (ξtout |ξ1inp , ..., ξtinp ; θ) (7)

t=1

Note that for many problems the input history ξ1inp , ..., ξtinp is critical to deciding future actions

because the environment observation at the current time-step et alone does not contain enough in-

formation. The hidden unit activations of the LSTM in NPI are capable of capturing these temporal

dependencies. The single-step conditional probability in equation (7) can be factorized into three

further conditional distributions, corresponding to predicting the next program, next arguments, and

whether to halt execution:

log P (ξtout |ξ1inp , ..., ξtinp ) = log P (it+1 |ht ) + log P (at+1 |ht ) + log P (rt |ht ) (8)

where ht is the output of flstm at time t, carrying information from previous time steps. We train

by gradient ascent on the likelihood in equation (7).

We used an adaptive curriculum in which training examples for each mini-batch are fetched with fre-

quency proportional to the model’s current prediction error for the corresponding program. Specif-

ically, we set the sampling frequency using a softmax over average prediction error across all pro-

grams, with configurable temperature. Every 1000 steps of training we re-estimated these prediction

errors. Intuitively, this forces the model to focus on learning the program for which it currently per-

forms worst in executing. We found that the adaptive curriculum immediately worked much better

than our best-performing hand-designed curriculum, allowing a multi-task NPI to achieve compara-

ble performance to single-task NPI on all tasks.

We also note that our program has a distinct memory advantage over basic LSTMs because all sub-

programs can be trained in parallel. For programs whose execution length grows e.g. quadratically

5

Published as a conference paper at ICLR 2016

ADD

ADD1 ADD1 ADD1

input 1 0 0 0 9 6 WRITE OUT 1 WRITE OUT 2 WRITE OUT 2

CARRY CARRY LSHIFT

PTR CARRY LEFT PTR CARRY LEFT PTR INP1 LEFT

input 2 0 0 1 2 5 WRITE CARRY 1 WRITE CARRY 1 PTR INP2 LEFT

PTR CARRY RIGHT PTR CARRY RIGHT PTR CARRY LEFT

carry LSHIFT LSHIFT PTR OUT LEFT

0 0 1 1 1 PTR INP1 LEFT PTR INP1 LEFT

PTR INP2 LEFT PTR INP2 LEFT

output 0 0 0 2 1 PTR CARRY LEFT PTR CARRY LEFT

PTR OUT LEFT PTR OUT LEFT

(a) Example scratch pad and pointers (b) Actual trace of addition program generated by our model

used for computing “96 + 125 = 221”. on the problem shown to the left. Note that we substituted

Carry step is being implemented. the ACT calls in the trace with more human-readable steps.

with the input sequence length, an LSTM will by highly constrained by device memory to train on

short sequences. By exploiting compositionality, an effective curriculum can often be developed

with sublinear-length subprograms, enabling our NPI model to train on order of magnitude larger

sequences than the LSTM.

4 E XPERIMENTS

This section describes the environment and state encoder function for each task, and shows example

outputs and prediction accuracy results. For all tasks, the core LSTM had two layers of size 256.

We trained the NPI using the ADAM solver (Kingma & Ba, 2015) with base learning rate 0.0001,

batch size 1, and decayed the learning rate by a factor of 0.95 every 10,000 steps.

In this section we provide an overview of the tasks used to evaluate our model. Table 2 in the

appendix provides a full listing of all the programs and subprograms learned by our model.

A DDITION

The task in this environment is to read in the digits of two base-10 numbers and produce the digits

of the answer. Our goal is to teach the model the standard (at least in the US) grade school algorithm

of adding, in which one works from right to left applying single-digit add and carry operations.

In this environment, the network is endowed with a “scratch pad” with which to store intermediate

computations; e.g. to record carries. There are four pointers; one for each of the two input numbers,

one for the carry, and another to write the output. At each time step, a pointer can be moved left or

right, or it can record a value to the pad. Figure 3a illustrates the environment of this model, and

Figure 3b provides a real execution trace generated by our model.

For the state encoder fenc , the model is allowed a view of the scratch pad from the perspective of

each of the four pointers. That is, the model sees the current values at pointer locations of the two

inputs, the carry row and the output row, as 1-of-K encodings, where K is 10 because we are working

in base 10. We also append the values of the input argument tuple at :

fenc (Q, i1 , i2 , i3 , i4 , at ) = M LP ([Q(1, i1 ), Q(2, i2 ), Q(3, i3 ), Q(4, i4 ), at (1), at (2), at (3)]) (9)

where Q ∈ R4×N ×K , and i1 , ..., i4 are pointers, one per scratch pad row. The first dimension of Q

corresponds to scratch pad rows, N is the number of columns (digits) and K is the one-hot encoding

dimension. To begin the ADD program, we set the initial arguments to a default value and initialize

all pointers to be at the rightmost column. The only subprogram with non-default arguments is ACT,

in which case the arguments indicate an action to be taken by a specified pointer.

S ORTING

In this section we apply our model to a setting with potentially much longer execution traces: sorting

an array of numbers using bubblesort. As in the case of addition we can use a scratch pad to store

intermediate states of the array. We define the encoder as follows:

fenc (Q, i1 , i2 , at ) = M LP ([Q(1, i1 ), Q(1, i2 ), at (1), at (2), at (3)]) (10)

6

Published as a conference paper at ICLR 2016

BUBBLESORT

array BUBBLE RESET … BUBBLE …

t=0 3 2 4 9 1 PTR 2 RIGHT LSHIFT PTR 2 RIGHT

BSTEP PTR 1 LEFT BSTEP

COMPSWAP PTR 2 LEFT COMPSWAP

t=1

3 2 4 9 1 SWAP 1 2 LSHIFT SWAP 1 2

RSHIFT PTR 1 LEFT RSHIFT

PTR 1 RIGHT PTR 2 LEFT PTR 1 RIGHT

t=2 2 3 4 9 1 PTR 2 RIGHT …

LSHIFT ...

PTR 2 RIGHT

…

BSTEP PTR 1 LEFT BSTEP

t=3 2 3 4 9 1 COMPSWAP PTR 2 LEFT COMPSWAP

RSHIFT RSHIFT

(a) Example scratch pad and pointers PTR 1 RIGHT PTR 1 RIGHT

used for sorting. Several steps of the PTR 2 RIGHT PTR 2 RIGHT

BUBBLE subprogram are shown. (b) Excerpt from the trace of the learned bubblesort program.

where Q ∈ R1×N ×K is the pad, N is the array length and K is the array entry embedding dimension.

Figure 4 shows an example series of array states and an excerpt of an execution trace.

C ANONICALIZING 3D MODELS

We also apply our model to a vision task with a very different perceptual environment - pixels. Given

a rendering of a 3D car, we would like to learn a visual program that “canonicalizes” the model with

respect to its pose. Whatever the starting position, the program should generate a trajectory of

actions that delivers the camera to the target view, e.g. frontal pose at a 15◦ elevation. For training

data, we used renderings of the 3D car CAD models from (Fidler et al., 2012).

This is a nontrivial problem because different starting positions will require quite different trajec-

tories to reach the target. Further complicating the problem is the fact that the model will need to

generalize to different car models than it saw during training.

We again use a scratch pad, but here it is a very simple read-only pad that only contains a target

camera elevation and azimuth – i.e., the “canonical pose”. Since observations come in the form of

image pixels, we use a convolutional neural network fCN N as the image encoder:

fenc (Q, x, i1 , i2 , at ) = M LP ([Q(1, i1 ), Q(2, i2 ), fCN N (x), at (1), at (2), at (3)]) (11)

where x ∈ RH×W ×3 is a car rendering at the current pose, Q ∈ R2×1×K is the pad containing

canonical azimuth and elevation, i1 , i2 are the (fixed at 1) pointer locations, and K is the one-hot

encoding dimension of pose coordinates. We set K = 24 corresponding to 15◦ pose increments.

Note, critically, that our NPI model only has access to pixels of the rendering and the target pose,

and is not provided the pose of query frames. We are also aware that one solution to this problem

would be to train a pose classifier network and then find the shortest path to canonical pose via

classical methods. That is also a sensible approach. However, our purpose here is to show that our

method generalizes beyond the scratch pad domain to detailed images of 3D objects, and also to

other environments with a single multi-task model.

Both LSTMs and Neural Turing Machines can learn to perform sorting to a limited degree, although

they have not been shown to generalize well to much longer arrays than were seen during training.

However, we are interested not only in whether sorting can be accomplished, but whether a particular

sorting algorithm (e.g. bubblesort) can be learned by the model, and how effectively in terms of

sample complexity and generalization.

We compare the generalization ability of our model to a flat sequence-to-sequence LSTM (Sutskever

et al., 2014), using the same number of layers (2) and hidden units (256). Note that a flat 2 version

of NPI could also learn sorting of short arrays, but because bubblesort runs in O(N 2 ) for arrays of

length N , the execution traces quickly become far too long to store the required number of LSTM

states in memory. Our NPI architecture can train on much larger arrays by exploiting compositional

structure; the memory requirements of any given subprogram can be restricted to O(N ).

2

By flat in this case, we mean non-compositional, not making use of subprograms, and only making calls

to ACT in order to swap values and move pointers.

7

Published as a conference paper at ICLR 2016

Training

sequence

lengths

Figure 5: Sample complexity. Test accuracy Figure 6: Strong vs. weak generalization. Test

of sequence-to-sequence LSTM versus NPI on accuracy of sequence-to-sequence LSTM ver-

length-20 arrays of single-digit numbers. Note sus NPI on varying-length arrays of single-digit

that NPI is able to mine and train on subprogram numbers. Both models were trained on arrays of

traces from each bubblesort example. single-digit numbers up to length 20.

A strong indicator of whether a neural network has learned a program well is whether it can run the

program on inputs of previously-unseen sizes. To evaluate this property, we train both the sequence-

to-sequence LSTM and NPI to perform bubblesort on arrays of single-digit numbers from length 2

to length 20. Compared to fixed-length inputs this raises the challenge level during training, but in

exchange we can get a more flexible and generalizable sorting program.

To handle variable-sized inputs, the state representation must have some information about input se-

quence length and the number of steps taken so far. For example, the main BUBBLESORT program

naturally needs to call its helper function BUBBLE a number of times dependent on the sequence

length. We enable this in our model by adding a third pointer that acts as a counter; each time BUB-

BLE is called the pointer is advanced by one step. The scratch pad environment also provides a bit

indicating whether a pointer is at the start or end of a sequence, equivalent in purpose to end tokens

used in a sequence-to-sequence model.

For each length, we provided 64 example bubblesort traces, for a total of 1,216 examples. Then,

we evaluated whether the network can learn to sort arrays beyond length 20. We found that the

trained model generalizes well, and is capable of sorting arrays up to size 60; see Figure 6. At 60

and beyond, we observed a failure mode in which sweeps of pointers across the array would take

the wrong number of steps, suggesting that the limiting performance factor is related to counting.

In stark contrast, when provided with the 1,216 examples, the sequence-to-sequence LSTMs fail to

generalize beyond arrays of length 25 as shown in Figure 6.

To study sample complexity further, we fix the length of the arrays to 20 and vary the number of

training examples. We see in Figure 5 that NPI starts learning with 2 examples and is able to sort

almost perfectly with only 8 examples. The sequence-to-sequence model on the other hand requires

64 examples to start learning and only manages to sort well with over 250 examples.

Figure 7 shows several example canonicalization trajectories generated by our model, starting from

the leftmost car. The image encoder was a convolutional network with three passes of stride-2

convolution and pooling, trained on renderings of size 128 × 128. The canonical target pose in this

case is frontal with 15◦ elevation. At test time, from an initial rendering, NPI is able to canonicalize

cars of varying appearance from multiple starting positions. Importantly, it can generalize to car

appearances not encountered in the training set as shown in Figure 7.

One challenge for continual learning of neural-network-based agents is that training on new tasks

and experiences can lead to degraded performance in old tasks. The learning of new tasks may

require that the network weights change substantially, so care must be taken to avoid catastrophic

forgetting (Mccloskey & Cohen, 1989; OReilly et al., 2014). Using NPI, one solution is to fix the

weights of the core routing module, and only make sparse updates to the program memory.

When adding a new program the core module’s routing computation will be completely unaffected;

all the learning for a new task occurs in program embedding space. Of course, the addition of new

programs to the memory adds a new choice of program at each time step, and an old program could

8

Published as a conference paper at ICLR 2016

GOTO 1 2

GOTO 1 2 1 2 3

HGOTO

HGOTO 1 2 3 LGOTO

RGOTO

ACT(LEFT)

ACT(RIGHT)

ACT(LEFT) 4 5 6

VGOTO

ACT(LEFT)

UGOTO

ACT(LEFT)

ACT(UP)

ACT(LEFT) 7

GOTO 1 2 1 2 3 VGOTO

HGOTO UGOTO

RGOTO ACT(UP)

ACT(RIGHT)

ACT(RIGHT) GOTO 1 2

ACT(RIGHT) HGOTO 1 2 3

VGOTO 4 5 6 LGOTO

DGOTO ACT(LEFT)

ACT(DOWN) VGOTO

ACT(DOWN) DGOTO

ACT(DOWN)

Figure 7: Example canonicalization of several different test set cars. The network is able to generate

and execute the appropriate plan based on the starting car image. This NPI was trained on trajectories

starting at azimuth (−75◦ ...75◦ ) , elevation (0◦ ...60◦ ) in 15◦ increments. The training trajectories

target azimuth 0◦ and elevation 15◦ , as in the generated traces above.

mistakenly call a newly added program. To overcome this, when learning a new set of program

vectors with a fixed core, in practice we train not only on example traces of the new program, but

also traces of existing programs. Alternatively, a simpler approach is to prevent existing programs

from calling subsequently added programs, allowing addition of new programs without ever looking

back at training data for known programs. In either case, note that only the memory slots of the new

programs are updated, and all other weights, including other program embeddings, are fixed.

Table 1 shows the result of adding a maximum-finding program MAX to a multitask NPI trained

on addition, sorting and canonicalization. MAX first calls BUBBLESORT and then a new program

RJMP, which moves pointers to the right of the sorted array, where the max element can be read.

During training we froze all weights except for the two newly-added program embeddings. We

find that NPI learns MAX perfectly without forgetting the other tasks. In particular, after training a

single multi-task model as outlined in the following section, learning the MAX program with this

fixed-core multi-task NPI results in no performance deterioration for all three tasks.

4.4 S OLVING MULTIPLE TASKS WITH A SINGLE NETWORK

In this section we perform a controlled experiment to compare the performance of a multi-task NPI

with several single-task NPI models. Table 1 shows the results for addition, sorting and canonical-

izing 3D car models. We trained and evaluated on 10-digit numbers for addition, length-5 arrays for

sorting, and up to four-step trajectories for canonicalization. As shown in Table 1, one multi-task

NPI can learn all three programs (and necessarily the 21 subprograms) with comparable accuracy

compared to each single-task NPI.

Task Single Multi + Max Table 1: Per-sequence % accuracy. “+ Max”

Addition 100.0 97.0 97.0 indicates performance after addition of the ad-

Sorting 100.0 100.0 100.0 ditional max-finding subprograms to memory.

Canon. seen car 89.5 91.4 91.4 “unseen” uses a test set with disjoint car mod-

Canon. unseen 88.7 89.9 89.9 els from the training set, while “seen car” uses

Maximum - - 100.0 the same car models but different trajectories.

5 C ONCLUSION

We have shown that the NPI can learn programs in very dissimilar environments with different

affordances. In the context of sorting we showed that NPI exhibits very strong generalization in

comparison to sequence-to-sequence LSTMs. We also showed how a trained NPI with a fixed core

can continue to learn new programs without forgetting already learned programs.

ACKNOWLEDGMENTS

We sincerely thank Arun Nair and Ed Grefenstette for helpful suggestions.

9

Published as a conference paper at ICLR 2016

R EFERENCES

Anderson, Michael L. Neural reuse: A fundamental organizational principle of the brain. Behavioral

and Brain Sciences, 33:245–266, 8 2010.

Andre, David and Russell, Stuart J. Programmable reinforcement learning agents. In Advances in

Neural Information Processing Systems, pp. 1019–1025. 2001.

Banzhaf, Wolfgang, Nordin, Peter, Keller, Robert E, and Francone, Frank D. Genetic programming:

An introduction, volume 1. Morgan Kaufmann San Francisco, 1998.

Dietterich, Thomas G. Hierarchical reinforcement learning with the MAXQ value function decom-

position. Journal of Artificial Intelligence Research, 13:227–303, 2000.

Donnarumma, Francesco, Prevete, Roberto, and Trautteur, Giuseppe. Programming in the brain: A

neural network theoretical framework. Connection Science, 24(2-3):71–90, 2012.

Donnarumma, Francesco, Prevete, Roberto, Chersi, Fabian, and Pezzulo, Giovanni. A programmer-

interpreter neural network architecture for prefrontal cognitive control. International Journal of

Neural Systems, 25(6):1550017, 2015.

Fidler, Sanja, Dickinson, Sven, and Urtasun, Raquel. 3D object detection and viewpoint estimation

with a deformable 3D cuboid model. In Advances in neural information processing systems, 2012.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural Turing machines. arXiv preprint

arXiv:1410.5401, 2014.

Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):

1735–1780, 1997.

Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent

nets. In NIPS, 2015.

Kaiser, Łukasz and Sutskever, Ilya. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228,

2015.

Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. 2015.

Kolter, Zico, Abbeel, Pieter, and Ng, Andrew Y. Hierarchical apprenticeship learning with appli-

cation to quadruped locomotion. In Advances in Neural Information Processing Systems, pp.

769–776. 2008.

Kurach, Karol, Andrychowicz, Marcin, and Sutskever, Ilya. Neural random-access machines. arXiv

preprint arXiv:1511.06392, 2015.

Mccloskey, Michael and Cohen, Neal J. Catastrophic interference in connectionist networks: The

sequential learning problem. In The psychology of learning and motivation, volume 24, pp. 109–

165. 1989.

Mou, Lili, Li, Ge, Liu, Yuxuan, Peng, Hao, Jin, Zhi, Xu, Yan, and Zhang, Lu. Building program

vector representations for deep learning. arXiv preprint arXiv:1409.3358, 2014.

Neelakantan, Arvind, Le, Quoc V, and Sutskever, Ilya. Neural programmer: Inducing latent pro-

grams with gradient descent. arXiv preprint arXiv:1511.04834, 2015.

OReilly, Randall C., Bhattacharyya, Rajan, Howard, Michael D., and Ketz, Nicholas. Complemen-

tary learning systems. Cognitive Science, 38(6):1229–1248, 2014.

Rothkopf, ConstantinA. and Ballard, DanaH. Modular inverse reinforcement learning for visuomo-

tor behavior. Biological Cybernetics, 107(4):477–490, 2013.

Rumelhart, D. E., Hinton, G. E., and McClelland, J. L. Parallel distributed processing: Explorations

in the microstructure of cognition, vol. 1. chapter A General Framework for Parallel Distributed

Processing, pp. 45–76. MIT Press, 1986.

10

Published as a conference paper at ICLR 2016

Schaul, Tom, Horgan, Daniel, Gregor, Karol, and Silver, David. Universal value function approxi-

mators. In International Conference on Machine Learning, 2015.

Schmidhuber, Jürgen. Learning to control fast-weight memories: An alternative to dynamic recur-

rent networks. Neural Computation, 4(1):131–139, 1992.

Schneider, Walter and Chein, Jason M. Controlled and automatic processing: behavior, theory, and

biological mechanisms. Cognitive Science, 27(3):525–559, 2003.

Subramanian, Kaushik, Isbell, Charles, and Thomaz, Andrea. Learning options through human

interaction. In IJCAI Workshop on Agents Learning Interactively from Human Teachers, 2011.

Sutskever, Ilya and Hinton, Geoffrey E. Using matrices to model symbolic relationship. In Advances

in Neural Information Processing Systems, pp. 1593–1600. 2009.

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural net-

works. In Advances in neural information processing systems, pp. 3104–3112, 2014.

Sutton, Richard S., Precup, Doina, and Singh, Satinder. Between MDPs and semi-MDPs: A frame-

work for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–

211, 1999.

Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdeep. Pointer networks. Advances in Neural Infor-

mation Processing Systems (NIPS), 2015.

Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

Zaremba, Wojciech and Sutskever, Ilya. Reinforcement learning neural turing machines. arXiv

preprint arXiv:1505.00521, 2015.

Zaremba, Wojciech, Mikolov, Tomas, Joulin, Armand, and Fergus, Rob. Learning simple algorithms

from examples. arXiv preprint arXiv:1511.07275, 2015.

11

Published as a conference paper at ICLR 2016

6 A PPENDIX

6.1 L ISTING OF LEARNED PROGRAMS

ADD Perform multi-digit addition ADD1, LSHIFT

ADD1 Perform single-digit addition ACT, CARRY

CARRY Mark a 1 in the carry row one unit left ACT

LSHIFT Shift a specified pointer one step left ACT

RSHIFT Shift a specified pointer one step right ACT

ACT Move a pointer or write to the scratch pad -

BUBBLESORT Perform bubble sort (ascending order) BUBBLE, RESET

BUBBLE Perform one sweep of pointers left to right ACT, BSTEP

RESET Move both pointers all the way left LSHIFT

BSTEP Conditionally swap and advance pointers COMPSWAP, RSHIFT

COMPSWAP Conditionally swap two elements ACT

LSHIFT Shift a specified pointer one step left ACT

RSHIFT Shift a specified pointer one step right ACT

ACT Swap two values at pointer locations or move a pointer -

GOTO Change 3D car pose to match the target HGOTO, VGOTO

HGOTO Move horizontally to the target angle LGOTO, RGOTO

LGOTO Move left to match the target angle ACT

RGOTO Move right to match the target angle ACT

VGOTO Move vertically to the target elevation UGOTO, DGOTO

UGOTO Move up to match the target elevation ACT

DGOTO Move down to match the target elevation ACT

ACT Move camera 15◦ up, down, left or right -

RJMP Move all pointers to the rightmost posiiton RSHIFT

MAX Find maximum element of an array BUBBLESORT,RJMP

Table 2: Programs learned for addition, sorting and 3D car canonicalization. Note the the ACT

program has a different effect depending on the environment and on the passed-in arguments.

Figure 8 shows the sequence of program calls for BUBBLESORT. Pointers 1 and 2 are used to im-

Figure 8: Generated execution trace from our trained NPI sorting the array [9,2,5].

BUBBLESORT

BUBBLE BUBBLE BUBBLE

PTR 2 RIGHT PTR 2 RIGHT PTR 2 RIGHT

BSTEP BSTEP BSTEP

COMPSWAP COMPSWAP COMPSWAP

SWAP 1 2

RSHIFT RSHIFT RSHIFT

PTR 1 RIGHT PTR 1 RIGHT PTR 1 RIGHT

PTR 2 RIGHT PTR 2 RIGHT PTR 2 RIGHT

BSTEP BSTEP BSTEP

COMPSWAP COMPSWAP COMPSWAP

SWAP 1 2

RSHIFT RSHIFT RSHIFT

PTR 1 RIGHT PTR 1 RIGHT PTR 1 RIGHT

PTR 2 RIGHT PTR 2 RIGHT PTR 2 RIGHT

RESET RESET RESET

LSHIFT LSHIFT LSHIFT

PTR 1 LEFT PTR 1 LEFT PTR 1 LEFT

PTR 2 LEFT PTR 2 LEFT PTR 2 LEFT

LSHIFT LSHIFT LSHIFT

PTR 1 LEFT PTR 1 LEFT PTR 1 LEFT

PTR 2 LEFT PTR 2 LEFT PTR 2 LEFT

PTR 3 RIGHT PTR 3 RIGHT PTR 3 RIGHT

plement the “bubble” operation involving the comparison and swapping of adjacent array elements.

The third pointer (referred to in the trace as “PTR 3”) is used to count the number of calls to BUB-

BLE. After every call to RESET the swapping pointers are moved to the beginning of the array and

the counting pointer is advanced by 1. When it has reached the end of the scratch pad, the model

learns to halt execution of BUBBLESORT.

12

Published as a conference paper at ICLR 2016

sequence models for the addition task, to evaluate the generalization ability. we implemented addi-

tion in a sequence to sequence model, training to model sequences of the following form, e.g. for

“90 + 160 = 250” we represent the sequence as:

90X160X250

For the simple Seq2Seq baseline above (same number of LSTM layers and hidden units as NPI), we

observed that the model could predict one or two digits reliably, but did not generalize even up to

20-digit addition. However, we are aware that others have gotten multi-digit addition of the above

form to work to some extent with curriculum learning (Zaremba & Sutskever, 2014). In order to

make a more competitive baseline, we helped Seq2Seq in two ways: 1) reverse input digits and

stack the two numbers on top of each other to form a 2-channel sequence, and 2) reverse input digits

and generate reversed output digits immediately at each time step.

In the approach of 1), the seq2seq model schematically looks like this:

output: XXXX250

input 1: 090XXXX

input 2: 061XXXX

output: 052

input 1: 090

input 2: 061

Both 1) which we call s2s-stacked and 2) which we call s2s-easy are much stronger competitors to

NPI than even the proposed addition baseline. We compare the generalization performance of NPI

to these baselines in the figure below:

Figure 9: Comparing NPI and Seq2Seq variants on addition generalization to longer sequences.

We found that NPI trained on 32 examples for problem lengths 1,...,20 generalizes with 100% ac-

curacy to all the lengths we tried (up to 3000). s2s-easy trained on twice as many examples gen-

eralizes to just over length 2000 problems. s2s-stacked barely generalizes beyond 5, even with far

more data. This suggests that locality of computation makes a large impact on generalization per-

formance. Even when we carefully ordered and stacked the input numbers for Seq2Seq, NPI still

had an edge in performance. In contrast to Seq2Seq, NPI is taught (supervised for now) to move

its pointers so that the key operations (e.g. single digit add, carry) can be done using only local

information, and this appears to help generalization.

13

Published as a conference paper at ICLR 2016

Karol Kurach∗ & Marcin Andrychowicz∗ & Ilya Sutskever

{kkurach,marcina,ilyasu}@google.com

A BSTRACT

In this paper, we propose and investigate a new neural network architecture called

Neural Random Access Machine. It can manipulate and dereference pointers to

arXiv:1511.06392v3 [cs.LG] 9 Feb 2016

input-output examples using backpropagation.

We evaluate the new model on a number of simple algorithmic tasks whose so-

lutions require pointer manipulation and dereferencing. Our results show that the

proposed model can learn to solve algorithmic tasks of such type and is capable

of operating on simple data structures like linked-lists and binary trees. For easier

tasks, the learned solutions generalize to sequences of arbitrary length. More-

over, memory access during inference can be done in a constant time under some

assumptions.

1 I NTRODUCTION

Deep learning is successful for two reasons. First, deep neural networks are able to represent the

“right” kind of functions; second, deep neural networks are trainable. Deep neural networks can

be potentially improved if they get deeper and have fewer parameters, while maintaining train-

ability. By doing so, we move closer towards a practical implementation of Solomonoff induc-

tion (Solomonoff, 1964). The first model that we know of that attempted to train extremely deep

networks with a large memory and few parameters is the Neural Turing Machine (NTM) (Graves

et al., 2014) — a computationally universal deep neural network that is trainable with backprop-

agation. Other models with this property include variants of Stack-Augmented recurrent neural

networks (Joulin & Mikolov, 2015; Grefenstette et al., 2015), and the Grid-LSTM (Kalchbrenner

et al., 2015)—of which the Grid-LSTM has achieved the greatest success on both synthetic and real

tasks. The key characteristic of these models is that their depth, the size of their short term memory,

and their number of parameters are no longer confounded and can be altered independently — which

stands in contrast to models like the LSTM (Hochreiter & Schmidhuber, 1997), whose number of

parameters grows quadratically with the size of their short term memory.

A fundamental operation of modern computers is pointer manipulation and dereferencing. In this

work, we investigate a model class that we name the Neural Random-Access Machine (NRAM),

which is a neural network that has, as primitive operations, the ability to manipulate, store in mem-

ory, and dereference pointers into its working memory. By providing our model with dereferencing

as a primitive, it becomes possible to train models on problems whose solutions require pointer

manipulation and chasing. Although all computationally universal neural networks are equivalent,

which means that the NRAM model does not have a representational advantage over other models if

they are given a sufficient number of computational steps, in practice, the number of timesteps that

a given model has is highly limited, as extremely deep models are very difficult to train. As a result,

the model’s core primitives have a strong effect on the set of functions that can be feasibly learned

in practice, similarly to the way in which the choice of a programming language strongly affects the

functions that can be implemented with an extremely small amount of code.

Finally, the usefulness of computationally-universal neural networks depends entirely on the ability

of backpropagation to find good settings of their parameters. Indeed, it is trivial to define the “op-

timal” hypothesis class (Solomonoff, 1964), but the problem of finding the best (or even a good)

∗

Equal contribution.

1

Published as a conference paper at ICLR 2016

function in that class is intractable. Our work puts the backpropagation algorithm to another test,

where the model is extremely deep and intricate.

In our experiments, we evaluate our model on several algorithmic problems whose solutions required

pointer manipulation and chasing. These problems include algorithms on a linked-list and a binary

tree. While we were able to achieve encouraging results on these problems, we found that standard

optimization algorithms struggle with these extremely deep and nonlinear models. We believe that

advances in optimization methods will likely lead to better results.

2 R ELATED WORK

There has been a significant interest in the problem of learning algorithms in the past few years.

The most relevant recent paper is Neural Turing Machines (NTMs) (Graves et al., 2014). It was the

first paper to explicitly suggest the notion that it is worth training a computationally universal neural

network, and achieved encouraging results.

A follow-up model that had the goal of learning algorithms was the Stack-Augmented Recurrent

Neural Network (Joulin & Mikolov, 2015) This work demonstrated that the Stack-Augmented RNN

can generalize to long problem instances from short problem instances. A related model is the

Reinforcement Learning Neural Turing Machine (Zaremba & Sutskever, 2015), which attempted to

use reinforcement learning techniques to train a discrete-continuous hybrid model.

The memory network (Weston et al., 2014) is an early model that attempted to explicitly separate

the memory from computation in a neural network model. The followup work of Sukhbaatar et al.

(2015) combined the memory network with the soft attention mechanism, which allowed it to be

trained with less supervision.

The Grid-LSTM (Kalchbrenner et al., 2015) is a highly interesting extension of LSTM, which allows

to use LSTM cells for both deep and sequential computation. It achieves excellent results on both

synthetic, algorithmic problems and on real tasks, such as language modelling, machine translation,

and object recognition.

The Pointer Network (Vinyals et al., 2015) is somewhat different from the above models in that it

does not have a writable memory — it is more similar to the attention model of Bahdanau et al.

(2014) in this regard. Despite not having a memory, this model was able to solve a number of diffi-

cult algorithmic problems that include the convex hull and the approximate 2D travelling salesman

problem (TSP).

Finally, it is important to mention the attention model of Bahdanau et al. (2014). Although this

work is not explicitly aimed at learning algorithms, it is by far the most practical model that has

an “algorithmic bent”. Indeed, this model has proven to be highly versatile, and variants of this

model have achieved state-of-the-art results on machine translation (Luong et al., 2015), speech

recognition (Chan et al., 2015), and syntactic parsing (Vinyals et al., 2014), without the use of

almost any domain-specific tuning.

3 M ODEL

In this section we describe the NRAM model. We start with a description of the simplified version

of our model which does not use an external memory and then explain how to augment it with a

variable-size random-access memory. The core part of the model is a neural controller, which acts

as a “processor”. The controller can be a feedforward neural network or an LSTM, and it is the only

trainable part of the model.

The model contains R registers, each of which holds an integer value. To make our model trainable

with gradient descent, we made it fully differentiable. Hence, each register represents an integer

value with a distribution over the set {0, 1, . . . , M − 1}, for some constant M . We do not assume

that these distributions have any special form — they are simply stored as vectors p ∈ RM satisfying

pi ≥ 0 and i pi = 1. The controller does not have direct access to the registers; it can interact

P

with them using a number of prespecified modules (gates), such as integer addition or equality test.

2

Published as a conference paper at ICLR 2016

mi : {0, 1, . . . , M − 1} × {0, 1, . . . , M − 1} → {0, 1, . . . , M − 1}.

On a high level, the model performs a sequence of timesteps, each of which consists of the following

substeps:

1. The controller gets some inputs depending on the values of the registers (the controller’s

inputs are described in Sec. 3.1).

2. The controller updates its internal state (if the controller is an LSTM).

3. The controller outputs the description of a “fuzzy circuit” with inputs r1 , . . . , rR , gates

m1 , . . . , mQ and R outputs.

4. The values of the registers are overwritten with the outputs of the circuit.

More precisely, each circuit is created as follows. The inputs for the module mi are chosen by the

controller from the set {r1 , . . . , rR , o1 , . . . , oi−1 }, where:

• rj is the value stored in the j-th register at the current timestep, and

• oj is the output of the module mj at the current timestep.

Hence, for each 1 ≤ i ≤ Q the controller chooses weighted averages of the values

{r1 , . . . , rR , o1 , . . . , oi−1 } which are given as inputs to the module. Therefore,

(1)

where the vectors ai , bi ∈ RR+i−1 are produced by the controller (Fig. 1).

outputs of

previous

registers modules h·, ·i

r1 ... rR o1 . . . oi−1 mi oi

h·, ·i

ai s-m

LSTM

bi s-m

Figure 1: The execution of the module mi . Gates s-m represent the softmax function and h·, ·i

denotes inner product. See Eq. 1 for details.

Recall that the variables rj represent probability distributions and therefore the inputs to mi , be-

ing weighted averages of probability distributions, are also probability distributions. Thus, as the

modules mi are originally defined for integer inputs and outputs, we must extend their domain to

probability distributions as inputs, which can be done in a natural way (and make their output also

be a probability distribution):

X

∀0≤c<M P (mi (A, B) = c) = P(A = a)P(B = b)[mi (a, b) = c]. (2)

0≤a,b<M

After the modules have produced their outputs, the controller decides which of the values

{r1 , . . . , rR , o1 , . . . , oQ } should be stored in the registers. In detail, the controller outputs the vec-

tors ci ∈ RR+Q for 1 ≤ i ≤ R and the values of the registers are updated (simultaneously) using

the formula:

ri := (r1 , . . . , rR , o1 , . . . , oQ )T softmax(ci ). (3)

3

Published as a conference paper at ICLR 2016

Recall that at the beginning of each timestep the controller receives some inputs, and it is an im-

portant design decision to decide where should these inputs come from. A naive approach is to

use the values of the registers as inputs to the controller. However, the values of the registers are

probability distributions and are stored as vectors p ∈ RM . If the entire distributions were given as

inputs to the controller then the number of the model’s parameters would depend on M . This would

be undesirable because, as will be explained in the next section, the value M is linked to the size of

an external random-access memory tape and hence it would prevent the model from generalizing to

different memory sizes.

Hence, for each 1 ≤ i ≤ R the controller receives, as input, only one scalar from each register,

namely P(ri = 0) — the probability that the value in the register is equal 0. This solution has

an additional advantage, namely it limits the amount of information available to the controller and

forces it to rely on the modules instead of trying to solve the problem on its own. Notice that this

information is sufficient to get the exact value of ri if ri ∈ {0, 1}, which is the case whenever ri is

an output of a ,,boolean” module, e.g. the inequality test module mi (a, b) = [a < b].

One could use the model described so far for learning sequence-to-sequence transformations by

initializing the registers with the input sequence, and training the model to produce the desired

output sequence in its registers after a given number of timesteps. The disadvantage of such model

is that it would be completely unable to generalize to longer sequences, because the length of the

sequence that the model can process is equal to the number of its registers, which is constant.

Therefore, we extend the model with a variable-size memory tape, which consists of M memory

cells, each of which stores a distribution over the set {0, 1, . . . , M −1}. Notice that each distribution

stored in a memory cell or a register can be interpreted as a fuzzy address in the memory and used

as a fuzzy pointer. We will hence identify integers in the set {0, 1, . . . , M − 1} with pointers to the

memory. Therefore, the value in each memory cell may be interpreted as an integer or as a pointer.

The exact state of the memory can be described by a matrix M ∈ RM M , where the value Mi,j is the

probability that the i-th cell holds the value j.

The model interacts with the memory tape solely using two special modules:

• READ module: this module takes as the input a pointer1 and returns the value stored under

the given address in the memory. This operation is extended to fuzzy pointers similarly

to Eq. 2. More precisely, if p is a vector representing the probability distribution of the

input (i.e. pi is the probability that the input pointer points to the i-th cell) then the module

returns the value MT p.

• WRITE module: this module takes as the input a pointer p and a value a and stores the value

a under the address p in the memory. The fuzzy form of the operation can be effectively

expressed using matrix operations 2 .

The memory tape also serves as an input-output channel — the model’s memory is initialized with

the input sequence and the model is expected to produce the output in the memory. Moreover, we

use a novel way of deciding how many timesteps should be executed. After each timestep we let

the controller decide whether it would like to continue the execution or finish it, in which case the

current state of the memory is treated as the output.

1

Formally each module takes two arguments. In this case the second argument is simply ignored.

2

The exact formula is M := (J − p)J T · M + paT , where J denotes a (column) vector consisting of M

ones and · denotes coordinate-wise multiplication.

4

Published as a conference paper at ICLR 2016

binarized

LSTM finish?

registers r1 m1 m3 r1

r2 r2

r3 r3

r4 m2 r4

memory tape

Figure 2: One timestep of the NRAM architecture with R = 4 registers. The LSTM controller gets

the ,,binarized” values r1 , r2 , . . . stored in the registers as inputs and outputs the description of the

circuit in the grey box and the probability of finishing the execution in the current timestep (See

Sec. 3.3 for more detail). The weights of the solid thin connections are outputted by the controller.

The weights of the solid thick connections are trainable parameters of the model. Some of the

modules (i.e. READ and WRITE) may interact with the memory tape (dashed connections).

More precisely, after the timestep t the controller outputs a scalar ft ∈ [0, 1]3 , which denotes the

willingness to finish the execution in the current timestep. Therefore, the probability that the exe-

Qt−1

cution has not been finished before the timestep t is equal i=1 (1 − fi ), and the probability that

Qt−1

the output is produced exactly at the timestep t is equal pt = ft · i=1 (1 − fi ). There is also

some maximal allowed number of timesteps T , which is a hyperparameter. The model is forced to

PT −1

produce output in the last step if it has not done it yet, i.e. pT = 1 − i=1 pi regardless of the value

fT .

(t)

Let M(t) ∈ RM M denote the memory matrix after the timestep t, i.e. Mi,j is the probability that

the i-th memory cell holds the value j after the timestep t. For an input-output pair (x, y), where

x, y ∈ {0, 1, . . . , M − 1}M we define the loss of the model as the expected

negative log-likelihood

PT PM (t)

of producing the correct output, i.e., − t=1 pt · i=1 log(Mi,yi ) assuming that the memory

was initialized with the sequence x4 . Moreover, for all problems we consider the output sequence

is shorter than the memory. Therefore, we compute the loss only over memory cells, which should

contain the output.

3.4 D ISCRETIZATION

costly operation. For example, computing the output of the READ module takes Θ(M 2 ) time as it

requires the multiplication of the matrix M ∈ RM

M and the vector p ∈ R .

M

One may however suspect (and we empirically verify this claim in Sec. 4) that the NRAM model

naturally learns solutions in which the distributions of intermediate values have very low entropy.

The argument for this hypothesis is that fuzziness in the intermediate values would probably prop-

agate to the output and cause a higher value of the cost function. To test this hypothesis we trained

the model and then used its discretized version during interference. In the discretized version every

module gets as inputs the values from modules (or registers), which are the most probable to produce

3

In fact, the controller outputs a scalar xi and fi = sigmoid(xi ). P

4 (t)

One could also use the negative log-likelihood of the expected output, i.e. − M T

P

i=1 log t=1 pt · Mi,yi

as the loss function.

5

Published as a conference paper at ICLR 2016

the given input accordingly to the distribution outputted by the controller. More precisely, it corre-

sponds to replacing the function softmax in equations (1,3) with the function returning the vector

containing 1 on the position of the maximum value in the input and zeros on all other positions.

Notice that in the discretized NRAM model each register and memory cell stores an integer from

the set {0, 1, . . . , M − 1} and therefore all modules may be executed efficiently (assuming that

the functions represented by the modules can be efficiently computed). In case of a feedforward

controller and a small (e.g. ≤ 20) number of registers the interference can be accelerated even

further. Recall that the only inputs to the controller are binarized values of the register. Therefore,

instead of executing the controller one may simple precompute the (discretized) controller’s output

for each configuration of the registers’ binarized values. Such algorithm would enjoy an extremely

efficient implementation in machine code.

4 E XPERIMENTS

The NRAM model is fully differentiable and we trained it using the Adam optimization algorithm

(Kingma & Ba, 2014) with the negative log-likelihood cost function. Notice that we do not use any

additional supervised data (such as memory access traces) beyond pure input-output examples.

We used multilayer perceptrons (MLPs) with two hidden layers or LSTMs with a hidden layer

between input and LSTM cells as controllers. The number of hidden units in each layer was equal.

The ReLu nonlinearity (Nair & Hinton, 2010) was used in all experiments.

Below are some important techniques that we used in the training:

Curriculum learning As noticed in several papers (Bengio et al., 2009; Zaremba & Sutskever,

2014), curriculum learning is crucial for training deep networks on very complicated problems. We

followed the curriculum learning schedule from Zaremba & Sutskever (2014) without any modifi-

cations. The details can be found in Appendix B.

Gradient clipping Notice that the depth of the unfolded execution is roughly a product of the

number of timesteps and the number of modules. Even for moderately small experiments (e.g. 14

modules and 20 timesteps) this value easily exceeds a few hundreds. In networks of such depth,

the gradients can often “explode” (Bengio et al., 1994), what makes training by backpropagation

much harder. We noticed that the gradients w.r.t. the intermediate values inside the backpropagation

were so large, that they sometimes led to an overflow in single-precision floating-point arithmetic.

Therefore, we clipped the gradients w.r.t. the activations, within the execution of the backpropaga-

tion algorithm. More precisely, each coordinate is separately cropped into the range [−C1 , C1 ] for

some constant C1 . Before updating parameters, we also globally rescale the whole gradient vector,

so that its L2 norm is not bigger than some constant value C2 .

Noise We added random Gaussian noise to the computed gradients after the backpropagation step.

The variance of this noise decays exponentially during the training. The details can be found in

Neelakantan et al. (2015).

Enforcing Distribution Constraints For very deep networks, a small error in one place can prop-

agate to a huge error in some other place. This was the case with our pointers: they are probability

distributions over memory cells and they should sum up to 1. However, after a number of operations

are applied, they can accumulate error as a result of inaccurate floating-point arithmetic.

We have a special layer which is responsible for rescaling all values (multiplying by the inverse of

their sum), to make sure they always represent a probability distribution. We add this layer to our

model in a few critical places (eg. after the softmax operation)5 .

5

We do not however backpropagate through these renormalizing operations, i.e. during the backward pass

we simply assume that they are identities.

6

Published as a conference paper at ICLR 2016

Entropy While searching for a solution, the network can fix the pointer distribution on some

particular value. This is advantageous at the end of training, because ideally we would like to be

able to discretize the model. However, if this happens at the begin of the training, it could force the

network to stay in a local minimum, with a small chance of moving the probability mass to some

other value. To address this problem, we encourage the network to explore the space of solutions by

adding an ”entropy bonus”, that decreases over time. More precisely, for every distribution outputted

by the controller, we subtract from the cost function the entropy of the distribution multiplied by

some coefficient, which decreases exponentially during the training.

Limiting the values of logarithms There are two places in our model where the logarithms are

computed — in the cost function and in the entropy computation. Inputs to whose logarithms can

be very small numbers, which may cause very big values of the cost function or even overflows in

floating-point arithmetic. To prevent this phenomenon we use log(max(x, )) instead of log(x) for

some small hyperparameter whenever a logarithm is computed.

4.2 TASKS

We now describe the tasks used in our experiments. For every task, the input is given to the network

in the memory tape, and the network’s goal is to modify the memory according to the task’s specifi-

cation. We allow the network to modify the original input. The final error for a test case is computed

as mc

, where c is the number of correctly written cells, and m represents the total number of cells

that should be modified.

Due to limited space, we describe the tasks only briefly here. The detailed memory layout of inputs

and outputs can be found in the Appendix A.

2. Increment Given an array, increment all its elements by 1.

3. Copy Given an array and a pointer to the destination, copy all elements from the array to

the given location.

4. Reverse Given an array and a pointer to the destination, copy all elements from the array

in reversed order.

5. Swap Given two pointers p, q and an array A, swap elements A[p] and A[q].

6. Permutation Given two arrays of n elements: P (contains a permutation of numbers

1, . . . , n) and A (contains random elements), permutate A according to P .

7. ListK Given a pointer to the head of a linked list and a number k, find the value of the k-th

element on the list.

8. ListSearch Given a pointer to the head of a linked list and a value v to find return a pointer

to the first node on the list with the value v.

9. Merge Given pointers to 2 sorted arrays A and B, merge them.

10. WalkBST Given a pointer to the root of a Binary Search Tree, and a path to be traversed

(sequence of left/right steps), return the element at the end of the path.

4.3 M ODULES

In all of our experiments we used the same sequence of 14 modules: READ (described in Sec. 3.2),

ZERO(a, b) = 0, ONE(a, b) = 1, TWO(a, b) = 2, INC(a, b) = (a+1) mod M , ADD(a, b) = (a+b)

mod M , SUB(a, b) = (a − b) mod M , DEC(a, b) = (a − 1) mod M , LESS-THAN(a, b) = [a <

b], LESS-OR-EQUAL-THAN(a, b) = [a ≤ b], EQUALITY-TEST(a, b) = [a = b], MIN(a, b) =

min(a, b), MAX(a, b) = max(a, b), WRITE (described in Sec. 3.2).

We also considered settings in which the module sequence is repeated many times, e.g. there are 28

modules, where modules number 1. and 15. are READ, modules number 2. and 16. are ZERO and so

on. The number of repetitions is a hyperparameter.

7

Published as a conference paper at ICLR 2016

Access len(A) ≤ 20 0 perfect perfect

Increment len(A) ≤ 15 0 perfect perfect

Copy len(A) ≤ 15 0 perfect perfect

Reverse len(A) ≤ 15 0 perfect perfect

Swap len(A) ≤ 20 0 perfect perfect

Permutation len(A) ≤ 6 0 almost perfect perfect

ListK len(list) ≤ 10 0 strong hurts performance

ListSearch len(list) ≤ 6 0 weak hurts performance

Merge len(A) + len(B) ≤ 10 1% weak hurts performance

WalkBST size(tree) ≤ 10 0.3% strong hurts performance

Table 1: Results of the experiments. The perfect generalization error means that the tested problem

had error 0 for complexity up to 50. Exact generalization errors are presented in Fig. 3 The perfect

discretization means that the discretized version of the model produced exactly the same outputs as

the original model on all test cases.

0.45

Merge

0.40 WalkBST

ListK

0.35 ListSearch

Permutation

0.30

0.25

Test error

0.20

0.15

0.10

0.05

0.00

10 15 20 25 30

Max task complexity

Figure 3: Generalization errors for hard tasks. The Permutation and ListSearch problems were

trained only up to complexity 6. The remaining problems were trained up to complexity 10. The

horizontal axis denotes the maximal task complexity, i.e., x = 20 denotes results with complexity

sampled uniformly from the interval [1, 20].

4.4 R ESULTS

Overall, we were able to find parameters that achieved an error 0 for all problems except Merge and

WalkBST (where we got an error of ≤ 1%). As described in 4.2, our metric is an accuracy on the

memory cells that should be modified. To compute it, we take the continuous memory state produced

by our network, then discretize it (every cell will contain the value with the highest probability), and

finally compare with the expected output. The results of the experiments are summarized in Table 1.

Below we describe our results on all 10 tasks in more detail. We divide them into 2 categories:

”easy” and ”hard” tasks. Easy tasks is a category of tasks that achieved low error scores for many

sets of parameters and we did not have to spend much time trying to tune them. First 5 problems

from our task list belong to this category. Hard tasks, on the other hand, are problems that often

trained to low error rate only in a very small number of cases, eg. 1 out of 100.

This category includes the following problems: Access, Increment, Copy, Reverse, Swap. For

all of them we were able to find many sets of hyperparameters that achieved error 0, or close to it

without much effort.

8

Published as a conference paper at ICLR 2016

1 6 2 10 6 8 9 0 0 0 0 0 0 0 0 0 0 p:0 p:0 a:6

2 6 2 10 6 8 9 0 0 0 0 0 0 0 5 0 1 p:1 p:6 a:2

3 6 2 10 6 8 9 2 0 0 0 0 0 0 5 1 1 p:1 p:6 a:2

4 6 2 10 6 8 9 2 0 0 0 0 0 0 5 1 2 p:2 p:7 a:10

5 6 2 10 6 8 9 2 10 0 0 0 0 0 5 2 2 p:2 p:7 a:10

6 6 2 10 6 8 9 2 10 0 0 0 0 0 5 2 3 p:3 p:8 a:6

7 6 2 10 6 8 9 2 10 6 0 0 0 0 5 3 3 p:3 p:8 a:6

8 6 2 10 6 8 9 2 10 6 0 0 0 0 5 3 4 p:4 p:9 a:8

9 6 2 10 6 8 9 2 10 6 8 0 0 0 5 4 4 p:4 p:9 a:8

10 6 2 10 6 8 9 2 10 6 8 0 0 0 5 4 5 p:5 p:10 a:9

11 6 2 10 6 8 9 2 10 6 8 9 0 0 5 5 5 p:5 p:10 a:9

Table 2: State of memory and registers for the Copy problem at the start of every timestep. We also show

the arguments given to the READ and WRITE functions in each timestep. The argument “p:” represents the

source/destination address and “a:” represents the value to be written (for WRITE). The value 6 at position 0

in the memory is the pointer to the destination array. It is followed by 5 values (gray columns) that should be

copied.

We also tested how those solutions generalize to longer input sequences. To do this, for every

problem we selected a model that achieved error 0 during the training, and tested it on inputs with

lengths up to 506 . To perform these tests we also increased the memory size and the number of

allowed timesteps.

In all cases the model solved the problem perfectly, what shows that it generalizes not only to longer

input sequences, but also to different memory sizes and numbers of allowed timesteps. Moreover,

the discretized version of the model (see Sec. 3.4 for details) also solves all the problems perfectly.

These results show that the NRAM model naturally learns “algorithmic” solutions, which generalize

well.

We were also interested if the found solutions generalize to sequences of arbitrary length. It is eas-

iest to verify in the case of a discretized model with a feedforward controller. That is because then

circuits outputted by the controller depend solely on the values of registers, which are integers. We

manually analysed circuits for problems Copy and Increment and verified that found solutions gen-

eralize to inputs of arbitrary length, assuming that the number of allowed timesteps is appropriate.

r3 '

ListSearch, Merge and WalkBST. For all of r4 p

read

a

them we had to perform an extensive random p write

search to find a good set of hyperparameters. add

Usually, most of the parameter combinations

were stuck on the starting curriculum level with

r2 r2 '

a high error of 50% − 70%. For the first 3 tasks

we managed to train the network to achieve er-

ror 0. For WalkBST and Merge the training er- r3 inc min r4 '

ing those problems we had to introduce addi- r1 r1 '

tional techniques described in Sec. 4.1.

For Permutation, ListK and WalkBST our

model generalizes very well and achieves low

error rates on inputs at least twice longer than Figure 4: The circuit generated at every timestep

the ones seen during the training. The exact ≥ 2. The values of the pointer (p) for READ,

generalization errors are shown in Fig. 3. WRITE and the value to be written (a) for WRITE

are presented in Table 2. The modules whose out-

The only hard problem on which our model puts are not used were removed from the picture.

discretizes well is Permutation — on this task

6

Unfortunately we could not test for lengths longer than 50 due to the memory restrictions.

9

Published as a conference paper at ICLR 2016

the discretized version of the model produces exactly the same outputs as the original model on all

cases tested. For the remaining four problems the discretized version of the models perform very

poorly (error rates ≥ 70%). We believe that better results may be obtained by using some techniques

encouraging discretization during the training 7 .

We noticed that the training procedure is very unstable and the error often raises from a few percents

to e.g. 70% in just one epoch. Moreover, even if we use the best found set of hyperparameters, the

percent of random seeds that converges to error 0 was usually equal about 11%. We observed that

the percent of converging seeds is much lower if we do not add noise to the gradient — in this case

only about 1% of seeds converge.

A comparison to other models is challenging because we are the first to consider problems with

pointers. The NTM can solve tasks like Copy or Reverse, but it suffers from the inability to naturally

store a pointer to a fixed location in the memory. This makes it unlikely that it could solve tasks such

as ListK, ListSearch or WalkBST since the pointers used in these tasks refer to absolute positions.

What distinguishes our model from most of the previous attempts (including NTMs, Memory Net-

works, Pointer Networks) is the lack of content-based addressing. It was a deliberate design deci-

sion, since this kind of addressing inherently slows down the memory access. In contrast, our model

— if discretized — can access the memory in a constant time.

The NRAM is also the first model that we are aware of employing a differentiable mechanism for

deciding when to finish the computation.

We present one example execution of our model for the problem Copy. For the example, we use

a very small model with 12 memory cells, 4 registers and the standard set of 14 modules. The

controller for this model is a feedforward network, and we run it for 11 timesteps. Table 2 contains,

for every timestep, the state of the memory and registers at the begin of the timestep.

The model can execute different circuits at different timesteps. In particular, we observed that the

first circuit is slightly different from the rest, since it needs to handle the initialization. Starting from

the second step all generated circuits are the same. We present this circuit in Fig. 4. The register r2

is constant and keeps the offset between the destination array and the source array (6 − 1 = 5 in

this case). The register r3 is responsible for incrementing the pointer in the source array. Its value is

copied to r4 8 , the register used by the READ module. For the WRITE module, it also uses r4 which

is shifted by r2 . The register r1 is unused. This solution generalizes to sequences of arbitrary length.

5 C ONCLUSIONS

In this paper we presented the Neural Random-Access Machine, which can learn to solve problems

that require explicit manipulation and dereferencing of pointers.

We showed that this model can learn to solve a number of algorithmic problems and generalize well

to inputs longer than ones seen during the training. In particular, for some problems it generalizes

to inputs of arbitrary length.

However, we noticed that the optimization problem resulting from the backpropagating through the

execution trace of the program is very challenging for standard optimization techniques. It seems

likely that a method that can search in an easier “abstract” space would be more effective at solving

such problems.

7

One could for example add at later stages of training a penalty proportional to the entropy of the interme-

diate values of registers/memory.

8

In our case r3 < r2 , so the MIN module always outputs the value r3 + 1. It is not satisfied in the last

timestep, but then the array is already copied.

10

Published as a conference paper at ICLR 2016

R EFERENCES

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly

learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term dependencies with gra-

dient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.

Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In

Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM,

2009.

Chan, William, Jaitly, Navdeep, Le, Quoc V, and Vinyals, Oriol. Listen, attend and spell. arXiv

preprint arXiv:1508.01211, 2015.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint

arXiv:1410.5401, 2014.

Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil. Learning to

transduce with unbounded memory. arXiv preprint arXiv:1506.02516, 2015.

Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):

1735–1780, 1997.

Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent

nets. arXiv preprint arXiv:1503.01007, 2015.

Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex. Grid long short-term memory. arXiv preprint

arXiv:1507.01526, 2015.

Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980, 2014.

Luong, Minh-Thang, Pham, Hieu, and Manning, Christopher D. Effective approaches to attention-

based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines.

In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–

814, 2010.

Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V, Sutskever, Ilya, Kaiser, Lukasz, Kurach, Karol, and

Martens, James. Adding gradient noise improves learning for very deep networks. arXiv preprint

arXiv:1511.06807, 2015.

Solomonoff, Ray J. A formal theory of inductive inference. part i. Information and control, 7(1):

1–22, 1964.

Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and Fergus, Rob. End-to-end memory net-

works. arXiv preprint arXiv:1503.08895, 2015.

Vinyals, Oriol, Kaiser, Lukasz, Koo, Terry, Petrov, Slav, Sutskever, Ilya, and Hinton, Geoffrey.

Grammar as a foreign language. arXiv preprint arXiv:1412.7449, 2014.

Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdeep. Pointer networks. arXiv preprint

arXiv:1506.03134, 2015.

Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Memory networks. arXiv preprint

arXiv:1410.3916, 2014.

Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

Zaremba, Wojciech and Sutskever, Ilya. Reinforcement learning neural turing machines. arXiv

preprint arXiv:1505.00521, 2015.

11

Published as a conference paper at ICLR 2016

In this section we describe in details the memory layout of inputs and outputs for the tasks used in

our experiments. In all descriptions below, big letters represent arrays and small letters represents

pointers. N U LL denotes the value 0 and is used to mark the end of an array or a missing next

element in a list or a binary tree.

1. Access Given a value k and an array A, return A[k]. Input is given as k, A[0], .., A[n −

1], N U LL and the network should replace the first memory cell with A[k].

2. Increment Given an array A, increment all its elements by 1. Input is given as

A[0], ..., A[n − 1], N U LL and the expected output is A[0] + 1, ..., A[n − 1] + 1.

3. Copy Given an array and a pointer to the destination, copy all elements from the array to

the given location. Input is given as p, A[0], ..., A[n−1] where p points to one element after

A[n − 1]. The expected output is A[0], ..., A[n − 1] at positions p, ..., p + n − 1 respectively.

4. Reverse Given an array and a pointer to the destination, copy all elements from the array

in reversed order. Input is given as p, A[0], ..., A[n − 1] where p points one element after

A[n − 1]. The expected output is A[n − 1], ..., A[0] at positions p, ..., p + n − 1 respectively.

5. Swap Given two pointers p, q and an array A, swap elements A[p] and A[q]. Input is

given as p, q, A[0], .., A[p], ..., A[q], ..., A[n − 1], 0. The expected modified array A is:

A[0], ..., A[q], ..., A[p], ..., A[n − 1].

6. Permutation Given two arrays of n elements: P (contains a permutation of numbers

0, . . . , n − 1) and A (contains random elements), permutate A according to P . Input is

given as a, P [0], ..., P [n − 1], A[0], ..., A[n − 1], where a is a pointer to the array A. The

expected output is A[P [0]], ..., A[P [n − 1]], which should override the array P .

7. ListK Given a pointer to the head of a linked list and a number k, find the value of the

k-th element on the list. List nodes are represented as two adjacent memory cells: a pointer

to the next node and a value. Elements are in random locations in the memory, so that

the network needs to follow the pointers to find the correct element. Input is given as:

head, k, out, ... where head is a pointer to the first node on the list, k indicates how many

hops are needed and out is a cell where the output should be put.

8. ListSearch Given a pointer to the head of a linked list and a value v to find return a pointer

to the first node on the list with the value v. The list is placed in memory in the same way

as in the task ListK. We fill empty memory with “trash” values to prevent the network from

“cheating” and just iterating over the whole memory.

9. Merge Given pointers to 2 sorted arrays A and B, and the pointer to the output o,

merge the two arrays into one sorted array. The input is given as: a, b, o, A[0], .., A[n −

1], G, B[0], ..., B[m − 1], G, where G is a special guardian value, a and b point to the first

elements of arrays A and B respectively, and o points to the address after the second G.

The n + m element should be written in correct order starting from position o.

10. WalkBST Given a pointer to the root of a Binary Search Tree, and a path to be traversed,

return the element at the end of the path. The BST nodes are represented as tripes (v, l,

r), where v is the value, and l, r are pointers to the left/right child. The triples are placed

randomly in the memory. Input is given as root, out, d1 , d2 , ..., dk , N U LL, ..., where root

points to the root node and out is a slot for the output. The sequence d1 ...dk , di ∈ {0, 1}

represents the path to be traversed: di = 0 means that the network should go to the left

child, di = 1 represents going to the right child.

12

Published as a conference paper at ICLR 2016

As noticed in several papers (Bengio et al., 2009; Zaremba & Sutskever, 2014), curriculum learning

is crucial for training deep networks on very complicated problems. We followed the curriculum

learning schedule from Zaremba & Sutskever (2014) without any modifications.

For each of the tasks we have manually defined a sequence of subtasks with increasing difficulty,

where the difficulty is usually measured by the length of the input sequence. During training the

input-output examples are sampled from a distribution that is determined by the current difficulty

level D. The level is increased (up to some maximal value) whenever the error rate of the model

goes below some threshold. Moreover, we ensure that successive increases of D are separated by

some number of batches.

In more detail, to generate an input-output example we first sample a difficulty d from a distribution

determined by the current level D and then draw the example with the difficulty d. The procedure

for sampling d is the following:

• with probability 10%: pick d uniformly at random from the set of all possible difficulties;

• with probability 25%: pick d uniformly from [1, D + e], where e is a sample from a geo-

metric distribution with a success probability 1/2;

• with probability 65%: set d = D + e, where e is sampled as above.

Notice that the above procedure guarantees that every difficulty d can be picked regardless of the

current level D, which has been shown to increase performance Zaremba & Sutskever (2014).

13

Published as a conference paper at ICLR 2016

C E XAMPLE C IRCUITS

Below are presented example circuits generated during training for all simple tasks (except Copy

which was presented in the paper). For modules READ and WRITE, the value of the first argument

(pointer to the address to be read/written) is marked as p. For WRITE, the value to be written

is marked as a and the value returned by this module is always 0. For modules LESS-THAN and

LESS-OR-EQUAL-THAN the first parameter is marked as x and the second one as y. Other modules

either have only one parameter or the order of parameters is not important.

For all tasks below (except Increment), the circuit generated at timestep 1 is different than circuits

generated at steps ≥ 2, which are the same. This is because the first circuit needs to handle the

initialization. We present only the ”main” circuits generated for timesteps ≥ 2.

C.1 ACCESS

0

y

x lt min

inc p

read write r2 '

p a

r1

r1 '

Figure 5: The circuit generated at every timestep ≥ 2 for the task Access.

Step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r1 r2

1 3 1 12 4 7 12 1 13 8 2 1 3 11 11 12 0 0 0

2 3 1 12 4 7 12 1 13 8 2 1 3 11 11 12 0 3 0

3 4 1 12 4 7 12 1 13 8 2 1 3 11 11 12 0 3 0

Table 3: Memory for task Access. Only the first memory cell is modified.

14

Published as a conference paper at ICLR 2016

C.2 I NCREMENT

write

r3 '

a

r5 p r4 '

read inc max

r5 ' r2 '

1 add

min r1 '

Figure 6: The circuit generated at every timestep for the task Increment.

Step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r1 r2 r3 r4 r5

1 1 11 3 8 1 2 9 8 5 3 0 0 0 0 0 0 0 0 0 0 0

2 2 11 3 8 1 2 9 8 5 3 0 0 0 0 0 0 1 2 2 2 1

3 2 12 3 8 1 2 9 8 5 3 0 0 0 0 0 0 2 12 12 12 2

4 2 12 4 8 1 2 9 8 5 3 0 0 0 0 0 0 3 4 4 4 3

5 2 12 4 9 1 2 9 8 5 3 0 0 0 0 0 0 4 9 9 9 4

6 2 12 4 9 2 2 9 8 5 3 0 0 0 0 0 0 5 2 2 2 5

7 2 12 4 9 2 3 9 8 5 3 0 0 0 0 0 0 6 3 3 3 6

8 2 12 4 9 2 3 10 8 5 3 0 0 0 0 0 0 7 10 10 10 7

9 2 12 4 9 2 3 10 9 5 3 0 0 0 0 0 0 8 9 9 9 8

10 2 12 4 9 2 3 10 9 6 3 0 0 0 0 0 0 9 6 6 6 9

11 2 12 4 9 2 3 10 9 6 4 0 0 0 0 0 0 10 4 4 4 10

15

Published as a conference paper at ICLR 2016

C.3 R EVERSE

r4 r2 '

x

y le r4 '

1

r1 '

r1 add

sub

dec p

r3 inc r3 '

p write

a

read min max

Figure 7: The circuit generated at every timestep ≥ 2 for the task Reverse.

Step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r1 r2 r3 r4

1 8 8 1 3 5 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0

2 8 8 1 3 5 1 1 2 0 0 0 0 0 0 0 0 8 0 1 1

3 8 8 1 3 5 1 1 2 0 0 0 0 0 0 8 0 8 1 2 1

4 8 8 1 3 5 1 1 2 0 0 0 0 0 1 8 0 8 1 3 1

5 8 8 1 3 5 1 1 2 0 0 0 0 3 1 8 0 8 1 4 1

6 8 8 1 3 5 1 1 2 0 0 0 5 3 1 8 0 8 1 5 1

7 8 8 1 3 5 1 1 2 0 0 1 5 3 1 8 0 8 1 6 1

8 8 8 1 3 5 1 1 2 0 1 1 5 3 1 8 0 8 1 7 1

9 8 8 1 3 5 1 1 2 2 1 1 5 3 1 8 0 8 1 8 1

10 8 8 1 3 5 1 1 2 2 1 1 5 3 1 8 0 8 1 9 1

16

Published as a conference paper at ICLR 2016

C.4 S WAP

For swap we observed that 2 different circuits are generated, one for even timesteps, one for odd

timesteps.

r2 '

p read max

read a

p

write r1 '

r1 p

max a

write add

p

r2

Figure 8: The circuit generated at every even timestep for the task Swap.

max

a

r1 p a

read

write

r1 ' write

p p

r2

x

sub lt

y

eq

add

2

1 sub

y

x le r2 '

inc x

lt

y

Figure 9: The circuit generated at every odd timestep ≥ 3 for the task Swap.

Step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r1 r2

1 4 13 6 10 5 4 6 3 7 1 1 11 13 12 0 0 0 0

2 5 13 6 10 5 4 6 3 7 1 1 11 13 12 0 0 1 4

3 5 13 6 10 12 4 6 3 7 1 1 11 13 12 0 0 0 13

4 5 13 6 10 12 4 6 3 7 1 1 11 13 5 0 0 5 1

17

Published as a conference paper at ICLR 2016

Łukasz Kaiser & Ilya Sutskever

Google Brain {lukaszkaiser,ilyasu}@google.com

A BSTRACT

arXiv:1511.08228v3 [cs.LG] 15 Mar 2016

widely studied. It has been addressed using neural networks too, in particular by

Neural Turing Machines (NTMs). These are fully differentiable computers that

use backpropagation to learn their own programming. Despite their appeal NTMs

have a weakness that is caused by their sequential nature: they are not parallel and

are are hard to train due to their large depth when unfolded.

We present a neural network architecture to address this problem: the Neural

GPU. It is based on a type of convolutional gated recurrent unit and, like the

NTM, is computationally universal. Unlike the NTM, the Neural GPU is highly

parallel which makes it easier to train and efficient to run.

An essential property of algorithms is their ability to handle inputs of arbitrary

size. We show that the Neural GPU can be trained on short instances of an al-

gorithmic task and successfully generalize to long instances. We verified it on a

number of tasks including long addition and long multiplication of numbers rep-

resented in binary. We train the Neural GPU on numbers with up-to 20 bits and

observe no errors whatsoever while testing it, even on much longer numbers.

To achieve these results we introduce a technique for training deep recurrent net-

works: parameter sharing relaxation. We also found a small amount of dropout

and gradient noise to have a large positive effect on learning and generalization.

1 I NTRODUCTION

Deep neural networks have recently proven successful at various tasks, such as computer vision

(Krizhevsky et al., 2012), speech recognition (Dahl et al., 2012), and in other domains. Recurrent

neural networks based on long short-term memory (LSTM) cells (Hochreiter & Schmidhuber, 1997)

have been successfully applied to a number of natural language processing tasks. Sequence-to-

sequence recurrent neural networks with such cells can learn very complex tasks in an end-to-end

manner, such as translation (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014), parsing

(Vinyals & Kaiser et al., 2015), speech recognition (Chan et al., 2016) or image caption generation

(Vinyals et al., 2014). Since so many tasks can be solved with essentially one model, a natural

question arises: is this model the best we can hope for in supervised learning?

Despite its recent success, the sequence-to-sequence model has limitations. In its basic form, the

entire input is encoded into a single fixed-size vector, so the model cannot generalize to inputs much

longer than this fixed capacity. One way to resolve this problem is by using an attention mechanism

(Bahdanau et al., 2014). This allows the network to inspect arbitrary parts of the input in every de-

coding step, so the basic limitation is removed. But other problems remain, and Joulin & Mikolov

(2015) show a number of basic algorithmic tasks on which sequence-to-sequence LSTM networks

fail to generalize. They propose a stack-augmented recurrent network, and it works on some prob-

lems, but is limited in other ways.

In the best case one would desire a neural network model able to learn arbitrarily complex algorithms

given enough resources. Neural Turing Machines (Graves et al., 2014) have this theoretical property.

However, they are not computationally efficient because they use soft attention and because they tend

to be of considerable depth. Their depth makes the training objective difficult to optimize and im-

possible to parallelize because they are learning a sequential program. Their use of soft attention

requires accessing the entire memory in order to simulate 1 step of computation, which introduces

substantial overhead. These two factors make learning complex algorithms using Neural Turing Ma-

1

Published as a conference paper at ICLR 2016

chines difficult. These issues are not limited to Neural Turing Machines, they apply to other architec-

tures too, such as stack-RNNs (Joulin & Mikolov, 2015) or (De)Queue-RNNs (Grefenstette et al.,

2015). One can try to alleviate these problems using hard attention and reinforcement learning, but

such non-differentiable models do not learn well at present (Zaremba & Sutskever, 2015b).

In this work we present a neural network model, the Neural GPU, that addresses the above issues.

It is a Turing-complete model capable of learning arbitrary algorithms in principle, like a Neural

Turing Machine. But, in contrast to Neural Turing Machines, it is designed to be as parallel and as

shallow as possible. It is more similar to a GPU than to a Turing machine since it uses a smaller num-

ber of parallel computational steps. We show that the Neural GPU works in multiple experiments:

• A Neural GPU can learn long binary multiplication from examples. It is the first neural

network able to learn an algorithm whose run-time is superlinear in the size of its input.

Trained on up-to 20-bit numbers, we see no single error on any inputs we tested, and we

tested on numbers up-to 2000 bits long.

• The same architecture can also learn long binary addition and a number of other algorith-

mic tasks, such as counting, copying sequences, reversing them, or duplicating them.

The learning of algorithms with neural networks has seen a lot of interest after the success

of sequence-to-sequence neural networks on language processing tasks (Sutskever et al., 2014;

Bahdanau et al., 2014; Cho et al., 2014). An attempt has even been made to learn to evaluate sim-

ple python programs with a pure sequence-to-sequence model (Zaremba & Sutskever, 2015a), but

more success was seen with more complex models. Neural Turing Machines (Graves et al., 2014)

were shown to learn a number of basic sequence transformations and memory access patterns, and

their reinforcement learning variant (Zaremba & Sutskever, 2015b) has reasonable performance on

a number of tasks as well. Stack, Queue and DeQueue networks (Grefenstette et al., 2015) were also

shown to learn basic sequence transformations such as bigram flipping or sequence reversal.

The Grid LSTM (Kalchbrenner et al., 2016) is another powerful architecture that can learn to mul-

tiply 15-digit decimal numbers. As we will see in the next section, the Grid-LSTM is quite similar

to the Neural GPU – the main difference is that the Neural GPU is less recurrent and is explicitly

constructed from the highly parallel convolution operator.

In image processing, convolutional LSTMs, an architecture similar to the Neural GPU, have recently

been used for weather prediction (Shi et al., 2015) and image compression (Toderici et al., 2016).

We find it encouraging as it hints that the Neural GPU might perform well in other contexts.

Most comparable to this work are the prior experiments with the stack-augmented RNNs

(Joulin & Mikolov, 2015). These networks manage to learn and generalize to unseen lengths on

a number of algorithmic tasks. But, as we show in Section 3.1, stack-augmented RNNs trained to

add numbers up-to 20-bit long generalize only to ∼ 100-bit numbers, never to 200-bit ones, and

never without error. Still, their generalization is the best we were able to obtain without using the

Neural GPU and far surpasses a baseline LSTM sequence-to-sequence model with attention.

The quest for learning algorithms has been pursued much more widely with tools other than neu-

ral networks. It is known under names such as program synthesis, program induction, automatic

programming, or inductive synthesis, and has a long history with many works that we do not cover

here; see, e.g., Gulwani (2010) and Kitzelmann (2010) for a more general perspective.

Since one of our results is the synthesis of an algorithm for long binary addition, let us recall how

this problem has been addressed without neural networks. Importantly, there are two cases of this

problem with different complexity. The easier case is when the two numbers that are to be added

are aligned at input, i.e., if the first (lower-endian) bit of the first number is presented at the same

time as the first bit of the second number, then come the second bits, and so on, as depicted below

for x = 9 = 8 + 1 and y = 5 = 4 + 1 written in binary with least-significant bit left.

Input 1 0 0 1

(x and y aligned) 1 0 1 0

Desired Output (x + y) 0 1 1 1

2

Published as a conference paper at ICLR 2016

In this representation the triples of bits from (x, y, x + y), e.g., (1, 1, 0) (0, 0, 1) (0, 1, 1) (1, 0, 1)

as in the figure above, form a regular language. To learn binary addition in this representation it

therefore suffices to find a regular expression or an automaton that accepts this language, which can

be done with a variant of Anguin’s algorithm (Angluin, 1987). But only few interesting functions

have regular representations, as for example long multiplication does not (Blumensath & Grädel,

2000). It is therefore desirable to learn long binary addition without alignment, for example when x

and y are provided one after another. This is the representation we use in the present paper.

Input (x, y) 1 0 0 1 + 1 0 1 0

Desired Output (x + y) 0 1 1 1

2 T HE N EURAL GPU

Before we introduce the Neural GPU, let us recall the architecture of a Gated Recurrent Unit

(GRU) (Cho et al., 2014). A GRU is similar to an LSTM, but its input and state are the same

size, which makes it easier for us to generalize it later; a highway network could have also been

used (Srivastava et al., 2015), but it lacks the reset gate. GRUs have shown performance similar to

LSTMs on a number of tasks (Chung et al., 2014; Greff et al., 2015). A GRU takes an input vector

x and a current state vector s, and outputs:

GRU(x, s) = u ⊙ s + (1 − u) ⊙ tanh(W x + U (r ⊙ s) + B), where

u = σ(W ′ x + U ′ s + B ′ ) and r = σ(W ′′ x + U ′′ s + B ′′ ).

In the equations above, W, W ′ , W ′′ , U, U ′ , U ′′ are matrices and B, B ′ , B ′′ are bias vectors; these

are the parameters that will be learned. We write W x for a matrix-vector multiplication and r ⊙ s

for elementwise vector multiplication. The vectors u and r are called gates since their elements are

in [0, 1] — u is the update gate and r is the reset gate.

In recurrent neural networks a unit like GRU is applied at every step and the result is both passed as

new state and used to compute the output. In a Neural GPU we do not process a new input in every

step. Instead, all inputs are written into the starting state s0 . This state has 2-dimensional structure:

it consists of w × h vectors of m numbers, i.e., it is a 3-dimensional tensor of shape [w, h, m]. This

mental image evolves in time in a way defined by a convolutional gated recurrent unit:

CGRU(s) = u ⊙ s + (1 − u) ⊙ tanh(U ∗ (r ⊙ s) + B), where

u = σ(U ′ ∗ s + B ′ ) and r = σ(U ′′ ∗ s + B ′′ ).

U ∗ s above denotes the convolution of a kernel bank U with the mental image s. A kernel bank is a

4-dimensional tensor of shape [kw , kh , m, m], i.e., it contains kw · kh · m2 parameters, where kw and

kh are kernel width and height. It is applied to a mental image s of shape [w, h, m] which results in

another mental image U ∗ s of the same shape defined by:

X

⌊kw /2⌋

X

⌊kh /2⌋

X

m

U ∗ s[x, y, i] = s[x + u, y + v, c] · U [u, v, c, i].

u=⌊−kw /2⌋ v=⌊−kh /2⌋ c=1

In the equation above the index x + u might sometimes be negative or larger than the size of s, and

in such cases we assume the value is 0. This corresponds to the standard convolution operator used

in convolutional neural networks with zero padding on both sides and stride 1. Using the standard

operator has the advantage that it is heavily optimized (see Section 4 for Neural GPU performance).

New work on faster convolutions, e.g., Lavin & Gray (2015), can be directly used in a Neural GPU.

Knowing how a CGRU gate works, the definition of a l-layer Neural GPU is simple, as depicted in

Figure 1. The given sequence i = (i1 , . . . , in ) of n discrete symbols from {0, . . . , I} is first em-

bedded into the mental image s0 by concatenating the vectors obtained from an embedding lookup

of the input symbols into its first column. More precisely, we create the starting mental image s0 of

shape [w, n, m] by using an embedding matrix E of shape [I, m] and setting s0 [0, k, :] = E[ik ] (in

python notation) for all k = 1 . . . n (here i1 , . . . , in is the input). All other elements of s0 are set to

0. Then, we apply l different CGRU gates in turn for n steps to produce the final mental image sfin :

st+1 = CGRUl (CGRUl−1 . . . CGRU1 (st ) . . .) and sfin = sn .

3

Published as a conference paper at ICLR 2016

i1 o1

.. ... ..

. CGRU1 CGRU2 CGRU1 CGRU2 .

in on

s0 s1 sn−1 sn

The result of a Neural GPU is produced by multiplying each item in the first column of sfin by

an output matrix O to obtain the logits lk = Osfin [0, k, :] and then selecting the maximal one:

ok = argmax(lk ). During training we use the standard loss function, i.e., we compute a softmax

over the logits lk and use the negative log probability of the target as the loss.

Since all components of a Neural GPU are clearly differentiable, we can train using any stochastic

gradient descent optimizer. For the results presented in this paper we used the Adam optimizer

(Kingma & Ba, 2014) with ε = 10−4 and gradients norm clipped to 1. The number of layers was

set to l = 2, the width of mental images was constant at w = 4, the number of maps in each mental

image point was m = 24, and the convolution kernels width and height was always kw = kh = 3.

Computational power of Neural GPUs. While the above definition is simple, it might not be

immediately obvious what kind of functions a Neural GPU can compute. Why can we expect it to

be able to perform long multiplication? To answer such questions it is useful to draw an analogy

between a Neural GPU and a discrete 2-dimensional cellular automaton. Except for being discrete

and the lack of a gating mechanism, such automata are quite similar to Neural GPUs. Of course,

these are large exceptions. Dense representations have often more capacity than purely discrete

states and the gating mechanism is crucial to avoid vanishing gradients during training. But the

computational power of cellular automata is much better understood. In particular, it is well known

that a cellular automaton can exploit its parallelism to multiply two n-bit numbers in O(n) steps

using Atrubin’s algorithm. We recommend the online book (Vivien, 2003) to get an understanding

of this algorithm and the computational power of cellular automata.

3 E XPERIMENTS

In this section, we present experiments showing that a Neural GPU can successfully learn a number

of algorithmic tasks and generalize well beyond the lengths that it was trained on. We start with the

two tasks we focused on, long binary addition and long binary multiplication. Then, to demonstrate

the generality of the model, we show that Neural GPUs perform well on several other tasks as well.

The two core tasks on which we study the performance of Neural GPUs are long binary addition

and long binary multiplication. We chose them because they are fundamental tasks and because

there is no known linear-time algorithm for long multiplication. As described in Section 2, we

input a sequence of discrete symbols into the network and we read out a sequence of symbols

again. For binary addition, we use a set of 4 symbols: {0, 1, +, PAD} and for multiplication we use

{0, 1, ·, PAD}. The PAD symbol is only used for padding so we depict it as empty space below.

Long binary addition (badd) is the task of adding two numbers represented lower-endian in

binary notation. We always add numbers of the same length, but we allow them to have 0s at start,

so numbers of differing lengths can be padded to equal size. Given two d-bit numbers the full

sequence length is n = 2d + 1, as seen in the example below, representing (1 + 4) + (2 + 4 + 8) =

5 + 14 = 19 = (16 + 2 + 1).

4

Published as a conference paper at ICLR 2016

badd@20 100% 100% 100%

badd@25 100% 100% 73%

badd@100 100% 88% 0%

badd@200 100% 0% 0%

badd@2000 100% 0% 0%

bmul@20 100% N/A 0%

bmul@25 100% N/A 0%

bmul@200 100% N/A 0%

bmul@2000 100% N/A 0%

Table 1: Neural GPU, stackRNN, and LSTM+A results on addition and multiplication. The table

shows the fraction of test cases for which every single bit of the model’s output is correct.

Input 1 0 1 0 + 0 1 1 1

Output 1 1 0 0 1

Long binary multiplication (bmul) is the task of multiplying two binary numbers, represented

lower-endian. Again, we always multiply numbers of the same length, but we allow them to have 0s

at start, so numbers of differing lengths can be padded to equal size. Given two d-bit numbers, the

full sequence length is again n = 2d+1, as seen in the example below, representing (2+4)·(2+8) =

6 · 10 = 60 = 32 + 16 + 8 + 4.

Input 0 1 1 0 · 0 1 0 1

Output 0 0 1 1 1

Models. We compare three different models on the above tasks. In addition to the Neural GPU

we include a baseline LSTM recurrent neural network with an attention mechanism. We call this

model LSTM+A as it is exactly the same as described in (Vinyals & Kaiser et al., 2015). It is a

3-layer model with 64 units in each LSTM cell in each layer, which results in about 200k param-

eters (the Neural GPU uses m = 24 and has about 30k paramters). Both the Neural GPU and

the LSTM+A baseline were trained using all the techniques described below, including curriculum

training and gradient noise. Finally, on binary addition, we also include the stack-RNN model from

(Joulin & Mikolov, 2015). This model was not trained using our training regime, but in exactly the

way as provided in its source code, only with nmax = 41. To match our training procedure, we ran

it 729 times (cf. Section 3.3) with different random seeds and we report the best obtained result.

Results. We measure also the rate of fully correct output sequences and report the results in Ta-

ble 1. For both tasks, we show first the error at the maximum length seen during training, i.e., for

20-bit numbers. Note that LSTM+A is not able to learn long binary multiplication at this length, it

does not even fit the training data. Then we report numbers for sizes not seen during training.

As you can see, a Neural GPU can learn a multiplication algorithm that generalizes perfectly, at least

as far as we were able to test (technical limits of our implementation prevented us from testing much

above 2000 bits). Even for the simpler task of binary addition, stack-RNNs work only up-to length

100. This is still much better than the LSTM+A baseline which only generalizes to length 25.

In addition to the two main tasks above, we tested Neural GPUs on the following simpler algorithmic

tasks. The same architecture as used above was able to solve all of the tasks described below, i.e.,

after being trained on sequences of length up-to 41 we were not able to find any error on sequences

on any length we tested (up-to 4001).

Copying sequences is the simple task of producing on output the same sequence as on input. It is

very easy for a Neural GPU, in fact all models converge quickly and generalize perfectly.

Reversing sequences is the task of reversing a sequence of bits, n is the length of the sequence.

5

Published as a conference paper at ICLR 2016

Duplicating sequences is the task of duplicating the input bit sequence on output twice, as in the

example below. We use the padding symbol on input to make it match the output length. We trained

on sequences of inputs up-to 20 bits, so outputs were up-to 40-bits long, and tested on inputs up-to

2000 bits long.

Input 0 0 1 1

Output 0 0 1 1 0 0 1 1

Counting by sorting bits is the task of sorting the input bit sequence on output. Since there are

only 2 symbols to sort, this is a counting tasks – the network must count how many 0s are in the

input and produce the output accordingly, as in the example below.

Input 1 0 1 1 0 0 1 0

Output 0 0 0 0 1 1 1 1

Here we describe the training methods that we used to improve our results. Note that we applied

these methods to the LSTM+A baseline as well, to keep the above comparison fair. We focus on

the most important elements of our training regime, all less relevant details can be found in the code

which is released as open-source.1

Grid search. Each result we report is obtained by running a grid search over 36 = 729 instances.

We consider 3 settings of the learning rate, initial parameters scale, and 4 other hyperparameters

discussed below: the relaxation pull factor, curriculum progress threshold, gradient noise scale, and

dropout. An important effect of running this grid search is also that we train 729 models with differ-

ent random seeds every time. Usually only a few of these models generalize to 2000-bit numbers,

but a significant fraction works well on 200-bit numbers, as discussed below.

Curriculum learning. We use a curriculum learning approach inspired by Zaremba & Sutskever

(2015a). This means that we train, e.g., on 7-digit numbers only after crossing a curriculum progress

threshold (e.g., over 90% fully correct outputs) on 6-digit numbers. However, with 20% probability

we pick a minibatch of d-digit numbers with d chosen uniformly at random between 1 and 20.

Gradients noise. To improve training speed and stability we add noise to gradients in each training

step. Inspired by the schedule from Welling & Teh (2011), we add to gradients a noise drawn from

the normal distribution with mean 0 and variance inversely proportional to the square root of step-

number (i.e., with standard deviation proportional to the 4-th root of step-number). We multiply this

noise by the gradient noise scale and, to avoid noise in converged models, we also multiply it by the

fraction of non-fully-correct outputs (which is 0 for a perfect model).

Gate cutoff. In Section 2 we defined the gates in a CGRU using the sigmoid function, e.g., we

wrote u = σ(U ′ ∗ s + B ′ ). Usually the standard sigmoid function is used, σ(x) = 1+e1−x . We

found that adding a hard threshold on the top and bottom helps slightly in our setting, so we use

1.2σ(x) − 0.1 cut to the interval [0, 1], i.e., σ ′ (x) = max(0, min(1, 1.2σ(x) − 0.1)).

Dropout is a widely applied technique for regularizing neural networks. But when applying it to

recurrent networks, it has been counter-productive to apply it on recurrent connections – it only

worked when applied to the non-recurrent ones, as reported by Pham et al. (2014).

Since a Neural GPU does not have non-recurrent connections it might seem that dropout will not

be useful for this architecture. Surprisingly, we found the contrary – it is useful and improves

generalization. The key to using dropout effectively in this setting is to set a small dropout rate.

When we run a grid search for dropout rates we vary them between 6%, 9%, and 13.5%, meaning

that over 85% of the values are always preserved. It turns out that even this small dropout has large

1

The code is at https://github.com/tensorflow/models/tree/master/neural_gpu.

6

Published as a conference paper at ICLR 2016

effect since we apply it to the whole mental image si in each step i. Presumably the network now

learns to include some redundancy in its internal representation and generalization benefits from it.

Without dropout we usually see only a few models from a 729 grid search generalize reasonably,

while with dropout it is a much larger fraction and they generalize to higher lengths. In particular,

dropout was necessary to train models for multiplication that generalize to 2000 bits.

To improve optimization of our deep network we use a relaxation technique for shared parameters

which works as follows. Instead of training with parameters shared across time-steps we use r

identical sets of non-shared parameters (we often use r = 6, larger numbers work better but use

more memory). At time-step t of the Neural GPU we use the i-th set if t mod r = i.

The procedure described above relaxes the network, as it can now perform different operations in

different time-steps. Training becomes easier, but we now have r parameters instead of the single

shared set we want. To unify them we add a term to the cost function representing the distance

of each parameter from the average of this parameter in all the r sets. This term in the final cost

function is multiplied by a scalar which we call the relaxation pull. If the relaxation pull is 0, the

network behaves as if the r parameter sets were separate, but when it is large, the cost forces the

network to unify the parameters across different set.

During training, we gradually increase the relaxation pull. We start with a small value and every time

the curriculum makes progress, e.g., when the model performs well on 6-digit numbers, we multiply

the relaxation pull by a relaxation pull factor. When the curriculum reaches the maximal length we

average the parameters from all sets and continue to train with a single shared parameter set.

This method is crucial for learning multiplication. Without it, a Neural GPU with m = 24 has

trouble to even fit the training set, and the few models that manage to do it do not generalize. With

relaxation almost all models in our 729 runs manage to fit the training data.

4 D ISCUSSION

We prepared a video of the Neural GPU trained to solve the tasks mentioned above.2. It shows

the state in each step with values of −1 drawn in white, 1 in black, and other in gray. This gives

an intuition how the Neural GPU solves the discussed problems, e.g., it is quite clear that for the

duplication task the Neural GPU learned to move a part of the embedding downwards in each step.

What did not work well? For one, using decimal inputs degrades performance. All tasks above can

easily be formulated with decimal inputs instead of binary ones. One could hope that a Neural GPU

will work well in this case too, maybe with a larger m. We experimented with this formulation and

our results were worse than when the representation was binary: we did not manage to learn long

decimal multiplication. Increasing m to 128 allows to learn all other tasks in the decimal setting.

Another problem is that often only a few models in a 729 grid search generalize to very long unseen

instances. Among those 729 models, there usually are many models that generalize to 40 or even 200

bits, but only a few working without error for 2000-bit numbers. Using dropout and gradient noise

improves the reliability of training and generalization, but maybe another technique could help even

more. How could we make more models achieve good generalization? One idea that looks natural

is to try to reduce the number of parameters by decreasing m. Surprisingly, this does not seem to

have any influence. In addition to the m = 24 results presented above we ran experiments with

m = 32, 64, 128 and the results were similar. In fact using m = 128 we got the most models to

generalize. Additionally, we observed that ensembling a few models, just by averaging their outputs,

helps to generalize: ensembles of 5 models almost always generalize perfectly on binary tasks.

Why use width? The Neural GPU is defined using two-dimensional convolutions and in our exper-

iments one of the dimensions is always set to 4. Doing so is not necessary since a one-dimensional

Neural GPU that uses four times larger m can represent every function representable by the original

one. In fact we trained a model for long binary multiplication that generalized to 2000-bit numbers

using a Neural GPU with width 1 and m = 64. However, the width of the Neural GPU increases the

2

The video is available at https://www.youtube.com/watch?v=LzC8NkTZAF4

7

Published as a conference paper at ICLR 2016

amount of information carried in its hidden state without increasing the number of its parameters.

Thus it can be thought of as a factorization and might be useful for other tasks.

Speed and data efficiency. Neural GPUs use the standard, heavily optimized convolution operation

and are fast. We experimented with a 2-layer Neural GPU for n = 32 and m = 64. After unfolding

in time it has 128 layers of CGRUs, each operating on 32 mental images, each 4 × 64 × 64 . The

joint forward-backward step time for this network was about 0.6s on an NVIDIA GTX 970 GPU.

We were also surprised by how data-efficient a Neural GPU can be. The experiments presented

above were all performed using 10k random training data examples for each training length. Since

we train on up-to 20-bit numbers this adds to about 200k training examples. We tried to train using

only 100 examples per length, so about 2000 total training instances. We were surprised to see

that it actually worked well for binary addition: there were models that generalized well to 200-bit

numbers and to all lengths below despite such small training set. But we never managed to train a

good model for binary multiplication with that little training data.

The results presented in Table 1 show clearly that there is a qualitative difference between what can

be achieved with a Neural GPU and what was possible with previous architectures. In particular, for

the first time, we show a neural network that learns a non-trivial superlinear-time algorithm in a way

that generalized to much higher lengths without errors.

This opens the way to use neural networks in domains that were previously only addressed by

discrete methods, such as program synthesis. With the surprising data efficiency of Neural GPUs it

could even be possible to replicate previous program synthesis results, e.g., Kaiser (2012), but in a

more scalable way. It is also interesting that a Neural GPU can learn symbolic algorithms without

using any discrete state at all, and adding dropout and noise only improves its performance.

Another promising future work is to apply Neural GPUs to language processing tasks. Good

results have already been obtained on translation with a convolutional architecture over words

(Kalchbrenner & Blunsom, 2013) and adding gating and recursion, like in a Neural GPU, should

allow to train much deeper models without overfitting. Finally, the parameter sharing relaxation

technique can be applied to any deep recurrent network and has the potential to improve RNN train-

ing in general.

R EFERENCES

Angluin, Dana. Learning regaular sets from queries and counterexamples. Information and Computation, 75:

87–106, 1987.

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to

align and translate. CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473.

Blumensath, Achim and Grädel, Erich. Automatic Structures. In Proceedings of LICS 2000, pp. 51–62, 2000.

URL http://www.logic.rwth-aachen.de/pub/graedel/BlGr-lics00.ps.

Chan, William, Jaitly, Navdeep, Le, Quoc V., and Vinyals, Oriol. Listen, attend and spell. In International

Conference on Acoustics, Speech and Signal Processing, ICASSP’16, 2016.

Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio,

Yoshua. Learning phrase representations using rnn encoder-decoder for statistical machine translation.

CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.

Chung, Junyoung, Gülçehre, Çaglar, Cho, Kyunghyun, and Bengio, Yoshua. Empirical evaluation

of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. URL

http://arxiv.org/abs/1412.3555.

Dahl, George E., Yu, Dong, Deng, Li, and Acero, Alex. Context-dependent pre-trained deep neural networks

for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech & Language Processing, 20

(1):30–42, 2012.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. CoRR, abs/1410.5401, 2014. URL

http://arxiv.org/abs/1410.5401.

8

Published as a conference paper at ICLR 2016

Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil.

Learning to transduce with unbounded memory. CoRR, abs/1506.02516, 2015. URL

http://arxiv.org/abs/1506.02516.

Greff, Klaus, Srivastava, Rupesh Kumar, Koutnı́k, Jan, Steunebrink, Bas R., and Schmidhuber, Jürgen. LSTM:

A search space odyssey. CoRR, abs/1503.04069, 2015. URL http://arxiv.org/abs/1503.04069.

Gulwani, Sumit. Dimensions in program synthesis. In Proceedings of PPDP 2010, PPDP ’10, pp. 13–24, 2010.

Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):1735–1780,

1997.

Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent nets.

CoRR, abs/1503.01007, 2015. URL http://arxiv.org/abs/1503.01007.

Kaiser, Łukasz. Learning games from videos guided by descriptive complexity. In Proceedings of the AAAI-12,

pp. 963–970. AAAI Press, 2012. URL http://goo.gl/mRbfV5.

Kalchbrenner, Nal and Blunsom, Phil. Recurrent continuous translation models. In Proceedings EMNLP 2013,

pp. 1700–1709, 2013. URL http://nal.co/papers/KalchbrennerBlunsom_EMNLP13.

Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex. Grid long short-term memory. In International Confer-

ence on Learning Representations, 2016. URL http://arxiv.org/abs/1507.01526.

Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,

2014. URL http://arxiv.org/abs/1412.6980.

Kitzelmann, Emanuel. Inductive programming: A survey of program synthesis techniques. In Approaches and

Applications of Inductive Programming, AAIP 2009, volume 5812 of LNCS, pp. 50–73, 2010.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey. Imagenet classification with deep convolutional neural

network. In Advances in Neural Information Processing Systems, 2012.

Lavin, Andrew and Gray, Scott. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308,

2015. URL http://arxiv.org/abs/1509.09308.

Pham, Vu, Bluche, Théodore, Kermorvant, Christopher, and Louradour, Jérôme. Dropout improves recur-

rent neural networks for handwriting recognition. In International Conference on Frontiers in Handwriting

Recognition (ICFHR), pp. 285–290. IEEE, 2014. URL http://arxiv.org/pdf/1312.4569.pdf.

Shi, Xingjian, Chen, Zhourong, Wang, Hao, Yeung, Dit-Yan, kin Wong, Wai, and chun Woo, Wang. Convo-

lutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural

Information Processing Systems, 2015. URL http://arxiv.org/abs/1506.04214.

Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhuber, Jürgen. Highway networks. CoRR,

abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387.

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural net-

works. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014. URL

http://arxiv.org/abs/1409.3215.

Toderici, George, O’Malley, Sean M., Hwang, Sung Jin, Vincent, Damien, Minnen, David, Baluja,

Shumeet, Covell, Michele, and Sukthankar, Rahul. Variable rate image compression with recur-

rent neural networks. In International Conference on Learning Representations, 2016. URL

http://arxiv.org/abs/1511.06085.

Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural

Information Processing Systems, 2015. URL http://arxiv.org/abs/1412.7449.

Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption

generator. CoRR, abs/1411.4555, 2014. URL http://arxiv.org/abs/1411.4555.

Vivien, Helene. An Introduction to cellular automata. 2003. URL

http://www.liafa.univ-paris-diderot.fr/˜yunes/ca/archives/bookvivien.pdf.

Welling, Max and Teh, Yee Whye. Bayesian learning via stochastic gradient Langevin dynamics. In Proceed-

ings of ICML 2011, pp. 681–688, 2011.

Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. CoRR, abs/1410.4615, 2015a. URL

http://arxiv.org/abs/1410.4615.

Zaremba, Wojciech and Sutskever, Ilya. Reinforcement learning neural turing machines. CoRR,

abs/1505.00521, 2015b. URL http://arxiv.org/abs/1505.00521.

9

Learning Efficient Algorithms with Hierarchical Attentive Memory

Karol Kurach∗ KKURACH @ GOOGLE . COM Google / University of Warsaw1

∗

equal contribution

arXiv:1602.03218v2 [cs.LG] 23 Feb 2016

few thousands.

In this paper, we propose and investigate a novel

memory architecture for neural networks called It would be desirable for the size of the memory to be inde-

Hierarchical Attentive Memory (HAM). It is pendent of the number of model parameters. The first ver-

based on a binary tree with leaves corresponding satile and highly successful architecture with this property

to memory cells. This allows HAM to perform was Neural Turing Machine (NTM) (Graves et al., 2014).

memory access in Θ(log n) complexity, which The main idea behind the NTM is to split the network into a

is a significant improvement over the standard trainable “controller” and an “external” variable-size mem-

attention mechanism that requires Θ(n) opera- ory. It caused an outbreak of other neural network architec-

tions, where n is the size of the memory. tures with external memories (see Sec. 2).

We show that an LSTM network augmented with However, one aspect which has been usually neglected so

HAM can learn algorithms for problems like far is the efficiency of the memory access. Most of the

merging, sorting or binary searching from pure proposed memory architectures have the Θ(n) access com-

input-output examples. In particular, it learns to plexity, where n is the size of the memory. It means that,

sort n numbers in time Θ(n log n) and general- for instance, copying a sequence of length n requires per-

izes well to input sequences much longer than the forming Θ(n2 ) operations, which is clearly unsatisfactory.

ones seen during the training. We also show that

HAM can be trained to act like classic data struc- 1.1. Our contribution

tures: a stack, a FIFO queue and a priority queue.

In this paper we propose a novel memory module for neural

networks, called Hierarchical Attentive Memory (HAM).

1. Intro The HAM module is generic and can be used as a build-

ing block of larger neural architectures. Its crucial property

Deep Recurrent Neural Networks (RNNs) have recently is that it scales well with the memory size — the memory

proven to be very successful in real-word tasks, e.g. ma- access requires only Θ(log n) operations, where n is the

chine translation (Sutskever et al., 2014) and computer vi- size of the memory. This complexity is achieved by us-

sion (Vinyals et al., 2014). However, the success has been ing a new attention mechanism based on a binary tree with

achieved only on tasks which do not require a large mem- leaves corresponding to memory cells. The novel attention

ory to solve the problem, e.g. we can translate sentences mechanism is not only faster than the standard one used in

using RNNs, but we can not produce reasonable transla- Deep Learning (Bahdanau et al., 2014), but it also facilities

tions of really long pieces of text, like books. learning algorithms due to a built-in bias towards operating

A high-capacity memory is a crucial component neces- on intervals.

sary to deal with large-scale problems that contain plenty We show that an LSTM augmented with HAM is able to

of long-range dependencies. Currently used RNNs do not learn algorithms for tasks like merging, sorting or binary

scale well to larger memories, e.g. the number of parame- searching. In particular, it is the first neural network, which

ters in an LSTM (Hochreiter & Schmidhuber, 1997) grows we are aware of, that is able to learn to sort from pure input-

quadratically with the size of the network’s memory. In output examples and generalizes well to input sequences

1 much longer than the ones seen during the training. More-

Work done while at Google.

over, the learned sorting algorithm runs in time Θ(n log n).

We also show that the HAM memory itself is capable of

simulating different classic memory structures: a stack, a

FIFO queue and a priority queue.

Learning Efficient Algorithms with Hierarchical Attentive Memory

2. Related work els is that they allow a constant time memory access. They

were however only successful on relatively simple tasks.

In this section we mention a number of recently proposed

neural architectures with an external memory, which size is Another model, which can use a pointer-based memory

independent of the number of the model parameters. is the Neural Programmer-Interpreter (Reed & de Freitas,

2015). It is very interesting, because it managed to learn

Memory architectures based on attention Attention is sub-procedures. Unfortunately, it requires strong supervi-

a recent but already extremely successful technique in sion in the form of execution traces.

Deep Learning. This mechanism allows networks to at- Another type of pointer-based memory was presented

tend to parts of the (potentially preprocessed) input se- in Neural Random-Access Machine (Kurach et al., 2015),

quence (Bahdanau et al., 2014) while generating the out- which is a neural architecture mimicking classic comput-

put sequence. It is implemented by giving the network as ers.

an auxiliary input a linear combination of input symbols,

where the weights of this linear combination can be con- Parallel memory architectures There are two recent

trolled by the network. memory architectures, which are especially suited for

Attention mechanism was used to access the memory in parallel computation. Grid-LSTM (Kalchbrenner et al.,

Neural Turing Machines (NTMs) (Graves et al., 2014). It 2015) is an extension of LSTM to multiple dimen-

was the first paper, that explicitly attempted to train a com- sions. Another recent model of this type is Neural GPU

putationally universal neural network and achieved encour- (Kaiser & Sutskever, 2015), which can learn to multiply

aging results. long binary numbers.

model that attempted to explicitly separate the memory 3. Hierarchical Attentive Memory

from computation in a neural network model. The followup In this section we describe our novel memory module

work of (Sukhbaatar et al., 2015) combined the memory called Hierarchical Attentive Memory (HAM). The HAM

network with the soft attention mechanism, which allowed module is generic and can be used as a building block of

it to be trained with less supervision. In contrast to NTMs, larger neural network architectures. For instance, it can be

the memory in these models is non-writeable. added to feedforward or LSTM networks to extend their ca-

Another model without writeable memory is the Pointer pabilities. To make our description more concrete we will

Network (Vinyals et al., 2015), which is very similar to the consider a model consisting of an LSTM “controller” ex-

attention model of Bahdanau et al. (2014). Despite not hav- tended with a HAM module.

ing a memory, this model was able to solve a number of The high-level idea behind the HAM module is as follows.

difficult algorithmic problems that include the Convex Hull The memory is structured as a full binary tree with the

and the approximate 2D Travelling Salesman Problem. leaves containing the data stored in the memory. The in-

All of the architectures mentioned so far use standard at- ner nodes contain some auxiliary data, which allows us to

tention mechanisms to access the memory and therefore efficiently perform some types of “queries” on the mem-

memory access complexity scales linearly with the mem- ory. In order to access the memory, one starts from the

ory size. root of the tree and performs a top-down descent in the

tree, which is similar to the hierarchical softmax procedure

Memory architectures based on data structures Stack- (Morin & Bengio, 2005). At every node of the tree, one

Augmented Recurrent Neural Network (Joulin & Mikolov, decides to go left or right based on the auxiliary data stored

2015) is a neural architecture combining an RNN and a in this node and a “query”. Details are provided in the rest

differentiable stack. In another paper (Grefenstette et al., of this section.

2015) authors consider extending an LSTM with a stack,

a FIFO queue or a double-ended queue and show some 3.1. Notation

promising results. The advantage of the latter model is that The model takes as input a sequence x1 , x2 , . . . and out-

the presented data structures have a constant access time. puts a sequence y1 , y2 , . . .. We assume that each element

of these sequences is a binary vector of size b ∈ N, i.e.

Memory architectures based on pointers In two recent xi , yi ∈ {0, 1}b. Suppose for a moment that we only want

papers (Zaremba & Sutskever, 2015; Zaremba et al., 2015) to process input sequences of length ≤ n, where n ∈ N is

authors consider extending neural networks with nondif- a power of two (we show later how to process sequences of

ferentiable memories based on pointers and trained using an arbitrary length). The model is based on the full binary

Reinforcement Learning. The big advantage of these mod- tree with n leaves. Let V denote the set of the nodes in that

Learning Efficient Algorithms with Hierarchical Attentive Memory

y1 y2 y3 h1

JOIN

h2 h3

LSTM LSTM LSTM

JOIN JOIN

h4 h5 h6 h7

...

JOIN JOIN JOIN JOIN

h8 h9 h10 h11 h12 h13 h14 h15

HAM HAM HAM

x1 ... xm x1 x2 x3 x4 x5 x6

Figure 1. The LSTM+HAM model consists of an LSTM con- Figure 2. Initialization of the model. The value in the i-th leaf of

troller and a HAM module. The execution of the model starts HAM is initialized with EMBED(xi ), where EMBED is a train-

with the initialization of HAM using the whole input sequence able feed-forward network. If there are more leaves than input

x1 , x2 , . . . , xm . At each timestep, the HAM module produces symbols, we initialize the values in the excessive leaves with ze-

an input for the LSTM, which then produces an output symbol ros. Then, we initialize the values in the inner nodes bottom-up

yt . Afterwards, the hidden states of the LSTM and HAM are up- using the formula he = JOIN(hl(e) , hr(e) ). The hidden state of

dated. the LSTM — hLSTM is initialized with zeros.

set of its leaves. Let l(e) for e ∈ V \ L be the left child of h2 h3 SEARCH(h3 , hLSTM ) = 0.1

the node e and let r(e) be its right child. SEARCH(h6 , hLSTM ) = 1

h8 h9 h10 h11 h12 ha h14 h15

3.2. Inference

The high-level view of the model execution is presented in Figure 3. Attention phase. In this phase the model performs a top-

Fig. 1. The hidden state of the model consists of two com- down “search” in the tree starting from the root. Suppose that

ponents: the hidden state of the LSTM controller (denoted we are currently at the node c ∈ V \ L. We compute the value

hLSTM ∈ Rl for some l ∈ N) and the hidden values stored p = SEARCH(hc , hLSTM ). Then, with probability p the model

in the nodes of the HAM tree. More precisely, for every goes right (i.e. c := r(c)) and with probability 1 − p it goes left

node e ∈ V there is a hidden value he ∈ Rd . These values (i.e. c := l(c)). This procedure is continued until we reach one

change during the recurrent execution of the model, but we of the leaves. This leaf is called the attended or accessed leaf and

drop all timestep indices to simplify the notation. denoted a.

haviour of the LSTM, as well as the following 4 trans- The HAM parameters describe only the 4 mentioned trans-

formations, which describe the HAM module: EMBED : formations and hence the number of the model parameters

Rb → Rd , JOIN : Rd × Rd → Rd , SEARCH : Rd × Rl → does not depend on the size of the binary tree used. Thus,

[0, 1] and WRITE : Rd × Rl → Rd . These transforma- we can use the model to process the inputs of an arbitrary

tions may be represented by arbitrary function approxima- length by using big enough binary trees. It is not clear that

tors, e.g. Multilayer Perceptrons (MLPs). Their meaning the same set of parameters will give good results across

will be described soon. different tree sizes, but we showed experimentally that it is

indeed the case (see Sec. 4 for more details).

The details of the model are presented in 4 figures. Fig. 2

describes the initialization of the model. Each recurrent We decided to represent the transformations defining HAM

timestep of the model consists of three phases: the attention with MLPs with ReLU (Nair & Hinton, 2010) activation

phase described in Fig. 3, the output phase described in function in all neurons except the output layer of SEARCH,

Fig. 4 and the update phase described in Fig. 5. The whole which uses sigmoid activation function to ensure that

timestep can be performed in time Θ(log n). the output may be interpreted as a probability. More-

Learning Efficient Algorithms with Hierarchical Attentive Memory

tion of the model. We would like to maximize the log-

ha hLSTM yt probability of producing the correct output, i.e.

!

X

Figure 4. Output phase. The value ha stored in the attended leaf L = log p(y|x, θ) = log p(A|x, θ)p(y|A, x, θ) .

is given to the LSTM as an input. Then, the LSTM produces an A

output symbol yt ∈ {0, 1}b . More precisely, the value u ∈ Rb

is computed by a trainable linear transformation from hLSTM and

This sum is intractable, so instead of minimizing it directly,

the distribution of yt is defined by the formula p(yt,i = 1) =

sigmoid(ui ) for 1 ≤ i ≤ b. It may be beneficial to allow the we minimize a variational lower bound on it:

model to access the memory a few times between producing each X

output symbols. Therefore, the model produces an output symbol F= p(A|x, θ) log p(y|A, x, θ) ≤ L.

only at timesteps with indices divisible by some constant η ∈ N, A

which is a hyperparameter.

This sum is also intractable, so we approximate its

h1 gradient using the REINFORCE, which we briefly

explain below. Using the identity ∇p(A|x, θ) =

JOIN p(A|x, θ)∇ log p(A|x, θ), the gradient of the lower bound

h2 h3

with respect to the model parameters can be rewritten as:

X h

JOIN

h4 h5 h6 h7

∇F = p(A|x, θ) ∇ log p(y|A, x, θ) +

A

i

JOIN

h8 h9 h10 h11 h12 ha h14 h15

log p(y|A, x, θ)∇ log p(A|x, θ)

(1)

hLSTM We estimate this value using Monte Carlo approximation.

ha := WRITE(ha , hLSTM )

For every x we sample A e from p(A|x, θ) and approxi-

e x, θ) +

mate the gradient for the input x as ∇ log p(y|A,

Figure 5. Update phase. In this phase the value in the attended e e

log p(y|A, x, θ)∇ log p(A|x, θ).

leaf a is updated. More precisely, the value is modified us-

ing the formula ha := WRITE(ha , hLSTM ). Then, we update Notice that this gradient estimate can be computed using

the values of the inner nodes encountered during the attention normal backpropagation if we substitute the gradients in

phase (h6 , h3 and h1 in the figure) bottom-up using the equation the nodes2 which sample whether we should go left or right

he = JOIN(hl(e) , hr(e) ). during the attention phase by

e x, θ) ∇ log p(A|x,

log p(y|A, e θ).

over, the network for WRITE is enhanced in a similar | {z }

return

way as Highway Networks (Srivastava et al., 2015), i.e.

WRITE(ha , hLSTM ) = T (ha , hLSTM ) · H(ha , hLSTM ) + This term is called REINFORCE gradient estimate and the

(1 − T (ha , hLSTM)) · ha , where H and T are two MLPs left factor is called a return in Reinforcement Learning lit-

with sigmoid activation function in the output layer. This erature. This gradient estimator is unbiased, but it often

allows the WRITE transformation to easily leave the value has a high variance. Therefore, we employ two standard

ha unchanged. variance-reduction technique for REINFORCE: discounted

returns and baselines (Williams, 1992). Discounted re-

3.3. Training turns Pmeans that our return at the t-th timestep has the

form t≤i γ i−t log p(yi |A, e x, θ) for some discount con-

In this section we describe how to train our model

from purely input-output examples using REINFORCE stant γ ∈ [0, 1], which is a hyperparameter. This biases

(Williams, 1992). In Appendix A we also present a dif- the estimator if γ < 1, but it often decreases its variance.

ferent variant of HAM which is fully differentiable and can For the lack of space we do not describe the baselines

be trained using end-to-end backpropagation. technique. We only mention that our baseline is case and

Let x, y be an input-output pair. Recall that both x and y 2

For a general discussion of computing gradients in computa-

are sequences. Moreover, let θ denote the parameters of tion graphs, which contain stochastic nodes see (Schulman et al.,

the model and let A denote the sequence of all decisions 2015).

Learning Efficient Algorithms with Hierarchical Attentive Memory

timestep dependent: it is computed using a learnable lin- algorithm with exponentially decaying learning rate. We

ear transformation from hLSTM and trained using MSE loss use random search to determine the best hyper-parameters

function. for the model. We use gradient clipping (Pascanu et al.,

2012) with constant 5. The depth of our MLPs is either 1

The whole model is trained with the Adam (Kingma & Ba,

or 2, the LSTM controller has l = 20 memory cells and the

2014) algorithm. We also employ the following three train-

hidden values in the tree have dimensionality d = 20. Con-

ing techniques:

stant η determining a number of memory accesses between

producing each output symbols (Fig. 4) is equal either 1

Different reward function During our experiments we or 2. We always train for 100 epochs, each consisting of

noticed that better results may be obtained by using a dif- 1000 batches of size 50. After each epoch we evaluate the

ferent reward function for REINFORCE. More precisely, model on 200 validation batches without learning. When

instead of the log-probability of producing the correct the training is finished, we select the model parameters that

output, we use the percentage of the output bits, which gave the lowest error rate on validation batches and report

have the probability of being predicted correctly (given the error using these parameters on fresh 2, 500 random ex-

e greater than 50%, i.e. our discounted return is equal

A)

P h i amples.

i−t e x, θ) > 0.5 . Notice that it

p(yi,j |A,

t≤i,1≤j≤b γ We report two types of errors: a test error and a general-

corresponds to the Hamming distance between the most ization error. The test error shows how well the model is

probable outcome accordingly to the model (given A) b and

able to fit the data distribution and generalize to unknown

the correct output. cases, assuming that cases of similar lengths were shown

during the training. It is computed using the HAM memory

Entropy bonus term We add a special term to the cost with n = 32 leaves, as the percentage of output sequences,

function which encourages exploration. More precisely, for which were predicted incorrectly. The lengths of test exam-

each sampling node we add to the cost function the term ples are sampled uniformly from the range [1, n]. Notice

α

H(p) , where H(p) is the entropy of the distribution of the that we mark the whole output sequence as incorrect even

decision, whether to go left or right in this node and α is if only one bit was predicted incorrectly, e.g. a hypothetical

an exponentially decaying coefficient. This term goes to model predicting each bit incorrectly with probability 1%

infinity, whenever the entropy goes to zero, what ensures (and independently of the errors on the other bits) has an

some level of exploration. We noticed that this term works error rate of 96% on whole sequences if outputs consist of

better in our experiments than the standard term of the form 320 bits.

−αH(p) (Williams, 1992).

The generalization error shows how well the model per-

forms with enlarged memory on examples with lengths ex-

Curriculum schedule We start with training on inputs ceeding n. We test our model with memory 4 times bigger

with lengths sampled uniformly from [1, n] for some n = than the training one. The lengths of input sequences are

2k and the binary tree with n leaves. Whenever the error now sampled uniformly from the range [2n + 1, 4n].

drops below some threshold, we increment the value k and

start using the bigger tree with 2n leaves and inputs with During testing we make our model fully deterministic by

lengths sampled uniformly from [1, 2n]. using the most probable outcomes instead of stochastic

sampling. More precisely, we assume that during the at-

tention phase the model decides to go right iff p > 0.5

4. Experiments (Fig. 3). Moreover, the output symbols (Fig. 4) are com-

In this section, we evaluate two variants of using the HAM puted by rounding to zero or one instead of sampling.

module. The first one is the model described in Sec. 3,

which combines an LSTM controller with a HAM mod- 4.2. LSTM+HAM

ule (denoted by LSTM+HAM). Then, in Sec. 4.3 we in-

We evaluate the model on a number of algorithmic tasks

vestigate the “raw” HAM (without the LSTM controller)

described below:

to check its capability of acting as classic data structures: a

stack, a FIFO queue and a priority queue.

Reverse: Given a sequence of 10-bit vectors, output

them in the reversed order., i.e. yi = xm+1−i for 1 ≤

4.1. Test setup

i ≤ m, where m is the length of the input sequence.

For each test that we perform, we apply the following pro-

cedure. First, we train the model with memory of size Search: Given a sequence of pairs xi = keyi ||valuei

up to n = 32 using the curriculum schedule described in for 1 ≤ i ≤ m − 1 sorted by keys and a query xm = q, find

Sec. 3.3. The model is trained using the minibatch Adam the smallest i such that keyi = q and output y1 = valuei .

Learning Efficient Algorithms with Hierarchical Attentive Memory

Keys and values are 5-bit vectors and keys are compared eralizes very well to new sizes of the binary tree. We find

lexicographically. The LSTM+HAM model is given only this fact quite interesting, because it means that parameters

two timesteps (η = 2) to solve this problem, which forces learned from a small neural network (i.e. HAM based on a

it to use a form of binary search. tree with 32 leaves) can be successfully used in a different,

bigger network (i.e. HAM with 128 memory cells).

Merge: Given two sorted sequences of pairs —

In comparison, the LSTM with attention does not learn to

(p1 , v1 ), . . . , (pm , vm ) and (p′1 , v1′ ), . . . , (p′m′ , vm

′

′ ), where

′ ′ 5 merge, nor sort. It also completely fails to generalize to

pi , pi ∈ [0, 1] and vi , vi ∈ {0, 1} , merge them. Pairs are

longer examples, which shows that LSTM+A learns rather

compared accordingly to their priorities, i.e. values pi and

some statistical dependencies between inputs and outputs

p′i . Priorities are unique and sampled uniformly from the

1 than the real algorithms.

set { 300 , . . . , 300

300 }, because neural networks can not easily

distinguish two real numbers which are very close to each The LSTM+HAM model makes a few errors when test-

other. Input is encoded as xi = pi ||vi for 1 ≤ i ≤ m and ing on longer outputs than the ones encountered during

xm+i = p′i ||vi′ for 1 ≤ i ≤ m′ . The output consists of the the training. Notice however, that we show in the table

vectors vi and vi′ sorted accordingly to their priorities3 . the percentage of output sequences, which contain at least

one incorrect bit. For instance, LSTM+HAM on the prob-

Sort: Given a sequence of pairs xi = keyi ||valuei sort lem Merge predicts incorrectly only 0.03% of output bits,

them in a stable way4 accordingly to the lexicographic or- which corresponds to 2.48% of incorrect output sequences.

der of the keys. Keys and values are 5-bit vectors. We believe that these rare mistakes could be avoided if one

trained the model longer and chose carefully the learning

Add: Given two numbers represented in binary, rate schedule. One more way to boost generalization capa-

compute their sum. The input is represented as bilities would be to simultaneously train the models with

a1 , . . . , am , +, b1 , . . . , bm , = (i.e. x1 = a1 , x2 = a2 different memory sizes and shared parameters. We have

and so on), where a1 , . . . , am and b1 , . . . , bm are bits of not tried this as the generalization properties of the model

the input numbers and +, = are some special symbols. were already very good.

Input and output numbers are encoded starting from the

least significant bits.

Table 1. Experimental results. The upper table presents the error

Every example output shown during the training is finished rates on inputs of the same lengths as the ones used during train-

by a special “End Of Output” symbol, which the model ing. The lower table shows the error rates on input sequences

learns to predict. It forces the model to learn not only the 2 to 4 times longer than the ones encountered during training.

output symbols, but also the length of the correct output. LSTM+A denotes an LSTM with the standard attention mecha-

nism. Each error rate is a percentage of output sequences, which

We compare our model with 2 strong baseline mod- contained at least one incorrectly predicted bit.

els: encoder-decoder LSTM (Sutskever et al., 2014) and test error LSTM LSTM+A LSTM+HAM

encoder-decoder LSTM with attention (denoted LSTM+A) Reverse 73% 0% 0%

(Bahdanau et al., 2014). The number of the LSTM cells Search 62% 0.04% 0.12%

in the baselines was chosen in such a way, that they have Merge 88% 16% 0%

more parameters than the biggest of our models. We also Sort 99% 25% 0.04%

use random search to select an optimal learning rate and Add 39% 0% 0%

some other parameters for the baselines and train them us- 2-4x longer inputs LSTM LSTM+A LSTM+HAM

ing the same curriculum scheme as LSTM+HAM. Reverse 100% 100% 0%

Search 89% 0.52% 1.68%

The results are presented in Table 1. Not only, does Merge 100% 100% 2.48%

LSTM+HAM solve all the problems almost perfectly, but Sort 100% 100% 0.24%

it also generalizes very well to much longer inputs on all Add 100% 100% 100%

problems except Add. Recall that for the generalization Complexity Θ(1) Θ(n) Θ(log n)

tests we used a HAM memory of a different size than the

ones used during the training, what shows that HAM gen-

3 4.3. Raw HAM

Notice that we earlier assumed for the sake of simplicity that

the input sequences consist of binary vectors and in this task the In this section, we evaluate “raw” HAM module (without

priorities are real values. It does not however require any change

the LSTM controller) to see if it can act as a drop-in re-

of our model. We decided to use real priorities in this task in order

to diversify our set of problems. placement for 3 classic data structures: a stack, a FIFO

4

Stability means that pairs with equal keys should be ordered queue and a priority queue. For each task, the network is

accordingly to their order in the input sequence. given a sequence of PUSH and POP operations in an on-

Learning Efficient Algorithms with Hierarchical Attentive Memory

Table 2. Results of experiments with the raw version of HAM

operation to perform xt . This is a more realistic scenario

(without the LSTM controller). Error rates are measured as a per-

for data structures usage as it prevents the network from centage of operation sequences in which at least one POP query

cheating by peeking into the future. was not answered correctly.

Raw HAM module differs from the LSTM+HAM model Task Test Error Generalization

from Sec. 3 in the following way: Error

Stack 0% 0%

Queue 0% 0%

• The HAM memory is initialized with zeros. PriorityQueue 0.08% 0.2%

from the value in the accessed leaf ha . 4.4. Analysis

• Notice that in the LSTM+HAM model, hLSTM acted In this section, we present some insights into the algorithms

as a kind of “query” or “command” guiding the be- learned by the LSTM+HAM model, by investigating the

haviour of HAM. We will now use the values xt in- the hidden representations he learned for a variant of the

stead. Therefore, at the t-th timestep we use xt in- problem Sort in which we sort 4-bit vectors lexicograph-

stead of hLSTM whenever hLSTM was used in the orig- ically5 . For demonstration purposes, we use a small tree

inal model, e.g. during the attention phase (Fig. 3) with n = 8 leaves and d = 6.

we use p = SEARCH(hc , xt ) instead of p = The trained network performs sorting perfectly. It attends

SEARCH(hc , hLSTM). to the leaves in the order corresponding to the order of the

sorted input values, i.e. at every timestep HAM attends to

We evaluate raw HAM on the following tasks: the leaf corresponding to the smallest input value among

the leaves, which have not been attended so far.

Stack: The “PUSH x” operation places the element x It would be interesting to exactly understand the algorithm

(a 5-bit vector) on top of the stack, and the “POP” returns used by the network to perform this operation. A natural

the last added element and removes it from the stack. solution to this problem would be to store in each hidden

node e the smallest input value among the (unattended so

Queue: The “PUSH x” operation places the element x (a far) leaves below e together with the information whether

5-bit vector) at the end of the queue and the “POP” returns the smallest value is in the right or the left subtree under e.

the oldest element and removes it from the queue. We present two timesteps of our model together with some

insights into the algorithm used by the network in Fig.6.

PriorityQueue: The “PUSH x p” operations adds

the element x with priority p to the queue. The “POP”

5. Comparison to other models

operation returns the value with the highest priority and re-

move it from the queue. Both x and p are represented as Comparing neural networks able to learn algorithms is dif-

5-bit vectors and priorities are compared lexicographically. ficult for a few reasons. First of all, there are no well-

To avoid ties we assume that all elements have different established benchmark problems for this area. Secondly,

priorities. the difficulty of a problem often depends on the way in-

puts and outputs are encoded. For example, the difficulty

Model was trained with the memory of size up to n =

of the problem of adding long binary numbers depends on

32 with operation sequences of length n. Sequences of

whether the numbers are aligned (i.e. the i-th bit of the

PUSH/POP actions for training were selected randomly.

second number is “under” the i-th bit of the first number)

The t-th operation out of n operations in the sequence was

or written next to each other (e.g. 10011+10101). More-

POP with probability nt and PUSH otherwise. To test gen-

over, we could compare error rates on inputs from the same

eralization, we report the error rates with the memory of

distribution as the ones seen during the training or com-

size 4n on sequences of operations of length 4n.

pare error rates on inputs longer than the ones seen dur-

The results presented in Table 2 shows that HAM sim- ing the training to see if the model “really learned the al-

ulates a stack and a queue perfectly with no errors 5

whatsoever even for memory 4 times bigger. For the In the problem Sort considered in the experimental results,

there are separate keys and values, which forces the model to learn

PriorityQueue task, the model generalizes almost per- stable sorting. Here, for the sake of simplicity, we consider the

fectly to large memory, with errors only in 0.2% of output simplified version of the problem and do not use separate keys

sequences. and values.

Learning Efficient Algorithms with Hierarchical Attentive Memory

Figure 6. This figure shows two timesteps of the model. The LSTM controller is not presented to simplify the exposition. The input

sequence is presented on the left, below the tree: x1 = 0000, x2 = 1110, x3 = 1101 and so on. The 2x3 grids in the nodes of the

tree represent the values he ∈ R6 . White cells correspond to value 0 and non-white cells correspond to values > 0. The lower-rightmost

cells are presented in pink, because we managed to decipher the meaning of this coordinate for the inner nodes. This coordinate in the

node e denotes whether the minimum in the subtree (among the values unattended so far) is in the right or left subtree of e. Value greater

than 0 (pink in the picture) means that the minimum is in the right subtree and therefore we should go right while visiting this node in

the attention phase. In the first timestep the leftmost leaf (corresponding to the input 0000) is accessed. Notice that the last coordinates

(shown in pink) are updated appropriately, e.g. the smallest unattended value at the beginning of the second timestep is 0101, which

corresponds to the 6-th leaf. It is in the right subtree under the root and accordingly the last coordinate in the hidden value stored in the

root is high (i.e. pink in the figure).

gorithm”. Furthermore, different models scale differently chine (Kurach et al., 2015), and Queue-Augmented LSTM

with the memory size, which makes direct comparison of (Grefenstette et al., 2015). However, the first three models

error rates less meaningful. have been only successful on relatively simple tasks. The

last model was successful on some synthetic tasks from the

As far as we know, our model is the first one which is

domain of Natural Language Processing, which are very

able to learn a sorting algorithm from pure input-output

different from the tasks we tested our model on, so we can

examples. In (Reed & de Freitas, 2015) it is shown that

not directly compare the two models.

an LSTM is able to learn to sort short sequences, but it

fails to generalize to inputs longer than the ones seen dur- Finally, we do not claim that our model is superior to

ing the training. It is quite clear that an LSTM can not the all other ones, e.g. Neural Turing Machines (NTM)

learn a “real” sorting algorithm, because it uses a bounded (Graves et al., 2014). We believe that both memory mech-

memory independent of the length of the input. The Neu- anisms are complementary: NTM memory has a built-in

ral Programmer-Interpreter (Reed & de Freitas, 2015) is a associative map functionality, which may be difficult to

neural network architecture, which is able to learn bubble achieve in HAM. On the other hand, HAM performs bet-

sort, but it requires strong supervision in the form of execu- ter in tasks like sorting due to a built-in bias towards op-

tion traces. In comparison, our model can be trained from erating on intervals of memory cells. Moreover, HAM al-

pure input-output examples, which is crucial if we want to lows much more efficient memory access than NTM. It is

use it to solve problems for which we do not know any al- also quite possible that a machine able to learn algorithms

gorithms. should use many different types of memory in the same

way as human brain stores a piece of information differ-

An important feature of neural memories is their ef-

ently depending on its type and how long it should be stored

ficiency. Our HAM module in comparison to many

(Berntson & Cacioppo, 2009).

other recently proposed solutions is effective and al-

lows to access the memory in Θ(log(n)) complexity. 6. Conclusions

In the context of learning algorithms it may sound sur-

prising that among all the architectures mentioned in We presented a new memory architecture for neural net-

Sec. 2 the only ones, which can copy a sequence of works called Hierarchical Attentive Memory. Its crucial

length n without Θ(n2 ) operations are: Reinforcement- property is that it scales well with the memory size — the

Learning NTM (Zaremba & Sutskever, 2015), the model memory access requires only Θ(log n) operations. This

from (Zaremba et al., 2015), Neural Random-Access Ma- complexity is achieved by using a new attention mecha-

Learning Efficient Algorithms with Hierarchical Attentive Memory

nism is not only faster than the standard one used in Deep

This version of the model is fully differentiable and there-

Learning, but it also facilities learning algorithms due to

fore it can be trained using end-to-end backpropagation on

the embedded tree structure.

the log-probability of producing the correct output. We ob-

We showed that an LSTM augmented with HAM can learn served that training DHAM is slightly easier than the RE-

a number of algorithms like merging, sorting or binary INFORCE version. However, DHAM does not generalize

searching from pure input-output examples. In particular, as well as HAM to larger memory sizes.

it is the first neural architecture able to learn a sorting algo-

rithm and generalize well to sequences much longer than References

the ones seen during the training.

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,

We believe that some concepts used in HAM, namely the Yoshua. Neural machine translation by jointly learning

novel attention mechanism and the idea of aggregating in- to align and translate. arXiv preprint arXiv:1409.0473,

formation through a binary tree may find applications in 2014.

Deep Learning outside of the problem of designing neural

memories. Berntson, G.G. and Cacioppo, J.T. Handbook of Neuro-

science for the Behavioral Sciences. Number v. 1 in

Acknowledgements Handbook of Neuroscience for the Behavioral Sciences.

Wiley, 2009. ISBN 9780470083567.

We would like to thank Nando de Freitas, Alexander

Graves, Serkan Cabi, Misha Denil and Jonathan Hunt for Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural

helpful comments and discussions. turing machines. arXiv preprint arXiv:1410.5401, 2014.

Grefenstette, Edward, Hermann, Karl Moritz, Suleyman,

A. Using soft attention Mustafa, and Blunsom, Phil. Learning to transduce with

One of the open questions in the area of designing neu- unbounded memory. In Advances in Neural Information

ral networks with attention mechanisms is whether to use Processing Systems, pp. 1819–1827, 2015.

a soft or hard attention. The model described in the pa- Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-

per belongs to the latter class of attention mechanisms as it term memory. Neural computation, 9(8):1735–1780,

makes hard, stochastic choices. The other solution would 1997.

be to use a soft, differentiable mechanism, which attends to

a linear combination of the potential attention targets and Joulin, Armand and Mikolov, Tomas. Inferring algorith-

do not involve any sampling. The main advantage of such mic patterns with stack-augmented recurrent nets. arXiv

models is that their gradients can be computed exactly. preprint arXiv:1503.01007, 2015.

We now describe how to modify the model to make it Kaiser, Łukasz and Sutskever, Ilya. Neural gpus learn al-

fully differentiable (”DHAM”). Recall that in the origi- gorithms. arXiv preprint arXiv:1511.08228, 2015.

nal model the leaf which is attended at every timestep is

sampled stochastically. Instead of that, we will now at ev- Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex.

ery timestep compute for every leaf e the probability p(e) Grid long short-term memory. arXiv preprint

that this leaf would be attended if we used the stochastic arXiv:1507.01526, 2015.

procedure described in Fig. 3. The value p(e) can be com- Kingma, Diederik and Ba, Jimmy. Adam: A

puted by multiplying the probabilities of going in the right method for stochastic optimization. arXiv preprint

direction from all the nodes on the path from the root to e. arXiv:1412.6980, 2014.

As

P the input for the LSTM we then use the value Kurach, Karol, Andrychowicz, Marcin, and Sutskever,

e∈L p(e) · he . During the write phase, we update the Ilya. Neural random-access machines. arXiv preprint

values of all the leaves using the formula he := p(e) ·

arXiv:1511.06392, 2015.

WRITE(he , hROOT ) + (1 − p(e)) · he . Then, in the up-

date phase we update the values of all the inner nodes, so Li, Yujia, Tarlow, Daniel, Brockschmidt, Marc, and Zemel,

that the equation he = JOIN(hl(e) , hr(e) ) is satisfied for Richard. Gated graph sequence neural networks. arXiv

each inner node e. Notice that one timestep of the soft ver- preprint arXiv:1511.05493, 2015.

sion of the model takes time Θ(n) as we have to update the

values of all the nodes in the tree. Our model may be seen Morin, Frederic and Bengio, Yoshua. Hierarchical proba-

as a special case of Gated Graph Neural Network (Li et al., bilistic neural network language model. In Aistats, vol-

ume 5, pp. 246–252. Citeseer, 2005.

Learning Efficient Algorithms with Hierarchical Attentive Memory

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units Advances in neural information processing systems, pp.

improve restricted boltzmann machines. In Proceedings 3104–3112, 2014.

of the 27th International Conference on Machine Learn-

ing (ICML-10), pp. 807–814, 2010. Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Er-

han, Dumitru. Show and tell: A neural image caption

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. generator. arXiv preprint arXiv:1411.4555, 2014.

Understanding the exploding gradient problem. Comput-

ing Research Repository (CoRR) abs/1211.5063, 2012. Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdeep.

Pointer networks. arXiv preprint arXiv:1506.03134,

Reed, Scott and de Freitas, Nando. Neural programmer- 2015.

interpreters. arXiv preprint arXiv:1511.06279, 2015.

Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Mem-

Schulman, John, Heess, Nicolas, Weber, Theophane, and

ory networks. arXiv preprint arXiv:1410.3916, 2014.

Abbeel, Pieter. Gradient estimation using stochastic

computation graphs. In Advances in Neural Information Williams, Ronald J. Simple statistical gradient-following

Processing Systems, pp. 3510–3522, 2015. algorithms for connectionist reinforcement learning.

Srivastava, Rupesh Kumar, Greff, Klaus, and Schmid- Machine learning, 8(3-4):229–256, 1992.

huber, Jürgen. Highway networks. arXiv preprint Zaremba, Wojciech and Sutskever, Ilya. Reinforce-

arXiv:1505.00387, 2015. ment learning neural turing machines. arXiv preprint

Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and arXiv:1505.00521, 2015.

Fergus, Rob. End-to-end memory networks. arXiv

Zaremba, Wojciech, Mikolov, Tomas, Joulin, Armand, and

preprint arXiv:1503.08895, 2015.

Fergus, Rob. Learning simple algorithms from exam-

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Se- ples. arXiv preprint arXiv:1511.07275, 2015.

quence to sequence learning with neural networks. In

Adaptive Computation Time

for Recurrent Neural Networks

arXiv:1603.08983v6 [cs.NE] 21 Feb 2017

Alex Graves

Google DeepMind

gravesa@google.com

Abstract

This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neu-

ral networks to learn how many computational steps to take between receiving an input and emitting

an output. ACT requires minimal changes to the network architecture, is deterministic and differen-

tiable, and does not add any noise to the parameter gradients. Experimental results are provided for

four synthetic problems: determining the parity of binary vectors, applying binary logic operations,

adding integers, and sorting real numbers. Overall, performance is dramatically improved by the

use of ACT, which successfully adapts the number of computational steps to the requirements of the

problem. We also present character-level language modelling results on the Hutter prize Wikipedia

dataset. In this case ACT does not yield large gains in performance; however it does provide in-

triguing insight into the structure of the data, with more computation allocated to harder-to-predict

transitions, such as spaces between words and ends of sentences. This suggests that ACT or other

adaptive computation methods could provide a generic method for inferring segment boundaries in

sequence data.

1 Introduction

The amount of time required to pose a problem and the amount of thought required to solve it are

notoriously unrelated. Pierre de Fermat was able to write in a margin the conjecture (if not the

proof) of a theorem that took three and a half centuries and reams of mathematics to solve [35].

More mundanely, we expect the effort required to find a satisfactory route between two cities, or the

number of queries needed to check a particular fact, to vary greatly, and unpredictably, from case

to case. Most machine learning algorithms, however, are unable to dynamically adapt the amount

of computation they employ to the complexity of the task they perform.

For artificial neural networks, where the neurons are typically arranged in densely connected

layers, an obvious measure of computation time is the number of layer-to-layer transformations the

network performs. In feedforward networks this is controlled by the network depth, or number of

layers stacked on top of each other. For recurrent networks, the number of transformations also

depends on the length of the input sequence — which can be padded or otherwise extended to allow

for extra computation. The evidence that increased depth leads to more performant networks is by

now inarguable [5, 4, 19, 9], and recent results show that increased sequence length can be similarly

beneficial [31, 33, 25]. However it remains necessary for the experimenter to decide a priori on the

amount of computation allocated to a particular input vector or sequence. One solution is to simply

1

make every network very deep and design its architecture in such a way as to mitigate the vanishing

gradient problem [13] associated with long chains of iteration [29, 17]. However in the interests

of both computational efficiency and ease of learning it seems preferable to dynamically vary the

number of steps for which the network ‘ponders’ each input before emitting an output. In this case

the effective depth of the network at each step along the sequence becomes a dynamic function of

the inputs received so far.

The approach pursued here is to augment the network output with a sigmoidal halting unit

whose activation determines the probability that computation should continue. The resulting halting

distribution is used to define a mean-field vector for both the network output and the internal network

state propagated along the sequence. A stochastic alternative would be to halt or continue according

to binary samples drawn from the halting distribution—a technique that has recently been applied to

scene understanding with recurrent networks [7]. However the mean-field approach has the advantage

of using a smooth function of the outputs and states, with no need for stochastic gradient estimates.

We expect this to be particularly beneficial when long sequences of halting decisions must be made,

since each decision is likely to affect all subsequent ones, and sampling noise will rapidly accumulate

(as observed for policy gradient methods [36]).

A related architecture known as Self-Delimiting Neural Networks [26, 30] employs a halting

neuron to end a particular update within a large, partially activated network; in this case however a

simple activation threshold is used to make the decision, and no gradient with respect to halting time

is propagated. More broadly, learning when to halt can be seen as a form of conditional computing,

where parts of the network are selectively enabled and disabled according to a learned policy [3, 6].

We would like the network to be parsimonious in its use of computation, ideally limiting itself to

the minimum number of steps necessary to solve the problem. Finding this limit in its most general

form would be equivalent to determining the Kolmogorov complexity of the data (and hence solving

the halting problem) [21]. We therefore take the more pragmatic approach of adding a time cost to

the loss function to encourage faster solutions. The network then has to learn to trade off accuracy

against speed, just as a person must when making decisions under time pressure. One weakness is

that the numerical weight assigned to the time cost has to be hand-chosen, and the behaviour of the

network is quite sensitive to its value.

The rest of the paper is structured as follows: the Adaptive Computation Time algorithm is

presented in Section 2, experimental results on four synthetic problems and one real-world dataset

are reported in Section 3, and concluding remarks are given in Section 4.

Consider a recurrent neural network R composed of a matrix of input weights Wx , a parametric

state transition model S, a set of output weights Wy and an output bias by . When applied to an

input sequence x = (x1 , . . . , xT ), R computes the state sequence s = (s1 , . . . , sT ) and the output

sequence y = (y1 , . . . , yT ) by iterating the following equations from t = 1 to T :

st = S(st−1 , Wx xt ) (1)

yt = Wy st + by (2)

The state is a fixed-size vector of real numbers containing the complete dynamic information of the

network. For a standard recurrent network this is simply the vector of hidden unit activations. For

a Long Short-Term Memory network (LSTM) [14], the state also contains the activations of the

memory cells. For a memory augmented network such as a Neural Turing Machine (NTM) [10],

the state contains both the complete state of the controller network and the complete state of the

memory. In general some portions of the state (for example the NTM memory contents) will not be

visible to the output units; in this case we consider the corresponding columns of Wy to be fixed to

0.

2

Adaptive Computation Time (ACT) modifies the conventional setup by allowing R to perform a

variable number of state transitions and compute a variable number of outputs at each input step.

Let N (t) be the total number of updates performed at step t. Then define the intermediate state

N (t) N (t)

sequence (s1t , . . . , st ) and intermediate output sequence (yt1 , . . . , yt ) at step t as follows

(

n S(st−1 , x1t ) if n = 1

st = (3)

S(sn−1

t , xnt ) otherwise

ytn = Wy snt + by (4)

where xnt = xt + δn,1 is the input at time t augmented with a binary flag that indicates whether the

input step has just been incremented, allowing the network to distinguish between repeated inputs

and repeated computations for the same input. Note that the same state function is used for all

state transitions (intermediate or otherwise), and similarly the output weights and bias are shared

for all outputs. It would also be possible to use different state and output parameters for each

intermediate step; however doing so would cloud the distinction between increasing the number of

parameters and increasing the number of computational steps. We leave this for future work.

To determine how many updates R performs at each input step an extra sigmoidal halting unit

h is added to the network output, with associated weight matrix Wh and bias bh :

As with the output weights, some columns of Wh may be fixed to zero to give selective access to the

network state. The activation of the halting unit is then used to determine the halting probability

pnt of the intermediate steps:

(

R(t) if n = N (t)

pnt = (6)

hnt otherwise

where 0

n

X

0

N (t) = min{n : hnt >= 1 − } (7)

n=1

N (t)−1

X

R(t) = 1 − hnt (8)

n=1

and is a small constant (0.01 for the experiments in this paper), whose purpose is to allow compu-

tation to halt after a single update if h1t >= 1 − , as otherwise a minimum of two updates would

PN (t) n

be required for every input step. It follows directly from the definition that n=1 pt = 1 and

0 ≤ pnt ≤ 1 ∀n, so this is a valid probability distribution. A similar distribution was recently used

to define differentiable push and pop operations for neural stacks and queues [11].

At this point we could proceed stochastically by sampling n̂ from pnt and setting st = sn̂t , y t = ytn̂ .

However we will eschew sampling techniques and the associated problems of noisy gradients, instead

using pnt to determine mean-field updates for the states and outputs:

N (t) N (t)

X X

st = pnt snt yt = pnt ytn (9)

n=1 n=1

The implicit assumption is that the states and outputs are approximately linear, in the sense that

a linear interpolation between a pair of state or output vectors will also interpolate between the

3

Figure 1: RNN Computation Graph. An RNN unrolled over two input steps (separated by vertical dotted lines). The input

and output weights Wx , Wy , and the state transition operator S are shared over all steps.

Figure 2: RNN Computation Graph with Adaptive Computation Time. The graph is equivalent to Figure 1, only with

each state and output computation expanded to a variable number of intermediate updates. Arrows touching boxes denote

operations applied to all units in the box, while arrows leaving boxes denote summations over all units in the box.

properties the vectors embody. There are several reasons to believe that such an assumption is

reasonable. Firstly, it has been observed that the high-dimensional representations present in neu-

ral networks naturally tend to behave in a linear way [32, 20], even remaining consistent under

arithmetic operations such as addition and subtraction [22]. Secondly, neural networks have been

successfully trained under a wide range of adversarial regularisation constraints, including sparse

internal states [23], stochastically masked units [28] and randomly perturbed weights [1]. This leads

us to believe that the relatively benign constraint of approximately linear representations will not

be too damaging. Thirdly, as training converges, the tendency for both mean-field and stochastic

latent variables is to concentrate all the probability mass on a single value. In this case that yields a

standard RNN with each input duplicated a variable, but deterministic, number of times, rendering

the linearity assumption irrelevant.

A diagram of the unrolled computation graph of a standard RNN is illustrated in Figure 1, while

Figure 2 provides the equivalent diagram for an RNN trained with ACT.

4

2.1 Limiting Computation Time

If no constraints are placed on the number of updates R can take at each step it will naturally

tend to ‘ponder’ each input for as long as possible (so as to avoid making predictions and incurring

errors). We therefore require a way of limiting the amount of computation the network performs.

Given a length T input sequence x, define the ponder sequence (ρ1 , . . . , ρT ) of R as

T

X

P(x) = ρt (11)

t=1

Since R(t) ∈ (0, 1), P(x) is an upper bound on the (non-differentiable) property we ultimately want

PT

to reduce, namely the total computation t=1 N (t) during the sequence1 .

We can encourage the network to minimise P(x) by modifying the sequence loss function L(x, y)

used for training:

L̂(x, y) = L(x, y) + τ P(x) (12)

where τ is a time penalty parameter that weights the relative cost of computation versus error. As

we will see in the experiments section the behaviour of the network is quite sensitive to the value

of τ , and it is not obvious how to choose a good value. If computation time and prediction error

can be meaningfully equated (for example if the relative financial cost of both were known) a more

principled technique for selecting τ should be possible.

To prevent very long sequences at the beginning of training (while the network is learning how

to use the halting unit) the bias term bh can be initialised to a positive value. In addition, a hard

limit M on the maximum allowed value of N (t) can be imposed to avoid excessive space and time

costs. In this case Equation (7) is modified to

0

n

X

0

N (t) = min{M, min{n : hnt >= 1 − }} (13)

n=1

The ponder costs ρt are discontinuous with respect to the halting probabilities at the points where

N (t) increments or decrements (that is, when the summed probability mass up to some n either

decreases below or increases above 1 − ). However they are continuous away from those points,

as N (t) remains constant and R(t) is a linear function of the probabilities. In practice we simply

ignore the discontinuities by treating N (t) as constant and minimising R(t) everywhere.

Given this approximation, the gradient of the ponder cost with respect to the halting activations

is straightforward:

(

∂P(x) 0 if n = N (t)

= (14)

∂hnt −1 otherwise

1 Fora stochastic ACT network, a more natural halting distribution than the one described in Equations (6) to (8)

Qn−1 0

would be to simply treat hn n

t as the probability of halting at step n, in which case pt = ht

n

n0 =1

(1 − hn

t ). One could

PN (t) n

then set ρt = n=1 npt — i.e. the expected ponder time under the stochastic distribution. However experiments

show that networks trained to minimise expected rather than total halting time learn to ‘cheat’ in the following

ingenious way: they set h1t to a value just below the halting threshold, then keep hn t = 0 until some N (t) when they

N (t) N (t)

set ht high enough to ensure they halt. In this case pt p1t , so the states and outputs at n = N (t) have much

lower weight in the mean field updates (Equation (9)) than those at n = 1; however by making the magnitudes of the

states and output vectors much larger at N (t) than n = 1 the network can still ensure that the update is dominated

by the final vectors, despite having paid a low ponder penalty.

5

and hence (

∂ L̂(x, y) ∂L(x, y) 0 if n = N (t)

n = − (15)

∂ht ∂hnt τ otherwise

The halting activations only influence L via their effect on the halting probabilities, therefore

N (t)

∂L(x, y) X ∂L(x, y) ∂pn0

t

= n0

(16)

∂hnt 0

∂pt ∂hnt

n =1

Furthermore, since the halting probabilities only influence L via their effect on the states and outputs,

it follows from Equation (9) that

= yt + st (17)

∂pnt ∂yt ∂st

0

∂pnt

0 δn,n0 if n < N (t) and n < N (t)

= −1 if n0 = N (t) and n < N (t) (18)

∂hnt

0 if n = N (t)

Combining Equations (15), (17) and (18) gives, for n < N (t)

∂L(x, y)

N (t)

n = yt − yt + snt − st −τ (19)

∂ht ∂yt ∂st

∂ L̂(x, y)

N (t)

=0 (20)

∂ht

Thereafter the network can be differentiated as usual (e.g. with backpropagation through time [36])

and trained with gradient descent.

3 Experiments

We tested recurrent neural networks (RNNs) with and without ACT on four synthetic tasks and one

real-world language processing task. LSTM was used as the network architecture for all experiments

except one, where a simple RNN was used. However we stress that ACT is equally applicable to

any recurrent architecture.

All the tasks were supervised learning problems with discrete targets and cross-entropy loss.

The data for the synthetic tasks was generated online and cross-validation was therefore not needed.

Similarly, the character prediction dataset was sufficiently large that the network did not overfit.

The performance metric for the synthetic tasks was the sequence error rate: the fraction of examples

where any mistakes were made in the complete output sequence. This metric is useful as it is trivial

to evaluate without decoding. For character prediction the metric was the average log-loss of the

output predictions, in units of bits per character.

Most of the training parameters were fixed for all experiments: Adam [18] was used for optimi-

sation with a learning rate of 10−4 , the Hogwild! algorithm [24] was used for asynchronous training

with 16 threads; the initial halting unit bias bh mentioned in Equation (5) was 1; the term from

Equation (7) was 0.01. The synthetic tasks were all trained for 1M iterations, where an iteration

6

Figure 3: Parity training Example. Each sequence consists of a single input and target vector. Only 8 of the 64 input bits

are shown for clarity.

is defined as a weight update on a single thread (hence the total number of weight updates is ap-

proximately 16 times the number of iterations). The character prediction task was trained for 10K

iterations. Early stopping was not used for any of the experiments.

A logarithmic grid search over time penalties was performed for each experiment, with 20 ran-

domly initialised networks trained for each value of τ . For the synthetic problems the range of the

grid search was from i × 10−j with integer i in the range 1–10 and the exponent j in the range 1–4.

For the language modelling task, which took many days to complete, the range of j was limited to

1–3 to reduce training time (lower values of τ , which naturally induce more pondering, tend to give

greater data efficiency but slower wall clock training time).

Unless otherwise stated the maximum computation time M (Equation (13)) was set to 100. In

all experiments the networks converged on learned values of N (t) that were far less than M , which

functions mainly as safeguard against excessively long ponder times early in training.

3.1 Parity

Determining the parity of a sequence of binary numbers is a trivial task for a recurrent neural

network [27], which simply needs to implement an internal switch that changes sign every time

a one is received. For shallow feedforward networks receiving the entire sequence in one vector,

however, the number of distinct input patterns, and hence difficulty of the task, grows exponentially

with the number of bits. We gauged the ability of ACT to infer an inherently sequential algorithm

from statically presented data by presenting large binary vectors to the network and asking it to

determine the parity. By varying the number of binary bits for which parity must be calculated we

were also able to assess ACT’s ability to adapt the amount of computation to the difficulty of the

vector.

The input vectors had 64 elements, of which a random number from 1 to 64 were randomly set

to 1 or −1 and the rest were set to 0. The corresponding target was 1 if there was an odd number

of ones and 0 if there was an even number of ones. Each training sequence consisted of a single

input and target vector, an example of which is shown in Figure 3. The network architecture was

a simple RNN with a single hidden layer containing 128 tanh units and a single sigmoidal output

unit, trained with binary cross-entropy loss on minibatches of size 128. Note that without ACT the

recurrent connection in the hidden layer was never used since the data had no sequential component,

and the network reduced to a feedforward network with a single hidden layer.

Figure 4 demonstrates that the network was unable to reliably solve the problem without ACT,

with a mean of almost 40% error compared to 50% for random guessing. For penalties of 0.03 and

below the mean error was below 5%. Figure 5 reveals that the solutions were both more rapid and

more accurate with lower time penalties. It also highlights the relationship between the time penalty,

the classification error rate and the average ponder time per input. The variance in ponder time

for low τ networks is very high, indicating that many correct solutions with widely varying runtime

can be discovered. We speculate that progressively higher τ values lead the network to compute

7

Figure 4: Parity Error Rates. Bar heights show the mean error rates for different time penalties at the end of training.

The error bars show the standard error in the mean.

Figure 5: Parity Learning Curves and Error Rates Versus Ponder Time. Left: faint coloured curves show the errors for

individual runs. Bold lines show the mean errors over all 20 runs for each τ value. ‘Iterations’ is the number of gradient

updates per asynchronous worker. Right: Small circles represent individual runs after training is complete, large circles

represent the mean over 20 runs for each τ value. ‘Ponder’ is the mean number of computation steps per input timestep

(minimum 1). The black dotted line shows the mean error for the networks without ACT. The height of the ellipses

surrounding the mean values represents the standard error over error rates for that value of τ , while the width shows the

standard error over ponder times.

the parities of successively larger chunks of the input vector at each ponder step, then iteratively

combine these calculations to obtain the parity of the complete vector.

Figure 6 shows that for the networks without ACT and those with overly high time penalties, the

error rate increases sharply with the difficulty of the task (where difficulty is defined as the number

of bits whose parity must be determined), while the amount of ponder remains roughly constant.

For the more successful networks, with intermediate τ values, ponder time appears to grow linearly

with difficulty, with a slope that generally increases as τ decreases. Even for the best networks the

error rate increased somewhat with difficulty. For some of the lowest τ networks there is a dramatic

increase in ponder after about 32 bits, suggesting an inefficient algorithm.

3.2 Logic

Like parity, the logic task tests if an RNN with ACT can sequentially process a static vector.

Unlike parity it also requires the network to internally transfer information across successive input

timesteps, thereby testing whether ACT can propagate coherent internal states.

Each input sequence consists of a random number from 1 to 10 of size 102 input vectors. The

first two elements of each input represent a pair of binary numbers; the remainder of the vector

is divided up into 10 chunks of size 10. The first B chunks, where B is a random number from

8

Figure 6: Parity Ponder Time and Error Rate Versus Input Difficulty. Faint lines are individual runs, bold lines are means

over 20 networks. ‘Difficulty’ is the number of bits in the parity vectors, with a mean over 1,000 random vectors used for

each data-point.

T T F F F F F T T T T T

T F F F T T T F F F T T

F T F T F T T F F T F T

F F T F F F T F T T T F

1 to 10, contain one-hot representations of randomly chosen numbers between 1 and 10; each of

these numbers correspond to an index into the subset of binary logic gates whose truth tables are

listed in Table 1. The remaining 10 − B chunks were zeroed to indicate that no further binary

operations were defined for that vector. The binary target bB+1 for each input is the truth value

yielded by recursively applying the B binary gates in the vector to the two initial bits b1 , b0 . That

is for 1 ≤ b ≤ B:

bi+1 = Ti (bi , bi−1 ) (21)

where Ti (., .) is the truth table indexed by chunk i in the input vector.

For the first vector in the sequence, the two input bits b0 , b1 were randomly chosen to be false (0)

or true (1) and assigned to the first two elements in the vector. For subsequent vectors, only b1 was

random, while b0 was implicitly equal to the target bit from the previous vector (for the purposes

of calculating the current target bit), but was always set to zero in the input vector. To solve the

task, the network therefore had to learn both how to calculate the sequence of binary operations

represented by the chunks in each vector, and how to carry the final output of that sequence over

to the next timestep. An example input-target sequence pair is shown in Figure 7.

The network architecture was single-layer LSTM with 128 cells. The output was a single sigmoidal

unit, trained with binary cross-entropy, and the minibatch size was 16.

Figure 8 shows that the network reaches a minimum sequence error rate of around 0.2 without

ACT (compared to 0.5 for random guessing), and virtually zero error for all τ ≤ 0.01. From Figure 9

it can be seen that low τ ACT networks solve the task very quickly, requiring about 10,000 training

iterations. For higher τ values ponder time reduces to 1, at which point the networks trained with

ACT behave identically to those without. For lower τ values, the spread of ponder values, and

hence computational cost, is quite large. Again we speculate that this is due to the network learning

more or less ‘chunked’ solutions in which composite truth table are learned for multiple successive

logic operations. This is somewhat supported by the clustering of the lowest τ networks around a

ponder time of 5–6, which is approximately the mean number of logic gates applied per sequence,

9

Figure 7: Logic training Example. Both the input and target sequences consist of 3 vectors. For simplicity only 2 of the 10

possible logic gates represented in the input are shown, and each is restricted to one of the first 3 gates in Table 1 (NOR,

Xq, and ABJ). The segmentation of the input vectors is show on the left and the recursive application of Equation (21)

required to determine the targets (and subsequent b0 values) is shown in italics above the target vectors.

and hence the minimum number of computations the network would need if calculating single binary

operations at a time.

Figure 10 shows a surprisingly high ponder time for the least difficult inputs, with some networks

taking more than 10 steps to evaluate a single logic gate. From 5 to 10 logic gates, ponder gradually

increases with difficulty as expected, suggesting that a qualitatively different solution is learned for

the two regimes. This is supported by the error rates for the non ACT and high τ networks, which

increase abruptly after 5 gates. It may be that 5 is the upper limit on the number of successive

gates the network can learn as a single composite operation, and thereafter it is forced to apply an

iterative algorithm.

3.3 Addition

The addition task presents the network with a input sequence of 1 to 5 size 50 input vectors. Each

vector represents a D digit number, where D is drawn randomly from 1 to 5, and each digit is drawn

randomly from 0 to 9. The first 10D elements of the vector are a concatenation of one-hot encodings

of the D digits in the number, and the remainder of the vector is set to 0. The required output

is the cumulative sum of all inputs up to the current one, represented as a set of 6 simultaneous

classifications for the 6 possible digits in the sum. There is no target for the first vector in the

sequence, as no sums have yet been calculated. Because the previous sum must be carried over by

the network, this task again requires the internal state of the network to remain coherent. Each

classification is modelled by a size 11 softmax, where the first 10 classes are the digits and the 11th

is a special marker used to indicate that the number is complete. An example input-target pair is

shown in Figure 11.

The network was single-layer LSTM with 512 memory cells. The loss function was the joint

cross-entropy of all 6 targets at each time-step where targets were present and the minibatch size

10

Figure 9: Logic Learning Curves and Error Rates Versus Ponder Time.

Figure 10: Logic Ponder Time and Error Rate Versus Input Difficulty. ‘Difficulty’ is the number of logic gates in each

input vector; all sequences were length 5.

Figure 11: Addition training Example. Each digit in the input sequence is represented by a size 10 one hot encoding.

Unused input digits, marked ‘-’, are represented by a vector of 10 zeros. The black vector at the start of the target sequence

indicates that no target was required for that step. The target digits are represented as 1-of-11 classes, where the 11t h

class, marked ‘*’, is used for digits beyond the end of the target number.

11

Figure 12: Addition Error Rates.

Figure 13: Addition Learning Curves and Error Rates Versus Ponder Time.

was 32. The maximum ponder M was set to 20 for this task, as it was found that some networks

had very high ponder times early in training.

The results in Figure 12 show that the task was perfectly solved by the ACT networks for all

values of τ in the grid search. Unusually, networks with higher τ solved the problem with fewer

training examples. Figure 14 demonstrates that the relationship between the ponder time and the

number of digits was approximately linear for most of the ACT networks, and that for the most

efficient networks (with the highest τ values) the slope of the line was close to 1, which matches our

expectations that an efficient long addition algorithm should need one computation step per digit.

Figure 15 shows how the ponder time is distributed during individual addition sequences, pro-

viding further evidence of an approximately linear-time long addition algorithm.

3.4 Sort

The sort task requires the network to sort sequences of 2 to 15 numbers drawn from a standard

normal distribution in ascending order. The experiments considered so far have been designed to

favour ACT by compressing sequential information into single vectors, and thereby requiring the

use of multiple computation steps to unpack them. For the sort task a more natural sequential

representation was used: the random numbers were presented one at a time as inputs, and the

required output was the sequence of indices into the number sequence placed in sorted order; an

example is shown in Figure 16. We were particularly curious to see how the number of ponder steps

scaled with the number of elements to be sorted, knowing that efficient sorting algorithms have

O(N log N ) computational cost.

The network was single-layer LSTM with 512 cells. The output layer was a size 15 softmax,

12

Figure 14: Addition Ponder Time and Error Rate Versus Input Difficulty. ‘Difficulty’ is the number of digits in each input

vector; all sequences were length 3.

Figure 15: Ponder Time During Three Addition Sequences. The input sequence is shown along the bottom x-axis and

the network output sequence is shown along the top x-axis. The ponder time ρt at each input step is shown by the black

lines; the actual number of computational steps taken at each point is ρt rounded up to the next integer. The grey lines

show the total number of digits in the two numbers being summed at each step; this appears to give a rough lower bound

on the ponder time, suggesting an internal algorithm that is approximately linear in the number of digits. All plots were

created using the same network, trained with τ = 9e−4 .

trained with cross-entropy to classify the indices of the sorted inputs. The minibatch size was 16.

Figure 17 shows that the advantage of using ACT is less dramatic for this task than the previous

three, but still substantial (from around 12% error without ACT to around 6% for the best τ value).

However from Figure 18 it is clear that these gains come at a heavy computational cost, with the best

networks requiring roughly 9 times as much computation as those without ACT. Not surprisingly,

Figure 19 shows that the error rate grew rapidly with the sequence length for all networks. It

also indicates that the better networks had a sublinear growth in computations per input step with

sequence length, though whether this indicates a logarithmic time algorithm is unclear. One problem

with the sort task was that the Gaussian samples were sometimes very close together, making it hard

for the network to determine which was greater; enforcing a minimum separation between successive

values would probably be beneficial.

Figure 20 shows the ponder time during three sort sequences of varying length. As can be seen,

there is a large spike in ponder time near (though not precisely at) the end of the input sequence,

presumably when the majority of the sort comparisons take place. Note that the spike is much higher

for the longer two sequences than the length 5 one, again pointing to an algorithm that is nonlinear

13

Figure 16: Sort training Example. Each size 2 input vector consists of one real number and one binary flag to indicate the

end of sequence to be sorted; inputs following the sort sequence are set to zero and marked in black. No targets are present

until after the sort sequence; thereafter the size 15 target vectors represent the sorted indices of the input sequence.

Figure 18: Sort Learning Curves and Error Rates Versus Ponder Time.

Figure 19: Sort Ponder Time and Error Rate Versus Input Difficulty. ‘Difficulty’ is the length of the sequence to be

sorted.

14

Figure 20: Ponder Time During Three Sort Sequences. The input sequences to be sorted are shown along the bottom

x-axes and the network output sequences are shown along the top x-axes. All plots created using the same network, trained

with τ = 10−3 .

in sequence length (the average ponder per timestep is nonetheless lower for longer sequences, as

little pondering is done away from the spike.).

The Wikipedia task is character prediction on text drawn from the Hutter prize Wikipedia dataset [15].

Following previous RNN experiments on the same data [8], the raw unicode text was used, including

XML tags and markup characters, with one byte presented per input timestep and the next byte

predicted as a target. No validation set was used for early stopping, as the networks were unable to

overfit the data, and all error rates are recorded on the training set. Sequences of 500 consecutive

bytes were randomly chosen from the training set and presented to the network, whose internal state

was reset to 0 at the start of each sequence.

LSTM networks were used with a single layer of 1500 cells and a size 256 softmax classification

layer. As can be seen from Figures 21 and 22, the error rates are fairly similar with and without

ACT, and across values of τ (although the learning curves suggest that the ACT networks are

somewhat more data efficient). Furthermore the amount of ponder per input is much lower than for

the other problems, suggesting that the advantages of extra computation were slight for this task.

However Figure 23 reveals an intriguing pattern of ponder allocation while processing a sequence.

Character prediction networks trained with ACT consistently pause at spaces between words, and

pause for longer at ‘boundary’ characters such as commas and full stops. We speculate that the extra

computation is used to make predictions about the next ‘chunk’ in the data (word, sentence, clause),

much as humans have been found to do in self-paced reading experiments [16]. This suggests that

ACT could be useful for inferring implicit boundaries or transitions in sequence data. Alternative

measures for inferring transitions include the next-step prediction loss and predictive entropy, both

of which tend to increase during harder predictions. However, as can be seen from the figure, they

15

Figure 22: Wikipedia Learning Curves (Zoomed) and Error Rates Versus Ponder Time.

Figure 23: Ponder Time, Prediction loss and Prediction Entropy During a Wikipedia Text Sequence. Plot created using

a network trained with τ = 6e−3

are a less reliable indicator of boundaries, and are not likely to increase at points such as full stops

and commas, as these are invariably followed by space characters. More generally, loss and entropy

only indicate the difficulty of the current prediction, not the degree to which the current input is

likely to impact future predictions.

Furthermore Figure 24 reveals that, as well as being an effective detector of non-text transition

markers such as the opening brackets of XML tags, ACT does not increase computation time during

random or fundamentally unpredictable sequences like the two ID numbers. This is unsurprising,

as doing so will not improve its predictions. In contrast, both entropy and loss are inevitably high

for unpredictable data. We are therefore hopeful that computation time will provide a better way

to distinguish between structure and noise (or at least data perceived by the network as structure

or noise) than existing measures of predictive difficulty.

4 Conclusion

This paper has introduced Adaptive Computation time (ACT), a method that allows recurrent

neural networks to learn how many updates to perform for each input they receive. Experiments on

16

Figure 24: Ponder Time, Prediction loss and Prediction Entropy During a Wikipedia Sequence Containing XML Tags.

Created using the same network as Figure 23.

synthetic data prove that ACT can make otherwise inaccessible problems straightforward for RNNs

to learn, and that it is able to dynamically adapt the amount of computation it uses to the demands

of the data. An experiment on real data suggests that the allocation of computation steps learned

by ACT can yield insight into both the structure of the data and the computational demands of

predicting it.

ACT promises to be particularly interesting for recurrent architectures containing soft attention

modules [2, 10, 34, 12], which it could enable to dynamically adapt the number of glances or internal

operations they perform at each time-step.

One weakness of the current algorithm is that it is quite sensitive to the time penalty parameter

that controls the relative cost of computation time versus prediction error. An important direction

for future work will be to find ways of automatically determining and adapting the trade-off between

accuracy and speed.

Acknowledgments

The author wishes to thank Ivo Danihleka, Greg Wayne, Tim Harley, Malcolm Reynolds, Jacob

Menick, Oriol Vinyals, Joel Leibo, Koray Kavukcuoglu and many others on the DeepMind team for

valuable comments and suggestions, as well as Albert Zeyer, Martin Abadi, Dario Amodei, Eugene

Brevdo and Christopher Olah for pointing out the discontinuity in the ponder cost, which was

erroneously described as smooth in an earlier version of the paper.

References

[1] G. An. The effects of adding noise during backpropagation training on a generalization perfor-

mance. Neural Computation, 8(3):643–674, 1996.

[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align

and translate. abs/1409.0473, 2014.

[3] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks

for faster models. arXiv preprint arXiv:1511.06297, 2015.

[4] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image

classification. In arXiv:1202.2745v1 [cs.CV], 2012.

17

[5] G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks

for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Trans-

actions on, 20(1):30 –42, jan. 2012.

[6] L. Denoyer and P. Gallinari. Deep sequential neural network. arXiv preprint arXiv:1410.0510,

2014.

[7] S. Eslami, N. Heess, T. Weber, Y. Tassa, K. Kavukcuoglu, and G. E. Hinton. Attend, infer,

repeat: Fast scene understanding with generative models. arXiv preprint arXiv:1603.08575,

2016.

[8] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint

arXiv:1308.0850, 2013.

[9] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural net-

works. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Con-

ference on, pages 6645–6649. IEEE, 2013.

[10] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint

arXiv:1410.5401, 2014.

[11] E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom. Learning to transduce with

unbounded memory. In Advances in Neural Information Processing Systems, pages 1819–1827,

2015.

[12] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for

image generation. arXiv preprint arXiv:1502.04623, 2015.

[13] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the

difficulty of learning long-term dependencies, 2001.

[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–

1780, 1997.

[15] M. Hutter. Universal artificial intelligence. Springer, 2005.

[16] M. A. Just, P. A. Carpenter, and J. D. Woolley. Paradigms and processes in reading compre-

hension. Journal of experimental psychology: General, 111(2):228, 1982.

[17] N. Kalchbrenner, I. Danihelka, and A. Graves. Grid long short-term memory. arXiv preprint

arXiv:1507.01526, 2015.

[18] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980, 2014.

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional

neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[20] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv

preprint arXiv:1405.4053, 2014.

[21] M. Li and P. Vitányi. An introduction to Kolmogorov complexity and its applications. Springer

Science & Business Media, 2013.

[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations

of words and phrases and their compositionality. In Advances in neural information processing

systems, pages 3111–3119, 2013.

18

[23] B. A. Olshausen et al. Emergence of simple-cell receptive field properties by learning a sparse

code for natural images. Nature, 381(6583):607–609, 1996.

[24] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic

gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.

2015.

[26] J. Schmidhuber. Self-delimiting neural networks. arXiv preprint arXiv:1210.0118, 2012.

[27] J. Schmidhuber and S. Hochreiter. Guessing can outperform many long time lag algorithms.

Technical report, 1996.

way to prevent neural networks from overfitting. The Journal of Machine Learning Research,

15(1):1929–1958, 2014.

[29] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In Advances in

Neural Information Processing Systems, pages 2368–2376, 2015.

[30] R. K. Srivastava, B. R. Steunebrink, and J. Schmidhuber. First experiments with powerplay.

Neural Networks, 41:130–136, 2013.

[31] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in

Neural Information Processing Systems, pages 2431–2439, 2015.

[32] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks.

arXiv preprint arXiv:1409.3215, 2014.

[33] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. arXiv

preprint arXiv:1511.06391, 2015.

[34] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Advances in Neural Information

Processing Systems, pages 2674–2682, 2015.

[35] A. J. Wiles. Modular elliptic curves and fermats last theorem. ANNALS OF MATH, 141:141,

1995.

[36] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and

their computational complexity. Back-propagation: Theory, architectures and applications,

pages 433–486, 1995.

19

DeepMath - Deep Sequence Models for Premise

Selection

Google Inc. Google Inc. Google Inc.

alemi@google.com fchollet@google.com een@google.com

arXiv:1606.04442v2 [cs.AI] 26 Jan 2017

Google Inc. Google Inc. Czech Technical University in Prague

geoffreyi@google.com szegedy@google.com josef.urban@gmail.com

Abstract

automated theorem proving, one of the main bottlenecks in the formalization of

mathematics. We propose a two stage approach for this task that yields good

results for the premise selection task on the Mizar corpus while avoiding the hand-

engineered features of existing state-of-the-art models. To our knowledge, this is

the first time deep learning has been applied to theorem proving on a large scale.

1 Introduction

Mathematics underpins all scientific disciplines. Machine learning itself rests on measure and

probability theory, calculus, linear algebra, functional analysis, and information theory. Complex

mathematics underlies computer chips, transit systems, communication systems, and financial infras-

tructure – thus the correctness of many of these systems can be reduced to mathematical proofs.

Unfortunately, these correctness proofs are often impractical to produce without automation, and

present-day computers have only limited ability to assist humans in developing mathematical proofs

and formally verifying human proofs. There are two main bottlenecks: (1) lack of automated methods

for semantic or formal parsing of informal mathematical texts (autoformalization), and (2) lack of

strong automated reasoning methods to fill in the gaps in already formalized human-written proofs.

The two bottlenecks are related. Strong automated reasoning can act as a semantic filter for autoformal-

ization, and successful autoformalization would provide a large corpus of computer-understandable

facts, proofs, and theory developments. Such a corpus would serve as both background knowledge to

fill in gaps in human-level proofs and as a training set to guide automated reasoning. Such guidance

is crucial: exhaustive deductive reasoning tools such as today’s resolution/superposition automated

theorem provers (ATPs) quickly hit combinatorial explosion, and are unusable when reasoning with a

very large number of facts without careful selection [4].

In this work, we focus on the latter bottleneck. We develop deep neural networks that learn from a

large repository of manually formalized computer-understandable proofs. We learn the task that is

essential for making today’s ATPs usable over large formal corpora: the selection of a limited number

of most relevant facts for proving a new conjecture. This is known as premise selection.

The main contributions of this work are:

∗

Authors listed alphabetically. All contributions are considered equal.

†

Supported by ERC Consolidator grant nr. 649043 AI4REASON.

• A demonstration for the first time that neural network models are useful for aiding in large

scale automated logical reasoning without the need for hand-engineered features.

• The comparison of various network architectures (including convolutional, recurrent and

hybrid models) and their effect on premise selection performance.

• A method of semantic-aware “definition”-embeddings for function symbols that improves

the generalization of formulas with symbols occurring infrequently. This model outperforms

previous approaches.

• Analysis showing that neural network based premise selection methods are complementary

to those with hand-engineered features: ensembling with previous results produce superior

results.

In the last two decades, large corpora of complex mathematical knowledge have been formalized:

encoded in complete detail so that computers can fully understand the semantics of complicated

mathematical objects. The process of writing such formal and verifiable theorems, definitions, proofs,

and theories is called Interactive Theorem Proving (ITP).

The ITP field dates back to 1960s [16] and the Automath system by N.G. de Bruijn [9]. ITP systems

include HOL (Light) [15], Isabelle [37], Mizar [13], Coq [7], and ACL2 [23]. The development of

ITP has been intertwined with the development of its cousin field of Automated Theorem Proving

(ATP) [31], where proofs of conjectures are attempted fully automatically. Unlike ATP systems,

ITP systems allow human-assisted formalization and proving of theorems that are often beyond the

capabilities of the fully automated systems.

Large ITP libraries include the Mizar Mathematical Library (MML) with over 50,000 lemmas, and

the core Isabelle, HOL, Coq, and ACL2 libraries with thousands of lemmas. These core libraries are a

basis for large projects in formalized mathematics and software and hardware verification. Examples

in mathematics include the HOL Light proof of the Kepler conjecture (Flyspeck project) [14], the

Coq proofs of the Feit-Thompson theorem [12] and Four Color theorem [11], and the verification of

most of the Compendium of Continuous Lattices in Mizar [2]. ITP verification of the seL4 kernel [25]

and CompCert compiler [27] show comparable progress in large scale software verification. While

these large projects mark a coming of age of formalization, ITP remains labor-intensive. For example,

Flyspeck took about 20 person-years, with twice as much for Feit-Thompson. Behind this cost are

our two bottlenecks: lack of tools for autoformalization and strong proof automation.

Recently the field of Automated Reasoning in Large Theories (ARLT) [35] has developed, including

AI/ATP/ITP (AITP) systems called hammers that assist ITP formalization [4]. Hammers analyze

the full set of theorems and proofs in the ITP libraries, estimate the relevance of each theorem, and

apply optimized translations from the ITP logic to simpler ATP formalism. Then they attack new

conjectures using the most promising combinations of existing theorems and ATP search strategies.

Recent evaluations have proved 40% of all Mizar and Flyspeck theorems fully automatically [20, 21].

However, there is significant room for improvement: with perfect premise selection (a perfect choice

of library facts) ATPs can prove at least 56% of Mizar and Flyspeck instead of today’s 40% [4]. In

the next section we explain the premise selection task and the experimental setting for measuring

such improvements.

Given a formal corpus of facts and proofs expressed in an ATP-compatible format, our task is

Definition (Premise selection problem). Given a large set of premises P, an ATP system A with

given resource limits, and a new conjecture C, predict those premises from P that will most likely

lead to an automatically constructed proof of C by A.

We use the Mizar Mathematical Library (MML) version 4.181.11473 as the formal corpus and E

prover [32] version 1.9 as the underlying ATP system. The following list exemplifies a small non-

3

ftp://mizar.uwb.edu.pl/pub/system/i386-linux/mizar-7.13.01_4.181.

1147-i386-linux.tar

2

:: t99_jordan: Jordan curve theorem in Mizar

for C being Simple_closed_curve holds C is Jordan;

fof(t99_jordan, axiom, (! [A] : ( (v1_topreal2(A) & m1_subset_1(A,

k1_zfmisc_1(u1_struct_0(k15_euclid(2))))) => v1_jordan1(A)) ) ).

Figure 1: (top) The final statement of the Mizar formalization of the Jordan curve theorem. (bottom) The

translation to first-order logic, using name mangling to ensure uniqueness across the entire corpus.

(a) Length in chars. (b) Length in words. (c) Word occurrences. (d) Dependencies.

Figure 2: Histograms of statement lengths, occurrences of each word, and statement dependencies in the

Mizar corpus translated to first order logic. The wide length distribution poses difficulties for RNN models and

batching, and many rarely occurring words make it important to take definitions of words into account.

representative sample of topics and theorems that are included in the Mizar Mathematical Library:

Cauchy-Riemann Differential Equations of Complex Functions, Characterization and Existence of

Gröbner Bases, Maximum Network Flow Algorithm by Ford and Fulkerson, Gödel’s Completeness

Theorem, Brouwer Fixed Point Theorem, Arrow’s Impossibility Theorem Borsuk-Ulam Theorem,

Dickson’s Lemma, Sylow Theorems, Hahn Banach Theorem, The Law of Quadratic Reciprocity,

Pepin’s Primality Test for Public-Key Cryptography, Ramsey’s Theorem.

This version of MML was used for the latest AITP evaluation reported in [21]. There are 57,917

proved Mizar theorems and unnamed top-level lemmas in this MML organized into 1,147 articles.

This set is chronologically ordered by the order of articles in MML and by the order of theorems in

the articles. Proofs of later theorems can only refer to earlier theorems. This ordering also applies

to 88,783 other Mizar formulas (encoding the type system and other automation known to Mizar)

used in the problems. The formulas have been translated into first-order logic formulas by the MPTP

system [34] (see Figure 1).

Our goal is to automatically prove as many theorems as possible, using at each step all previous

theorems and proofs. We can learn from both human proofs and ATP proofs, but previous experi-

ments [26, 20] show that learning only from the ATP proofs is preferable to including human proofs

if the set of ATP proofs is sufficiently large. Since for 32,524 (56.2%) of the 57,917 theorems an ATP

proof was previously found by a combination of manual and learning-based premise selection [21],

we use only these ATP proofs for training.

The 40% success rate from [21] used a portfolio of 14 AITP methods using different learners, ATPs,

and numbers of premises. The best single method proved 27.3% of the theorems. Only fast and

simple learners such as k-nearest-neighbors, naive Bayes, and their ensembles were used, based on

hand-crafted features such as the set of (normalized) sub-terms and symbols in each formula.

Strong premise selection requires models capable of reasoning over mathematical statements, here

encoded as variable-length strings of first-order logic. In natural language processing, deep neural net-

works have proven useful in language modeling [28], text classification [8], sentence pair scoring [3],

conversation modeling [36], and question answering [33]. These results have demonstrated the ability

of deep networks to extract useful representations from sequential inputs without hand-tuned feature

engineering. Neural networks can also mimic some higher-level reasoning on simple algorithmic

tasks [38, 18].

3

Logistic loss

Maximum

Fully connected layer with 1

output

Ux+c Ux+c Ux+c

1024 outputs

! [ A , B ] : ( g t a ...

Axiom first order logic Conjecture first order logic

sequence sequence

Figure 3: (left) Our network structure. The input sequences are either character-level (section 5.1) or word-level

(section 5.2). We use separate models to embed conjecture and axiom, and a logistic layer to predict whether the

axiom is useful for proving the conjecture. (right) A convolutional model.

The Mizar data set is also an interesting case study in neural network sequence tasks, as it differs

from natural language problems in several ways. It is highly structured with a simple context free

grammar – the interesting task occurs only after parsing. The distribution of lengths is wide, ranging

from 5 to 84,299 characters with mean 304.5, and from 2 to 21,251 tokens with mean 107.4 (see

Figure 2). Fully recurrent models would have to back-propagate through 100s to 1000s of characters

or 100s of tokens to embed a whole statement. Finally, there are many rare words – 60.3% of the

words occur fewer than 10 times – motivating the definition-aware embeddings in section 5.2.

The full premise selection task takes a conjecture and a set of axioms and chooses a subset of

axioms to pass to the ATP. We simplify from subset selection to pairwise relevance by predicting the

probability that a given axiom is useful for proving a given conjecture. This approach depends on a

relatively sparse dependency graph. Our general architecture is shown in Figure 3(left): the conjecture

and axiom sequences are separately embedded into fixed length real vectors, then concatenated and

passed to a third network with two fully connected layers and logistic loss. During training time, the

two embedding networks and the joined predictor path are trained jointly.

As discussed in section 3, we train our models on premise selection data generated by a combination

of various methods, including k-nearest-neighbor search on hand-engineered similarity metrics. We

start with a first stage of character-level models, and then build second and later stages of word-level

models on top of the results of earlier stages.

We begin by avoiding special purpose engineering by treating formulas on the character-level using

an 80 dimensional one-hot encoding of the character sequence. These sequences are passed to a

weight shared network for variable length input. For the embedding computation, we have explored

the following architectures:

1. Pure recurrent LSTM [17] and GRU [6] networks.

2. A pure multi-layer convolutional network with various numbers of convolutional layers (with strides)

followed by a global temporal max-pooling reduction (see Figure 3(right)).

3. A recurrent-convolutional network, that uses convolutional layers to produce a shorter sequence which

is processed by a LSTM.

It is computationally prohibitive to compute a large number of (conjecture, axiom) pairs due to the

costly embedding phase. Fortunately, our architecture allows caching the embeddings for conjectures

and axioms and evaluating the shared portion of the network for a given pair. This makes it practical

to consider all pairs during evaluation.

The character-level models are limited to word and structure similarity within the axiom or conjecture

being embedded. However, many of the symbols occurring in a formula are defined by formulas

4

earlier in the corpus, and we can use the axiom-embeddings of those symbols to improve model

performance.

Since Mizar is based on first-order set theory, definitions of symbols can be either explicit or implicit.

An explicit definition of x sets x = e for some expression e, while an implicit definition states a

property of the defined object, such as defining a function f (x) by ∀x.f (f (x)) = g(x). To avoid

manually encoding the structure of implicit definitions, we embed the entire statement defining a

symbol f , and then use the stage 1 axiom-embedding corresponding to the whole statement as a

word-level embeddings.

Ideally, we would train a single network that embeds statements by recursively expanding and

embedding the definitions of the defined symbols. Unfortunately, this recursion would dramatically

increase the cost of training since the definition chains can be quite deep. For example, Mizar defines

real numbers in terms of non-negative reals, which are defined as Dedekind cuts of non-negative

rationals, which are defined as ratios of naturals, etc. As an inexpensive alternative, we reuse the

axiom embeddings computed by a previously trained character-level model, mapping each defined

symbol to the axiom embedding of its defining statement. Other tokens such as brackets and operators

are mapped to fixed pseudo-random vectors of the same dimension.

Since we embed one token at a time ignoring the grammatical structure, our approach does not require

a parser: a trivial lexer is implemented in a few lines of Python. With word-level embeddings, we use

the same architectures with shorter input sequence to produce axiom and conjecture embeddings for

ranking the (conjecture, axiom) pairs. Iterating this approach by using the resulting, stronger axiom

embeddings as word embeddings multiple times for additional stages did not yield measurable gains.

6 Experiments

6.1 Experimental Setup

For training and evaluation we use a subset of 32,524 out of 57,917 theorems that are known to

be provable by an ATP given the right set of premises. We split off a random 10% of these (3,124

statements) for testing and validation. Also, we held out 400 statements from the 3,124 for monitoring

training progress, as well as for model and checkpoint selection. Final evaluation was done on the

remaining 2,724 conjectures. Note that we only held out conjectures, but we trained on all statements

as axioms. This is comparable to our k-NN baseline which is also trained on all statements as axioms.

The randomized selection of the training and testing sets may also lead to learning from future proofs:

a proof Pj of theorem Tj written after theorem Ti may guide the premise selection for Ti . However,

previous k-NN experiments show similar performance between a full 10-fold cross-validation and

incremental evaluation as long as chronologically preceding formulas participate in proofs of only

later theorems.

6.2 Metrics

For each conjecture, our models output a ranking of possible premises. Our primary metric is the

number of conjectures proved from the top-k premises, where k = 16, 32, . . . , 1024. This metric can

accommodate alternative proofs but is computationally expensive. Therefore we additionally measure

the ranking quality using the average maximum relative rank of the testing premise set. Formally,

average max relative rank is

rank(P, Pavail (C))

aMRR = mean max

C P ∈Ptest (C) |Pavail (C)|

where C ranges over conjectures, Pavail (C) is the set of premises available to prove C, Ptest (C) is the

set of premises for conjecture C from the test set, and rank(P, Pavail (C)) is the rank of premise P

among the set Pavail (C) according to the model. The motivation for aMRR is that conjectures are

easier to prove if all their dependencies occur early in the ranking.

Since it is too expensive to rank all axioms for a conjecture during continuous evaluation, we

approximate our objective. For our holdout set of 400 conjectures, we select all true dependencies

Ptest (C) and 128 fixed random false dependencies from Pavail (C) − Ptest (C) and compute the average

max relative rank in this ordering. Note that aMRR is nonzero even if all true dependencies are

ordered before false dependencies; the best possible value is 0.051.

5

Figure 4: Specification of the different embedder networks.

All our neural network models use the general architecture from Fig 3: a classifier on top of the

concatenated embeddings of an axiom and a conjecture. The same classifier architecture was used for

all models: a fully-connected neural network with one hidden layer of size 1024. For each model, the

axiom and conjecture embedding networks have the same architecture without sharing weights. The

details of the embedding networks are shown in Fig 4.

The neural networks were trained using asynchronous distributed stochastic gradient descent using

the Adam optimizer [24] with up to 20 parallel NVIDIA K-80 GPU workers per model. We used the

TensorFlow framework [1] and the Keras library [5]. The weights were initialized using [10]. Polyak

averaging with 0.9999 decay was used for producing the evaluation weights [30]. The character

level models were trained with maximum sequence length 2048 characters, where the word-level

(and definition embedding) based models had a maximum sequence length of 500 words. For good

performance, especially for low cutoff thresholds, it was critical to employ negative mining during

training. A side process was continuously evaluating many (conjecture, axiom) pairs. For each

conjecture, we pick the lowest scoring statements that have higher score than the lowest scoring true

positive. A queue of previously mined negatives is maintained for producing a mixture of examples

in which the ratio of mined instances is about 25% and the rest are randomly selected premises.

Negative mining was crucial for good quality: at the top-16 cutoff, the number of proved theorems

on the test set has doubled. For the union of proof attempts over all cutoff thresholds, the ratio of

successful proofs has increased from 61.3% to 66.4% for the best neural model.

Our best selection pipeline uses a stage-1 character-level convolutional neural network model to

produce word-level embeddings for the second stage. The baseline uses distance-weighted k-

NN [19, 21] with handcrafted semantic features [22]. For all conjectures in our holdout set, we

consider all the chronologically preceding statements (lemmas, definitions and axioms) as premise

6

(a) Training accuracy for different character-level

models without hard negative mining. Recurrent (b) Test average max relative rank for different mod-

models seem underperform, while pure convolutional els without hard negative mining. The best is a

models yield the best results. For each architecture, word-level CNN using definition embeddings from

we trained three models with different random initial- a character-level 2-layer CNN. An identical word-

ization seeds. Only the best runs are shown on this embedding model with random starting embedding

graph; we did not see much variance between runs overfits after only 250,000 iterations and underper-

on the same architecture. forms the best character-level model.

candidates. In the DeepMath case, premises were ordered by their logistic scores. E prover was

applied to the top-k of the premise-candidates for each of the cutoffs k ∈ (16, 32, . . . , 1024) until a

proof is found or k = 1024 fails. Table 1 reports the number of theorems proved with a cutoff value

at most the k in the leftmost column. For E prover, we used auto strategy with a soft time limit of 90

seconds, a hard time limit of 120 seconds, a memory limit of 4 GB, and a processed clauses limit of

500,000.

Our most successful models employ simple convolutional networks followed by max pooling (as

opposed to recurrent networks like LSTM/GRU), and the two stage definition-based def-CNN

outperforms the naïve word-CNN word embedding significantly. In the latter the word embeddings

were learned in a single pass; in the former they are fixed from the stage-1 character-level model. For

each architecture (cf. Figure 4) two convolutional layers perform best. Although our models differ

significantly from each other, they differ even more from the k-NN baseline based on hand-crafted

features. The right column of Table 1 shows the result if we average the prediction score of the stage-1

model with that of the definition based stage-2 model. We also experimented with character-based

RNN models using shorter sequences: these lagged behind our long-sequence CNN models but

performed significantly better than those RNNs trained on longer sequences. This suggest that RNNs

could be improved by more sophisticated optimization techniques such as curriculum learning.

Cutoff k-NN Baseline (%) char-CNN (%) word-CNN (%) def-CNN-LSTM (%) def-CNN (%) def+char-CNN (%)

16 674 (24.6) 687 (25.1) 709 (25.9) 644 (23.5) 734 (26.8) 835 (30.5)

32 1081 (39.4) 1028 (37.5) 1063 (38.8) 924 (33.7) 1093 (39.9) 1218 (44.4)

64 1399 (51) 1295 (47.2) 1355 (49.4) 1196 (43.6) 1381 (50.4) 1470 (53.6)

128 1612 (58.8) 1534 (55.9) 1552 (56.6) 1401 (51.1) 1617 (59) 1695 (61.8)

256 1709 (62.3) 1656 (60.4) 1635 (59.6) 1519 (55.4) 1708 (62.3) 1780 (64.9)

512 1762 (64.3) 1711 (62.4) 1712 (62.4) 1593 (58.1) 1780 (64.9) 1830 (66.7)

1024 1786 (65.1) 1762 (64.3) 1755 (64) 1647 (60.1) 1822 (66.4) 1862 (67.9)

Table 1: Results of ATP premise selection experiments with hard negative mining on a test set of 2,742 theorems.

Each entry is the number (%) of theorems proved by E prover using that particular model to rank the premises.

The union of def-CNN and char-CNN proves 69.8% of the test set, while the union of the def-CNN and k-NN

proves 74.25%. This means that the neural network predictions are more complementary to the k-NN predictions

than to other neural models. The union of all methods proves 2218 theorems (80.9%) and just the neural models

prove 2151 (78.4%).

Also, when we applied two of the premise selection models on those Mizar statements that were not

proven automatically before, we managed to prove 823 additional of them.

7

Model Test min average relative rank

char-CNN 0.0585

word-CNN 0.06

def-CNN-LSTM 0.0605

def-CNN 0.0575

(d) Best sustained test results obtained by the above

models. Lower values are better. This was moni-

tored continuously during training on a holdout set

with 400 theorems, using all true positive premises

(c) Jaccard similarities between proved sets of con- and 128 randomly selected negatives. In this setup,

jectures across models. Each of the neural network the lowest attainable max average relative rank with

model prediction are more like each other than those perfect predictions is 0.051.

of the k-NN baseline.

7 Conclusions

In this work we provide evidence that even simple neural models can compete with hand-engineered

features for premise selection, helping to find many new proofs. This translates to real gains in

automatic theorem proving. Despite these encouraging results, our models are relatively shallow

networks with inherent limitations to representational power and are incapable of capturing high level

properties of mathematical statements. We believe theorem proving is a challenging and important

domain for deep learning methods, and that more sophisticated optimization techniques and training

methodologies will prove more useful than in less structured domains.

8 Acknowledgments

We would like to thank Cezary Kaliszyk for providing us with an improved baseline model. Also

many thanks go to the Google Brain team for their generous help with the training infrastructure. We

would like to thank Quoc Le for useful discussions on the topic and to Sergio Guadarrama for his

help with TensorFlow-slim.

References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,

M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,

M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,

B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War-

den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on

heterogeneous systems, 2015. Software available from tensorflow.org.

[2] G. Bancerek and P. Rudnicki. A Compendium of Continuous Lattices in MIZAR. J. Autom. Reasoning,

29(3-4):189–224, 2002.

[3] P. Baudiš, J. Pichl, T. Vyskočil, and J. Šedivý. Sentence pair scoring: Towards unified framework for text

comprehension. arXiv preprint arXiv:1603.06127, 2016.

[4] J. C. Blanchette, C. Kaliszyk, L. C. Paulson, and J. Urban. Hammering towards QED. J. Formalized

Reasoning, 9(1):101–148, 2016.

[5] F. Chollet. Keras. https://github.com/fchollet/keras, 2015.

[6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. arXiv preprint

arXiv:1502.02367, 2015.

[7] The Coq Proof Assistant. http://coq.inria.fr.

[8] A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing

Systems, pages 3061–3069, 2015.

[9] N. de Bruijn. The mathematical language AUTOMATH, its usage, and some of its extensions. In M. Laudet,

editor, Proceedings of the Symposium on Automatic Demonstration, pages 29–61, Versailles, France, Dec.

1968. Springer-Verlag LNM 125.

[10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In

International conference on artificial intelligence and statistics, pages 249–256, 2010.

8

[11] G. Gonthier. The four colour theorem: Engineering of a formal proof. In D. Kapur, editor, Computer

Mathematics, 8th Asian Symposium, ASCM 2007, Singapore, December 15-17, 2007. Revised and Invited

Papers, volume 5081 of Lecture Notes in Computer Science, page 333. Springer, 2007.

[12] G. Gonthier, A. Asperti, J. Avigad, Y. Bertot, C. Cohen, F. Garillot, S. L. Roux, A. Mahboubi, R. O’Connor,

S. O. Biha, I. Pasca, L. Rideau, A. Solovyev, E. Tassi, and L. Théry. A machine-checked proof of the Odd

Order Theorem. In S. Blazy, C. Paulin-Mohring, and D. Pichardie, editors, ITP, volume 7998 of LNCS,

pages 163–179. Springer, 2013.

[13] A. Grabowski, A. Korniłowicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Reasoning,

3(2):153–245, 2010.

[14] T. C. Hales, M. Adams, G. Bauer, D. T. Dang, J. Harrison, T. L. Hoang, C. Kaliszyk, V. Magron,

S. McLaughlin, T. T. Nguyen, T. Q. Nguyen, T. Nipkow, S. Obua, J. Pleso, J. Rute, A. Solovyev, A. H. T.

Ta, T. N. Tran, D. T. Trieu, J. Urban, K. K. Vu, and R. Zumkeller. A formal proof of the Kepler conjecture.

CoRR, abs/1501.02155, 2015.

[15] J. Harrison. HOL Light: A tutorial introduction. In M. K. Srivas and A. J. Camilleri, editors, FMCAD,

volume 1166 of LNCS, pages 265–269. Springer, 1996.

[16] J. Harrison, J. Urban, and F. Wiedijk. History of interactive theorem proving. In J. H. Siekmann, editor,

Computational Logic, volume 9 of Handbook of the History of Logic, pages 135 – 214. North-Holland,

2014.

[17] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[18] Ł. Kaiser and I. Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.

[19] C. Kaliszyk and J. Urban. Stronger automation for Flyspeck by feature weighting and strategy evolution.

In J. C. Blanchette and J. Urban, editors, PxTP 2013, volume 14 of EPiC Series, pages 87–95. EasyChair,

2013.

[20] C. Kaliszyk and J. Urban. Learning-assisted automated reasoning with Flyspeck. J. Autom. Reasoning,

53(2):173–213, 2014.

[21] C. Kaliszyk and J. Urban. MizAR 40 for Mizar 40. J. Autom. Reasoning, 55(3):245–256, 2015.

[22] C. Kaliszyk, J. Urban, and J. Vyskocil. Efficient semantic features for automated reasoning over large

theories. In Q. Yang and M. Wooldridge, editors, Proceedings of the Twenty-Fourth International Joint

Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages

3084–3090. AAAI Press, 2015.

[23] M. Kaufmann and J. S. Moore. An ACL2 tutorial. In Mohamed et al. [29], pages 17–21.

[24] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[25] G. Klein, J. Andronick, K. Elphinstone, G. Heiser, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt,

R. Kolanski, M. Norrish, T. Sewell, H. Tuch, and S. Winwood. seL4: formal verification of an operating-

system kernel. Commun. ACM, 53(6):107–115, 2010.

[26] D. Kuehlwein and J. Urban. Learning from multiple proofs: First experiments. In P. Fontaine, R. A.

Schmidt, and S. Schulz, editors, PAAR-2012, volume 21 of EPiC Series, pages 82–94. EasyChair, 2013.

[27] X. Leroy. Formal verification of a realistic compiler. Commun. ACM, 52(7):107–115, 2009.

[28] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based

language model. In INTERSPEECH, volume 2, page 3, 2010.

[29] O. A. Mohamed, C. A. Muñoz, and S. Tahar, editors. Theorem Proving in Higher Order Logics, 21st

International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings, volume

5170 of LNCS. Springer, 2008.

[30] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on

Control and Optimization, 30(4):838–855, 1992.

[31] J. A. Robinson and A. Voronkov, editors. Handbook of Automated Reasoning (in 2 volumes). Elsevier and

MIT Press, 2001.

[32] S. Schulz. E - A Brainiac Theorem Prover. AI Commun., 15(2-3):111–126, 2002.

[33] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in Neural Information

Processing Systems, pages 2431–2439, 2015.

[34] J. Urban. MPTP 0.2: Design, implementation, and initial experiments. J. Autom. Reasoning, 37(1-2):21–43,

2006.

[35] J. Urban and J. Vyskočil. Theorem proving in large formal mathematics as an emerging AI field. In M. P.

Bonacina and M. E. Stickel, editors, Automated Reasoning and Mathematics: Essays in Memory of William

McCune, volume 7788 of LNAI, pages 240–257. Springer, 2013.

9

[36] O. Vinyals and Q. Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.

[37] M. Wenzel, L. C. Paulson, and T. Nipkow. The Isabelle framework. In Mohamed et al. [29], pages 33–38.

[38] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

10

Learning to Transduce with Unbounded Memory

Google DeepMind Google DeepMind Google DeepMind

etg@google.com kmh@google.com mustafasul@google.com

Phil Blunsom

Google DeepMind and Oxford University

pblunsom@google.com

Abstract

Recently, strong results have been demonstrated by Deep Recurrent Neural Net-

works on natural language transduction problems. In this paper we explore the

representational power of these models using synthetic grammars designed to ex-

hibit phenomena similar to those found in real transduction problems such as ma-

chine translation. These experiments lead us to propose new memory-based recur-

rent networks that implement continuously differentiable analogues of traditional

data structures such as Stacks, Queues, and DeQues. We show that these architec-

tures exhibit superior generalisation performance to Deep RNNs and are often able

to learn the underlying generating algorithms in our transduction experiments.

1 Introduction

Recurrent neural networks (RNNs) offer a compelling tool for processing natural language input in

a straightforward sequential manner. Many natural language processing (NLP) tasks can be viewed

as transduction problems, that is learning to convert one string into another. Machine translation is

a prototypical example of transduction and recent results indicate that Deep RNNs have the ability

to encode long source strings and produce coherent translations [1, 2]. While elegant, the appli-

cation of RNNs to transduction tasks requires hidden layers large enough to store representations

of the longest strings likely to be encountered, implying wastage on shorter strings and a strong

dependency between the number of parameters in the model and its memory.

In this paper we use a number of linguistically-inspired synthetic transduction tasks to explore the

ability of RNNs to learn long-range reorderings and substitutions. Further, inspired by prior work on

neural network implementations of stack data structures [3], we propose and evaluate transduction

models based on Neural Stacks, Queues, and DeQues (double ended queues). Stack algorithms are

well-suited to processing the hierarchical structures observed in natural language and we hypothesise

that their neural analogues will provide an effective and learnable transduction tool. Our models

provide a middle ground between simple RNNs and the recently proposed Neural Turing Machine

(NTM) [4] which implements a powerful random access memory with read and write operations.

Neural Stacks, Queues, and DeQues also provide a logically unbounded memory while permitting

efficient constant time push and pop operations.

Our results indicate that the models proposed in this work, and in particular the Neural DeQue, are

able to consistently learn a range of challenging transductions. While Deep RNNs based on long

short-term memory (LSTM) cells [1, 5] can learn some transductions when tested on inputs of the

same length as seen in training, they fail to consistently generalise to longer strings. In contrast,

our sequential memory-based algorithms are able to learn to reproduce the generating transduction

algorithms, often generalising perfectly to inputs well beyond those encountered in training.

1

2 Related Work

String transduction is central to many applications in NLP, from name transliteration and spelling

correction, to inflectional morphology and machine translation. The most common approach lever-

ages symbolic finite state transducers [6, 7], with approaches based on context free representations

also being popular [8]. RNNs offer an attractive alternative to symbolic transducers due to their sim-

ple algorithms and expressive representations [9]. However, as we show in this work, such models

are limited in their ability to generalise beyond their training data and have a memory capacity that

scales with the number of their trainable parameters.

Previous work has touched on the topic of rendering discrete data structures such as stacks continu-

ous, especially within the context of modelling pushdown automata with neural networks [10, 11, 3].

We were inspired by the continuous pop and push operations of these architectures and the idea of

an RNN controlling the data structure when developing our own models. The key difference is that

our work adapts these operations to work within a recurrent continuous Stack/Queue/DeQue-like

structure, the dynamics of which are fully decoupled from those of the RNN controlling it. In our

models, the backwards dynamics are easily analysable in order to obtain the exact partial derivatives

for use in error propagation, rather than having to approximate them as done in previous work.

In a parallel effort to ours, researchers are exploring the addition of memory to recurrent networks.

The NTM and Memory Networks [4, 12, 13] provide powerful random access memory operations,

whereas we focus on a more efficient and restricted class of models which we believe are sufficient

for natural language transduction tasks. More closely related to our work, [14] have sought to

develop a continuous stack controlled by an RNN. Note that this model—unlike the work proposed

here—renders discrete push and pop operations continuous by “mixing” information across levels of

the stack at each time step according to scalar push/pop action values. This means the model ends up

compressing information in the stack, thereby limiting its use, as it effectively loses the unbounded

memory nature of traditional symbolic models.

3 Models

In this section, we present an extensible memory enhancement to recurrent layers which can be set

up to act as a continuous version of a classical Stack, Queue, or DeQue (double-ended queue). We

begin by describing the operations and dynamics of a neural Stack, before showing how to modify

it to act as a Queue, and extend it to act as a DeQue.

Let a Neural Stack be a differentiable structure onto and from which continuous vectors are pushed

and popped. Inspired by the neural pushdown automaton of [3], we render these traditionally dis-

crete operations continuous by letting push and pop operations be real values in the interval (0, 1).

Intuitively, we can interpret these values as the degree of certainty with which some controller wishes

to push a vector v onto the stack, or pop the top of the stack.

⇢

Vt 1 [i] if 1 i < t

Vt [i] = (Note that Vt [i] = vi for all i t) (1)

vt if i = t

8 tP1

<

max(0, st 1 [i] max(0, ut st 1 [j])) if 1 i < t

st [i] = j=i+1 (2)

:

dt if i = t

t

X t

X

rt = (min(st [i], max(0, 1 st [j]))) · Vt [i] (3)

i=1 j=i+1

Formally, a Neural Stack, fully parametrised by an embedding size m, is described at some timestep

t by a t ⇥ m value matrix Vt and a strength vector st 2 Rt . These form the core of a recurrent layer

which is acted upon by a controller by receiving, from the controller, a value vt 2 Rm , a pop signal

ut 2 (0, 1), and a push signal dt 2 (0, 1). It outputs a read vector rt 2 Rm . The recurrence of this

2

layer comes from the fact that it will receive as previous state of the stack the pair (Vt 1 , st 1 ), and

produce as next state the pair (Vt , st ) following the dynamics described below. Here, Vt [i] represents

the ith row (an m-dimensional vector) of Vt and st [i] represents the ith value of st .

Equation 1 shows the update of the value component of the recurrent layer state represented as a

matrix, the number of rows of which grows with time, maintaining a record of the values pushed to

the stack at each timestep (whether or not they are still logically on the stack). Values are appended

to the bottom of the matrix (top of the stack) and never changed.

Equation 2 shows the effect of the push and pop signal in updating the strength vector st 1 to

produce st . First, the pop operation removes objects from the stack. We can think of the pop value

ut as the initial deletion quantity for the operation. We traverse the strength vector st 1 from the

highest index to the lowest. If the next strength scalar is less than the remaining deletion quantity, it

is subtracted from the remaining quantity and its value is set to 0. If the remaining deletion quantity

is less than the next strength scalar, the remaining deletion quantity is subtracted from that scalar and

deletion stops. Next, the push value is set as the strength for the value added in the current timestep.

Equation 3 shows the dynamics of the read operation, which are similar to the pop operation. A

fixed initial read quantity of 1 is set at the top of a temporary copy of the strength vector st which

is traversed from the highest index to the lowest. If the next strength scalar is smaller than the

remaining read quantity, its value is preserved for this operation and subtracted from the remaining

read quantity. If not, it is temporarily set to the remaining read quantity, and the strength scalars of

all lower indices are temporarily set to 0. The output rt of the read operation is the weighted sum

of the rows of Vt , scaled by the temporary scalar values created during the traversal. An example

of the stack read calculations across three timesteps, after pushes and pops as described above, is

illustrated in Figure 1a. The third step shows how setting the strength s3 [2] to 0 for V3 [2] logically

removes v2 from the stack, and how it is ignored during the read.

This completes the description of the forward dynamics of a neural Stack, cast as a recurrent layer,

as illustrated in Figure 1b. All operations described in this section are differentiable1 . The equations

describing the backwards dynamics are provided in Appendix A of the supplementary materials.

v3

stack grows upwards

row 3 0.9

v2 removed

row 2 v2 0.5 v2 0 from stack

(Vt-1, st-1)

Vt-1 Vt

previous ht-1

previous state next state

state ht next

Ht-1

R (Vt, st) state

prev. strengths (st-1) Neural next strengths (st) rt-1

N st-1 Neural st

Ht

push (dt) dt

pop (ut)

Stack output (rt) input

N (ot, …) …

ut

Stack rt

input

it (it, rt-1)

value (vt) ot

output

Split vt

ot

Join

A neural Queue operates the same way as a neural Stack, with the exception that the pop operation

reads the lowest index of the strength vector st , rather than the highest. This represents popping and

1

The max(x, y) and min(x, y) functions are technically not differentiable for x = y. Following the work

on rectified linear units [15], we arbitrarily take the partial differentiation of the left argument in these cases.

3

reading from the front of the Queue rather than the top of the stack. These operations are described

in Equations 4–5.

8 iP1

<

max(0, st 1 [i] max(0, ut st 1 [j])) if 1 i < t

st [i] = j=1 (4)

:

dt if i = t

t

X i 1

X

rt = (min(st [i], max(0, 1 st [j]))) · Vt [i] (5)

i=1 j=1

A neural DeQue operates likes a neural Stack, except it takes a push, pop, and value as input for

both “ends” of the structure (which we call top and bot), and outputs a read for both ends. We write

utop

t and ubot

t instead of ut , vttop and vtbot instead of vt , and so on. The state, Vt and st are now

a 2t ⇥ m-dimensional matrix and a 2t-dimensional vector, respectively. At each timestep, a pop

from the top is followed by a pop from the bottom of the DeQue, followed by the pushes and reads.

The dynamics of a DeQue, which unlike a neural Stack or Queue “grows” in two directions, are

described in Equations 6–11, below. Equations 7–9 decompose the strength vector update into three

steps purely for notational clarity.

8

< vtbot if i = 1

Vt [i] = vtop if i = 2t (6)

: t

Vt 1 [i 1] if 1 < i < 2t

2(t 1) 1

X

stop

t [i] = max(0, st 1 [i] max(0, utop

t st 1 [j])) if 1 i < 2(t 1) (7)

j=i+1

i 1

X

sboth

t [i] = max(0, stop

t [i] max(0, ubot

t stop

t [j])) if 1 i < 2(t 1) (8)

j=1

8 both

< st [i 1] if 1 < i < 2t

st [i] = dbot if i = 1 (9)

: ttop

dt if i = 2t

2t

X 2t

X

rtop

t = (min(st [i], max(0, 1 st [j]))) · Vt [i] (10)

i=1 j=i+1

2t

X i 1

X

rbot

t = (min(st [i], max(0, 1 st [j]))) · Vt [i] (11)

i=1 j=1

To summarise, a neural DeQue acts like two neural Stacks operated on in tandem, except that the

pushes and pops from one end may eventually affect pops and reads on the other, and vice versa.

While the three memory modules described can be seen as recurrent layers, with the operations being

used to produce the next state and output from the input and previous state being fully differentiable,

they contain no tunable parameters to optimise during training. As such, they need to be attached

to a controller in order to be used for any practical purposes. In exchange, they offer an extensible

memory, the logical size of which is unbounded and decoupled from both the nature and parameters

of the controller, and from the size of the problem they are applied to. Here, we describe how any

RNN controller may be enhanced by a neural Stack, Queue or DeQue.

We begin by giving the case where the memory is a neural Stack, as illustrated in Figure 1c. Here

we wish to replicate the overall ‘interface’ of a recurrent layer—as seen from outside the dotted

4

lines—which takes the previous recurrent state Ht 1 and an input vector it , and transforms them

to return the next recurrent state Ht and an output vector ot . In our setup, the previous state Ht 1

of the recurrent layer will be the tuple (ht 1 , rt 1 , (Vt 1 , st 1 )), where ht 1 is the previous state

of the RNN, rt 1 is the previous stack read, and (Vt 1 , st 1 ) is the previous state of the stack

as described above. With the exception of h0 , which is initialised randomly and optimised during

training, all other initial states, r0 and (V0 , s0 ), are set to 0-valued vectors/matrices and not updated

during training.

The overall input it is concatenated with previous read rt 1 and passed to the RNN controller as

input along with the previous controller state ht 1 . The controller outputs its next state ht and a

controller output o0t , from which we obtain the push and pop scalars dt and ut and the value vector

vt , which are passed to the stack, as well as the network output ot :

dt = sigmoid(Wd o0t + bd ) ut = sigmoid(Wu o0t + bu )

vt = tanh(Wv o0t + bv ) ot = tanh(Wo o0t + bo )

where Wd and Wu are vector-to-scalar projection matrices, and bd and bu are their scalar biases;

Wv and Wo are vector-to-vector projections, and bd and bu are their vector biases, all randomly

intialised and then tuned during training. Along with the previous stack state (Vt 1 , st 1 ), the stack

operations dt and ut and the value vt are passed to the neural stack to obtain the next read rt and

next stack state (Vt , st ), which are packed into a tuple with the controller state ht to form the next

state Ht of the overall recurrent layer. The output vector ot serves as the overall output of the

recurrent layer. The structure described here can be adapted to control a neural Queue instead of a

stack by substituting one memory module for the other.

The only additional trainable parameters in either configuration, relative to a non-enhanced RNN,

are the projections for the input concatenated with the previous read into the RNN controller, and the

projections from the controller output into the various Stack/Queue inputs, described above. In the

case of a DeQue, both the top read rtop and bottom read rbot must be preserved in the overall state.

They are both concatenated with the input to form the input to the RNN controller. The output of the

controller must have additional projections to output push/pop operations and values for the bottom

of the DeQue. This roughly doubles the number of additional tunable parameters “wrapping” the

RNN controller, compared to the Stack/Queue case.

4 Experiments

In every experiment, integer-encoded source and target sequence pairs are presented to the candidate

model as a batch of single joint sequences. The joint sequence starts with a start-of-sequence (SOS)

symbol, and ends with an end-of-sequence (EOS) symbol, with a separator symbol separating the

source and target sequences. Integer-encoded symbols are converted to 64-dimensional embeddings

via an embedding matrix, which is randomly initialised and tuned during training. Separate word-

to-index mappings are used for source and target vocabularies. Separate embedding matrices are

used to encode input and output (predicted) embeddings.

The aim of each of the following tasks is to read an input sequence, and generate as target sequence a

transformed version of the source sequence, followed by an EOS symbol. Source sequences are ran-

domly generated from a vocabulary of 128 meaningless symbols. The length of each training source

sequence is uniformly sampled from unif {8, 64}, and each symbol in the sequence is drawn with

replacement from a uniform distribution over the source vocabulary (ignoring SOS, and separator).

A deterministic task-specific transformation, described for each task below, is applied to the source

sequence to yield the target sequence. As the training sequences are entirely determined by the

source sequence, there are close to 10135 training sequences for each task, and training examples

are sampled from this space due to the random generation of source sequences. The following steps

are followed before each training and test sequence are presented to the models, the SOS symbol

(hsi) is prepended to the source sequence, which is concatenated with a separator symbol (|||) and

the target sequences, to which the EOS symbol (h/si) is appended.

5

Sequence Copying The source sequence is copied to form the target sequence. Sequences have

the form:

hsia1 . . . ak |||a1 . . . ak h/si

Sequence Reversal The source sequence is deterministically reversed to produce the target se-

quence. Sequences have the form:

hsia1 a2 . . . ak |||ak . . . a2 a1 h/si

Bigram flipping The source side is restricted to even-length sequences. The target is produced

by swapping, for all odd source sequence indices i 2 [1, |seq|] ^ odd(i), the ith symbol with the

(i + 1)th symbol. Sequences have the form:

hsia1 a2 a3 a4 . . . ak 1 ak |||a2 a1 a4 a3 . . . ak ak 1 h/si

The following tasks examine how well models can approach sequence transduction problems where

the source and target sequence are jointly generated by Inversion Transduction Grammars (ITG) [8],

a subclass of Synchronous Context-Free Grammars [16] often used in machine translation [17]. We

present two simple ITG-based datasets with interesting linguistic properties and their underlying

grammars. We show these grammars in Table 1, in Appendix C of the supplementary materials. For

each synchronised non-terminal, an expansion is chosen according to the probability distribution

specified by the rule probability p at the beginning of each rule. For each grammar, ‘A’ is always the

root of the ITG tree.

We tuned the generative probabilities for recursive rules by hand so that the grammars generate left

and right sequences of lengths 8 to 128 with relatively uniform distribution. We generate training

data by rejecting samples that are outside of the range [8, 64], and testing data by rejecting samples

outside of the range [65, 128]. For terminal symbol-generating rules, we balance the classes so

that for k terminal-generating symbols in the grammar, each terminal-generating non-terminal ‘X’

generates a vocabulary of approximately 128/k, and each each vocabulary word under that class is

equiprobable. These design choices were made to maximise the similarity between the experimental

settings of the ITG tasks described here and the synthetic tasks described above.

faithfully reproduce high-level syntactic divergences between languages. For instance, when trans-

lating an English sentence with a non-finite verb into German, a transducer must locate and move

the verb over the object to the final position. We simulate this phenomena with a synchronous

grammar which generates strings exhibiting verb movements. To add an extra challenge, we also

simulate simple relative clause embeddings to test the models’ ability to transduce in the presence

of unbounded recursive structures.

A sample output of the grammar is presented here, with spaces between words being included for

stylistic purposes, and where s, o, and v indicate subject, object, and verb terminals respectively, i

and o mark input and output, and rp indicates a relative pronoun:

si1 vi28 oi5 oi7 si15 rpi si19 vi16 oi10 oi24 ||| so1 oo5 oo7 so15 rpo so19 vo16 oo10 oo24 vo28

language with gender-free articles to one with gender-specific definite and indefinite articles. A

real world example of such a translation would be from English (the, a) to German (der/die/das,

ein/eine/ein).

The grammar simulates sentences in (N P/(V /N P )) or (N P/V ) form, where every noun phrase

can become an infinite sequence of nouns joined by a conjunction. Each noun in the source language

has a neutral definite or indefinite article. The matching word in the target language then needs to be

preceeded by its appropriate article. A sample output of the grammar is presented here, with spaces

between words being included for stylistic purposes:

we11 the en19 and the em17 ||| wg11 das gn19 und der gm17

6

4.3 Evaluation

For each task, test data is generated through the same procedure as training data, with the key dif-

ference that the length of the source sequence is sampled from unif {65, 128}. As a result of this

change, we not only are assured that the models cannot observe any test sequences during training,

but are also measuring how well the sequence transduction capabilities of the evaluated models gen-

eralise beyond the sequence lengths observed during training. To control for generalisation ability,

we also report accuracy scores on sequences separately sampled from the training set, which given

the size of the sample space are unlikely to have ever been observed during actual model training.

For each round of testing, we sample 1000 sequences from the appropriate test set. For each se-

quence, the model reads in the source sequence and separator symbol, and begins generating the

next symbol by taking the maximally likely symbol from the softmax distribution over target sym-

bols produced by the model at each step. Based on this process, we give each model a coarse

accuracy score, corresponding to the proportion of test sequences correctly predicted from begin-

ning until end (EOS symbol) without error, as well as a fine accuracy score, corresponding to the

average proportion of each sequence correctly generated before the first error. Formally, we have:

#seqs

X #correcti

#correct 1

coarse = f ine =

#seqs #seqs i=1 |targeti |

where #correct and #seqs are the number of correctly predicted sequences (end-to-end) and the

total number of sequences in the test batch (1000 in this experiment), respectively; #correcti is the

number of correctly predicted symbols before the first error in the ith sequence of the test batch, and

|targeti | is the length of the target segment that sequence (including EOS symbol).

For each task, we use as benchmarks the Deep LSTMs described in [1], with 1, 2, 4, and 8 layers.

Against these benchmarks, we evaluate neural Stack-, Queue-, and DeQue-enhanced LSTMs. When

running experiments, we trained and tested a version of each model where all LSTMs in each model

have a hidden layer size of 256, and one for a hidden layer size of 512. The Stack/Queue/DeQue

embedding size was arbitrarily set to 256, half the maximum hidden size. The number of parameters

for each model are reported for each architecture in Table 2 of the appendix. Concretely, the neural

Stack-, Queue-, and DeQue-enhanced LSTMs have the same number of trainable parameters as a

two-layer Deep LSTM. These all come from the extra connections to and from the memory module,

which itself has no trainable parameters, regardless of its logical size.

Models are trained with minibatch RMSProp [18], with a batch size of 10. We grid-searched learning

rates across the set {5 ⇥ 10 3 , 1 ⇥ 10 3 , 5 ⇥ 10 4 , 1 ⇥ 10 4 , 5 ⇥ 10 5 }. We used gradient clipping

[19], clipping all gradients above 1. Average training perplexity was calculated every 100 batches.

Training and test set accuracies were recorded every 1000 batches.

Because of the impossibility of overfitting the datasets, we let the models train an unbounded number

of steps, and report results at convergence. We present in Figure 2a the coarse- and fine-grained

accuracies, for each task, of the best model of each architecture described in this paper alongside

the best performing Deep LSTM benchmark. The best models were automatically selected based on

average training perplexity. The LSTM benchmarks performed similarly across the range of random

initialisations, so the effect of this procedure is primarily to try and select the better performing

Stack/Queue/DeQue-enhanced LSTM. In most cases, this procedure does not yield the actual best-

performing model, and in practice a more sophisticated procedure such as ensembling [20] should

produce better results.

For all experiments, the Neural Stack or Queue outperforms the Deep LSTM benchmarks, often by

a significant margin. For most experiments, if a Neural Stack- or Queue-enhanced LSTM learns

to partially or consistently solve the problem, then so does the Neural DeQue. For experiments

where the enhanced LSTMs solve the problem completely (consistent accuracy of 1) in training,

the accuracy persists in longer sequences in the test set, whereas benchmark accuracies drop for

7

Training Testing

Experiment Model Coarse Fine Coarse Fine

4-layer LSTM 0.98 0.98 0.01 0.50

Sequence Stack-LSTM 0.89 0.94 0.00 0.22

Copying Queue-LSTM 1.00 1.00 1.00 1.00

DeQue-LSTM 1.00 1.00 1.00 1.00

Sequence Stack-LSTM 1.00 1.00 1.00 1.00

Reversal Queue-LSTM 0.44 0.61 0.00 0.01

DeQue-LSTM 1.00 1.00 1.00 1.00

Bigram Stack-LSTM 0.44 0.90 0.00 0.48

Flipping Queue-LSTM 0.55 0.94 0.55 0.98

DeQue-LSTM 0.55 0.94 0.53 0.98

Stack-LSTM 1.00 1.00 1.00 1.00

SVO to SOV

Queue-LSTM 1.00 1.00 1.00 1.00

DeQue-LSTM 1.00 1.00 1.00 1.00

Gender Stack-LSTM 0.93 0.97 0.93 0.97

Conjugation Queue-LSTM 1.00 1.00 1.00 1.00

DeQue-LSTM 1.00 1.00 1.00 1.00

(b) Comparison of Model Conver-

(a) Comparing Enhanced LSTMs to Best Benchmarks gence during Training

all experiments except the SVO to SOV and Gender Conjugation ITG transduction tasks. Across

all tasks which the enhanced LSTMs solve, the convergence on the top accuracy happens orders of

magnitude earlier for enhanced LSTMs than for benchmark LSTMs, as exemplified in Figure 2b.

The results for the sequence inversion and copying tasks serve as unit tests for our models, as the

controller mainly needs to learn to push the appropriate number of times and then pop continuously.

Nonetheless, the failure of Deep LSTMs to learn such a regular pattern and generalise is itself

indicative of the limitations of the benchmarks presented here, and of the relative expressive power

of our models. Their ability to generalise perfectly to sequences up to twice as long as those attested

during training is also notable, and also attested in the other experiments. Finally, this pair of

experiments illustrates how while the neural Queue solves copying and the Stack solves reversal, a

simple LSTM controller can learn to operate a DeQue as either structure, and solve both tasks.

The results of the Bigram Flipping task for all models are consistent with the failure to consistently

correctly generate the last two symbols of the sequence. We hypothesise that both Deep LSTMs and

our models economically learn to pairwise flip the sequence tokens, and attempt to do so half the

time when reaching the EOS token. For the two ITG tasks, the success of Deep LSTM benchmarks

relative to their performance in other tasks can be explained by their ability to exploit short local

dependencies dominating the longer dependencies in these particular grammars.

Overall, the rapid convergence, where possible, on a general solution to a transduction problem

in a manner which propagates to longer sequences without loss of accuracy is indicative that an

unbounded memory-enhanced controller can learn to solve these problems procedurally, rather than

memorising the underlying distribution of the data.

6 Conclusions

The experiments performed in this paper demonstrate that single-layer LSTMs enhanced by an un-

bounded differentiable memory capable of acting, in the limit, like a classical Stack, Queue, or

DeQue, are capable of solving sequence-to-sequence transduction tasks for which Deep LSTMs

falter. Even in tasks for which benchmarks obtain high accuracies, the memory-enhanced LSTMs

converge earlier, and to higher accuracies, while requiring considerably fewer parameters than all

but the simplest of Deep LSTMs. We therefore believe these constitute a crucial addition to our neu-

ral network toolbox, and that more complex linguistic transduction tasks such as machine translation

or parsing will be rendered more tractable by their inclusion.

8

References

[1] Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le. Sequence to sequence learning with neural

networks. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,

editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran

Associates, Inc., 2014.

[2] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,

and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical

machine translation. arXiv preprint arXiv:1406.1078, 2014.

[3] GZ Sun, C Lee Giles, HH Chen, and YC Lee. The neural network pushdown automaton:

Model, stack and learning simulations. 1998.

[4] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401,

2014.

[5] Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of

Studies in Computational Intelligence. Springer, 2012.

[6] Markus Dreyer, Jason R. Smith, and Jason Eisner. Latent-variable modeling of string trans-

ductions with finite-state methods. In Proceedings of the Conference on Empirical Methods in

Natural Language Processing, EMNLP ’08, pages 1080–1089, Stroudsburg, PA, USA, 2008.

Association for Computational Linguistics.

[7] Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. Open-

FST: A general and efficient weighted finite-state transducer library. In Implementation and

Application of Automata, volume 4783 of Lecture Notes in Computer Science, pages 11–23.

Springer Berlin Heidelberg, 2007.

[8] Dekai Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel cor-

pora. Computational linguistics, 23(3):377–403, 1997.

[9] Alex Graves. Sequence transduction with recurrent neural networks. In Representation Learn-

ing Worksop, ICML. 2012.

[10] Sreerupa Das, C Lee Giles, and Guo-Zheng Sun. Learning context-free grammars: Capabilities

and limitations of a recurrent neural network with an external stack memory. In Proceedings

of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, 1992.

[11] Sreerupa Das, C Lee Giles, and Guo-Zheng Sun. Using prior knowledge in a {NNPDA} to

learn context-free languages. Advances in neural information processing systems, 1993.

[12] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. Weakly supervised mem-

ory networks. CoRR, abs/1503.08895, 2015.

[13] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv

preprint arXiv:1505.00521, 2015.

[14] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented re-

current nets. arXiv preprint arXiv:1503.01007, 2015.

[15] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann ma-

chines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10),

pages 807–814, 2010.

[16] Alfred V Aho and Jeffrey D Ullman. The theory of parsing, translation, and compiling.

Prentice-Hall, Inc., 1972.

[17] Dekai Wu and Hongsing Wong. Machine translation with a stochastic grammatical channel.

In Proceedings of the 17th international conference on Computational linguistics-Volume 2,

pages 1408–1415. Association for Computational Linguistics, 1998.

[18] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running

average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4,

2012.

[19] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding the exploding gradient

problem. Computing Research Repository (CoRR) abs/1211.5063, 2012.

[20] Zhi-Hua Zhou, Jianxin Wu, and Wei Tang. Ensembling neural networks: many could be better

than all. Artificial intelligence, 137(1):239–263, 2002.

9

Inferring Algorithmic Patterns with

Stack-Augmented Recurrent Nets

Facebook AI Research Facebook AI Research

770 Broadway, New York, USA. 770 Broadway, New York, USA.

ajoulin@fb.com tmikolov@fb.com

Abstract

Despite the recent achievements in machine learning, we are still very far from

achieving real artificial intelligence. In this paper, we discuss the limitations of

standard deep learning approaches and show that some of these limitations can be

overcome by learning how to grow the complexity of a model in a structured way.

Specifically, we study the simplest sequence prediction problems that are beyond

the scope of what is learnable with standard recurrent networks, algorithmically

generated sequences which can only be learned by models which have the capacity

to count and to memorize sequences. We show that some basic algorithms can be

learned from sequential data using a recurrent network associated with a trainable

memory.

1 Introduction

Machine learning aims to find regularities in data to perform various tasks. Historically there have

been two major sources of breakthroughs: scaling up the existing approaches to larger datasets, and

development of novel approaches [5, 14, 22, 30]. In the recent years, a lot of progress has been

made in scaling up learning algorithms, by either using alternative hardware such as GPUs [9] or by

taking advantage of large clusters [28]. While improving computational efficiency of the existing

methods is important to deploy the models in real world applications [4], it is crucial for the research

community to continue exploring novel approaches able to tackle new problems.

Recently, deep neural networks have become very successful at various tasks, leading to a shift in

the computer vision [21] and speech recognition communities [11]. This breakthrough is commonly

attributed to two aspects of deep networks: their similarity to the hierarchical, recurrent structure of

the neocortex and the theoretical justification that certain patterns are more efficiently represented

by functions employing multiple non-linearities instead of a single one [1, 25].

This paper investigates which patterns are difficult to represent and learn with the current state of the

art methods. This would hopefully give us hints about how to design new approaches which will ad-

vance machine learning research further. In the past, this approach has lead to crucial breakthrough

results: the well-known XOR problem is an example of a trivial classification problem that cannot

be solved using linear classifiers, but can be solved with a non-linear one. This popularized the use

of non-linear hidden layers [30] and kernels methods [2]. Another well-known example is the parity

problem described by Papert and Minsky [25]: it demonstrates that while a single non-linear hidden

layer is sufficient to represent any function, it is not guaranteed to represent it efficiently, and in

some cases can even require exponentially many more parameters (and thus, also training data) than

what is sufficient for a deeper model. This lead to use of architectures that have several layers of

non-linearities, currently known as deep learning models.

Following this line of work, we study basic patterns which are difficult to represent and learn for

standard deep models. In particular, we study learning regularities in sequences of symbols gen-

1

Sequence generator Example

{an bn | n > 0} aabbaaabbbabaaaaabbbbb

{an bn cn | n > 0} aaabbbcccabcaaaaabbbbbccccc

{an bn cn dn | n > 0} aabbccddaaabbbcccdddabcd

{an b2n | n > 0} aabbbbaaabbbbbbabb

n m n+m

{a b c | n, m > 0} aabcccaaabbcccccabcc

n ∈ [1, k], X → nXn, X →= (k = 2) 12=212122=221211121=12111

Table 1: Examples generated from the algorithms studied in this paper. In bold, the characters which

can be predicted deterministically. During training, we do not have access to this information and at

test time, we evaluate only on deterministically predictable characters.

erated by simple algorithms. Interestingly, we find that these regularities are difficult to learn even

for some advanced deep learning methods, such as recurrent networks. We attempt to increase the

learning capabilities of recurrent nets by allowing them to learn how to control an infinite structured

memory. We explore two basic topologies of the structured memory: pushdown stack, and a list.

Our structured memory is defined by constraining part of the recurrent matrix in a recurrent net [24].

We use multiplicative gating mechanisms as learnable controllers over the memory [8, 19] and show

that this allows our network to operate as if it was performing simple read and write operations, such

as PUSH or POP for a stack.

Among recent work with similar motivation, we are aware of the Neural Turing Machine [17] and

Memory Networks [33]. However, our work can be considered more as a follow up of the research

done in the early nineties, when similar types of memory augmented neural networks were stud-

ied [12, 26, 27, 37].

2 Algorithmic Patterns

We focus on sequences generated by simple, short algorithms. The goal is to learn regularities in

these sequences by building predictive models. We are mostly interested in discrete patterns related

to those that occur in the real world, such as various forms of a long term memory.

More precisely, we suppose that during training we have only access to a stream of data which is

obtained by concatenating sequences generated by a given algorithm. We do not have access to the

boundary of any sequence nor to sequences which are not generated by the algorithm. We denote

the regularities in these sequences of symbols as Algorithmic patterns. In this paper, we focus on

algorithmic patterns which involve some form of counting and memorization. Examples of these

patterns are presented in Table 1. For simplicity, we mostly focus on the unary and binary numeral

systems to represent patterns. This allows us to focus on designing a model which can learn these

algorithms when the input is given in its simplest form.

Some algorithm can be given as context free grammars, however we are interested in the more gen-

eral case of sequential patterns that have a short description length in some general Turing-complete

computational system. Of particular interest are patterns relevant to develop a better language un-

derstanding. Finally, this study is limited to patterns whose symbols can be predicted in a single

computational step, leaving out algorithms such as sorting or dynamic programming.

3 Related work

Some of the algorithmic patterns we study in this paper are closely related to context free and context

sensitive grammars which were widely studied in the past. Some works used recurrent networks

with hardwired symbolic structures [10, 15, 18]. These networks are continuous implementation of

symbolic systems, and can deal with recursive patterns in computational linguistics. While theses

approaches are interesting to understand the link between symbolic and sub-symbolic systems such

as neural networks, they are often hand designed for each specific grammar.

Wiles and Elman [34] show that simple recurrent networks are able to learn sequences of the form

an bn and generalize on a limited range of n. While this is a promising result, their model does not

2

truly learn how to count but instead relies mostly on memorization of the patterns seen in the training

data. Rodriguez et al. [29] further studied the behavior of this network. Grünwald [18] designs a

hardwired second order recurrent network to tackle similar sequences. Christiansen and Chater [7]

extended these results to grammars with larger vocabularies. This work shows that this type of

architectures can learn complex internal representation of the symbols but it cannot generalize to

longer sequences generated by the same algorithm. Beside using simple recurrent networks, other

structures have been used to deal with recursive patterns, such as pushdown dynamical automata [31]

or sequenctial cascaded networks [3, 27].

Hochreiter and Schmidhuber [19] introduced the Long Short Term Memory network (LSTM) archi-

tecture. While this model was orginally developed to address the vanishing and exploding gradient

problems, LSTM is also able to learn simple context-free and context-sensitive grammars [16, 36].

This is possible because its hidden units can choose through a multiplicative gating mechanism to

be either linear or non-linear. The linear units allow the network to potentially count (one can easily

add and subtract constants) and store a finite amount of information for a long period of time. These

mechanisms are also used in the Gated Recurrent Unit network [8]. In our work we investigate the

use of a similar mechanism in a context where the memory is unbounded and structured. As opposed

to previous work, we do not need to “erase” our memory to store a new unit. More recently, Graves

et al. [17] have extended LSTM with an attention mechansim to build a model which roughly resem-

bles a Turing machine with limited tape. Their memory controller works with a fixed size memory

and it is not clear if its complexity is necessary for the the simple problems they study.

Finally, many works have also used external memory modules with a recurrent network, such as

stacks [12, 13, 20, 26, 37]. Zheng et al. [37] use a discrete external stack which may be hard

to learn on long sequences. Das et al. [12] learn a continuous stack which has some similarities

with ours. The mechnisms used in their work is quite different from ours. Their memory cells are

associated with weights to allow continuous representation of the stack, in order to train it with

continuous optimization scheme. On the other hand, our solution is closer to a standard RNN with

special connectivities which simulate a stack with unbounded capacity. We tackle problems which

are closely related to the ones addressed in these works and try to go further by exploring more

challenging problems such as binary addition.

4 Model

4.1 Simple recurrent network

We consider sequential data that comes in the form of discrete tokens, such as characters or words.

The goal is to design a model able to predict the next symbol in a stream of data. Our approach is

based on a standard model called recurrent neural network (RNN) and popularized by Elman [14].

RNN consists of an input layer, a hidden layer with a recurrent time-delayed connection and an

output layer. The recurrent connection allows the propagation of information through time.Given a

sequence of tokens, RNN takes as input the one-hot encoding xt of the current token and predicts

the probability yt of next symbol. There is a hidden layer with m units which stores additional

information about the previous tokens seen in the sequence. More precisely, at each time t, the state

of the hidden layer ht is updated based on its previous state ht−1 and the encoding xt of the current

token, according to the following equation:

ht = σ (U xt + Rht−1 ) , (1)

where σ(x) = 1/(1 + exp(−x)) is the sigmoid activation function applied coordinate wise, U is the

d × m token embedding matrix and R is the m × m matrix of recurrent weights. Given the state of

these hidden units, the network then outputs the probability vector yt of the next token, according to

the following equation:

yt = f (V ht ) , (2)

where f is the softmax function [6] and V is the m × d output matrix, where d is the number of

different tokens. This architecture is able to learn relatively complex patterns similar in nature to

the ones captured by N-grams. While this has made the RNNs interesting for language modeling

[23], they may not have the capacity to learn how algorithmic patterns are generated. In the next

section, we show how to add an external memory to RNNs which has the theoretical capability to

learn simple algorithmic patterns.

3

(a) (b)

Figure 1: (a) Neural network extended with push-down stack and a controlling mechanism that

learns what action (among PUSH, POP and NO-OP) to perform. (b) The same model extended with

a doubly-linked list with actions INSERT, LEFT, RIGHT and NO-OP.

4.2 Pushdown network

In this section, we describe a simple structured memory inspired by pushdown automaton, i.e., an

automaton which employs a stack. We train our network to learn how to operate this memory with

standard optimization tools.

A stack is a type of persistent memory which can be only accessed through its topmost element.

Three basic operations can be performed with a stack: POP removes the top element, PUSH adds

a new element on top of the stack and NO-OP does nothing. For simplicity, we first consider a

simplified version where the model can only choose between a PUSH or a POP at each time step.

We suppose that this decision is made by a 2-dimensional variable at which depends on the state of

the hidden variable ht :

at = f (Aht ) , (3)

where A is a 2 × m matrix (m is the size of the hidden layer) and f is a softmax function. We denote

by at [PUSH], the probability of the PUSH action, and by at [POP] the probability of the POP action.

We suppose that the stack is stored at time t in a vector st of size p. Note that p could be increased

on demand and does not have to be fixed which allows the capacity of the model to grow. The top

element is stored at position 0, with value st [0]:

st [0] = at [PUSH]σ(Dht ) + at [POP]st−1 [1], (4)

where D is 1 × m matrix. If at [POP] is equal to 1, the top element is replaced by the value below

(all values are moved by one position up in the stack structure). If at [PUSH] is equal to 1, we move

all values down in the stack and add a value on top of the stack. Similarly, for an element stored at

a depth i > 0 in the stack, we have the following update rule:

st [i] = at [PUSH]st−1 [i − 1] + at [POP]st−1 [i + 1]. (5)

We use the stack to carry information to the hidden layer at the next time step. When the stack is

empty, st is set to −1. The hidden layer ht is now updated as:

ht = σ U xt + Rht−1 + P skt−1 ,

(6)

where P is a m × k recurrent matrix and skt−1 are the k top-most element of the stack at time t − 1.

In our experiments, we set k to 2. We call this model Stack RNN, and show it in Figure 1-a without

the recurrent matrix R for clarity.

Stack with a no-operation. Adding the NO-OP action allows the stack to keep the same value on

top by a minor change of the stack update rule. Eq. (4) is replaced by:

st [0] = at [PUSH]σ(Dht ) + at [POP]st−1 [1] + at [NO-OP]st−1 [0].

Extension to multiple stacks. Using a single stack has serious limitations, especially considering

that at each time step, only one action can be performed. We increase capacity of the model by

using multiple stacks in parallel. The stacks can interact through the hidden layer allowing them to

process more challenging patterns.

4

method an bn an bn cn an bn cn dn an b2n an bm cn+m

RNN 25% 23.3% 13.3% 23.3% 33.3%

LSTM 100% 100% 68.3% 75% 100%

List RNN 40+5 100% 33.3% 100% 100% 100%

Stack RNN 40+10 100% 100% 100% 100% 43.3%

Stack RNN 40+10 + rounding 100% 100% 100% 100% 100%

Table 2: Comparison with RNN and LSTM on sequences generated by counting algorithms. The

sequences seen during training are such that n < 20 (and n + m < 20), and we test on sequences

up to n = 60. We report the percent of n for which the model was able to correctly predict the

sequences. Performance above 33.3% means it is able to generalize to never seen sequence lengths.

Doubly-linked lists. While in this paper we mostly focus on an infinite memory based on stacks, it

is straightforward to extend the model to another forms of infinite memory, for example, the doubly-

linked list. A list is a one dimensional memory where each node is connected to its left and right

neighbors. There is a read/write head associated with the list. The head can move between nearby

nodes and insert a new node at its current position. More precisely, we consider three different

actions: INSERT, which inserts an element at the current position of the head, LEFT, which moves

the head to the left, and RIGHT which moves it to the right. Given a list L and a fixed head position

HEAD, the updates are:

at [RIGHT]Lt−1 [i + 1] + at [LEFT]Lt−1 [i − 1] + at [INSERT]σ(Dht ) if i = HEAD,

(

Lt [i] = at [RIGHT]Lt−1 [i + 1] + at [LEFT]Lt−1 [i − 1] + at [INSERT]Lt−1 [i + 1] if i < HEAD,

at [RIGHT]Lt−1 [i + 1] + at [LEFT]Lt−1 [i − 1] + at [INSERT]Lt−1 [i] if i > HEAD.

Note that we can add a NO-OP operation as well. We call this model List RNN, and show it in

Figure 1-b without the recurrent matrix R for clarity.

Optimization. The models presented above are continuous and can thus be trained with stochastic

gradient descent (SGD) method and back-propagation through time [30, 32, 35]. As patterns be-

comes more complex, more complex memory controller must be learned. In practice, we observe

that these more complex controller are harder to learn with SGD. Using several random restarts

seems to solve the problem in our case. We have also explored other type of search based proce-

dures as discussed in the supplementary material.

Rounding. Continuous operators on stacks introduce small imprecisions leading to numerical is-

sues on very long sequences. While simply discretizing the controllers partially solves this problem,

we design a more robust rounding procedure tailored to our model. We slowly makes the controllers

converge to discrete values by multiply their weights by a constant which slowly goes to infinity. We

finetune the weights of our network as this multiplicative variable increase, leading to a smoother

rounding of our network. Finally, we remove unused stacks by exploring models which use only a

subset of the stacks. While brute-force would be exponential in the number of stacks, we can do it

efficiently by building a tree of removable stacks and exploring it with deep first search.

First, we consider various sequences generated by simple algorithms, where the goal is to learn their

generation rule [3, 12, 29]. We hope to understand the scope of algorithmic patterns each model can

capture. We also evaluate the models on a standard language modeling dataset, Penn Treebank.

Implementation details. Stack and List RNNs are trained with SGD and backpropagation through

time with 50 steps [32], a hard clipping of 15 to prevent gradient explosions [23], and an initial

learning rate of 0.1. The learning rate is divided by 2 each time the entropy on the validation set is

not decreasing. The depth k defined in Eq. (6) is set to 2. The free parameters are the number of

hidden units, stacks and the use of NO-OP. The baselines are RNNs with 40, 100 and 500 units, and

LSTMs with 1 and 2 layers with 50, 100 and 200 units. The hyper-parameters of the baselines are

selected on the validation sets.

Given an algorithm with short description length, we generate sequences and concatenate them into

longer sequences. This is an unsupervised task, since the boundaries of each generated sequences

5

current next prediction proba(next) action stack1[top] stack2[top]

b a a 0.99 POP POP -1 0.53

a a a 0.99 PUSH POP 0.01 0.97

a a a 0.95 PUSH PUSH 0.18 0.99

a a a 0.93 PUSH PUSH 0.32 0.98

a a a 0.91 PUSH PUSH 0.40 0.97

a a a 0.90 PUSH PUSH 0.46 0.97

a b a 0.10 PUSH PUSH 0.52 0.97

b b b 0.99 PUSH PUSH 0.57 0.97

b b b 1.00 POP PUSH 0.52 0.56

b b b 1.00 POP PUSH 0.46 0.01

b b b 1.00 POP PUSH 0.40 0.00

b b b 1.00 POP PUSH 0.32 0.00

b b b 1.00 POP PUSH 0.18 0.00

b b b 0.99 POP PUSH 0.01 0.00

b b b 0.99 POP POP -1 0.00

b b b 0.99 POP POP -1 0.00

b b b 0.99 POP POP -1 0.00

b b b 0.99 POP POP -1 0.01

b a a 0.99 POP POP -1 0.56

Table 3: Example of the Stack RNN with 20 hidden units and 2 stacks on a sequence an b2n with

n = 6. −1 means that the stack is empty. The depth k is set to 1 for clarity. We see that the first

stack pushes an element every time it sees a and pop when it sees b. The second stack pushes when

it sees a. When it sees b , it pushes if the first stack is not empty and pop otherwise. This shows how

the two stacks interact to correctly predict the deterministic part of the sequence (shown in bold).

Figure 2: Comparison of RNN, LSTM, List RNN and Stack RNN on memorization and the perfor-

mance of Stack RNN on binary addition. The accuracy is in the proportion of correctly predicted

sequences generated with a given n. We use 100 hidden units and 10 stacks.

are not known. We study patterns related to counting and memorization as shown in Table 1. To

evaluate if a model has the capacity to understand the generation rule used to produce the sequences,

it is tested on sequences it has not seen during training. Our experimental setting is the following:

the training and validation set are composed of sequences generated with n up to N < 20 while

the test set is composed of sequences generated with n up to 60. During training, we incrementally

increase the parameter n every few epochs until it reaches some N . At test time, we measure the

performance by counting the number of correctly predicted sequences. A sequence is considered as

correctly predicted if we correctly predict its deterministic part, shown in bold in Table 1. On these

toy examples, the recurrent matrix R defined in Eq. (1) is set to 0 to isolate the mechanisms that

Stack and list can capture.

Counting. Results on patterns generated by “counting” algorithms are shown in Table 2. We report

the percentage of sequence lengths for which a method is able to correctly predict sequences of

that length. List RNN and Stack RNN have 40 hidden units and either 5 lists or 10 stacks. For

these tasks, the NO-OP operation is not used. Table 2 shows that RNNs are unable to generalize to

longer sequences, and they only correctly predict sequences seen during training. LSTM is able to

generalize to longer sequences which shows that it is able to count since the hidden units in an LSTM

can be linear [16]. With a finer hyper-parameter search, the LSTM should be able to achieve 100%

6

on all of these tasks. Despite the absence of linear units, these models are also able to generalize.

For an bm cn+m , rounding is required to obtain the best performance.

Table 3 show an example of actions done by a Stack RNN with two stacks on a sequence of the

form an b2n . For clarity, we show a sequence generated with n equal to 6, and we use discretization.

Stack RNN pushes an element on both stacks when it sees a. The first stack pops elements when the

input is b and the second stack starts popping only when the first one is empty. Note that the second

stack pushes a special value to keep track of the sequence length, i.e. 0.56.

Memorization. Figure 2 shows results on memorization for a dictionary with two elements. Stack

RNN has 100 units and 10 stacks, and List RNN has 10 lists. We use random restarts and we repeat

this process multiple times. Stack RNN and List RNN are able to learn memorization, while RNN

and LSTM do not seem to generalize. In practice, List RNN is more unstable than Stack RNN and

overfits on the training set more frequently. This unstability may be explained by the higher number

of actions the controler can choose from (4 versus 3). For this reason, we focus on Stack RNN in

the rest of the experiments.

Figure 3: An example of a learned Stack RNN that performs binary addition. The last column

is our interpretation of the functionality learned by the different stacks. The color code is: green

means PUSH, red means POP and grey means actions equivalent to NO-OP. We show the current

(discretized) value on the top of the each stack at each given time. The sequence is read from left

to right, one character at a time. In bold is the part of the sequence which has to be predicted. Note

that the result is written in reverse.

Binary addition. Given a sequence representing a binary addition, e.g., “101+1=”, the goal is

to predict the result, e.g., “110.” where “.” represents the end of the sequence. As opposed to

the previous tasks, this task is supervised, i.e., the location of the deterministic tokens is provided.

The result of the addition is asked in the reverse order, e.g., “011.” in the previous example. As

previously, we train on short sequences and test on longer ones. The length of the two input numbers

is chosen such that the sum of their lengths is equal to n (less than 20 during training and up to 60

at test time). Their most significant digit is always set to 1. Stack RNN has 100 hidden units with

10 stacks. The right panel of Figure 2 shows the results averaged over multiple runs (with random

restarts). While Stack RNNs are generalizing to longer numbers, it overfits for some runs on the

validation set, leading to a larger error bar than in the previous experiments.

Figure 3 shows an example of a model which generalizes to long sequences of binary addition. This

example illustrates the moderately complex behavior that the Stack RNN learns to solve this task: the

first stack keeps track of where we are in the sequence, i.e., either reading the first number, reading

the second number or writing the result. Stack 6 keeps in memory the first number. Interestingly, the

first number is first captured by the stacks 3 and 5 and then copied to stack 6. The second number is

stored on stack 3, while its length is captured on stack 4 (by pushing a one and then a set of zeros).

When producing the result, the values stored on these three stacks are popped. Finally stack 5 takes

7

care of the carry: it switches between two states (0 or 1) which explicitly say if there is a carry over

or not. While this use of stacks is not optimal in the sense of minimal description length, it is able

to generalize to sequences never seen before.

Model Ngram Ngram + Cache RNN LSTM SRCN [24] Stack RNN

Validation perplexity - - 137 120 120 124

Test perplexity 141 125 129 115 115 118

Table 4: Comparison of RNN, LSTM, SRCN [24] and Stack RNN on Penn Treebank Corpus. We

use the recurrent matrix R in Stack RNN as well as 100 hidden units and 60 stacks.

We compare Stack RNN with RNN, LSTM and SRCN [24] on the standard language modeling

dataset Penn Treebank Corpus. SRCN is a standard RNN with additional self-connected linear

units which capture long term dependencies similar to bag of words. The models have only one

hidden layer with 100 hidden units. Table 4 shows that Stack RNN performs better than RNN with

a comparable number of parameters, but not as well as LSTM and SRCN. Empirically, we observe

that Stack RNN learns to store exponentially decaying bag of words similar in nature to the memory

of SRCN.

6 Discussion and future work

Continuous versus discrete model and search. Certain simple algorithmic patterns can be effi-

ciently learned using a continuous optimization approach (stochastic gradient descent) applied to a

continuous model representation (in our case RNN). Note that Stack RNN works better than prior

work based on RNN from the nineties [12, 34, 37]. It seems also simpler than many other ap-

proaches designed for these tasks [3, 17, 31]. However, it is not clear if a continuous representation

is completely appropriate for learning algorithmic patterns. It may be more natural to attempt to

solve these problems with a discrete model. This motivates us to try to combine continuous and

discrete optimization. It is possible that the future of learning of algorithmic patterns will involve

such combination of discrete and continuous optimization.

Long-term memory. While in theory using multiple stacks for representing memory is as powerful

as a Turing complete computational system, intricate interactions between stacks need to be learned

to capture more complex algorithmic patterns. Stack RNN also requires the input and output se-

quences to be in the right format (e.g., memorization is in reversed order). It would be interesting

to consider in the future other forms of memory which may be more flexible, as well as additional

mechanisms which allow to perform multiple steps with the memory, such as loop or random access.

Finally, complex algorithmic patterns can be more easily learned by composing simpler algorithms.

Designing a model which possesses a mechanism to compose algorithms automatically and training

it on incrementally harder tasks is a very important research direction.

7 Conclusion

We have shown that certain difficult pattern recognition problems can be solved by augmenting a

recurrent network with structured, growing (potentially unlimited) memory. We studied very simple

memory structures such as a stack and a list, but, the same approach can be used to learn how to

operate more complex ones (for example a multi-dimensional tape). While currently the topology

of the long term memory is fixed, we think that it should be learned from the data as well.

Acknowledgment. We would like to thank Arthur Szlam, Keith Adams, Jason Weston, Yann LeCun

and the rest of the Facebook AI Research team for their useful comments.

References

[1] Y. Bengio and Y. LeCun. Scaling learning algorithms towards ai. Large-scale kernel machines, 2007.

[2] C. M. Bishop. Pattern recognition and machine learning. springer New York, 2006.

[3] M. Bodén and J. Wiles. Context-free and context-sensitive dynamics in recurrent neural networks. Con-

nection Science, 2000.

The code is available at https://github.com/facebook/Stack-RNN

8

[4] L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT. Springer, 2010.

[5] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[6] J. S. Bridle. Probabilistic interpretation of feedforward classification network outputs, with relationships

to statistical pattern recognition. In Neurocomputing, pages 227–236. Springer, 1990.

[7] M. H. Christiansen and N. Chater. Toward a connectionist model of recursion in human linguistic perfor-

mance. Cognitive Science, 23(2):157–205, 1999.

[8] J. Chung, C. Gulcehre, K Cho, and Y. Bengio. Gated feedback recurrent neural networks. arXiv, 2015.

[9] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. High-performance neural

networks for visual object classification. arXiv preprint, 2011.

[10] M. W. Crocker. Mechanisms for sentence processing. University of Edinburgh, 1996.

[11] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for

large-vocabulary speech recognition. Audio, Speech, and Language Processing, 20(1):30–42, 2012.

[12] S. Das, C. Giles, and G. Sun. Learning context-free grammars: Capabilities and limitations of a recurrent

neural network with an external stack memory. In ACCSS, 1992.

[13] S. Das, C. Giles, and G. Sun. Using prior knowledge in a nnpda to learn context-free languages. NIPS,

1993.

[14] J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.

[15] M. Fanty. Context-free parsing in connectionist networks. Parallel natural language processing, 1994.

[16] F. A. Gers and J. Schmidhuber. Lstm recurrent networks learn simple context-free and context-sensitive

languages. Transactions on Neural Networks, 12(6):1333–1340, 2001.

[17] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint, 2014.

[18] P. Grünwald. A recurrent network that performs a context-sensitive prediction task. In ACCSS, 1996.

[19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[20] S. Holldobler, Y. Kalinke, and H. Lehmann. Designing a counter: Another case study of dynamics and

activation landscapes in recurrent networks. In Advances in Artificial Intelligence, 1997.

[21] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural net-

works. In NIPS, 2012.

[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.

1998.

[23] T. Mikolov. Statistical language models based on neural networks. PhD thesis, Brno University of

Technology, 2012.

[24] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. A. Ranzato. Learning longer memory in recurrent

neural networks. arXiv preprint, 2014.

[25] M. Minsky and S. Papert. Perceptrons. MIT press, 1969.

[26] M. C. Mozer and S. Das. A connectionist symbol manipulator that discovers the structure of context-free

languages. NIPS, 1993.

[27] J. B. Pollack. The induction of dynamical recognizers. Machine Learning, 7(2-3):227–252, 1991.

[28] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient

descent. In NIPS, 2011.

[29] P. Rodriguez, J. Wiles, and J. L. Elman. A recurrent neural network that learns to count. Connection

Science, 1999.

[30] D. E Rumelhart, G. Hinton, and R. J. Williams. Learning internal representations by error propagation.

Technical report, DTIC Document, 1985.

[31] W. Tabor. Fractal encoding of context-free grammars in connectionist networks. Expert Systems, 2000.

[32] P. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural

Networks, 1(4):339–356, 1988.

[33] J. Weston, S. Chopra, and A. Bordes. Memory networks. In ICLR, 2015.

[34] J. Wiles and J. Elman. Learning to count without a counter: A case study of dynamics and activation

landscapes in recurrent networks. In ACCSS, 1995.

[35] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their com-

putational complexity. Back-propagation: Theory, architectures and applications, pages 433–486, 1995.

[36] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint, 2014.

[37] Z. Zeng, R. M. Goodman, and P. Smyth. Discrete recurrent neural networks for grammatical inference.

Transactions on Neural Networks, 5(2):320–330, 1994.

- Oopsla08 Memory-efficient Java SlidesUploaded bychaney.jd
- MIT6_087IAP10_lec08Uploaded bySantosh Kumar
- Robo Report OfficialUploaded byJose Fernando Romero
- 9789812794215_fmatterUploaded byAt Adam
- Neural Virtual PhysicianUploaded byshivani1275
- csc190 project 3 - engineering science class of 1t7 wikiUploaded byapi-242586715
- kkjkkkUploaded byNoorin Ri
- How to Develop a Super Memory and Learn Like a Genius With Jim Kwik Nov 2018 LaunchUploaded byCarolina Ávila
- Efficient VLSI Implementation Based On Constructive Neural Network AlgorithmsUploaded byIOSRjournal
- PSR_gameUploaded byAhmed Khan
- Military Reconnaissance RobotUploaded byIJAERS JOURNAL
- apcomUploaded bydanix09
- 2.- A Note on the Equivalence of NARX and RNNUploaded bycristian_master
- Resume Uploaded (3)Uploaded byAbhishek Rajan
- IEEEXplore_1Uploaded byRohit Sajjan
- Islam 2017Uploaded bysyarifudin
- Binary Classification Tutorial With the Keras Deep Learning LibraryUploaded byShudu Tang
- Lec 1-5 Practice makes permanent.pdfUploaded bythisisfaked
- Advanced Data Structures Algorithms Jan2007 r059211201Uploaded byNizam Institute of Engineering and Technology Library
- Artificial IntelligenceUploaded byArfaan XhAikh
- Ey Second Edition 2011 E Banno Y Ikeda Y Ohno C ShinagawUploaded byΚρίστιΟυράνια
- Proposal samplesUploaded byRahul Raghuwanshi
- Guide to Using Positive AffirmationsUploaded byFlorinaUngureanu
- 13 efficiecy optimization of a vector controlled induction motor drive using an artificial neural network.pdfUploaded bynguyenngocban
- Hinton 2017Uploaded bygheorghe gardu
- 04232733Uploaded byyuvi_think
- ass.docxUploaded byMuhammad Bilal
- 2 LearningUploaded bymallikarjunabalimidi
- Recognizing Social Touch Gestures Using Recurrent and Convolutional Neural NetworksUploaded byjeffconnors
- Adaptive nonlinear control using input normalized neural networks.pdfUploaded byIvan Broggi

- Different Types of Lines & Their Uses (CE)Uploaded byIamIN
- CE Assignment2 Part2Uploaded byIamIN
- CSE 109_1 (Johra Madam)Uploaded byIamIN
- Vectors NoteUploaded byIamIN
- Control StatementUploaded byIamIN
- Dawn of Programming Contest (March 2015)Uploaded byanimesh_ccna
- Math 157 (Limit, Continuity & Differentiability) Exercise Set 1Uploaded byIamIN
- CE Homework3Uploaded byIamIN
- IPE_Capital Budgeting DecisionUploaded byasdf zxcv
- 17_ConvexSetUploaded byIamIN
- SuccessiveUploaded byRahul Yadav
- Computer ProgrammingUploaded bysumi_bhairab
- EEE 101 (Lecture 1 to Lecture 6)Uploaded byIamIN
- lecture 3Uploaded byIamIN
- CSE 109_2 (Johra Madam)Uploaded byIamIN
- Math 159- Vector- Salma ParvinUploaded byIamIN
- CE 106- 1&2Uploaded byIamIN
- PHY 121- Waves- Fahima Khanam_2Uploaded byIamIN
- PHY 121- Waves- Fahima KhanamUploaded byIamIN
- Math 157- Differentiation- Doli Rani PalUploaded byIamIN
- MATH 157 Integration Ilius SirUploaded byIamIN
- CE coverUploaded byIamIN
- PHY 121 Optics Nur E Alam AbdullahUploaded byIamIN
- alphabetoflines-100205022249-phpapp02Uploaded bydecanoa
- Team Slots for UniversitiesUploaded byIamIN
- Compiler Construction NotesUploaded byApoorva Bhatt
- Memory Dump Analysis Anthology Volume 2Uploaded byIamIN
- Math 159- Complex- Nilufar FarhatUploaded byIamIN
- Prommotted List of Class XI Sci1Uploaded byIamIN
- vol-34.1-61-68Uploaded byIamIN

- Analyze Your Scratch Projects With Dr. Scratch and Assess Your Computational Thinking SkillsUploaded byfaizah rozali
- Cp 2 Mark and 16 MarkUploaded bykathirdcn
- Karnataka VAT AND CST User Manual_Version_e-upload.pdfUploaded byManjunathreddy Seshadri
- RexxUploaded byGovind Prasad
- DP3055-2065_Quick-Reference_PRINTER.pdfUploaded byAnonymous xGCiF6o
- ECC TO HANA BY BODS.docxUploaded byAdaikalam Alexander Rayappa
- shell.docUploaded byVikas Ps
- Chapter 3 Part 1Uploaded byAbdulkerim
- WebSphere DataPower Service Gateway XG45Uploaded bySreedhar Konduru
- DBUploaded byFaiza Hamza
- PIX to ASAUploaded byChaunceyHenderson
- TK10A60D en DatasheetUploaded byXenon Diaz Palacios
- Wonderware InduSoft Web Studio 8.0 TrainUploaded byDefinal Chaniago
- Introduction to Airline Information System 4880Uploaded bySneha Agarwal
- Two Level QR Code for Private Message Sharing and Document Authentication-IJAERDV03I1227786N.pdfUploaded bySelva
- Good Riddance Guitar Tab Violin SoloUploaded byLaura
- Stuxnet: Analysis, Myths, RealitiesUploaded byYury Chemerkin
- Errata Datasheet 18f2550Uploaded byAlessandro Nakoneczny Schildt
- Critical Cloud Computing V10Uploaded bybortles1
- termsUploaded byapi-256502946
- EBee Plus Drone User ManualUploaded byWilliam Pg.
- SIMLab Annual Report 2013(1)Uploaded byAnonymous YU0QeLgtS
- Keyboard and RecorderUploaded byPris Atarashii
- CD600PLUSUploaded byEnmanuel Rosales Carvajál
- CRM TemplateUploaded byAli Mohamed
- hasil olahan data suci.docUploaded byyosi silvana putri
- Chapter 14.pdfUploaded byDana Ajouz
- PolymerUploaded byivorisek
- Basic CommandsUploaded byNemo
- FIFA 13 readmeUploaded byVatsal Sapra