Professional Documents
Culture Documents
LEARNING USING
RECTIFIED LINEAR UNITS
IN RECURRENT NEURAL
NETWORKS
Submitted by
SHRIRAMVISHWANATH S. (13BCS022)
SARAVANAN V.A. (13BCS084)
VENKATAKRISHNAN N. (13BCS102)
NAGARAJ P. (14BCS306)
of
Bachelor of Engineering
in
OCTOBER 2018
Dr.Mahalingam College of Engineering and Technology
Pollachi - 642003
An Autonomous Institution
BONAFIDE CERTIFICATE
SHRIRAMVISHWANATH S. (13BCS022)
SARAVANAN V.A. (13BCS084)
VENKATAKRISHNAN N. (13BCS102)
NAGARAJ P. (14BCS306)
Dr. G.Anupriya
SUPERVISOR Dr. G.Anupriya
Associate Professor HEAD OF THE DEPARTMENT
Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering
Dr. Mahalingam College of Engineering and Dr. Mahalingam College of Engineering and
Technology, Pollachi – 642003 Technology, Pollachi – 642003
Submitted for the Autonomous End Semester Examination Innovative and Creative
ProjectViva-voce held on _______________________
ABSTRACT
Integer sequence prediction is very popular in aptitude tests which measure a
time series prediction, DNA sequencing etc. Given a sequence of numbers, the subject
must understand the underlying function and predict the next number in the sequence.
Since this task requires logical thinking, it serves as a test bed for Artificial General
Intelligence systems. Inductive reasoning systems have been attempted earlier, but they
require manual programming. Since artificial neural networks can approximate almost
any function, they can be used to solve this problem. Existing research has explored
artificial neural networks and simple recurrent networks with regard to learning integer
using rectified linear units as the activation function, which has been known to produce
better results in many other applications such as computer vision and speech
number sequences and has been used as a benchmark to test the performance of our
system.
ACKNOWLEDGEMENT
First and foremost, we wish to express our deep unfathomable feeling, gratitude
to our institution and our department for providing us a chance to fulfill our long
cherished of becoming Computer Science engineers.
Our hearty thanks to our guide Dr.G.Anupriya Associate Professor for her
constant support and guidance offered to us during the course of our project by being
one among us and all the noble hearts that gave us immense encouragement towards the
completion of our project.
We are deeply grateful to our project coordinator, Ms. A. Brunda for her
guidance, patience and support. We also thank our review panel
members________________,
_____________________and_____________________for their continuous support and
guidance.
ii
LIST OF ABBREVIATIONS (in alphabetic order)
iii
TABLE OF CONTENTS
1. INTRODUCTION ......................................................................................................................1
1.1 Objective ............................................................................................................................2
1.2 Overview ............................................................................................................................2
2. LITERATURE SURVEY ..............................................................................................................3
2.1 An AI Approach to Predict Number Sequences .................................................................3
2.2 Comparing Computer Models Solving Number Series Problems ......................................4
2.3 Recurrent Neural Networks ...............................................................................................5
2.4 Existing System - Solving Number Series with Simple Recurrent Networks ......................6
2.5 Initializing Recurrent Networks of Rectified Linear Units ..................................................7
2.6 Summary ............................................................................................................................8
3. METHODOLOGY .....................................................................................................................9
3.1 Preprocessing of input data ...............................................................................................9
3.2 Autoencoder ................................................................................................................... 10
3.3 Architecture of Simple Recurrent Network .................................................................... 11
3.4 Gradient Descent Algorithm ........................................................................................... 11
3.5 Activation functions ........................................................................................................ 12
4. RESULTS............................................................................................................................... 13
4.1 Dataset ............................................................................................................................ 13
4.2 Evaluation Metric ............................................................................................................ 13
4.3 Experiments & Results .................................................................................................... 14
4.4 Summary of Results ........................................................................................................ 16
5. CONCLUSION ....................................................................................................................... 17
REFERENCES ................................................................................................................................ 18
APPENDIX A: SAMPLE CODE ............................................................................................ A.1
APPENDIX B: SCREENSHOTS ............................................................................................ B.1
iv
LIST OF FIGURES
v
LIST OF TABLES
vi
1. INTRODUCTION
Integer sequence prediction refers to the process of predicting the next number of
an integer sequence. This is very popular in aptitude tests which are used to measure a
person’s numerical reasoning ability. Artificial General Intelligence (AGI) refers to
Artificial Intelligence (AI) systems that can successfully perform any intellectual task
that a human can. Since humans have the logical ability to deduce number sequences,
integer sequence prediction serves as a good test-bed for such AGI systems. Neural
networks have shown promising results in the quest for AGI, as they can be applied to
several different problems like speech recognition and computer vision without much
change in their architecture.If they can also be applied to this task successfully, it will
be another step towards reaching AGI.
The subject will be given a sequence of numbers, and is asked to predict the next
number in the sequence. These sequences will have some underlying function, which
has to be identified by the subject in order to successfully predict the next number. This
function may be of several types, ranging from simple arithmetic calculations such as
addition, subtraction, multiplication or division to complex recursive functions that may
require inputs given several time steps ago. Earlier research has pointed towards
inductive reasoning and artificial neural networks for solving this problem. Inductive
reasoning systems such as IGOR2[1] require manual programming which might not be
feasible for large datasets. In contrast, artificial neural networks are universal
approximators, which can, in theory, be used to approximate any function. Hidden units
in these networks use non-linearities such as hyperbolic tangent or sigmoid function.
Thus, they can model even non-linear functions. From earlier research, it can be seen
that the performance of neural networks and humans vary drastically. Some sequences
that could not be solved by any neural networks have been solved by some human
subjects, and vice versa[2]. However, further research is required in this area, as only
simple architectures have been tested on this problem.
Ragni and Klein investigated the use of simple neural networks using
backpropagation of errors to predict integer sequences[3]. Wendemuth et al. used
simple recurrent networks for this problem, and reported better performance on a test
1
dataset of 20 sequences[4]. These networks use hyperbolic tangent or linear activation
functions which cause the vanishing/exploding gradient problem. Rectified linear units,
on the other hand, have become very popular recently, and have been known to improve
the performance ofneural networks by eliminating the vanishing/exploding gradient
problem[5]. They are used extensively in computer vision and speech recognition.
1.1 Objective
1.2 Overview
Chapter 2 presents a brief overview of the existing literature on this topic. ANN
approaches to solving number series problems, architecture of Recurrent Neural
Network (RNN), and the initialization of rectified linear units in RNNs are
discussed.Chapter 3 describes the methodology used to solve the integer sequence
learning problem. This includes preprocessing the input data to a form more suitable for
training neural networks, and ways to initialize weights effectively using unsupervised
learning methods such as autoencoder. An overview of the experiments conducted and
the hyperparameters used are also described.Chapter 4 presents the results of the
experiments. The performance of the networks using various architectures and weight
initialization methods is tabulated. Chapter 5 describes the interpretation of the results,
and gives a formal conclusion to the project.
2
2. LITERATURE SURVEY
Number series problems are an interesting test bed for Artificial Intelligence
systems because the underlying function includes but is not limited to addition,
subtraction, multiplication and division. Artificial Neural Networks can be used to solve
such sequences and predict the next number in the sequence. The Online Encyclopedia
of Integer Sequences (OEIS), which consists of a vast collection of number sequences,
is used for benchmarking[6].ANN with a single hidden layer and error back-propagation
has been used. Hyperbolic tangent is used as the activation function. The numbers in
each sequence are normalized to the range of 0 to 1, for efficient network optimization.
Learning rate, number of units in the input and hidden layers and number of iterations
are systematically varied to identify the best configuration.
Around 57,000 sequences from the OEIS database, with values +/- 1000 are
chosen as the dataset. Another dataset of 20 sequences is also used for testing.The input
data, which is in the form of sequence of integers, is used to generate patterns based on
the number of input units. These are used as training data, and the last pattern is used as
test data.Out of the 20 sequences in one experiment, 17 could be solved i.e the last
number of the sequence could be predicted correctly, even if it is not used during
training. In the OEIS dataset, 26,951 out of 57,524 sequences could be solved. The best
architecture was with 4 input and 2 hidden nodes, which could solve 12,764 sequences.
Thus, it can be inferred that the architecture of the ANN has a significant role in
the prediction accuracy of a neural network. The best configuration for the ANNs is
about 2-4 input nodes and 5-6 hidden nodes. The maximum accuracy obtained is 12,764
out of 57,524 which is about 22%[7].
3
2.2 Comparing Computer Models Solving Number Series Problems
Inductive reasoning is the process of finding a general rule which fits the given
instance. Many IQ tests have number series prediction as a component. Given a
sequence of numbers, the subject must learn the underlying function and use it to
predict the next number. Since this is a measure of human intelligence, it can also be
used as a test bed for Artificial General Intelligence.Number series in intelligence tests
are usually restricted to the four basic mathematical operations and use small values for
easy calculations. Each number in the sequence may depend on one or more numbers
which occurred before. Thus, the underlying function may range from simple arithmetic
to complex recursive ones. Number series may be characterized according to features
such as necessary background knowledge, numerical values, structural complexity and
existence of a closed formula.
All the above approaches are based on symbolic computation. They are not only
able to produce the next number, but also give the underlying function. However, the set
of functions which can be learned is restricted. In contrast, artificial neural networks can
approximate any arbitrary function. Patterns are generated from the sequences and used
as input for network optimization. The last pattern is used to predict the next number.
The architecture of the network, learning rate and number of iterations are
systematically varied to find the optimum hyper parameters.It is observed that out of 20
sequences, 6 could not be solved by IGOR2, while only 3 could not be solved by ANNs.
One series could not be solved by either approach. Inductive reasoning systems require
manual programming but can be applied to several sequences without modification.
4
ANNs do not require manual programming, but individual networks have to be trained
for each sequence.
Two fundamental ways can be used to add feedback into feedforward multilayer
neural networks. Elman[10] introduced feedback from the hidden layer to the context
portion of the input layer. This approach pays more attention to the sequence of input
values. Jordan recurrent neural networks [11] use feedback from the output layer to the
context nodes of the input layer and give more emphasis to the sequence of output
values.Gradient descent is a key concept in neural network optimization. Error
backpropagation in neural networks is based on this technique. While backpropagation
is relatively simple to implement, several problems can occur in its use in practical
applications, including the difficulty of the network getting trapped in some local
minima.Researchers have developed a variety ofschemes by which gradient methods,
and in particular backpropagationlearning,can be extended to recurrent neural
networks[12]. Thebackpropagation through time approach approximates a recurrent
neural network as a sequence of static networks using gradient methods. In another
approach, a master neural network is used to identify suitable dynamical slave networks
for processing the given data.
5
Thus, recurrent neural networks are suitable for working with data which has
long-term dependencies, as they can “remember” previous inputs for longer periods of
time. Backpropagation technique can also be applied to RNNs for network optimization.
Since neural networks converge much more effectively when using normalized
inputs, the numerical values are normalized to the range [0,1] before processing. The
linear activation function is used in this approach. Instead of using random initial
weights, unsupervised pre-training is used as auto encoder for weight initialization. This
means, the network was trained to generate the input (numbers of the series) at the
output. Such pre-training procedure can help to guide the parameters of the layers
towards regions in parameter space where solutions are allowed; that is, near a solution
that captures statistical structure of the input [4].
6
The network was trained for a maximum of 1000 iterations, omitting the last
element. After every 10 training cycles the network was tested on the complete series. If
it could predict the final element of the series, it was considered to have successfully
learned the rule underlying the series. For training, as for pre-training, the scaled
conjugate gradient back propagation algorithm was used.
In this experiment, 100 SRNs were trained on each of the 20 series. Thus, the
chance of starting from some unfavorable initial weights is minimized, and a measure of
the general difficulty of the task is discovered. It is seen that a simple network with one
input and one hidden unit could solve 18 of the 20 series, and with three input units, all
20 series could be solved. Thus, it can be concluded that recurrence in neural networks
is important for cognitive modeling, because recurrence is a fundamental concept in
human cognition.
Recurrent Neural Networks are used in several areas such as speech recognition,
machine translation and sequence prediction tasks like language modeling. However,
training RNNs using back-propagation can be difficult because vanishing and exploding
gradients cause great difficulty in learning long-term dependencies [12]. Hessian-Free
optimization and stochastic gradient descent with momentum are some of the
approaches proposed for overcoming this difficulty. However, the most successful of
them all is the LSTM Recurrent Neural Networks which use stochastic gradient descent,
but changes the hidden units in such a way that the back propagated gradients are much
better behaved.
LSTM replaces logistic or tanhhiddenunits with “memory cells” that can store an
analog value. Each memory cell has its own input and output gates that control when
inputs are allowed to add to the stored analog value and when this value is allowed to
influence the output. These gates are logistic units with their own learned weightson
connections coming from the input and also the memory cells at the previous time-step.
There isalso a forget gate with learned weights that controls the rate at which the analog
value stored in thememory cell decays. For periods when the input and output gates are
off and the forget gate is notcausing decay, a memory cell simply holds its value over
7
time so the gradient of the error with respect to itsstored value stays constant when
backpropagated over those periods.
With the right initialization of weights, RNNs composed of rectified linear units
are relatively easy to train. Their performance on test data is comparable with LSTMs in
certain tasks like predicting the next word in a very large corpus of text. The recurrent
weight matrix is initialized to be the identity matrix and biases to be zero. Identity
initialization has the very desirable property that when the error derivatives for the
hidden units are back propagated through time they remain constant provided no extra
error-derivatives are added. This is the same behavior as LSTMswhen their forget gates
are set so that there is no decay and it makes it easy to learn very long-range temporal
dependencies.
2.6 Summary
Thus, it can be inferred that neural networks can be used to predict integer
sequences. Simple recurrent networks are able to solve a considerable number of
sequences even with a very basic architecture. By using rectified linear units with
identity initialization of recurrent weight matrix, better accuracy may be achieved in
integer sequence learning.
8
3. METHODOLOGY
Figure 1shows the overall block diagram of the process. Training data is
generated from the input sequences. Weight matrices are initialized randomly or using
autoencoder. The activation function is used to introduce non-linearity, so that the
network can learn non-linear functions also. The cost function calculates the difference
between the predicted and actual outputs. The gradient descent algorithm is used for
weight optimization. Finally, the optimized weights are used to predict the last number
of the series. The implementation was done in Ubuntu 16.04 environment using Python
with Tensor Flow library.
Each sequence in the dataset consists of several integers in order. Since neural
networks work better when the input range is fixed, the numbers were scaled to the
𝑖
range (0, 1). This was done using the function 𝑓𝑖 = 10𝑙𝑒𝑛(𝑛)
where n is the largest
number in the sequence. The output of the network was scaled back using the inverse of
9
the same function. Patterns are generated from the sequences and used as training data
for the neural network.
Pattern N1 N2 N3 N4 N5 N6 N7 N8
P1 V1 V2 V3 T
P2 V2 V3 V4 T
P3 V3 V4 V5 T
P4 V4 V5 V6 T
P5 V5 V6 V7 ?
3.2 Autoencoder
10
Figure 2 Structure of Autoencoder
A simple recurrent network is very similar to a normal neural network, but it has
an extra context layer connected to the hidden layer. This context layer stores the output
of the hidden layer from one time step (t) and feeds it to the hidden layer during the next
time step (t+1). The difference between their architectures is shown in Figure 3 below.
the weight matrix from layer l to layer l+1. g is the activation function, 𝑧 (𝑙) denotes the
11
Figure 4 shows the visualization of cost where the global minimum is at the
centre. J(w) denotes the cost for the weights w. The steps taken towards reaching the
minimum are highlighted in black.
2
𝑡𝑎𝑛ℎ(𝑧) = 1+ 𝑒 −2𝑧 – 1 (3)
12
4. RESULTS
4.1 Dataset
The neural network approach is tested on two different data sets. One is the
benchmark 20 sequences[7] presented in Figure 5 below, and the other is a set of 5000
random sequences taken from the OEIS database[6], such that the minimum length is 8,
and all the values are in the range +/- 1000.
The networks predict floating point numbers, which were rounded off to the
nearest integer. In the OEIS dataset, to get a better picture of the accuracy, the count of
sequences which differed upto+/- 10 from the actual numbers were also recorded.For
each network architecture, the number of sequences solved is used as the evaluation
metric.
13
4.3 Experiments & Results
In our experiment, the number of input units was fixed at 4, as it yielded the best
results. The number of hidden units was varied from 1 to 5 and the results were
recorded. For one part, the weights were initialized randomly with mean 0 and standard
deviation of 0.1. In another part, an autoencoder was used for weight initialization. The
autoencoder was trained for 1000 iterations with a learning rate of 0.3. The SRN was
then trained for a maximum of 1000 iterations with a learning rate of 0.9. After every 10
iterations, the network was tested on the last pattern. If it could predict the last number
correctly, it was considered to have successfully solved the sequence.
Num_hidden 1 2 3 4 5
Trial 1 30% 35% 45% 40% 45%
Trial 2 35% 45% 45% 35% 50%
Trial 3 30% 35% 40% 50% 50%
Trial 4 35% 35% 50% 40% 40%
Trial 5 35% 35% 45% 35% 45%
Average 33% 37% 45% 40% 46%
Table 2shows the number of sequences out of the 20 benchmark set solved by an
SRN with 4 input units, and varying number of hidden units.It is seen that the network
with 5 input units solves 46% of sequences on average.
Num_hidden 1 2 3 4 5
Trial 1 55% 40% 55% 35% 45%
Trial 2 55% 35% 45% 55% 50%
Trial 3 35% 55% 40% 45% 40%
Trial 4 35% 45% 55% 50% 50%
Trial 5 30% 45% 50% 50% 50%
Average 42% 44% 49% 47% 47%
14
Table 3 shows the number of sequences solved by the same SRN mentioned
above, but with unsupervised pre-training using autoencoder. It is seen that the network
with 3 hidden units has the best performance, with an average of 49%.
Table 4 shows the performance of the SRN without autoencoder, on the OEIS
dataset of 5000 sequences, with the number of hidden units varied from 1 to 7. It is seen
that the networks with 5-7 hidden units perform the best, solving 42% ofsequences.
Table 5shows the number of sequences solved by SRN with autoencoder, out of
5000 sequences from the OEIS dataset, with hidden units varied from 1 to 7. The
network with 1 hidden unit performs best, solving 32% of sequences. Increasing the
number of hidden units decreases the performance of the network.
15
4.4 Summary of Results
From the above table, it is seen that an SRN with 3 hidden units works best for
the benchmark dataset, solving about 45% of sequences on average. SRN with
autoencoder solves50% of sequences on average.However, for the OEIS dataset, with
autoencoder, 1 hidden unit works best, solving 32% of sequences, with additional units
gradually decreasing the performance of the network. It is also seen that when the
accuracy is decreased to +/- 10, the number solved increases to 43%. Without
autoencoder, the network solves up to 42% sequences when the hidden layer size is 6.
When the accuracy is reduced to +/- 10, the number increases up to 51%. On the OEIS
dataset, it is seen that random weight initialization works better than using an
autoencoder.
16
5. CONCLUSION
17
REFERENCES
[1] U. Schmid and E. Kitzelmann, "Inductive rule learning on the knowledge level,"
Cognitive Systems Research, pp. 237-248, 2011.
[2] U. Schmid and M. Ragni, "Comparing Computer Models Solving Number Series
Problems," Lecture Notes in Computer Science, vol. 9205, pp. 352-361, 2015.
[3] M. Ragni and A. Klein, "Predicting Numbers: An AI Approach to Solving Number
Series," Lecture Notes in Artificial Intelligence, vol. 7006, pp. 255-259, 2011.
[4] S. Gluge and W. Andreas, "Solving Number Series with Simple Recurrent
Networks," Lecture Notes in Computer Science, vol. 7930, pp. 412-420, 2013.
[5] N. Jaitly, Q. V. Le and G. E. Hinton, "A simple way to initialize recurrent networks
of rectified linear units," arXiv, vol. 1504, no. 00941, 2015.
[6] "The Online Encyclopedia of Integer Sequences," [Online]. Available:
http://oeis.org. [Accessed June 2016].
[7] M. Ragni and A. Klein, "Solving number series - Architectural Properties of
Successful Artificial Neural Networks," Neural Computation Theory &
Applications, pp. 224-229, 2011.
[8] M. Meredith, "Seek-whence: a model of pattern perception," Technical report,
Indiana Univ., Bloomington (USA), 1986.
[9] P. Sanghi and D. Dowe, "A computer program capable of passing I.Q. tests," in 7th
Conf. of the Australasian Society for Cognitive Science, Sydney, Australia, 2003.
[10] J. L. Elman, "Finding structure in time," Cognitive Science, vol. 14, no. 179, 1990.
[11] M. Jordan, "Generic constraints on underspecified target trajectories," in
International Joint Conference on Neural Networks, 1989.
[12] L.R.Medsker, Recurrent Neural Networks: Design and Applications, 2001, pp. 12-
15.
[13] "Activation Function," Wikipedia, [Online]. Available:
https://en.wikipedia.org/wiki/Activation_function. [Accessed August 2016].
18
APPENDIX A: SAMPLE CODE
1: import math
2: import numpy as np
3: definputsplit(Xin,input_layer_size):
4: m = abs(max(Xin))
5: count = 1
6: if not(m == 0 or m == 1):
7: count = math.ceil(math.log10(abs(max(Xin))))
8: Xin = [a/pow(10,count) for a in Xin]
9: X = []
10: for i in range(0,len(Xin)-input_layer_size):
11: X.append(Xin[i:i+input_layer_size])
12: X = np.matrix(X)
13: y = np.matrix(Xin[input_layer_size:]).T
14: return X,y,count
1: import tensorflow as tf
2: import numpy as np
3: from inputsplit import inputsplit
4: input_layer_size = 4;
5: hidden_layer_size = 3;
6: definit_weights(shape):
7: return tf.Variable(tf.random_normal(shape, stddev=0.1))
19: f = open('oeis_sample.csv')
20: correct,nseq,near5,near10 = 0,0,0,0
21: while(1):
22: sess = tf.InteractiveSession()
23: x = f.readline()
A.1
24: if(x==""):
25: break
26: sequence = [int(y) for y in x.rstrip('\r\n').split(',')]
27: nseq+=1
28: if(nseq>5000):
29: break
30: Xin,yout,count = inputsplit(sequence,input_layer_size)
44: sess.run(tf.initialize_all_variables())
45: Z = autoencode(X,W1,b1,Wprime,bprime)
46: aecost = tf.reduce_mean(tf.square(Z - X))
47: aetrain_op=tf.train.GradientDescentOptimizer(0.1).minimize(aecos
t)
48: for i in range(1000):
49: sess.run(aetrain_op, feed_dict={X: trX})
50: #print(sess.run(h),sess.run(Wh))
60: #print(correct,int(sequence[-1]),float(sess.run(predict_op,
feed_dict={X: teX})))
61: predicted_num = float(sess.run(predict_op, feed_dict={X: teX}))
62: if(abs(float(sequence[-1])-predicted_num)<5):
63: near5+=1
64: if(abs(float(sequence[-1])-predicted_num)<10):
65: near10+=1
66: if nseq%10==0:
67: print nseq,correct,near5,near10
68: #print(int(sequence[-1]),float(sess.run(predict_op,
feed_dict={X: teX})))
69: tf.reset_default_graph()
70: sess.close()
A.2
APPENDIX B: SCREENSHOTS
B.1