A Note On The Equivalence of NARX and RNN

Neural Comput & Applic (1999)8:3339
1999 Springer-Verlag London Limited
A Note on the Equivalence of NARX and RNN

J.P.F. Sum1, W.-K. Kan2 and G.H. Young3
1
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong; 2Department of Computer
Science and Engineering, Chinese University of Hong Kong, Shatin, Hong Kong; 3Department of Computing, Hong Kong
Polytechnic University, Hung Hom, Hong Kong
This paper presents several aspects with regards 1. Introduction

the application of the NARX model and Recurrent
Neural Network (RNN) model in system identifi- The Nonlinear Autoregressive models with exogen-
cation and control. We show that every RNN can ous input (NARX model) and the Recurrent Neural
be transformed to a first order NARX model, and Network (RNN) are two models commonly used in
vice versa, under the condition that the neuron system identification, time series prediction and sys-
transfer function is similar to the NARX transfer tem control. Formally, a NARX model [15] is
function. If the neuron transfer function is piecewise defined as follows:
linear, that is f(x):=x if x 1 and f(x):=sign(x)
otherwise, we further show that every NARX model y(t) = g(y(t 1), %, y(t ny), u(t), %, u(t nu))
of order larger than one can be transformed into a (1)
RNN. According to these equivalence results, there where u(t) and y(t) correspond to the input and
are three advantages from which we can benefit: output of the network at time t; ny and nu are the
(i) if the output dimension of a NARX model is input order and output order, respectively. A simple
larger than the number of its hidden unit, training example is illustrated in Fig. 1 with ny = 1 and
an equivalent RNN will be faster, i.e. the equivalent nu = 0. This model is expressed as follows:
RNN is trained instead of the NARX model. Once
the training is finished, the RNN is transformed
back to an equivalent NARX model. On the other
hand, (ii) if the output dimension of a RNN model
is less than the number of its hidden units, the
training of a RNN can be speeded up by using a
similar method; (iii) the RNN pruning can be
accomplished in a much simpler way, i.e. the equiv-
alent NARX model is pruned instead of the RNN.
After pruning, the NARX model is transformed back
to the equivalent RNN.
Keywords: Extended Kalman filter; Model equival-

ence; NARX model; Output feedback control; Prun-
ing; Recurrent neural networks; Recursive least
square; Training complexity
Correspondence and offprint requests to: John P.F. Sum, Depart-

ment of Computer Science, Hong Kong Baptist University, Kow-
loon Tong, KLN, Hong Kong. Fig. 1. A NARX model.
34 J.P.F. Sum et al.
y(t) = ksk(t)
k=1

s(t) = (s1(t), s2(t), s3(t))T (7)
and
c = ( 1, 2, 3)T
Parameters i, ij, i and i are called the input
weight, the recurrent weight, the bias and the output
weight, respectively.
In contrast to the NARX model, RNN does not
have feedback connections from the output to the
input. The feedback connection exists only amongst
the neurons in the hidden layer.
According to the structural difference, the NARX
model and RNN are basically studied independently.
Only a few papers have presented results concerning
their similarity [14,15]. Olurotimi [15] recently
Fig. 2. Recurrent neural network model. presented a model equivalence result for a RNN
and a feedforward neural network. He showed that
every RNN can be transformed into a NARX model.
Thus, he derived an algorithm for RNN training

3
y(t) = tanh(iy(t 1) iu(t) i) with feedforward complexity.

i
i=1
Inspired by Olurotimis work, in the rest of the
paper, we present some other aspects regarding the
where equivalence between NARX and RNN. Section 2
ex ex presents the main result, the model equivalence
tanh(x) = between NARX and RNN. Three issues concerning
ex ex
the use of these equivalence results are studied
Parameters i and i are usually called the input in Section 3. Finally, we conclude the paper in
weight, and the parameter i is called the bias. The Section 4.
parameter i is called the output weight. This model
is basically a multilayer perceptron [6], except that
the output is fed back to the input. 2. Model Equivalence
On the other hand, a RNN [713] is defined in
a state-space form Assuming that the system being identified is deter-
s(t) = g(s

(t 1), u(t), u(t 1), %, u(t nu)) (2) ministic, and given by
y(t) = cT
s(t) (3) x(t 1) = g(x(t), u(t 1)) (8)
where s(t) is the output of the hidden units at time y(t 1) = cx(t 1) (9)
t and c is the ouptut weight vector. A simple where x(t), y(t) are the system input and output.
example which consists of three hidden units is Due to the universal approximation property of a
illustrated in Fig. 2. This model is expressed as feedforward neural network [12,1619] the nonlinear
follows: function g can thus be approximated by a feedfor-

3 ward neural network. Hence, the above system can
s1(t) = tanh 1ksk(t 1) 1u(t) 1 (4) be rewritten as follows:

k=1 n
s2(t) = tanh 3

2ksk(t 1) 2u(t) 2 (5)
x(t 1) =
i=1
ditanh(aix(t) biu(t 1) ei) (10)
y(t 1) = cx(t 1)
k=1
(11)
s3(t) = tanh 3
k=1

3ksk(t 1) 3u(t) 3 (6) where {ai, bi, di, ei}ni=1 and c are the system para-
meters.
A Note on the Equivalence of NARX and RNN 35
2.1. tanh Neuron where W1 RNn, W2 RnN, W3 RnM and

W4 Rn. Similarly, an equivalent transformation
2.1.1. If x,y,u are scalars. Obviously, system (10) can be established for RNN
and (11) is equivalent to a NARX model if we
s(t 1) = tanh(W2

s(t) W3
u(t 1) W4) (25)
substitute x(t) in Eq. (10) by c1y(t), i.e.
y(t 1) = W1

s(t 1)

n
(26)
y(t 1) = tanh(iy(t) iu(t 1)
i (12) via the following equations:
i=1
i) W1 = W1 (27)
with i = cdi, i = aic1, i = bi and i = ei. W2 = W2W1 (28)
We start by writing down the state equation of a W3 = W3 (29)
RNN with a non-linear function similar to that of
the NARX model: W4 = W4 (30)
si(t 1) = tanh k=1

n

iksk(t) iu(t 1) i
(13)
For getting back the NARX model, we can perform
the inverse transformation defined as follows:
W1 = W1 (31)

n
y(t) = ksk(t) (14) W2 = W2W (W1W )

T
1
T 1
1 (32)
W3 = W3
k=1
(33)
Searching for the definition of si(t 1) that can
link Eq. (10) to Eq. (13), we can come up with W4 = W4 (34)
si(t 1) = tanh(aix(t) biu(t 1) ei) for W1 is nonsingular.
and thus ik = aidk, i = bi, i = ei = and i = cdi.

By comparing the coefficients amongst Eqs (10), 2.2. Piece-wise Linear Neuron
(11), (12), (13) and (14), we can define the follow-
ing transformations: Once the neurons transfer function is piecewise
linear, i.e.
ik = ik (15)
y(t 1) = W1 f(W2

y(t) W3
u(t 1) W4) (35)
i = i (16)
where
i = i (17)

i = i (18) 1 if x 1
f(x) = x if 1 x 1 (36)
Let [ ] be the matrix ( ik)nn; the vector form is 1 if x 1
given by
[ ] = T, = , = , = (19) this result can be extended to higher order NARX
model. Without loss of generality, we consider a
This establishes a way to transform a NARX into second order NARX:
a recurrent neural network. The inverse transform-
ation of a RNN into a NARX model can be y(t 1) = W1f(W20

y(t) W21
y(t 1) (37)
accomplished via the following equations: W3u(t 1) W4)

= ( T )1 [ ] (20) Now we define a state-vector
=
=
(21)
(22)
z(t) =

s(t)
s(t 1)
where s(t 1) = f(W20

y(t) W21
y(t 1)
= (23)
W3u(t 1) W4). Since s = f(s) if s is bounded

as long as (or ) is non-zero vector. by 1, we can rewrite the model as follows:

z(t 1) = f(W2

z(t) W3
u(t 1) W4) (38)
2.1.2. If x, u, y are vectors. If u RM, x, y
RN, the vector NARX model is given by y(t 1) = W z(t 1)

1

(39)
y(t 1) = W1 tanh(W2
y(t) W3
u(t 1) W4) (24) where
W1 = [W1 ONn] (40) composition on P(t) can speed up the training

[11,21], the complexity is still in the same order
W20W1 W21W1 if N n.
W2 = (41)
Inn Onn Next, if the same dynamic system is identified
W3 =
W3
OnM
(42)
by a RNN, the extended Kalman filter approach
[13,22,23] is one fast method which can simul-
taneously estimate the state vector s(t) and identify
W4 =
W4
On1
(43)
the parametric vector (t) [24]:
s(tt 1) = g(s(t 1t 1), u(t), (t 1)) (48)
Note that W1 RN2n, W2 R2n2n, W3 R2nM P(tt 1) = F(t 1)P(t 1t 1)FT(t 1) (49)
and W4 R2n1.
3. Implications from the Equivalence

s(tt)
(t)
s(tt 1)
=
(t 1)
L(t)e(t) (50)
Property e(t) = (y

y(s(tt 1), (t 1)))
(t)
3.1. On Training P(tt) = P(tt 1) L(t)HT(t)P(tt 1) (51)

where
One should note from the equation, W2 = W2W1,
that the number of parameters in the RNN is fewer
than NARX if n N (since W2 Rnn while W2 F(t 1) = F11(t 1)
0dimn
F12(t 1)
Idimdim (52)
RnN), where n is the number of hidden units
and N is the number of output units. F11(t 1) = sg(s(tt), u(t 1), (t))
In most of the applications of NARX in dynamic F12(t 1) = g(s(tt), u(t1), (t))
system modelling, the output dimension is small.
Suppose we ignore the computational complexity in HT(t) = [Ts y(t) T y(t)] (53)
each training step, and we purely look at the number L(t) = P(tt 1)H(t)S1(t) (54)
of parameters being updated, training a NARX
should be faster than training a RNN. It will be a S(t) = H P(tt 1)H(t) rINN
T
controversy if the dimension of y is larger than the The initial P1(00) is set to be a zero matrix and
number of hidden units, that is n N. (0) is a small random vector. We have
Suppose a NARX is being trained by using the
Forgeting Recursive Least Squares1 (FRLS) method. g(s(tt), u(t 1), (t)) = tanh(W2(t) s(tt) (55)
Let be the augmented vector including all the W3(t) u(t 1) W4(t))
parameters {W1, W2, W3, W4}, xt = (y T
(t 1), u The computationl burden is again on P(tt) which
T
(t)) and (t) = (y(t)/, the training can be
T
requires (dim 3) multiplication.
accomplished via the following recursive equations: Since is the augmented vector including all the
S(t) = T(t)P(t 1)(t) (1 )INN (44) parameters {W1, W2, W3, W4} and is the augmented
vector including all the parameters {W1, W2, W3, W4};
L(t) = P(t 1)(t)S S (t)
1 1
(45) the dimension of will be the total number of
P(t 1) elements in W1, W2, W3 and W4. That is
P(t) = (Idimdim L(t)(xt)) (46)
1 dim = n(2N M 1)
(t) = (t 1) L(t)(y(t) y(xt, (t 1)) (47) Similarly, the dimension of is given by
with the initial conditions (0) = 0 and dim = n(n M N 1)
P(0) = 1Idimdim and 0 1, and is a small
positive number. y(xt,(t1)) is the ouptut of the By comparing their computational complexities on
NARX model at the t th step. The computational P(t) and P(tt), respectively, it is observed that training
burden is on Eqs (45) and (46), which requires the RNN may not be more time consuming than
(dim3) multiplication. Even though some de- training a NARX model. So, we suggest the following
indirect method for training NARX and RNN.
1
We pick FRLS for discussion simply because it is a fast training (a) If N n and NARX has to be trained, we
method for feedforward neural networks [2,20,21,34]. can first initialise a random NARX model and
transform it to a RNN model. We then train the

RNN using the extended Kalman filter method.
Once the training is finished, we inversely trans-
form RNN into the NARX model.
(b) If N n and the RNN has to be trained, we
can first initialise a random RNN and transform
it into a NARX. We then train the NARX using
the forgetting recursive least square method.
Once the training is finished, we inversely trans-
form NARX into the RNN model.
3.2 On Pruning
Fig. 3. Summary of the training and pruning ideas implied from
the model equivalence.
One should also realise that this equivalence result
sheds light on the design of a more effective RNN
pruning algorithm. As indicated in some papers 3.3. On Stability Analysis of the NARX Model
[3,10,25], pruning a RNN is basically more difficult
than pruning a feedforward neural network. One rea- Stability is one concern that researchers would like
son is due to the evaluation of the second order to consider a dynamic system has been identified. In
RNN, some resutls on this issue have recently been
derivative of the error function. Therefore, it will be
derived [9,12,18]. In accordance with the equivalence
interesting to see whether we can reformulate the
of NARX and RNN, we can readily use those the-
RNN pruning in such a way that is similar to a
orems to analyse the system stability.
feedforward network pruning.
The idea is simple. After the RNN has been trained,
Theorem 1. A NARX model defined as in Eq. (24)
it is transformed into an equivalent NARX model.
is stable if the magnitude of all the eigenvalues of
Then we can apply optimal brain damage [26,27] or
W2W1 are smaller than one.
other techniques [25,28,29] to prune the NARX
model. Empirically, pruning a feedforward neural net- Proof. Using the equivalence relation, a NARX model
work is usually easier than pruning a recurrent neural

n
network [20,22]. Once the pruning procedure is fin-
ished, we transform it back to a RNN model. Of y(t 1) =

y(t) W3
W1 tanh(W2 u(t 1) W4)
i=1
course, this kind of indirect pruning procedure for
RNN does not ensure that the number of weights can be transformed into
will be reduced. s(t 1) = tanh(W2

s(t) W3
u(t 1) W4)
Note that not all pruning techniques for feedforward
networks can be applied. Two examples are the Stat- y(t 1) = W1

s(t 1)
istical Stepwise Method (SSM) [30] and RLS-based When no input is fed into the system, the difference
pruning [20], as their pruning methods require infor- s(t 1) and
between s(t) is given by
mation which can only be obtained during training.
To do so, we will have to transform the RNN into s

s(t) = tanh(W2W1
(t 1) s(t) W4)
an equivalent NARX model at the very beginning. tanh(W2W1
s(t 1) W4)
Once the RNN is initialised, it is transformed into an
equivalent NARX model. Once training of this equiv- W2W1
s(t)
s(t 1)
alent NARX model is finished, we can apply methods Therefore, if all the eigenvalues of W2W1 are smaller
such as statistical stepwise method, RLS-based prun- than one, limt s(t) = 0, which implies
s(t 1)
ing and the nonconvergent method [31] to prune the that s(t) will converge to a constant vector

s0. Hence
NARX model. After pruning is finished, the pruned limty(t) = W1s0, and the proof is complete.
NARX model is transformed back into a RNN. For
clarity, we summarise all these training and pruning One consequence of Theorem 1 is applicable to
ideas graphically in Fig. 3. the design of a feedback controller for a dynamic
system. Assuming that an unknown system is already we have also shown that if the neuron tranfer function
identified by a NARX model with y0 being an equilib- is a piecewise linear function, every NARX model
rium and the system is unstable. Due to disturbance, (irrespective to its order) can also be transformed
the output of y0 y
y0 shifts to
. To make the into a RNN, and vice versa. In accordance with this

output of the system go back to y0, one can design equivalent relationship, we are able to:
an output feedback controller, as shown in Fig. 4(b),
speed up the training of a NARX or a RNN by
u(t 1) = W0

y(t) an indirect method,
with W0 satisfying the condition that all the eigen- simplify the pruning procedure of RNN,
values of (W2 W3W0)W1 are smaller than one. analyse the stability behaviour of a NARX,
Usually, researchers proposed to use two neural net- design an output feedback controller for the
works, one for identification and the other for control unknown dynamic system.
[11,21,32,33]. To make the controller work, two-phase
training is needed. In the first phase, a neural network
identifier has to be trained to identify the dynamical
behaviour of the system. Then, in the second phase, References
the weight values of the identifier is fixed and the
1. Chen S, Billings SA, Grant PM. Non-linear system
controller network is trained. This will be time con- identification using neural networks. Int J Control 1990;
suming and difficult to implement as an online con- 51(6): 11911214
trol method. 2. Chen S, Cowan C, Billings SA, Grant PM. Parallel
recursive prediction error algorithm for training layered
neural networks. Int J Control 1990; 51(6): 12151228
3. Lin T, Lee Giles C, Horne BG, Kung SY. A delay
4. Conclusion damage model selection algorithm for NARX neural
networks. IEEE Trans Signal Processing, Special Issue
In this paper, we have presented several results on on Neural Networks for Signal Processing 1997;
the equivalence of the NARX model and the RNN. 45(11): 2719
First, we have shown that if the neuron transfer 4. Narendra KS, Parthasarathy K. Neural networks and
dynamical systems. Int J Approximate Reasoning 1992;
function is tanh, every first order NARX model can 6: 109131
be transformed into a RNN, and vice versa. Second, 5. Siegelmann HT, Horne BG, Lee Giles C. Computational
capabilities of recurrent NARX neural networks. IEEE
Trans Systems, Man and Cybernetics Part B: Cyber-
netics 1997; 27(2): 208
6. Rumelhart DE et al. Learning internal representations
by error propagation. In Parallel Distributed Processing
Volume 1: Foundations, DE Rumelhart et al. (ed), MIT
Press, 1986, pp 318362
7. Jin L et al. Absolute stability conditions for discrete-
time recurrent neural network. IEEE Trans Neural Net-
works 1994; 5(6): 954964
8. Jin L et al. Approximation of discrete-time stage-space
trajectories using dynamic recurrent neural networks.
IEEE Trans Automatic Control 1995; 40(7): 12661270
9. Jin L, Gupta MM. Globally asymptotical stability of
discrete-time analog neural network. IEEE Trans Neural
Networks 1996; 7(4): 10241031
10. With Pedersen M, Hansen LK. Recurrent networks:
Second order properties and pruning. Advances in Neu-
ral Information Processing Systems 7 (G. Tesauro et al.
(ed), MIT Press, 1995, pp 673680
11. Puskorius GV, Feldkamp LA. Neurocontrol of nonlinear
dynamical systems with Kalman filter trained recurrent
networks. IEEE Trans Neural Networks 1994; 5(2):
279297
12. Sontag ED. Recurrent neural networks: Some systems-
theoretic aspects. In: Dealing with Complexity: a Neural
Network Approach (M Karny, K Warwick, V Kurkova,
eds.), Springer-Verlag, London, 1998, pp 111
Fig. 4. (a) Training of a NARX model to identify an unknown 13. Williams RJ. Training recurrent networks using the
dynamic system; (b) once the system is identified, an output extended Kalman filter. Proc IJCNN92, Baltimore, Vol
feedback controller, u(t + 1) = W0
y(t), can be designed. IV, 1992 pp 241246.
14. Connor JT, Martin D, Atlas LE. Recurrent neural net- method for recurrent networks. Neural Computation
works and robust time series prediction. IEEE Trans 1998; 10(6): 14811505
Neural Networks 1994; 5(2): 240254 25. With Pedersen et al. Pruning with generalization based
15. Olurotimi O. Recurrent neural network training with weight saliencies: OBD and OBS. Advances in Neu-
feedforward complexity. IEEE Trans Neural Networks ral Information Processing Systems 8 (DS Touretzky
1994; 5(2): 185197 et al. (eds), MIT Press, 1996
16. Cybenko G. Approximation by superpositions of a sig- 26. LeCun Y et al. Optimal brain damage. Advances in
moidal function. Mathematics of Control, Signals and Neural Information Processing Systems 2 (DS Tour-
Syst 1989; 2: 303314 etzky, ed.) 1990, pp 396404
17. Funahashi K. On the approximate realization of continu- 27. Reed R. Pruning algorithms A survey. IEEE Trans
ous mappings by neural networks. Neural Networks Neural Networks 1993; 4(5): 740747
1989; 2(3): 183192 28. Hassibi B, Stork DG. Second order derivatives for
18. Sontag ED. Neural network for control. In Essays on network pruning: Optimal brain surgeon. In: Hanson
Control: Perspectives in the Theory and its Applications et al. (eds) Advances in Neural Information Processing
(HL Trentelman, JC Willems, eds), Birkhauser, Boston, Systems 1993, pp 164171
1993, pp 339380 29. Moody J. Prediction risk and architecture selection for
19. Sum J, Chan LW. On the approximation property of neural networks. In From Statistics to Neural Net-
recurrent neural network. Proc World Multiconference works: Theory and Pattern Recognition Application,
on Systemics, Cybernetics and Informatics, Caracas, V. Cherkassky et al. (eds.), Springer-Verlag, 1994
Venezuela, July 711 1997 30. Cottrell M et al. Neural modeling for time series: A
20. Leung CS et al. On-line training and pruning for RLS statistical stepwise method for weight elimination.
algorithms. Elec Lett 1996; 7: 21522153 IEEE Trans Neural Networks 1995; 6(6): 13551362.
21. Puskorius GV, Feldkamp LA. Decoupled extended Kal- 31. Finnoff W, Hergert F, Zimmermann HG. Improving
man filter training of feedforward layered networks. model selection by nonconvergent methods. Neural
Proc IJCNN91, Vol I, 1991, pp 771777 Networks 1993; 6: 771783
22. Sum JPF et al. Extended Kalman filter in recurrent 32. Ku C, Lee K. Diagonal recurrent neural networks for
neural network training and pruning. Trechnical report dynamic systems control. IEEE Trans Neural Net-
CS-TR-96-05, Department of Computer Science and works 1995; 6(1): 144156
Engineering, CUHK, June 1996 33. Narendra KS, Parthasarathy K. Identification and con-
23. Suykens J, De Moor B, Vandewalle J. Nonlinear system trol of dynamical systems using neural networks. IEEE
identification using neural state space models, applicable Trans Neural Networks 1990; 1(1): 427
to robust control design. Int J Control 1995; 82(1): 34. Shah S et al. Optimal filtering algorithms for fast
129152 learning in feedforward neural networks. Neural Net-
24. Sum J et al. Extended Kalman filter-based pruning works. 1992; 5: 779787

A Note On The Equivalence of NARX and RNN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Note On The Equivalence of NARX and RNN

Uploaded by

Copyright:

Available Formats

Neural Comput & Applic (1999)8:3339

1999 Springer-Verlag London Limited

A Note on the Equivalence of NARX and RNN

This paper presents several aspects with regards 1. Introduction

Keywords: Extended Kalman filter; Model equival-

Correspondence and offprint requests to: John P.F. Sum, Depart-

y(t) = tanh(iy(t 1) iu(t) i) with feedforward complexity.

2.1. tanh Neuron where W1 RNn, W2 RnN, W3 RnM and

si(t 1) = tanh k=1

y(t) = ksk(t) (14) W2 = W2W (W1W )

si(t 1) = tanh(aix(t) biu(t 1) ei) for W1 is nonsingular.

and thus ik = aidk, i = bi, i = ei = and i = cdi.

= ( T )1 [ ] (20) Now we define a state-vector

as long as (or ) is non-zero vector. by 1, we can rewrite the model as follows:

W1 = [W1 ONn] (40) composition on P(t) can speed up the training

3. Implications from the Equivalence

3.1. On Training P(tt) = P(tt 1) L(t)HT(t)P(tt 1) (51)

transform it to a RNN model. We then train the

You might also like