Finite State Automata and Simple Recurrent Networks

Communicated by Jeffrey Elman
Finite State Automata and Simple Recurrent

Networks
Axel Cleeremans
Department of Psychology, Carncgic-Mellon University,
Pittsburgh, PA 15213 USA
David Servan-Schreiber
Departmmt of Computer Scienre, Carnegie-Mcllon University,
James L. McClelland
Department of Psychology, Cnmegir-hlcllon University,
We explore a network architecture introduced by Elman (1988)for predicting successive elements of a sequence. The network uses the pattern of activation over a set of hidden units from time-step t-1, together
with element t, to predict element t + 1. When the network is trained
with strings from a particular finite-state grammar, it can learn to be a
perfect finite-state recognizer for the grammar. When the network has a
minimal number of hidden units, patterns on the hidden units come to
correspond to the nodes of the grammar, although this correspondence
is not necessary for the network to act as a perfect finite-state recognizer. We explore the conditions under which the network can carry
information about distant sequential contingencies across intervening
elements. Such information is maintained with relative ease if it is
relevant at each intermediate step; it tends to be lost when intervening elements do not depend on it. At first glance this may suggest that
such networks are not relevant to natural language, in which dependencies may span indefinite distances. However, embeddings in natural
language are not completely independent of earlier information. The
final simulation shows that long distance sequential contingencies can
be encoded by the network even if only subtle statistical properties of
embedded strings depend on the early information.
1 Introduction
Several connectionist architectures that are explicitly constrained to capture sequential information have been proposed. Examples are Time Delay Networks (for example, Sejnowski and Rosenberg 1986)- also called
Neiirnl Computation 1, 372-381 (1989) @ 1989 Massachusetts Institute of Technology
Finite State Automata and Simple Recurrent Networks
373
Figure 1: The simple recurrent network (Elman 1988). In the SRN, the pattern
of activation on the hidden units at time step t - 1, together with the new input
pattern, is allowed to influence the pattern of activation at time step f . This is
achieved by copying the pattern of activation on the hidden layer at time step
t - 1 to a set of input units - called the "context units" - at time step t . All the
forward connections in the network are subject to training via backpropagation.
"moving window" paradigms - or algorithms such as backpropagation
in time (Rumelhart et al. 1986; Williams and Zipser 1988). Such architectures use explicit representations of several consecutive events, if not of
the entire history of past inputs. Recently, Elman (1988) has introduced a
simple recurrent network (SRN) that has the potential to master an infinite corpus of sequences with the limited means of a learning procedure
that is completely local in time (Fig. 1).
In this paper, we show that the SRN can learn to mimic closely a
finite-state automaton (FSA), both in its behavior and in its state representations. In particular, we show that it can learn to process an infinite
corpus of strings based on experience with a finite set of training exemplars. We then explore the capacity of this architecture to recognize and
use nonlocal contingencies between elements of a sequence.
2 Discovering a Finite-State Grammar
In our first experiment, we asked whether the network could learn the
contingencies implied by a small finite-state grammar (Fig. 2). The network was presented with strings derived from this grammar, and required to try to predict the next letter at every step. These predictions
Cleeremans, Servan-Schreiber, and McClelland
374
are context dependent since each letter appears twice in the grammar
and is followed in each case by different successors.
A single unit on the input layer represented a given letter (six input
units in total; five for the letters and one for a begin symbol " B ) . Similar
local representations were used on the output layer (with the "begin"
symbol being replaced by an end symbol "E''). There were three hidden
units.
2.1 Training. On each of 60,000 training trials, a string was generated
from the grammar, starting with the "B." Successive arcs were selected
randomly from the two possible continuations with a probability of 0.5.
Each letter was then presented sequentially to the network. The activations of the context units were reset to 0.5 at the beginning of each string.
After each letter, the error between the network's prediction and the actual successor specified by the string was computed and backpropagated.
The 60,000 randomly generated strings ranged from 3 to 30 letters (mean,
7; SD, 3.3).
2.2 Performance. Three tests were conducted. First, we examined
the network's predictions on a set of 70,000 random strings. During this
test, the network is first presented with the "B," and one of the five letters
or "E" is then selected at random as a successor. If that letter is predicted
Start
Figure 2: The small finite-state grammar (Reber 1967).
End
375
by the network as a legal successor (that is, activation is above 0.3 for
the corresponding unit), it is then presented to the input layer on the
next time step, and another letter is drawn at random as its successor.
This procedure is repeated as long as each letter is predicted as a legal
successor until "E" is selected as the next letter. The procedure is interrupted as soon as the actual successor generated by the random procedure
is not predicted by the network, and the string of letters is then considered "rejected." A string is considered "accepted" if all its letters have
been predicted as possible continuations up to "E." Of the 70,000 random
strings, 0.3% happened to be grammatical and 99.7% were ungrammatical. The network performed flawlessly, accepting all the grammatical
strings and rejecting all the others.
In a second test, we presented the network with 20,000 strings generated at random from the grammar, that is, all these strings were grammatical. Using the same criterion as above, all of these strings were correctly
"accepted."
Finally, we constructed a set of very long grammatical strings - more
than 100 letters long - and verified that at each step the network correctly predicted all the possible successors (activations above 0.3) and
none of the other letters in the grammar.
2.3 Analysis of Internal Representations. What kinds of internal
representations have developed over the set of hidden units that allow
the network to associate the proper predictions to intrinsically ambiguous letters? We recorded the hidden units' activation patterns generated
in response to the presentation of individual letters in different contexts.
These activation vectors were then used as input to a cluster analysis
program that groups them according to their similarity. Figure 3 shows
the results of such an analysis conducted on a small random set of grammatical strings. The patterns of activation are grouped according to the
nodes of the grammar: all the patterns that are used to predict the successors of a given node are grouped together independently of the current
letter. This observation sheds some light on the behavior of the network:
at each point in a sequence, the pattern of activation stored over the
context units provides information about the current node in the grammar. Together with information about the current letter (represented on
the input layer), this contextual information is used to produce a new
pattern of activation over the hidden layer, that uniquely specifies the
next node. In that sense, the network closely approximates the FSA that
would encode the grammar from which the training exemplars were derived. However, a closer look at the cluster analysis reveals that within a
cluster corresponding to a particular node, patterns are further divided
according to the path traversed before the node is reached. For example,
looking at the bottom cluster - node #5 - patterns produced by a "VV,"
"PS," "XS," or "SXS ending are grouped separately by the analysis: they
are more similar to each other than to other examples of paths leading
376
to node #5. This tendency to preserve information about the path is not
a characteristic of traditional finite-state automata.
It must be noted that an SRN can perform the string acceptance tests
described above and still fail to produce representations that clearly delineate the nodes of the grammar as in the case shown in Figure 3.
This tendency to approximate the behavior but not the representation of
the grammar is exhibited when there are more hidden units than are
absolutely necessary to perform the task. Thus, using representations
that correspond to the nodes in the FSA shown in figure 2 is only one
way to act in accordance with the grammar encoded by that machine:
representations correspond to the nodes in the FSA only when resources
are severely constrained. In other cases, the network tends to preserve
the separate identities of the arcs, and may preserve other information
about the path leading to a node, even when this is not task relevant.
3 Encoding Path Information
In a different set of experiments, we asked whether the network could

learn to use the information about the path that is encoded in the hidden units patterns of activation. In one of these experiments, we tested
whether the network could master length constraints. When strings generated from the small finite-state grammar may have a maximum of only
eight letters, the prediction following the presentation of the same letter
in position number six or seven may be different. For example, following the sequence TSSSXXV, V is the seventh letter and only another
V would be a legal successor. In contrast, following the sequence
TSSXXV, both V and I are legal successors. A network with 15
hidden units was trained on 21 of the 43 legal strings of length 3 to 8. It
was able to use the small activation differences present over the context
units - and due to the slightly different sequences presented - to master contingencies such as those illustrated above. Thus, after TSSXXV,
the network predicted V or I;but after TSSSXXV, it predicted only
V as a legal successor.
How can information about the path be encoded in the hidden layer
patterns of activation? As the initial papers about backpropagation pointed out, the hidden unit patterns of activation represent an encoding
of the features of the input patterns that are relevant to the task. In the
recurrent network, the hidden layer is presented with information about
the current letter, but also - on the context layer - with an encoding of
the relevant features of the previous letter. Thus, a given hidden layer
pattern can come to encode information about the relevant features of
two consecutive letters. When this pattern is fed back on the context
layer, the new pattern of activation over the hidden units can come to
encode information about three consecutive letters, and so on. In this
377
-0.5
0.0
0.5
1 .o
1.5
Figure 3: Hierarchical cluster analysis of the hidden unit activation patterns

after 60,000 presentations of strings generated at random from the finite-state
grammar. A small set of strings was used to test the network. The single
uppercase letter in each string shown in the figure corresponds to the letter
actually presented to the network on that trial.
378
manner, the context layer patterns can allow the network to maintain
prediction-relevant features of an entire sequence.
Learning progresses through three phases. During the first phase, the
context information tends to be ignored because the patterns of activation
on the hidden layer - of which the former are a copy - are changing
continually as a result of the learning algorithm. In contrast, the network
is able to pick up the stable association between each letter and all its possible successors. At the end of this phase, the network thus predicts all
the successors of each letter in the grammar, independently of the urc
to which each letter corresponds. In the second phase, patterns copied
on the context layer are now represented by a unique code designating
which letter preceded the current letter, and the network can exploit this
stability of the context information to start distinguishing between different occurrences of the same letter - different arcs in the grammar.
Finally, in a third phase, small differences in the context information
that reflect the occurrence of previous elements can be used to differentiate position-dependent predictions resulting from length constraints (see
Servan-Schreiber e f ul. 1988, for more details).
It is important to note that information about the path that is not locally
relevant tends not to be encoded in the next hidden layer pattern. It may
then be lost for subsequent processing. This tendency is decreased when
the network has extra degrees of freedom (that is, more hidden units)
so as to allow small and locally useless differences to survive for several
processing steps.
4 Processing Embedded Sequences
This observation raises an important question about the relevance of this

simple architecture to natural language processing. Any natural language
processing system must have the ability to preserve information about
long-distance contingencies in order to correctly process sentences containing embeddings. For example, in the following problem of number
agreement, information about the head of the sentence has to be preserved in order to make a correct grammatical judgment about the verb
following the embedding:
The dog [that chased the cat] is playful
The dogs [that chased the cat1 are playful
At first glance, however, information about the head of the sentence
does not seem to be relevant for processing of the embedding itself. Yet,
from the perspective of a system that is continuously generating expectations about possible succeeding events, information about the head is
relevant within the embedding. For example, itself may follow the
action verb only in the first sentence, whereas each other is acceptable only in the second. There is ample empirical evidence to support
379
the claim that human subjects d o generate expectations continuously in

the course of natural language processing (see McClelland 1988, for a
review).
To show that such expectations may contribute to maintaining information about a nonlocal context, we devised a new finite-state grammar
in which the identity of the last letter depends on the identity of the first
one (see Fig. 4). In a first experiment, contingencies within the embedded subgrammars were identical (0.5 on all arcs). Therefore, information
about the initial letter is not locally relevant to predicting the successor
of any letter in the embedding.
Figure 4: A complex finite-state grammar involving embedded sequences. The

last letter is contingent on the first, and the intermediate structure is shared by
the two branches of the grammar. In the first experiment, transition probabilities
of all arcs were equal and set to 0.5. In the second experiment, the probabilities
were biased toward the top arcs for the top embedding, and toward the bottom
arcs for the bottom embedding. The numbers above each arc in the figure
indicate the transition probabilities in the biased version of the grammar.
380
An SRN with 15 hidden units was trained on exemplars (mean length,

7; SD, 2.6) generated from this grammar for 2.4 million letter presentations. Even after such extensive training, the network indifferently
predicted' both possible final letters, independently of the initial letter of
the string.
However, when the contingencies within the embedded grammars
depend, even very slightly, on the first letter of the string, the performance of the network can improve dramatically. For example, we adjusted the transition probabilities in the subgrammars such that the top
subgrammar was biased toward the top arcs and the bottom subgrammar was biased toward the bottom arcs (see Fig. 4). An SRN trained
on exemplars (mean length, 7; SD, 2.6) derived from this biased grammar performed considerably better. After 2.4 million presentations, the
network correctly predicted the last letter of the string in 66.9% of the
trials. There were 11.3%errors - that is, cases where the incorrect alternative was predicted - and 21.8% cases in which both alternatives
were predicted about equally. Accuracy was very high for strings with
a short embedding, and fell gradually to chance level at seven embedded elements. Note that we tested the network with strings generated
from the unbiased version of the grammar. The network must therefore
have learned to preserve information about the first letter throughout the
embedding independently of whether the particular embedding is probable or not given the first letter. Moreover, with the biased grammar,
some embeddings have exactly the same probability of occurring during
training independently of whether the first letter is a "T" or a "I"'( e g ,
"PVPXVV). As we have seen, training with such exemplars makes it
very difficult' for the network to maintain information about the first letter. Yet, when such embeddings are part of an ensemble of embeddings
that as a whole has a probability structure that depends on the first letter,
information about this first letter can be preserved much more easily.
5 Conclusion
We have presented a network architecture first explored by Elman (1988)
that is capable of mastering an infinite corpus of strings generated from
a finite-state grammar after training on a finite set of exemplars with a
learning algorithm that is local in time. When it contains only a minimal set of hidden units, the network can develop internal representa'We considered that the final letter was correctly or mistakenly predicted if the Luce
ratio of the activation of the corresponding unit to the sum of all activations on the
output layer was above 0.6. Any Luce ratio below 0.6 was considered as a failure to
respond (that is, a miss).
20ther experiments showed that when the embedding is totally independent of the
head, the network's ability to maintain information about a nnnlocal context depends
on the number of hidden units. For a given number of hidden units, learning time
increases exponentially with the length of the embedding.
381
tions that correspond to the nodes of the grammar, and closely approximates the corresponding minimal finite-state recognizer. We have also
shown that the simple recurrent network is able to encode information
about long-distance contingencies as long as information about critical
past events is relevant at each time step for generating predictions about
potential alternatives. Finally, we demonstrated that this architecture can
be used to maintain information across embedded sequences, even when
the probability structure of ensembles of embedded sequences depends
only subtly on the head of the sequence. It appears that the SRN possesses characteristics that recommend it for further study as one useful
class of idealizations of the mechanism of natural language processing.
Acknowledgments
Axel Cleeremans was supported in part by a fellowship from the Belgian American Educational Foundation, and in part by a grant from the
National Fund for Scientific Research (Belgium). David Servan-Schreiber
was supported by an NIMH Individual Fellow Award MH-09696-01~
James L. McClelland was supported by an NIMH Research Scientist
Career Development Award MH-00385. Support for computational resources was provided by NSF(BNS-86-09729) and ONR(N00014-86-G-
0146)
References
Elman, J.L. 1988. Finding Structure in Time. CRL Tech. Rep. 9901. Center for
Research in Language, University of California, San Diego, CA.
McClelland, J.L. 1988. The case for interactionism in language processing. In
Attention and Performaiice XII, M. Coltheart, ed. Erlbaum, London.
Reber, A.S. 1967. Implicit learning of artificial grammars. I. Verbal Leartiing
Verbal Behau. 5, 855-863.
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. 1986. Learning internal representations by backpropagating errors. Nature (London) 323, 533-536.
Sejnowski, T.J., and Rosenberg, C. 1986. NETtalk: A Parallel Network That Learns
to Read Aloud. Tech. Rep. JHU-EECS-86-01, Johns Hopkins University.
Servan-Schreiber, D., Cleeremans, A,, and McClelland, J.L. 1988. Learnitig Sequential Structure in Sirnple Recirrretit Netzuorks. Tech. Rep. CMU-CS-183,
Carnegie-Mellon University.
Williams, R.J., and Zipser, D. 1988. A Learning Algorithm for Cotititiually Running
Fully Recurrent Neural Netzuorks. ICS Tech. Rep. 8805. Institute for Cognitive
Science, University of California, San Diego, CA.
Received 6 March 1989; accepted 20 April 1989.
This article has been cited by:

1. Hari Mohan Pandey, Ankit Chaudhary, Deepti Mehrotra. 2016. Grammar
induction using bit masking oriented genetic algorithm and comparative analysis.
Applied Soft Computing 38, 453-468. [CrossRef]
2. Cheng Hao Jin, Gouchol Pok, Yongmi Lee, Hyun-Woo Park, Kwang Deuk
Kim, Unil Yun, Keun Ho Ryu. 2015. A SOM clustering pattern sequence-based
next symbol prediction method for day-ahead direct electricity load and price
forecasting. Energy Conversion and Management 90, 84-92. [CrossRef]
3. Luca Onnis, Arnaud Destrebecqz, Morten H. Christiansen, Nick Chater, Axel
CleeremansImplicit learning of non-adjacent dependencies 213-246. [CrossRef]
4. Andrew Stevenson, James R. Cordy. 2014. A survey of grammatical inference in
software engineering. Science of Computer Programming 96, 444-459. [CrossRef]
5. V. Vembarasan, P. Balasubramaniam, Chee Seng Chan. 2014. Non-fragile state
observer design for neural networks with Markovian jumping parameters and timedelays. Nonlinear Analysis: Hybrid Systems 14, 61-73. [CrossRef]
6. Axel Cleeremans. 2014. Connecting Conscious and Unconscious Processing.
Cognitive Science 38:10.1111/cogs.2014.38.issue-6, 1286-1315. [CrossRef]
7. Daniel O. Sales, Diogo O. Correa, Leandro C. Fernandes, Denis F. Wolf, Fernando
S. Osrio. 2014. Adaptive finite state machine based visual autonomous navigation
system. Engineering Applications of Artificial Intelligence 29, 152-162. [CrossRef]
8. Branko ter. 2013. Selective Recurrent Neural Network. Neural Processing Letters
38, 1-15. [CrossRef]
9. Alexa R. Romberg, Jenny R. Saffran. 2013. All Together Now: Concurrent
Learning of Multiple Structures in an Artificial Language. Cognitive Science n/an/a. [CrossRef]
10. Daniel Oliva Sales, Fernando Santos OsorioVision-Based Autonomous Topological
Navigation in Outdoor Environments 91-96. [CrossRef]
11. Franklin Chang, Marius Janciauskas, Hartmut Fitz. 2012. Language adaptation and
learning: Getting explicit about implicit learning. Language and Linguistics Compass
6:10.1002/lnc3.2012.6.issue-5, 259-278. [CrossRef]
12. Urun Dogan, Johann Edelbrunner, Ioannis IossifidisAutonomous driving: A
comparison of machine learning techniques by means of the prediction of lane
change behavior 1837-1843. [CrossRef]
13. Xiaodi Li, Rajan Rakkiyappan. 2011. Robust asymptotic state estimation of
Takagi-Sugeno fuzzy Markovian jumping Hopfield neural networks with mixed
interval time-varying delays. Mathematical Methods in the Applied Sciences n/a-n/
a. [CrossRef]
14. S. Lakshmanan, P. Balasubramaniam. 2011. New results of robust stability analysis
for neutral-type neural networks with time-varying delays and Markovian jumping
parameters 1 The work of authors was supported by Department of Science and
Technology, New Delhi, India, under the sanctioned No. SR/S4/MS:485/07.

Canadian Journal of Physics 89, 827-840. [CrossRef]
15. Wataru Hinoshita, Hiroaki Arie, Jun Tani, Hiroshi G. Okuno, Tetsuya Ogata.
2011. Emergence of hierarchical structure mirroring linguistic composition in a
recurrent neural network. Neural Networks 24, 311-320. [CrossRef]
16. RODRIGO MARTINS DA SILVA, NADIA NEDJAH, LUIZA DE
MACEDO MOURELLE. 2011. HARDWARE IMPLEMENTATIONS OF
MLP ARTIFICIAL NEURAL NETWORKS WITH CONFIGURABLE
TOPOLOGY. Journal of Circuits, Systems and Computers 20, 417-437. [CrossRef]
17. P. Balasubramaniam, V. Vembarasan, R. Rakkiyappan. 2011. Delay-dependent
robust exponential state estimation of Markovian jumping fuzzy Hopfield neural
networks with mixed random time-varying delays. Communications in Nonlinear
Science and Numerical Simulation 16, 2109-2129. [CrossRef]
18. Antoine Pasquali, Bert Timmermans, Axel Cleeremans. 2010. Know thyself:
Metacognitive networks and measures of consciousness. Cognition 117, 182-190.
[CrossRef]
19. Thomas Pronk, Ingmar Visser. 2010. The role of reversal frequency in learning
noisy second order conditional sequences. Consciousness and Cognition 19, 627-635.
[CrossRef]
20. P. Balasubramaniam, S. Lakshmanan, S. Jeeva Sathya Theesar. 2010. State
estimation for Markovian jumping recurrent neural networks with interval timevarying delays. Nonlinear Dynamics 60, 661-675. [CrossRef]
21. Todd M. Gureckis, Bradley C. Love. 2010. Direct Associations or Internal
Transformations? Exploring the Mechanisms Underlying Sequential Learning
Behavior. Cognitive Science 34, 10-50. [CrossRef]
22. Antonio Fernndez-Caballero, Mara Teresa Lpez, Jos Carlos Castillo,
Saturnino Maldonado-Bascn. 2009. Real-Time Accumulative Computation
Motion Detectors. Sensors 9, 10044-10065. [CrossRef]
23. Randall K. Jamieson, D. J. K. Mewhort. 2009. Applying an exemplar model to the
artificial-grammar task: String completion and performance on individual items.
The Quarterly Journal of Experimental Psychology 63, 1014-1039. [CrossRef]
24. Yurong Liu, Zidong Wang, Jinling Liang, Xiaohui Liu. 2009. Stability and
Synchronization of Discrete-Time Markovian Jumping Neural Networks With
Mixed Mode-Dependent Time Delays. IEEE Transactions on Neural Networks 20,
1102-1116. [CrossRef]
25. Subramoniam Perumal, Ali A. MinaiStable-yet-switchable (SyS) attractor
networks 2509-2516. [CrossRef]
26. Andrej Dobnikar, Branko ter. 2009. Structural Properties of Recurrent Neural
Networks. Neural Processing Letters 29, 75-88. [CrossRef]
27. Randall Jamieson, D. J. K. Mewhort. 2009. Applying an exemplar model to the

serial reaction-time task: Anticipating from experience. The Quarterly Journal of
Experimental Psychology 62, 1757-1783. [CrossRef]
28. Z WANG, Y LIU, X LIU. 2009. State estimation for jumping recurrent neural
networks with discrete and distributed delays. Neural Networks 22, 41-48.
[CrossRef]
29. Ueruen Dogan, Hannes Edelbrunner, Ioannis IossifidisTowards a Driver Model:
Preliminary Study of Lane Change Behavior 931-937. [CrossRef]
30. Andr Grning. 2007. Elman Backpropagation as Reinforcement for Simple
Recurrent Networks. Neural Computation 19:11, 3108-3131. [Abstract] [PDF]
[PDF Plus]
31. Ingmar Visser, Maartje E. J. Raijmakers, Peter C. M. Molenaar. 2007.
Characterizing sequence knowledge using online measures and hidden Markov
models. Memory & Cognition 35, 1502-1517. [CrossRef]
32. Jeremy R. Reynolds, Jeffrey M. Zacks, Todd S. Braver. 2007. A Computational
Model of Event Segmentation From Perceptual Prediction. Cognitive Science 31,
613-643. [CrossRef]
33. Henrik Jacobsson. 2006. The Crystallizing Substochastic Sequential Machine
Extractor: CrySSMEx. Neural Computation 18:9, 2211-2255. [Abstract] [PDF]
[PDF Plus]
34. Z WANG, Y LIU, L YU, X LIU. 2006. Exponential stability of delayed recurrent
neural networks with Markovian jumping parameters. Physics Letters A 356,
346-352. [CrossRef]
35. Branko ter, Andrej Dobnikar. 2006. Modelling the Environment of a Mobile
Robot with the Embedded Flow State Machine. Journal of Intelligent and Robotic
Systems 46, 181-199. [CrossRef]
36. Peter Tio, Ashely J. S. Mills. 2006. Learning Beyond Finite Memory in Recurrent
Networks of Spiking Neurons. Neural Computation 18:3, 591-613. [Abstract]
[PDF] [PDF Plus]
37. Andr Grning. 2006. Stack-like and queue-like dynamics in recurrent neural
networks. Connection Science 18, 23-42. [CrossRef]
38. Jianghua Bao, P. MunroStructural Mapping with Identical Elements Neural
Network 870-874. [CrossRef]
39. Rick Dale, Michael J. Spivey. 2005. From apples and oranges to symbolic
dynamics: a framework for conciliating notions of cognitive representation. Journal
of Experimental & Theoretical Artificial Intelligence 17, 317-342. [CrossRef]
40. David Landy. 2005. Inside Doubt: On the Non-Identity of the Theory of Mind and
Propositional Attitude Psychology. Minds and Machines 15, 399-414. [CrossRef]
41. 2005. A multiobjective genetic algorithm for obtaining the optimal size of
a recurrent neural network for grammatical inference. Pattern Recognition 38,
1444-1456. [CrossRef]
42. Masahiko Morita, Kouhei Matsuzawa, Shigemitsu Morokami. 2005. A model
of context-dependent association using selective desensitization of nonmonotonic
neural elements. Systems and Computers in Japan 36:10.1002/scj.v36:7, 73-83.
[CrossRef]
43. Henrik Jacobsson. 2005. Rule Extraction from Recurrent Neural Networks:
ATaxonomy and Review. Neural Computation 17:6, 1223-1263. [Abstract] [PDF]
[PDF Plus]
44. Banshidhar Majhi ., Hasan Shalabi ., Mowafak Fathi .. 2005. FLANN Based
Forecasting of S&P 500 Index. Information Technology Journal 4, 289-292.
[CrossRef]
45. P Perruchet. 2004. The exploitation of distributional information in syllable
processing. Journal of Neurolinguistics 17, 97-119. [CrossRef]
46. A. Vahed, C. W. Omlin. 2004. A Machine Learning Method for Extracting
Symbolic Knowledge from Recurrent Neural Networks. Neural Computation 16:1,
59-71. [Abstract] [PDF] [PDF Plus]
47. P. Tino, M. Cernansky, L. Benuskova. 2004. Markovian Architectural Bias of
Recurrent Neural Networks. IEEE Transactions on Neural Networks 15, 6-15.
[CrossRef]
48. S Chalup. 2003. Incremental training of first order recurrent neural networks to
predict a context-sensitive language. Neural Networks 16, 955-972. [CrossRef]
49. Peter Tio, Barbara Hammer. 2003. Architectural Bias in Recurrent Neural
Networks: Fractal Analysis. Neural Computation 15:8, 1931-1957. [Abstract]
[PDF] [PDF Plus]
50. P Sturt. 2003. Learning first-pass structural attachment preferences with dynamic
grammars and recursive neural networks. Cognition 88, 133-169. [CrossRef]
51. Garen Arevian, Stefan Wermter, Christo Panchev. 2003. Symbolic state transducers
and recurrent neural preference machines for text mining. International Journal of
Approximate Reasoning 32, 237-258. [CrossRef]
52. I Gabrijel. 2003. On-line identification and reconstruction of finite automata with
generalized recurrent neural networks. Neural Networks 16, 101-120. [CrossRef]
53. D PALMERBROWN, J TEPPER, H POWELL. 2002. Connectionist natural
language parsing. Trends in Cognitive Sciences 6, 437-442. [CrossRef]
54. Rahul Sarpeshkar, Micah O'Halloran. 2002. Scalable Hybrid Computation with
Spikes. Neural Computation 14:9, 2003-2038. [Abstract] [PDF] [PDF Plus]
55. Jonathan Tepper, Heather Powell, Dominic Palmer-Brown. 2002. A corpus-based
connectionist architecture for large-scale natural language parsing. Connection
Science 14, 93-114. [CrossRef]
56. Prahlad Gupta, Neal J. Cohen. 2002. Theoretical and computational analysis of
skill learning, repetition priming, and procedural memory. Psychological Review 109,
401-448. [CrossRef]
57. A Blanco. 2001. Identification of fuzzy dynamic systems using MaxMin recurrent
neural networks. Fuzzy Sets and Systems 122, 451-467. [CrossRef]
58. Paul Rodriguez. 2001. Simple Recurrent Networks Learn Context-Free and
Context-Sensitive Languages by Counting. Neural Computation 13:9, 2093-2118.
[Abstract] [PDF] [PDF Plus]
59. Peter Tio, Bill G. Horne, C. Lee Giles. 2001. Attractive Periodic Sets in
Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and
Bifurcations in Two-Neuron Networks). Neural Computation 13:6, 1379-1414.
60. A. Blanco, M. Delgado, M.C. Pegalajar. 2001. Fuzzy automaton induction
using neural network. International Journal of Approximate Reasoning 27, 1-26.
[CrossRef]
61. A. Paccanaro, G.E. Hinton. 2001. Learning distributed representations of concepts
using linear relational embedding. IEEE Transactions on Knowledge and Data
Engineering 13, 232-244. [CrossRef]
62. Stefan C. Kremer. 2001. Spatiotemporal Connectionist Networks: A Taxonomy
and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus]
63. SHOZO SATO, KAZUTOSHI GOHARA. 2001. FRACTAL TRANSITION IN
CONTINUOUS RECURRENT NEURAL NETWORKS. International Journal
of Bifurcation and Chaos 11, 421-434. [CrossRef]
64. Hans Gleisner, Henrik Lundstedt. 2001. Auroral electrojet predictions with
dynamic neural networks. Journal of Geophysical Research 106, 24541. [CrossRef]
65. Annette Kinder, Anja Assmann. 2000. Learning artificial grammars: No evidence
for the acquisition of rules. Memory & Cognition 28, 1321-1332. [CrossRef]
66. L Gupta. 2000. Investigating the prediction capabilities of the simple recurrent
neural network on real temporal sequences. Pattern Recognition 33, 2075-2081.
[CrossRef]
67. Felix A. Gers, Jrgen Schmidhuber, Fred Cummins. 2000. Learning to Forget:
Continual Prediction with LSTM. Neural Computation 12:10, 2451-2471.
68. L Gupta. 2000. Classification of temporal sequences via prediction using the simple
recurrent neural network. Pattern Recognition 33, 1759-1770. [CrossRef]
69. Rafael C. Carrasco, Mikel L. Forcada, M. ngeles Valds-Muoz, Ramn P.
eco. 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent
Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract]
[PDF] [PDF Plus]
70. Faison P. Gibson. 2000. Feedback Delays: How Can Decision Makers Learn Not
to Buy a New Car Every Time the Garage Is Empty?. Organizational Behavior and
Human Decision Processes 83, 141-166. [CrossRef]
71. A. Blanco, M. Delgado, M. C. Pegalajar. 2000. Extracting rules from a (fuzzy/
crisp) recurrent neural network using a self-organizing map. International Journal of
Intelligent Systems 15:10.1002/(SICI)1098-111X(200007)15:7<>1.0.CO;2S, 595-621. [CrossRef]
72. SHOZO SATO, KAZUTOSHI GOHARA. 2000. POINCAR MAPPING
OF CONTINUOUS RECURRENT NEURAL NETWORKS EXCITED BY
TEMPORAL EXTERNAL INPUT. International Journal of Bifurcation and
Chaos 10, 1677-1695. [CrossRef]
73. K Arai. 2000. Stable behavior in a recurrent neural network for a finite state
machine. Neural Networks 13, 667-680. [CrossRef]
74. A. Blanco, M. Delgado, M.C. Pegalajar. 2000. A genetic algorithm to obtain the
optimal recurrent neural network. International Journal of Approximate Reasoning
23, 67-83. [CrossRef]
75. S. Lawrence, C.L. Giles, S. Fong. 2000. Natural language grammatical inference
with recurrent neural networks. IEEE Transactions on Knowledge and Data
Engineering 12, 126-140. [CrossRef]
76. Mikel L. Forcada, Marco GoriNeural Nets, Recurrent . [CrossRef]
77. C Lee Giles, Christian Omlin, K ThornberEquivalence in Knowledge
Representation . [CrossRef]
78. Ivelin StoianovLessons From Language Learning . [CrossRef]
79. C.L. Giles, C.W. Omlin, K.K. Thornber. 1999. Equivalence in knowledge
representation: automata, recurrent neural networks, and dynamical fuzzy systems.
Proceedings of the IEEE 87, 1623-1640. [CrossRef]
80. D Rohde. 1999. Language acquisition in the absence of explicit negative evidence:
how important is starting small?. Cognition 72, 67-109. [CrossRef]
81. MARCLIO C.P.DE SOUTO, PAULO J. L. ADEODATO, TERESA B.
LUDERMIR. 1999. Sequential RAM-based Neural Networks: Learnability,
Generalisation, Knowledge Extraction, And Grammatical Inference. International
Journal of Neural Systems 09, 203-210. [CrossRef]
82. Morten H Christiansen, Nick Chater. 1999. Toward a Connectionist Model
of Recursion in Human Linguistic Performance. Cognitive Science 23, 157-205.
[CrossRef]
83. Theano Anastasopoulou, Nigel Harvey. 1999. Assessing Sequential Knowledge
through Performance Measures: The Influence of Short-term Sequential Effects.
The Quarterly Journal of Experimental Psychology Section A 52, 423-448. [CrossRef]
84. P. Tino, M. Koteles. 1999. Extracting finite-state representations from recurrent

neural networks trained on chaotic symbolic sequences. IEEE Transactions on
Neural Networks 10, 284-302. [CrossRef]
85. Zoltn Dienes, Gerry T.M. Altmann, Shi-Ji Gao. 1999. Mapping across Domains
Without Feedback: A Neural Network Model of Transfer of Implicit Knowledge.
Cognitive Science 23, 53-82. [CrossRef]
86. A.B. Tickle, R. Andrews, M. Golea, J. Diederich. 1998. The truth will come
to light: directions and challenges in extracting the knowledge embedded within
trained artificial neural networks. IEEE Transactions on Neural Networks 9,
1057-1068. [CrossRef]
87. Martin Redington, Nick Chater, Steven Finch. 1998. Distributional Information:
A Powerful Cue for Acquiring Syntactic Categories. Cognitive Science 22, 425-469.
[CrossRef]
88. L Teow. 1998. Effective learning in recurrent maxmin neural networks. Neural
Networks 11, 535-547. [CrossRef]
89. C.W. Omlin, K.K. Thornber, C.L. Giles. 1998. Fuzzy finite-state automata can
be deterministically encoded into recurrent neural networks. IEEE Transactions on
Fuzzy Systems 6, 76-89. [CrossRef]
90. Athanasios Tsadiras, Konstantinos Margaritis. 1998. Two neuron fuzzy cognitive
map dynamics. International Journal of Computer Mathematics 67, 47-75.
[CrossRef]
91. Peter Tio, Bill G. Horne, C. Lee Giles, Pete C. CollingwoodFinite State Machines
and Recurrent Neural Networks Automata and Dynamical Systems Approaches
171-219. [CrossRef]
92. Sepp Hochreiter, Jrgen Schmidhuber. 1997. Long Short-Term Memory. Neural
Computation 9:8, 1735-1780. [Abstract] [PDF] [PDF Plus]
93. Alan D. Blair, Jordan B. Pollack. 1997. Analysis of Dynamical Recognizers. Neural
Computation 9:5, 1127-1142. [Abstract] [PDF] [PDF Plus]
94. Ephraim Nissan, Hava Siegelmann, Alex Galperin, Shuky Kimhi. 1997. Upgrading
automation for nuclear fuel in-core management: From the symbolic generation of
configurations, to the neural adaptation of heuristics. Engineering with Computers
13, 1-19. [CrossRef]
95. Kevin Swingler. 1996. Financial prediction: Some pointers, pitfalls and common
errors. Neural Computing & Applications 4, 192-197. [CrossRef]
96. Hava T. Siegelmann. 1996. RECURRENT NEURAL NETWORKS
AND FINITE AUTOMATA. Computational Intelligence 12:10.1111/
coin.1996.12.issue-4, 567-574. [CrossRef]
97. Wang DeLiang, Liu Xiaomei, Stanley C. Ahalt. 1996. On Temporal Generalization
of Simple Recurrent Networks. Neural Networks 9, 1099-1118. [CrossRef]
98. S.C. Kremer. 1996. Comments on "Constructive learning of recurrent neural

networks: limitations of recurrent cascade correlation and a simple solution". IEEE
Transactions on Neural Networks 7, 1047-1051. [CrossRef]
99. Paolo Frasconi, Marco Gori, Marco Maggini, Giovanni Soda. 1996. Representation
of finite state automata in Recurrent Radial Basis Function networks. Machine
Learning 23, 5-32. [CrossRef]
100. C Omlin. 1996. Extraction of Rules from Discrete-time Recurrent Neural
Networks. Neural Networks 9, 41-52. [CrossRef]
101. R. Alquzar, A. Sanfeliu. 1995. An Algebraic Framework to Represent Finite State
Machines in Single-Layer Recurrent Neural Networks. Neural Computation 7:5,
102. B.A. Pearlmutter. 1995. Gradient calculations for dynamic recurrent neural
networks: a survey. IEEE Transactions on Neural Networks 6, 1212-1228.
[CrossRef]
103. Veronique De Keyser. 1995. Time in ergonomics research. Ergonomics 38,
1639-1660. [CrossRef]
104. Peter Tio, Jozef ajda. 1995. Learning and Extracting Initial Mealy Automata
with a Modular Neural Network Model. Neural Computation 7:4, 822-844.
105. C.L. Giles, Dong Chen, Guo-Zheng Sun, Hsing-Hen Chen, Yee-Chung Lee,
M.W. Goudreau. 1995. Constructive learning of recurrent neural networks:
limitations of recurrent cascade correlation and a simple solution. IEEE
Transactions on Neural Networks 6, 829-836. [CrossRef]
106. P. Frasconi, M. Gori, M. Maggini, G. Soda. 1995. Unified integration of explicit
knowledge and learning by example in recurrent networks. IEEE Transactions on
Knowledge and Data Engineering 7, 340-346. [CrossRef]
107. Mark W. Goudreau, C. Lee Giles. 1995. Using recurrent neural networks to learn
the structure of interconnection networks. Neural Networks 8, 793-804. [CrossRef]
108. C Lee Giles. 1995. Learning a class of large finite state machines with a recurrent
neural network. Neural Networks 8, 1359-1365. [CrossRef]
109. S ROGERS. 1995. Neural networks for automatic target recognition. Neural
Networks 8, 1153-1184. [CrossRef]
110. M Hirsch. 1995. Computing with dynamic attractors in neural networks. Biosystems
34, 173-195. [CrossRef]
111. Itsuki Noda. 1995. A learning method for recurrent networks based on
minimization of states of finite state machines. Systems and Computers in Japan
26:10.1002/scj.v26:9, 50-60. [CrossRef]
112. Peter Manolios, Robert Fanelli. 1994. First-Order Recurrent Neural Networks
and Deterministic Finite State Automata. Neural Computation 6:6, 1155-1173.
113. C.L. Giles, C.W. Omlin. 1994. Pruning recurrent neural networks for improved
generalization performance. IEEE Transactions on Neural Networks 5, 848-851.
[CrossRef]
114. Zheng Zeng, R.M. Goodman, P. Smyth. 1994. Discrete recurrent neural networks
for grammatical inference. IEEE Transactions on Neural Networks 5, 320-330.
[CrossRef]
115. M. Bianchini, M. Gori, M. Maggini. 1994. On the problem of local minima in
recurrent neural networks. IEEE Transactions on Neural Networks 5, 167-177.
[CrossRef]
116. Luis Jimnez, Cstor Mndez, Mara Jos Lorda. 1994. Aprendizaje implcito: tres
aproximaciones a la cuestin del aprendizaje sin conciencia. Estudios de Psicologa
15, 99-126. [CrossRef]
117. Zheng Zeng, Rodney M. Goodman, Padhraic Smyth. 1993. Learning Finite State
Machines With Self-Clustering Recurrent Networks. Neural Computation 5:6,
118. M. Reczko. 1993. Protein Secondary Structure Prediction with Partially Recurrent
Neural Networks. SAR and QSAR in Environmental Research 1, 153-159.
[CrossRef]
119. Garrison Cottrell, Fu-Sheng Tsung. 1993. Learning Simple Arithmetic
Procedures. Connection Science 5, 37-58. [CrossRef]
120. Francesca Albertini, Eduardo D. Sontag. 1993. For neural networks, function
determines form. Neural Networks 6, 975-990. [CrossRef]
121. C. LEE GILES, CHRISTIAN W. OMLIN. 1993. Extraction, Insertion and
Refinement of Symbolic Rules in Dynamically Driven Recurrent Neural Networks.
Connection Science 5, 307-337. [CrossRef]
122. J. Alvarez-Cercadillo, J. Ortega-Garcia, L.A. Hernandez-GomezContext modeling
using RNN for keyword detection 569-572 vol.1. [CrossRef]
123. A REBER. 1992. The cognitive unconscious: An evolutionary perspective.
Consciousness and Cognition 1, 93-133. [CrossRef]
124. C. L. Giles, C. B. Miller, D. Chen, H. H. Chen, G. Z. Sun, Y. C. Lee. 1992.
Learning and Extracting Finite State Automata with Second-Order Recurrent
Neural Networks. Neural Computation 4:3, 393-405. [Abstract] [PDF] [PDF
Plus]
125. Raymond L. Watrous, Gary M. Kuhn. 1992. Induction of Finite-State Languages
Using Second-Order Recurrent Networks. Neural Computation 4:3, 406-414.
126. Hermann Moisl. 1992. Connectionist Finite State Natural Language Processing.
127. Morten Christiansen, Nick Chater. 1992. Connectionism, Learning and Meaning.
128. Georg Schwarz. 1992. Connectionism, Processing, Memory. Connection Science 4,

207-226. [CrossRef]
129. Nicholas J. SalesImplementing a Semantic Subsystem for a Model of Language
Acquisition 1351-1355. [CrossRef]
130. David Servan-Schreiber, Axel Cleeremans, James L. Mcclelland. 1991. Graded
state machines: The representation of temporal contingencies in simple recurrent
networks. Machine Learning 7, 161-193. [CrossRef]
131. J.E.R. Staddon, J.L.O. Bueno. 1991. ON MODELS, BEHAVIORISM AND
THE NEURAL BASIS OF LEARNING. Psychological Science 2:10.1111/
psci.1991.2.issue-1, 3-11. [CrossRef]
132. Richard Maclin, Jude W. ShavlikRefining Domain Theories Expressed as FiniteState Automata 524-528. [CrossRef]
133. Christopher K.I. Williams, Geoffrey E. HintonMean field networks that learn to
discriminate temporally distorted strings 18-22. [CrossRef]
134. Patrick T.W. Hudson, R. Hans PhafORDERS OF APPROXIMATION OF
NEURAL NETWORKS TO BRAIN STRUCTURE: LEVELS, MODULES
AND COMPUTING POWER 1025-1028. [CrossRef]
135. J.C. ScholtesRecurrent Kohonen Self-Organization in Natural Language
Processing 1751-1754. [CrossRef]
136. Ronald J. Williams, Jing Peng. 1990. An Efficient Gradient-Based Algorithm for
On-Line Training of Recurrent Network Trajectories. Neural Computation 2:4,
137. R. KamimuraApplication of temporal supervised learning algorithm to generation
of natural language 201-207 vol.1. [CrossRef]
138. P. Tino, V. VojtekExtracting stochastic machines from recurrent neural networks
trained on complex symbolic sequences 551-558. [CrossRef]
139. I. NodaLearning method by overload 1357-1360. [CrossRef]
140. C.W. Omlin, C.L. GilesPruning recurrent neural networks for improved
generalization performance 690-699. [CrossRef]
141. M.L. Forcada, A.M. Corbi-Bellot, M. Gori, M. MagginiRecurrent neural networks
can learn simple, approximate regular languages 1527-1532. [CrossRef]
142. K. Sundarakantham, S.M. ShalinieNatural language grammatical inference with
support vector machines 252-257. [CrossRef]
143. A. Psarrou, Shaogang Gong, H. BuxtonModelling spatio-temporal trajectories and
face signatures on partially recurrent neural networks 2226-2231. [CrossRef]
144. C.W. Omlin, C.L. Giles, C.B. MillerHeuristics for the extraction of rules from
discrete-time recurrent neural networks 33-38. [CrossRef]
145. K. DoyaBifurcations in the learning of recurrent neural networks 2777-2780.
[CrossRef]
146. Z. Zeng, R.M. Goodman, P. SmythSelf-clustering recurrent networks 33-38.

[CrossRef]
147. M. CrucianuStructured representations developed by learning in recurrent
networks 4405-4408. [CrossRef]
148. C.L. Giles, C.W. OmlinInserting rules into recurrent neural networks 13-22.
[CrossRef]
149. J.C. ScholtesUnsupervised context learning in natural language processing
107-112. [CrossRef]
150. M.L. HambadaOscillatory state machine 2179-2184. [CrossRef]
151. J.-F. JodouinPutting the simple recurrent network to the test 1141-1146.
[CrossRef]
152. K. Lindgren, A. Nilsson, M.G. Nordahl, I. RadeRegular language inference using
evolving neural networks 75-86. [CrossRef]
153. R.J. WilliamsTraining recurrent networks using the extended Kalman filter
241-246. [CrossRef]
154. C.W. Omlin, C.L. Giles, B.G. Horne, L.R. Leerink, T. LinTraining recurrent
neural networks with temporal input encodings 1267-1272. [CrossRef]
155. S. Lawrence, C.L. Giles, Sandiway FongCan recurrent neural networks learn
natural language grammars? 1853-1858. [CrossRef]
156. A. Vahed, C.W. OmlinRule extraction from recurrent neural networks using a
symbolic machine learning algorithm 712-717. [CrossRef]
157. M. Kinouchi, M. HagiwaraMemorization of melodies by complex-valued recurrent
network 1324-1328. [CrossRef]
158. M. Kinouchi, M. HagiwaraLearning temporal sequences by complex neurons with
local feedback 3165-3169. [CrossRef]
159. T. Czernichow, A. Germond, B. Dorizzi, P. CaireImproving recurrent network load
forecasting 899-904. [CrossRef]
160. P. Frasconi, M. Gori, M. Maggini, G. SodaAn unified approach for integrating
explicit knowledge and learning by example in recurrent networks 811-816.
[CrossRef]
161. L.R. Leerink, M.A. JabriExtraction of high level sequential structure using
recurrent neural networks and radial basis functions 69-72. [CrossRef]
162. D. Chen, C.L. Giles, G.Z. Sun, H.H. Chen, Y.C. Lee, M.W.
GoudreauConstructive learning of recurrent neural networks 1196-1201.
[CrossRef]
163. R. Kothari, K. AgyepongInduced specialization of context units for temporal
pattern recognition and reproduction 131-140. [CrossRef]
164. S.C. KremerParallel stochastic grammar induction 1424-1428. [CrossRef]
165. S. Wermter, G. ArevianModular preference Moore machines in news mining agents

1792-1797. [CrossRef]

Finite State Automata and Simple Recurrent Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Finite State Automata and Simple Recurrent Networks

Uploaded by

Copyright:

Available Formats

Communicated by Jeffrey Elman