You are on page 1of 12

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-4, NO.

4, JULY 1974 323

Learning Automata A Survey


KUMPATI S. NARENDRA, SENIOR MEMBER, IEEE, AND M. A. L. THATHACHAR

Abstract-Stochastic automata operating in an unknown random can be considered to show learning behavior. Tsypkin
environment have been proposed earlier as models of learning. These [GT1] has recently argued that seemingly diverse problems
automata update their action probabilities in accordance with the inputs . . .
received from the environment and can improve their own performance in pa t rec i idenfatio
o n lering.
during operation. In this context they are referred to as learning auto- can be treated ii a unified manner as problems in learning
mata. A survey of the available results in the area of learning automata using probabilistic iterative methods.
has been attempted in this paper. Attention has been focused on the Viewed in a purely mathematical context the goal of a
norms of behavior of learning automata, issues in the design of updating learning system is the optimization of a functional not
schemes, convergence of the action probabilities, and interaction of . .
several automata. Utilization of learning automata in parameter known expicily,asfunctoeamlwith athemati daltexpeta-n
optimization and hypothesis testing is discussed, and potential areas o tion of a random functional with a probability distribution
application are suggested. function not known in advance. An approach that has been
used in the past is to reduce the problem to the determina-
I. INTRODUCTION tion of an optimal set of parameters and then apply
stochastic hillclimbing techniques [GT1]. An alternative
rN CLASSICAL deterministic control theory, the control
of a process is always preceded by complete knowledge approach attention recently is to regard the problem
gaining an
ofrthe of the proce;a
r the m athmtia
as one of
actions to achieveoptimal
andfinding action
this using stochastic set of allowable
out of aautomata
[LN2].
description of the process is assumedito bknon, an the The following example of the learning process of a student
inputs to the process are deterministic functions of time. wit a-rbblsi ece lutae h uoao
Later developments in stochastic control theory took into approach.
account uncertainties that might be present in the process; Conti s
stochastic control was effected by assuming that the thestden andanite st aer nAtivn isp
probabilistic characteristics of the uncertainties are known. p he student an set on alternative s,
Frequently, the uncertainties are of a higher order, and even foling w he the teacher respond in a ary anner
the probabilistic characteristics such as the distribution indicaing wheth selcter is ight or wrong.
functions may not be completely known. It is then necessary teac thri howeve,poabster is a nonzr
to make observations on the process as it is in operation and probabilit ficither esponse zfra
gain further knowledge of the process. In other words, a of the answers selected by the student. The saving feature of
distinctive feature of such problems is that there is little the an sethat tik that The tachr' neative
a priori information, and additional information is to be responses have the least probability for the correct answer.
acquired
on
line. One viewpoint is to regard these
acquired on learnine. Oeipnitrgdhsa
as

Under these circumstances the interest is in finding the


aro mingsislearing.
Learning defined as any relatively permanent changin
. sequence in ofwhich
manner alternatives and should
the student processplanthea choice of a
information
behavior resulting from past experience, and a learning obtained from the teacher so that he learns the correct
system is characterized by its ability to improve its behavior answer.
with time, in some sense tending towards an ultimate goal. In stochastic automata models the stochastic automaton
In mathematical psychology, models of learning systems corresponds to the student, and the random environment in
[GBI], [GLI] have been developed to explain behavior which it operates represents the probabilistic teacher. The
patterns among living organisms. These models in turn have actions (or states)of the stochastic automaton are the
lately been adapted to synthesize engineering systems, which various alternative answers that are provided. The responses
of the environment for a particular action of the stochastic
Manuscript received January 15, 1974; revised February 13, 1974. automaton are the teacher's probabilistic responses. The
This work was supported by the National Science Foundation under problem is to obtain the optimal action that corresponds to
Grant GK-20580.
K. S. Narendra is with the Becton Center, Yale University, New thcortanw.
Haven, Conn. The stochastic automaton attempts a solution of this
M. A. L. Thathachar is with the Becton Center, Yale University, problem as follows. To start with, no information as to
New Haven, Conn., on leave from the Indian Institute of Science, hc n steotmlato sasmd n qa
Bangalore, India.' hconisteotmlato isas edadeql
324 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

probabilities are attached to all the actions. One action is Union and elsewhere has followed the trend set by his
selected at random, the response of the environment to this source paper. No attempt, however, has been made in this
action is observed, and based on this response the action paper to review all these studies.
probabilities are changed. Now a new action is selected Varshavskii and Vorontsova [LVY] observed that the use
according to the updated action probabilities, and the of stochastic automata with updating of action probabilities
procedure is repeated. A stochastic automaton acting in could reduce the number of states in comparison with
this manner to improve its performance is referred to as a deterministic automata. This idea has proved to be very
learning automaton in this paper. fruitful and has been exploited in a series of investigations,
Stochastic hillclimbing methods (such as stochastic the results of which form the subject of this paper.
approximation) and stochastic automata methods represent Fu and his associates [LFI]-[LF6] were among the first
two distinct approaches to the learning problem. Though to introduce stochastic automata into the control literature.
both approaches involve iterative procedures, updating at A variety of applications to parameter optimization,
every stage is done in the parameter space in the first method pattern recognition, and game theory were considered by
and probability space in the second. It is, of course, possible this school. McLaren [LM1] explored the properties of
that they lead to equivalent descriptions in some examples. linear updating schemes and suggested the concept of a
The automata methods have two distinct advantages over "growing" automaton [LM2]. Chandrasekaran and Shen
stochastic hillclimbing methods in that the action space [LC1]-[LC3] made useful studies of nonlinear updating
need not be a metric space (i.e., no concept of neighborhood schemes, nonstationary environments, and games of
is needed), and since at every stage any element of the automata. Tsypkin and Poznyak [LTI] attempted to unify
action set can be chosen, global rather than local optimum the updating schemes by focusing attention on an inverse
can be obtained. optimization problem. The present authors and their
Experimental simulation of automata methods carried associates [LS1], [LS2], [LV3]-[LV10], [LN1], [LN2],
out during the last few years has indicated the feasibility of [LL1]-[LL5] have studied the theory and applications of
the automaton approach in the solution of interesting learning automata and also carried out simulation studies
examples in parameter optimization, hypothesis testing, and in the area.
game theory. The automaton approach also appears The survey papers on learning control systems by
appropriate in the study of hierarchical systems and in Sklansky [GS1] and Fu [GF1] have devoted part of their
tackling certain nonstationary optimization problems. attention to learning automata. The topic also finds a place
Furthermore, several other avenues to learning can be in some books and collections of articles on learning
interpreted as iterative procedures in the probability space, systems [GM2], [GF2], [LF6]. The literature on the
and the learning automaton provides a natural mathematical two-armed bandit problem is relevant in the present context
model for such situations and serves as a unifying theme but is not referred to in detail as the approach taken is
among diverse techniques [GM3]. rather different [LC5], [LW2]. References to other
Previous studies on learning automata have led to a contributions will be made at appropriate points in the body
certain understanding of the basic issues involved and have of the paper.
provided guidelines for the design of algorithms. An
appreciation of the fundamental problems in the field has Organization
also taken place. It appears that research in this area has This paper has been divided into nine sections. Following
reached a stage where the power and applicability of the the introduction, the basic concepts and definitions of
approach needs to be made widely known in order that it stochastic automata and random environments are given in
Section II. The possible ways in which the behavior of
can be fully exploited in solving problems in relevant areas.
learning automata can be judged are defined in Section III.
In this paper we review recent results in the area of learning
automata, reexamine some of the theoretical questions thatSection IV deals with reinforcement schemes (or updating
arise, and suggest potential areas where the available results
algorithms) and their properties and includes a discussion
may find application. of convergence. Section V describes collective behavior of
automata in terms of games between automata and multi-
level structures of automata. Nonstationary environments
Historically, the first learning automata models were are briefly considered in Section VI. Possible uses of
developed in mathematical psychology. Early work in this learning automata in optimization and hypothesis testing
area has been well documented in the book by Bush and form the subject matter of Section VII. A short description
Mosteller [GBl]. More recent results can be found in of the fields of application of learning automata is given in
Atkinson et al. [GAl]. A rigorous mathematical framework Section VIII. A comprehensive bibliography is provided in
has been developed for the study of learning problems by the Reference section and is divided into three subsections
Iosifescu and Theodorescu [GIl] as well as by Norman dealing with 1) general references in the literature pertinent
[GNl]. to the topic considered, 2) some important papers on de-
Tsetlin [DT1] introduced the concept of using determi- terministic automata that provided the impetus for stochas-
nistic automata operating in random environments as tic automata models, and 3) publications wholly devoted to
models of learning. A great deal of work in the Soviet learning automata.
NARENDRA AND THATHACHAR: LEARNING AUTOMATA 325

INPUT | STOCHASTIC ACTION(OUTPUT) Learning Automaton (Stochastic Automaton in a Random


}{O,I) AUTOMATON a 4E a,,a2,-ar Environment)
Fig. 1. Stochastic automaton. Fig. 3 represents a feedback connection of a stochastic
automaton and an environment. The actions of the autom-
PENALTY PROBABILITY SET aton in this case form the inputs to the environment. The
{Cp C2, *. CrO responses of the environment in turn are the inputs to the
automaton and influence the updating of the action
INPUT ENVI,...arONMEN OUTPU RESPONE probabilities. As these responses are random, the action
Fig. 2. Environment. probability vector p(n) is also random.
In psychological learning experiments the organism under
PENALTY PROBABILITY SET study is said to learn when it improves the probability of
{Cp C,2 Cr) correct response as a result of interaction with its environ-
ment. Since the stochastic automaton being considered in
this paper behaves in a similar fashion, it appears proper to
refer to it as a learning automaton. Thus a learning
automaton is a stochastic automaton that operates in a
{p,A) random environment and updates its action probabilities in
ACTION STOCHASTIC INPUT accordance with the inputs received from the environment
a E {a,,.. ar) AUTOMATON xE {O,l} so as to improve its performance in some specified sense.
Fig. 3. Learning automaton. In the context of psychology, a learning automaton may
be regarded as a model of the learning behavior of the
II. STOCHAsTic AUTOMATA AND RANDom ENVIRONMENTS organism under study and the environment as controlled by
the experimenter. In an engineering application such as the
Stochastic Automaton control of a process, the controller corresponds to the
A stochastic automaton is a sextuple {x,0,oc,p,A,G} where learning automaton, while the rest of the system with all
x is the input set, 4 = {01,02,.- * ,l} is the set of internal uncertainties constitutes the environment.
states, os = {01,0C2, *Cr } with r < s is the output or It is useful to note the distinction between several models
action set, p is the state probability vector governing the based on the nature of the input to the learning automaton.
choice of the state at each stage (i.e., at each stage n, If the input set is binary, e.g., {0, 1}, the model is known as
p(n) = (pl(n),p2(n), - *,p.(n))), A is an algorithm (also a P-model. On the other hand it is called a Q-model if the
called an updating scheme or reinforcement scheme) which input set is a finite collection of distinct symbols as, for
generates p(n + 1) from p(n), and G: 0 -ao is the output example, obtained by quantization and an S-model if the
function. G could be a stochastic function, but there is no input set is an interval [0,1]. Each of these models appears
loss of generality in assuming it to be deterministic [GP1]. appropriate in certain situations.
In this paper G is taken to be deterministic and one-to-one A remark on the terminology is relevant here. Following
(i.e., r = s, states and actions are regarded synonymous) Tsetlin [DT1], deterministic automata operating in random
and s < xo. Fig. 1 shows a stochastic automaton with its environments have been proposed as models of learning
inputs and actions. behavior. Thus they are also contenders to the term
It may be noted that the states of a stochastic automaton "learning automata." However, in the view of the present
correspond to the states of a discrete-state discrete- authors the stochastic automaton with updating of action
parameter Markov process. Occasionally, it may be probabilities is a general model from which the deterministic
convenient to regard the pi(n) themselves as states of a automaton can be obtained as a special case having a
continuous-state Markov process. 0-1-state transition matrix, and it appears reasonable to
Environment apply the term learning automaton to the more general
model. In cases where it is felt necessary to emphasize the
Only an environment (also called a medium) with random learning properties of a deterministic automaton one can
response characteristics is of interest in the problems use a qualifying term such as "deterministic learning
considered. The environment (shown in Fig. 2) has inputs automaton." It may also be noted that learning automata
c(n) = {f,xl. 'r} and outputs (responses) belonging to a of this paper have been referred to as "variable-structure
set x. Frequently the responses are binary {0,1 } with zero stochastic automata," in earlier literature [LVI].
being called the nonpenalty response and one as the penalty
response. The probability of emitting a particular output III. NORMS OF BEHAVIOR OF LEARNING AUTOMATA
symbol (say, 1) depends on the input and is denoted by The basic operation carried out by a learning automaton
ci(i = 1,..* ,r). The ci are called the penalty probabilities, is the updating of the action probabilities on the basis of the
If the ci do not depend on n, the environment is said to be responses of the environment. A natural question here is to
stationary. Otherwise it is nonstationary. It is assumed that examine whether the updating is done in such a manner as
the ci are unknown initially; the problem would be trivial to result in a performance compatible with intuitive notions
if they are known a priori. of learning.
326 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

One quantity useful in judging the behavior of a learning In practice, the penalty probabilities are often completely
automaton is the average penalty received by the automaton. unknown, and it would be necessary to have desirable
At a certain stage n, if the action (i is selected with prob- performance whatever be the values of ci, that is, in all
ability pi(n) the average penalty conditioned on p(n) is stationary random media. The performance would also be
superior if the decrease of E[M(n)] is monotonic. Both
M(n) = E{x(n) p(n)} these requirements are considered in the following definition
r [LL3].
= pi(n)ci. (1) Definition 4: A learning automaton is said to be absolutely
i- 1 expedient if
If no a priori information is available, and the actions are E[M(n + 1) p(n)] < M(n) (6)
chosen with equal probability (i.e., at random), the value of
the average penalty is denoted by Mo and is given by for all n, all pk(n) E (0, 1)(k = 1,.*.*,r), and all possible
values2 of ci(i= 1, ,r). Absolute expediency implies
M = Cl + C2 + + Cr (2) that M(n) is a supermartingale and that E[M(n)] is strictly
r monotonically decreasing with n in all stationary random
environments. If M(n) < Mo initially, absolute expediency
The average penaltyaisimadetlessothan Me atilea
if the average penalty iS made less than MO, at least implies expediency. It is thus a stronger requirement on the
asymptotically. Such a behavior is called expediency and is learning automaton. Furthermore, it can be shown that
defined as follows [DTI], [LCI]. absolute expediency implies E-optimality in all stationary
random environments [LL4]. It is not at present known
Definition 1: A learning automaton is called expedient' if whether the reverse implication is true. lowever, every
lim E[M(n)] < Mo. (3) learning automaton presently known to be e-optimal in all
n- cc stationary media is also absolutely expedient. Hence
When a learning automaton is expedient it only does better s-optimality and absolute expediency will be treated as
than one which chooses actions in a purely random manner. synonymous in the sequel.
It would be desirable if the average penalty could be The definitions in this section have been given with
minimized by a proper selection of the actions. In such a reference to a P-model but can be applied with minor
case the learning automaton is called optimal. From (1) it changes to Q- and S-models [LV3], [LV8], [LCl].
can be seen that the minimum value of M(n) is mini {c'}.
Definition 2: A learning automaton is called optinmal if IV. REINFORCEMENT SCHEMES
Having decided on the norms of behavior of learning
lim E[M(n)] = cl (4) automata we can now focus attention on the means of
where achieving the desired performance. It is evident from the
cl = min {c-}. description of the learning automaton that the crucial
i factor that affects the performance is the reinforcement
scheme for the updating of the action
Optimality implies that asymptotically the action associated becomes
necessary to relate the structure of
probabilities.
It thus
a reinforcement
with the minimum penalty probability is chosen with
probability one. While optimality appears a very desirable scheme and the performance of the automaton using the
property, certain conditions in a given situation may scheme r a
preclude its achievement. In such a case one would aim at a genera
suboptimal performance. One such property is given by sented by
e-optimality [LV4]. p(n + 1) = T[p(n),c(n),x(n)] (7)
Definition 3: A learning automaton is called c-optimal if where T is an operator; x(n)
and x(n) represent the action
c
lim E[M(n)] < cl + (5) of the automaton and the input to the automaton at instant
n 8 X00 n, respectively. One can classify the reinforcement schemes
can be obtained for any arbitrary c > 0 by a suitable either on the basis of the property exhibited by a learning
choice of the parameters of the reinforcement scheme. automaton using the scheme (as, for example, the automaton
s-optimality implies that the performance of the automaton being expedient or optimal) or on the basis of the nature of
can be made as close to the optimal as desired. the functions appearing in the scheme (as, for example,
It is possible that the preceding properties hold only when linear, nonlinear, or hybrid). Ifp(n + 1) is a linear function
fpenalty probabilities c satisfy certain restric- of the components of p(n), the reinforcement scheme is said
tions, for example, that they should lie in certain intervals, to be linear, otherwise it is nonlinear. Sometimes it is
In such cases the properties are said to be conditional. advantageous to update p(n) according to different schemes
depending on the intervals in which the value of p(n) lies.
1 Since pi(n), limn . p1(n), and consequently M(n) are, in general,
random variables, the expectation operator is needed in the definition 2 It is usually assumed that the set {c1} has unique maximum and
to represent the average penalty. minimum elements.
NARENDRA AND THATHACHAR: LEARNING AUTOMATA 327

In such a case the combined reinforcement scheme is called It is known that an automaton using the LR-P scheme is
a hybrid scheme. expedient in all stationary random environments. Expres-
The basic idea behind any reinforcement scheme is rather sions for the rate of learning and the variance of the action
simple. If the learning automaton selects an action ici at probabilities are also available.
instant n and a nonpenalty input occurs, the action By setting
probability pi(n) is increased, and all the other components fj(p) = apj gj(p) 0, for all] (10)
of p(n) are decreased. For a penalty input, pi(n) is decreased,
and the other components are increased. These changes in we get the linear reward-inaction (LR-I) scheme. This
pi(n) are known as reward and penalty, respectively. scheme was considered first in mathematical psychology
Occasionally the action probabilities may be retained at the [GBl] but was later independently conceived and intro-
previous values, in which case the status quo is known as duced into the engineering literature by Shapiro and
"inaction." Narendra [LSI], [LS2].
In general, when the action at n is o The characteristic of the scheme is that it ignores penalty
inputs from the environment so that the action probabilities
pi(n + 1) = p/n)-f/p(n)), for x(n) = 0 remain unchanged under these inputs. Because of this
pj(n + 1) = pJ(n) + gj(p(n)), for x(n) = 1. property a learning autoinaton using the scheme has been
called a "benevolent automaton" by Tsypkin and Poznyak
(. 7& i) (8a) [LTI].
The algorithm for pi(n + 1) is to be fixed so that Pk(n + 1) The LR--I scheme was originally reported to be optimal in
(k = 1,* * ,r) add to unity. Thus all stationary random environments, but it is now known
that it is only c-optimal [LV4], [LL4]. It is significant,
p1(n + 1) = pi(n) + E f1(p(n)), for x(n) = 0 however, that replacing the penalty by inaction in the LR-P
scheme totally changes the performance from expediency to
pi(n + 1) = pi(n) - E gsp(n)), for x(n) = 1 (8b) c-optimality.
continuousfunctionsOther
3heretheonnegative
where the nonnegative3
possible combinations such as the linear reward-
continuous functions fj() and reward, penalty-penalty, and inaction-penalty schemes
gj( ) are such that Pk(n + 1) E (0,1), for all k = 1, * ,r have been considered in [LV9], but these are, in general,
whenever every pk(nl) E (0,1). The latter requirement 5s inferior to the LR-I and LR-P schemes. The effect of varying
necessary to prevent the automaton from getting trapped the parameters a and b with n has also been studied in
prematurely in an absorbing barrier. [LV9]
Varshavskii and Vorontsova [LVI] were the first to
suggest such reinforcement schemes for two-state automata Nonlinear Schemes
and thus set the trend for later developments. They con- As mentioned earlier, the first nonlinear scheme for a
sidered two schemes-one linear and the other nonlinear- two-state automaton was proposed by Varshavskii and
in terms of updating of the state-transition probabilities. Vorontsova [LVI] in terms of transition probabilities. The
Fu, McLaren, and McMurtry [LFI], [LF2] simplified the total-probability version of the scheme corresponds to the
procedure by considering updating of the total action choice
probabilities as dealt with here.
gj(p) = fj(p) = apj(l - pj), j = 1,2. (11)
Linear Schemes This scheme is c-optimal in a restricted random environment
The earliest known scheme can be obtained by setting satisfying either c, < 1/2 < c2 or c2 < 1/2 < cl. Chan-
fj(p)= apj g/(p) fj-p
= bp1 + b/r- 1, = blrdrasekaran
pj j(p=pj and Shen [LCI] have studied nonlinear
schemes with power-law nonlinearities. Several nonlinear
for all j = 1, ,r (9) schemes, which are c-optimal in all stationary random
where 0 < a, b < 1.4 This is known as a linear reward- environments, have been suggested by Viswanathan and
penalty (denoted LR-P) scheme. Early studies of the scheme, Narendra [LV9] as well as by Lakshmivarahan and
principally dealing with the two-state case, were made by Thathachar [LLI], [LL3]. A simple scheme of this type for
Bush and Mosteller [GB1] and Varshavskii and Vorontsova the two-state case is
[LVY]. McLaren [LMI] made a detailed investigation of f (p) = apj2(l - pj) gj(p) = bpj( - pj),
the multistate case, and this work was continued by
j = 1,2 (12)
Chandrasekaran and Shen [LCI] as well as by Viswana- where 0 < a < 0 < b < .
than and Narendra [LV9]. Norman [LN4] established
4,
several results pertaining to the ergodic character of the Acobntnofleaadnniertrm otn
scheme. appears advantageous [LL3]. Extensive simulation results
on a variety of schemes utilizing several possible combina-
tions of reward, penalty, and inaction are available in
3The nonnegativity condition need be imposed only if the "*reward" [LV 10]. A result that unifies most of the preceding rein-
character of f,(.) and the "penalty" character of g>( ) are to be forcement schemes has been reported in [LL3] and is given
4g9j() for this scheme is not nonnegative for all values of pj. by the following.
328 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

Theorem: A necessary and sufficient condition for the whereas they play a crucial role in the case of s-optimal
learning automaton using (8) to be absolutely expedient is schemes.
It has been shown that when the LR-P scheme is used,
fi(P)
Pt
=M_P)
P2
- - fr(P) -= (p)
Pr
p(n) converges to a random variable with a continuous
distribution in which the mean and variance can be
computed. The variance can be made as small as desired by
g1(P)
Pi
91(P)
g2(P) _
9
P2
=
_ g~.(p)
#2(P = .. = =u(p)
Pr
(13) a proper choice of the parameters of the scheme [LM1],
[LC1].
where A(-) and 4(*) are arbitrary continuous functions In s-optimal schemes the action probability vector p(n)
to the set of absorbing states with probability
satisfying5 converges
one. As there are at least two such states and only one of
O < A(p) < 1 the states is the desired one (i.e., the state associated with
the minimum penalty probability) one can only say that
O < ,u(p) < min (pa/i - pj) (14) p(n) converges to the desired state with a positive prob-
ability. It is important to quantify this probability.
for all j = 1, * ,r and all pj E (0,1). For simplicity, consider a two-state case. If c1 < c2, we
In simple terms, the theorem suggests that to obtain require p1(n) -÷ 1,is and if c1 > c2, p1(n) -+ 0. When an
absolute expediency
abslut one type of updating
exedincYonetyp should
ofupdtin be mae drawn is p1(n) -+{0, 1} with
shuldbemade c-optimal scheme used the only conclusion that can be
probability one, hence, h
for the probability of the action selected by the automaton desirable event happens with a specific probability. Further-
at the instant considered and a different type of updating more, this probability depends on the initial value pi (). In
ore, to
for all the other action probabilities. No distinction should order baindendsonthe inithe
value the
confidence in the use of the schemes the
automaton probabilitygain
to
be madebeamong actions nt
the atios
adeamog
te not sleced the atomton
selected byy te of convergence to the desired state has to be
in the sense th ththe ratio
same for all these actions.
pn(n + 1)/(ps(n)) should be the determined as a function of the initial probability.
For fixing ideas, let us assume c1 < c2. It is necessary to
All the schemes known so far, which are s-optimal in all find a function6 y(p) defined by
stationary random environments, are also absolutely
expedient; hence the functions appearing in these schemes y(p) = Pr [p1(e) = 1 P1(O) = p].
satisfy the conditions of the theorem. The theorem also The available results in the theory of stochastic stability are
provides guidelines for the choice of new schemes. not of much use here as they treat, for the most part, the
case of one absorbing state, whereas there are two such
Convergence of Action Probabilities states here. Pioneering work on the present problem has
So far only the gross behavior of the automaton based on been done by Norman [LN3], who has shown that y(p)
changes in the average penalty has been considered. It is of satisfies a functional equation
interest to probe deeper into the nature of asymptotic Uy(p) = y(p)
behavior of the action probability vector p(n).
There are two distinct types of convergence associated with suitable boundary conditions at p = 0 and p = 1,
with the reinforcement schemes. In one case the distribution where the operator U is defined by
functions of the sequence of action probabilities converge Uy(p) = E[y(pl(n ± 1)) pl(n) p].
to a distribution function at all points of continuity of the
latter function. This mode of convergence occurs typically However, this functional equation is extremely difficult to
in the case of expedient schemes such as the LR-P scheme. A solve. Hence the next best thing that can be done is to estab-
different mode of convergence occurs in the case Of e- lish upper and lower bounds on y(p). These can be computed
optimal schemes (such as the LR-I scheme). Here it can be by finding two functions ii(p) and +'2(P) such that
proved using the martingale convergence theorem that the
sequence of action probabilities converges to a limiting Ut/I (p) . /I (p)
random variable with probability one. Thus a stronger mode and
of convergence is exhibited by s-optimal schemes [LN4]. U2(p) < +12(P)
The difference between the two modes of convergence can
be understood by the fact that expedient schemes (such as for all p E [0,1] with appropriate boundary conditions.
LR-P) generate Markov processes that are ergodic but have Satisfaction of these inequalities yields
no absorbing barrier, whereas e-optimal schemes result in
Markov processes with more than one absorbing barrier. #f2(P) . 7(P) . tf1(p)
Initial values of action probabilities do not affect the The functions 4,1e( ) and {1X2( ) are called subregular and
asymptotic behavior in the case of expedient schemes, superregular, respectively, and are comparatively easy to
find as it involves only establishment of inequalities. Use of
sThese conditions broadly arise because the pj(fl) are probabilities
and are required to lie in (0,1) and sum to unity. For a more precise
statement see [LL3]. 6p represents a scalar in the remainder of this subsection.
NARENDRA AND THATHACHAR: LEARNING AUTOMATA 329

exponential functions appears particularly appropriate present, is the case when all the roots of the martingale
here. With equation correspond to these absorbing states.
In view of the preceeding clarifications, some of the
oi( ) (xip- ii= 1,2
((xip)

exp conclusions drawn in the earlier literature have to be


exp (xi- 1) modified. The nonlinear schemes regarded as optimal in
Norman [LN3] has established tight bounds for y(p) in the restricted environments by Varshavskii and Vorontsova
case of the LR-I scheme with two states. The technique has [LV1] as well as by Chandrasekaran and Shen [LCI] are
been extended to cover nonlinear schemes and multistate now seen only c-optimal. Similarly the LR-I scheme of
to be
cases [LL4]. Shapiro and Narendra [LS1] and the nonlinear schemes
Comments on Convergence Results: Recognizing the given by Viswanathan and Narendra [LV9] Lakshmivara-
importance of the study of convergence of the action han and Thathachar [LL1] are only c-optimal in all
probabilities, several researchers have attempted to simplify stationary random environments. The nonlinear schemes in
the procedures involved. Some intuitively attractive ideas [LCI] where there are roots of the martingale equation
were employed extensively in this context, but on closer other than those corresponding to the absorbing states have
scrutiny they have been found to be incorrect. The to be studied in greater detail in order to make definitive
mechanism of convergence now appears to be more complex statements about their convergence. In connection with the
than what was thought of earlier. Attention will be drawn LR Dscheme, the asymptotic values of the state probabilities
to some of these fallacious arguments in the following, given in [LCI] actually refer to their expectations.
1) In the case of a two-state automaton, suppose pi(n) Some General Remarks on Reinforcement Schemes
satisfies
1) All the schemes available in the literature so far are
Ap1(n) = E[pl(n pl(n)
+ 1) - pi(n)] > O, either expedient or c-optimal. Vorontsova [LV2] stated
sufficient conditions for a scheme to be optimal for a
for all pi(n) e (0,1) two-state automaton operating in continuous time. It is not
= 0, for p (n) = 0 and I. known at present whether these conditions ensure optimality
for the discrete-time automaton of our interest.
It follows immediately that lim,, . pl(n) = {0,1} and that 2) The question whether expedient or c-optimal schemes
E[p1(n)] is strictly monotonically increasing. It has further are to be used in a particular situation is of considerable
been contended [LSI], [LLI] that since E[pl(n)] is interest. While c-optimal schemes show definite advantages
bounded above by unity, E[pi(n)] -+ 1 as n -- oo. This in in stationary random environments, the situation is not
turn implies that pi(n) --+1 with probability one. However, clear in the case of more general environments. A detailed
the conclusion that E[p1(n)] --+1 is not necessarily true comparison of various schemes has been made by Viswana-
even though E[pl(n)] is strictly monotonically increasing. than and Narendra [LV6].
Neither is the conclusion about convergence of p1(n) to 3) Only schemes suitable for the P-model have been
unity with probability one. discussed here. With appropriate modifications each of
2) A stability argument to be described has been widely these schemes can be made suitable for Q- and S-models
used [LVl], [LCl] in gaining insight into the nature of [LCI], [LV3], [LV8].
convergence. Let P i (i = 1,2, * * *) be the roots (in the unit 4) For the sake of simplicity, examples mostly of two-
interval) of the martingale equation state schemes have been given. These can, however, be
extended in a simple manner for use with multistate
Ap1(n) = E[p1(n + 1) - pi(n) p1(n)] = 0. automata [LV9], [LL4].
The roots satisfying 5) Several linear and nonlinear schemes have been
discussed in this section. Often improved rates of con-
dAp, vergence can be obtained by a combination of linear and
dp1P1=1 < 0 nonlinear terms [LL3], [LL4]. c-optimal schemes are
usually slow when the action probabilities are close to their
are called stable roots and the rest unstable. It is then final values. The convergence can be speeded up by using a
argued, on the basis of regarding Ap, as an increment in hybrid scheme constructed from a judicious combination of
pl(n), that p1(n) converges to the set of stable roots with an c-optimal scheme and an expedient scheme [LV1O].
probability one. When Apl(n) is sign definite the argument 6) In any particular scheme, the speed of convergence
reduces to that in the previous comment 1). can be increased by changing the values of the parameters
On deeper probing, no justification of this argument has in the scheme. However, this is almost invariably accom-
been found. It does not generally appear possible to prove .........
panied by an increase in the probability of convergence to
convergence with probability one when there are roots of the undesired action(s) [LL4]. We meet the classical
the martingale equation other than those corresponding to problem of speed versus accuracy. To score on both the
the absorbing states. Indeed what conclusions can be drawn counts, it appears that a careful selection of the nonlinear
in such a situation are at present unclear. The only situation, terms is necessary. Development of an analytical measure
which can be handled presently (following Norman's of the speed of convergence is necessary for further progress
approach outlined earlier) when absorbing states are in this regard.
330 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

A+3_ In the classical theory of games developed by Von


IPT
INPUT
aIM
ACTION Neumann [GM1], [GL2], [GV1] it is assumed that the
players are provided with complete information regarding
REFEREE the game, such as the number of players, their possible
ORgaeplyr, psil
ENVIRONMENT strategies, and the payoff matrix. The players can, therefore,
compute their optimal strategies using this prior informa-
INPUT ACTION tion. In the automata games being considered here such
x,2)s_x(l) A a11 prior knowledge is not available, and the automata have to
choose their strategies on-line. The payoff function is a
A A2 LEARNING AUTOMATA random variable with an unknown distribution.
Fig. 4. Game between learning automata. The concept of automata games was originally suggested
by Krylov and Tsetlin [DKI], who considered competitive
games of deterministic automata. These results were
7) While the reinforcement scheme updates the action extended by the introduction of learning automata by
probabilities, a simultaneous estimation of the penalty Chandrasekaran and Shen [LC3], who used the LR--P and
probabilities of the environment using the Bayesian nonlinear schemes. More recently Viswanathan and
technique is proposed by Lakshmivarahan and Thathachar Narendra [LV7] discussed such games using c-optimal
[LL2]. This leads to a fast estimate of the optimal action schemes. It was demonstrated in [LV7] that when the
and also provides a confidence level on this estimate. payoff matrix has a unique saddle point, the value of the
game coincides with that obtained in the deterministic
V. INTERACTION OF AUTOMATA situation, even though the characteristics of the game have
The discussions made thus far have been centered around to be learned as the game evolves. Since the automata
a single stochastic automaton interacting with a random operate entirely on the basis of their own strategies and the
environment. We shall now consider interactions between corresponding response of the referee, without any
several automata. In particular, two types of interactions knowledge of the other automata participating in the game,
are of interest. In one case several automata are operating this result also has implications in the context of decen-
together in an environment either in a competitive or a tralized control systems.
cooperative manner so that we have a game situation. In The preceding discussion pertains to competitive game
the other case we consider a hierarchical system where problems, because each player, in trying to maximize his
there are various levels of automata, and there is interaction gain, is also attempting to minimize the gain of the other
between different levels. Many interesting aspects of player. When the participants of a game cooperate with
interaction, mostly with reference to deterministic automata, each other, we have a cooperative game. Stefanyuk and
have been brought out in a series of papers by Varshavskii Vaisboard [DSl], [DVI], [DV2] considered cooperative
[DV5]-[DV8]. games using deterministic automata. Viswanathan and
Games of Automata Narendra [LV7] have shown that the value of the game can
be made as close to the Von Neumann value as desired by
Consider two automata operating in the same environ- using s-optimal schemes when the payoff matrix has a
ment (Fig. 4). Each automaton selects its actions without unique saddle point.
any knowledge of the operation of the other automaton. The results available so far on learning automata games
For each pair of actions selected by the automata, the are few and limited. It appears that there is a broad scope
environment responds in a random manner with outcomes for further study.
that form zero-sum inputs to the two automata. The action
probabilities of the two automata are now independently Multilevel Automata
updated according to suitable reinforcement schemes, and A multilevel system of automata consists of several
the procedure is repeated. The interest here is in the levels, each comprising many automata. Each action of an
asymptotic behavior of the action probabilities and inputs. automaton at a certain level triggers automata at the next
This problem may be regarded as a game between the lower level, and thus the complete system has a tree
two automata. In this context several quantities can be structure. At each stage, decisions are made at various
redefined as follows. The event corresponding to the levels and communicated to lower levels in the hierarchy.
selection of a pair of actions by the automata is called a The purpose of organizing such a multilevel system may be
play. A game consists of a sequence of plays. Each action to achieve a performance that cannot be obtained using a
chosen by an automaton in a play is called a strategy. The single automaton or to overcome the high dimensionality of
eniVronment iS known as a referee, and the input received the decision space. The main problem in such multilevel
by each automaton corresponds to the payoff. Since the sum systems is the coordination of activities at different levels
of the payoffs is zero, the game corresponds to a zero-sum or, in other words, to ensure convergence of the entire
two-person game. The asymptotic value of the expectation structure towards the desired behavior.
of the payoff to one of the players is called the value of the A two-level system has been proposed by Narendra and
game. Viswanathan [LN 1] for the optimization of-performance of
NARENDRA AND THATHACHAR: LEARNING AUTOMATA 331

automata operating in a periodic random environment. It mances of the expedient linear scheme and c-optimal
is assumed that the penalty probabilities ci(n) are periodic schemes with the suggested modification is not available at
functions of n with an unknown period. The upper bound on the present time.
the period is known.
The first level consists of one automaton in which the Periodic Environments
actions correspond to the various possible values of the If it is known a priori that the penalty probabilities of the
period. The second level consists of a number of automata, environment vary periodically in time with a common
one automaton corresponding to each of these first-level period T, the period T may be divided into N intervals. A
actions. Following the selection of a period T(n) by the system of N automata with a suitable arrangement can then
first level automaton, the first T(n) automata in the second be used so that one automaton is in operation at any instant
level are initiated to operate in the environment for one of time and each automaton operates only once in every
cycle. The average output of the environment in this cycle period. This is equivalent to each automaton operating in a
is used as the input to the first level to make the next stationary environment, so that over many cycles of opera-
selection of the value of the period. It has been shown tion the automata converge to the desired actions. In the
through simulations [LNI] that the use of s-optimal case in which the environment is known to be periodic but
schemes leads to the convergence of the period to the true the value of tfie period is unknown, a two-level automaton
value. can be used as already described in Section V.
VI. NONSTATIONARY ENVIRONMENTS VII. UTILIZATION OF AUTOMATA
Most of the available work relates to the behavior of The learning automata discussed in earlier sections can be
learning automata in stationary environments. The problem utilized to perform certain specific functions in a systems
of behavior in nonstationary environments appears difficult, context. In particular, they may be used as optimizers or -as
and only a few and specialized results are known [LL6], decision-making devices and consequently can prove useful
[LC4]. The interaction of automata in game situations and in a large class of practical problems.
in hierarchical systems is one such special case. Some of the
other known results will be described in this section. Parameter Optimization
As remarked earlier many problems of adaptive control,
pattern recognition, filtering, and identification can, under
Tsetlin [DT1] considered a composite environment, proper assumptions, be regarded as parameter optimization
which switches between a number of stationary environ- problems. It appears that a learning automaton can be
ments in accordance with a Markov chain, and investigated fruitfully applied to solve such problems especially under
the behavior of deterministic automata [DT2]. Varshavskii noisy conditions when the a priori information is small. In
and Vorontsova [LVI] applied the learning automaton to fact, the possibility of using a stochastic automaton as a
the same problem and showed that expediency can be model for a learning controller provided the first motivation
obtained for the entire range of possible values of the for studies in this area.
parameter of the switching transition probability matrix. Given a system with only noisy measurements g(x,co) on
In a very limited situation where the Markov chain the performance function I(a) = E{g(o,co)}, where 2 is an
governing switching of the environment reaches its m-vector of parameters and co is the measurement noise, the
stationarity quickly in comparison with the time taken for parameter optimization problem is to determine the
the convergence of the reinforcement scheme, the problem optimal vector of parameters OCopt such that the system
essentially reduces to the operation in a stationary environ- performance function is extremized. It is assumed that an
ment, hence, c-optimal schemes would perform well. analytical solution is not possible because of lack of
sufficient information concerning the structure of the
system and its performance function or because of math-
When the penalty probabilities of the environment vary ematical intractability. The performance function I(x) may
"slowly" in time, an c-optimal scheme tends to lock on to a also be multimodal.
certain action, thereby losing its ability to change. As When this problem is tackled by gradient methods, both
studied by Chandrasekaran and Shen [LC2], if the deterministic and stochastic, a search in the parameter
frequency of variation is sufficiently small, then the LR-P space is carried out resulting in convergence to a local
scheme, which is expedient, seems to function satisfactorily. optimum. In the automaton approach the adaptive con-
If prior information about the rate of variation is available, troller is the automaton in which the actions correspond to
the performance of c-optimal schemes can be improved by different values of a. The automaton updates the prob-
introducing reinitialization of action probabilities (that is, abilities of the various parameter values based on the
resetting them to be equal) at regular intervals of time. measurements of the performance function.
Another possible approach appears to be to bound the As observed earlier, the gradient methods are in a sense
action probabilities so that they fall short of attaining the inhibited by the fact that at each instant a new value of the
absorbing barriers and are thus free to change according to parameter is to be chosen close to the previous value. There
changes in the environment. A comparison of the perfor- is no similar restriction in the automaton approach, for each
332 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

parameter value has a certain probability of being selected number (say, r) of regions v1,v2, * ,vr. There is one autom-
and only these probabilities are updated. Thus the learning aton at the first level having r actions each corresponding to
automaton has the desired flexibility not to get locked on one region, and this automaton acts as a supervisor govern-
to a local optimum, and this difference makes automata ing the choice of one of the regions. When discretization of
methods particularly attractive for use in systems having the parameter space is permissible, the second level has r
multimodal performance criteria. automata, each automaton having control over one region.
Several studies have been made on applying automata If, on the contrary, the parameter space is to be treated as
methods to parameter optimization. When a P-model continuous, the second level consists of a local search such
automaton is used, one problem is to relate the performance as stochastic approximation. In this case it is necessary that
measurement, which is usually a continuous real variable, the number of regions be large enough so that each region
to a binary-valued response of the environment. Fu and has at most one local extremum if the two-level system is
McLaren [LF2] avoided this by defining the penalty- to act as a global optimizer. Narendra and Viswanathan
strength (S-) model where the environmental response can [LV5] demonstrated through simulation that the two-level
take any value in [0,1]. McMurtry and Fu [LM3] applied system exhibits a faster convergence rate than a single
a learning automaton using the LR-P scheme to find the automaton.
global minimum of a multimodal surface and showed that 2) Cooperative Games of Automata: The results of
the automaton chooses the global minimum with the cooperative games of automata can be naturally applied to
highest probability. The use of c-optimal schemes such as the parameter optimization problem as mentioned in
the LR-I scheme was advocated by Shapiro and Narendra Section V. Consider the problem where there are m
[LSI], [LWI], who reported success on the difficult parameters each of which is discretized into r values. The
problem of handling a relatively flat performance surface parameter space now has rm points. If r and m are large, the
along with the superposition of a high variance noise. use of a single automaton controller would lead to a slow
When the bounds on the performance measurements are rate of convergence. Instead, if the controller consists of m
known it is simple to normalize them to [0,1] and then use automata each of which chooses the value of one parameter,
an S-model scheme. However, if these bounds are unknown, then the number of probabilities to be updated at each
it is suggested in [LV3] that the current estimates of the stage would only be rm. The m automata used in this
bounds can be used for normalization, and it is further manner could be regarded as playing a cooperative game
shown experimentally that s-optimal convergence can still with the common object of extremizing the performance
be achieved. function. It follows from the results of Viswanathan and
A problem of higher complexity was considered by Narendra [LV5], [LV7] that if the performance function is
Jarvis [LJ1], [LJ2] who studied, through simulation, the unimodal, the use of c-optimal schemes by the m-automaton
operation of a learning automaton using the LR-P scheme controller leads to convergence of parameter values to the
as a global optimizer in a nonstationary environment. A optimum with as high a probability as desired. Further
pattern recognizer was used for sensing the changes in the research is needed to extend this approach to multimodal
environment. search problems.
The restriction that the set of parameter values considered
must be finite is sometimes undesirable. To overcome this Statistical Decision-Making
McLaren [LM2] has proposed a "growing automaton" As the learning automaton selects the desirable action
where the number of actions of the automaton can grow from the set of all actions on the basis of probabilistic
countably to cx. A comparison of several on-line learning inputs from the environment, one can regard the automaton
algorithms, which include the growing automaton algorithm, as a device for making decisions under uncertainty. It can
was recently made by Saridis [GS2]. thus be expected to be used as a tool for solving statistical
Problem of High Dimensionality: A basic problem decision problems.
associated with the use of automata methods in parameter Many problems in control and communication can be
optimization is that of high dimensionality. The problem is posed as the fundamental problem of deciding between two
caused by the fact that the number of control actions of the hypotheses H1 and H2 on the basis of noisy observations
automaton rapidly increases with the number of parameters x(n). The conditional densities p(x H1) and p(x H2) are
and the "fineness" of the grid employed in the search. The given, and the problem is to decide whether H1 or H2 is
speed of convergence of the updating schemes is to a large true so that the probabilities of making errors of the two
extent dependent upon the number of control actions, and kinds are less than prespecified values.
thus when the parameter space contains a large number of In order to apply the learning automaton approach to
points the convergence is too slow to be of any practical this situation it is necessary to make certain identifications.
use. In case the parameter space is continuous, the automata The two hypotheses H1 and H2 are made to correspond to
method cannot be applied directly. the two actions of an automaton, and the observations x(n)
There are two methods of overcoming this problem of are regarded as the responses from the environment. Binary
high dimensionality, as presented in the following discus- responses required for a P-model are obtained by using a
sion. Both the methods employ several interacting automata. threshold.
1) Two-Level Controller: Thg controller has a two-level As no a priori information on the true hypothesis is
structure here. The parameter space is divided into a finite available, the initial action probabilities are set at 0.5 each,
NARENDRA AND THATHACHAR: LEARNING AUTOMATA 333

and the automaton is allowed to operate according to an Learning automata provide a novel and computationally
c-optimal scheme. The hypothesis corresponding to the attractive mode of attacking a large class of problems
action in which the probability attains unity is taken as the involving uncertainties of a high order. As such they
true one. A design procedure for choosing the threshold and constitute an alternative approach to the well-known
the parameters of the reinforcement scheme so as to satisfy parameter optimization method using stochastic approx-
any prespecified bounds on the error probabilities has been imation. It is the opinion of the authors that a judicious
worked out by Lakshmivarahan and Thathachar [LL5]. combination of the two approaches will find increasing
Extension to multiple-hypothesis testing is also straight- application in many practical problems in the future.
forward [LL4].
ACKNOWLEDGMENT
Tes-ting Reinforcement Schemes
The authors would like to thank R. Viswanathan and
Attempts to develop new learning models in recent years S. Lakshmivarahan who worked with them on several
have resulted in a variety of linear and nonlinear reinforce- aspects of learning automata.
ment schemes with adjustable parameters. It has conse-
quently become increasingly difficult to compare two REFERENCES
schemes for any given application. A novel approach General
suggested by Viswanathan and Narendra [LV6] is to set up [GAI] R. C. Atkinson, G. H. Bower, and E. J. Crothers, An Introduc-
tion to Mathematical Learning Theory. New York: Wiley,
a competitive
a competitive game between
game
between two
two automata using the given
automata using the given
1965.
schemes. A suitable game matrix with a unique saddle point [GB1] R. R. Bush and F. Mosteller, Stochastic Models for Learning.
is chosen, and the automata are allowed to play the game. [GF1]New York: Wiley, 1958.
GF]K. S. Fu, "Learning control systems-Review and outlook,"
The behavior of the two automata may then be used to IEEE Trans. Automat. Contr., vol. 15, pp. 210-221, Apr. 1970.
determine the effects of parameters as well as the superiority K. Fu, Ed., Pattern Recognition and Machine Learnin7.
[GF2] NewS.York: Plenum, 1971.
of one of the schemes. Several such comparisons have been [GIl] M. Iosifescu and R. Theodorescu, Random Processes and
made in [LV6], and the superiority of c-optimal schemes [GLI] Learning. New York: Springer, 1969.
GL]R. D. Luce, Individual Choice Behavior. New York: Wiley,
over expedient schemes in stationary environments has been 1959.
demonstrated. [GL2] R. 1957.and H. Raiffa, Games and Decisions. New York:
D. Luce
Wiley,
[GM1] McKinsey, Introduction to the Theory of Games. New
J.
VIII. APPLICATIONS York: McGraw-Hill, 1952.
[GM2] J. M. Mendel and K. S. Fu, Eds., Adaptive, Learning and
As mentioned in earlier sections, the theory of learning Pattern Recognition Systems. New York: Academic, 1970.
automata is still in its infancy, and the few significant M. Mendel, "Reinforcement learning models and their
[GM3] J.applications to control problems," Learning Systems-A
results that are available were obtained only in recent Symposium of the 1973 Joint Automatic Control Conf. (ASME
years. Naturally, the application of these newly developed publication),
[GNIGN]M. pp. 3-18, 1973.
F. Norman, Markov Processes and Learning Models.
concepts to real-world problems has lagged somewhat, and New York: Academic, 1972.
only a few attempts have so far been reported. A. Paz, Introduction
[GPI] Academic, 1971. to Probabilistic Automata. New York:
An early application of the automaton approach was [GSl] J. Sklansky, "Learning systems for automatic control," IEEE
studied by Riordon [LRl], [LR2] in connection with the [GS2] G. Trans. Automat."On-line
N. Saridis,
vol. AC-li,
Contr., learning pp. 6--19, Jan. 1966.
control algorithms," Learning
control of heat treatment regarded as a Markov process Systems-A Symposium of the 1973 Joint Automatic Control
with unknown dynamics. The automaton acts as the IGTI] Ya. Conf.Z.(ASME publication), pp. 19-52, 1973.
Tsypkin, Adaptation and Learning in Automatic Systems.
adaptive controller in modeling the process as well as New York: Academic, 1971.
generating the appropriate control signals. Glorioso et al. [GVI] J. Von Neumann and 0. Morgenstern, Theory of Games and
[LGl], [LG2] have .LGI,
*LG2] discussed adaptive routing in
discussed adaptive
have routing
inl largeEconomic
large DtriitcAtmt
Behavior. Princeton, N.J.: Princeton Univ., 1947.
comm unicati On networks
communication etworks using using heuristic
heu-ristic ideas of prob-
ideas of prob- Deterministic
[DG1] S. L. AutomataGinzburg and M. L. Tsetlin, "Some examples of
abilistic iteration. Recently these ideas have been extended simulation of the collective behavior of automata," Probl.
by Narendra, Tripathi, and Mason [LN5] to the telephone [DK1] V. Peredachi Jnformatsii, vol. 1, no. 2, pp. 54-62, 1965.
Yu Krylov and M. L. Tsetlin, "Games between automata,"
traffic routing problem for a real large-scale telephone Avtomat. Telemekh., vol. 24, pp. 975-987, July 1963.
network in British Columbia. Li and Fu [LL8] have applied [DRl] automata L. E. Radyuk and A. F. Terpugov, "Effectiveness of applying
with linear tactic in signal detection systems,"
a stochastic automaton to select features in agricultural Avtomat. Telemekh., vol. 32, pp. 99-107, Apr. 1971.
data obtained by remote sensing. Several interesting [DS1] V. L. Stefanyuk, "Example of a problem in the joint behavior
of two automata," Avtomat. Telemekh., vol. 24, pp. 781-784,
applications to numerical analysis, random counters, and June 1963.
modeling appear to be contained in a recent Russian [DS2] group N. L. Stefanyuk and M. L. Tsetlin, "Power regulation in a
of radio stations," Probl. Peredachi Informatsii, vol. 3,
publication edited by Lorentz and Levin [LL7]. no. 4, pp. 49-57, 1967.
Other attempts have been made in the Soviet Union to [DT1] M. L. Tsetlin, "On the behavior of finite automata in random
. . . . ~~~~~~~~~~~~~~~media,"
Avntomat. Telemekh., vol. 22, pp. 1345-1354, Oct. 1961.
apply automaton methods to a variety of situations, such as [DT2] H. Tsuji, M. Mizumoto, J. Toyoda, and K. Tanaka, "An
reliability [DGl], power regulation [DS2], switching automaton in the non-stationary random environment,"
cicis_D3, quein
clrcuts qeu1n
J, sytm [D4]
sysems D42, an_eeto
nd dtecton [V1]B. D1 Inform. M. Sci., vol. 6, "Game
Vaisbord, no. 2, pp.of 123-142, Apr. 1973.
two automata with differing
[D1.While most of these studies are concerned with memory depths," Avtomat. Telemekh., vol. 29, pp. 91-102,
deterministic automata models, there does not appear to be [DV2]- "Game of many automata with various depths of
any reason why learning automata cannot be used in memory," Avtomat. Telemekh., vol. 29, pp. 52-58, Dec. 1968.
siia sitatinsslmllar sltuatlons. [DV3J ~~~~~~~~"Behavior
V. I. Varshavskii,of automataM. V. Meleshina,
in periodic and M.
random media Tsetlin,
L. and the
334 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

problem of synchronization in the presence of noise," Probl. [LM1] R. W. McLaren, "A stochastic automaton model for synthesis
Peredachy Informatsii, vol. 1, no. 1, pp. 65-71, 1965. of learning systems," IEEE Trans. Syst. Sci. Cybern., vol.
[DV4] , "Priority organization in queuing systems using a model SSC-2, pp. 109-114, Dec. 1966.
of collective behavior," Probl. Peredachi Informatsii, vol. 4, [LM2] , "A stochastic automaton model for a class of learning
no. 1, pp. 73-76, 1968. controllers," in 1967 Joint Automatic Control Conf., Preprints.
[DV5] V. I. Varshavskii, "Collective behavior and control problems," [LM3] G. J. McMurtry and K. S. Fu, "A variable-structure automaton
in Machine Intelligence, 3, D. Michie, Ed. Edinburgh: used as a multimodal search technique," IEEE Trans. Automat.
Edinburgh Univ., 1968. Contr., vol. AC-lI, pp. 379-387, July 1966.
[DV6] , "The organization of interaction in collectives of [LM4] L. G. Mason, "Self optimizing allocation systems," Ph.D.
automata," Machine Intelligence, 4, B. Meltzer and D. Michie, dissertation, University of Saskatchewan, Saskatoon, Sask.,
Eds. Edinburgh: Edinburgh Univ., 1969. Canada, 1972.
[DV7] , "Some effects in the collective behavior of automata," [LM5] "An optimal learning algorithm for S-model environ-
Machine Intelligence, 7, B. Meltzer and D. Michie, Eds. ments," IEEE Trans. Automat. Contr., vol. AC-18, pp. 493-496,
Edinburgh: Edinburgh Univ., 1972. Oct. 1973.
[DV8] , "Automata games and control problems," presented at [LN1] K. S. Narendra and R. Viswanathan, "A two-level system of
the 5th World Congr. IFAC, Paris, France, June 12-17, 1972. stochastic automata for periodic random environments," IEEE
Trans. Syst., Man, Cybern., vol. SMC-2, pp. 285-289, Apr.
1972.
Learning Automata [LN2] , "Learning models using stochastic automata," in Proc.
[LC1] B. Chandrasekaran and D. W. C. Shen, "On expediency and 1972 Int. Conf. Cybernetics and Society, Washington, D.C.,
convergence in variable-structure automata," IEEE Trans. Oct. 9-12.
Syst. Sci. Cybern., vol. SSC-4, pp. 52-60, Mar. 1968. [LN3] M. F. Norman, "On linear models with two absorbing
[LC2] , "Adaptation of stochastic automata in nonstationary barriers," J. Math. Psychol., vol. 5, pp. 225-241, 1968.
environments," Proc. NEC, vol. 23, pp. 39-44, 1967. [LN4] , "Some convergence theorems for stochastic learning
[LC31 , "Stochastic automata games," IEEE Trans. Syst. Sci. models with distance diminishing operators," J. Math.
Cybern., vol. SSC-5, pp. 145-149, Apr. 1969. Psychol., vol. 5, pp. 61-101, 1968.
[LC4] L. D. Cockrell and K. S. Fu, "On search techniques in [LN5] K. S. Narendra, S. S. Tripathi, and L. G. Mason, "Application
switching environment," Proc. 9th Symposium Adaptive of learning automata to telephone traffic routing problems,"
Processes, Austin, Tex., Dec. 1970. Becton Center, Yale Univ., New Haven, Conn., Tech. Rep.
[LC5] T. M. Cover and M. E. Hellman, "The two-armed bandit CT-60, Jan. 1974.
problem with time-invariant finite memory," IEEE Trans. [LR1] J. S. Riordon, "Optimal feedback characteristics from stochas-
Inform. Theory, vol. IT-16, pp. 185-195, Mar. 1970. tic automaton models," IEEE Trans. Automat. Contr., vol.
[LF1] K. S. Fu and G. J. McMurtry, "A study of stochastic automata AC-14, pp. 89-92, Feb. 1969.
as models of adaptive and learning controllers," Purdue Univ., [LR2] , "An adaptive automaton controller for discrete-time
Lafayette, Ind., Tech. Rep., TR-EE 65-8, 1965. Markov processes," Automatica (IFAC Journal), vol. 5, no. 6,
[LF2] K. S. Fu and R. W. McLaren, "An application of stochastic pp. 721-730, Nov. 1969.
automata to the synthesis of learning systems," Purdue Univ., [LS1] I. J. Shapiro and K. S. Narendra, "Use of stochastic automata
Lafayette, Ind., Tech. Rep. TR-EE 65-17, 1965. for parameter self-optimization with multimodal performance
[LF3] K. S. Fu, "Stochastic automata, stochastic languages and criteria," IEEE Trans. Syst. Sci. Cybern., vol. SSC-5, pp.
pattern recognition," J. Cybern., vol. 1, no. 3, pp. 31-49, 352-360, Oct. 1969.
July-Sept. 1971. [LS2] I. J. Shapiro, "The use of stochastic automata in adaptive
[LF4] K. S. Fu and T. J. Li, "Formulation of learning automata and control," Ph.D. dissertation, Dep. Eng. and Applied Sci., Yale
automata games," Inform. Sci., vol. 1, no. 3, pp. 237-256, July Univ., New Haven, Conn., 1969.
1969. [LS3] Y. Sawaragi and N. Baba, "A consideration on the learning
[LF5] , "On stochastic automata and languages," Inform. Sci., behavior of variable structure stochastic automata," IEEE
vol. 1, no. 4, pp. 403-419, Oct. 1969. Trans. Syst., Man, Cybern., vol. SMC-3, pp. 644-647, Nov.
[LF6] K. S. Fu, "Stochastic automata as models of learning systems," 1973.
Computer and Information Sciences II, J. T. Tou, Ed. New [LT1] Ya. Z. Tsypkin and A. S. Poznyak, "Finite learning automata,"
York: Academic, 1967. Eng. Cybern., vol. 10, pp. 478-490, May-June 1972.
[LG1] R. M. Glorioso, G. R. Grueneich, and J. C. Dunn, "Self [LVI] V. I. Varshavskii and I. P. Vorontsova, "On the behavior of
organization and adaptive routing for communication net- stochastic automata with variable structure," Avtomat.
works," 1969 EASCON Rec., pp. 243-250. Telemekh., vol. 24, pp. 353-360, Mar. 1963.
[LG2] R. M. Glorioso and G. R. Grueneich, "A training algorithm [LV2] I. P. Vorontsova, "Algorithms for changing stochastic autom-
for systems described by stochastic transition matrices," IEEE ata transition probabilities," Probl. Peredachi Informatsii, vol.
Trans. Syst., Man, Cybern., vol. SMC-1, pp. 86-87, Jan. 1971. 1, no. 3, pp. 122-126, 1965.
[LG3] B. R. Gaines, "Stochastic computing systems," in Advances in [LV3] R. Viswanathan and K. S. Narendra, "Stochastic automata
Information Sciences 2. New York: Plenum, 1969, pp. 37-172. models with applications to learning systems," IEEE Trans.
[LJ1] R. A. Jarvis, "Adaptive global search in a time-invariant Syst., Man, Cybern., vol. SMC-3, Jan. 1973.
environment using a probabilistic automaton," Proc. IREE, [LV4] , "A note on the linear reinforcement scheme for variable-
Australia, pp. 210-226, 1969. structure stochastic automata," IEEE Trans. Syst., Man,
[LJ2] - , "Adaptive global search in a time-variant environment Cybern., vol. SMC-2, pp. 292-294, Apr. 1972.
using a probabilistic automaton with pattern recognition [LV5] R. Viswanathan, "Learning automata: models and applica-
supervision," IEEE Trans. Syst. Sci. Cybern., vol. SSC-6, pp. tions," Ph.D. dissertation, Dep. Eng. and Applied Sci., Yale
209-217, July 1970. Univ., New Haven, Conn., 1972.
[LL1] S. Lakshmivarahan and M. A. L. Thathachar, "Optimal [LV6] R. Viswanathan and K. S. Narendra, "Comparison of
nonlinear reinforcement schemes for stochastic automata," expedient and optimal reinforcement schemes for learning
Inform. Sci., vol. 4, pp. 121-128, 1972. systems," J. Cybern., vol. 2, no. 1, pp. 21-37, 1972.
[LL2] , "Bayesian learning and reinforcement schemes for [LV7] , "Competitive and cooperative games of variable-
stochastic automata," Proc. Int. Conf. Cybernetics and Society, structure stochastic automata," in Joint Automatic Control
Washington, D.C., Oct. 1972. Conf., Preprints, Aug. 1972; also available as Becton Center,
[LL3] , "Absolutely expedient learning algorithms for stochastic Yale Univ., New Haven, Conn., Tech. Rep. CT-44, Nov. 1971.
automata," IEEE Trans. Syst. Man, Cybern., vol. SMC-3, pp. [LV8] , "Application of stochastic automata models to learning
281-286, May 1973. systems with multimodal performance criteria," Becton
[LL4] S. Lakshmivarahan, "Learning algorithms for stochastic Center, Yale Univ., New Haven, Conn., Tech. Rep. CT-40,
automata," Ph.D. dissertation, Dep. Elec. Eng., Indian June 1971.
Institute of Science, Bangalore, India, Jan. 1973. [LV9] , "Expedient and optimal variable-structure stochastic
[LL5] S. Lakshmivarahan and M. A. L. Thathachar, "Hypothesis automata," Dunham Lab., Yale Univ., New Haven, Conn.
testing using variable-structure stochastic automata," in 1973 Tech. Rep. CT-31, Apr. 1970.
IEEE Decision and Control Conf., Preprints, San Diego, Calif. [LY10] , "Simulation studies of stochastic automata models,
[LL6] G. Langholtz, "'Behavior of automata in a nonstationary Part I-Reinforcement schemes," Becton Center, Yale Univ.,
environment," Electron. Lett., vol. 7, no. 12, pp. 348-349, New Haven, Conn., Tech. Rep. CT-45, Dec. 1971.
June 17, 1971. [LW1] I. H. Witten, "Comments on 'Use of stochastic automata for
[LL7] A. A. Lorentz and V. I. Lenin, Eds., Probabilistic Automata parameter self-optimization with multimodal performance
and Their Applications (in Russian). Riga, Latvia, USSR: criteria,' " IEEE Trans. Syst., Man, Cybern., vol. SMC-2, Apr.
Zinatne, 1971. 1972.
(LL8] T. J. Li and K. S. Fu, "Automata games, stochastic automata [LW2] , "Finite-time performance of some two-armed bandit
and formal languages," Purdue Univ., Lafayette, Ind., Tech. controllers," IEEE Trans. Syst., Man, Cybern., vol. SMC-3,
Rep. TR-EE69-1, Jan. 1969. pp. 194-197, Mar. 1973.

You might also like