Professional Documents
Culture Documents
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002
Manuscript received November 15, 2001; revised March 1, 2002. This paper
was recommended by Guest Editors M. S. Obaidat, G. I. Papadimitriou, and A.
S. Pomportsis.
The authors are with the Computational Modeling Lab (COMO), Vrije Universiteit Brussels, Brussels 1050, Belgium (e-mail: kaverbee@vub.ac.be; asnowe@info.vub.ac.be).
Publisher Item Identifier S 1083-4419(02)06466-X.
meta-heuristic unifying the existing ant algorithms has been defined. A good overview of the state of the art in the field is given
in [8].
The main observation on which ACO is based is that real
ants are capable of finding shortest paths from their nest to food
sources and back. They can perform this behavior thanks to a
simple pheromone laying mechanism. In fact, while walking
ants deposit small amounts of pheromones on the ground. When
ants move from their nest to the food source they move mostly
random, but their random movements are biased by pheromone
trails left on the ground by preceding ants. Because the ants that
initially chose the shortest path to the food arrive first, this path
will be seen as more desirable by the same ants during their
journey back to the nest.2 This, in turn, will increase the amount
of pheromone deposited on the shortest path. Eventually, this
auto-catalytic process causes all the ants to take the shortest
path.
Artificial ants take advantage of the differential length as well
as of the auto-catalytic aspects of the real ants behavior to solve
discrete optimization problems. The problem description is represented by a graph. Artificial ants are software agents in this
graph, who modify some variables so to favor the emergence of
good solutions. In practice, to each graphs edge a variable is associated, which is called a pheromone trail in analogy with the
real ants. Ants add pheromone to those edges they pass and by
doing so they increase the probability with which future ants
will take these edges. Artificial ants, as real ones, move according to a probabilistic decision policy biased by the amount
of pheromone trail they smell on the graph edges.
This makes an ant fit the definition of an agent, and thus
ASs are examples of MAS. Since some ant algorithms were already tested extensively and proved to perform well (see, for example, [9]), they should be studied more theoretically. Although
many good results are achieved by this algorithms, many open
questions still remain. How and why do these algorithms work?
What are the principles of ACO algorithms? What does control
them?
It turns out that the way ant algorithms work is not that different from interconnected LAs. In this paper, we wish to point
out that although both fields came from a different perspective
and motivation (human behavior as opposed to ants), they came
up with the same kind of algorithms for the same applications,
cf. routing in telecommunication networks [9], [5]. As far as the
aforementioned questions are concerned, the field of LAs may
help out with a theoretical basis for ant algorithms and MAS
in general. The potential of using LAs for learning in MAS was
also pointed out by others [10], [11]. At the end of this paper, we
use the analogy with the ACO algorithms to construct an interconnected model of LAs. This model is able to handle common
problems in current MAS research (i.e., Markov games) directly
[12], [13].
In the next section, ACO will be discussed. We give two ant
algorithms, which are representatives of the two problem types
ant algorithms handle: 1) static optimization problems and 2)
dynamic optimization problems. Next, we summarize some basics from LA theory. Since a graph can be modeled as an MDP,
2This
773
774
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002
allowed
otherwise.
(1)
where
is the distance between town and
is the intensity of the trail on edge
. After
town and
values of
the ants in the system ended their tours, the trail
are updated
every edge
(2)
is a trail decay coefficient such that
where
is the length of the tour done by the th ant and is
a constant.
Ants are fully cooperative, as they have a common goal, i.e.,
to find the shortest path. The use of a link by one ant does not
influence the usefulness of that link for another agent. This is
also reflected by the fact that when using more ants in this case,
good solutions will evolve more quickly.5 Some heuristic information is added to the action-selection rule (1) via the term
of which the importance can be tuned by parameter .
AS was compared with other general purpose heuristics [15].
For small TSP problems, results were very interesting. AS was
able to find and improve the best solution found so far for a
known 30-city problem. For problems of growing dimensions
AS quickly converged to good solutions, however did not reach
the best known solutions within the allowed number of iterations. Later on AS was extended to the ACS system6 [8]. The
performance of ACS turned out to be the best one, both in terms
of quality of the solutions generated and of CPU time, on standard problems of various sizes.
ACO algorithms for other static optimization problems were
introduced, i.e., the quadratic assignment problem, job-shop
scheduling, graph coloring, sequential ordering, etc. [8]. They
all proved to be competitive with the best known methods in
literature.
B. Distributed Routing in Communication Networks
The AntNet system was introduced in [9] as a distributed,
adaptive, mobile-agent-based algorithm for load-based shortest
path routing in connection-less communication networks.
Routing is the distributed activity of building and using routing
tables, one for each node in the network, which tells incoming
data packets which outgoing link to use to continue their travel
to their destination node.
are launched in
At regular time intervals, forward ants
every node concurrently with data traffic to a randomly selected destination node . The ants goal is to find a feasible
low-cost path to the destination and to check the load status of
the network. To accomplish this they have to use the same network queues as normal data packets. While traveling, ants keep
track of the nodes visited and the time elapsed since launching
time.
Ants decisions to move forward are taken on the basis of
a combination of a long-term learning process and an instantaneous heuristic prediction. Neighbor of node is selected
with a probability
and
(5)
(3)
(4)
775
. The value of weights the importance of the heuristic correction with respect to the routing table information. Therefore,
depends on a long-term learning process
action probability
and an instantaneous heuristic prediction.7
Furthermore, the ant sees to it that cycles are detected so that
irrelevant information collected on cycles may be erased or that
the ant destroys itself when the cycle was too long. When evenis
tually the destination node is reached a backward ant
created. The forward ant transfers its memory to it and dies. The
backward ant moves now in the opposite direction and at each
node along the path it updates the statistical model of the node
as well as its routing table. Backward ants use priority queues to
continue their travel so the information is quickly propagated.
of the traffic distriEvery node keeps a statistical model
bution by computing sample means and variances over the trip
times experienced by the mobile ants. A moving observation
is used to compute the value
of the best trip
window
time seen in that window. The model is updated as follows:
(6)
(7)
is the new observed trip time from node to deswhere
tination . The statistical model is used in the routing table updating process by assigning a goodness to the trip time. This
with
is seen in the
goodness value
current node as a positive reinforcement signal for the node
from which the backward ant returns
(8)
for destination with another neighboring
Probabilities
node of implicitly receive a negative reinforcement by normalization
(9)
DiCaro and Dorigo [9] recognized the importance of the reinforcement . It is carefully chosen to be a squashed function of
a sum of two terms. The first term, which is the most important one, evaluates the ratio between the current trip time and
the best trip time observed over the current window. The second
term considers the stability in the last trip times, favoring paths
with low variations.
If the trip time of a subpath is statistically good, then the statistics and routing table of entries corresponding to every node
on the path from to are also updated.
Routing tables are used in a probabilistic way, not only by the
ants, but also by the data packets.
Versions of the AntNet algorithm were tested in [16] against
some state-of-the-art algorithms, using the NTT Japanese backbone network, randomly generated networks of 100 and 150
nodes, and benchmark problems. The performance of AntNet
algorithms was among the best concerning packet delays and
7Experimentally, Dorigo et al. [9] found that the best value of the weight can
vary between 0.2 and 0.5, depending on the problem characteristics. For lower
values, the reactive effect is vanishing, while for higher values, oscillations of
the resulting routing tables appear.
776
Fig. 1.
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002
reward -penalty. The philosophy of those schemes is essentially to increase the probability of an action when it results in
a success and to decrease it when the response is a failure. The
general rewardpenalty algorithm is given by
if
is chosen at time
(10)
if
(11)
The constants and are the reward and penalty parameters,
, the algorithm is referred to as linear
respectively. When
;
when
, it is referred to as rerewardpenalty
;
and
when
is
small
compared to , it is
wardinaction
.
called linear reward -penalty
(11). Only one LA is active at each time step and the transition
to the next state triggers the LA from that state to become active
and take some action. LA
active in state is not informed of
resulting from its action , leading to
the one-step reward
receives two pieces
state . When the state is visited again,
of data: 1) he cumulative reward generated by the process up to
the current time step and 2) the current global time. From these,
computes the incremental reward generated since this last
visit and the corresponding elapsed global time. The environis then taken to be
ment response or the input to
(12)
is the cumulative total reward generated for
where
is the cumulative total time
action in state and
elapsed.10 Wheeler and Narendra [6] denote updating rules
(10) and (11) with the environment response of (12) as learning
scheme . They also prove that this interconnected LA-model
is capable of solving the MDP.
Theorem 1 (Wheeler and Narendra [6]): Let for every acusing
tion state of an state Markov chain, an automaton
and having actions be associated with.
learning scheme
Assume that the Markov chain, corresponding to each policy
is ergodic.11 Then for any
, there exists an
such that for
and any
in
where
is the expected reward per step and can be written
for policy
in terms of the limiting stationary probabilities
and are the rewards and transition probabilities, respectively, depending on starting state and ending state .
Proof: Wheeler and Narendra [6] prove that the Markov
chain control problem under the assumptions above can be
asymptotically approximated by an identical payoff game12 of
automata. This game is shown to have a unique equilibrium.
For the corresponding automata game with every automaton
using an
updating scheme, the above result is proved. As
long as the ratio of updating frequencies of any two controllers
does not tend to zero, the result also holds for the asynchronous
updating scheme.
The principal result derived is that, without prior knowledge
of transition probabilities or rewards, the network of independent decentralized LA controllers is able to converge to the set
of actions that maximizes the long-term expected reward [5].
one-step reward is normalized so that stays in [0, 1].
Markov chain x(n) is said to be ergodic when the distribution of the
chain converges to a limiting distribution as n
.
12Games are a formalization of interactions between players. In an identical
payoff game, every player receives the same payoff for the joint action taken
by the players. For an overview of game theory, see [18]. For an introduction to
automata games, see [5].
10The
11A
!1
777
Fig. 2. Grid-world game with two mobile agents in their initial state and
nonmobile LAs in every nongoal state of the grid.
778
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002
0:006;
When we use
schemes for our nonmobile state automata, the automata find two optimal nonconflicting paths for
the mobile agents. They both take action up in their initial state.
This solution is the Nash equilibrium for the game consisting of
the mobile agents dominating strategy (see Figs. 3 and 4).
The game can be made more interesting by letting the action
up in the two starting states only be executed with probability
0.5. Figs. 5 and 6 show that a
scheme with
finds a solution involving one player taking the lateral move and
the other trying to move north. Again, this is an equilibrium for
this game.14
Therefore, in this example, our LA model is able to find the
equilibrium solutions, yet without explicit communication or
the need for an agent to model his opponents.
14The other equilibrium is the same, only with the role of the agents switched.
779
= 0:06; probabilities
VI. DISCUSSION
In this paper, we compared the field of ACO with an interconnected LA model, which is capable of controlling an MDP.
We extended this LA model to allow for multiple LAs being active simultaneously and we discovered that this model matches
perfectly with the ant colony paradigm.
How can we use this results? From the ACO point of view,
this means that the theory of LA can serve as a theoretical analysis tool for the ACO algorithms. The problem in the field of
ACO is that although good results both in terms of quality and
convergence time, are achieved by the algorithms, no underlying
formalism or convergence results for them exist. Why do these
algorithms work? The convergence proof of the LA model in
Section V is a justification for the use of ant algorithms in the
case of static optimization problems. In dynamic optimization
problems, ant algorithms still work fine; however, in this case,
no proof of convergence is given for the interconnected model
of LA. This confirms the importance of the use of heuristic information which is used in ant algorithms. From the LA point
of view, this may be a suggestion coming from the ACO field,
that is, the use of heuristic information can guide learning and
improve convergence results.
Apart from the practical influence of using a heuristic,
learning in one-agent problems or static environments seem to
be theoretically well understood in the broad RL framework
[2]. However, for learning in nonstationary environments, such
as in MAS, a theoretical ground is missing. The ant algorithms
and therefore also the interconnected model of LA give several
useful insights in this case. Currently, a lot of attention is
going to the Markov game model for learning in MAS. The
Markov game model [13], [19] is a direct extension of the
MDP model and the game theory model for MAS. This model
augments the MDP model with actions that are distributed
over the different agents as in the game theory model. At
every step in the process, the system is in a certain state and
a corresponding game has to be played. Although this model
gives a natural mapping of the problem, learning in it is not
trivial because the Markovian property no longer holds due
to the other agents in the environment. RL, or more precisely,
the technique of Q-learning [20], has already been used for
learning in stochastic games [4], [12], [13], [19]; however, the
proposed solutions are limited by some conditions. Oscillating
behavior may arise and stabilizing features should be added
[4]. Moreover, in MDPs, agents learn a value for an action in a
certain state, while in stochastic games, values are learned for
combinations of actions, and learning is thus done in a product
space. ACO methods seem to work around this problem. We
showed how the similar model of interconnected LAs can be
used directly on a Markov game problem, currently studied in
the MAS community. For now, the LA model gives the same
results, however, without modeling others, and thus without the
780
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002
need for the product space, and without explicit communication. Moreover, by analogy with ACO applications and results,
scaling this technique to larger problems should be possible.
The use of LAs for learning in MAS was also proposed by
others. Schmidhuber and Zhao designed a MAS model [21]
which augments the action vector with an additional action
being a change in strategy. The key idea there is the reward they
receive by continual testing the reinforcement acceleration for
every modification done. In this model, there is also no explicit
communication or modeling of other agents.
In [10], distributed game automata and the effect of delayed
communication is studied. Explicit communication is used, but
limited so that overhead costs are reduced but good decisions
still result. Simulation and analytic results reported in [10] show
that there exists a maximum communication delay before decision quality begins to suffer. However, with sufficient communication, the agents adapt to a coordinated policy.
In [22], LAs are used for playing stochastic games at multiple
levels.
To conclude, we believe that the similarities between ACO
and LAs mean that the theory of LA can serve as a good theoretical tool for analyzing ACO algorithms and learning in MAS
in general.
[12] C. Boutilier, Sequential optimality and coordination in multi-agent systems, in Proc. IJCAI, Stockholm, Sweden, 1999, pp. 478485.
[13] J. Hu and M. P. Wellman, Multi-agent reinforcement learning: Theoretical framework and an algorithm, in Proc. 15th Int. Conf. Machine
Learning, 1998, pp. 242250.
[14] M. Guntsch, J. Branke, M. Middendorf, and H. Schmek, ACO strategies
for dynamic TSP, in Proc. 2nd Int. Workshop Ant Algorithms, 2000, pp.
5962.
[15] M. Dorigo and L. M. Gambardella, Ant colony system: A cooperative learning approach to the travelling salesman problem, IEEE Trans.
Evol. Comput., vol. 1, pp. 5366, Jan. 1997.
[16] G. Di Caro and M. Dorigo, Two ant colony algorithms for best-effort
routing in datagram networks, in Proc. 10th Int. Conf. Parallel and Distributed Computing and Systems, 1998, pp. 541546.
[17] J. N. Tsitsiklis, Asynchronous stochastic approximation and
q-learning, Mach. Learn., vol. 16, pp. 185202, 1994.
[18] J. O. Osborne and A. Rubinstein, A Course in Game
Theory. Cambridge, MA: MIT Press, 1994.
[19] M. L. Litmann, Markov games as a framework for multiagent reinforcement learning, in Proc. 11th Int. Conf. Machine Learning, 1994, pp.
157163.
[20] C. Watkins and P. Dayan, Q-learning, Mach. Learn., vol. 8, no. 3, pp.
279292, 1992.
[21] J. Schmidhuber and J. Zhao, Direct policy search and uncertain policy
evaluation, in AAAI Spring Symp. Search Under Certain and Incomplete Information. Stanford, CA, 1999, pp. 119124.
[22] E. A. Billard and S. Lakshmivarahan, Learning in multilevel games
with incomplete informationPart I, IEEE Trans. Syst., Man, Cybern.
B, vol. 29, pp. 329339, June 1999.
REFERENCES
[1] G. Weiss, Ed., Multiagent Systems. A Modern Approach to Distributed
Artificial Intelligence. Cambridge, MA: MIT Press, 1999.
[2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
[3] J. Boyan and M. Littman, Packet routing in dynamically changing networks: A reinforcement learning approach, Adv. Neural Inf. Process.
Syst., vol. 6, pp. 671678, 1994.
[4] A. Now and K. Verbeeck, Distributed reinforcement learning, loadbased routing: A case study, in Notes of the Neural, Symbolic, and Reinforcement Methods for Sequence Learning Workshop at IJCAI, Stockholm, Sweden, 1999, pp. 8591.
[5] K. Narendra and M. Thathachar, Learning Automata: An Introduction. Englewood Cliffs, NJ: Prentice-Hall, 1989.
[6] R. M. Wheeler and K. S. Narendra, Decentralized learning in finite
Markov chains, IEEE Trans. Automat. Contr., vol. AC-31, pp. 519526,
June 1986.
[7] M. Dorigo, V. Maniezzo, and A. Colorni, The ant system: Optimization
by a colony of cooperating agents, IEEE Trans. Syst., Man, Cybern. B,
vol. 26, pp. 2941, Feb. 1996.
[8] M. Dorigo, G. Di Caro, and L. M. Gambardella, Ant algorithms for
discrete optimization, Artif. Life, vol. 5, no. 2, pp. 137172, 1999.
[9] G. Di Caro and M. Dorigo, AntNet: Stigmergetic control for communications networks, J. Artif. Intell. Res., vol. 9, pp. 317365, 1998.
[10] E. A. Billard and J. C. Pasquale, Adaptive coordination in distributed
systems with delayed communication, IEEE Trans. Syst., Man, Cybern., vol. 25, pp. 546554, Apr. 1995.
[11] A. Glockner and J. C. Pasquale, Coadaptive behavior in a simple distributed job scheduling system, IEEE Trans. Syst., Man, Cybern., vol.
23, pp. 902907, May/June 1993.
Katja Verbeeck received the M.S. degree in mathematics in 1995, and the M.S. degree in computer
science in 1997, both from Vrije Universiteit
Brussels (VUB), Brussels, Belgium, where she is
currently pursuing the Ph.D. degree.
She is also currently a Teaching Assistant in the
Computational Modeling Lab, COMO, at VUB.
Her research interests are reinforcement learning,
learning automata, and learning in multiagent
systems.
Ann Now received the M.S. degree from Universiteit Gent, Gent, Belgium, in 1987, where
she studied mathematics with optional courses in
computer science, and the Ph.D. degree from Vrije
Universiteit Brussels (VUB), Brussels, Belgium,
in collaboration with Queen Mary and Westfield
College, University of London, London, U.K., in
1994. The subject of her dissertation is located in
the intersection of computer science (AI), control
theory (fuzzy control), and mathematics (numerical
analysis, stochastic approximation).
She was a Teaching Assistant and is now a Professor at VUB. Her major areas
of interest are AI-learning techniques, in particular, reinforcement learning and
learning in multiagent systems. She is a member of the Computational Modeling
Lab, COMO, at VUB.
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.