Benchmarks, Test Beds, Controlled Experimentation, and The Design of Agent Architectures

AI Magazine Volume 14 Number 4 (1993) ( AAAI)
Articles
Benchmarks, Test Beds,

Controlled
Experimentation,
and the Design of
Agent Architectures
Steve Hanks, Martha E. Pollack, and Paul R. Cohen
The methodological underpinnings of AI are researcher to discriminate between uninter-

slowly changing. Benchmarks, test beds, and esting and important phenomena and to fol-
controlled experimentation are becoming more low up reports of experiments with thorough
common. Although we are optimistic that this explanations of their results. Second, there is
change can solidify the science of AI, we also rec- little agreement about what a representative
ognize a set of difficult issues concerning the
benchmark or test-bed problem is. A third
appropriate use of this methodology. We discuss
and related concern is that results obtained
these issues as they relate to research on agent
design. We survey existing test beds for agents
with benchmarks and test beds are often not
and argue for appropriate caution in their use. general. Fourth, because benchmarks and test
We end with a debate on the proper role of beds are attractive to program managers and
experimental methodology in the design and val- others who provide funding, there is a real
idation of planning agents. danger that researchers will aim for the pre-
scribed benchmark target when funding is
I
n recent years, increasing numbers of AI perceived to be the reward. In sum, we are
research projects have involved controlled concerned that benchmarks and test beds, if
experimentation, in which a researcher not carefully used, will provide only a com-
varies the features of a system or the environ- fortable illusion of scientific progresscon-
ment in which it is embedded and measures trolled experimentation with reproducible
the effects of these variations on aspects of problems and environments and objective
system performance. At the same time, two performance measuresbut no generalizable,
research tools have gained currency: bench- significant results.
marks, precisely defined, standardized tasks, Benchmarks and test beds serve at least two
and test beds, challenging environments in different purposes. One is to provide metrics
which AI programs can be studied. In our for comparing competing systems. Compari-
view, the move toward more principled son metrics are valuable for some purposes,
experimental methods is uncontroversially a but performance comparisons do not consti-
good thing; indeed, we are optimistic that it tute scientific progress unless they suggest or
will solidify the science of AI. However, we provide evidence for explanatory theories of
also recognize some issues concerning the performance differences. The scientific value
appropriate use of these methods. First, of well-crafted benchmarks and test beds is
benchmarks and test beds no more guarantee their power to highlight interesting aspects of
important results than, say, microscopes and system performance, but this value is realized
Bunsen burners. They are simply part of the only if the researcher can adequately explain
apparatus of empirical AI. It is up to the why his or her system behaves the way it does.
Copyright 1993, AAAI. All rights reserved. 0738-4602-1993 / $2.00 WINTER 1993 17
Articles
A The experimental control that can be increasingly more controversial as the article
achieved with test beds can help us explain proceeds, and indeed, by the end of the arti-
benchmark why systems behave as they do. AI systems are cle, we will no longer speak with one voice.
is intended to be deployed in large, extremely
complex environments, and test beds serve as
illuminating simplified, simulated versions of these envi-
Benchmarks and Test Beds
to the degree ronments, in which the experimenter has Benchmarks are a common tool in computer
access to particular aspects of the environ- science. In the design of central processing
that it tells ment, and other aspects are allowed to vary units (CPUs), for example, matrix multiplica-
us randomly. The experimental process consists tion is a good benchmark task because it is
something in the researcher varying the features of the representative of an important class of
test-bed environment, the benchmark task, or numeric processing problems, which, in turn,
we want to the embedded system and measuring the is representative of a wider class of computa-
know about resulting effects on system performance. A tional problemsthose that do not involve
fundamental question exists, however, about significant amounts of input-output. The
the behavior the viability of this approach. The concern matrix multiplication problem can be
of a grows out of the tension between realism and described precisely and rigorously. Moreover,
the possibility of experimental control. On the matrix multiplication is illuminating: It tells
program. one hand, controlled experiments seem, at the CPU designer something interesting
least currently, to be feasible only for simpli- about CPU, namely, its processing speed. In
fied systems operating in highly idealized envi- other words, if we are interested in processing
ronments. On the other hand, our ultimate speed as a measure of performance, then
interest is not simplified systems and environ- matrix multiplication is a good benchmark:
ments but, rather, real-world systems deployed Good performance on matrix multiplication
in complex environments. It is not always problems predicts good performance on the
obvious whether the lessons learned from the large class of numeric tasks for which the pro-
simplified systems are generally applicable, but cessor is being designed.
neither is it obvious how to perform systemat- An early benchmark task for AI planning
ic experiments without the simplifications. programs was the Sussman anomaly (the
Researchers disagree about how best to pro- three-block problem) (Sussman 1975). The
ceed in light of this tension. One approach is Sussman anomaly helped many researchers
to maintain systematicity in experiments and elucidate how their planners worked. It was
look for ways to translate the results of the popular because like matrix multiplication, it
experiments into general principles that was representative of an important class of
apply to more complex systems and environ- problems, those involving interactions
ments. The alternative is to focus on more among conjunctive subgoals, and it was easy
realistic systems and environments and to try to describe.
to conduct systematic experiments on them A benchmark is illuminating to the degree
directly. Much of this article focuses on a that it tells us something we want to know
comparison of these approaches. about the behavior of a program. Our goals as
Although benchmarks, test beds, and con- scientists, engineers, and consumers dictate
trolled experimentation are increasingly what we want to know. Sometimes we are
important in a number of subareas of AI, most interested in the systems raw perfor-
including natural language understanding mance. In buying a workstation, we might be
and machine learning, we focus our discus- impressed with the rate at which a particular
sion on its role in agent design. We begin, in machine performs matrix multiplication.
Benchmarks and Test Beds, by describing Likewise, as the potential user of an AI search
some of the criteria for good benchmarks and algorithm, we might be impressed with the
test beds and discussing some of the potential performance of the min-conflicts heuristic
difficulties encountered in their design. In algorithm on the million-queens problem
Current Issues in Agent Design, we discuss (Minton et al. 1990). As scientists and engi-
the range of features that a test bed for agent neers, however, our interests are different. In
design might have. In Test-Bed Implementa- these roles, we want to understand why a sys-
tions, we survey existing test beds for agent tem behaves the way it does. What is it about
design with these features in mind. Finally, in the Cray architecture that allows high-perfor-
Discussion, we return to the general issue of mance matrix multiplication? Why does the
experimental methodology in agent design min-conflicts heuristic algorithm solve
and discuss some unresolved questions con- increasingly difficult n-queens problems in
cerning its use. Our points will become roughly constant time?
18 AI MAGAZINE
Articles
Understanding a systems behavior on a scheduler, this benchmark will not tell us Benchmarks
benchmark task requires a model of the task, whether the quality of a schedule is appropri-
so our goals as scientists and engineers will ate given time constraints and other goals of ideally are
often be served only by benchmark tasks that the program. However, it is far from obvious problems
we understand well enough to model precise- that any benchmark can be devised for such a
ly, especially for cases in which we expect a case. Benchmarks are problems that everyone that are
program to pass the benchmark test. Without can try to solve with his/her own system, so both
a model of the task, it is difficult to see what the definition of a benchmark cannot depend
has been accomplished: We risk finding our- on any system-specific details, nor can the
amenable
selves in the position of knowing simply that scoring criteria. What a researcher learns to precise
our system produced the successful behav- about a system from performance on a analysis
iorpassing the benchmark. benchmark is liable to be inversely propor-
Models are also important when we design tional to the size, complexity, and specificity and repre-
benchmarks to be failed, but in this case, we of the system. sentative of
need a model of the factors that make the Thus, the conscientious researcher, intent
benchmark difficult. For example, we learn on evaluating a system, faces an uncomfort- a more
more about articulation by asking a human to able choice. The behaviors of the systems complex and
say black back brake block repeatedly than components can be evaluated individually on
we do from having the person say the equally benchmark tasks, or the systems behav-
sophisticat-
unpronounceable sentence alckb bcak raebk iorsnot necessarily those of individual com- ed reality.
lbcko. Both sentences are extremely difficult, ponentscan be evaluated by task-specific
but the former is more illuminating because criteria. On the one hand, the researcher
we have models of phonetics that explain learns, say, that the embedded constraint-sat-
why it is difficult. Experiments can tell us isfaction algorithm is extremely slow and
which design choices lead to good perfor- wont scale up; on the other, he/she learns
mance on benchmark tasks, but we need good that the system nonetheless produces robust,
models of these tasks to explain why it is so. timely schedules for the particular job shop
However, building a good model tends to modeled. Neither result is likely to evoke
require a simple problem, and there is always interest outside the researchers own laborato-
the danger that a simple problem will not be ry. Why should the rest of us care that an
especially illuminating. inefficient algorithm suffices to solve an
Benchmarks ideally are problems that are applied problem that doesnt concern us? The
both amenable to precise analysis and repre- difficulty is that as our attention turns to
sentative of a more complex and sophisticated integrated programs, benchmark scores for
reality. Unfortunately, the current state of the component processes might be at variance
field often elevates these problems to a new with or poorly predict task-specific measures.
status: They become interesting for their own
T
sake rather than for their help in understand- he potential mismatch between bench-
ing a systems behavior on larger, more inter- mark scores and performance on real
esting tasks. Cohens (1991) survey of papers tasks is also a concern for researchers
from the 1990 National Conference on Artifi- who are developing test beds. Although some
cial Intelligence found that 63 percent of the test beds are no more than an interface to
papers focused on benchmark problems such specify parameters of a benchmark problem
as n queens, the Yale shooting problem, and and instrumentation to measure perfor-
Sussmans anomaly. However, few of these mance, those described in this article provide
papers made explicit the connection between rich environments that present a wide range
the benchmark problems and any other task. of challenges to planners and related AI pro-
Without this additional analysis, it is difficult grams. You can design a lot of tasks for your
to say whether these problems are representa- planning system in TILEWORLD, PHOENIX, and
tive of others we presumably care about and, the other test beds discussed here. You can
therefore, exactly why the reported solutions study a lot of phenomenareal-time satisfic-
are themselves interesting. ing, graceful degradation under resource
As AI begins to focus less on component restrictions, path planning and navigation,
technologies and more on complete, integrat- sensor fusion, various kinds of learning, and
ed systems, these traditional benchmarks so on. However, each of these general behav-
might reveal their limitations. For example, iors will be implemented in a particular way
although we might use n queens to test the depending on the specific test bed and sys-
capability and speed of a constraint-satisfac- tem being developed. Graceful degradation in
tion algorithm embedded in, say, a factory a simplified TILEWORLD agent might have little
WINTER 1993 19
Articles
Benchmarks in common with what we call graceful degra- agent is usually assumed to possess complete
dation in a complex system deployed to per- and error-free information about the state of
and test form a real task, just as aggressive behavior in the world when it begins planning. Because it
beds do not seagulls has little in common with aggressive knows what the initial state of the world is,
behavior in teenage boys. McDermotts what actions it intends to carry out, and what
currently (1981) wishful mnemonic problem has not the effects of those actions will be, it can, at
bridge the gone away: Two test-bed researchers might least in principle, predict exactly what the
each claim to have achieved graceful degrada- state of the world will be when it finishes act-
gap tion under resource restrictions, but it is more ing. In other words, it knows ahead of time
between accurate to say that each has achieved some- whether a particular plan will or will not
general and thing that he or she calls graceful degrada- achieve its goal.
tion. Test beds make it easier to build pro- Classical planners embody strong simplify-
specific grams that exhibit diverse behaviors, but ing assumptions both in the sense that their
problems researchers have to face the problem of capabilities (the class of problems they can
understanding what like-named behaviors solve) tend to be limited and in the sense that
and have in common. the worlds in which they operate tend to be
solutions. Benchmarks and test beds do not currently small, exhibiting few features and a limited
bridge the gap between general and specific physics. Planners are generally tested in
problems and solutions. A gap exists between domains with few planning operators, on
the benchmark n-queens problem and anoth- goals with few conjuncts, and on models of
er, domain-specific problem that you care the world in which few features are explicitly
about. A gap exists between the test-bed prob- modeled. Performance tends to degrade when
lem of having too few bulldozers to fight fires the number of operators, goal conjuncts, or
in the PHOENIX simulation and a general environmental features increases. Just as con-
resource-limited planning problem. Those of trol means that the planner can, in principle,
us who build and work with test beds appre- prove that its plan will work, the simplifying
ciate the opportunities they provide to study assumptions mean that the planner can as a
many phenomena, but we also recognize the practical matter generate the proof. Control
difficulties involved in finding test-bedspe- and simplifying assumptions, therefore, allow
cific problems that satisfy the criteria of the planner the luxury of generating provably
benchmarks: They are simultaneously repre- correct plans prior to execution time.
sentative of larger, more interesting problems; Most current work on agent architectures
easy to describe; and illuminating. aims toward relaxing these assumptions.
Reactive systems, for example, deal with the
problem that the world can change unpre-
Current Issues in Agent Design dictably between plan time and execution
Despite the difficulties in designing test beds time by deciding what to do at execution
and perhaps because of the promise associated time instead of generating a plan prior to exe-
with test-bedbased experimentation, a num- cution. Case-based planners confront the
ber of test-bed systems for studying agent simplicity problem by storing only the essen-
design have been developed to date. In Test- tial details of a solution, allowing the planner
Bed Implementations, we survey some of to concentrate on the relevant features of a
them. This section motivates the survey by new problem.
describing some significant research issues in Next we describe some specific issues that
agent design and noting corresponding fea- have recently attracted the attention of plan-
tures that test beds should exhibit. Much cur- ning researchers and, therefore, guide deci-
rent research in agent design builds on the sions about what features a planning test bed
classical planning paradigm that characterized might exhibit.
the field for several years, so our section begins Exogenous events: Perhaps the most limit-
with a short explanation of this paradigm. ing assumption of the classical planning
The classical planning paradigm assumes an worlds (most notably, the blocks world) is
environment that is both controlled and sim- that no exogenous, or unplanned, events can
ple. The planning agent is generally assumed occur. Relaxing this assumption makes the
to have complete control over the environ- process of predicting the effects of plans more
ment, which means that its intended actions difficult (Hanks 1990b) and also introduces
are the only events that can change the the need to react to unplanned events as they
worlds state and, furthermore, that the occur at execution time (Agre and Chapman
effects of its actions are fully known, both to 1987; Firby 1989). The time cost of planning
the agent and to the system designer. The becomes important in a world that allows
20 AI MAGAZINE
Articles
unplanned changes: The longer the agent against the cost of achieving it. The problem Allowing
takes to plan, the more likely it is that the of balancing cost against solution quality
world has changed significantly between the becomes more difficult when the agent is multiple
time the plan was generated and the time it is actually planning for a sequence of problems agents to
executed (Bratman, Israel, and Pollack 1988; over time, some of which might not even
Russell and Wefald 1991; Dean and Boddy have been made explicit when it begins to act in the
1988). plan. world
Complexity of the world: A realistic world Multiple agents: Allowing multiple agents
has many features. Even a simple block has to act in the world introduces new problems:
introduces
color, mass, texture, smell, and so on, how behaviors are coordinated, how the new
although many of these features will be irrele- agents should communicate, how the effects problems.
vant to many tasks. A realistic world also has of simultaneous actions differ from the effects
a complex causal structure: Changes in one of those actions performed serially. Multiple-
aspect of the world can change many other agent planning is an active research area
aspects, even though most of those changes (Bond and Gasser 1988), and a test bed for
might again be irrelevant to any particular exploring these research issues must allow
problem. Reasoning about more realistic coordinated behavior and communication
models of the world requires the ability to among the agents that inhabit it.
represent and make predictions about com- In addition to the functions required to
plex mechanisms (Weld and deKleer 1989) as make the test bed challenging, we also identi-
well as the ability to recognize and focus fy some design issues that tend to make a test
attention on those aspects of the world rele- bed more useful to prospective users:
vant to the problem at hand (Hanks 1990a). A clean interface: It is important to main-
A test bed for exploring realistically complex tain a clear distinction between the agent and
planning problems should itself provide a the world in which the agent is operating.
complexity and diversity of features. The natural separation is through the agents
Quality and cost of sensing and effect- sensors and effectors, so the interface should
ing: Sensing and effecting, generally ignored be clean, well defined, and well documented.
by the classical planners, are neither perfect A designer must be able to determine easily
nor cost free. An agent must therefore incor- what actions are available to the agent, how
porate incorrect and noisy sensor reports into the actions are executed by the test bed, and
its predictive model of the world (Hanks and how information about the world is commu-
McDermott 1994) and must plan sensing nicated back to the agent.
actions to improve its state of information, A well-defined model of time: Test beds
taking into account both the benefit of the must present a reasonable model of passing
information and the cost of acquiring it time to simulate exogenous events and simul-
(Chrisman and Simmons 1991). Thus, a test taneous action and to define clearly the time
bed for studying agent design might be popu- cost of reasoning and acting. (This problem is
lated with agents having imperfect sensors a general one in simulation and modeling.
and effectors. The test bed needs to make a See Law and Kelton [1981], for example.)
clean distinction between the agent and the However, the test bed must somehow be able
simulated world, the agents sensing and to communicate how much simulated time
effecting capabilities defining the interface. has elapsed. Making sense of experimental
Measures of plan quality: Classical plan- results requires a way to reconcile the test
ners are provided with a goal state to achieve, beds measure of time with that used by the
and they stop when their plans can achieve agent.
this state. However, simple achievement of a Supporting experimentation: Testing an
goal state is an inadequate measure of suc- agent architecture amounts to assessing its
cess; it does not take into account the cost of performance over a variety of sample prob-
achieving the goal, and it also does not admit lems and conditions. Controlled experiments
the possibility of partial goal satisfaction. require that problems and environmental
Haddawy and Hanks (1993) and Wellman conditions be varied in a controlled fashion.
and Doyle (1991) explore the relationship A test bed should therefore provide a conve-
between goal expressions and utility func- nient way for the experimenter to vary the
tions. A test bed for exploring richer notions behavior of the worlds in which the agent is
of success and failure should allow the to be tested. The experimenter must also be
designer to pose problems involving partial able to monitor the agents behavior in the
satisfaction of desired states, forcing the plan- test-bed world (Langley and Drummond
ner to trade the benefits of achieving the goal 1990). Although it is far from clear at this
WINTER 1993 21
Articles
point what statistics should be used in such The experimenter can control the rate at
an assessment, the test bed must allow perfor- which these objects appear and disappear as
mance statistics to be gathered. It is also use- well as certain characteristics (capacity and
ful for the data to be formatted automatically score) of the newly created objects. The abili-
for analysis using statistical software packages. ty to control these parameters is an impor-
tant feature of TILEWORLD because it allows sys-
tematic exploration of worlds with various
Test-Bed Implementations characteristics (for example, worlds that
Previous sections provided the motivations change relatively quickly or slowly). The goal
for simulated test-bed worlds and discussed of such exploration is to find systematic rela-
some of the problems that might be explored tionships between world characteristics and
in them. This section surveys several of the corresponding characteristics of the embed-
simulated worlds available to the community. ded agent. The TILEWORLD system is distribut-
Our survey is not exhaustive, nor is our selec- ed with a basic agent design, which is also
tion of test beds meant to imply that they are parameterized to allow manipulation by the
the best available. For each test bed, we experimenter (see the following discussion).
describe the sort of world the test bed is sup- The interface between the agent and the
posed to simulate and the research problems world allows the agent to take one of four
it was designed to test, we discuss the inter- primitive actions at any time: move left,
face between the agent and the world and move right, move up, and move down. Some
that between the researcher and the system or all of the primitive actions might be infea-
(agent plus world), and we summarize the sible at a given time, for example, if an obsta-
main methodological commitments associat- cle is blocking the way. The effects of each
ed with the test bed. action are predetermined and deterministic:
The agent always moves to the appropriate
Grid Worlds adjacent cell if it chooses to do so and if the
Several test-bed worlds have been organized move is feasible. It never ends up in a differ-
around the theme that the agent is situated ent cell by accident. Tiles and obstacles are
in a rectangular two-dimensional grid, and its characterized by their types and their loca-
main task is to push tiles around the grid. We tion on the grid. Each takes up exactly one
first discuss the TILEWORLD of Pollack and cell. Holes, which can occupy one or more
Ringuette (1990), then the independently cells, are characterized by location, capacity,
developed NASA (National Aeronautics and and score.
Space Administration) T I L E W O R L D ( N T W ) Holes, obstacles, and tiles appear and dis-
(Philips and Bresina 1991) and the MICE simu- appear probabilistically, according to parame-
lator (Montgomery et al. 1992). ter settings established by the researcher prior
Pollack and Ringuette (1990) report on the to any trial. The probabilities are indepen-
TILEWORLD test bed, a system designed to sup- dent of one another; a single probability gov-
port controlled experiments with agent archi- erns the appearance of tiles, and it is the
tectures situated in dynamic and unpre- same regardless of the time, the location, or
dictable environments. The world consists of any other parameter in the game.
a rectangular grid on which can be placed the TILEWORLD has no explicit sensing operators.
agent, some tiles, some obstacles, and some The agent is provided with a data structure
holes. Each object occupies one cell of the that describes the worlds state in complete
grid. The agent can move up, down, left, and detail and with complete accuracy. The use of
right unless doing so would cause it to run this information is left to the designer of the
into the worlds boundaries or an obstacle. embedded agent; for example, he or she can
When a tile is in a cell adjacent to the agent, design mechanisms that distort the informa-
the agent can push the tile by moving in its tion to introduce inaccuracy.
direction. The agents goal is to fill holes with The researcher describes a world by specify-
tiles. Each hole has a capacity C and a score S. ing the size of the grid; the duration of the
When the agent pushes C tiles into a hole, game; and the probability parameters govern-
the hole disappears, and the trials score ing the appearance and disappearance rates
increases by S. Each trial has a time limit, and of tiles, obstacles, and holes and the distribu-
the agents performance is measured by the tion of hole scores and capacities. The experi-
trials score at its completion.1 menter can control additional environmental
The T I L E W O R L D environment includes characteristics; for example, the experimenter
exogenous events: Objects in the world can can decide whether hole scores remain con-
appear and disappear during a simulation. stant until the hole disappears or whether the
22 AI MAGAZINE
Articles
score decreases over time. To facilitate experi- al. 1991) is an independently developed test
mentation, the system provides mechanisms bed that is also organized around the theme
for specifying suites of experiments, which of a two-dimensional grid with tiles. Exoge-
can then be run without intervention, and nous events in NTW consist of winds that can
recording performance data. blow tiles across the grid. NTW has no obsta-
Three related qualities characterize TILE - cles or holes.
WORLD: its abstract nature, its simplicity, and Two features distinguish the two simula-
its parameterized nature. TILEWORLD is not an tors. First, the NTW simulator has no built-in
attempt to model any particular planning measure of success that is analogous to the
domain; instead, the world might be used to notion of a score. What the agent is supposed
pose paradigmatic planning problems in the to do and what constitutes success is left
abstract. It is a simple world that presents the entirely to the experimenter. The second is
agent with only a few possibilities for action; the nature of the interface between the agent
objects have few attributes, and the occur- and its environment. The TILEWORLD agent
rence and effects of exogenous events are not calls the simulator as a subroutine and passes
complex. The worlds simplicity means that a information back and forth using a shared
few parameters define a world instance com- data structure. The NTW agent and the world
pletely, and these parameters can be varied as simulator run asynchronously: The agent
experiments are performed. posts commands to the world, which are put
TILEWORLD was originally developed to inves- in a queue and eventually executed. Opera-
tigate a particular agent architecture, IRMA tors can be programmed to fail probabilisti-
(intelligent resource-limited machine architec- cally: A grasp operation might not result in
ture) (Bratman, Israel, and Pollack 1988), and, the agent holding the tile, and a move might
in fact, is distributed to the research commu- result in the agent being displaced to an adja-
nity with an embedded IRMA agent. IRMA actu- cent location other than the one intended.
ally specifies a space of agent architectures; in The agent is given no indication of whether
other words, there is a range of agent architec- an operator has succeeded or failed and must
tures within the IRMA framework. The embed- explicitly sense the world to ascertain the
ded TILEWORLD agent is parameterized to allow effects of its actions.
exploration of the design choices consistent MICE (Montgomery and Durfee 1990; Mont-
with the IRMA specifications. gomery et al. 1992) is another grid-oriented
simulator, designed to support research into
coordinating the problem-solving behavior of
multiple autonomous agents. The basic lay-
T
he interface between a TILEWORLD agent out of MICE consists only of a grid and various
and its environment works as follows: agents, although agents can be used to simu-
When the agent wants to perform some late objects, such as tiles and forest fires.
action, it calls the simulator as a subroutine, The basic MICE operator is the move com-
specifying the action it wants to perform mand, moving the agent from one grid cell to
along with an indication of the amount of an adjacent cell. The link command is an
time that has elapsed since its last call (repre- abstract version of a grasp operator; the agent
senting the amount of time it spent reason- uses it to pick up objects. The world is popu-
ing about what to do). The simulator then lated only with agents, but they can be
updates the world, both to reflect exogenous diverse. MICE has no explicit provision for
events that took place during that period and exogenous events, although they can be sim-
to reflect the agents new actions. The result- ulated to some extent by implementing
ing world is then passed back to the agent (in agents that have the desired effects on the
a data structure called the world). world (making a grid cell wet and slippery to
This approach to agent-environment inter- simulate rain, for example).
face places the responsibility for specifying The main difference between the MICE simu-
sensing and effecting conditions on the agent lator and the NTW and TILEWORLD simulators is
designer. If the agent uses the world data struc- that MICE makes even less of a commitment to
ture directly, it will always have a complete a world physics; the experimenter defines an
and correct model. Incomplete or noisy sens- agents sensing and effecting capabilities and
ing can be achieved by manipulating this data also the effect of actions taken simultaneously
structure before the agent is allowed to use it. by the agents. MICE might be viewed more as a
Similarly, imprecision in effecting change has framework for building test beds rather than a
to be specified within the agent itself. simulator in and of itself. (The MICE designers
NTW (Philips and Bresina 1991; Philips et have built versions of TILEWORLD and PHOENIX
WINTER 1993 23
Articles
using this platform. See Montgomery and and effect any change. The simulator enforces
Durfee [1990], for example.) no model of sensing. It provides information
about the world (the characteristics of a cell
The PHOENIX Test Bed in the map, for example) by responding to
PHOENIX (Hart and Cohen 1990; Greenberg messages but does not restrict its answers.
and Westbrook 1990) is a framework for However, the PHOENIX agents have limited
implementing and testing multiple sensory and physical abilities; for example,
autonomous agents in a complex environ- bulldozers have a 200-meter radius of view
ment. The scenario is fire fighting; the world (although the view is not affected by eleva-
consists of a map with varying terrain, tion), and they move and cut fire lines at
elevations, and weather. Fires can start at any rates codified by the U.S. Forestry Service.
location and spread depending on the sur- Defining an environment consists of defin-
rounding terrain. Agents are fire-fighting ing a mapthe topographic features for a
units (commonly bulldozers) that change the land area, including ground cover, elevation,
terrain to control the fires. roads, rivers, and buildingsand processes
It is helpful to distinguish the PHOENIX sim- within the environment, such as fires and
ulator from the PHOENIX environment and wind. Defining an agent is generally more
PHOENIX agents. The simulator has three main complicated because it involves designing
functions: (1) to maintain and update the sensors, effectors, a planner, a reactive com-
map; (2) to synchronize the activities of the ponent, internal maps of the environment,
environment and the agents, which are and so on.
implemented as independent tasks; and (3) to PHOENIX includes an experiment-running
gather data. The P H O E N I X environment facility that includes a language for specifying
includes a representation of Yellowstone scripts for changes in weather, fires starting,
National Park (from Defense Mapping Agency and other events. It also allows for agents
data) and the tasks that implement fires. behavior to be monitored, producing data
PHOENIX agents generate tasks that simulate a files that can be read by data-manipulation
fire boss, several bulldozers, watchtowers, and statistical packages. The design of the
helicopters, fuel tankers, and so on. Agent PHOENIX system is modular, and other test
tasks include moving across the map, cutting beds have been developed rapidly by swap-
a fire line, predicting the course of fires, plan- ping out the Yellowstone map and the
ning the attack on the fire by several bulldoz- PHOENIX agent definitions and swapping in,
ers, monitoring progress and detecting fail- for example, a world of shipping lanes, ports,
ures in expectations, and recovering from docks, ships, and roads.
failure. Tasks insert themselves (by sending PHOENIX differs from the previous simula-
messages) onto a timeline maintained by the tors in that it tries to provide a realistic simu-
PHOENIX simulation. Tasks run intermittently lation of a single domain rather than imple-
and sometimes periodically. ment an abstract domain-independent task
P H O E N I X agents sense and change the environment. Apart from this difference,
PHOENIX environment by sending messages to however, it is similar to the MICE simulator in
the object managing the map, but the simula- that it enforces few constraints on how
tor makes no attempt to control the form of agents and exogenous events can sense or
the messages. Thus, PHOENIX agents have no change the world. The simulator maintains
predetermined set of operators. The PHOENIX the map and schedules activities, but, like
environment contains only two kinds of MICE , much of the domains physics lies in
objects: agents and fires. However, each cell definitions of the individual tasks.
of the map of the environment contains
information that agents and fires use to deter-
mine their behavior. For example, bulldozers
TRUCKWORLD
travel quicker on cells that are designated TRUCKWORLD (Firby and Hanks 1987; Nguyen,
blacktop road, and fires burn faster in the Hanks, and Thomas 1993) is a multiagent
direction designated uphill. Exogenous events test bed designed to test theories of reactive
are also implemented as tasks and influence execution (Firby 1989) and provide motivat-
other tasks indirectly. For example, wind ing examples for a theory of reasoning about
causes fires to burn faster. dynamic and uncertain worlds (Hanks 1993;
Tasks make their effects known by sending Hanks and McDermott 1994). The main
messages to the simulator. The form of these commitment is to provide a realistic world
messages is not restricted; any task can, in for its agents but without physical sensors or
principle, find out anything about the world effectors.2
24 AI MAGAZINE
Articles
An agent is a truck consisting of two arms; the control channel to manipulate the simu-
two cargo bays; several sensors; and various lators internal state (for example, to connect
other components, such as a fuel tank, a set or disconnect from the simulator, to advance
of tires, and direction and speed controllers. the simulators clock, or to collect statistics
It operates in a world consisting of roads and about the world). Multiple agents communi-
locations. Roads connect the locations, which cate using only the communication devices
are populated with objects. The simulator the world provides for them.
itself places few restrictions on the behavior There were two main goals in designing
of objects, which can be complex. TRUCK - TRUCKWORLD : (1) to provide a test bed that
WORLD can model objects such as fuel drums, generates interesting problems both in delib-
which the truck can use to increase its fuel erative and in reactive reasoning without
level; tire chains, which help it drive safely committing to a particular problem domain
down slippery roads; vending machines, and (2) to provide significant constraints on
which require money and produce a product; the agents effecting and sensing capabilities
and bombs, which tend to break unprotected and on the causal structure of the world but
objects in their immediate vicinity. still allow the system to be extended to meet
Exogenous events such as rainstorms occur the designers needs.
periodically in the world. A rainstorm makes TRUCKWORLD occupies a position between
all roads in its vicinity wet, and dirt roads simple abstract simulators such as TILEWORLD
become muddy for a while. The truck runs and NTW, a domain-specific simulator such as
the risk of getting stuck in the mud if it trav- PHOENIX , and a test-bedbuilding platform
els on a muddy road without proper tires. such as MICE. TRUCKWORLD implements a spe-
Objects in the vicinity of a rainstorm get wet, cific set of operators for the agent (unlike
too, which might affect their behavior (a MICE) but provides fewer constraints than do
match might not ignite anymore, a plant TILEWORLD or PHOENIX on the nature of the
might start growing). The occurrence of other objects in the world and on the interac-
events can depend both on random chance tion between the agent and these objects.
and on characteristics of the world (rain-
storms might be more likely at certain loca- Summary
tions or at certain times of day). We looked at five systems for implementing
TRUCKWORLD provides a wide variety of (sim- planning test beds: the parameterizable TILE-
ulated) sensors: Cameras report visual fea- WORLD and NTW, the multiagent MICE platform,
tures of objects, sonars report whether there the PHOENIX fire-fighting simulation, and the
is an object at a location, scales report an TRUCKWORLD simulator. Although there are
objects weight, and X-ray machines report many differences in what features each sys-
on objects within a closed container. Sensors tem offers and what design decisions each
typically have noise parameters: A camera makes, we can identify three main areas in
sometimes reports an incorrect but close col- which the systems differ:
or for an object, and such a report is more Domain dependence: PHOENIX strives for a
likely at night than during the day. A scale realistic depiction of a single domain, and
reports the objects true weight distorted TILEWORLD , NTW, and MICE try to describe
according to a user-supplied noise distribu- abstract worlds and operators that affect the
tion; a sonar occasionally incorrectly reports world. There is an obvious trade-off in this
that an object is present. decision: A researcher using a domain-depen-
A variety of communication devices are dent simulator might be able to demonstrate
available: Radios allow connection among that a program is an effective problem solver
agents; loudspeakers produce sounds that can in the domain but might have difficulty
be detected by microphones in the vicinity. going on to conclude that the architecture is
Motion detectors notice when objects appear effective for dealing with other domains. A
or disappear from their immediate vicinity. researcher using an abstract simulator might
Tape recorders are activated when a sound is be able to build a system based on what he or
produced, and an agent can retrieve the she judges to be general problem-solving
recorded message later. principles, but then the difficulty is in estab-
Communication between an agent and the lishing that these principles apply to any real-
simulator is tightly controlled: Each agent istic domain.
and the simulator itself run as separate pro- Definition of sensors and effectors: The
cesses, communicating over two channels. question arises about whether or to what
The agent performs actions and gets sensor extent the simulator should define the
reports over the command channel and uses agents sensing and effecting capabilities. At
WINTER 1993 25
Articles
one extreme, we have the PHOENIX simulation, mainly uncontroversial points: the need for
which does not itself impose any constraints introducing more rigorous empirical methods
on environment dynamics or the information into planning research and the roles that test-
agents can find out about their environment. bed environments and benchmark tasks
All such constraints are specified in the agent might play. The question of what the ulti-
definitions and are merely enforced by the mate goal of these research efforts is, as well
simulator. MICE and NTW represent the other as how the goal might best be pursued, is the
extreme: The simulator defines an agent, as subject of some disagreement among the
well as a world physics, supplying a set of authors. The following three subsections
sensing and effecting operations as part of the reflect this disagreement and represent the
world. TRUCKWORLD partially defines the trucks authors personal opinions. In the first sub-
effecting capabilities; it defines a set of primi- section, Hanks argues against a program of
tive commands, but the exact effect of these controlled experimentation in small, artifi-
commands depends on the objects being cially simple worlds. Pollack defends such a
manipulated. Objects and their interactions program in the second subsection. In the
are defined by the experimenter. TRUCKWORLD third, Cohen addresses the problem of gener-
does not, however, define a set of sensing alizing results from test-bed experiments.
operations. Sensors are objects defined by the
experimenter that happen to send sensory The Danger of Experimentation
information back over the command channel. in the Small (Steve Hanks)
Parameterizability: TILEWORLD and NTW The planning community has been pushed
have a built-in set of parameters that charac- (or has pushed itself) in two directions recent-
terize the behavior of a world. These parame- ly, and these directions seem at odds. We see
ters facilitate experimentation; by varying the pressure to apply our representations and
worlds parameters systematically and match- algorithms to more realistic domains, and at
ing them against various agent designs, one the same time, we feel the need to evaluate
might be able to come up with agent types our systems more rigorously than by
that perform well for particular world types. announcing a systems ability to solve a few
The price one pays for this ability to perform small, carefully chosen problems. The prob-
experiments is in simplicity and control. A lem is that programs that operate in more
world that is fully characterized by a small realistic domains tend to be bigger and more
number of parameters must be simple, and complicated, and big complicated programs
furthermore, the parameters must character- are more difficult to understand and evaluate.
ize completely the nature of the agents In writing this article, we agreed on the fol-
behavior in this world. PHOENIX allows the lowing two objectives as researchers in plan-
experimenter to specify values for parameters ning: (1) to build systems that extend the
such as wind speed and also to write scripts functions of existing systemsthat solve larg-
for how parameters change over time during er or more complicated problems or solve
an experiment. PHOENIX also provides a mech- existing problems betterand (2) to under-
anism called alligator clips for recording the stand how and why these systems work. Fur-
values of parameters during experiments. ther, running experiments is a good way
We are again faced with a trade-off: In TILE- (although not the only way) to accomplish
WORLD, NTW, and PHOENIX, one might be able the goal of understanding the systems we
to demonstrate a systematic relationship build. We tended to disagree, however, on the
between a worlds characteristics and an best way to achieve these objectives, in par-
agents performance. Such demonstrations, ticular on the issues of what form an experi-
however, must be supplemented with con- mental methodology should take and what
vincing arguments that these relationships role it should play in the system-building
will be mirrored in a more realistic world, and process.
it is far from easy to make such arguments. In Here I discuss a particular methodological
TRUCKWORLD , one can demonstrate that the approach, which I call experimentation in the
agent performs well on more complex prob- small. Langley and Drummond (1990) advo-
lems, but it might be difficult to demonstrate cate this position in the abstract. Pollack and
precisely the reasons for this success and to Ringuette (1990) and Kinny and Georgeff
apply these reasons to other domains. (1991) explore it concretely using an imple-
mented test bed and a suite of experiments. I
take the methodological commitments of this
Discussion approach to be the following:
The discussion to this point has touched on First, the researcher conducts experiments
26 AI MAGAZINE
Articles
in a test-bed world that is significantly sim- ter-override mechanism governs when the
pler than the world in which the agent is ulti- agent abandons the current hole in favor of a
mately to be deployed. In particular, the new alternative.
world is supposed to exhibit particular inter- The TILEWORLD agent has three components:
esting characteristics but will be artificially First is the filter-override mechanism, a test
simple in other aspects. applied to a newly appeared hole that deter-
Second, the test bed provides a set of mines whether the task of filling the current
parameters that govern the worlds behavior. hole should be reconsidered in light of the
Experimentation is a process of matching new option(s). 3 Only one filter-override
characteristics of the agents problem-solving mechanism was implemented: a threshold v
methods with the worlds parameter values; such that a new hole would be considered as
the goal of experimentation is to discover an alternative to the current hole just in case
relationships between these two sets of char- its score exceeded the score of the current
acteristics that predict good (or bad) perfor- hole by at least v points.
mance. Second is the deliberator, a procedure that
The main point of my discussion is that chooses the next hole to work on. Two alter-
experimentation in small, controlled worlds is natives were implemented. The simpler
not, in and of itself, an effective way to estab- (highest score, or HS) deliberator always
lish meaningful relationships between agents chooses the hole with the highest score. The
and their environments. I show that the more complicated (likely value, or LV) delib-
nature of the relationships established by erator divides the holes score by an estimate
these experiments is inherently connected of the cost of filling it: the sum of the dis-
with the implementation details both of the tances of the n closest tiles, where n is the
agent and of the test-bed worlds. The hard holes capacity.
part remains: generalizing beyond the particu- Third is the path planner. Given a hole to
lars of the world or even arguing that a partic- fill, the path planner uses breadth-first search
ular test-bed world is appropriate for studying to generate the optimal sequence of moves to
a particular agent architecture. The experi- fill it with tiles. The choice of a path planner
ments themselves do not provide guidance in was not among the agent parameters varied
this task and might even tend to hinder it. experimentally; only the optimal path plan-
I use the TILEWORLD test bed and experi- ner was implemented.
ments from Pollack and Ringuette (1990) and The experiments show the following
Kinny and Georgeff (1991) to make these results: (1) An agent that acts in parallel with
points. My goal in doing so is not to single reasoning performs slightly better than an
out this particular work for criticism. I do so, agent that acts and reasons serially. (2) The
first, because its important to discuss the more sophisticated LV deliberator performs
concrete results that can be expected from somewhat better than the simpler HS deliber-
these experimental endeavors, and second, ator. (3) The filter-override mechanism at best
these two pieces of work are rare examples of has no effect on the agents performance and,
systematic experimentation with agent archi- in some cases, makes it perform worse.
tectures in small, controlled worlds. Hanks and Badr (1991) analyze these
The Original T ILEWORLD Experiments experiments in detail. Here, I want to discuss
The planning agent studied in Pollack and some issues relevant to the question of what
Ringuette (1990) is an implementation of the this experimental paradigm can be expected
IRMA architecture (Bratman, Israel, and Pollack to accomplish. In particular, I want to stress
1988). One of the key ideas advanced in their the need for caution in interpreting these
paper is that one way for agents to cope with results. There is a large gap between the
a changing environment is to filter out (avoid efforts larger goal of establishing general rela-
considering) options that conflict with their tionships between agent designs and environ-
current intentions (the filtering mechanism) mental conditions and the information that
unless the option is especially promising (the is actually presented in the paper. I dont see
filter-override mechanism). IRMA also suggests this gap as a fault of the paperwhich pre-
that the agent separate what to do (delibera- sents preliminary workbut it is important
tion) from how to do it (planning). The TILE- to keep the results in perspective.
WORLD agent thus chooses its actions in two The connection between a general architec-
phases: The deliberation phase chooses a hole ture for problem solving (in this case, IRMA)
to fill (we call it the current hole), then the and the particular results reported must be
planning phase plans a sequence of moves interpreted, taking into account many design
that can fill the current hole. The agents fil- and implementation decisions: (1) the way in
WINTER 1993 27
Articles
which the IRMA architecture was realized in ting examined.

the TILEWORLD agent (for example, equating The final resultthat the filter-override
deliberation with choosing which hole to fill mechanism does not generally improve the
and planning with generating a sequence of agents performancestrikes me as the one
moves to fill the hole), (2) the implementa- most closely related to the specific agent and
tion of these modules in the TILEWORLD agent environment implementations. Hanks and
(for example, what the specific algorithms for Badr recognize the problem that the environ-
deliberation and path planning are and how ment did not challenge the deliberator, thus
they interact), and (3) the implementation of rendering a fast preliminary filtering mecha-
the TILEWORLD simulator (for example, the nism unnecessary. They propose making the
choice of what environmental parameters can environment more challenging, specifically
be varied, the interaction among the different by making the world change more quickly
parameters and between the agent and the (that is, by changing the parameters that gov-
simulator, and the simplifying assumptions ern the worlds behavior).
built into the world itself). Another interpretation of the same result is
Consider the first result, for example, and that TILEWORLD is inherently not a good test of
the broader conclusions we might be able to an IRMA-like filtering mechanism. The justifi-
draw from it. TILEWORLD uses a simulated cation for a filtering and override mechanism
notion of serial and parallel reasoning. In is that the filter override benefits the problem
fact, the act cycle and reasoning cycle run solver when deliberation is complex and diffi-
sequentially, but they are constrained to take cult but, at the same time, when deliberation
the same amount of time. Is this implementa- at least potentially benefits the planner sig-
tion detail important to assess the benefit of nificantly.
acting in parallel with reasoning? Im not
sure. In the current implementation, the
agent cannot be interrupted during its rea-
P
soning cycle by changes to the world that ut another way, deliberation is really a
occur during the concurrent act cycle. This matter of predicting the future state of
deviation from truly parallel reasoning and the world and choosing ones actions to
acting strikes me as significant. In any event, maximize utility given the predicted future.
the speedup result must be interpreted with The problem with TILEWORLD is that there is
an understanding of the particular imple- little to predict. Tiles appear and disappear at
mentation and cannot be interpreted more random and with no pattern. The effects of
broadly without further analysis. the agents actions are localized. On balance,
The second result, suggesting that the LV there is little to be gained from thinking hard
deliberator performs better than the HS delib- about the world, which Hanks and Badr
erator, must also be interpreted in the context (1991) show by demonstrating that there is
of the particular implementation. Hanks and little benefit to be had even by implementing
Badr (1991) note that one part of the TILE- a deliberator that computes the agents opti-
WORLD agent is the path-planning algorithm, mal course of action given current informa-
which (1) solves the problem optimally; (2) is tion. If deliberation is either easy to do or
not subject to experimental variation; and (3) doesnt benefit the agent significantly, then
is written in C, presumably for efficiency rea- there is no need for a surrogate for delibera-
sons. To what extent do the experimental tion such as the filter override. Hanks and
results depend on the ability to solve the Badr mention the possibility of making the
path-planning subproblem quickly and opti- deliberation process more expensive but not
mally? Hanks and Badr (1991) show that the the possibility of changing the world (for
fast path planner has a greater effect on the example, giving it more causal structure or
systems performance than does variation in making the agents reward structure more
the deliberator (which was one of the param- complex) to give more potential payoff to the
eters varied experimentally). Given this fact, deliberation process.
we should be cautious about interpreting the The point of this discussion is to demon-
experimental result too broadly. Would an strate the difficulty of interpreting experi-
agent actually benefit from a more sophisti- mental results such as those reported in Pol-
cated deliberator if it were unable to solve the lack and Ringuette (1990) or, more
path-planning subproblem quickly and opti- specifically, the difficulty associated with
mally? This question would have to be applying the results to any circumstances
answered to apply the result beyond the spe- other than those under which the experi-
cific implementation and experimental set- ments were conducted. In the next subsec-
28 AI MAGAZINE
Articles
tion, I discuss the implications of the general the number p bears no necessary relationship
paradigm of experimentation in the small, to the amount of time that it actually takes to
but first, I want to discuss some follow-up generate the plan. Planning time is a constant
experiments in the TILEWORLD environment. set by the experimenter and does not depend
Subsequent TILEWORLD Experiments The on the time it takes to build a plan for the cur-
experiments in Pollack and Ringuette (1990) rent path. Second, increasing or decreasing p
tried to establish a relationship between the has no effect on solution quality. Kinny and
agents commitment to its current planits Georgeff are not exploring the trade-off
willingness to abandon its current goal to between planning time and plan quality. The
consider a new option, or boldness as it was path planner always returns an optimal path;
calledand the rate at which the world the planning-time parameter makes it seem
changes. Kinny and Georgeff (1991) try to like it took p time units to do so.
make this relationship precise and provide A single parameter causes variation in the
additional empirical support. They begin world: , which is the ratio of the agents
their experimental inquiry by further simpli- clock rate to the rate at which the world
fying the test-bed world: [The TILEWORLD] was changes. Large values of indicate that the
considered too rich for the investigative world changes frequently relative to the
experiments we had planned. Therefore, to amount of time it takes the agent to act. The
reduce the complexity of the object-level rea- agents effectiveness is measured by dividing
soning required of our agent, we employed a the number of points the agent actually
simplified TILEWORLD with no tiles (Kinny scores by the sum of the scores for all the
and Georgeff [1991], p. 83). The agents task holes that appear during the game.
deliberation is really a
matter of
predicting the future state of the world.
in this simplified TILEWORLD is to move itself The experiments showed various relation-
to a hole on the board, at which point it is ships between effectiveness, rate of world
awarded the holes score. The agent is provid- change, commitment, and planning time: (1)
ed with perfect, immediate, and cost-free Effectiveness decreases as the rate of world
information about the worlds current state. change () increases. (2) As planning time
Once again, the planning agent is config- approaches 0, an agent that reconsidered its
ured around the tasks of deciding which hole options (choice of hole) after every step per-
to pursue, deciding which path to take to the forms better than an agent that never recon-
chosen path, and deciding whether to pursue sidered. (3) As increases, an agent that
a new hole that appears during execution. reconsidered often tends to perform better
The agent always chooses the hole with the than an agent that reconsiders infrequently,
highest ratio of score to distance. It adopts a planning time held constant. (4) When the
new hole according to its filter-override poli- cost of planning is high, an agent that recon-
cy, also called its degree of commitment or siders infrequently tends to perform better
degree of boldness. Degree of boldness is a than one that did so frequently, rate of world
number bthe agent automatically reconsid- change held constant.
ers its choice of hole after executing b steps of These experiments used an agent that
its path toward the hole it is currently pursu- reconsidered its current hole after a fixed
ing. A bold agent, therefore, tends to make a number of steps b. If the agent instead recon-
choice and stick with it. A cautious agent sidered its choice of path either after b steps
tends to reconsider more often and is more or at the time the target hole disappeared,
likely to abandon its old choice of hole in then the bold agent outperformed the cau-
favor of a new one. tious agent, regardless of the values of p and
Another agent parameter is its planning . Performance was improved further by
time, a number p set by the experimenter. The reconsidering the choice of target when a
path planner produces an optimal path to the hole appeared closer to the agent than the
current hole, and the planning-time parame- current target.4
ter dictates that it took p time units to do so. Once again, I want to point out the diffi-
It is important to point out two things. First, culty in applying these results to situations
WINTER 1993 29
Articles
other than the specific experimental environ- ence between a bold and a cautious agent, for
ment. Doing so requires evaluating what the example. These general terms are supposed to
simplifications are to TILEWORLD and how they suggest an agents willingness to reassess its
affect the complexity of the deliberation task, plan commitments as it executes its plans: A
evaluating how well the definitions of bold- bold agent rarely reconsiders its plans; a cau-
ness and planning time apply to different tious agent does so frequently.5
domains, and so on. To what extent does the The two main results from Kinny and
last result, for example, depend on the fact Georgeff (1991) can be stated as follows: First,
that the agent was provided with complete, its a good policy for an agent to be more cau-
instantaneous, correct, and cost-free informa- tious as the world changes more rapidly. In
tion about changes to the world? other words, planning ahead doesnt do
Analysis How do these experiments much good when the world changes a lot
advance the cause of building intelligent before the plan is executed or while the plan
agents? I think its clear that the agents pre- is being executed. Second, its a good policy
sented in these papers do not, in and of for an agent to rethink its commitment to a
themselves, constitute significant progress. goal when the goal disappears or when a goal
Both operate in extremely simple domains, appears that superficially looks more promis-
and the actual planning algorithm consists of ing than its current goal. Both results turn
using a shallow estimate of a holes value to out to be robust, holding as various other
focus the agents attention, then applying an parameters both of the agent and of the
optimal algorithm to plan a path to the cho- world are varied. Stated this way, the relation-
sen hole. This strategy is feasible only because ships seem pretty straightforward; I would be
the test bed is so simple: The agent has, at surprised to hear about an agent that did not
most, four possible primitive actions; it adopt these policies either explicitly or
doesnt have to reason about the indirect implicitly. The question is, therefore, whether
effects of its actions; it has complete, perfect, the relationships stated in these general terms
and cost-free information about the world; its provide significant guidance to those that
goals are all of the same form and do not build other agents.
interact strongly; and so on.
The argument must therefore be advanced
O
that these experimental results will somehow f course, the first relationship can be
inform or constrain the design of a more restated much more specifically, men-
interesting agent. Such an argument ulti- tioning the agents goals (to move to
mately requires translating these results into holes), its problem-solving strategy (to choose
general relationships that apply to signifi- a hole using a heuristic, then plan a path
cantly different domains and agents, and I optimally to the hole, using exactly p units of
pointed out how tricky it will be to establish time to do so), its definition of boldness (the
any applicability beyond the experimental number of operators it executes befor e
test bed itself. A crucial part of this extensibil- replanning), and the nature of its world-
ity argument will be that certain aspects of change parameter (the rate at which holes
the worldthose that the test bed was randomly appear). Interpreted in this light,
designed to simulate more or less realistical- the result is much less obvious. It does pro-
lycan be considered in isolation; that is, vide significant guidance to somebody who
studying certain aspects of the world in isola- wants to design an agent using the architec-
tion can lead to constraints and principles ture so described, to act effectively in a world
that still apply when the architecture is so described, given problems of the sort so
deployed in a world in which the test beds described, but nobody really wants to do it.
simplifying assumptions are relaxed. Thus, the problem is how to interpret this
Finding a general and useful interpretation more specific relationship in a broader con-
for experimental results is a crucial part of the text. What if the agent doesnt have immedi-
process of controlled experimentation. One ate, perfect, and cost-free information about
immediately faces the trade-off between stat- the appearance of holes? What if the designer
ing the relationships in such a way that they does not have an optimal and efficient path
are not so general as to be uninformative and planner at his/her disposal? What if the
stating them so that they are not so specific appearance of holes is not truly random but
that they dont generalize outside the particu- operates according to some richer causal
lar agent and world in which the experiments structure? Do the same relationships still
were conducted. hold? For that matter, are the same relation-
Both TILEWORLD papers discuss the differ- ships even meaningful?
30 AI MAGAZINE
Articles
The main point here is that experimenta- soning about plans in an uncertain world.
tion does not provide us automatically with Unplanned, random changesuch as tiles
meaningful relationships between agents and and holes appearing and disappearingis
environments. Claiming that a specific exper- one source of uncertainty, but there are oth-
imental relationship establishes a connection ers: The agent can have incomplete or incor-
between boldness and the rate of world rect information about the worlds initial
change constitutes a form of wishful think- state, have an incomplete model of its own
ing. 6 It translates a specific relationship actions, and might not have enough to time
between a particular implemented agent and to consider explicitly every outcome of its
a particular simulated world into terms that plan. I see no way to separate one of these
are intuitive, broad, and imprecise. Giving factors from the others in any principled way;
intuitive names to these characteristics and to therefore, I see no way that studying the sim-
their relationship does not make them mean- plified problem of a world in which all uncer-
ingful or broadly applicable. The real contri- tainty is the result of unplanned, random
bution of such an analysis would be to come change can shed light on the larger problem
up with the right way of characterizing the of reasoning about plans in an uncertain
agent, the world, and their relationship in world.
terms that are not so specific as to be applica- Its not even clear whether the problem
ble only to the experimental domain but not that the TILEWORLD papers claim to be investi-
so vague as to be vacuously true. Thus far, the gatingthe decision of when it is advanta-
experimental work has not focused on this geous to act as opposed to deliberatecan be
question; in fact, its worth asking whether considered in a context in which all exoge-
running experiments in artificially small, sim- nous change is random. The decision about
ple worlds is the right place to start looking whether to plan or act depends both on the
for these relationships at all. world and on the agents ability to predict the
Examining Environmental Features in world; the better it is at reasoning about the
Isolation I turn now to the second effects of its actions, the more benefit can be
assumption underlying experimentation in derived from thinking ahead.
the small: A particular characteristic of a real- TILEWORLD trivializes the prediction process
istic world can be studied in isolation, and by making the world essentially unpre-
good solutions to the restricted problem lead dictable: Tiles and holes appear and disappear
to good solutions in the realistic world. TILE- at random. The agent, therefore, has no
WORLD , for example, focuses on unplanned incentive to reason about what tiles might
change in the form of the random appear- appear or disappear or where they might
ance and disappearance of tiles and holes (or appear, which greatly simplifies the question
just holes in the case of the simplified TILE- of whether it should deliberate or act. Can we
WORLD) but simplifies away other aspects of therefore apply the experimental results
the world. established in TILEWORLD to worlds in which
This scaling assumption is absolutely cru- prediction is a difficult problem?7
cial to the whole experimental paradigm, and Experimentation in the small depends on
I have not seen it defended in the literature. the ability to study particular aspects of a
In fact, the only explicit mention of the realistic world in isolation and to apply solu-
assumption I have found appears in Philips et tions to the small problems to a more realistic
al. (1991, p. 1): world. I have seen no indication that such
We are not suggesting that studies of studies can in fact be performed; in fact, nei-
these attributes in isolation are sufficient ther TILEWORLD paper argues that random,
to guarantee the obvious goals of good unplanned change is a reasonable feature for
methodology, brilliant architectures, or isolated study. An experimenter using these
first-class results; however, we are sug- worlds, therefore, runs the risk of solving
gesting that such isolation facilitates the problems in a way that cannot be extended
achievement of such goals. Working on a to more realistic worlds and, at the same
real-world problem has obvious benefits, time, of making his/her job artificially diffi-
but to understand the systems that we cult for having studied the problem in isola-
build we must isolate attributes and car- tion. Kinny and Georgeff (1991) state that
ry out systematic experimentation. [simulated worlds] should ideally capture
My own work leads me to believe that it the essential features of real-world domains
will be difficult to isolate particular aspects of while permitting flexible, accurate, and repro-
a large planning problem. In Hanks (1990b), ducible control of the worlds characteristics
for example, I confront the problem of rea- (p. 82). Their proposition is appealing, but
WINTER 1993 31
Articles
the fact is we dont know what it means to and (3) the relationship demonstrated experi-
capture the essential features of real-world mentally actually constrains or somehow
domains, much less whether it is possible to guides the design of a larger, more realistic
do so in a system that allows reproducible agent. The experimental work I have seen has
control of the worlds characteristics. Con- addressed none of these questions.
ducting experiments in small, controlled I originally stated that our two objectives
worlds carries with it the responsibility of as researchers are (1) building interesting sys-
considering the implications of the simplifi- tems and (2) understanding why they work.
cations that were made to allow the experi- It seems to me that experimentation in the
mentation in the first place. small adopts the position that these goals
However at this point, we must remind should be tackled in reverse orderthat you
ourselves of our ultimate goals: to build sys- can understand how an interesting system
tems that solve interesting problems and to must be built without actually building one. I
understand why they do so. Research deci- dont believe this case to be so; rather, we
sions must be oriented toward solving prob- should be building systems and then apply-
lems, not toward satisfying methodological ing analytic and experimental tools to under-
goals. The ultimate danger of experimenta- stand why the systems did (or did not) work.
tion in the small is that it entices us into solv-
ing problems that we understand rather than
problems that are interesting. At best, it gives
the mistaken impression that we are making The Promise of Experimentation
progress toward our real goal. At worst, over (Martha E. Pollack)
time it confounds us to the point that we Steve Hanks believes that experimentation in
believe that our real goal is the solution of the small is a dangerous enterprise. I believe
the small, controlled problems. that to the contrary, controlled experimenta-
Conclusion In no way should this section tionsmall, medium, or largepromises to
be taken as an argument against using experi- help AI achieve the scientific maturity it has
mental methods to validate theories or pro- so long sought. In these comments, I try to
grams. In fact, I think the need for experi- defend this belief.
mentation is manifest; we need to In his section entitled The Danger of
understand why and how well our ideas and Experimentation in the Small, Hanks begins
our architectures work, and we will not by stating that the primary objectives of
always be able to do so using analytic meth- those studying agent design are (1) to build
ods. Neither am I opposed to conducting systems that extend the functionality of exist-
these experiments in controlled, overly sim- ing systems . . . and (2) to understand how
plified worlds. I can imagine, for example, a and why these systems work (emphasis
researcher implementing some idea in a sys- mine). I would put the second point some-
tem, then building a small world that isolates what differently and claim that we aim to
the essence of this idea, then using the small understand how and why such systems can
world to explore the idea further. I object, work. This change is not a minor matter of
however, when attention turns to the experi- wording but, rather, a fundamental disagree-
mentation process itself instead of the ideas ment about research methodology. Hanks
that are to be tested and when the assump- believes that complex-system building must
tions inherent in the small world are adopted precede experimental analysis, but I believe
without regard to the relationships the world that these two activities can and should pro-
is supposed to demonstrate. ceed in parallel. Hanks does not object to all
The ultimate valuearguably the only val- experimentation, only to experimentation in
ueof experimentation is to constrain or the small, that is, experimentation using sim-
otherwise inform the designer of a system plified systems and environments. I claim not
that solves interesting problems. To do so, the only that such experimentation can be infor-
experimenter must demonstrate three things: mative, but that given our current state of
(1) his/her resultsthe relationships he/she knowledge about system design, controlled
demonstrates between agent characteristics experimentation often requires such simplifi-
and world characteristicsextend beyond the cations. Thus, in my view, Hankss position is
particular agent, world, and problem specifi- tantamount to an injunction against all
cation studied; (2) the solution to the prob- experimentation in AI; in other words, it is a
lem area studied in isolation will be applica- call for the maintenance of the status quo in
ble when the same problem area is AI methodology.
encountered in a larger, more complex world; It is important to be clear about what con-
32 AI MAGAZINE
Articles
stitutes an understanding of how and why methodological challenges. In particular, he

certain autonomous agents work. In my view, points out the issue of generalizability: How
this understanding consists of a theory that can a researcher guarantee that the simplifica-
explains how alternative design choices affect tions made in the design of an experiment do
agent behavior in alternative environments; not invalidate the generality of the results
that is, it will largely consist of claims having obtained? I believe that this issue is a serious
the form, A system with some identifiable one that poses a significant challenge to AI
properties S, when situated in an environment researchers. Moreover, I agree with Hanks
with identifiable properties E, will exhibit that by and large, the controlled experimen-
behavior with identifiable properties B.8 tation that has been performed to date in
The goal of experimentation in AI (and agent designincluding my own workhas
arguably, a primary goal of the science of AI not adequately met this challenge. This inad-
taken as a whole) is to elucidate the relation- equacy, however, is a result of the fact that so
ships between sets of properties S, E, and B, as far painfully little controlled experimentation
previously defined. I argue that for experi- has been conducted in AI; as Hanks notes,
mentation to succeed in meeting this goal, the TILEWORLD experiments represent relative-
two types of simplification must be made. ly rare examples of systematic experimenta-
The first type is inherent in the notion of tion with agent architectures.9 It is extreme-
experimental design. Experimentation neces- ly difficult and often impossible to have
controlled experimentation
small, medium, or large
promises to help AI achieve the scientific maturity
it has so long sought
sarily involves selective attention to, and confidence in the generality of the results
manipulation of, certain characteristics of the obtained from a few experiments. The desire
phenomena being investigated. Such selectiv- for robust, generalizable results should lead
ity and control constitute a type of simplifica- us to do more, not less, experimentation.
tion of the phenomena. The second type of The problem of generalizability is not
simplification that is currently needed arises unique to AI; it is inherent in the experimen-
from our existing abilities to build complex tal methodology, a methodology that has
AI systems. Large, complex systems that tack- been tremendously successful in, and indeed
le interesting problems are generally not prin- is the cornerstone of, many other sciences. I
cipled enough to allow the experimenter to see nothing in AIs research agenda that
meaningfully probe the design choices under- would preclude its also benefiting from con-
lying them. Moreover, they are designed for trolled experimentation. Of course, adopting
environments in which it might be difficult the experimental method entails adapting it
or impossible to isolate, manipulate, and to the particulars of the AI research program.
measure particular characteristics. Finally, In my comments to follow, I give some neces-
these systems do not generally include instru- sarily sketchy suggestions about how we
mentation to measure their performance, might adapt the methodology and, in partic-
although it is conceivable that in many cases, ular, how the challenge of generalizability
this instrumentation could be added in a fair- can be met in AI. Following Hanks, I also use
ly straightforward way. Thus, these systems TILEWORLD as an example.
do not allow the experimenter sufficient Simplification in Experimentation
access at least to S and E and, possibly, also to Simplification, paring back the variables, far
B; they are, in short, ill suited for controlled from invalidating results, is indeed required
experimentation. In contrast, the kinds of by the foundations of empirical design. The
simplified systems we described in Test-Bed success of reductionism depends on measur-
Implementations, that is, test beds such as ing and reporting only that bit of cloth that
TILEWORLD, NTW, TRUCKWORLD, and PHOENIX and can be understood and tested piecemeal
their embedded agents, are designed specifi- (Powers [1991], p. 355).
cally to provide the control needed by the Experimentation mandates simplification.
experimenter. In investigating a complex phenomenon, the
Hanks correctly notes that the simplifica- experimenter selectively attends to some
tions required for experimentation introduce aspects of it, namely, those that he/she
WINTER 1993 33
Articles
believes are relevant to his/her hypotheses. ed to conduct experimentationto enable

He/she exerts control over those aspects of him/her to control and monitor the condi-
the phenomenon, manipulating them as nec- tions of the environment and to measure the
essary to test his/her hypotheses. At the same behavior of a system embedded in the envi-
time, he/she holds constant those influences ronment. In other words, a useful test bed will
that he/she believes are extraneous to his/her give the researcher a handle on B and on E.
hypotheses and allows or even forces random To give the researcher a way to measure B,
variation in those influences that he/she the test-bed designer specifies what counts as
believes are noise. This selective attention to, successful behavior in the test-bed environ-
and intentional manipulation of, certain ment and provides instrumentation that
aspects of the phenomenon is the paring measures success. To give the researcher a
back [of] the variables noted in the previous way to control and monitor E, the test-bed
quotation. designer selects some set of environmental
Does Hanks object to simplification as features and provides instrumentation that
such; that is, does he believe that to be use- allows the researcher to control these. One
ful, a hypothesis about agent design cannot potential objection is that the test-bed
make reference only to some aspects of an designer thereby influences the experiments
agents architecture or environment? that can be conducted using the test bed;
Although he appears to be inclined toward researchers might want to study other charac-
this conclusion when he asserts his belief that teristics of B and E than those identified by
it will be quite difficult to isolate particular the test-bed designer. However, this problem
aspects of a large planning problem, this only exists if researchers are mandated to use
objection is not his primary one. Rather, particular test beds. The problem disappears if
what he views as dangerous is a particular we leave the decision about which test bed to
way of achieving simplification in research use to individual researchers. A test bed is just
on agent design, namely, by conducting a tool, and it is up to the researcher to deter-
experiments using highly simplified agents mine the best tool for his/her current task.
operating in highly simplified environments. Indeed, in some cases, researchers might need
This reliance on simplification defines what to build their own tools to pursue the ques-
he terms experimentation in the small. tions of interest to them. It is worth noting,
Hankss introductory comments mention though, that some test beds might be more
only objections to the use of simplified envi- flexible than others, that is, might more read-
ronments, but his criticisms of the TILEWORLD ily suggest ways to model a variety of envi-
experiments show that he also objects to the ronmental features and/or behavioral aspects
use of highly simplified agents. and, thus, be more amenable to modification
I alluded earlier to my belief that it is nec- by the test-bed users. Later, I suggest that
essary to make significant simplifications in flexibility is one of the strengths of the TILE-
the agents and environments we use in con- WORLD system.
ducting experimentation. Large, realistic sys- To this point, I have focused on how a test
tems have generally been built without the bed allows control of B and E. It is, of course,
benefit of a principled understanding of also necessary for the researcher to have con-
agent designprecisely what experimenta- trol of the system features, S. One way to
tion (supplemented with theorizing) aims at. achieve this control is to use the same kind of
As a result, it is extraordinarily difficult to parameterization in an agent embedded in a
determine which mechanisms of these com- test-bed environment as is used in the envi-
plex systems are responsible for which aspects ronment itself. One of the more useful fea-
of their behavior, in other words, to isolate tures of the TILEWORLD system is precisely that
the key properties of S and B. It is difficult to it provides the experimenter with control
determine what in the system is essential to over the embedded system as well as over the
the observed behavior and what is instead an environment.
artifact of the way the system happened to be Hanks does not dispute the claim that sim-
implemented. In addition, when these sys- plification of the kind provided by test-bed
tems are deployed in real environments, environments and agents provides experi-
there is no ready way to isolate and control mental control. What worries him is that the
key features of these environments, that is, to price we might pay for this control is too
get a handle on E. high. His main argument is that the simplifi-
The test beds that we surveyed in Test-Bed cations that provide the needed control also
Implementations are designed specifically to make it impossible to produce results that are
provide the researcher with the control need- in any sense real or generalizable, that is, can
34 AI MAGAZINE
Articles
be shown to be applicable to larger AI appli- results do describe the wider world. Oth-
cations. Cohen, Hanks, and I all agree that er psychologists or engineers or educa-
this problem, often called realism, is the most tors may disagree. This is one reason
difficult challenge facing researchers on agent why a single experiment is rarely com-
design who adopt the experimental method- pletely convincing, despite the com-
ology we discuss in this article. However, we pelling logic of experimental design. The
disagree about whether this difficulty in true scope of a new finding must usually
insurmountable. be explored by a number of experiments
Toward Realism The problem of realism is in various settings.
a challenge for experimentalistsfor all A convincing case that an experiment
experimentalists, not just those in AI. To is sufficiently realistic to produce useful
achieve the experimental control they need, information is based not on statistics,
scientists in many disciplines have made use but on the experimenters knowledge of
of simplified systems and have thus had to the subject-matter of the experiment
address the question of how the lessons they (Moore and McCabe 1989, p. 270).
learn using these systems can be applied to
more complex phenomena. However, the his-
tory of science is full of examples in which
T
this challenge has been met successfully. For he key to achieving realism lies in the
example, biologists have used the simple researchers knowledge of the subject
organisms Drosophila and Escherichia coli in matter; the researcher must provide an
numerous experiments aimed at understand- argument, based on his/her understanding of
ing the fundamental mechanisms of genetics. the subject matter, that, in fact, the experi-
The results of these experiments have had mental results do describe the wider world.
tremendous significance for the theory of For such arguments to be satisfying, they
inheritance in all organisms, including must be informed by a rich theory of the
humans. Neurobiologists have used aplysia, phenomena in question. For the experimen-
animals with only a few neurons, to conduct tal program to succeed in AI, AI researchers
experiments investigating neuroplasticity. will need to be more scrupulous about careful
Again, the results have been generalized to theory development; as I have claimed else-
theories about the ways in which human where (Pollack 1992), our field has not always
brains function. As another example, engi- valued theory development as an integral
neers have built systems to simulate natural part of our work.
phenomenawind tunnels and wave Research into agent design begins with a
machines, for instance. These simulations theory. Of course, the theory, in whole or in
abstract away from much of the complexity part, can be informed by the theorists previ-
of real environments. Nonetheless, experi- ous experiences with building large, interest-
ments conducted using them have provided ing systems. An experimental research pro-
many valuable lessons about the effects of the gram on agent design includes the following
modeled phenomena on engineered artifacts components (see Cohens [1991] MAD [mod-
such as airplanes. eling, analysis, and design] methodology): (1)
Of course, merely pointing out that many a theory describing some aspect(s) of agent
other sciences have been able to meet the designparticularly, the agents architecture,
challenge of realism is not, in and of itself, the environment, and the agents behav-
enough to demonstrate that AI researchers iorand the purported effect of these design
concerned with agent design will be able to aspects on agent behavior in certain environ-
do so. What is needed is a closer look at how ments;10 (2) an implemented test-bed envi-
this challenge has been met. A widely used ronment and a description of the characteris-
introductory textbook on statistics describes tics of the environment; (3) an implemented
the process of achieving realism as follows: agent who will operate in the test bed; and
Most experimenters want to generalize (4) mappings describing the relationship
their conclusions to some setting wider between the real phenomena described by
than that of the actual experiment. Sta- the theory and their intended analogs in the
tistical analysis of the original experi- test-bed environment, the relationship
ment cannot tell us how far the results between the agent architecture described in
will generalize. Rather the experimenter the theory and its realization in the imple-
must argue based on an understanding mented agent, and the relationship between
of psychology or chemical engineering the agents design and its performance in the
or education that the experimental test-bed world.
WINTER 1993 35
Articles
early, prototype test-bed system, and in using
A
typical set of experiments will then
evolve from some hypothesis, typically it, my research group and I have not only
asserting that under the conditions of learned about agent design but also a great
some given environment, some specified deal about the test-bed design. These lessons
behavior will be observed in agents having have led to a number of changes and exten-
some given architectural characteristics. sions to the original system, which was
Experiments can then be designed using the reported on in Pollack and Ringuette (1990).
implemented (or operationalized) analog of Here I mention some of these changes; see
this hypothesis, relating conditions in the also Pollack et al. (1993).
test bed to observed behavior in the imple- The initial goal in building TILEWORLD was to
mented agent. Such experiments can have study a particular, well-developed theory of
several different types of result. They can resource-limited reasoning, called IRMA, that
confirm, deny, or suggest modifications to we had previously developed (Bratman, Israel,
the hypotheses in the underlying theory. and Pollack 1988; Bratman 1987; Pollack
They can suggest needed changes to the test- 1991). This theory was built on a detailed
bed system or to the mappings between the philosophical analysis of the role of intention
actual and the simulated environments. They in managing reasoning; the aim was to inves-
can reveal flaws in the way the environment tigate certain underspecified aspects of this
was modeled. They can suggest needed model. In particular, we began with a theoreti-
changes to the simplified agent or to the cally motivated strategy for coping with
mappings between the actual and the simu- changing environmentsthe strategy of com-
lated agents. They can reveal flaws in the way mitment-based filtering. Roughly speaking, this
the agent was modeled or the way its behav- strategy involves committing to certain plans
ior was measured. Perhaps most importantly, and tending to ignore options for action that
they can suggest additional experiments that are deemed incompatible with these plans.
should be performed, either using the same Filtering can be more or less strict, and we
or another test bed and agent. wanted to determine the environmental con-
This last type of result is critical. Experi- ditions under which stricter filtering was more
mentation is an iterative process. Part of the advantageous. In addition, there are various
experimental program is to refine the map- ways to realize the notion of strictness, and
ping between a theory and its realization in we wanted to explore the effects of these alter-
implemented systems. Part of the experimen- natives on agent behavior in different envi-
tal program is to iteratively refine the experi- ronmental conditions. The environmental
ments themselves. As Moore and McCabe condition that we suspected to be most
(1989, p. 270) put it, a single experiment is important was the average rate of change in
rarely completely convincing. The true the environment. Details are in Pollack and
scope of a new finding must usually be Ringuette (1990); this brief sketch is meant to
explored by a number of experiments in vari- highlight the fact that underlying our attempt
ous settings. To facilitate related experi- to relate S (in this case, conditions on filter-
ments, great care must be given to the way in ing), E (average rate of change), and B (the
which theories are stated and to the way in agents overall performance) was a larger theo-
which these theories are operationalized in ry about the role of intentions in resource-
experimental settings. Test beds and simpli- limited reasoning.
fied agents make it possible to meet the latter The experiments that we conducted, as
requirement. well as those performed by others using TILE-
The TILEWORLD Experience To make this WORLD (Kinny 1990; Kinny and Georgeff
discussion more concrete, I want to describe 1991; Kinny,. Georgeff, and Hendler 1992),
briefly some of the experiences we have had led to each of the kind of results I described
in conducting experiments using the TILE - earlier:
WORLD system. I focus on TILEWORLD because it First, they provided preliminary confirma-
is the experimental work with which I am tion of some parts of the theory. Experimen-
most familiar and because Hanks addresses it tation showed that strict filtering of incom-
in his comments. I do not mean to suggest patible options, coupled with an appropriate
that TILEWORLD is the ultimate test bed or one overriding mechanism, is viable at least
that all researchers should use in their work. under some circumstances (Kinny and
On the contrary, for reasons I have already Georgeff 1991; Kinny 1990). In other words,
discussed, it is essential that AI researchers commitment to ones plans can be a valuable
use a variety of test-bed systems in their strategy for managing a changing environ-
experimentation. Moreover, TILEWORLD is an ment. Experimentation also suggested needed
36 AI MAGAZINE
Articles
modifications to the theory. For example, one designer does not have an optimal and effi-
TILEWORLD user, John Oh, pointed out to us cient planner at his/her disposal? Questions
that the agents performance is hindered by such as these are precisely what a theory of
its inability to immediately adopt certain agent design should answer and directly sug-
extremely promising options without deliber- gest experiments that could be performed
ation. The original theory included a mecha- using TILEWORLD or other test-bed systems. We
nism for short circuiting deliberation to elim- count as a success of our experience with TILE-
inate a new option, but it lacked a WORLD that it has led a number of researchers
mechanism for short circuiting deliberation to ask just such questions. Moreover, TILE -
to immediately adopt a new option. Thus, WORLD has proven to be flexible in the sense
the theory needed to be modified to include a that it can readily be modified to support
new mechanism of the latter type. experiments investigating environmental and
Second, the experiments suggested needed agent-design issues other than those for
changes to the test-bed environment. As Hanks which it was originally designed.
correctly points out, the original TILEWORLD test One error that we made in the initial TILE-
bed was extremely homogeneousessential- WORLD experiments was a failure to be precise
ly, the world only presented one type of top- enough in the terminology we used to
level goal (hole filling). This fact limited the describe the theory and its realization in the
range of experiments that could be conduct- test bed and simplified agent. 11 Instead of
ed; there was no way to explore the behavior using qualitative terms, we should perhaps
of agents who had to perform complex (and, have developed quantitative analyses. For
thus, computationally costly) plan genera- example, instead of describing environments
tion. Since the publication of Pollack and as fast or slow relative to some arbitrary base-
Ringuette (1990), researchers have increased line, we might have defined the rate of envi-
the complexity of the TILEWORLD environ- ronmental change as the ratio between the
ment, so that they can study situations in average period of time between changes in
which a wider range of options are presented the environment and the average amount of
to the agent (Pollack et al. 1993). time it takes an agent to form an arbitrary
Third, the experiments also suggested plan. Qualitative definitions such as this one
needed changes to the agent embedded in would certainly have facilitated the specifica-
the TILEWORLD environment. Early experi- tion of the mapping functions between real
ments showed that the simplifications phenomena and the TILEWORLD operational-
researchers made in the deliberation and ization of them.
plan-generation component of the system
were too extreme. Both processes were uni-
I
formly inexpensive, and we were thus unable t is clear that significant effort must be put
adequately to explore the advantages of the into the development of vocabularies for
filtering process, whose intent is to reduce describing agents and environments and
the amount of deliberation and planning their realizations in implemented systems. I
needed (Pollack and Ringuette 1990). This agree completely with Hanks that the real
limitation subsequently led us to increase the contribution of this line of research will be
complexity of the deliberation process. Note to come up with the right way of character-
the interaction between this change and the izing the agent, the world, and their relation-
previous one described; the added complexity ship. This goal is the primary purpose of our
in the agent depended on the added com- ongoing work. However, I disagree strongly
plexity in the environment. with Hanks when he goes on to claim that to
Finally, the experiments suggested a large date the terms used in the TILEWORLD studies
number of additional experiments that need (and in all other experimentation in the
to be conducted to expand and strengthen the small) are so specific as to be applicable only
original theory. Hanks, in fact, gives many to the experimental domain [or] so vague as
examples of such experiments. He wonders to be vacuously true.
about the significance of the agents ability to Consider the TILEWORLD results that he
perform some planning problems optimally. describes as vacuously true. He states these in
He suggests that the degree of (un)predictabil- terms of the circumstances under which it is
ity in an environment might be an important advantageous to reconsider the plans to
influence on the value of committing to ones which one has already committed (for exam-
plans. He asks, What if the agent doesnt ple, be more inclined to reconsider when the
have immediate, perfect, cost-free information world is changing more rapidly; reconsider
about the appearance of holes? What if the when your goal becomes impossible). Howev-
WINTER 1993 37
Articles
er, what is most important about the early current state of our science.
TILEWORLD results is that they support the idea Although in his comments Hanks focuses
that commitment is a good idea in the first on the difficulties involved in using test beds
place; the results, as described by Hanks, have and simplified agents in experimentation, in
to do with refinements to this basic idea. Kin- his conclusion, he supports their use, provid-
ny and Georgeff found that commitment led ed that the hypotheses toward which they are
to the most effective behavior under all the directed were inspired by experiences with
conditions they studied, provided the agent particular large-scale systems. Thus, he says
was given a minimal override policy that that he is not opposed to conducting
allows for reconsideration of goals that have experiments in controlled, overly simplified
become unachievable. worlds [and] can imagine, for example, a
The key idea of the IRMA theory is that it researcher implementing some idea in a sys-
pays for an agent in a dynamic environment tem, then building a small world that isolates
to commit to certain courses of action, even the essence of this idea, then using the small
though the environment might change so world to explore the idea further. Apparent-
that some of these courses of action cease to ly, Hanks feels that the problem is not in the
be optimal. Local optimalityalways doing use of simplified systems and agents per se
what is best at a given timemust be sacri- but, rather, in the fact that researchers who
ficed in the interest of doing well enough have to date used simplified systems and
The key idea overall; commitment to ones plans generally agents have been willing to investigate
of the IRMA rules out local optimality but can help lead to hypotheses that have been developed apart
theory is overall satisficing, that is, good enough, from the implementation of any particular
behavior. Although I cannot restate the entire system. Thus, it appears that the primary dis-
that it pays argument here (again, see Bratman, Israel, and pute between Hanks and myself has little to
for an agent Pollack [1988]; Bratman [1987]; Pollack do with the use of test beds and simplified
[1991]), it should be said that this claim is far systems. We both agree that unprincipled fid-
in a from being so obvious that all reasonable peo- dling with any systems (large or small) is just
dynamic ple would assent to it.12 Hanks says that he that. Experimentation must build on theoriz-
would be surprised to hear about an agent ing.13 However, Hanks demands that any the-
environment that did not adopt these policies, but in fact, ory worth investigating must derive directly
to commit the recent literature in agent design has been from a large, implemented system, but I see
to certain filled with examples of agents, specifically, the no need for this restriction. Sometimes,
so-called reactive agents, that are notable pre- hypotheses about agent design can result
courses of cisely because they do not commit to any from other avenues of inquirysuch as the
action, even plans; instead, they decide at each point in philosophical theorizing that led to
time what action is appropriate (Agre and IRMA and it might be more effective to
though the Chapman 1987; Brooks 1991; Schoppers explore these theories experimentally before
environ- 1987). A standard attempt to resolve the investing in large, complex systems that
debate between those advocating reactiveness embody them.
ment might and those advocating deliberativeness has
change been to suggest a middle road: Rational agents Generalization of Test-Bed Results
sometimes should deliberate about, and com- (Paul R. Cohen)
mit to, plans, and other times, they should Much of the preceding discussion touches on
react more immediately to their environment. the problem of generalizing results from
The TILEWORLD experiments conducted to date research with test beds. I do the reader no ser-
can be seen, at least in part, as an attempt to vice by recounting my coauthors arguments.
clarify the conditions under which each alter- Instead, I try to clarify what test beds are for,
native is desirable. focusing on their role in the search for gener-
Conclusion In these comments, I distin- al rules of behavior.14 I was struck by Steve
guished between two kinds of simplification Hankss repeated assertion that results from
in experimentation: (1) investigating the TILEWORLD studies are difficult to interpret,
hypotheses that focus on particular charac- so this assertion serves as the launching point
teristics of a system, its behavior, and its for my own comments. All empirical results
environment and (2) using simplified sys- are open to interpretation. Interpretation is
tems, operating in simplified environments, our job. When we read the results of a study
to conduct the experiments. I claimed that we have to ask ourselves, What do they
the former is essential to all experimenta- mean? We can answer this question in several
tion, and that although in principle the lat- ways. First, we might say, Goodness gracious,
ter is not necessary, de facto it is, given the this result deals a deadly blow to the prevail-
38 AI MAGAZINE
Articles
ing theory of, say, agent curiosity. Lets agree it is difficult to get them to commit unswerv-
that this response is unlikely for two reasons: ingly to any plan for long. Their natural state
First, we dont have a theory of agent curiosi- is bold, and they fail catastrophically when
tyor a theory of any other agent behav- we make them less so; so, the results were
iorand death-dealing empirical results are, inconclusive. However, imagine the experi-
in any case, rare. Second, we might interpret ments had succeeded, and evidence was
a study as a chink in the armor of a prevailing accrued that boldness really does help
theory; for example, results from astronomy P H O E N I X agents when the environment
sometimes are interpreted as troublesome for becomes more unpredictable. Then two
the big bang theory. This response, too, is research groupsmine and that of Kinny and
unlikely because we dont have any theories Georgeffwould have demonstrated the
that make predictions for results to contra- same result, right? Whether you agr ee
dict. Third, a study might be interpreted as depends on what you mean by the same
supporting a prevailing theory, if we had any result. I mean the following: Kinny and
theories to support. Fourth, a result might Georgeff offered a mapping from terms in
suggest a theory or just a tentative explana- their theory (bold, agent, better, sometimes,
tion of an aspect of agent behavior. I interpret unpredictable) to mechanisms in TILEWORLD,
Kinny and Georgeffs paper in this way, as and I offered a mapping from the same terms
weak evidence for the theory that agents to mechanisms in PHOENIX. We both found
sometimes do better in unpr edictable that a sentence composed fr om these
domains if they are bold. In addition, I have termsbold agents sometimes do better in an
no sympathy for the complaint that the unpredictable environmentwas empirically
paper is difficult to interpret. Interpretation is true. In reality, as I noted, we were unable to
our job, especially now when we have no the- replicate Kinny and Georgeffs result. We Interpreta-
ories to do the job for us. In short, we ought failed for technical reasons; there was no easy tion is our
to ask what our few empirical r esults way to create a PHOENIX agent that was not
meanwhat theories they suggest because we bold. Differences in experimental apparatus
job. When
currently have no theories to provide inter- always make replication difficult. For exam- we read the
pretationsinstead of assert strenuously that ple, TILEWORLD has just one agent and a limit- results of a
they mean nothing. ed provision for exogenous events; so, it
Let us recognize that empirical results are would be difficult to use TILEWORLD to repli- study we
rarely general. Interpretations of results cate results from PHOENIX. Still, these prob- have to ask
might be general, but results are invariably lems are only technical and do not provide a
tied to an experimental setup. It is wrong to strong argument against the possibility of ourselves,
assert that because Kinny and Georgeff generalizing results from test-bed research. What do
worked with a trivial test bed, their results Test beds have a role in three phases of
have no general interpretation. I have already research. In an exploratory phase, they provide
they mean?
recounted one general interpretation: Bold the environments in which agents will
agents sometimes do better in unpredictable behave in interesting ways. During explo-
domains. Moreover, every substantive word ration, we characterize these behaviors loose-
in this interpretation has a precise meaning ly; for example, we observe behaviors that
in TILEWORLD. Thus, Kinny and Georgeff could appear bold or inquisitive. In exploratory
say, Bold agents sometimes do better in an research, the principal requirement of test
unpredictable environment, and here is what beds is that they support the manifestation
we mean by bold, agent, sometimes, better, and observation of interesting behaviors,
and unpredictable. If you are interested in which is why I favor complex agents and test
our theory, tell us what you mean by these beds over simple ones. In a confirmatory
terms, and let us see if the theory generalizes. phase, we tighten up the characterizations of
Nothing prevents us from inventing gener- behaviors and test specific hypotheses. In
al theories as interpretations of results of test- particular, we provide an operational defini-
bed studies, and nothing prevents us from tion of, say, boldness so that a data-collecting
designing additional studies to test predic- computer program can observe the agents
tions of these theories in several test beds. For behavior and decide whether it is bold. We
example, two students in my research group test hypotheses about the conditions in
explored whether bold PHOENIX agents do bet- which boldness is a virtue, and when we are
ter as the PHOENIX environment becomes done, we have a set of results that describe
more unpredictable. The experiment proved precise, test-bedspecific conditions in which
technically difficult because PHOENIX agents a precise, agent-specific behavior is good or
rely heavily on failure-recovery strategies; so, bad. In confirmatory research, the primary
WINTER 1993 39
Articles
requirement of a test bed is that it provide al Science Foundation (NSF) grants IRI-
experimental control and make running 9008670 and IRI-9206733. Martha E. Pollack
experiments and collecting data easy. For this was supported by United States Air Force
reason, PHOENIX has a script mechanism for Office of Scientific Research contracts
automatically running experiments and inte- F49620-91-C-0005 and F49620-92-J-0422,
grated data-collection, data-manipulation, and Rome Laboratory, the Advanced Research Pro-
statistical packages. In the third phase, general- jects Agency (ARPA) contract F30602-93-C-
ization, we attempt to replicate our results. As I 0038, and NSF Young Investigators Award
described earlier, several research groups might IRI-9258392. Paul Cohen was supported in
attempt to replicate bold behavior under con- part by ARPA contract F30602-91-C-0076.
ditions comparable to those in the original
experiment. Each group will have to design Notes
their own agent-specific, test-bedspecific defi- 1. The scoring metric in TILEWORLD was later revised
nitions of bold and comparable conditions. For to make it easier to compare trials of varying
example, uncertainty about the environment length: Raw score was replaced with a normalized
might be induced in agents by rapidly chang- value called efficiency (Kinny and Georgeff 1991).
A number of changes have been made to the TILE-
ing wind speed in PHOENIX and erratically mov-
WORLD system since 1990, some of which are dis-
ing holes in TILEWORLD. To achieve this goal, cussed in The Promise of Experimentation; see also
test beds would have to be parameterizable, Pollack et al. (1993). Code and documentation for
and researchers woule have to work closely TILEWORLD are available by sending mail to tile-
during the generalization phase. world-request@cs.pitt.edu.
The boldness theory is general to the extent 2. TRUCKWORLD code and documentation are avail-
that boldness and unpredictability in TILE - able by sending mail to truckworld-users-
WORLD are phenomena similar to boldness and request@cs.washington.edu.
unpredictability in PHOENIX and other test beds. 3. The filtering mechanism itself in the original
Similar agents in similar test beds are apt to TILEWORLD agent is trivial: When the agent is work-
manifest similar behaviors, but this similarity ing on filling a hole, the filter rejects all other
does not convince us that the behaviors are holes; when the agent does not have a current
general. Generality is achieved when different hole, the filter accepts all holes.
agents in different test beds exhibit common 4. In both experiments, the agent was automatical-
behaviors in common conditions. The more ly and immediately notified of the appearance and
the agents and test beds differ, the more diffi- disappearance of holes.
cult it is to show that behaviors and condi- 5. The terms can be defined precisely within the
IRMA frameworkthey describe the sensitivity of
tions are common. If we had theories of
the agents filter-override mechanismbut presum-
behavior, we could show how conditions and
ably the terms and the associated relationships are
behaviors in different test beds are specializa-
intended to be applied to agents other than imple-
tions of terms in our theories. However, we do mentations of IRMA.
not have theories; we must bootstrap theories
6. Compare McDermott (1981).
from empirical studies. Our only hope is to
7. Chapman (1990) advances an even stronger view
rely on our imaginations and abilities to inter-
that randomness without structure actually makes
pret behaviors and conditions in different test- planning more difficult.
bed studies as similar.
8. For similar statements of this research paradigm,
In conclusion, I believe results of test-bed see Cohen, Howe, and Hart (1990); Rosenschein,
research can be generalized. Some features of Hayes-Roth, and Erman (1990); Pollack and
test beds will make it easier to observe, Ringuette (1990); and Langley and Drummond
explain, and test hypotheses about agents (1990). Also see the paper by L. Chrisman, R. Caru-
behaviors. Generalization is done by scientists, ana, and K. Carriker from the 1991 AAAI Fall Sym-
not apparatus, so I strongly disagree with any posium Series on Sensory Aspects of Robotic Intelli-
implication that particular kinds of test beds gence, Intelligent Agent Design Issues: Internal
preclude generalization. Test beds offer Agent State and Incomplete Perception. Some
researchers also split out the properties of the
researchers the opportunity to tell each other
agents task; in these comments, I consider the task
what they observed in particular conditions.
specification to be part of the environment, but my
When a researcher publishes an observation, argument does not depend on this consideration.
other researchers are responsible for the hard
9. Although I believe the situation is changing;
work required to say, I observed the same recent conference proceedings appear to include an
thing! increasing number of experimental papers on agent
design, and in some other subfields of AI, notably
Acknowledgments machine learning and text understanding, there are
Steve Hanks was supported in part by Nation- many such papers.
40 AI MAGAZINE
Articles
10. A general question exists about the appropriate Dean, T., and Boddy, M. 1988. An Analysis of Time-
language for the researcher to use in articulating Dependent Planning. In Proceedings of the Sev-
his/her theory. Sometimes, it will be the language enth National Conference on Artificial Intelligence,
of mathematics; other times a natural language, 4952. Menlo Park, Calif.: American Association for
clearly used, can suffice. Artificial Intelligence.
11. Another error was the failure to provide a clean Firby, R. J. 1989. Adaptive Execution in Complex
enough interface between the agent and the envi- Dynamic Worlds. Ph.D. diss., Dept. of Computer
ronment; it is more difficult than originally hoped Science, Yale Univ.
to excise the IRMA -based embedded agent and Firby, R. J., and Hanks, S. 1987. A Simulator for
replace it with an alternative. Also, as Hanks points Mobile Robot Planning. In Proceedings of the
out, we used an awkward mechanism, which has DARPA Knowledge-Based Planning Workshop, 23-
since been modified, for simulating concurrent act- 123-7. Washington, D.C.: Defense Advanced
ing and reasoning on a sequential machine. Research Projects Agency.
12. If you dont believe me, I invite you to listen to Greenberg, M., and Westbrook, L. 1990. The
the objections that are raised when I give talks PHOENIX Test Bed, Technical Report, COINS TR 90-
describing IRMA. 19, Dept. of Computer and Information Science,
13. An exploratory phase of experimentation can Univ. of Massachusetts.
occur after initial attempts at verifying a particular Haddawy, P., and Hanks, S. 1993. Utility Models for
theory and can sometimes look like fiddling, but Goal-Directed Decision-Theoretic Planners, Techni-
this area is another matter. cal Report, 93-06-04, Dept. of Computer Science
14. Much of what I say arises from conversations and Engineering, Univ. of Washington.
with Bruce Porter of the University of Texas. Hanks, S. 1993. Modeling a Dynamic and Uncer-
Although I owe my current understanding of the tain World II: Action Representation and Plan Evo-
issues to our discussions, I do not mean to imply lution, Technical Report, 93-09-07, Dept. of Com-
that he agrees with everything here. puter Science and Engineering, Univ. of
Washington.
References Hanks, S. 1990a. Practical Temporal Projection. In
Agre, P., and Chapman, D. 1987. PENGI: An Imple- Proceedings of the Eighth National Conference on
mentation of a Theory of Activity. In Proceedings Artificial Intelligence, 158163. Menlo Park, Calif.:
of the Sixth National Conference on Artificial Intel- American Association for Artificial Intelligence.
ligence, 268272. Menlo Park, Calif.: American Hanks, S. 1990b. Projecting Plans for Uncertain
Association for Artificial Intelligence. Worlds. Ph.D. diss., Dept. of Computer Science,
Bond, A., and Gasser, L. 1988. Readings in Distribut- Yale Univ.
ed Artificial Intelligence. Los Altos, Calif.: Morgan Hanks, S., and Badr, A. 1991. Critiquing the TILE-
Kaufmann Publishers. WORLD: Agent Architectures, Planning Benchmarks,
Bratman, M. 1987. Intention, Plans, and Practical and Experimental Methodology, Technical Report,
Reason. Cambridge, Mass.: Harvard University 91-10-31, Dept. of Computer Science and Engineer-
Press. ing, Univ. of Washington.
Bratman, M.; Israel, J.; and Pollack, M. 1988. Plans Hanks, S., and McDermott, D. 1994. Modeling a
and Resource-Bounded Practical Reasoning. Compu- Dynamic and Uncertain World I: Symbolic and
tational Intelligence 4:349355. Probabilistic Reasoning about Change. Artificial
Brooks, R. 1991. Intelligence without Reasoning. In Intelligence 65(2). Forthcoming.
Proceedings of the Twelfth International Joint Con- Hart, D., and Cohen, P. 1990. PHOENIX: A Test Bed
ference on Artificial Intelligence, 569595. Menlo for Shared Planning Research. In Proceedings of the
Park, Calif.: International Joint Conferences on NASA Ames Workshop on Benchmarks and Metrics,
Artificial Intelligence. Moffett Field, California, June.
Chapman, D. 1990. On Choosing Domains for Kinny, D. 1990. Measuring the Effectiveness of Sit-
Agents. Position Paper presented at the NASA Ames uated Agents, Technical Report 11, Australian AI
Workshop on Benchmarks and Metrics, Moffett Institute, Carlton, Australia.
Field, California, June. Kinny, D., and Georgeff, M. 1991. Commitment
Chrisman, L., and Simmons, R. 1991. Senseful and Effectiveness of Situated Agents. In Proceed-
Planning: Focusing Perceptual Attention. In Pro- ings of the Twelfth International Joint Conference
ceedings of the Ninth National Conference on Arti- on Artificial Intelligence, 8288. Menlo Park, Calif.:
ficial Intelligence, 756761. Menlo Park, Calif.: International Joint Conferences on Artificial Intelli-
American Association for Artificial Intelligence. gence.
Cohen, P. 1991. A Survey of the Eighth National Kinny, D.; Georgeff, M.; and Hendler, J. 1992.
Conference on Artificial Intelligence: Pulling Experiments in Optimal Sensing for Situated
Together or Pulling Apart? AI Magazine 12:1641. Agents. In Proceedings of the Second Pacific Rim
Cohen, P.; Howe, A.; and Hart, D. 1990. Intelligent International Conference on Artificial Intelligence,
Real-Time Problem Solving: Issues and Examples. 11761182. Seoul, South Korea: Korean Informa-
In Intelligent Real-Time Problem Solving: Workshop tion Science Society.
Report, ed L. Erman, IX-1IX-33. Palo Alto, Calif.: Langley, P., and Drummond, M. 1990. Toward an
Cimflex Teknowledge Corp. Experimental Science of Planning. In Proceedings of
WINTER 1993 41
Articles
the DARPA Workshop on Innovative Approaches to Calif.: Cimflex Teknowledge Corp.

Planning, Scheduling, and Control, 109114. San Russell, S., and Wefald, E. 1991. Do the Right Thing:
Mateo, Calif.: Morgan Kaufmann Publishers. Studies in Limited Rationality. Cambridge, Mass.: The
Law, A., and Kelton, W. 1981. Simulation Modeling MIT Press.
and Analysis. New York: McGraw-Hill. Schoppers, M. 1987. Universal Plans for Reactive
McDermott, D. 1981. Artificial Intelligence Meets Robots in Unpredictable Environments. Intelli-
Natural Stupidity. In Mind Design: Essays in Philoso- gence without Reasoning. In Proceedings of the
phy, Psychology, and Artificial Intelligence, ed. J. Tenth International Joint Conference on Artificial
Haugland, 143160. Cambridge, Mass.: The MIT Intelligence, 10391046. Menlo Park, Calif.: Inter-
Press. national Joint Conferences on Artificial Intelli-
Minton, S.; Johnston, M.; Philips, A.; and Laird, P. gence.
1990. Solving Large-Scale Constraint Satisfaction Sussman, G. 1975. A Computer Model of Skill Acqui-
and Scheduling Problems Using a Heuristic Repair sition. New York: Elsevier Science Publishing Com-
Method. In Proceedings of the Ninth National pany.
Conference on Artificial Intelligence, 1724. Menlo Weld, D., and deKleer, J. 1989. Readings in Qualita-
Park, Calif.: American Association for Artificial tive Reasoning about Physical Systems. Los Altos,
Intelligence. Calif.: Morgan Kaufmann Publishers.
Montgomery, T., and Durfee, E. 1990. Using MICE to Wellman, M., and Doyle, J. 1991. Preferential
Study Intelligent Dynamic Coordination. In Pro- Semantics for Goals. In Proceedings of the Ninth
ceedings of the Second International Conference National Conference on Artificial Intelligence,
on Tools for Artificial Intelligence, 438444. Wash- 698703. Menlo Park, Calif.: American Association
ington, D.C.: Institute of Electrical and Electronics for Artificial Intelligence.
Engineers.
Montgomery, T.; Lee, J.; Musliner, D.; Durfee, E.; Steve Hanks is an assistant professor
Darmouth, D.; and So, Y. 1992. MICE Users Guide, of computer science and engineering
Technical Report, CSE-TR-64-90, Dept. of Electrical at the University of Washington. He
Engineering and Computer Science, Univ. of Michi- received a Ph.D. from Yale Universi-
gan. ty in 1990 and an M.B.A. from the
Moore, D., and McCabe, G. 1989. Introduction to the Wharton School in 1983. His
Practice of Statistics. New York: W. H. Freeman and research interests include automated
Company. planning and decision making, reasoning and deci-
sion making under uncertainty and with incom-
Nguyen, D.; Hanks, S.; and Thomas, C. 1993. The
plete information, and decision support systems for
TRUCKWORLD Manual, Technical Report, 93-09-08,
complex and subjective domains.
Dept. of Computer Science and Engineering, Univ.
of Washington.
Philips, A., and Bresina, J. 1991. NASA TILEWORLD Martha E. Pollack is associate pro-
Manual, Technical Report TR-FIA-91-04, NASA fessor of computer science and intel-
Ames Research Center, Mountain View, California. ligent systems at the University of
Pittsburgh. She received her Ph.D.
Philips, A.; Swanson, K.; Drummond, M.; and
from the University of Pennsylvania
Bresina, J. 1991. A Design Rationale for NASA TILE-
in 1986 and was employed at the AI
WORLD , Technical Report FIA-91-04, AI Research
Center, SRI International, from 1985
Branch, NASA Ames, Moffett Field, California. to 1991. A recipient of the Computers and Thought
Pollack, M. 1992. The Uses of Plans. Artificial Intel- Award (1991) and a National Science Foundation
ligence 57(1): 4369. Young Investigators Award (1992), Pollack has
Pollack, M. 1991. Overloading Intentions for Effi- research interests in resource-limited reasoning,
cient Practical Reasoning. Nous 25(4): 513536. plan generation and recognition, natural language
Pollack, M., and Ringuette, M. 1990. Introducing processing, and AI methodology.
the TILEWORLD : Experimentally Evaluating Agent
Architectures. In Proceedings of the Eighth Nation- Paul R. Cohen is an associate pro-
al Conference on Artificial Intelligence, 183189. fessor of computer science at the
Menlo Park, Calif.: American Association for Artifi- University of Massachusetts at
cial Intelligence. Amherst and director of the Experi-
Pollack, M.; Joslin, D.; Nunes, A.; and Ur, S. 1993. mental Knowledge Systems Labora-
Experimental Investigation of an Agent Design tory. He received his Ph.D. from
Strategy, Technical Report, Dept. of Computer Sci- Stanford University in computer sci-
ence, Univ. of Pittsburgh. Forthcoming. ence and psychology in 1983. At Stanford, Cohen
edited the Handbook of Artificial Intelligence, Volume
Powers, R. 1991. The Gold Bug Variations. New York:
III, with Edward Feigenbaum and Avron Barr and
William Morrow and Company.
recently finished editing volume IV with them. He
Rosenschein, S.; Hayes-Roth, B.; and Erman, L. is currently completing a book entitled Empirical
1990. Notes on Methodologies for Evaluating IRTPS Methods for Artificial Intelligence.
Systems. In Intelligent Real-Time Problem Solving:
Workshop Report, ed. L. Erman, II-1II-12. Palo Alto,
42 AI MAGAZINE

Benchmarks, Test Beds, Controlled Experimentation, and The Design of Agent Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Benchmarks, Test Beds, Controlled Experimentation, and The Design of Agent Architectures

Uploaded by

Copyright:

Available Formats

AI Magazine Volume 14 Number 4 (1993) ( AAAI)

Benchmarks, Test Beds,

The methodological underpinnings of AI are researcher to discriminate between uninter-

which the IRMA architecture was realized in ting examined.

stitutes an understanding of how and why methodological challenges. In particular, he

believes are relevant to his/her hypotheses. ed to conduct experimentationto enable

early, prototype test-bed system, and in using

the DARPA Workshop on Innovative Approaches to Calif.: Cimflex Teknowledge Corp.

You might also like