Professional Documents
Culture Documents
This volume resulted from a call for papers to “. . . explore the state of the art of software
quality assurance, with particular emphasis on testing to measure quality.” It is my belief
that software testing as a discipline is ripe for theoretical breakthroughs. Researchers are
considering the right questions, and there are promising new approaches and exciting new
results. It seems that new understanding of the testing process can lead to practical tools
and techniques that revolutionize software development. I don’t believe that testing will
become easier or cheaper; rather, it will be more rational in the sense that expending effort
will more dependably lead to better software. In this introductory essay I provide a personal
view of testing, testing research, and their roles in software quality assurance.
of the researchers’ science. Both of these activities have immediate value in their own
right. Research spins off and achieves a number of short term goals while looking for
fundamental understanding. For example, software testing research suggests ways to
find defects, although it is assessing the absence of defects that is really the ultimate
goal. Practice, in making even elementary improvements in the repeatability and doc-
umentation of development procedures, makes an immediate qualitative improvement
in control and management of projects. About the only thing that no one should be
doing is what I call “process research”, studying practical procedures as if they were
fundamental science, and circularly “validating” surrogate measures in terms of each
other.
So software-development professionals should be organizing the development
process, and getting their organizations certified by ISO and according to criteria like
the CMM. Engineers build things; they use, rather than invent, theories. As a college
freshman (in engineering) I worked one summer for an experienced engineer, who
understood the situation very well. When I asked him to “derive” a formula I was to
use, he said, “Just use it. I know it’s right. ‘Derive’ is for when I don’t know what’s
right.” Working software engineers should be using what they know is right.
On the other hand, software-engineering researchers should be finding and un-
derstanding the properties of software that matter, and the technical ways to control
them. They can safely leave the development of practical process infrastructure to the
developers themselves. Common sense, codified and refined into academic formulas,
is a study in futility, rightly ridiculed by the people who need common sense daily.
Let me end this section with an analogy to the Professional Engineer (PE) on
a construction project. When a PE signs off on a project, it’s important that proper
procedures have been followed by people with demonstrated capabilities. The PE
relies on these people to have done the job right. But underlying all of their work are
the scientific principles that define a job properly done. The steel was up to spec, the
design put enough rivets in the right places, the concrete had the right proportion of
cement and aggregate, etc. Without these technical, product measures, the PE’s job
would be impossible. If PEs relied on process alone, because the scientific foundations
didn’t exist, most PE licenses would soon be revoked. That there are no software PEs
is a reflection of the scientific weakness of software engineering, and it’s the job of
academic and industrial researchers to find a remedy.
In the software development cycle, only a few formal specification methods, and
the theory of compilation, rival software testing in technical sophistication. Thus the
subject of testing is a good candidate for the kind of scientific investigation proposed
in the previous section. The scientific question that begs for an answer is:
How can a piece of software practically be tested so that its behavior can be guar-
anteed?
4 D. Hamlet / Editor’s introduction
In principle, this question contains the seeds of contradiction in its use of ‘practically’
and ‘guaranteed’. Neither exhaustive testing nor use of full-scale field trials are practi-
cal, but in principle only such methods provide a guarantee. A long history of testing
research has sought to find a “magic method” to answer the question, but so far none
has been found. The intuition that “coverage” of the program’s functional or structural
space might do the trick has not been validated, although research continues and that
door is not closed. (See the section on Research directions, below.)
Testing research is slowly coming to accept a statistical version of the fundamental
question:
How can a piece of software practically be tested so that confidence in its behavior
can be quantified?
This statistical version of the question would appear to eliminate the search for magic
from from testing research, and reduce testing to a routine random sampling procedure.
Unfortunately, giving up certainty does not ensure practicality. The statistical question
has a straightforward answer, but not one that a software developer wants to hear:
statistical confidence can indeed be established by random testing, but useful values
cannot be obtained in reasonable time. As a rule of thumb, to establish high confidence
in any software behavior requires a random sample roughly the same size as the actual
usage of interest. Thus confidence that flight-control software will operate properly
for 109 hours requires about 109 hours of testing. Confidence that a PC product will
not malfunction in the first week following release, on a customer base of 20 million
copies requires about 20 million weeks of testing, and so on. For some situations,
statistical testing is practical, but not in these examples.
As described in the section on Research directions below, conversion to the
statistical view has launched research investigations seeking a new kind of magic:
some way to learn more from small samples than straightforward statistics predicts is
possible. Some initial results are almost miraculous, but the methods appear to apply
only to isolated examples.
The new interest in statistical properties of testing has brought an old question
into sharp focus: what is the purpose of testing in software development? In the pre-
vious section, it was assumed that this purpose is assessment – to establish properties
of a piece of software. That is not how testing is largely used in development today.
Beginning with Glenford Myers’ insight that the tester who makes software fail has
learned much more than the tester who has seen only success, testing has been used
as a defect-removal procedure, and that is its primary role today. A number of studies
have suggested that techniques like inspection and review are more cost effective at
uncovering defects than is testing. However, it remains true that the weakest develop-
ment organization does nothing about defects; when the organization improves a little
D. Hamlet / Editor’s introduction 5
testing is added; and it is only a relatively mature development organization that uses
inspections.
Two related lines of research apply to testing for defect removal. One compares
statistical methods (“random” testing) to methods that systematically divide the pro-
gram input space into subdomains (“partition” testing, although “subdomain” testing is
a name causing less confusion). The practical outcome of this research is quite clear.
It shows that random testing is surprisingly good at making defects show themselves,
but that it cannot compete with clever subdomain testing in making software fail un-
der test. Insofar as the tester can isolate failure-prone classes of inputs, and confine
these classes to a manageable size, using them for subdomain testing is much better at
finding failures than the more scattered random testing. This ideal form of subdomain
testing is sometimes called “fault-based”, because the subdomains are devised to force
failures from imagined defects in the code.
However, since the performance of random testing is better than expected, the
future may lie there. Clever subdomain testing is labor intensive, and requires scarce
human talent to perform successfully. Random testing on the other hand is merely
wasteful of machine time for executing tests generated automatically. Although the
research is less clear on this, it appears that random testing can usually better find
failures if it is allowed to use more test points than subdomain testing, and those
additional test points cost little extra. However, there is one show-stopping obstacle that
prevents random testing from coming into practical use: more tests are not conducted
free of people time unless an automatic oracle exists. “Oracle” is the technical term for
a judgment of test outcome – has the software failed? Without precise, mathematically
formal software specifications, the only oracle is a person, and for a person it is
crucial that practical testing involve only a few test points. Hence in today’s world
of informal specifications, subdomain testing is the only practical technique. Research
into automatic oracles constructed from formal specifications is very promising, and
as formal methods are employed for their own sake, the possibility of cheap, easy
random testing opens up as a byproduct.
The other research line that bears on testing for defect removal compares testing
methods by their subdomain decomposition. The “subsumes” relationship, which may
be the source of many structural test ideas, began this research. Subsumes is an absolute
comparison between systematic test methods. Method X strictly subsumes method Y if
it is impossible to do X without also doing Y, but not the other way around. Intuitively,
X adds some requirements to Y, but omits none of Y requirements. A practical example
is the comparison between loop-branch testing and statement testing. The latter is the
oldest structural coverage method; it requires that testing cause every statement to be
executed. Loop-branch testing requires that testing force every conditional to take
both of its branches, and that some test point force every loop body to be (1) skipped
(executed zero times), some point force the body (2) to be executed exactly once, and
some point force the body (3) to be executed more than once. It is obvious that loop-
branch testing strictly subsumes statement testing. Indeed, it is probably perceived
6 D. Hamlet / Editor’s introduction
If this prediction is correct, will software-testing research have put itself out of busi-
ness? I do not believe so. Rather, the business of testing will shift from defect detection
to quality assessment. There will be fewer defects to find, and tests will infrequently
uncover any; the purpose of testing will be to predict the quality of behavior to be
expected in the field.
One important form of quality behavior is reliability. In manufacturing other
than software, quantitative measurement of reliability is a standard feature of product
engineering, and one basis of judging whether a product meets the requirements laid
down for it. Technically, reliability is defined as the probability that a product will
perform adequately over time. The time dimension reflects an assumption that failure to
perform is a consequence of wear and tear. To make the best analogy we can, consider
a mechanical object (say a hinge) that is used repeatedly. If there is a fixed chance p
that the hinge will break when it is flexed, and it is flexed N times independently, then
its reliability is R(N ) = (1 − p)N . The analogy would be even better if we imagined
that for the time of interest the hinge does not wear at all, and the failure probability p
is the chance that each flex happens to excite a fatal flaw in the design or manufacture.
Notice that in this formulation, p > 0 means that the hinge will fail – it is only a
matter of time until R(N ) gets arbitrarily close to 0. Sometimes reliability is taken to
be 1 − p itself, although in this the inevitability of failure is obscured.
For an hinge it is not difficult to imagine testing in a simple “worst” operational
environment, and the complement of that environment defines misuse of the hinge.
The test, even if it does not cause the hinge to fail, can be used to predict statistical
parameters like confidence in an upper bound on p. For software, on the other hand,
the environmental description is seldom simple. If there is to be a sensible failure
probability p for a piece of software as there was for the hinge, it will arise from
a usage pattern in which some executions excite design flaws and always fail, while
other executions always succeed. Since these failure and success points are unknown
in a piece of software that has not yet failed, it is not unreasonable to take 1 − p as
a parameter of software quality. However, lacking a worst-case environment for test
imposes a stringent additional requirement: software must be tested using an input
distribution that duplicates the distribution in which reliability is to be predicted. If
test inputs are given different relative weights than the weights of actual usage, the
test predictions may bear no relation to reality.
A distribution describing software usage is usually called an operational profile,
and determining a profile is the second potential show-stopper for the use of random
testing. The absence of an oracle is a problem for both failure finding and assessing
reliability; a profile is needed only for assessing reliability. (Of course, any random
testing must use some distribution, but for failure finding, a uniform distribution or a
special distribution related to a fault model, eases the requirement that ultimate usage
be correctly modeled.)
Thus, in addition to the general difficulty that too many test points are needed
to assess any behavioral aspect of software (such as reasonable confidence in a value
8 D. Hamlet / Editor’s introduction
of 1 − p near 1), there are the additional problems of finding an accurate operational
profile for testing, and having an effective oracle to judge test results.
On the positive side, testing for reliability assessment is unlikely to have any
competition from inspection technology. It is impossible to imagine people studying a
piece of software using checklists and experience to estimate (say) confidence bounds
for p < 10−9 . Although not all software has such stringent reliability requirements,
the special case of “ultra-reliable” systems is an important touchstone for any theory.
Papers in this volume will be described in the order they appear, and referred to
by the names of their author(s).
In the long term, a solid theory of software testing is needed. Such a theory must
explain both testing-to-find-failures, and testing-to-assess-reliability, and thus connect
these two goals. Researchers trying to make the connection are attempting to un-
derstand the relationship of test “coverage” to random sampling, because the former
seems to be aimed at finding failures, while the latter is the basis for reliability. Brian
Mitchell and Steven J. Zeil attempt to make the connection using order statistics.
They seek to predict reliability from failures observed in non-random, systematic test-
ing. Christoph Michael and Jeffrey Voas consider the same question in the broad
theoretical context of software quality prediction. Paul Ammann, Dahlard L. Lukes,
and John C. Knight apply “data diversity”, one of the most promising techniques for
developing reliable software, to the specific example of a differential equation solver.
They critically analyze the question of independence of redundant computations for
self-checking programs. My view is that a successful theory will have a statistical ba-
sis, as these three papers also assume. However, it is too soon to predict the outcome.
Janusz Laski takes the view that a theory of software incorrectness can be devised by
studying faults and errors deterministically.
The mainstay of today’s testing is functional, based on specifications. Each of
four papers on functional testing uses a particular example to illustrate and support
its approach. Technically, functional testing is a directed, coverage-like method, but
it is much less precise than the similar structural techniques based on code, because
specifications are usually less formal than code. James A. Whittaker proposes to
attack this problem by creating a stochastic Markov model of what a piece of software
does, then basing specification coverage on that model. Amit Paradkar, K.C. Tai, and
M.A. Vouk convert an informal specification to a more formal cause-effect graph, to
which a formal Boolean-operator coverage technique can be applied. Tomas Vagoun
and Alan Hevner utilize the state design of a piece of software to create an input
partition for testing. Valerie Barr’s paper addresses functional testing in a unique
domain: rule-based programs. The rule base provides an entity for coverage analysis
that lies between specification and program.
Special application domains require special testing approaches. Herbert Hecht
and Myron Hecht identify the requirements for quality assurance of safety systems,
D. Hamlet / Editor’s introduction 9
and argue that existing practices do not often meet them. Jeff Tian and Joe Palma
look at large commercial software systems, and recommend test workload measures
for reliability modeling.
And finally, two papers look at software quality prior to the testing phase. David
W. Binkley analyzes use of C++ in safety-critical systems, pointing out not only what
language features are dangerous, but also providing idioms for safer code. James Ken-
neth Blundell, Mary Lou Hines, and Jerrold Stach survey quality metrics for software
design.
This volume resulted from a generally circulated call for papers to “. . .explore
the state of the art of software quality assurance, with particular emphasis on testing
to measure quality.” The refereeing was arranged by exchanging papers among the
authors who had written on the same topic, and was double-blind: author names and
obvious self-citations were removed before the papers were exchanged. (With the
appearance of this volume, part of the referee pool will now be known to the authors,
and referees will learn whose papers they handled.) On balance, this unorthodox
refereeing scheme worked reasonably well. It’s weakness is that for a group of weak
papers on a topic, the reviews for the group are also likely to be weak. I suspected
that reviewer bias, if any, would be in the direction of being too hard on the paper
reviewed, and this proved to be the case. When referees suggested major revisions,
I did the follow-up reviewing myself. In most cases I was able to carefully read the
revised paper and not only to judge the response to original reviewer comments, but
to write an additional independent review suggesting further improvements. In a few
cases, I could only check that the response was adequate.
My own bias in software engineering is clearly against the currently popular
“process” view, for reasons given earlier in this introduction. Perhaps because prospec-
tive authors were aware of my bias, only a few of the papers submitted took the process
view, and none of them survived the refereeing process. Part of the reason may have
been that the reviewer pool shared my bias. However, as noted above, the “process”
authors/reviewers were hardest on their own subject.
This Introduction was not refereed, and represents only the volume-editor’s per-
sonal opinions.