You are on page 1of 9

Annals of Software Engineering 4 (1997) 1–9 1

An essay on software testing for quality assurance –


Editor’s introduction
Dick Hamlet
Center for Software Quality Research, Department of Computer Science, Portland State University,
Portland, OR 97207, USA

This volume resulted from a call for papers to “. . . explore the state of the art of software
quality assurance, with particular emphasis on testing to measure quality.” It is my belief
that software testing as a discipline is ripe for theoretical breakthroughs. Researchers are
considering the right questions, and there are promising new approaches and exciting new
results. It seems that new understanding of the testing process can lead to practical tools
and techniques that revolutionize software development. I don’t believe that testing will
become easier or cheaper; rather, it will be more rational in the sense that expending effort
will more dependably lead to better software. In this introductory essay I provide a personal
view of testing, testing research, and their roles in software quality assurance.

“Product” vs. “process” in quality assurance

It is fashionable to take a “process” view of software development, and two strong


forces, the Capability Maturity Model (CMM) from the US Software Engineering
Institute, and the ISO-9000 standards from the European-based International Standards
Organization, support this view. In a somewhat unfair nutshell, to believe in “process”
is to think that by controlling the way in which software is developed, better software
can be produced.
The opposite, “product” view is traditional in any engineering discipline. In brief,
it holds that the quality of software is a property of the software product itself, and
may be unrelated to the way in which that product was produced.
On the surface, both of these views are absolutely correct. Of course software
is good or bad in itself. If good, then it is not worsened because of how it was done;
if bad, then no pedigree can improve it. Of course the development process matters.
A good product can result from chaotic development, and an excellent process can
go wrong; still, it is clear, as Damon Runyon said, how to bet. Furthermore, the
two views appear to fit together: it is measured product characteristics that should
guide adjustments to the development process for improving quality. The paradigm
of controlled production followed by quality measurement feeding back to correct
production procedures is the basis of mass manufacturing, and it has been wildly
successful in producing a variety of quality goods.

 J.C. Baltzer AG, Science Publishers


2 D. Hamlet / Editor’s introduction

However, as applied to software development, the “process” paradigm copies the


form of engineering successes in other areas, but lacks their substance. The diffi-
culty is a simple but profound one: the feedback measures are wrong. Our scientific
understanding of software quality is just beginning, so when we attempt to assess a
particular piece of software, we fail. It isn’t that no one knows what characteristics
of the product are important, or that there is no way in principle to measure them.
Rather, our lack of understanding shows itself in an inability to find and measure
practical indicators of the desired characteristics. Another way to say the same thing
is that surrogate quality measures are needed, and they must be validated against the
real properties we care about – actual quality – before they can be used. Our lack of
understanding makes this validation impossible today.
It would be an apparently acceptable development paradigm to subject released
software to field use, measure its actual quality, and use that direct measure to adjust
the development process. Two important properties are reliability and maintainability.
Both can be simply and accurately defined and measured, respectively by counting
failures in the field, and by tracking costs of making changes that arise in practice.
But these real measures are useless in correcting development, because they take far
too long to come in. A representative segment of the maintenance history of a product,
for example, may not be known for five or more years after product release. So we fall
back on the surrogates, properties that we hope are indicative of real software char-
acteristics, but ones that can be quickly measured. For example, a common surrogate
measure of maintainability is code complexity, which can be defined in a number of
appealing ways, for example by counting decision nodes in a flowgraph. All that is
required then is to validate such a surrogate, to demonstrate that when (say) complexity
is low, maintainability is high.
The experiments that can establish the connection between a practical surrogate
measure and actual quality are of course as difficult and time-consuming as direct
measurement of quality, since they necessarily include that direct measurement. Ob-
viously, it will not do to make the validation for a particular piece of software – in
that case, we might as well forget the surrogate and simply use the actual measure,
which can’t be done as argued in the previous paragraph. What must be done is to
make the difficult, time-consuming validation experiment on representative examples
of software, and then apply the results to other software on which direct validation is
not attempted. But here our lack of understanding thwarts us: what are representative
examples? There are too many variables, and no sensible person would believe that re-
sults from (say) flight-control development for the military, would apply to developing
data-processing software for a small business.
The facts of life about software development suggest very different roles for
researchers and for practitioners in the field. It is the business of the researcher to
investigate the scientific foundations of product measures, to supply the understanding
whose lack today makes it impossible to validate and use easily measured charac-
teristics of software. It is the business of the practitioner to establish and control the
development processes that exist, perhaps optimistically in preparation for the insertion
D. Hamlet / Editor’s introduction 3

of the researchers’ science. Both of these activities have immediate value in their own
right. Research spins off and achieves a number of short term goals while looking for
fundamental understanding. For example, software testing research suggests ways to
find defects, although it is assessing the absence of defects that is really the ultimate
goal. Practice, in making even elementary improvements in the repeatability and doc-
umentation of development procedures, makes an immediate qualitative improvement
in control and management of projects. About the only thing that no one should be
doing is what I call “process research”, studying practical procedures as if they were
fundamental science, and circularly “validating” surrogate measures in terms of each
other.
So software-development professionals should be organizing the development
process, and getting their organizations certified by ISO and according to criteria like
the CMM. Engineers build things; they use, rather than invent, theories. As a college
freshman (in engineering) I worked one summer for an experienced engineer, who
understood the situation very well. When I asked him to “derive” a formula I was to
use, he said, “Just use it. I know it’s right. ‘Derive’ is for when I don’t know what’s
right.” Working software engineers should be using what they know is right.
On the other hand, software-engineering researchers should be finding and un-
derstanding the properties of software that matter, and the technical ways to control
them. They can safely leave the development of practical process infrastructure to the
developers themselves. Common sense, codified and refined into academic formulas,
is a study in futility, rightly ridiculed by the people who need common sense daily.
Let me end this section with an analogy to the Professional Engineer (PE) on
a construction project. When a PE signs off on a project, it’s important that proper
procedures have been followed by people with demonstrated capabilities. The PE
relies on these people to have done the job right. But underlying all of their work are
the scientific principles that define a job properly done. The steel was up to spec, the
design put enough rivets in the right places, the concrete had the right proportion of
cement and aggregate, etc. Without these technical, product measures, the PE’s job
would be impossible. If PEs relied on process alone, because the scientific foundations
didn’t exist, most PE licenses would soon be revoked. That there are no software PEs
is a reflection of the scientific weakness of software engineering, and it’s the job of
academic and industrial researchers to find a remedy.

Testing for quality assurance

In the software development cycle, only a few formal specification methods, and
the theory of compilation, rival software testing in technical sophistication. Thus the
subject of testing is a good candidate for the kind of scientific investigation proposed
in the previous section. The scientific question that begs for an answer is:
How can a piece of software practically be tested so that its behavior can be guar-
anteed?
4 D. Hamlet / Editor’s introduction

In principle, this question contains the seeds of contradiction in its use of ‘practically’
and ‘guaranteed’. Neither exhaustive testing nor use of full-scale field trials are practi-
cal, but in principle only such methods provide a guarantee. A long history of testing
research has sought to find a “magic method” to answer the question, but so far none
has been found. The intuition that “coverage” of the program’s functional or structural
space might do the trick has not been validated, although research continues and that
door is not closed. (See the section on Research directions, below.)
Testing research is slowly coming to accept a statistical version of the fundamental
question:
How can a piece of software practically be tested so that confidence in its behavior
can be quantified?
This statistical version of the question would appear to eliminate the search for magic
from from testing research, and reduce testing to a routine random sampling procedure.
Unfortunately, giving up certainty does not ensure practicality. The statistical question
has a straightforward answer, but not one that a software developer wants to hear:
statistical confidence can indeed be established by random testing, but useful values
cannot be obtained in reasonable time. As a rule of thumb, to establish high confidence
in any software behavior requires a random sample roughly the same size as the actual
usage of interest. Thus confidence that flight-control software will operate properly
for 109 hours requires about 109 hours of testing. Confidence that a PC product will
not malfunction in the first week following release, on a customer base of 20 million
copies requires about 20 million weeks of testing, and so on. For some situations,
statistical testing is practical, but not in these examples.
As described in the section on Research directions below, conversion to the
statistical view has launched research investigations seeking a new kind of magic:
some way to learn more from small samples than straightforward statistics predicts is
possible. Some initial results are almost miraculous, but the methods appear to apply
only to isolated examples.

Testing for defect-removal

The new interest in statistical properties of testing has brought an old question
into sharp focus: what is the purpose of testing in software development? In the pre-
vious section, it was assumed that this purpose is assessment – to establish properties
of a piece of software. That is not how testing is largely used in development today.
Beginning with Glenford Myers’ insight that the tester who makes software fail has
learned much more than the tester who has seen only success, testing has been used
as a defect-removal procedure, and that is its primary role today. A number of studies
have suggested that techniques like inspection and review are more cost effective at
uncovering defects than is testing. However, it remains true that the weakest develop-
ment organization does nothing about defects; when the organization improves a little
D. Hamlet / Editor’s introduction 5

testing is added; and it is only a relatively mature development organization that uses
inspections.
Two related lines of research apply to testing for defect removal. One compares
statistical methods (“random” testing) to methods that systematically divide the pro-
gram input space into subdomains (“partition” testing, although “subdomain” testing is
a name causing less confusion). The practical outcome of this research is quite clear.
It shows that random testing is surprisingly good at making defects show themselves,
but that it cannot compete with clever subdomain testing in making software fail un-
der test. Insofar as the tester can isolate failure-prone classes of inputs, and confine
these classes to a manageable size, using them for subdomain testing is much better at
finding failures than the more scattered random testing. This ideal form of subdomain
testing is sometimes called “fault-based”, because the subdomains are devised to force
failures from imagined defects in the code.
However, since the performance of random testing is better than expected, the
future may lie there. Clever subdomain testing is labor intensive, and requires scarce
human talent to perform successfully. Random testing on the other hand is merely
wasteful of machine time for executing tests generated automatically. Although the
research is less clear on this, it appears that random testing can usually better find
failures if it is allowed to use more test points than subdomain testing, and those
additional test points cost little extra. However, there is one show-stopping obstacle that
prevents random testing from coming into practical use: more tests are not conducted
free of people time unless an automatic oracle exists. “Oracle” is the technical term for
a judgment of test outcome – has the software failed? Without precise, mathematically
formal software specifications, the only oracle is a person, and for a person it is
crucial that practical testing involve only a few test points. Hence in today’s world
of informal specifications, subdomain testing is the only practical technique. Research
into automatic oracles constructed from formal specifications is very promising, and
as formal methods are employed for their own sake, the possibility of cheap, easy
random testing opens up as a byproduct.
The other research line that bears on testing for defect removal compares testing
methods by their subdomain decomposition. The “subsumes” relationship, which may
be the source of many structural test ideas, began this research. Subsumes is an absolute
comparison between systematic test methods. Method X strictly subsumes method Y if
it is impossible to do X without also doing Y, but not the other way around. Intuitively,
X adds some requirements to Y, but omits none of Y requirements. A practical example
is the comparison between loop-branch testing and statement testing. The latter is the
oldest structural coverage method; it requires that testing cause every statement to be
executed. Loop-branch testing requires that testing force every conditional to take
both of its branches, and that some test point force every loop body to be (1) skipped
(executed zero times), some point force the body (2) to be executed exactly once, and
some point force the body (3) to be executed more than once. It is obvious that loop-
branch testing strictly subsumes statement testing. Indeed, it is probably perceived
6 D. Hamlet / Editor’s introduction

deficiencies in statement testing that led to the additional requirements of loop-branch


testing.
At first it seemed obvious that if method X subsumes method Y, then X is the
better method, no matter what criteria are used to define “better”. But it is not so.
The reason is that subsumption is a property of test methods, but what is used in
practice are the tests themselves, particular examples of methods. For collections of
test points TX and TY respectively satisfying methods X and Y, subsumption forces
TX to satisfy Y as well, while TY might fail to satisfy X. But there may be a large
number of different tests such as TX and TY exemplifying the methods. Of these, the
ability to excite failure may be independent of the method. That is, a larger fraction
of the possible TY may find defects than the fraction of TX that find defects. Then
the best testing strategy is to use method Y, and subsumption is misleading.
The easiest way to gain an intuition about this subtle point is to look at some
extreme examples of subsumption. Consider “fractional coverage”, for say statement
testing. Let method X be “80% of statements covered”, and method Y be “75% of
statements covered”. Obviously X subsumes Y. But it can happen that most of the X
tests, in order to gain the higher fractional coverage, must neglect some defect-prone
portion of the code, while the Y tests do not have this bias. Then it is better to look
for a Y test. The effect has been experimentally observed.
Real understanding of when subsumption is misleading requires detailed analysis
of the actual properties of test methods that are of interest, not just the set-theoretic
inclusions. For defect detection, some nice theoretical work has been done. Frankl and
Weyuker were able to define a “covering” relationship more complex than “subsumes”
between the subdomains of two testing methods, a relationship that does guarantee that
tests chosen according to one method will be more likely to uncover a failure. It is
interesting that statistical analysis was required to attack this long-outstanding problem
in absolute comparisons.

Testing for reliability

Although it is extrapolating well beyond the solid implications of existing re-


search, I predict the following future of software testing in the development cycle:
Formal specification, with its byproduct of automatic test oracles, will grow in im-
portance. Random testing will become routine. The combination of “free” random
testing with the cost effectiveness of inspections will push people-intensive, clever
subdomain testing out of the business of defect detection. For development in which
formal specification is not applied, testing for defect detection will standardize on
a few subdomain methods that are rigorously, theoretically shown to be the most
effective. These methods will require considerable skill to use, and the people who
can use them will be in demand. When a development project cannot find or afford
such talent, it will fall back on inspections. Thus testing as we know it today will
be a fringe activity.
D. Hamlet / Editor’s introduction 7

If this prediction is correct, will software-testing research have put itself out of busi-
ness? I do not believe so. Rather, the business of testing will shift from defect detection
to quality assessment. There will be fewer defects to find, and tests will infrequently
uncover any; the purpose of testing will be to predict the quality of behavior to be
expected in the field.
One important form of quality behavior is reliability. In manufacturing other
than software, quantitative measurement of reliability is a standard feature of product
engineering, and one basis of judging whether a product meets the requirements laid
down for it. Technically, reliability is defined as the probability that a product will
perform adequately over time. The time dimension reflects an assumption that failure to
perform is a consequence of wear and tear. To make the best analogy we can, consider
a mechanical object (say a hinge) that is used repeatedly. If there is a fixed chance p
that the hinge will break when it is flexed, and it is flexed N times independently, then
its reliability is R(N ) = (1 − p)N . The analogy would be even better if we imagined
that for the time of interest the hinge does not wear at all, and the failure probability p
is the chance that each flex happens to excite a fatal flaw in the design or manufacture.
Notice that in this formulation, p > 0 means that the hinge will fail – it is only a
matter of time until R(N ) gets arbitrarily close to 0. Sometimes reliability is taken to
be 1 − p itself, although in this the inevitability of failure is obscured.
For an hinge it is not difficult to imagine testing in a simple “worst” operational
environment, and the complement of that environment defines misuse of the hinge.
The test, even if it does not cause the hinge to fail, can be used to predict statistical
parameters like confidence in an upper bound on p. For software, on the other hand,
the environmental description is seldom simple. If there is to be a sensible failure
probability p for a piece of software as there was for the hinge, it will arise from
a usage pattern in which some executions excite design flaws and always fail, while
other executions always succeed. Since these failure and success points are unknown
in a piece of software that has not yet failed, it is not unreasonable to take 1 − p as
a parameter of software quality. However, lacking a worst-case environment for test
imposes a stringent additional requirement: software must be tested using an input
distribution that duplicates the distribution in which reliability is to be predicted. If
test inputs are given different relative weights than the weights of actual usage, the
test predictions may bear no relation to reality.
A distribution describing software usage is usually called an operational profile,
and determining a profile is the second potential show-stopper for the use of random
testing. The absence of an oracle is a problem for both failure finding and assessing
reliability; a profile is needed only for assessing reliability. (Of course, any random
testing must use some distribution, but for failure finding, a uniform distribution or a
special distribution related to a fault model, eases the requirement that ultimate usage
be correctly modeled.)
Thus, in addition to the general difficulty that too many test points are needed
to assess any behavioral aspect of software (such as reasonable confidence in a value
8 D. Hamlet / Editor’s introduction

of 1 − p near 1), there are the additional problems of finding an accurate operational
profile for testing, and having an effective oracle to judge test results.
On the positive side, testing for reliability assessment is unlikely to have any
competition from inspection technology. It is impossible to imagine people studying a
piece of software using checklists and experience to estimate (say) confidence bounds
for p < 10−9 . Although not all software has such stringent reliability requirements,
the special case of “ultra-reliable” systems is an important touchstone for any theory.

Research directions; papers in this volume

Papers in this volume will be described in the order they appear, and referred to
by the names of their author(s).
In the long term, a solid theory of software testing is needed. Such a theory must
explain both testing-to-find-failures, and testing-to-assess-reliability, and thus connect
these two goals. Researchers trying to make the connection are attempting to un-
derstand the relationship of test “coverage” to random sampling, because the former
seems to be aimed at finding failures, while the latter is the basis for reliability. Brian
Mitchell and Steven J. Zeil attempt to make the connection using order statistics.
They seek to predict reliability from failures observed in non-random, systematic test-
ing. Christoph Michael and Jeffrey Voas consider the same question in the broad
theoretical context of software quality prediction. Paul Ammann, Dahlard L. Lukes,
and John C. Knight apply “data diversity”, one of the most promising techniques for
developing reliable software, to the specific example of a differential equation solver.
They critically analyze the question of independence of redundant computations for
self-checking programs. My view is that a successful theory will have a statistical ba-
sis, as these three papers also assume. However, it is too soon to predict the outcome.
Janusz Laski takes the view that a theory of software incorrectness can be devised by
studying faults and errors deterministically.
The mainstay of today’s testing is functional, based on specifications. Each of
four papers on functional testing uses a particular example to illustrate and support
its approach. Technically, functional testing is a directed, coverage-like method, but
it is much less precise than the similar structural techniques based on code, because
specifications are usually less formal than code. James A. Whittaker proposes to
attack this problem by creating a stochastic Markov model of what a piece of software
does, then basing specification coverage on that model. Amit Paradkar, K.C. Tai, and
M.A. Vouk convert an informal specification to a more formal cause-effect graph, to
which a formal Boolean-operator coverage technique can be applied. Tomas Vagoun
and Alan Hevner utilize the state design of a piece of software to create an input
partition for testing. Valerie Barr’s paper addresses functional testing in a unique
domain: rule-based programs. The rule base provides an entity for coverage analysis
that lies between specification and program.
Special application domains require special testing approaches. Herbert Hecht
and Myron Hecht identify the requirements for quality assurance of safety systems,
D. Hamlet / Editor’s introduction 9

and argue that existing practices do not often meet them. Jeff Tian and Joe Palma
look at large commercial software systems, and recommend test workload measures
for reliability modeling.
And finally, two papers look at software quality prior to the testing phase. David
W. Binkley analyzes use of C++ in safety-critical systems, pointing out not only what
language features are dangerous, but also providing idioms for safer code. James Ken-
neth Blundell, Mary Lou Hines, and Jerrold Stach survey quality metrics for software
design.

Preparation of this volume

This volume resulted from a generally circulated call for papers to “. . .explore
the state of the art of software quality assurance, with particular emphasis on testing
to measure quality.” The refereeing was arranged by exchanging papers among the
authors who had written on the same topic, and was double-blind: author names and
obvious self-citations were removed before the papers were exchanged. (With the
appearance of this volume, part of the referee pool will now be known to the authors,
and referees will learn whose papers they handled.) On balance, this unorthodox
refereeing scheme worked reasonably well. It’s weakness is that for a group of weak
papers on a topic, the reviews for the group are also likely to be weak. I suspected
that reviewer bias, if any, would be in the direction of being too hard on the paper
reviewed, and this proved to be the case. When referees suggested major revisions,
I did the follow-up reviewing myself. In most cases I was able to carefully read the
revised paper and not only to judge the response to original reviewer comments, but
to write an additional independent review suggesting further improvements. In a few
cases, I could only check that the response was adequate.
My own bias in software engineering is clearly against the currently popular
“process” view, for reasons given earlier in this introduction. Perhaps because prospec-
tive authors were aware of my bias, only a few of the papers submitted took the process
view, and none of them survived the refereeing process. Part of the reason may have
been that the reviewer pool shared my bias. However, as noted above, the “process”
authors/reviewers were hardest on their own subject.
This Introduction was not refereed, and represents only the volume-editor’s per-
sonal opinions.

You might also like