You are on page 1of 12

3

For many professional psychologists, assessment


is viewed as a unique and defning feature of their
expertise (Krishnamurthy et al., 2004). Historically,
careful attention to both conceptual and pragmatic
issues related to measurement has served as the cor-
nerstone of psychological science. Within the realm
of professional psychology, the ability to provide
assessment and evaluation services is typically seen as
a required core competency. Indeed, assessment ser-
vices are such an integral component of psychologi-
cal practice that their value is rarely questioned but,
rather, is typically assumed. However, solid evidence
to support the usefulness of psychological assessment
is lacking and many commonly used clinical assess-
ment methods and instruments are not supported
by scientifc evidence (e.g., Hunsley, Lee, & Wood,
2003; Hunsley & Mash, 2007; Neisworth & Bognato,
2000; Norcross, Koocher, & Garofalo, 2006). Indeed,
as Peterson (2004) recently commented, For many
of the most important inferences professional psy-
chologists have to make, practitioners appear to be
forever dependent on incorrigibly fallible interviews
and unavoidably selective, reactive observations as
primary sources of data (p. 202).
In this era of evidence-based health-care practices,
the need for scientifcally sound assessment methods
and instruments is greater than ever (Barlow, 2005).
Assessment is the key to the accurate identifcation of
patients problems and strengths. Whether construed
as individual patient monitoring, ongoing quality
assurance efforts, or program evaluation, assessment
is central to efforts to gauge the impact of health-
care services provided to ameliorate these prob-
lems (Hermann, Chan, Zazzali, & Lerner, 2006).
Furthermore, the increasing availability of research-
derived treatment benchmarks holds out great promise
for providing clinicians with meaningful and attain-
able targets for their intervention services (Hunsley
& Lee, 2007; Weersing, 2005). Unfortunately, even
in psychology, statements about evidence-based prac-
tice and best-practice guidelines rarely pay more than
nominal attention to how critical assessment is to the
provision of evidence-based services (e.g., American
Psychological Association Presidential Task Force on
Evidence-Based Practice, 2006). Without drawing
upon a scientifcally supported assessment literature,
the prominence accorded to evidence-based treat-
ment has been likened to constructing a magnifcent
house without bothering to build a solid foundation
(Achenbach, 2005). Indeed, as the identifcation of
evidence-based treatments rests entirely on the data
provided by assessment tools, ignoring the quality of
these tools places the whole evidence-based enter-
prise in jeopardy.
DEFINING EVIDENCE-BASED
ASSESSMENT (EBA)
As we have described previously, there are three
critical aspects that should defne EBA (Hunsley &
Mash, 2005, 2007; Mash & Hunsley, 2005). First,
1
Developing Criteria for Evidence-
Based Assessment: An Introduction
to Assessments That Work
John Hunsley
Eric J. Mash
4 INTRODUCTION
research fndings and scientifcally supported theo-
ries on both psychopathology and normal human
development should be used to guide the selection of
constructs to be assessed and the assessment process.
As Barlow (2005) suggested, EBA measures and strat-
egies should also be designed to be integrated into
interventions that have been shown to work with the
disorders or conditions that are targeted in the assess-
ment. Therefore, while recognizing that most disor-
ders do not come in clearly delineated neat packages,
and that comorbidity is often the rule rather than the
exception, we see EBAs as being disorder- or problem-
specifc. A problem-specifc approach is consistent
with how most assessment and treatment research is
conducted and would facilitate the integration of EBA
into evidence-based treatments (cf. Kazdin & Weisz,
2003; Mash & Barkley, 2006, 2007; Mash & Hunsley,
2007). Although formal diagnostic systems provide a
frequently used alternative for framing the range of
disorders and problems to be considered, commonly
experienced emotional and relational problems, such
as excessive anger, loneliness, confictual relation-
ships, and other specifc impairments that may occur
in the absence of a diagnosable disorder, may also be
the focus of EBAs. Even when diagnostic systems are
used as the framework for the assessment, a narrow
focus on assessing symptoms and symptom reduc-
tion is insuffcient for both treatment planning and
treatment evaluation purposes (cf. Kazdin, 2003).
Many assessments are conducted to identify the pre-
cise nature of the persons problem(s). It is, therefore,
necessary to conceptualize multiple, interdependent
stages in the assessment process, with each iteration
of the process becoming less general in nature and
increasingly problem-specifc with further assess-
ment (Mash & Terdal, 1997). In addition, for some
generic assessment strategies, there may be research
to indicate that the strategy is evidence based with-
out being problem-specifc. Examples of this include
functional analytic assessments (Haynes, Leisen, &
Blaine, 1997) and some recently developed patient
monitoring systems (e.g., Lambert, 2001).
A second requirement is that, whenever possible,
psychometrically strong measures should be used to
assess the constructs targeted in the assessment. The
measures should have evidence of reliability, validity,
and clinical utility. They should also possess appro-
priate norms for norm-referenced interpretation and/
or replicated supporting evidence for the accuracy
(e.g., sensitivity, specifcity, predictive power, etc.) of
cut-scores for criterion-referenced interpretation (cf.
Achenbach, 2005). Furthermore, there should be
supporting evidence to indicate that the EBAs are
sensitive to key characteristics of the individual(s)
being assessed, including characteristics such as age,
gender, race, ethnicity, and culture (Bell, Foster, &
Mash, 2005; Ramirez, Ford, Stewart, & Teresi, 2005;
Sonderegger & Barrett, 2004). Given the range of
purposes for which assessment instruments can be
used (i.e., screening, diagnosis, prognosis, case con-
ceptualization, treatment formulation, treatment
monitoring, treatment evaluation) and the fact that
psychometric evidence is always conditional (based
on sample characteristics and assessment purpose),
supporting psychometric evidence must be consid-
ered for each purpose for which an instrument or
assessment strategy is used. Thus, general discus-
sions concerning the relative merits of information
obtained via different assessment methods have
little meaning outside of the assessment purpose
and context. For example, as suggested in many
chapters in this volume, semistructured diagnostic
interviews are usually the best option for obtaining
diagnostic information; however, in some instances,
such interviews may not have incremental validity
or utility once data from brief symptom rating scales
are considered (Pelham, Fabiano, & Massetti, 2005).
Similarly, not all psychometric elements are relevant
to all assessment purposes. The group of validity sta-
tistics that includes specifcity, sensitivity, positive
predictive power, and negative predictive power is
particularly relevant for diagnostic and prognostic
assessment purposes and contains essential informa-
tion for any measure that is intended to be used for
screening purposes (Hsu, 2002). Such validity sta-
tistics may have little relevance, however, for many
methods intended to be used for treatment moni-
toring and/or evaluation purposes; for these pur-
poses, sensitivity to change is a much more salient
psychometric feature (e.g., Vermeersch, Lambert, &
Burlingame, 2000).
Finally, even with data from psychometrically
strong measures, the assessment process is inherently
a decision-making task in which the clinician must
iteratively formulate and test hypotheses by integrating
data that are often incomplete or inconsistent. Thus,
a truly evidence-based approach to assessment would
involve an evaluation of the accuracy and useful-
ness of this complex decision-making task in light of
potential errors in data synthesis and interpretation,
DEVELOPING CRITERIA FOR EBA 5
the costs associated with the assessment process, and,
ultimately, the impact the assessment had on clini-
cal outcomes. There are an increasing number of
illustrations of how assessments can be conducted in an
evidence-based manner (e.g., Doss, 2005; Frazier
& Youngstrom, 2006; Youngstrom & Duax, 2005).
These provide invaluable guides for clinicians and
provide a preliminary framework that could lead to
the eventual empirical evaluation of EBA processes
themselves.
FROM RESEARCH TO PRACTICE: USING A
GOOD-ENOUGH PRINCIPLE
Perhaps the greatest single challenge facing efforts
to develop and implement EBAs is determining how
to start the process of operationalizing the criteria
we just outlined. The assessment literature provides
a veritable wealth of information that is potentially
relevant to EBA; this very strength, though, is also
a considerable liability, for the size of the literature
is beyond voluminous. Not only is the literature
vast in scope, but the scientifc evaluation of assess-
ment methods and instruments can also be without
end because there is no fnite set of studies that can
establish, once and for all, the psychometric proper-
ties of an instrument (Kazdin, 2005; Sechrest, 2005).
On the other hand, every single day, clinicians must
make decisions about what assessment tools to use in
their practices, how best to use and combine the vari-
ous forms of information they obtain in their assess-
ment, and how to integrate assessment activities into
other necessary aspects of clinical service. Moreover,
the limited time available for service provision in clini-
cal settings places an onus on using assessment options
that are maximally accurate, effcient, and cost-effec-
tive. Thus, above and beyond the scientifc support that
has been amassed for an instrument, clinicians require
tools that are brief, clear, clinically feasible, and user-
friendly. In other words, they need instruments that
have clinical utility and that are good enough to get the
job done (Barlow, 2005; Lambert & Hawkins, 2004).
As has been noted in the assessment literature,
there are no clear, commonly accepted guidelines
to aid clinicians or researchers in determining when
an instrument has suffcient scientifc evidence
to warrant its use (Kazdin, 2005; Sechrest, 2005).
The Standards for Educational and Psychological
Testing (American Educational Research Association,
American Psychological Association, & National
Council on Measurement in Education, 1999) set out
generic standards to be followed in developing and
using psychological instruments, but are silent on
the question of specifc psychometric values that an
instrument should have. The basic reason for this is
that psychometric characteristics are not properties
of an instrument per se but, rather, are properties of
an instrument when used for a specifc purpose with
a specifc sample. Quite understandably, therefore,
assessment scholars, psychometricians, and test devel-
opers have been reluctant to explicitly indicate the
minimum psychometric values or evidence necessary
to indicate that an instrument is scientifcally sound
(cf. Streiner & Norman, 2003). Unfortunately, this
is of little aid to the clinicians and researchers who
are constantly faced with the decision of whether an
instrument is good enough, scientifcally speaking,
for the assessment task at hand.
There have been some isolated attempts to estab-
lish criteria for the selection and use of measures for
research purposes. Robinson, Shaver, and Wrightsman
(1991), for example, developed evaluative criteria for
the adequacy of attitude and personality measures,
covering the domains of theoretical development,
item development, norms, inter-item correlations,
internal consistency, testretest reliability, factor ana-
lytic results, known groups validity, convergent valid-
ity, discriminant validity, and freedom from response
sets. Robinson and colleagues also used specifc psy-
chometric criteria for many of these domains, such as
describing a coeffcient of .80 as exemplary. More
recently, there have been efforts to establish general
psychometric criteria for determining the suitability
of measures for clinical use in measuring disability
in speech/language disorders (Agency for Healthcare
Research and Quality, 2002). A different approach was
taken by the Measurement and Treatment Research
to Improve Cognition in Schizophrenia Group to
develop a consensus battery of cognitive tests to be
used in clinical trials in schizophrenia (MATRICS,
2006). Rather than setting precise psychometric cri-
teria for use in rating potential instruments, expert
panelists were asked to rate, on a nine-point scale,
each proposed tools characteristics, including test
retest reliability, utility as a repeated measure, relation
to functional outcome, responsiveness to treatment
change, and practicality/tolerability.
Clearly any attempt to develop a method for
determining the scientifc adequacy of assessment
6 INTRODUCTION
instruments is fraught with the potential for error.
The application of criteria that are too stringent could
result in a solid set of assessment options, but one
that is so limited in number or scope as to render the
whole effort clinically worthless. Alternatively, using
excessively lenient criteria could undermine the
whole notion of an instrument or process being evi-
dence based. So, with a clear awareness of this assess-
ment equivalent of Scylla and Charybdis, we sought
to construct a framework for the chapters included in
this volume that would employ good-enough criteria
for rating psychological instruments. In other words,
rather than focusing on standards that defne ideal
criteria for a measure, our intent was to provide cri-
teria that would indicate the minimum evidence that
would be suffcient to warrant the use of a measure
for specifc clinical purposes. We assume, from the
outset, that although our framework is intended to be
scientifcally sound and defensible, it is a frst step,
rather than the defnitive effort in designing a rating
system for evaluating psychometric adequacy.
In brief, to operationalize the good-enough principle,
we developed specifc rating criteria to be used across
categories of psychometric properties that have clear
clinical relevance; each category has rating options of
adequate, good, and excellent. In the following sec-
tions, we describe the assessment purposes covered by
our rating system, the psychometric properties included
in the system, and the rationales for the rating options.
The actual rating system, used by authors in this vol-
ume to construct their summary tables of instruments,
is presented in two tables later in the chapter.
ASSESSMENT PURPOSES
Although psychological assessments are conducted
for many reasons, it is possible to identify a small set
of interrelated purposes which form the basis for most
assessments. These include (a) diagnosis (i.e., deter-
mining the nature and/or cause[s] of the presenting
problems, which may or may not involve the use of
a formal diagnostic or categorization system), (b)
screening (i.e., identifying those who have or who
are at risk for a particular problem and who might
be helped by further assessment or intervention),
(c) prognosis and other predictions (i.e., generating
predictions about the course of the problems if left
untreated, recommendations for possible courses of
action to be considered, and their likely impact on the
course of the problems), (d) case conceptualization/
formulation (i.e., developing a comprehensive and
clinically relevant understanding of the patient, gen-
erating hypotheses regarding critical aspects of the
patients psychosocial functioning and context that
are likely to infuence the patients adjustment), (e)
treatment design/planning (i.e., selecting/developing
and implementing interventions designed to address
the patients problems by focusing on elements identi-
fed in the diagnostic evaluation and the case concep-
tualization) (f) treatment monitoring (i.e., tracking
changes in symptoms, functioning, psychological
characteristics, intermediate treatment goals, and/or
variables determined to cause or maintain the prob-
lems), and (g) treatment evaluation (i.e., determining
the effectiveness, social validity, consumer satisfac-
tion, and/or cost-effectiveness of the intervention).
Our intent in conceptualizing this volume is to
provide a summary of the best assessment methods
and instruments for commonly encountered clinical
assessment purposes. Therefore, although recogniz-
ing the importance of other possible assessment pur-
poses, chapters in this volume focus on (a) diagnosis,
(b) case conceptualization and treatment planning,
and (c) treatment monitoring and treatment evalua-
tion. Although separable in principle, we combined
the purposes of case conceptualization and treatment
planning because they tend to rely on the same assess-
ment data. Similarly, we combined the purposes of
treatment monitoring and evaluation because they
often, but not exclusively, use the same assessment
methods and instruments. Clearly, there are some
overlapping elements, even in this set of purposes;
for example, it is relatively common for the question
of diagnosis to be revisited as part of evaluating the
outcome of treatment. In the instrument summary
tables that accompany each chapter, the psychomet-
ric strength of instruments used for these three main
purposes are presented and rated. Within a chapter,
the same instrument may be rated for more than one
assessment purpose and thus appear in more than one
table. As an instrument may possess more empirical
support for some purposes than for others, the ratings
given for the instrument may not be the same in each
of the tables.
The chapters in this volume present information on
the best available instruments for diagnosis, case con-
ceptualization and treatment planning, and treatment
monitoring and evaluation. They also provide details
on clinically appropriate options for the range of data to
DEVELOPING CRITERIA FOR EBA 7
collect, suggestions on how to address some of the chal-
lenges commonly encountered in conducting assess-
ment, and suggestions for the assessment process itself.
Consistent with the problem-specifc focus within EBA
outlined above, each chapter in this volume focuses on
one or more specifc disorders or conditions. However,
many patients present with multiple problems and,
therefore, there are frequent references within a given
chapter to the assessment of common co-occurring prob-
lems that are addressed in other chapters in the volume.
To be optimally useful to potential readers, the chapters
are focused on the most commonly encountered disor-
ders or conditions among children, adolescents, adults,
older adults, and couples. With the specifc focus on
the three critical assessment purposes of diagnosis, case
conceptualization and treatment planning, and treat-
ment monitoring and treatment, within each disorder
or condition, the chapters in this volume provide readers
with essential information for conducting the best EBAs
currently possible.
PSYCHOMETRIC PROPERTIES AND
RATING CRITERIA
Clinical assessment typically entails the use of both
idiographic and nomothetic instruments. Idiographic
measures are designed to assess unique aspects of a
persons experience and, therefore, to be useful in
evaluating changes in these individually defned and
constructed variables. In contrast, nomothetic mea-
sures are designed to assess constructs assumed to be
relevant to all individuals and to facilitate compari-
sons, on these constructs, across people. Most chapters
include information on idiographic measures such as
self-monitoring forms and individualized scales for
measuring treatment goals (e.g., goal attainment scal-
ing). For such idiographic measures, psychometric
characteristics such as reliability and validity may,
at times, not be easily evaluated or even relevant. It
is crucial, however, that the same items and instruc-
tions are used across assessment occasionswithout
this level of standardization it is impossible to accu-
rately determine changes that may be due to treat-
ment (Kazdin, 1993).
Deciding on the psychometric categories to be
rated for the nomothetic instruments was not a simple
task, nor was developing concrete rating options for
each of the categories. In the end, we focused on nine
categories: norms, internal consistency, inter-rater
reliability, testretest reliability, content validity, con-
struct validity, validity generalization, sensitivity to
treatment change, and clinical utility. Each of these
categories is applied in relation to a specifc assess-
ment purpose (e.g., case conceptualization and treat-
ment planning) in the context of a specifc disorder or
clinical condition (e.g., eating disorders, self-injurious
behavior, relationship confict). Consistent with our
previous comments, factors such as gender, ethnicity,
and age must be considered in making ratings within
these categories. For each category, a rating of less
than adequate, adequate, good, excellent, unavail-
able, or not applicable was possible. The precise
nature of what constituted adequate, good, and excel-
lent varied, of course, from category to category. In
general, though, a rating of adequate indicated that
the instrument meets a minimal level of scientifc
rigor, good indicated that the instrument would gen-
erally be seen as possessing solid scientifc support,
and excellent indicated there was extensive, high
quality supporting evidence. Accordingly, a rating of
less than adequate indicated that the instrument did
not meet the minimum level set out in the criteria.
A rating of unavailable indicated that research on
the psychometric property under consideration had
not yet been conducted or published. A rating of not
applicable indicated that the psychometric property
under consideration was not relevant to the instru-
ment (e.g., inter-rater reliability for a self-report symp-
tom rating scale).
When considering the clinical use of a measure,
it would be desirable to only use those measures that
would meet, at a minimum, the criteria for good.
However, as measure development is an ongoing
process, we thought it was important to provide the
option of the adequate rating in order to fairly evalu-
ate (a) relatively newly developed measures and (b)
measures for which comparable levels of research
evidence are not available across all psychometric
categories in the rating system. That being said, the
only instruments included in chapter summary tables
were those that had adequate or better ratings on the
majority of the psychometric dimensions. Thus, the
instruments presented in these tables represent only a
subset of available assessment tools.
Despite the diffculty inherent in promulgating
scientifc criteria for psychometric properties, we
believe that the potential benefts of fair and attain-
able criteria far outweigh the potential drawbacks (cf.
Sechrest, 2005). Accordingly, we used both reasoned
8 INTRODUCTION
arguments from respected psychometricians, assess-
ment scholars, and, whenever possible, summaries
of various assessment literatures to guide our selec-
tion of criteria for rating the psychometric properties
associated with an instrument. Table 1.1 presents the
criteria used in rating norms and reliability indices;
Table 1.2 presents the criteria used in rating validity
indices and clinical utility.
Norms
When using a standardized, nomothetically based
instrument, it is essential that norms, specifc cri-
terion-related cutoff scores, or both are available to
aid in the accurate interpretation of a clients test
score (American Educational Research Association,
American Psychological Association, & National
Council on Measurement in Education, 1999). For
example, norms can be used to determine the clients
pre- and post-treatment levels of functioning and to
evaluate whether any change in functioning is clini-
cally meaningful (Achenbach, 2001; Kendall, Marrs-
Garcia, Nath, & Sheldrick, 1999). Selecting the target
population(s) for the norms and then ensuring that the
norms are adequate can be diffcult tasks, and several
sets of norms may be required for a measure. One set
of norms may be needed to determine the meaning of
the obtained score relative to the general population,
whereas a different set of norms could be used to com-
pare the score to specifc subgroups within the popula-
tion (Cicchetti, 1994). Regardless of the population to
which comparisons are to be made, a normative sample
must be truly representative of the population with
respect to demographics and other important charac-
teristics (Achenbach, 2001). Ideally, whether conducted
at the national level or the local level, this would involve
probability-sampling efforts in which data are obtained
from the majority of contacted respondents. As those
familiar with psychological instruments are aware,
such a sampling strategy is rarely used for the develop-
ment of test norms. The reliance on data collected from
convenience samples with unknown response rates
reduces the accuracy of the resultant norms. Therefore,
at a minimum, clinicians need to be provided with
an indication of the quality and likely accuracy of the
norms for a measure. Accordingly, the ratings for norms
required, at a minimum for a rating of adequate, data
from a single, large clinical sample. For a rating of good,
. Criteria at a Glance: Norms and Reliability
Norms
Adequate = Measures of central tendency and distribution for the total score (and subscores if relevant) based on a
large, relevant, clinical sample are available
Good = Measures of central tendency and distribution for the total score (and subscores if relevant) based on several
large, relevant samples (must include data from both clinical and nonclinical samples) are available
Excellent = Measures of central tendency and distribution for the total score (and subscores if relevant) based on one
or more large, representative samples (must include data from both clinical and nonclinical samples) are available
Internal consistency
Adequate = Preponderance of evidence indicates values of .70.79
Good = Preponderance of evidence indicates values of .80.89
Excellent = Preponderance of evidence indicates values .90
Inter-rater reliability
Adequate = Preponderance of evidence indicates values of .60.74; the preponderance of evidence indicates Pearson
correlation or intraclass correlation values of .70.79
Good = Preponderance of evidence indicates values of .75.84; the preponderance of evidence indicates Pearson
correlation or intraclass correlation values of .80.89
Excellent = Preponderance of evidence indicates values .85; the preponderance of evidence indicates Pearson
correlation or intraclass correlation values .90
Testretest reliability
Adequate = Preponderance of evidence indicates testretest correlations of at least .70 over a period of several days to
several weeks
Good = Preponderance of evidence indicates testretest correlations of at least .70 over a period of several months
Excellent = Preponderance of evidence indicates testretest correlations of at least .70 over a period of a year or longer
DEVELOPING CRITERIA FOR EBA 9
normative data from multiple samples, including non-
clinical samples, were required; when normative data
from large, representative samples were available, a rat-
ing of excellent was applied.
Reliability
Reliability is a key psychometric element to be
considered in evaluating an instrument. It refers
to the consistency of a persons score on a measure
(Anastasi, 1988), including whether (a) all elements of
a measure contribute in a consistent way to the data
obtained (internal consistency), (b) similar results
would be obtained if the measure was used or scored
by another clinician (inter-rater reliability),
1
or (c) similar
results would be obtained if the person completed the
measure a second time (testretest reliability or test
stability). Not all reliability indices are relevant to all
assessment methods and measures, and the size of the
indices may vary on the basis of the samples used.
. Criteria at a Glance: Validity and Utility
Content validity
Adequate = The test developers clearly defned the domain of the construct being assessed and ensured that selected
items were representative of the entire set of facets included in the domain
Good = In addition to the criteria used for an adequate rating, all elements of the instrument (e.g., instructions, items)
were evaluated by judges (e.g., by experts or by pilot research participants)
Excellent = In addition to the criteria used for a good rating, multiple groups of judges were employed and
quantitative ratings were used by the judges
Construct validity
Adequate = Some independently replicated evidence of construct validity (e.g., predictive validity, concurrent validity,
and convergent and discriminant validity)
Good = Preponderance of independently replicated evidence, across multiple types of validity (e.g., predictive validity,
concurrent validity, and convergent and discriminant validity), is indicative of construct validity
Excellent = In addition to the criteria used for a good rating, evidence of incremental validity with respect to other
clinical data
Validity generalization
Adequate = Some evidence supports the use of this instrument with either (a) more than one specifc group (based on
sociodemographic characteristics such as age, gender, and ethnicity) or (b) in multiple contexts (e.g., home, school,
primary care setting, inpatient setting)
Good = Preponderance of evidence supports the use of this instrument with either (a) more than one specifc group
(based on sociodemographic characteristics such as age, gender, and ethnicity) or (b) in multiple settings (e.g.,
home, school, primary care setting, inpatient setting)
Excellent = Preponderance of evidence supports the use of this instrument with more than one specifc group (based
on sociodemographic characteristics such as age, gender, and ethnicity) and across multiple contexts (e.g., home,
school, primary care setting, inpatient setting)
Treatment sensitivity
Adequate = Some evidence of sensitivity to change over the course of treatment
Good = Preponderance of independently replicated evidence indicates sensitivity to change over the course of
treatment
Excellent = In addition to the criteria used for a good rating, evidence of sensitivity to change across different types of
treatments
Clinical utility
Adequate = Taking into account practical considerations (e.g., costs, ease of administration, availability of
administration and scoring instructions, duration of assessment, availability of relevant cutoff scores, acceptability to
patients), the resulting assessment data are likely to be clinically useful
Good = In addition to the criteria used for an adequate rating, there is some published evidence that the use of the
resulting assessment data confers a demonstrable clinical beneft (e.g., better treatment outcome, lower treatment
attrition rates, greater patient satisfaction with services)
Excellent = In addition to the criteria used for an adequate rating, there is independently replicated published evidence
that the use of the resulting assessment data confers a demonstrable clinical beneft
10 INTRODUCTION
With respect to internal consistency, we focused
on , which is the most widely used index (Streiner,
2003). Recommendations in the literature for what
constitutes adequate internal consistency vary, but
most authorities seem to view .70 as the minimum
acceptable value (e.g., Cicchetti, 1994), and Charter
(2003) reported that the mean internal consistency
value among commonly used clinical instruments
was .81. Accordingly, a rating of adequate was given
to values of .70.79, a rating of good required values
of .80.89, and, fnally, because of cogent arguments
that an value of at least .90 is highly desirable in
clinical assessment contexts (Nunnally & Bernstein,
1994), we required values .90 for an instrument
to be rated as having excellent internal consistency.
It should be noted that it is possible for to be too
(artifcially) high, as a value close to unity typically
indicates substantial redundancy among items (cf.
Streiner, 2003).
These value ranges were also used in rating evi-
dence for inter-rater reliability when assessed with
Pearson correlations or intraclass correlations.
Appropriate adjustments were made to the value
ranges when statistics were used, in line with the
recommendations discussed by Cicchetti (1994; see
also Charter, 2003). Importantly, evidence for inter-
rater reliability could only come from data gener-
ated among clinicians or clinical ratersestimates of
cross-informant agreement, such as between parent
and teacher ratings, are not indicators of reliability.
In establishing ratings for testretest reliability
values, our requirement for a minimum correlation
of .70 was infuenced by summary data reported
on typical testretest reliability results found with
clinical instruments (Charter, 2003) and trait-like
psychological measures (Watson, 2004). Of course,
not all constructs or measures are expected to show
temporal stability (e.g., measures of state-like vari-
ables, life stress inventories), so testretest reliability
was only rated if it was relevant. A rating of adequate
required evidence of correlation values of .70 or
greater, when reliability was assessed over a period
of several days to several weeks. We then faced a
challenge in determining appropriate criteria for
good and excellent ratings. In order to enhance its
likely usefulness, we wanted a rating system that was
relatively simple. However, testretest reliability is a
complex phenomenon that is infuenced by (a) the
nature of the construct being assessed (i.e., it can
be state-like, trait-like, or infuenced by situational
variables), (b) the time frame covering the reporting
period instructions (i.e., whether respondents are
asked to report their current functioning, functioning
over the past few days, or functioning over an extended
period, such as general functioning in the past year),
and (c) the duration of the retest period (i.e., whether
the time between two administrations of the instru-
ment involved days, weeks, months, or years). In the
end, rather than emphasize the value of increasingly
large testretest correlations, we decided to maintain
the requirement for .70 or greater correlation val-
ues, but require increasing retest period durations of
(a) several months and (b) at least a year for ratings of
good and excellent respectively.
Validity
Validity is another central aspect to be considered
when evaluating psychometric properties. Foster
and Cone (1995) drew an important distinction
between representational validity (i.e., whether a
measure really assesses what it purports to measure)
and elaborative validity (i.e., whether the measure has
any utility for measuring the construct). Attending
to the content validity of a measure is a basic, but
frequently overlooked, step in evaluating represen-
tational validity (Haynes, Richard, & Kubany, 1995).
As discussed by Smith, Fischer, and Fister (2003),
the overall reliability and validity of an instrument is
directly affected by the extent to which items in the
instrument adequately represent the various aspects
or facets of the construct the instrument is designed
to measure. Assuming that representational validity
has been established, it is elaborative validity that is
central to clinicians use of a measure. Accordingly,
replicated evidence for a measures concurrent, pre-
dictive, discriminative, and, ideally, incremental
validity (Hunsley & Meyer, 2003) should be available
to qualify a measure for consideration as evidence
based. We have indicated already that validation is
a context-sensitive conceptinattention to this fact
can lead to inappropriate generalizations being made
about a measures validity. There should be, therefore,
replicated elaborative validity evidence for each pur-
pose of the measure and for each population or group
for which the measure is intended to be used. This
latter point is especially relevant when considering
an instrument for clinical use, and thus it is essential
to consider evidence for validity generalizationthat
is, the extent to which there is evidence for validity
DEVELOPING CRITERIA FOR EBA 11
across a range of samples and settings (cf. Messick,
1995; Schmidt & Hunter, 1977).
For ratings of content validity evidence, we fol-
lowed Haynes et al.s (1995) suggestions, requiring
explicit consideration of the construct facets to be
included in the measure and, as the ratings increased,
involvement of content validity judges to assess the
measure. Unlike the situation for reliability, there are
no commonly accepted summary statistics to evalu-
ate either construct validity or incremental validity
(but see, respectively, Westen & Rosenthal [2000]
and Hunsley & Meyer [2003]). As a result, our ratings
were based on the requirement of increasing amounts
of replicated evidence of predictive validity, concur-
rent validity, convergent validity, and discriminant
validity; in addition, for a rating of excellent, evidence
of incremental validity was also required. We were
unable to fnd any clearly applicable standards in the
literature to guide us in developing criteria for valid-
ity generalization or treatment sensitivity (a dimen-
sion rated only for instruments used for the purposes
of treatment monitoring and treatment evaluation).
Therefore, adequate ratings for these dimensions
required some evidence of, respectively, the use of the
instrument with either more than one specifc group
or in multiple contexts and evidence of sensitivity to
change over the course of treatment. Consistent with
ratings for other dimensions, good and excellent rat-
ings required increasingly demanding levels of evi-
dence in these areas.
Utility
It is also essential to know the utility of an instru-
ment for a specifc clinical purpose. The concept of
clinical utility, applied to both diagnostic systems
(e.g., Kendell & Jablensky, 2003) and assessment
tools (e.g., Hunsley & Bailey, 1999; Yates & Taub,
2003), has received a great deal of attention in recent
years. Although defnitions vary, they have in com-
mon an emphasis on garnering evidence regarding
actual improvements in both decisions made by clini-
cians and service outcomes experienced by patients.
Unfortunately, despite thousands of studies on the
reliability and validity of psychological instruments,
there is only scant attention paid to matters of utility
in most assessment research studies (McGrath, 2001).
This has directly contributed to the present state of
affairs in which there is very little replicated evidence
that psychological assessment data have a direct
impact on improved provision and outcome of clini-
cal services. At present, therefore, for the majority of
psychological instruments, a determination of clini-
cal utility must often be made on the basis of likely
clinical value, rather than on empirical evidence.
Compared to the criteria for the psychometric
dimensions presented thus far, our standards for evi-
dence of clinical utility were noticeably less demand-
ing. This was necessary because of the paucity of
information on the extent to which assessment instru-
ments are acceptable to patients, enhance the quality
and outcome of clinical services, and/or are worth the
costs associated with their use. Therefore, we relied
on authors expert opinions to classify an instrument
as having adequate clinical utility. The availability of
any supporting evidence of utility was suffcient for a
rating of good and replicated evidence of utility was
necessary for a rating of excellent.
The instrument summary tables also contain
one fnal column, used to indicate instruments that
are the best measures currently available to clini-
cians for specifc purposes and disorders and, thus,
are highly recommended for clinical use. Given the
considerable differences in the state of the assessment
literature for different disorders/conditions, chapter
authors had some fexibility in determining their own
precise requirements for an instrument to be rated,
or not rated, as highly recommended. However, to
ensure a moderate level of consistency in these rat-
ings, a highly recommended rating could only be
considered for those instruments having achieved rat-
ings of good or excellent in the majority of its rated
psychometric categories.
SOME FINAL THOUGHTS
We are hopeful that the rating system described in
this chapter, and applied in each of the chapters of
this book, will serve to advance the state of evidence-
based psychological assessment. We also hope that it
will serve as a stimulus for others to refne and improve
upon our efforts. Whatever the possible merits of the
rating system, we wish to close this chapter by draw-
ing attention to three critical issues related to its use.
First, although the rating system used for this
volume is relatively simple, the task of rating psycho-
metric properties is not. Results from many studies
must be considered in making such ratings and pre-
cise quantitative standards were not set for how to
12 INTRODUCTION
weight the results from studies. Furthermore, in the
spirit of evidence-based practice, it is also important
to note that we do not know whether these ratings are,
themselves, reliable. Reliance on individual expert
judgment, no matter how extensive and current the
knowledge of the experts, is not as desirable as basing
evidence-based conclusions and guidance on system-
atic reviews of the literature conducted according to a
consensually agreed upon rating system (cf. GRADE
Working Group, 2004). However, for all the poten-
tial limitations and biases inherent in our approach,
reliance on expert review of the scientifc literature is
the current standard in psychology and, thus, was the
only feasible option for the volume at this time.
The second issue has to do with the responsible
clinical use of the guidance provided by the rating
system. Consistent with evaluation and grading strat-
egies used through evidence-based medicine and
evidence-based psychology initiatives, many of our
rating criteria relied upon the consideration of the
preponderance of data relevant to each dimension.
Such a strategy recognizes both the importance of
replication in science and the fact that variability
across studies in research design elements (including
sample composition and research setting) will infu-
ence estimates of these psychometric dimensions.
However, we hasten to emphasize that reliance on the
preponderance of evidence for these ratings does not
imply or guarantee that an instrument is applicable
for all patients or clinical settings. Our intention is to
have these ratings provide indications about scientif-
cally strong measures that warrant consideration for
clinical and research use. As with all evidence-based
efforts, the responsibility rests with the individual
professional to determine the suitability of an instru-
ment for the specifc setting, purpose, and individuals
to be assessed.
Third, as emphasized throughout this volume,
focusing on the scientifc evidence for specifc assess-
ment tools should not overshadow the fact that the
process of clinical assessment involves much more
than simply selecting and administering the best
available instruments. Choosing the best, most rel-
evant, instruments is unquestionably an important
step. Subsequent steps must ensure that the instru-
ments are administered in an appropriate manner,
accurately scored, and then individually interpreted
in accordance with the relevant body of scientifc
research. However, to ensure a truly evidence-based
approach to assessment, the major challenge is to then
integrate all of the data within a process that is, itself,
evidence based. Much of our focus in this chapter has
been on evidence-based methods and instruments, in
large part because (a) methods and specifc measures
are more easily identifed than are processes and (b)
the main emphasis in the assessment literature has
been on psychometric properties of methods and
instruments. As we indicated early in the chapter,
an evidence-based approach to assessment should be
developed in light of evidence on the accuracy and
usefulness of this complex, iterative decision-making
task. Although the chapters in this volume provide
considerable assistance for having the assessment
process be informed by scientifc evidence, the future
challenge will be to ensure that the entire process of
assessment is evidence based.
Note
1. Although we chose to use the term inter-rater reli-
ability, there is some discussion in the assessment literature
about whether the term should be inter-rater agreement.
Heyman et al. (2001), for example, suggested that, as indi-
ces of inter-rater reliability do not contain information about
individual differences among participants and only contain
information about one source of error (i.e., differences
among raters), they should be considered to be indices of
agreement, not reliability.
References
Achenbach, T. M. (2001). What are norms and why do
we need valid ones? Clinical Psychology: Science
and Practice, 8, 446450.
Achenbach, T. M. (2005). Advancing assessment of chil-
dren and adolescents: Commentary on evidence-
based assessment of child and adolescent disorders.
Journal of Clinical Child and Adolescent Psychology,
34, 541547.
Agency for Healthcare Research and Quality. (2002).
Criteria for determining disability in speech
language disorders. AHRQ Publication No. 02-E009.
American Educational Research Association, American
Psychological Association, National Council on
Measurement in Education. (1999). Standards for
educational and psychological testing. Washington,
DC: Author.
American Psychological Association Presidential
Task Force on Evidence-Based Practice. (2006).
Evidence-based practice in psychology. American
Psychologist, 61, 271285.
Anastasi, A. (1988). Psychological testing (6th ed.). New
York: Macmillan.
DEVELOPING CRITERIA FOR EBA 13
Barlow, D. H. (2005). Whats new about evidence based
assessment? Psychological Assessment, 17, 308311.
Bell, D., Foster, S. L., & Mash, E. J. (Eds.). (2005).
Handbook of behavioral and emotional problems in
girls. New York: Kluwer/Academic.
Charter, R. A. (2003). A breakdown of reliability coef-
fcients by test type and reliability method, and the
clinical implications of low reliability. Journal of
General Psychology, 130, 290304.
Cicchetti, D. V. (1994). Guidelines, criteria, and rules of
thumb for evaluating normed and standardized assess-
ment instruments in psychology. Psychological
Assessment, 6, 284290.
Doss, A. J. (2005). Evidence-based diagnosis: Incorporating
diagnostic instruments into clinical practice. Journal
of the American Academy of Child & Adolescent
Psychiatry, 44, 947952.
Foster, S. L., & Cone, J. D. (1995). Validity issues in clinical
assessment. Psychological Assessment, 7, 248260.
Frazier, T. W., & Youngstrom, E. A. (2006). Evidence-
based assessment of attention-defcit/hyperactivity
disorder: Using multiple sources of information.
Journal of the American Academy of Child &
Adolescent Psychiatry, 45, 614620.
GRADE Working Group. (2004). Grading quality of
evidence and strength of recommendations. British
Medical Journal, 328, 14901497.
Haynes, S. N., Leisen, M. B., & Blaine, D. D. (1997).
Design of individualized behavioral treatment pro-
grams using functional analytic clinical case meth-
ods. Psychological Assessment, 9, 334348.
Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995).
Content validity in psychological assessment: A
functional approach to concepts and methods.
Psychological Assessment, 7, 238247.
Hermann, R. C., Chan, J. A., Zazzali, J. L., & Lerner, D.
(2006). Aligning measure-based quality improve-
ment with implementation of evidence-based
practices. Administration and Policy in Mental
Health and Mental Health Services Research, 33,
636645.
Heyman, R. E., Chaudhry, B. R., Treboux, D., Crowell, J.,
Lord, C., Vivian, D., et al. (2001). How much observa-
tional data is enough? An empirical test using marital
interaction coding. Behavior Therapy, 32, 107123.
Hsu, L. M. (2002). Diagnostic validity statistics and the
MCMI-III. Psychological Assessment, 14, 410422.
Hunsley, J., & Bailey, J. M. (1999). The clinical utility of
the Rorschach: Unfulflled promises and an uncer-
tain future. Psychological Assessment, 11, 266277.
Hunsley, J., & Lee, C. M. (2007). Research-informed
benchmarks for psychological treatments: Effcacy
studies, effectiveness studies, and beyond. Professional
Psychology: Research and Practice, 38, 2133.
Hunsley, J., Lee, C. M., & Wood, J. M. (2003).
Controversial and questionable assessment tech-
niques. In S. O. Lilienfeld, S. J. Lynn, & J. M. Lohr
(Eds.), Science and pseudoscience in clinical psy-
chology (pp. 3976). New York: Guilford.
Hunsley, J., & Mash, E. J. (2005). Introduction to the
special section on developing guidelines for the evi-
dence-based assessment (EBA) of adult disorders.
Psychological Assessment, 17, 251255.
Hunsley, J., & Mash, E. J. (2007). Evidence-based assess-
ment. Annual Review of Clinical Psychology, 3, 5779.
Hunsley, J., & Meyer, G. J. (2003). The incremental
validity of psychological testing and assessment:
Conceptual, methodological, and statistical issues.
Psychological Assessment, 15, 446455.
Kazdin, A. E. (1993). Evaluation in clinical practice:
Clinically sensitive and systematic methods of
treatment delivery. Behavior Therapy, 24, 1145.
Kazdin, A. E. (2003). Psychotherapy for children and
adolescents. Annual Review of Psychology, 54,
253276.
Kazdin, A. E. (2005). Evidence-based assessment of child
and adolescent disorders: Issues in measurement
development and clinical application. Journal
of Clinical Child and Adolescent Psychology, 34,
548558.
Kazdin, A. E., & Weisz, J. R. (Eds.). (2003). Evidence-
based psychotherapies for children and adolescents.
New York: Guilford.
Kendall, P. C., Marrs-Garcia, A., Nath, S. R., & Sheldrick,
R. C. (1999). Normative comparisons for the evalu-
ation of clinical signifcance. Journal of Consulting
and Clinical Psychology, 67, 285299.
Kendell, R., & Jablensky, A. (2003). Distinguishing
between the validity and utility of psychiatric
diagnoses. American Journal of Psychiatry, 160,
412.
Krishnamurthy, R., VandeCreek, L., Kaslow, N. J.,
Tazeau, Y. N., Miville, M. L., Kerns, R., et al.
(2004). Achieving competency in psychological
assessment: Directions for education and train-
ing. Journal of Clinical Psychology, 60, 725739.
Lambert, M. J. (Ed.). (2001). Patient-focused research
[Special section]. Journal of Consulting and Clinical
Psychology, 69, 147204.
Lambert, M. J., & Hawkins, E. J. (2004). Measuring
outcome in professional practice: Considerations
in selecting and using brief outcome instruments.
Professional Psychology: Research and Practice, 35,
492499.
Mash, E. J., & Barkley, R. A. (Eds.). (2006). Treatment of
childhood disorders (3rd ed.). New York: Guilford.
Mash, E. J., & Barkley, R. A. (Eds.). (2007). Assessment of
childhood disorders (4th ed.). New York: Guilford.
14 INTRODUCTION
Mash, E. J., & Hunsley, J. (2005). Evidence-based assess-
ment of child and adolescent disorders: Issues and
challenges. Journal of Clinical Child and Adolescent
Psychology, 34, 362379.
Mash, E. J., & Hunsley, J. (2007). Assessment of
child and family disturbance: A developmental
systems approach. In E. Mash & R. A. Barkley
(Eds.), Assessment of childhood disorders (pp. 350).
New York: Guilford.
Mash, E. J., & Terdal, L. G. (1997). Assessment of child
and family disturbance: A behavioral-systems
approach. In E. J. Mash & L. G. Terdal (Eds.),
Assessment of childhood disorders (3rd ed., pp. 368).
New York: Guilford.
MATRICS. (2006). Results of the MATRICS RAND
Panel Meeting: Average medians for the categories of
each candidate test. Retrieved August 23, 2007, from
http://www.matrics.ucla.edu/matrics-psychometrics-
frame.htm
McGrath, R. E. (2001). Toward more clinically rel-
evant assessment research. Journal of Personality
Assessment, 77, 307332.
Messick, S. (1995). Validity of psychological assessment:
Validation of inferences from persons responses
and performances as scientifc inquiry into score
meaning. American Psychologist, 50, 741749.
Neisworth, J. T., & Bagnato, S. J. (2000). Recommended
practices in assessment. In S. Sandall, M. E.
McLean, & B. J. Smith (Eds.), DEC recommended
practices in early intervention/early child special edu-
cation (pp. 1727). Longmont, CO: Sopris West.
Norcross, J. C., Koocher, G. P., & Garofalo, A. (2006).
Discredited psychological treatments and tests: A
Delphi poll. Professional Psychology: Research and
Practice, 37, 515522.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
theory (3rd ed.). New York: McGraw-Hill.
Pelham, W. E., Fabiano, G. A., & Massetti, G. M.
(2005). Evidence-based assessment of attention
defcit hyperactivity disorder in children and ado-
lescents. Journal of Clinical Child and Adolescent
Psychology, 34, 449476.
Peterson, D. R. (2004). Science, scientism, and profes-
sional responsibility. Clinical Psychology: Science
and Practice, 11, 196210.
Ramirez, M., Ford, M. E., Stewart, A. L., & Teresi, J. A.
(2005). Measurement issues in health disparities
research. Health Services Research, 40, 16401657.
Robinson, J. P., Shaver, P. R., & Wrightsman, L. S.
(1991). Criteria for scale selection and evaluation. In
J. P. Robinson, P. R. Shaver, & L. S. Wrightsman
(Eds.), Measures of personality and social psycho-
logical attitudes (pp. 116). New York: Academic
Press.
Schmidt, F. L., & Hunter, J. E. (1977). Development
of a general solution to the problem of validity
generalization. Journal of Applied Psychology, 62,
529540.
Sechrest, L. (2005). Validity of measures is no simple
matter. Health Services Research, 40, 15841604.
Smith, G. T., Fischer, S., & Fister, S. M. (2003).
Incremental validity principles in test construction.
Psychological Assessment, 15, 467477.
Sonderegger, R., & Barrett, P. M. (2004). Assessment
and treatment of ethnically diverse children and
adolescents. In P. M. Barrett & T. H. Ollendick
(Eds.), Handbook of interventions that work with
children and adolescents: Prevention and treatment
(pp. 89111). New York: John Wiley.
Streiner, D. L. (2003). Starting at the beginning: An intro-
duction to coeffcient alpha and internal consistency.
Journal of Personality Assessment, 80, 99103.
Streiner, D. L, & Norman, G. R. (2003). Health measure-
ment scales: A practical guide to their development
and use (3rd ed.). New York: Oxford University Press.
Vermeersch, D. A., Lambert, M. J., & Burlingame,
G. M. (2000). Outcome questionnaire: Item sensi-
tivity to change. Journal of Personality Assessment,
74, 242261.
Watson, D. (2004). Stability versus change, dependabil-
ity versus error: Issues in the assessment of person-
ality over time. Journal of Research in Personality,
38, 319350.
Weersing, V. R. (2005). Benchmarking the effective-
ness of psychotherapy: Program evaluation as a
component of evidence-based practice. Journal
of the American Academy of Child & Adolescent
Psychiatry, 44, 10581062.
Westen, D., & Rosenthal, R. (2003). Quantifying con-
struct validity: Two simple measures. Journal of
Personality and Social Psychology, 84, 608618.
Yates, B. T., & Taub, J. (2003). Assessing the costs, ben-
efts, cost-effectiveness, and cost-beneft of psycho-
logical assessment: We should, we can, and heres
how. Psychological Assessment, 15, 478495.
Youngstrom, E. A., & Duax, J. (2005). Evidence-based
assessment of pediatric bipolar disorder, Part I: Base
rate and family history. Journal of the American
Academy of Child & Adolescent Psychiatry, 44,
712717.

You might also like