is viewed as a unique and defning feature of their expertise (Krishnamurthy et al., 2004). Historically, careful attention to both conceptual and pragmatic issues related to measurement has served as the cor- nerstone of psychological science. Within the realm of professional psychology, the ability to provide assessment and evaluation services is typically seen as a required core competency. Indeed, assessment ser- vices are such an integral component of psychologi- cal practice that their value is rarely questioned but, rather, is typically assumed. However, solid evidence to support the usefulness of psychological assessment is lacking and many commonly used clinical assess- ment methods and instruments are not supported by scientifc evidence (e.g., Hunsley, Lee, & Wood, 2003; Hunsley & Mash, 2007; Neisworth & Bognato, 2000; Norcross, Koocher, & Garofalo, 2006). Indeed, as Peterson (2004) recently commented, For many of the most important inferences professional psy- chologists have to make, practitioners appear to be forever dependent on incorrigibly fallible interviews and unavoidably selective, reactive observations as primary sources of data (p. 202). In this era of evidence-based health-care practices, the need for scientifcally sound assessment methods and instruments is greater than ever (Barlow, 2005). Assessment is the key to the accurate identifcation of patients problems and strengths. Whether construed as individual patient monitoring, ongoing quality assurance efforts, or program evaluation, assessment is central to efforts to gauge the impact of health- care services provided to ameliorate these prob- lems (Hermann, Chan, Zazzali, & Lerner, 2006). Furthermore, the increasing availability of research- derived treatment benchmarks holds out great promise for providing clinicians with meaningful and attain- able targets for their intervention services (Hunsley & Lee, 2007; Weersing, 2005). Unfortunately, even in psychology, statements about evidence-based prac- tice and best-practice guidelines rarely pay more than nominal attention to how critical assessment is to the provision of evidence-based services (e.g., American Psychological Association Presidential Task Force on Evidence-Based Practice, 2006). Without drawing upon a scientifcally supported assessment literature, the prominence accorded to evidence-based treat- ment has been likened to constructing a magnifcent house without bothering to build a solid foundation (Achenbach, 2005). Indeed, as the identifcation of evidence-based treatments rests entirely on the data provided by assessment tools, ignoring the quality of these tools places the whole evidence-based enter- prise in jeopardy. DEFINING EVIDENCE-BASED ASSESSMENT (EBA) As we have described previously, there are three critical aspects that should defne EBA (Hunsley & Mash, 2005, 2007; Mash & Hunsley, 2005). First, 1 Developing Criteria for Evidence- Based Assessment: An Introduction to Assessments That Work John Hunsley Eric J. Mash 4 INTRODUCTION research fndings and scientifcally supported theo- ries on both psychopathology and normal human development should be used to guide the selection of constructs to be assessed and the assessment process. As Barlow (2005) suggested, EBA measures and strat- egies should also be designed to be integrated into interventions that have been shown to work with the disorders or conditions that are targeted in the assess- ment. Therefore, while recognizing that most disor- ders do not come in clearly delineated neat packages, and that comorbidity is often the rule rather than the exception, we see EBAs as being disorder- or problem- specifc. A problem-specifc approach is consistent with how most assessment and treatment research is conducted and would facilitate the integration of EBA into evidence-based treatments (cf. Kazdin & Weisz, 2003; Mash & Barkley, 2006, 2007; Mash & Hunsley, 2007). Although formal diagnostic systems provide a frequently used alternative for framing the range of disorders and problems to be considered, commonly experienced emotional and relational problems, such as excessive anger, loneliness, confictual relation- ships, and other specifc impairments that may occur in the absence of a diagnosable disorder, may also be the focus of EBAs. Even when diagnostic systems are used as the framework for the assessment, a narrow focus on assessing symptoms and symptom reduc- tion is insuffcient for both treatment planning and treatment evaluation purposes (cf. Kazdin, 2003). Many assessments are conducted to identify the pre- cise nature of the persons problem(s). It is, therefore, necessary to conceptualize multiple, interdependent stages in the assessment process, with each iteration of the process becoming less general in nature and increasingly problem-specifc with further assess- ment (Mash & Terdal, 1997). In addition, for some generic assessment strategies, there may be research to indicate that the strategy is evidence based with- out being problem-specifc. Examples of this include functional analytic assessments (Haynes, Leisen, & Blaine, 1997) and some recently developed patient monitoring systems (e.g., Lambert, 2001). A second requirement is that, whenever possible, psychometrically strong measures should be used to assess the constructs targeted in the assessment. The measures should have evidence of reliability, validity, and clinical utility. They should also possess appro- priate norms for norm-referenced interpretation and/ or replicated supporting evidence for the accuracy (e.g., sensitivity, specifcity, predictive power, etc.) of cut-scores for criterion-referenced interpretation (cf. Achenbach, 2005). Furthermore, there should be supporting evidence to indicate that the EBAs are sensitive to key characteristics of the individual(s) being assessed, including characteristics such as age, gender, race, ethnicity, and culture (Bell, Foster, & Mash, 2005; Ramirez, Ford, Stewart, & Teresi, 2005; Sonderegger & Barrett, 2004). Given the range of purposes for which assessment instruments can be used (i.e., screening, diagnosis, prognosis, case con- ceptualization, treatment formulation, treatment monitoring, treatment evaluation) and the fact that psychometric evidence is always conditional (based on sample characteristics and assessment purpose), supporting psychometric evidence must be consid- ered for each purpose for which an instrument or assessment strategy is used. Thus, general discus- sions concerning the relative merits of information obtained via different assessment methods have little meaning outside of the assessment purpose and context. For example, as suggested in many chapters in this volume, semistructured diagnostic interviews are usually the best option for obtaining diagnostic information; however, in some instances, such interviews may not have incremental validity or utility once data from brief symptom rating scales are considered (Pelham, Fabiano, & Massetti, 2005). Similarly, not all psychometric elements are relevant to all assessment purposes. The group of validity sta- tistics that includes specifcity, sensitivity, positive predictive power, and negative predictive power is particularly relevant for diagnostic and prognostic assessment purposes and contains essential informa- tion for any measure that is intended to be used for screening purposes (Hsu, 2002). Such validity sta- tistics may have little relevance, however, for many methods intended to be used for treatment moni- toring and/or evaluation purposes; for these pur- poses, sensitivity to change is a much more salient psychometric feature (e.g., Vermeersch, Lambert, & Burlingame, 2000). Finally, even with data from psychometrically strong measures, the assessment process is inherently a decision-making task in which the clinician must iteratively formulate and test hypotheses by integrating data that are often incomplete or inconsistent. Thus, a truly evidence-based approach to assessment would involve an evaluation of the accuracy and useful- ness of this complex decision-making task in light of potential errors in data synthesis and interpretation, DEVELOPING CRITERIA FOR EBA 5 the costs associated with the assessment process, and, ultimately, the impact the assessment had on clini- cal outcomes. There are an increasing number of illustrations of how assessments can be conducted in an evidence-based manner (e.g., Doss, 2005; Frazier & Youngstrom, 2006; Youngstrom & Duax, 2005). These provide invaluable guides for clinicians and provide a preliminary framework that could lead to the eventual empirical evaluation of EBA processes themselves. FROM RESEARCH TO PRACTICE: USING A GOOD-ENOUGH PRINCIPLE Perhaps the greatest single challenge facing efforts to develop and implement EBAs is determining how to start the process of operationalizing the criteria we just outlined. The assessment literature provides a veritable wealth of information that is potentially relevant to EBA; this very strength, though, is also a considerable liability, for the size of the literature is beyond voluminous. Not only is the literature vast in scope, but the scientifc evaluation of assess- ment methods and instruments can also be without end because there is no fnite set of studies that can establish, once and for all, the psychometric proper- ties of an instrument (Kazdin, 2005; Sechrest, 2005). On the other hand, every single day, clinicians must make decisions about what assessment tools to use in their practices, how best to use and combine the vari- ous forms of information they obtain in their assess- ment, and how to integrate assessment activities into other necessary aspects of clinical service. Moreover, the limited time available for service provision in clini- cal settings places an onus on using assessment options that are maximally accurate, effcient, and cost-effec- tive. Thus, above and beyond the scientifc support that has been amassed for an instrument, clinicians require tools that are brief, clear, clinically feasible, and user- friendly. In other words, they need instruments that have clinical utility and that are good enough to get the job done (Barlow, 2005; Lambert & Hawkins, 2004). As has been noted in the assessment literature, there are no clear, commonly accepted guidelines to aid clinicians or researchers in determining when an instrument has suffcient scientifc evidence to warrant its use (Kazdin, 2005; Sechrest, 2005). The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) set out generic standards to be followed in developing and using psychological instruments, but are silent on the question of specifc psychometric values that an instrument should have. The basic reason for this is that psychometric characteristics are not properties of an instrument per se but, rather, are properties of an instrument when used for a specifc purpose with a specifc sample. Quite understandably, therefore, assessment scholars, psychometricians, and test devel- opers have been reluctant to explicitly indicate the minimum psychometric values or evidence necessary to indicate that an instrument is scientifcally sound (cf. Streiner & Norman, 2003). Unfortunately, this is of little aid to the clinicians and researchers who are constantly faced with the decision of whether an instrument is good enough, scientifcally speaking, for the assessment task at hand. There have been some isolated attempts to estab- lish criteria for the selection and use of measures for research purposes. Robinson, Shaver, and Wrightsman (1991), for example, developed evaluative criteria for the adequacy of attitude and personality measures, covering the domains of theoretical development, item development, norms, inter-item correlations, internal consistency, testretest reliability, factor ana- lytic results, known groups validity, convergent valid- ity, discriminant validity, and freedom from response sets. Robinson and colleagues also used specifc psy- chometric criteria for many of these domains, such as describing a coeffcient of .80 as exemplary. More recently, there have been efforts to establish general psychometric criteria for determining the suitability of measures for clinical use in measuring disability in speech/language disorders (Agency for Healthcare Research and Quality, 2002). A different approach was taken by the Measurement and Treatment Research to Improve Cognition in Schizophrenia Group to develop a consensus battery of cognitive tests to be used in clinical trials in schizophrenia (MATRICS, 2006). Rather than setting precise psychometric cri- teria for use in rating potential instruments, expert panelists were asked to rate, on a nine-point scale, each proposed tools characteristics, including test retest reliability, utility as a repeated measure, relation to functional outcome, responsiveness to treatment change, and practicality/tolerability. Clearly any attempt to develop a method for determining the scientifc adequacy of assessment 6 INTRODUCTION instruments is fraught with the potential for error. The application of criteria that are too stringent could result in a solid set of assessment options, but one that is so limited in number or scope as to render the whole effort clinically worthless. Alternatively, using excessively lenient criteria could undermine the whole notion of an instrument or process being evi- dence based. So, with a clear awareness of this assess- ment equivalent of Scylla and Charybdis, we sought to construct a framework for the chapters included in this volume that would employ good-enough criteria for rating psychological instruments. In other words, rather than focusing on standards that defne ideal criteria for a measure, our intent was to provide cri- teria that would indicate the minimum evidence that would be suffcient to warrant the use of a measure for specifc clinical purposes. We assume, from the outset, that although our framework is intended to be scientifcally sound and defensible, it is a frst step, rather than the defnitive effort in designing a rating system for evaluating psychometric adequacy. In brief, to operationalize the good-enough principle, we developed specifc rating criteria to be used across categories of psychometric properties that have clear clinical relevance; each category has rating options of adequate, good, and excellent. In the following sec- tions, we describe the assessment purposes covered by our rating system, the psychometric properties included in the system, and the rationales for the rating options. The actual rating system, used by authors in this vol- ume to construct their summary tables of instruments, is presented in two tables later in the chapter. ASSESSMENT PURPOSES Although psychological assessments are conducted for many reasons, it is possible to identify a small set of interrelated purposes which form the basis for most assessments. These include (a) diagnosis (i.e., deter- mining the nature and/or cause[s] of the presenting problems, which may or may not involve the use of a formal diagnostic or categorization system), (b) screening (i.e., identifying those who have or who are at risk for a particular problem and who might be helped by further assessment or intervention), (c) prognosis and other predictions (i.e., generating predictions about the course of the problems if left untreated, recommendations for possible courses of action to be considered, and their likely impact on the course of the problems), (d) case conceptualization/ formulation (i.e., developing a comprehensive and clinically relevant understanding of the patient, gen- erating hypotheses regarding critical aspects of the patients psychosocial functioning and context that are likely to infuence the patients adjustment), (e) treatment design/planning (i.e., selecting/developing and implementing interventions designed to address the patients problems by focusing on elements identi- fed in the diagnostic evaluation and the case concep- tualization) (f) treatment monitoring (i.e., tracking changes in symptoms, functioning, psychological characteristics, intermediate treatment goals, and/or variables determined to cause or maintain the prob- lems), and (g) treatment evaluation (i.e., determining the effectiveness, social validity, consumer satisfac- tion, and/or cost-effectiveness of the intervention). Our intent in conceptualizing this volume is to provide a summary of the best assessment methods and instruments for commonly encountered clinical assessment purposes. Therefore, although recogniz- ing the importance of other possible assessment pur- poses, chapters in this volume focus on (a) diagnosis, (b) case conceptualization and treatment planning, and (c) treatment monitoring and treatment evalua- tion. Although separable in principle, we combined the purposes of case conceptualization and treatment planning because they tend to rely on the same assess- ment data. Similarly, we combined the purposes of treatment monitoring and evaluation because they often, but not exclusively, use the same assessment methods and instruments. Clearly, there are some overlapping elements, even in this set of purposes; for example, it is relatively common for the question of diagnosis to be revisited as part of evaluating the outcome of treatment. In the instrument summary tables that accompany each chapter, the psychomet- ric strength of instruments used for these three main purposes are presented and rated. Within a chapter, the same instrument may be rated for more than one assessment purpose and thus appear in more than one table. As an instrument may possess more empirical support for some purposes than for others, the ratings given for the instrument may not be the same in each of the tables. The chapters in this volume present information on the best available instruments for diagnosis, case con- ceptualization and treatment planning, and treatment monitoring and evaluation. They also provide details on clinically appropriate options for the range of data to DEVELOPING CRITERIA FOR EBA 7 collect, suggestions on how to address some of the chal- lenges commonly encountered in conducting assess- ment, and suggestions for the assessment process itself. Consistent with the problem-specifc focus within EBA outlined above, each chapter in this volume focuses on one or more specifc disorders or conditions. However, many patients present with multiple problems and, therefore, there are frequent references within a given chapter to the assessment of common co-occurring prob- lems that are addressed in other chapters in the volume. To be optimally useful to potential readers, the chapters are focused on the most commonly encountered disor- ders or conditions among children, adolescents, adults, older adults, and couples. With the specifc focus on the three critical assessment purposes of diagnosis, case conceptualization and treatment planning, and treat- ment monitoring and treatment, within each disorder or condition, the chapters in this volume provide readers with essential information for conducting the best EBAs currently possible. PSYCHOMETRIC PROPERTIES AND RATING CRITERIA Clinical assessment typically entails the use of both idiographic and nomothetic instruments. Idiographic measures are designed to assess unique aspects of a persons experience and, therefore, to be useful in evaluating changes in these individually defned and constructed variables. In contrast, nomothetic mea- sures are designed to assess constructs assumed to be relevant to all individuals and to facilitate compari- sons, on these constructs, across people. Most chapters include information on idiographic measures such as self-monitoring forms and individualized scales for measuring treatment goals (e.g., goal attainment scal- ing). For such idiographic measures, psychometric characteristics such as reliability and validity may, at times, not be easily evaluated or even relevant. It is crucial, however, that the same items and instruc- tions are used across assessment occasionswithout this level of standardization it is impossible to accu- rately determine changes that may be due to treat- ment (Kazdin, 1993). Deciding on the psychometric categories to be rated for the nomothetic instruments was not a simple task, nor was developing concrete rating options for each of the categories. In the end, we focused on nine categories: norms, internal consistency, inter-rater reliability, testretest reliability, content validity, con- struct validity, validity generalization, sensitivity to treatment change, and clinical utility. Each of these categories is applied in relation to a specifc assess- ment purpose (e.g., case conceptualization and treat- ment planning) in the context of a specifc disorder or clinical condition (e.g., eating disorders, self-injurious behavior, relationship confict). Consistent with our previous comments, factors such as gender, ethnicity, and age must be considered in making ratings within these categories. For each category, a rating of less than adequate, adequate, good, excellent, unavail- able, or not applicable was possible. The precise nature of what constituted adequate, good, and excel- lent varied, of course, from category to category. In general, though, a rating of adequate indicated that the instrument meets a minimal level of scientifc rigor, good indicated that the instrument would gen- erally be seen as possessing solid scientifc support, and excellent indicated there was extensive, high quality supporting evidence. Accordingly, a rating of less than adequate indicated that the instrument did not meet the minimum level set out in the criteria. A rating of unavailable indicated that research on the psychometric property under consideration had not yet been conducted or published. A rating of not applicable indicated that the psychometric property under consideration was not relevant to the instru- ment (e.g., inter-rater reliability for a self-report symp- tom rating scale). When considering the clinical use of a measure, it would be desirable to only use those measures that would meet, at a minimum, the criteria for good. However, as measure development is an ongoing process, we thought it was important to provide the option of the adequate rating in order to fairly evalu- ate (a) relatively newly developed measures and (b) measures for which comparable levels of research evidence are not available across all psychometric categories in the rating system. That being said, the only instruments included in chapter summary tables were those that had adequate or better ratings on the majority of the psychometric dimensions. Thus, the instruments presented in these tables represent only a subset of available assessment tools. Despite the diffculty inherent in promulgating scientifc criteria for psychometric properties, we believe that the potential benefts of fair and attain- able criteria far outweigh the potential drawbacks (cf. Sechrest, 2005). Accordingly, we used both reasoned 8 INTRODUCTION arguments from respected psychometricians, assess- ment scholars, and, whenever possible, summaries of various assessment literatures to guide our selec- tion of criteria for rating the psychometric properties associated with an instrument. Table 1.1 presents the criteria used in rating norms and reliability indices; Table 1.2 presents the criteria used in rating validity indices and clinical utility. Norms When using a standardized, nomothetically based instrument, it is essential that norms, specifc cri- terion-related cutoff scores, or both are available to aid in the accurate interpretation of a clients test score (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). For example, norms can be used to determine the clients pre- and post-treatment levels of functioning and to evaluate whether any change in functioning is clini- cally meaningful (Achenbach, 2001; Kendall, Marrs- Garcia, Nath, & Sheldrick, 1999). Selecting the target population(s) for the norms and then ensuring that the norms are adequate can be diffcult tasks, and several sets of norms may be required for a measure. One set of norms may be needed to determine the meaning of the obtained score relative to the general population, whereas a different set of norms could be used to com- pare the score to specifc subgroups within the popula- tion (Cicchetti, 1994). Regardless of the population to which comparisons are to be made, a normative sample must be truly representative of the population with respect to demographics and other important charac- teristics (Achenbach, 2001). Ideally, whether conducted at the national level or the local level, this would involve probability-sampling efforts in which data are obtained from the majority of contacted respondents. As those familiar with psychological instruments are aware, such a sampling strategy is rarely used for the develop- ment of test norms. The reliance on data collected from convenience samples with unknown response rates reduces the accuracy of the resultant norms. Therefore, at a minimum, clinicians need to be provided with an indication of the quality and likely accuracy of the norms for a measure. Accordingly, the ratings for norms required, at a minimum for a rating of adequate, data from a single, large clinical sample. For a rating of good, . Criteria at a Glance: Norms and Reliability Norms Adequate = Measures of central tendency and distribution for the total score (and subscores if relevant) based on a large, relevant, clinical sample are available Good = Measures of central tendency and distribution for the total score (and subscores if relevant) based on several large, relevant samples (must include data from both clinical and nonclinical samples) are available Excellent = Measures of central tendency and distribution for the total score (and subscores if relevant) based on one or more large, representative samples (must include data from both clinical and nonclinical samples) are available Internal consistency Adequate = Preponderance of evidence indicates values of .70.79 Good = Preponderance of evidence indicates values of .80.89 Excellent = Preponderance of evidence indicates values .90 Inter-rater reliability Adequate = Preponderance of evidence indicates values of .60.74; the preponderance of evidence indicates Pearson correlation or intraclass correlation values of .70.79 Good = Preponderance of evidence indicates values of .75.84; the preponderance of evidence indicates Pearson correlation or intraclass correlation values of .80.89 Excellent = Preponderance of evidence indicates values .85; the preponderance of evidence indicates Pearson correlation or intraclass correlation values .90 Testretest reliability Adequate = Preponderance of evidence indicates testretest correlations of at least .70 over a period of several days to several weeks Good = Preponderance of evidence indicates testretest correlations of at least .70 over a period of several months Excellent = Preponderance of evidence indicates testretest correlations of at least .70 over a period of a year or longer DEVELOPING CRITERIA FOR EBA 9 normative data from multiple samples, including non- clinical samples, were required; when normative data from large, representative samples were available, a rat- ing of excellent was applied. Reliability Reliability is a key psychometric element to be considered in evaluating an instrument. It refers to the consistency of a persons score on a measure (Anastasi, 1988), including whether (a) all elements of a measure contribute in a consistent way to the data obtained (internal consistency), (b) similar results would be obtained if the measure was used or scored by another clinician (inter-rater reliability), 1 or (c) similar results would be obtained if the person completed the measure a second time (testretest reliability or test stability). Not all reliability indices are relevant to all assessment methods and measures, and the size of the indices may vary on the basis of the samples used. . Criteria at a Glance: Validity and Utility Content validity Adequate = The test developers clearly defned the domain of the construct being assessed and ensured that selected items were representative of the entire set of facets included in the domain Good = In addition to the criteria used for an adequate rating, all elements of the instrument (e.g., instructions, items) were evaluated by judges (e.g., by experts or by pilot research participants) Excellent = In addition to the criteria used for a good rating, multiple groups of judges were employed and quantitative ratings were used by the judges Construct validity Adequate = Some independently replicated evidence of construct validity (e.g., predictive validity, concurrent validity, and convergent and discriminant validity) Good = Preponderance of independently replicated evidence, across multiple types of validity (e.g., predictive validity, concurrent validity, and convergent and discriminant validity), is indicative of construct validity Excellent = In addition to the criteria used for a good rating, evidence of incremental validity with respect to other clinical data Validity generalization Adequate = Some evidence supports the use of this instrument with either (a) more than one specifc group (based on sociodemographic characteristics such as age, gender, and ethnicity) or (b) in multiple contexts (e.g., home, school, primary care setting, inpatient setting) Good = Preponderance of evidence supports the use of this instrument with either (a) more than one specifc group (based on sociodemographic characteristics such as age, gender, and ethnicity) or (b) in multiple settings (e.g., home, school, primary care setting, inpatient setting) Excellent = Preponderance of evidence supports the use of this instrument with more than one specifc group (based on sociodemographic characteristics such as age, gender, and ethnicity) and across multiple contexts (e.g., home, school, primary care setting, inpatient setting) Treatment sensitivity Adequate = Some evidence of sensitivity to change over the course of treatment Good = Preponderance of independently replicated evidence indicates sensitivity to change over the course of treatment Excellent = In addition to the criteria used for a good rating, evidence of sensitivity to change across different types of treatments Clinical utility Adequate = Taking into account practical considerations (e.g., costs, ease of administration, availability of administration and scoring instructions, duration of assessment, availability of relevant cutoff scores, acceptability to patients), the resulting assessment data are likely to be clinically useful Good = In addition to the criteria used for an adequate rating, there is some published evidence that the use of the resulting assessment data confers a demonstrable clinical beneft (e.g., better treatment outcome, lower treatment attrition rates, greater patient satisfaction with services) Excellent = In addition to the criteria used for an adequate rating, there is independently replicated published evidence that the use of the resulting assessment data confers a demonstrable clinical beneft 10 INTRODUCTION With respect to internal consistency, we focused on , which is the most widely used index (Streiner, 2003). Recommendations in the literature for what constitutes adequate internal consistency vary, but most authorities seem to view .70 as the minimum acceptable value (e.g., Cicchetti, 1994), and Charter (2003) reported that the mean internal consistency value among commonly used clinical instruments was .81. Accordingly, a rating of adequate was given to values of .70.79, a rating of good required values of .80.89, and, fnally, because of cogent arguments that an value of at least .90 is highly desirable in clinical assessment contexts (Nunnally & Bernstein, 1994), we required values .90 for an instrument to be rated as having excellent internal consistency. It should be noted that it is possible for to be too (artifcially) high, as a value close to unity typically indicates substantial redundancy among items (cf. Streiner, 2003). These value ranges were also used in rating evi- dence for inter-rater reliability when assessed with Pearson correlations or intraclass correlations. Appropriate adjustments were made to the value ranges when statistics were used, in line with the recommendations discussed by Cicchetti (1994; see also Charter, 2003). Importantly, evidence for inter- rater reliability could only come from data gener- ated among clinicians or clinical ratersestimates of cross-informant agreement, such as between parent and teacher ratings, are not indicators of reliability. In establishing ratings for testretest reliability values, our requirement for a minimum correlation of .70 was infuenced by summary data reported on typical testretest reliability results found with clinical instruments (Charter, 2003) and trait-like psychological measures (Watson, 2004). Of course, not all constructs or measures are expected to show temporal stability (e.g., measures of state-like vari- ables, life stress inventories), so testretest reliability was only rated if it was relevant. A rating of adequate required evidence of correlation values of .70 or greater, when reliability was assessed over a period of several days to several weeks. We then faced a challenge in determining appropriate criteria for good and excellent ratings. In order to enhance its likely usefulness, we wanted a rating system that was relatively simple. However, testretest reliability is a complex phenomenon that is infuenced by (a) the nature of the construct being assessed (i.e., it can be state-like, trait-like, or infuenced by situational variables), (b) the time frame covering the reporting period instructions (i.e., whether respondents are asked to report their current functioning, functioning over the past few days, or functioning over an extended period, such as general functioning in the past year), and (c) the duration of the retest period (i.e., whether the time between two administrations of the instru- ment involved days, weeks, months, or years). In the end, rather than emphasize the value of increasingly large testretest correlations, we decided to maintain the requirement for .70 or greater correlation val- ues, but require increasing retest period durations of (a) several months and (b) at least a year for ratings of good and excellent respectively. Validity Validity is another central aspect to be considered when evaluating psychometric properties. Foster and Cone (1995) drew an important distinction between representational validity (i.e., whether a measure really assesses what it purports to measure) and elaborative validity (i.e., whether the measure has any utility for measuring the construct). Attending to the content validity of a measure is a basic, but frequently overlooked, step in evaluating represen- tational validity (Haynes, Richard, & Kubany, 1995). As discussed by Smith, Fischer, and Fister (2003), the overall reliability and validity of an instrument is directly affected by the extent to which items in the instrument adequately represent the various aspects or facets of the construct the instrument is designed to measure. Assuming that representational validity has been established, it is elaborative validity that is central to clinicians use of a measure. Accordingly, replicated evidence for a measures concurrent, pre- dictive, discriminative, and, ideally, incremental validity (Hunsley & Meyer, 2003) should be available to qualify a measure for consideration as evidence based. We have indicated already that validation is a context-sensitive conceptinattention to this fact can lead to inappropriate generalizations being made about a measures validity. There should be, therefore, replicated elaborative validity evidence for each pur- pose of the measure and for each population or group for which the measure is intended to be used. This latter point is especially relevant when considering an instrument for clinical use, and thus it is essential to consider evidence for validity generalizationthat is, the extent to which there is evidence for validity DEVELOPING CRITERIA FOR EBA 11 across a range of samples and settings (cf. Messick, 1995; Schmidt & Hunter, 1977). For ratings of content validity evidence, we fol- lowed Haynes et al.s (1995) suggestions, requiring explicit consideration of the construct facets to be included in the measure and, as the ratings increased, involvement of content validity judges to assess the measure. Unlike the situation for reliability, there are no commonly accepted summary statistics to evalu- ate either construct validity or incremental validity (but see, respectively, Westen & Rosenthal [2000] and Hunsley & Meyer [2003]). As a result, our ratings were based on the requirement of increasing amounts of replicated evidence of predictive validity, concur- rent validity, convergent validity, and discriminant validity; in addition, for a rating of excellent, evidence of incremental validity was also required. We were unable to fnd any clearly applicable standards in the literature to guide us in developing criteria for valid- ity generalization or treatment sensitivity (a dimen- sion rated only for instruments used for the purposes of treatment monitoring and treatment evaluation). Therefore, adequate ratings for these dimensions required some evidence of, respectively, the use of the instrument with either more than one specifc group or in multiple contexts and evidence of sensitivity to change over the course of treatment. Consistent with ratings for other dimensions, good and excellent rat- ings required increasingly demanding levels of evi- dence in these areas. Utility It is also essential to know the utility of an instru- ment for a specifc clinical purpose. The concept of clinical utility, applied to both diagnostic systems (e.g., Kendell & Jablensky, 2003) and assessment tools (e.g., Hunsley & Bailey, 1999; Yates & Taub, 2003), has received a great deal of attention in recent years. Although defnitions vary, they have in com- mon an emphasis on garnering evidence regarding actual improvements in both decisions made by clini- cians and service outcomes experienced by patients. Unfortunately, despite thousands of studies on the reliability and validity of psychological instruments, there is only scant attention paid to matters of utility in most assessment research studies (McGrath, 2001). This has directly contributed to the present state of affairs in which there is very little replicated evidence that psychological assessment data have a direct impact on improved provision and outcome of clini- cal services. At present, therefore, for the majority of psychological instruments, a determination of clini- cal utility must often be made on the basis of likely clinical value, rather than on empirical evidence. Compared to the criteria for the psychometric dimensions presented thus far, our standards for evi- dence of clinical utility were noticeably less demand- ing. This was necessary because of the paucity of information on the extent to which assessment instru- ments are acceptable to patients, enhance the quality and outcome of clinical services, and/or are worth the costs associated with their use. Therefore, we relied on authors expert opinions to classify an instrument as having adequate clinical utility. The availability of any supporting evidence of utility was suffcient for a rating of good and replicated evidence of utility was necessary for a rating of excellent. The instrument summary tables also contain one fnal column, used to indicate instruments that are the best measures currently available to clini- cians for specifc purposes and disorders and, thus, are highly recommended for clinical use. Given the considerable differences in the state of the assessment literature for different disorders/conditions, chapter authors had some fexibility in determining their own precise requirements for an instrument to be rated, or not rated, as highly recommended. However, to ensure a moderate level of consistency in these rat- ings, a highly recommended rating could only be considered for those instruments having achieved rat- ings of good or excellent in the majority of its rated psychometric categories. SOME FINAL THOUGHTS We are hopeful that the rating system described in this chapter, and applied in each of the chapters of this book, will serve to advance the state of evidence- based psychological assessment. We also hope that it will serve as a stimulus for others to refne and improve upon our efforts. Whatever the possible merits of the rating system, we wish to close this chapter by draw- ing attention to three critical issues related to its use. First, although the rating system used for this volume is relatively simple, the task of rating psycho- metric properties is not. Results from many studies must be considered in making such ratings and pre- cise quantitative standards were not set for how to 12 INTRODUCTION weight the results from studies. Furthermore, in the spirit of evidence-based practice, it is also important to note that we do not know whether these ratings are, themselves, reliable. Reliance on individual expert judgment, no matter how extensive and current the knowledge of the experts, is not as desirable as basing evidence-based conclusions and guidance on system- atic reviews of the literature conducted according to a consensually agreed upon rating system (cf. GRADE Working Group, 2004). However, for all the poten- tial limitations and biases inherent in our approach, reliance on expert review of the scientifc literature is the current standard in psychology and, thus, was the only feasible option for the volume at this time. The second issue has to do with the responsible clinical use of the guidance provided by the rating system. Consistent with evaluation and grading strat- egies used through evidence-based medicine and evidence-based psychology initiatives, many of our rating criteria relied upon the consideration of the preponderance of data relevant to each dimension. Such a strategy recognizes both the importance of replication in science and the fact that variability across studies in research design elements (including sample composition and research setting) will infu- ence estimates of these psychometric dimensions. However, we hasten to emphasize that reliance on the preponderance of evidence for these ratings does not imply or guarantee that an instrument is applicable for all patients or clinical settings. Our intention is to have these ratings provide indications about scientif- cally strong measures that warrant consideration for clinical and research use. As with all evidence-based efforts, the responsibility rests with the individual professional to determine the suitability of an instru- ment for the specifc setting, purpose, and individuals to be assessed. Third, as emphasized throughout this volume, focusing on the scientifc evidence for specifc assess- ment tools should not overshadow the fact that the process of clinical assessment involves much more than simply selecting and administering the best available instruments. Choosing the best, most rel- evant, instruments is unquestionably an important step. Subsequent steps must ensure that the instru- ments are administered in an appropriate manner, accurately scored, and then individually interpreted in accordance with the relevant body of scientifc research. However, to ensure a truly evidence-based approach to assessment, the major challenge is to then integrate all of the data within a process that is, itself, evidence based. Much of our focus in this chapter has been on evidence-based methods and instruments, in large part because (a) methods and specifc measures are more easily identifed than are processes and (b) the main emphasis in the assessment literature has been on psychometric properties of methods and instruments. As we indicated early in the chapter, an evidence-based approach to assessment should be developed in light of evidence on the accuracy and usefulness of this complex, iterative decision-making task. Although the chapters in this volume provide considerable assistance for having the assessment process be informed by scientifc evidence, the future challenge will be to ensure that the entire process of assessment is evidence based. Note 1. Although we chose to use the term inter-rater reli- ability, there is some discussion in the assessment literature about whether the term should be inter-rater agreement. Heyman et al. (2001), for example, suggested that, as indi- ces of inter-rater reliability do not contain information about individual differences among participants and only contain information about one source of error (i.e., differences among raters), they should be considered to be indices of agreement, not reliability. References Achenbach, T. M. (2001). What are norms and why do we need valid ones? Clinical Psychology: Science and Practice, 8, 446450. Achenbach, T. M. (2005). Advancing assessment of chil- dren and adolescents: Commentary on evidence- based assessment of child and adolescent disorders. Journal of Clinical Child and Adolescent Psychology, 34, 541547. Agency for Healthcare Research and Quality. (2002). Criteria for determining disability in speech language disorders. AHRQ Publication No. 02-E009. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Author. American Psychological Association Presidential Task Force on Evidence-Based Practice. (2006). Evidence-based practice in psychology. American Psychologist, 61, 271285. Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan. DEVELOPING CRITERIA FOR EBA 13 Barlow, D. H. (2005). Whats new about evidence based assessment? Psychological Assessment, 17, 308311. Bell, D., Foster, S. L., & Mash, E. J. (Eds.). (2005). Handbook of behavioral and emotional problems in girls. New York: Kluwer/Academic. Charter, R. A. (2003). A breakdown of reliability coef- fcients by test type and reliability method, and the clinical implications of low reliability. Journal of General Psychology, 130, 290304. Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assess- ment instruments in psychology. Psychological Assessment, 6, 284290. Doss, A. J. (2005). Evidence-based diagnosis: Incorporating diagnostic instruments into clinical practice. Journal of the American Academy of Child & Adolescent Psychiatry, 44, 947952. Foster, S. L., & Cone, J. D. (1995). Validity issues in clinical assessment. Psychological Assessment, 7, 248260. Frazier, T. W., & Youngstrom, E. A. (2006). Evidence- based assessment of attention-defcit/hyperactivity disorder: Using multiple sources of information. Journal of the American Academy of Child & Adolescent Psychiatry, 45, 614620. GRADE Working Group. (2004). Grading quality of evidence and strength of recommendations. British Medical Journal, 328, 14901497. Haynes, S. N., Leisen, M. B., & Blaine, D. D. (1997). Design of individualized behavioral treatment pro- grams using functional analytic clinical case meth- ods. Psychological Assessment, 9, 334348. Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7, 238247. Hermann, R. C., Chan, J. A., Zazzali, J. L., & Lerner, D. (2006). Aligning measure-based quality improve- ment with implementation of evidence-based practices. Administration and Policy in Mental Health and Mental Health Services Research, 33, 636645. Heyman, R. E., Chaudhry, B. R., Treboux, D., Crowell, J., Lord, C., Vivian, D., et al. (2001). How much observa- tional data is enough? An empirical test using marital interaction coding. Behavior Therapy, 32, 107123. Hsu, L. M. (2002). Diagnostic validity statistics and the MCMI-III. Psychological Assessment, 14, 410422. Hunsley, J., & Bailey, J. M. (1999). The clinical utility of the Rorschach: Unfulflled promises and an uncer- tain future. Psychological Assessment, 11, 266277. Hunsley, J., & Lee, C. M. (2007). Research-informed benchmarks for psychological treatments: Effcacy studies, effectiveness studies, and beyond. Professional Psychology: Research and Practice, 38, 2133. Hunsley, J., Lee, C. M., & Wood, J. M. (2003). Controversial and questionable assessment tech- niques. In S. O. Lilienfeld, S. J. Lynn, & J. M. Lohr (Eds.), Science and pseudoscience in clinical psy- chology (pp. 3976). New York: Guilford. Hunsley, J., & Mash, E. J. (2005). Introduction to the special section on developing guidelines for the evi- dence-based assessment (EBA) of adult disorders. Psychological Assessment, 17, 251255. Hunsley, J., & Mash, E. J. (2007). Evidence-based assess- ment. Annual Review of Clinical Psychology, 3, 5779. Hunsley, J., & Meyer, G. J. (2003). The incremental validity of psychological testing and assessment: Conceptual, methodological, and statistical issues. Psychological Assessment, 15, 446455. Kazdin, A. E. (1993). Evaluation in clinical practice: Clinically sensitive and systematic methods of treatment delivery. Behavior Therapy, 24, 1145. Kazdin, A. E. (2003). Psychotherapy for children and adolescents. Annual Review of Psychology, 54, 253276. Kazdin, A. E. (2005). Evidence-based assessment of child and adolescent disorders: Issues in measurement development and clinical application. Journal of Clinical Child and Adolescent Psychology, 34, 548558. Kazdin, A. E., & Weisz, J. R. (Eds.). (2003). Evidence- based psychotherapies for children and adolescents. New York: Guilford. Kendall, P. C., Marrs-Garcia, A., Nath, S. R., & Sheldrick, R. C. (1999). Normative comparisons for the evalu- ation of clinical signifcance. Journal of Consulting and Clinical Psychology, 67, 285299. Kendell, R., & Jablensky, A. (2003). Distinguishing between the validity and utility of psychiatric diagnoses. American Journal of Psychiatry, 160, 412. Krishnamurthy, R., VandeCreek, L., Kaslow, N. J., Tazeau, Y. N., Miville, M. L., Kerns, R., et al. (2004). Achieving competency in psychological assessment: Directions for education and train- ing. Journal of Clinical Psychology, 60, 725739. Lambert, M. J. (Ed.). (2001). Patient-focused research [Special section]. Journal of Consulting and Clinical Psychology, 69, 147204. Lambert, M. J., & Hawkins, E. J. (2004). Measuring outcome in professional practice: Considerations in selecting and using brief outcome instruments. Professional Psychology: Research and Practice, 35, 492499. Mash, E. J., & Barkley, R. A. (Eds.). (2006). Treatment of childhood disorders (3rd ed.). New York: Guilford. Mash, E. J., & Barkley, R. A. (Eds.). (2007). Assessment of childhood disorders (4th ed.). New York: Guilford. 14 INTRODUCTION Mash, E. J., & Hunsley, J. (2005). Evidence-based assess- ment of child and adolescent disorders: Issues and challenges. Journal of Clinical Child and Adolescent Psychology, 34, 362379. Mash, E. J., & Hunsley, J. (2007). Assessment of child and family disturbance: A developmental systems approach. In E. Mash & R. A. Barkley (Eds.), Assessment of childhood disorders (pp. 350). New York: Guilford. Mash, E. J., & Terdal, L. G. (1997). Assessment of child and family disturbance: A behavioral-systems approach. In E. J. Mash & L. G. Terdal (Eds.), Assessment of childhood disorders (3rd ed., pp. 368). New York: Guilford. MATRICS. (2006). Results of the MATRICS RAND Panel Meeting: Average medians for the categories of each candidate test. Retrieved August 23, 2007, from http://www.matrics.ucla.edu/matrics-psychometrics- frame.htm McGrath, R. E. (2001). Toward more clinically rel- evant assessment research. Journal of Personality Assessment, 77, 307332. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons responses and performances as scientifc inquiry into score meaning. American Psychologist, 50, 741749. Neisworth, J. T., & Bagnato, S. J. (2000). Recommended practices in assessment. In S. Sandall, M. E. McLean, & B. J. Smith (Eds.), DEC recommended practices in early intervention/early child special edu- cation (pp. 1727). Longmont, CO: Sopris West. Norcross, J. C., Koocher, G. P., & Garofalo, A. (2006). Discredited psychological treatments and tests: A Delphi poll. Professional Psychology: Research and Practice, 37, 515522. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill. Pelham, W. E., Fabiano, G. A., & Massetti, G. M. (2005). Evidence-based assessment of attention defcit hyperactivity disorder in children and ado- lescents. Journal of Clinical Child and Adolescent Psychology, 34, 449476. Peterson, D. R. (2004). Science, scientism, and profes- sional responsibility. Clinical Psychology: Science and Practice, 11, 196210. Ramirez, M., Ford, M. E., Stewart, A. L., & Teresi, J. A. (2005). Measurement issues in health disparities research. Health Services Research, 40, 16401657. Robinson, J. P., Shaver, P. R., & Wrightsman, L. S. (1991). Criteria for scale selection and evaluation. In J. P. Robinson, P. R. Shaver, & L. S. Wrightsman (Eds.), Measures of personality and social psycho- logical attitudes (pp. 116). New York: Academic Press. Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529540. Sechrest, L. (2005). Validity of measures is no simple matter. Health Services Research, 40, 15841604. Smith, G. T., Fischer, S., & Fister, S. M. (2003). Incremental validity principles in test construction. Psychological Assessment, 15, 467477. Sonderegger, R., & Barrett, P. M. (2004). Assessment and treatment of ethnically diverse children and adolescents. In P. M. Barrett & T. H. Ollendick (Eds.), Handbook of interventions that work with children and adolescents: Prevention and treatment (pp. 89111). New York: John Wiley. Streiner, D. L. (2003). Starting at the beginning: An intro- duction to coeffcient alpha and internal consistency. Journal of Personality Assessment, 80, 99103. Streiner, D. L, & Norman, G. R. (2003). Health measure- ment scales: A practical guide to their development and use (3rd ed.). New York: Oxford University Press. Vermeersch, D. A., Lambert, M. J., & Burlingame, G. M. (2000). Outcome questionnaire: Item sensi- tivity to change. Journal of Personality Assessment, 74, 242261. Watson, D. (2004). Stability versus change, dependabil- ity versus error: Issues in the assessment of person- ality over time. Journal of Research in Personality, 38, 319350. Weersing, V. R. (2005). Benchmarking the effective- ness of psychotherapy: Program evaluation as a component of evidence-based practice. Journal of the American Academy of Child & Adolescent Psychiatry, 44, 10581062. Westen, D., & Rosenthal, R. (2003). Quantifying con- struct validity: Two simple measures. Journal of Personality and Social Psychology, 84, 608618. Yates, B. T., & Taub, J. (2003). Assessing the costs, ben- efts, cost-effectiveness, and cost-beneft of psycho- logical assessment: We should, we can, and heres how. Psychological Assessment, 15, 478495. Youngstrom, E. A., & Duax, J. (2005). Evidence-based assessment of pediatric bipolar disorder, Part I: Base rate and family history. Journal of the American Academy of Child & Adolescent Psychiatry, 44, 712717.