You are on page 1of 7

u03d2 Bias

Chapter 19, "Test Bias," pages 511544.

Imagine that you are a professional who wants to evaluate the progress of a given client through psychological testing.

Identify the different sources of bias you might encounter in the assessment process. Discuss the specific steps you would take to reduce test bias effects and what you would look for in an assessment instrument for purposes of bias reduction. Describe how the standard deviation of a test score affects the interpretation of test scores.

Response Guidelines Critically respond to at least one other learner. Are their strategies for reducing bias adequate? Provide feedback to improve their strategy or strengthen their post. The issue of systemic bias in psychological testing has been the subject of debate for many years. The Standards text indicates in 7.3 that when credible research reports that differential item functioning exists across age, gender, racial/ethnic, cultural, disability, and/or linguistic groups in the population of test takers in the content domain measured by the test, test developers should conduct appropriate studies when feasible. Such research should seek to detect and eliminate aspects of test design, content, and format that might bias test scores for particular groups (AERA, APA, & NCME, 1999). It is obvious from the above statement that even if the test seems reliable and valid, it can be bias if it systematically distorts the true scores of certain individuals or groups. Although test bias has been difficult to define, Hunter and Schmidt (1976) identify three ethical issues at work in testing bias: unqualified individualism, the use of quotas, and qualified individualism (as cited in Kaplan, R. M., & Saccuzzo, D. P., 2009). Unqualified individualism uses tests to select the most qualified individuals without regard to race and gender. The use of quotas recognizes race and gender in the testing process. Qualified individualism uses tests in the same way as unqualified individualism except it takes into consideration race, gender and religion. Test bias can appear in many forms. Some of the common ones are: cultural, gender, ethnic/race, age, measurement and prediction bias. Measurement bias occurs when a test has systematic errors in measuring a particular characteristic or attribute. Prediction bias takes place when a test makes systematic errors in predicting some outcome. In order to accurately begin to understand whether a test is bias and will result in content irrelevant scores one must first ask two questions: 1) Does the test fulfill the intent of the

developer? and 2) Does the test contain any content-irrelevant information that reveals a correct answer? In order to reduce bias one must examine if the following areas exist: language that may contain different meanings for different groups, use of stereotypes, language that is considered offensive or demeaning, religious language or references that are characteristic of particular groups, a particular bias for a specific geographical region and the implication of a particular socioeconomic status. If testing bias does exist an investigation should follow. The test should be removed from use until further validation is provided. Test developers should review the test and make appropriate adjustments. Since the standard deviation is the average deviation around the mean. In practice the standard deviation of an observed score can estimate the standard error of measurement. Classic test theory tells us that errors of measurement are random because testing instruments are imperfect in measuring a persons true score (Kaplan, R. M., & Saccuzzo, D. P., 2009). Repeated applications of the same test can yield different scores. The standard deviation of the scores can tells us something about the measurement of error around the true score or mean. References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Kaplan, R. M., & Saccuzzo, D. P. (2009). Psychological testing: Principles, applications, and issues (7th ed.). Belmont, CA: Cengage.

Even if a test is both reliable and valid, that is no guarantee that it will not be biased for, or against, certain groups. A test is biased if it systematically underestimates or overestimates the true scores of certain individuals. For instance, knowledge-based tests are always biased if given to people who have no way of realistically knowing the answers. Another example would be an English-based intelligence test given to a nonEnglish speaker. It is very important for tests in selection procedures to be unbiased so that certain groups are not underrepresented in the workforce. Bias can be due to external factors, such as group differences, and/or due to internal factors, such as some questions being significantly harder for a particular group. But it is important to remember that individual differences within groups tend to be much greater than differences between groups, and it can be very difficult to decide upon a criterion for forming groups in the first place. Also, a very large sample is required before subtle degrees of bias can be detected, but if the sample is too large then almost every item will show a small, but significant, degree of bias.

Response sets = psychological orientation or bias towards answering in a particular way: o Acquiescence: tendency to agree, i.e. say "Yes. Hence use of half -vely and half +vely worded items (but there can be semantic difficulties with -vely wording) o o Social desirability: tendency to portray self in a positive light. Try to design questions which so that social desirability isn't salient. Faking bad: Purposely saying 'no' or looking bad if there's a 'reward' (e.g. attention, compensation, social welfare, etc.).

Bias o Cultural bias: does the psychological construct have the same meaning from one culture to another; how are the different items interpreted by people from different cultures; actual content (face) validity may be different for different cultures. Gender bias may also be possible. Test Bias Bias in measurement occurs when the test makes systematic errors in measuring a particular characteristic or attribute e.g. many say that most IQ tests may well be valid for middle-class whites but not for blacks or other minorities. In interviews, which are a type of test, research shows that there is a bias in favour of good-looking applicants. Bias in prediction occurs when the test makes systematic errors in predicting some outcome (or criterion). It is often suggested that tests used in academic admissions and in personnel selection under-predict the performance of minority applicants Also a test may be useful for predicting the performance of one group e.g. males but be less accurate in predicting the performance of females.

o o

Test Bias and How to Identify It


An assessment item is considered biased if it favors examinees, not because of their knowledge or skill relative to a learning objective, but because of their membership in a particular group. Group membership may be related to such factors as socioeconomic status, race, religion, gender, cultural background, region, or age among others. To be free of bias, an item should measure content in a way that neither advantages nor disadvantages examinees because of group membership. According to the standards for Educational and Psychological Testing the term bias in tests and testing refers to construct-irrelevant components that result in systematically lower or higher scores for identifiable groups of examinees.

Classical reliability theory provides us with a way to think about this problem. It shows us that an obtained score X for student s (i.e., X
s

) can be thought of as being made up of two components: T


s

(a true score) and E


s

(an error score) for that student, on that particular test. X


s

=T
s

+E
s

In this equation, the error term can also be partitioned into two parts, random error and systematic error. If we could eliminate test bias we therefore we would also reduce measurement error, and the observed scores for students would more closely reflect what they actually know about the construct being tested (i.e., the true score). Construct-irrelevant score components may be introduced into tests due to inappropriate sampling of test content or lack of clarity in test instructions. They may also arise if scoring criteria fail to credit fully some correct problem approaches or solutions that are more typical of one group than another. Notice that according to this interpretation of the equation above, bias can also work in favor of students by making it appear that they know more about what is being measured than they actually do. Test-wise students can often use clues within the test or individual assessment items to boost their observed score even though they know less about the content being assessed than their lower scoring classmates. Another kind of item bias can occur because of stereotyping groups that individual students may associate with. In general, assessment writers should remove any elements that are offensive or questionable and would therefore draw attention away from the purpose of the assessment. The characterization of any group within test items should not be at the expense of that group. That is, jargon, slang, and demeaning characterizations should not be used. References to color, marital status, or gender should only be made when it is relevant to the context (e.g., gender-neutral terms should be used whenever

possible).
Page 2

Test authors should not use any elements that they think might malign or give unfair advantage to any subgroup of examinees. When writing or reviewing assessment materials for bias, consider the following: 1. Dose the item contain language that is not commonly used or has different connotations for different groups (e.g., depending on where you live). 2. Does the item portray group members in a stereotypical manner? These could include activities, occupations, or emotions. 3. Does the item contain wording demeaning or offensive to a particular group? 4. Does the item include religious references some students may not know? 5. Does the item assume that all students come from the same socioeconomic background (e.g., a suburban home with a two-car garage)? 6. Is the item appropriate for the geographical region? The following guidelines are provided to help reduce bias and distortion in multiple-choice items. When creating a test, you must constantly ask two questions of each item: (a) Did I communicate my intent clearly? and (b) Did I give any content-irrelevant clues to the correct answer? 1. All assessment items must be clearly aligned with their learning targets. 2. An item should clearly measure its learning target by conforming to the assessment characteristics for that learning target. 3. An item should only measure one objective. 4. The reading level, and/or complexity of stimulus material, should be appropriate to the examinees. 5. Make directions clear so that all students know what is expected of them. 6. The item stem should clearly formulate a problem or question (i.e., use a direct question instead of an incomplete statement whenever possible). 7. Avoid items that present an unnecessary instructional aside. 8. Item stems should not contain irrelevant material unless this somehow serves the purpose of the question. 9. Distractors should be plausible to uninformed examinees, and contain common misconceptions. The Anatomy of a Multiple-Choice Item This example will help you understand how four terms, shown all

in CAPS below, are used to label different parts of a test item. What fruit carries its seeds on the outside? STEM a) apple DISTRACTOR or FOIL OPTION b) grape DISTRACTOR or FOIL OPTION c) strawberry KEY OPTION d) tomato DISTRACTOR or FOIL OPTION
Page 3

10. All distractors should be equally attractive. 11. Items should not ask students to make value judgments. 12. Items should not contain clang associations (i.e., words or phrases in the stem and the correct response that sound alike). 13. Foils should be grammatically correct completions of the item stem (e.g., use of a or an, number or tense of a verb, or plurals give cues to correct answers. 14. Each distractor should be a logical response to the item stem. 15. The correct response should parallel the distractors in terms of length and complexity (i.e., options should be homogeneous). 16. Avoid using foils where a pair of opposites is presented and one of the pair is the correct answer (key). 17. Avoid the use of absolutes like always or never. 18. The technical level of the distractors should match the technical level of the item stem. Item difficulty should not be a function of the vocabulary used in the stem or item options. 19. Foils should not overlap or include one another (e.g., one foil should not be a subset of another). Foils should be mutually exclusive. 20. Whenever possible use a logical order when listing foils (e.g., alphabetical if one word, by length of foil if more than one word, or from smallest to largest if numerical, etc.). 21. Avoid using All of the above, None of the above or I dont know as

options. 22. Avoid using qualifiers such as LEAST or EXCEPT in the item stem. 23. Avoid the use of negative words (i.e., NOT, etc.) in stems or foils. When these terms are used they should be highlighted in some way (e.g., bold or all CAPS). 24. There should be one and only one response that content experts can agree on as correct. 25. Foils in one item should not give clues that will help answer another item on the test, or eliminate distractors in another item. 26. The key should appear in all option positions (i.e., A B C and D) with approximately equal frequency, and be randomly distributed throughout the test. References Mehrens, W. and Lehmann, I. (1978). Measurement and evaluation in education and psychology. Holt, Rinehart and Winston. Standards for Educational and Psychological Testing (1999). American Educational Research Association, American Psychological Association, and National Council on Measurement in Education.

You might also like