You are on page 1of 14

Reliability and Validity

Chong Ho Yu, Ph.D.


Conventional views of reliability (AERA et al., 1985)
• Temporal stability: the same form of a test on two or more separate occasions to the same
group of examinees (Test-retest). On many occasions this approach is not practical because
repeated measurements are likely to change the examinees. For example, the examinees will
adapt the test format and thus tend to score higher in later tests. Hence, careful implementation
of the test-retest approach is strongly recommendation (Yu, 2005).

• Form equivalence: two different forms of test, based on the same content, on one occasion to
the same examinees (Alternate form). After alternate forms have been developed, it can be used
for different examinees. It is very common in high-stake examination for pre-empting cheating.
A examinee who took Form A earlier could not share the test items with another student who
might take Form B later, because the two forms have different items.

• Internal consistency: the coefficient of test scores obtained from a single test or survey
(Cronbach Alpha, KR20, Spilt-half). For instance, let's say respondents are asked to rate
statements in an attitude survey about computer anxiety. One statement is "I feel very negative
about computers in general." Another statement is "I enjoy using computers." People who
strongly agree with the first statement should be strongly disagree with the second statement,
and vice versa. If the rating of both statements is high or low among several respondents, the
responses are said to be inconsistent and patternless. The same
principle can be applied to a test. When no pattern is found in the
students' responses, probably the test is too difficult and students
just guess the answers randomly.


Reliability is a necessary but not sufficient condition for validity.
For instance, if the needle of the scale is five pounds away from
zero, I always over-report my weight by five pounds. Is the
measurement consistent? Yes, but it is consistently wrong! Is the measurement valid? No! (But
if it under-reports my weight by five pounds, I will consider it a valid measurement)

• Performance, portfolio, and responsive evaluations, where the tasks vary substantially from
student to student and where multiple tasks may be evaluated simultaneously, are attacked for
lacking reliability. One of the difficulties is that there are more than one source of measurement
errors in performance assessment. For example, the reliability of writing skill test score is
affected by the raters, the mode of discourse, and several other factors (Parkes, 2000).

• Replications as unification: Users may be confused by the diversity of reliability indices.


Nevertheless, different types of reliability measures share a common thread: What constitutes a
replication of a measurement procedure? (Brennan, 2001) Take internal consistency as an
example. This measure is used because it is convenient to compute the reliability index based
upon data collected from one occasion. However, the ultimate inference should go beyond one
single testing occasion to others (Yu, 2005). In other words, any procedures for estimating
reliability should attempt to mirror a result based upon full-length replications.

Conventional views of validity (Cronbach, 1971)


• Face validity: Face validity simply means the validity at face value. As a check on face validity,
test/survey items are sent to teachers to obtain suggestions for modification. Because of its
vagueness and subjectivity, psychometricians
have abandoned this concept for a long time.
However, outside the measurement arena, face
validity has come back in another form. While
discussing the validity of a theory, Lacity and
Jansen (1994) defines validity as making
common sense, and being persuasive and
seeming right to the reader. For Polkinghorne
(1988), validity of a theory refers to results that
have the appearance of truth or reality.

The internal structure of things may not concur with the appearance. Many times professional
knowledge is counter-common sense. The criteria of validity in research should go beyond
"face," "appearance," and "common sense."

• Content validity: draw an inference from test


scores to a large domain of items similar to those
on the test. Content validity is concerned with
sample-population representativeness. i.e. the
knowledge and skills covered by the test items
should be representative to the larger domain of
knowledge and skills.

For example, computer literacy includes skills in


operating system, word processing, spreadsheet,
database, graphics, internet, and many others.
However, it is difficult, if not impossible, to
administer a test covering all aspects of computing. Therefore, only several tasks are sampled
from the population of computer skills.

Content validity is usually established by content experts. Take computer literacy as an


example again. A test of computer literacy should be written or reviewed by computer science
professors because it is assumed that computer scientists should know what are important in his
discipline. By the first glance, this approach looks similar to the validation process of face
validity, but yet there is a difference. In content validity, evidence is obtained by looking for
agreement in judgments by judges. In short, face validity can be established by one person but
content validity should be checked by a panel.

However, this approach has some drawbacks. Usually experts tend to take their knowledge for
granted and forget how little other people know. It is not uncommon that some tests written by
content experts are extremely difficult.

Second, very often content experts fail to identify the learning objectives of a subject. Take the
following question in a philosophy test as an example:

Top of Form

What is the time period of the philosopher Epicurus?


a. 341-270 BC

b. 331-232 BC

c. 280-207 BC

d. None of the above


Bottom of Form

This type of question tests the ability of memorizing historical facts, but not philosophizing. The
content expert may argue that "historical facts" are important for a student to further understand
philosophy. Let's change the subject to computer science and statistics. Look at the following
two questions:

Top of Form

When was the founder and CEO of Microsoft, William Gates III born?

a. 1949

b. 1953

c. 1957

d. None of the above


Bottom of Form
Top of Form

Which of the following statement is true about ANOVA

a. It was invented by R. A. Fisher in 1914


b. It was invented by R. A. Fisher in 1920

c. It was invented by Karl Pearson in 1920

d. None of the above


Bottom of Form

It would be hard pressed for any computer scientist or statistician to accept that the above
questions fulfill content validity. As a matter of fact, the memorization approach is a common
practice among instructors.

Further, sampling knowledge from a larger domain of knowledge involves subjective values.
For example, a test regarding art history may include many questions on oil paintings, but less
questions on watercolor paintings and photography because of the perceived importance of oil
paintings in art history.

Content validity is sample-oriented rather than sign-oriented. A behavior is viewed as a


sample when it is a subgroup of the same kind of behaviors. On the other hand, a behavior is
considered a sign when it is an indictor or a proxy of a construct. (Goodenough, 1949).
Construct validity and criterion validity, which will be discussed later, are sign-oriented because
both of them indicate behaviors different from those of the test.

• Criterion: draw an inference from test scores to


performance. A high score of a valid test
indicates that the tester has met the performance
criteria.

Regression analysis can be applied to establish


criterion validity. An independent variable could
be used as a predictor variable and a dependent
variable, the criterion variable. The correlation
coefficient between them is called validity
coefficients.
For instance, scores of the driving test by simulation is the predictor variable while scores of the
road test is the criterion variable. It is hypothesized that if the tester passes the simulation test,
he/she should meet the criterion of being a safe driver. In other words, if the simulation test
scores could predict the road test scores in a regression model, the simulation test is claimed to
have a high degree of criterion validity.

In short, criterion validity is about prediction rather than explanation. Predication is concerned
with non-casual or mathematical dependence where as explanation is pertaining to causal or
logical dependence. For example, one can predict the weather based on the height of mercury
inside a thermometer. Thus, the height of mercury could satisfy the criterion validity as a
predictor. However, one cannot explain why the weather changes by the change of mercury
height. Because of this limitation of criterion validity, an evaluator has to conduct construct
validation.

• Construct: draw an inference form test scores to


a psychological construct. Because it is
concerned with abstract and theoretical construct,
construct validity is also known as theoretical
construct.

According to Hunter and Schmidt (1990),


construct validity is a quantitative question rather
than a qualitative distinction such as "valid" or
"invalid"; it is a matter of degree. Construct
validity can be measured by the correlation
between the intended independent variable
(construct) and the proxy independent variable (indicator, sign) that is actually used.

For example, an evaluator wants to study the relationship between general cognitive ability and
job performance. However, the evaluator may not be able to administer a cognitive test to every
subject. In this case, he can use a proxy variable such as "amount of education" as an indirect
indicator of cognitive ability. After he administered a cognitive test to a portion of all subjects
and found a strong correlation between general cognitive ability and amount of education, the
latter can be used to the larger group because its construct validity is established.
Other authors (e.g. Angoff,1988; Cronbach & Quirk, 1976) argue that construct validity cannot
be expressed in a single coefficient; there is no mathematical index of construct validity. Rather
the nature of construct validity is qualitative.

There are two types of indictors:

○ Reflective indictor: the effect of the construct.

○ Formative indictor: the cause of the construct.

When an indictor is expressed in terms of multiple items of an instrument, factor analysis is


used for construct validation.

Test bias is a major threat against construct validity, and therefore test bias analyses should be
employed to examine the test items (Osterlind, 1983).

The presence of test bias definitely affects the measurement of the psychological construct.
However, the absence of test bias does not guarantee that the test possesses construct validity. In
other words, the absence of test bias is a necessary, but isn't a sufficient condition.

• Construct validation as unification: The criterion and the content models tends to be
empirical-oriented while the construct model is inclined to be theoretical. Nevertheless, all
models of validity requires some form of interpretation: What is the test measuring? Can it
measure what it intends to measure? In standard scientific inquiries, it is important to formulate
an interpretative (theoretical) framework clearly and then to subject it to empirical challenges. In
this sense, theoretical construct validation is considered functioning as a unified framework for
validity (Kane, 2001).

A modified view of reliability (Moss, 1994)


• There can be validity without reliability if reliability is defined as consistency among
independent measures.

• Reliability is an aspect of construct validity. As assessment becomes less standardized,


distinctions between reliability and validity blur.

• In many situations such as searching faculty candidate and conferring graduate degree,
committee members are not trained to agree on a common set of criteria and standards
• Inconsistency in students' performance across tasks does not invalidate the assessment. Rather it
becomes an empirical puzzle to be solved by searching for a more comprehensive
interpretation.

• Initial disagreement (e.g., among students, teachers, and parents in responsive evaluation) would
not invalidate the assessment. Rather it would provide an impetus for dialog.

Li (2003) argued that the preceding view is incorrect:

• The definition of reliability should be defined in terms of the classical test theory: the squared
correlation between observed and true scores or the proportion of true variance in obtained test
scores.

• Reliability is a unitless measure and thus it is already model-free or standard-free.

• It has been a tradition that multiple factors are introduced into a test to improve validity but
decrease internal-consistent reliability.

An extended view of Moss's reliability (Mislevy, 2004)


• Being inspired by Moss, Mislevy went further to ask whether there can be reliability without
reliability (indices).

• By blending psychometrics and Hermeneutics, in which a holistic and integrative approach to


understand the whole in light of its parts is used, Mislevy demanded psychometricians to think
about what they intend to make inferences about.

• In many cases we don't present just one argument; rather problem solving involves arguments or
chains of reasoning with massive evidence.

• Off-the-shelf inferential machinary (e.g. compute reliability indices) may fail if we quantify
things or tasks that we don't know much about.

• Probability-based reasoning to more complex assessments based upon cognitive psychology is


needed.

A radical view of reliability (Thompson et al, 2003)


• Reliability is not a property of the test; rather it is attached to the property of the data. Thus,
psychomterics is datammetrics.
• Tests are not reliable. It is important to explore reliability in virtually all studies.

• Reliability generalization, which can be used in a meta-analysis application similar to validity


generalization, should be implemented to assess variance in measurement error across studies.

An updated perspective of reliability (Cronbach, 2004)


In a 2004's article, Lee Cronbach, the inventor of Cronbach Alpha as a way of measuring reliability,
reviewed the historical development of Cronbach Alpha. He asserted, "I no longer regard the formula
(of Cronbach Alpha) as the most appropriate way to examine most data. Over the years, my associates
and I developed the complex generaliability (G) theory" (p. 403). Discussion of the G theory is beyond
the scope of this document. Nevertheless, Cronbach did not object use of Cronbach Alpha but he
recommended that researchers should take the following into consideration while employing this
approach:

• Standard error of measurement: It is the most important piece of information to report


regarding the instrument, not a coefficient.

• Independence of sampling

• Heterogeneity of content

• How the measurement will be used: Decide whether future uses of the instrument are likely to
be exclusively for absolute decisions, for differential decisions, or both.

• Number of conditions for the test

A critical view of validity (Pedhazur & Schmelkin,1991)


• Content validity is not a type of validity at all because validity refers to inferences made about
scores, not to an assessment of the content of an instrument.

• The very definition of a construct implies a domain of content. There is no sharp distinction
between test content and test construct.

A modified view of validity (Messick, 1995)


The conventional view (content, criterion, construct) is fragmented and incomplete, especially because
it fails to take into account both evidence of the value implications of score meaning as a basis for
action and the social consequences of score use.

Validity is not a property of the test or assessment, but rather of the meaning of the test scores.

• Content: evidence of content relevance, representativeness, and technical quality

• Substantive: theoretical rationale

• Structural: the fidelity of the scoring structure

• Generalizability: generalization to the population and across populations

• External: applications to multitrait-multimethod comparison

• Consequential: bias, fairness, and justice; the social consequence of the assessment to the
society

Critics argued that consequences should not be a component of validity because test developers should
not be held responsible for the consequences of misuse; accountability should lie with the misuser.
Messick (1998) counter-argued that social consequences of score interpretation include the value
implications of the construct label, which may or may not commensurate with the construct's trait
implications and need to be addressed in appraising score meaning. While test developers should not be
accountable to misuse of tests, they should still pay attention to the unanticipated consequences of
legitimate score interpretation.

A different view of reliability and validity (Salvucci, Walter, Conley, Fink, & Saba
(1997)
Some scholars argue that the traditional view that "reliability is a necessary but not a sufficient
condition of validity" is incorrect. This school of thought conceptualizes reliability as invariance and
validity as unbiasedness. A sample statistic may have an expected value over samples equal to the
population parameter (unbiasedness), but have very high variance from a small sample size.
Conversely, a sample statistic can have very low sampling variance but have an expected value far
departed from the population parameter (high bias). In this view, a measure can be unreliable (high
variance) but still valid (unbiased).
Population parameter (Red line) = Sample Population parameter (Red line) <> Sample
statistic (Yellow line) --> unbiased statistic (Yellow line) --> Biased
High variance (Green line) low variance (Green line)
Unreliable but valid Invalid but reliable

Caution and advice


There is a common misconception that if someone adopts a validated instrument, he/she does not need
to check the reliability and validity with his/her own data. Imagine this: When I buy a drug that has
been approved by FDA and my friend asks me whether it heals me, I tell him, "I am taking a drug
approved by FDA and therefore I don't need to know whether it works for me or not!" A responsible
evaluator should still check the instrument's reliability and validity with his/her own subjects and make
any modifications if necessary.

Low reliability is less detrimental to the performance pretest. In the pretest where subjects are not
exposed to the treatment and thus are unfamiliar with the subject matter, a low reliability caused by
random guessing is expected. One easy way to overcome this problem is to include "I don't know" in
multiple choices. In an experimental settings where students' responses would not affect their final
grades, the experimenter should explicitly instruct students to choose "I don't know" instead of making
a guess if they really don't know the answer. Low reliability is a signal of high measurement error,
which reflects a gap between what students actually know and what scores they receive. The choice "I
don't know" can help in closing this gap.

Last Updated: 2008


http://www.creative-wisdom.com/teaching/assessment/reliability.html

Reliability - overview
Reliability is the extent to which a test is repeatable and yields consistent scores.
Note: In order to be valid, a test must be reliable; but reliability does not guarantee
validity.
All measurement procedures have the potential for error, so the aim is to minimize it. An
observed test score is made up of the true score plus measurement error.
The goal of estimating reliability (consistency) is to determine how much of the variability in test
scores is due to measurement error and how much is due to variability in true scores.
Measurement errors are essentially random: a person?s test score might not reflect the true score
because they were sick, hungover, anxious, in a noisy room, etc.
Reliability can be improved by:
• getting repeated measurements using the same test and
• getting many different measures using slightly different techniques and methods.
- e.g. Consider university assessment for grades involve several sources. You would not
consider one multiple-choice exam question to be a reliable basis for testing your
knowledge of "individual differences". Many questions are asked in many different
formats (e.g., exam, essay, presentation) to help provide a more reliable score.
Types of reliability
There are several types of reliability:
There are a number of ways to ensure that a test is reliable. I?ll mention a few of them now:

1. Test-retest reliability
The test-retest method of estimating a test's reliability involves administering the test to
the same group of people at least twice. Then the first set of scores is correlated with the
second set of scores. Correlations range between 0 (low reliability) and 1 (high reliability)
(highly unlikely they will be negative!)
Remember that change might be due to measurement error e.g if you use a tape measure
to measure a room on two different days, any differences in the result is likely due to
measurement error rather than a change in the room size. However, if you measure
children?s reading ability in February and the again in June the change is likely due to
changes in children?s reading ability. Also the actual experience of taking the test can
have an impact (called reactivity). History quiz - look up answers and do better next time.
Also might remember original answers.

2. Alternate Forms
Administer Test A to a group and then administer Test B to same group. Correlation
between the two scores is the estimate of the test reliability

3. Split Half reliability


Relationship between half the items and the other half.

4. Inter-rater Reliability
Compare scores given by different raters. e.g., for important work in higher education
(e.g., theses), there are multiple markers to help ensure accurate assessment by checking
inter-rater reliability

5. Internal consistency
Internal consistence is commonly measured as Cronbach's Alpha (based on inter-item
correlations) - between 0 (low) and 1 (high). The greater the number of similar items, the
greater the internal consistency. That?s why you sometimes get very long scales asking a
question a myriad of different ways ? if you add more items you get a higher cronbach?s.
Generally, alpha of .80 is considered as a reasonable benchmark
How reliable should tests be? Some reliability guidelines
.90 = high reliability
.80 = moderate reliability
.70 = low reliability
High reliability is required when (Note: Most standardized tests of intelligence report
reliability estimates around .90 (high).
• tests are used to make important decisions
• individuals are sorted into many different categories based upon relatively small individual
differences e.g. intelligence
Lower reliability is acceptable when (Note: For most testing applications, reliability
estimates around .70 are usually regarded as low - i.e., 49% consistent variation (.7 to the
power of 2).
• tests are used for preliminary rather than final decisions
• tests are used to sort people into a small number of groups based on gross
individual differences e.g. height or sociability /extraversion
Reliability estimates of .80 or higher are typically regarded as moderate to high (approx. 16% of
the variability in test scores is attributable to error)
Reliability estimates below .60 are usually regarded as unacceptably low.
Levels of reliability typically reported for different types of tests and measurement devices
are reported in Table 7-6: Murphy and Davidshofer (2001, p.142).

 Reliability refers to whether a test is consistent, over time, in its results. A good test must also be
internally consistent. Validity refers to whether test results describe a person's actual behavior. For
most psychological tests, reliability is higher than validity, which underscores the need for additional
information, such as interviews.

Importance
 Reliability and validity are crucial to quality psychological testing. If reliability isn't present, then the
test can't be trusted to make valid assessments in a clinical setting. If validity isn't present, then the test
can't be trusted to make predictions about behavior in a real-world setting.

Read more: The Reliability & Validity of Psychological Tests | eHow.com


http://www.ehow.com/facts_7282618_reliability-validity-psychological-tests.html#ixzz17cE1VCcH

You might also like