You are on page 1of 9

https://www.uni.edu/chfasoa/reliabilityandvalidity.

htm
http://www.measuringu.com/blog/measure-reliability.php
https://explorable.com/definition-of-reliability?gid=1579
https://explorable.com/internal-consistency-reliability?gid=1579

SUMBER 1

Reliability is the degree to which an assessment tool produces stable and consistent
results.
Reliability is a measure of the consistency of a metric or a method.
Every metric or method we use, including things like methods for uncovering usability
problems in an interface and expert judgment, must be assessed for reliability.
Imagine that a researcher discovers a new drug that she believes helps people to
become more intelligent, a process measured by a series of mental exercises. After
analyzing the results, she finds that the group given the drug performed the mental tests
much better than the control group.
For her results to be reliable, another researcher must be able to perform exactly the
same experiment on another group of people and generate results with the same
statistical significance. If repeat experiments fail, then there may be something wrong
with the original research.

Types of Reliability

1. Test-retest reliability is a measure of reliability obtained by administering the

same test twice over a period of time to a group of individuals. The scores from
Time 1 and Time 2 can then be correlated in order to evaluate the test for stability
over time.

Example: A test designed to assess student learning in psychology could be given


to a group of students twice, with the second administration perhaps coming a week
after the first. The obtained correlation coefficient would indicate the stability of the
scores.
2. Parallel forms reliability is a measure of reliability obtained by administering

different versions of an assessment tool (both versions must contain items that
probe the same construct, skill, knowledge base, etc.) to the same group of
individuals. The scores from the two versions can then be correlated in order to
evaluate the consistency of results across alternate versions.

Example: If you wanted to evaluate the reliability of a critical thinking assessment,


you might create a large set of items that all pertain to critical thinking and then
randomly split the questions up into two sets, which would represent the parallel
forms.

3. Inter-rater reliability is a measure of reliability used to assess the degree to

which different judges or raters agree in their assessment decisions. Inter-rater


reliability is useful because human observers will not necessarily interpret
answers the same way; raters may disagree as to how well certain responses or
material demonstrate knowledge of the construct or skill being assessed.

Example: Inter-rater reliability might be employed when different judges are


evaluating the degree to which art portfolios meet certain standards. Inter-rater
reliability is especially useful when judgments can be considered relatively
subjective. Thus, the use of this type of reliability would probably be more likely
when evaluating artwork as opposed to math problems.

4. Internal consistency reliability is a measure of reliability used to evaluate the


degree to which different test items that probe the same construct produce
similar results.

A. Average inter-item correlation is a subtype of internal consistency

reliability. It is obtained by taking all of the items on a test that probe the
same construct (e.g., reading comprehension), determining the correlation
coefficient for each pair of items, and finally taking the average of all of

these correlation coefficients. This final step yields the average inter-item
correlation.

B. Split-half reliability is another subtype of internal consistency reliability.


The process of obtaining split-half reliability is begun by splitting in half
all items of a test that are intended to probe the same area of knowledge
(e.g., World War II) in order to form two sets of items. The entire test is
administered to a group of individuals, the total score for each set is
computed, and finally the split-half reliability is obtained by determining the
correlation between the two total set scores.

SUMBER 2

The test-retest reliability method is one of the simplest ways of testing the
stability and reliability of an instrument over time.
For example, if a group of students takes a test, you would expect them to show very similar results
if they take the same test a few months later. This definition relies upon there being no confounding
factor during the intervening time interval.

Instruments such as IQ tests and surveys are prime candidates for test-retest methodology, because
there is little chance of people experiencing a sudden jump in IQ or suddenly changing their
opinions.
On the other hand, educational tests are often not suitable, because students will learn much more
information over the intervening period and show better results in the second test.

Test-Retest Reliability and the Ravages of Time


For example, if a group of students take a geography test just before the end of semester and one
when they return to school at the beginning of the next, the tests should produce broadly the same
results.
If, on the other hand, the test and retest are taken at the beginning and at the end of the semester, it
can be assumed that the intervening lessons will have improved the ability of the students. Thus,
test-retest reliability will be compromised and other methods, such as split testing, are better.
Even if a test-retest reliability process is applied with no sign of intervening factors, there will always
be some degree of error. There is a strong chance that subjects will remember some of the
questions from the previous test and perform better.
Some subjects might just have had a bad day the first time around or they may not have taken the
test seriously. For these reasons, students facing retakes of exams can expect to face different
questions and a slightly tougher standard of marking to compensate.

Even in surveys, it is quite conceivable that there may be a big change in opinion. People may have
been asked about their favourite type of bread. In the intervening period, if a bread company mounts
a long and expansive advertising campaign, this is likely to influence opinion in favour of that brand.
This will jeopardise the test-retest reliability and so the analysis that must be handled with caution.

Test-Retest Reliability and Confounding Factors


To give an element of quantification to the test-retest reliability, statistical tests factor this into the
analysis and generate a number between zero and one, with 1 being a perfect correlation between
the test and the retest.
Perfection is impossible and most researchers accept a lower level, either 0.7, 0.8 or 0.9, depending
upon the particular field of research.
However, this cannot remove confounding factors completely, and a researcher must anticipate and
address these during the research design to maintain test-retest reliability.
To dampen down the chances of a few subjects skewing the results, for whatever reason, the test
forcorrelation is much more accurate with large subject groups, drowning out the extremes and
providing a more accurate result.

For any research program that requires qualitative rating by different


researchers, it is important to establish a good level of interrater reliability,
also known as interobserver reliability.
This ensures that the generated results meet the accepted criteria defining reliability, by
quantitatively defining the degree of agreement between two or more observers.

Interrater Reliability and the Olympics


Interrater reliability is the most easily understood form of reliability, because everybody has
encountered it.
For example, watching any sport using judges, such as Olympics ice skating or a dog show, relies
upon human observers maintaining a great degree of consistency between observers. If even one of
the judges is erratic in their scoring system, this can jeopardize the entire system and deny a
participant their rightful prize.

Outside the world of sport and hobbies, inter-rater reliability has some far more important
connotations and can directly influence your life.
Examiners marking school and university exams are assessed on a regular basis, to ensure that
they all adhere to the same standards. This is the most important example of interobserver reliability
- it would be extremely unfair to fail an exam because the observer was having a bad day.
For most examination boards, appeals are usually rare, showing that the interrater reliability process
is fairly robust.

An Example From Experience


I used to work for a bird protection charity and, every morning, we went down to the seashore and
used to estimate the number individuals for each bird species.
Obviously, you cannot count thousands of birds individually; apart from the huge numbers, they
constantly move, leaving and rejoining the group. Using experience, we estimated the numbers and
then compared our estimate.
If one person estimated 1000 dunlin, one 4000 and the other 12000, then there was something
wrong with our estimation and it was highly unreliable.
If, however, we independently came up with figures of 4000, 5000 and 6000, then that was accurate
enough for our purposes, and we knew that we could use the average with a good degree of
confidence.

Qualitative Assessments and Interrater Reliability


Any qualitative assessment using two or more researchers must establish interrater reliability to
ensure that the results generated will be useful.
One good example is Bandura's Bobo Doll experiment, which used a scale to rate the levels of
displayed aggression in young children. Apart from extensive pre-testing, the observers constantly
compared and calibrated their ratings, adjusting their scales to ensure that they were as similar as
possible.

Guidelines and Experience

Interobserver reliability is strengthened by establishing clear guidelines and thorough experience. If


the observers are given clear and concise instructions about how to rate or estimate behavior, this
increases the interobserver reliability.
Experience is also a great teacher; researchers who have worked together for a long time will be
fully aware of each other's strengths, and will be surprisingly similar in their observations.

Internal consistency reliability defines the consistency of the results


delivered in a test, ensuring that the various items measuring the different
constructs deliver consistent scores.
For example, an English test is divided into vocabulary, spelling, punctuation and grammar. The
internal consistency reliability test provides a measure that each of these particular aptitudes is
measured correctly and reliably.
One way of testing this is by using a

test-retest method, where the same test is administered some after the initial test
and the results compared.
However, this creates some problems and so many researchers prefer to measure internal
consistency by including two versions of the same instrument within the same test. Our example of
the English test might include two very similar questions about comma use, two about spelling and
so on.
The basic principle is that the student should give the same answer to both - if they do not know how
to use commas, they will get both questions wrong. A few nifty statistical manipulations will give the
internal consistency reliability and allow the researcher to evaluate the reliability of the test.
There are three main techniques for measuring the internal consistency reliability, depending upon
the degree, complexity and scope of the test.
They all check that the results and constructs measured by a test are correct, and the exact type
used is dictated by subject, size of the data set and resources.

Split-Halves Test
The split halves test for internal consistency reliability is the easiest type, and involves dividing a test
into two halves.
For example, a questionnaire to measure extroversion could be divided into odd and even questions.
The results from both halves are statistically analysed, and if there is weak correlation between the
two, then there is a reliability problem with the test.

The split halves test gives a measurement of in between zero and one, with one
meaning a perfect correlation.
The division of the question into two sets must be random. Split halves testing was a popular way to
measure reliability, because of its simplicity and speed.
However, in an age where computers can take over the laborious number crunching, scientists tend
to use much more powerful tests.

Kuder-Richardson Test
The Kuder-Richardson test for internal consistency reliability is a more advanced, and slightly more
complex, version of the split halves test.
In this version, the test works out the average correlation for all the possible split half combinations
in a test. The Kuder-Richardson test also generates a correlation of between zero and one, with a
more accurate result than the split halves test. The weakness of this approach, as with split-halves,
is that the answer for each question must be a simple right or wrong answer, zero or one.
For multi-scale responses, sophisticated techniques are needed to measure internal consistency
reliability.

Cronbach's Alpha Test


The Cronbach's Alpha test not only averages the correlation between every possible combination of
split halves, but it allows multi-level responses.

For example, a series of questions might ask the subjects to rate their response between one and
five. Cronbach's Alpha gives a score of between zero and one, with 0.7 generally accepted as a sign
of acceptable reliability.
The test also takes into account both the size of the sample and the number of potential responses.
A 40-question test with possible ratings of 1 - 5 is seen as having more accuracy than a ten-question
test with three possible levels of response.
Of course, even with Cronbach's clever methodology, which makes calculation much simpler than
crunching through every possible permutation, this is still a test best left to computers and statistics
spreadsheet programmes.

Summary
Internal consistency reliability is a measure of how well a test addresses different constructs and
delivers reliable scores. The test-retest method involves administering the same test, after a period
of time, and comparing the results.
By contrast, measuring the internal consistency reliability involves measuring two different versions
of the same item within the same test.

You might also like