You are on page 1of 10

1

PSYCHOLOGICAL ASSESSMENT
Conceptual Paradigm for Measurement and Evaluation
Samples of
Behavior:

Measurement

- Mental Abilities
- Personality
-

Scales
(IRON)
-

Interval
Ratio
Ordinal
Nominal

Test
Single
Measure

Battery Tests

Assessment

Series of
tests

Various
Techniques
(DITO)
-

Evaluation
(RAP)
Recommendatio
n

Documents
- Action Plan
Interview
- Program
Psychopathology
Personality
Mental
Abilities
Test
Development
Observation
Diagnosis
- Traits
- General Intelligence
(g) IQ

- classification
- States
- Specific Intelligence
(s) Non-

- severity
- Types MBTI
Prognosis
- Aptitude
- Multiple Intelligence
- predicting
- Interest
the devt of
the d/o - Values
verbal IQ

Measurement (IRON)
Parametric: Normal Distribution of Scores
(Pearsons r)

Non-Parametric: Abnormal Distribution of


Scores
(Spearman, (chi-square(nominal))
Interval: Temperature, Time, (IQ) has no
Ordinal: Rank, Positions, Likert Scale, Birth
absolute zero
Order
Ratio: Weight, Height has absolute zero
Nominal: Sex, Civil Status classifying
*has absolute zero: weight there could be no or 0 [value of] weight
has no absolute zero: temperature theres no 0 or no temperature
normal distribution of scores if the mean, median, mode are all the same (measures of central
tendency)

abnormal distribution of skewed

Objective Tests

Psychological Tests
Projective Tests (WIDU)

Standardized
Wishes
Test Administration, Scoring,
Intrapsychic conflict conflict bet.
Interpreting test scores
desires & morals
Limited number of responses multiple
Desires
choice; true or false
Unconscious motives
Group Tests
Subjectivity on test
Norms
interpretation/clinical judgment
- norm-referenced test (NRT)
Self-administered/individual tests
Unlimited no. of responses
- criterion-referenced test (CRT)
Norms
where we base the scores of the test takers
-- transform scores into a meaningful scale
> NRT age norms
> CRT ex. how would we know if a basketball player is skillful? -> sharpshooter; theres certain criterion to
be met

2
Medium of Psychological Tests

Battery of Tests sets of tests

Paper and pencil


Objects: wooden blocks, puzzles
Machine: Galvanic skin responses (ex. EEG, CT Scan)
Computer

Psychological Tests
Ability Tests
Intelligence Tests
-

Personality Tests

Achievement Tests

Verbal Intelligence
Non-verbal Intelligence

Ex. Weschler Adult Intelligence


Scale
Stanford Binet Int. Scale
Culture Fair Intelligence Test

measures the extent if


ones knowledge
various academic
subject

Ex. Achievement Test (what has


been learned?)
Stanford Achievement Test in
reading

Personality
Tests
-

Object
ive

Traits /
Domains or
Factors

Ex. Myers-Briggs Test


Inventory

* Usually, no right
or wrong answers

Aptitude Tests (predicting)


-

Various skills /
competencies

Ex. Differential Aptitude Test


Results are integrated into a single score interpretation

Documents
-records, protocols,
collateral reports

Assessment Techniques (DITO)


Interviews
Tests
-interview responses,
screening

-Initial assessment
> verification
Forms: written,
verbal, visual,

Evaluation Recommendation, Action Plan, Program Development


- Summarizing results of assessment
Test to SSCCRREEN

Screen applicants
Self-understanding
Classify people
Counsel individuals
Retain, dismiss, or promote employees
Research for programs, test construction
Evaluate performance for decision-making
Examine and gauge abilities
Need for diagnosis and intervention

Observation
-behavioral
observation
-observation
checklist

VALIDITY measures what it


purports to measure

Content Validity
- degree to which the tests represent the essence, the topics, and the areas that the test is
designed to measure (appropriate domain)
- Primary concern of test developers because it is the content of the items that really
reflects the whatness of the property intent to be measured
- Ex. achievement, aptitude, personality tests
- table of specification (blueprint) (TOS) (under analysis)
TOS generate items checked/validated by (at least 3) experts a.k.a raters
Depression

*Suicidal Ideation |* SelfDomains* (* in the box )


harm
-

Procedures on how to achieve high degree of content validity


1. Pre-survey or Review of Related Literature
- Focus on the theoretical constructs that is related to the test you are planning to
make, test used, purpose of the said test, areas covered, format, scaling
techniques, etc.
- This may start the development phase of the instrument you are to construct.
Item analysis focuses on the
items itself
o Ability, aptitude tests (tests
that have right & wrong
answers)

Factor analysis focuses on the domains (if a factor really is a factor)


o Personality tests
o Uses Chronbach alpha
- Empirical research
2. Development of Table of Specification (TOS)
- Determining the areas of concepts thatll represent he nature of the variable being
measured and the relative emphasis of each area are essentially judgmental
- A detailed TOS includes areas / concepts, objectives, number of items in each area
3. Consultation with Experts (raters)
- After making your own judgments, you need to consult your thesis adviser or
someone who has the expertise in making judgment about the representativeness /
relevance of the entries made in your TOS
4. Item Writing
- At this stage, you should know what type of items you are supposed to construct:
the type of instrument, format, scaling, and scoring techniques
- Every test item is based on the creative talent of the item writer and on the
background on the test content
Construct Validity
- Theoretical domains, factors / components
- Personality

X
Y
Optimism Optimism
(convergent)
Constructs
X
Y
Optimism Pessimism

1. Convergent V direct correlations between variables ( XY)

Measure that correlates well with other tests believed to measure the same
construct
2. Divergent V (Discriminant) demonstrates that a test measures something different
from that other available tests measures
- A tests should have low correlations, or evidence for what the test does not
measure
-

Criterion-related Validity is estimated by correlating a subjects score on a test with an


analysis of their behavior on an independent real life criterion. If this criterion you need to assess
and correlate is occurring now, you are assessing concurrent validity. If the assessment
criterion is to occur in the future, you are assessing predictive ability.
Construct Validity (a.k.a true validity) is the extent to which there is evidence that a
test measures a particular hypothetical construct. For example, are we really measuring
intelligence with an IQ test where there are so many competing theories regarding what
intelligence actually is?
Coefficient value estimate value
Variability margin of errors (because were human beings)
Unsystematic error can result from varied assessment implementation. E.g. scoring via
raters
RELIABILITY consistency
This suggests that the scores you gather on

psychological
tests are not in fact true of real
Observed Test Score
= True Score + Measurement
Error
scores. But, rather, those scores represent a
X=T+e
combination of many factors.
In theory, the reliability coefficient (rxx) gives us an index of the influence of true scores and
error scores on any given test. It is the ratio of true score variance of the total variance of the test.
In actuality, rxx is very similar to correlation (r). The addition of 2 similar subscripts tells us that
this r represents an rxx.

Models / Types of Reliability (the type depends on what test you are going to measure)
1. Test Retest Reliability Pearsons r
- Gives the same test to the same group of test takers on 2 different occasions
- Scores on the 1st administration are compared to scores on the 2 nd administration
using r
- 15 days or a month
- Too early familiarity (carryover effect); too long maturity
- Often researchers consider this to be a better measure of temporal stability
(consistency of test scores..)
- Assumption: people dont change on 2 administrations
- PROBLEM: Practice or carryover effects ( beneficial to the test takers)
2. Alternate Forms of Reliability r
- To eliminate the practice effects and other problems with the test-retest method (i.e
reactivity), test developers often give 2 highly similar forms of the test to the same
people at different times.
- Reliability, in this case, is again assessed at different times.
- To develop alt. form that is equivalent in terms of content, response, and statistical
characteristic.
- PROBLEM: difficulty of developing anotherSpearman-Brown
(equivalent; same
difficulty) form of the
Formula
test
kr

rxx=

3. Split-half Reliability Spearman-Brown prophecy


where

( 1+ ( k 1 ) ) r
rxx reliability coefficient
r coefficient
k

Measures the internal consistency of the test


Eliminate / reduce the problems of the ff:
1.
2.
3.

The need for 2 admin. of a test


The difficulty of developing another form
Carryover or reactivity effect

1. KR20 (Kruder & Richardson, 1937, 1939) for tests which questions can be scores
either 0 or 1 (binary; dichotomous)
2. Coefficient alpha (Cronbach, 1951) rating scales that have 2 or more possible
answers
Problem: whether the test being split is homogenous (i.e measuring one characteristic)
or heterogenous (i.e measuring many characteristics)
every item is
compared to one another

Split-half reliability is mostly similar to internal consistency.


halves of the were (correlated) measured

3. Scorer Reliability (inter-rater reliability) judgments or ratings made by different


scorers are often compared using correlation to see how much they agree.
If tests are being used to make important final decision about people then the reliability of a test
should be high (0.95)
Lower reliability levels may be acceptable when:
Making preliminary decisions,
Sorting people into groups,
Conducting research, etc.

Standard Error of Measurement (SEM or


Standard Deviation)
- Index of measurement of inconsistency or the
amount of expected error in an individual score
(i.e how much is the score is likely to differ)

Standard Deviation:
high heterogenous (more
spread)
low homogenous (less
spread)
*in terms of scores

Factors that can affect reliability


1. Errors that can increase or decrease individual score:
- the test itself
- the test administrator
- the test scoring
- the test taker
2. Test length as a rule, adding more homogenous items will increase the reliability of
the test.
3. Method used to estimate reliability split-half reliability methods yield higher reliability
estimates than test-retest or alt. forms methods
Psychometric properties:
- reliability (consistency)
- validity (measures what it intends to measure)
- norming
- standardization
The goal is to increase the probability of getting the true score and minimizing
the standard error of measurement.
Test score is composed of observed score (actual score), true score (reflection of what you
really know), and error score (difference between the true score and the actual score)

trait score sources of errors that reside within the individual taking
the test (excuses:

hungry,

Observed Score = true score + error score

headache, unprepared, etc.)

method score sources of errors that reside in the testing situation


(lousy instructions, too warm/cold room, missing pages, etc.)

Reliability=

True Score
True Score+ Error Score

Interrater reliability =

Number of agreements
Number of disagreements

error reliability

Stability the same results are obtained over repeated administration of the instrument.
- Test-retest reliability
- parallel, equivalent, or alternative forms
Homogeneity
internal consistency (unidimensional)
- item-total correlations; split-half reliability; Kuder-Richardson coefficient; Cronbachalpha
Item-total correlations each item on an instrument is correlated to total score an item with
low correlation may be deleted. Highest and lowest correlations are usually reported
- only important if homogeneity of items is desired
Kuder-Richardson coefficient when items have dichotomous response e.g. yes/no (binary)
Cronbachs-alpha Likert scale or linear graphic response format
- compares the consistency of responses of all items on the scale (may need to be
computed for each sample)
Equivalence
consistency of agreement of observers using the same measure among
alternative forms of a tool
- parallel of alternate forms (described under stability)
- interrater reliability
TEST CONSTRUCTION (has rudiments, process)
Test Planning
Decision to develop a Standard Test
(1) No test exist for a particular purpose or (2) the test existing for a certain purpose are not
adequate for one reason or another.
Weschlers idea of WAIS was originated from the army alpha (literate soldiers) and army

beta (illiterate soldiers), thats why there are vocabulary and performance tests.

Weschler both covers fluid and crystallized intelligences


Culture Fair Intelligence test looks into specific intelligence

difference between the two,


in terms of defining intelligence

Subject Matter Experts test developer must seek help of the experts in evaluating the test
items and even the identified constructs of component of the test
Writing Items depending on whether the scale is to assess an attitude, content knowledge,
ability or personality traits; stick to the pattern (ex. dont shift from declarative to
interrogative statement)
Guidelines
1. Deal only with one central thought; more than 1 is called double-barreled.
Poor item:
Better item:

My instructor grades fairly and quickly


My instructor grades fairly.

2. Be precise
Poor item:
Better item:

I received good customer service from Y Company.


A member of the scales staff at Y Company asked me if he could assist me within minute of
entering the store.

3. Be brief
4. Avoid awkward wording or dangling constructs.
Poor item:
Being clear is the overall guiding principle in writing items.
Better item: The overall guiding principle in writing items is to be clear.
* Active voice is more preferred than passive voice.

5. Avoid irrelevant information


6. Present items in positive language
* If its inevitable, when using not, italicize or CAPITALIZE it.

7. Avoid double negatives


8. Avoid terms like all and none
Poor item:
Better item:

Which of the following never occurs


Which of the following is extremely unlikely to occur?

9. Avoid indeterminate items like frequently or sometimes


10.Have someone else review your items

Table of Specifications (blueprint)


Cognitive Domain factual knowledge, ideas, and intellectual abilities
Affective Domain most with the values of a learner including his interests, appreciation,
and attitudes
Psychomotor readiness for a particular action that may either be mental, physical, or
emotional
Item Analysis
- Way of measuring the quality of questions seeing how appropriate they were for the
respondents and how well they measured their ability / trait
- Way of measuring items over and over again in different tests with prior knowledge of how
they are going to perform, creating a population of questions with known properties (e.g.
test bank)
- At least 3 or 4 times more
CTT true score model ( X = T + e )
easiest and most widely used form
of analyses
performed on the test as a whole
rather than on the item and
although item statistics can be
generated, they apply only to that
group of students on that collection
of items.
a set of psychometric procedures
used to test items and scales
reliability, difficulty, discrimination,
etc
assumes that every person has
true score on an item pr a scale of

Item Analysis
Classical Test Theory Latent Trait Models
(CTT)
Item Response
Rasch Models
Theory (IRT)
1P 2P 3P 4P
similar

Level of Difficulty proportion of percent of examinees that answered the item correctly.
In order to determine the difficulty level, table the number of examinees with the correct
answer in the item and then apply the formula.

P=

Nu
N x 100

Table of % in level of difficulty


91 % and above Very easy
Unacceptable
79% - 90%
Easy
Acceptable
Optimum difficulty / Highly Acceptable
26% - 78%

where: P = % of students who answered the items correctly


Nu = number of examinees who answered the items correctly
11% - 25%
N = total examinees consisting the 2 groups

10%

and below

moderate

Difficult
Very difficult

Acceptable
Unacceptable

Level of Difficulty Using Upper and Lower Groups


1. Score the papers after checking
2. Arrange the papers from highest to lowest score
3. Determine the upper and the lower group by x27% with the total number of examinees.

4. The top 27% of the examinees is considered the upper group while the bottom 27% of the
total examinees comprises the lower group
5. Get both frequencies of the examinees that answered the item correctly from the 2 groups
6. Determine the difficulty level and the discriminating power
Discriminating Power determines the difference between examinees who have done well and
those who did poorly in a particular item. To determine the discriminating level, perform
the steps in the difficulty level, then, determine the difference of the 2 groups and divide
the difference by the half of the total examinees. (? Di natapos)
Discriminability
Item/Total Correlation
every item will be correlated to the total score
point biserial method is best used
Point Biserial Method
dichotomous scored items / items with a correct answer
one dichotomous variable (correct/incorrect) correlated with one
continuous variable (total score) is a point biserial correlation
correlate the proportion of people getting each item with the total
test score
CTT

LTM

gauge the performance itself


but not trait derives it

has the test as its basis

statistics
are
often
generalized
to
similar
students taking a similar test
only
applies
to
those
students taking that test

aims to look beyond that at


the underlying traits which
are producing the tests
performance
measured at item level and
provides
sample-free
measurement

Latent Trait Models (LTM) made in 1940a but widely used in 1960s
practically unfeasible to use these without specialized software
Item Response Theory (IRT) family of latent trait models used to establish psychometric
properties of items and scales
sometimes referred to as modern psychometrics because has completely
replaced CTT
can predict if one has guessed an item
3 Basic Components

(ex. individual differences

on a construct)

1. Item Response Function (IRF) math function that related the latent trait to the
probability of endorsing an item.
good item

2. Item Information Function an indication of item quality, an items ability to


differentiate among respondents.
3. Invariance item characteristics
Item Response Theory (IRT) the relationship between examinee trait level, item
properties and the ability of endorsing the item.

can be converted into Item Characteristic Curves (ICC) which are


graphical functions that represents the respondents ability.
Item Parameters Location an items location b is defined as the amount of the
latent trait needed to have a 0.5 probability of endorsing the item.
Item Parameters Discrimination (a)
indicates the steepness of the IRT at the
items location
how strongly related the item is to the latent trait
like loadings in a factor analysis

You might also like