E-Book Designing Written Forms of Assessment

I
Designing Written Assessment of

Student Learning
Carlo Magno
Jerome Ouano
II
- Copyright page -
III
Acknowledgement
Special thanks for those who helped exert their efforts in finishing this book: Ms. MR
Aplaon for gathering the materials needed to complete each chapter; Mr. Paul Ong for his
contribution in the chapter on “grading students”; Ms. Ma. Theresa Carmela Kanlapan for
editing the grammar of the manuscript; Mr. Robert Chu for drafting the figures; Ms. Sheena
Morales for her contribution in testing the program and the guide in using the program.
This reference is inspired by my professors who have shaped me in my knowledge and

expertise in the field of measurement, assessment, and evaluation: Dr. Alexa Abenica, Dr.
Leticia Asusano, Dr. Irma Coronel, Dr. Allan Bernardo, and Dr. Rose Marie Salazar-Clemeña.
Carlo Magno
IV
- Editors Note –
V
Table of Contents
Title Page ………………………………………………………………………………………… I
Copyright …………………………………………………………………………………………II
Acknowledgement ………………………………………………………………………………III
Editors Note (Preface) ………………………………………………………………………… IV
Table of Contents ……………………………………………………………………………….. V
Chapter 1
Assessment, Measurement, and Evaluation ................................................................................... 1
Lesson 1
Assessment in the Classroom Context ………………………………………………………….. 2
Lesson 2
The Role of Measurement and Evaluation in Assessment …………………………………….. 4
The Nature of Measurement……………………………………………………………. 4
The Nature of Evaluation ……………………………………………………………… 6
Forms of Evaluation ……………………………………………………………………8
Models of Evaluation …………………………………………………………………. 9
Empirical Report: Examples of Evaluation Studies ………………………………………….. 13
Lesson 3
The Process of Assessment …………………………………………………………………… 22
Assessment Procedures ……………………………………………………………….. 24
Forms of Assessment …………………………………………………………………. 25
Components of Classroom Assessment ………………………………………………. 26
Paradigm Shifts in the Process of Assessment ………………………………………… 28

VI
Uses of Assessment ……………………………………………………………………. 29
Chapter 2
The Learning Intents …………………………………………………………………………. 32
Lesson 1
Stating Learning Intents ……………………………………………………………………...… 33
Lesson 2
The Conventional Taxonomic Tools …………………………………………………………. 36
Bloom’s Taxonomy …………………………………………………………………… 36
The Revised Blooms’ Taxonomy …………………………………………………….. 39
Lesson 3
Other Learning Taxonomies …………………………………………………………………… 42
Gagne’s Taxonomy ……………………………………………………………………. 42
Stiggins and Conklin’s Taxonomy ……………………………………………………. 43
Marzano’s Dimension of Learning ……………………………………………………. 44
De Bono’s Six Thinking Hats …………………………………………………………. 47
Lesson 4
Specifying the Learning Intents ………………………………………………………………. 49
Chapter 3
Characteristics of an Assessment Tool ……………………………………………………….. 56
Lesson 1
Reliability ……………………………………………………………………………………… 57
VII
Test-retest reliability ……………………………….. ………………………………… 57
Procedure for Correlating Scores for the Test-Retest…………………………………. 58
Parallel Form or Alternate Form Reliability …………………………………………… 61
Split-half Reliability …………………………………………………………………… 64
Internal Consistency Reliability ……………………………………………………….. 63
Kuder Richardson……………………………………………………………… 63
Cronbach’s Alpha ……………………………………………………………… 65
Interrater Reliability …………………………………………………………………… 67
Lesson 2
Validity ……………………………………………………………………………………….. 73
Content Validity ……………………………………………………………………… 73
Criterion-Prediction ………………………………………………………………….. 73
Construct ……………………………………………………………………………… 74
Convergent and Divergent …………………………………………………………… 78
Empirical Report: The Development of the Self-disclosure Scale …………………………… 80
Lesson 3
Item Difficulty and Item Discrimination ……………………………………………………. 92
Procedure for Determining Item Difficulty and Discrimination ……………………... 92
Analyzing Item Distracters…………………………………………………………….95
Empirical Report: Construction and Development of a Test Instrument for Grade 3 Social
Studies ………………………………………………………………………………………… 96
Item Response Theory: Obtaining Item Difficulty Using the Rasch Model ………… 103
Procedure for the Calibration of Item and Person Ability …………………………… 106
Grouped Distribution of Item Scores ………………………………………… 107

VIII
Grouped Distribution of Observed Person Scores …………………………… 108
Final Estimates of Item Difficulty …………………………………………… 109
Final estimates of Person Measures …………………………………………. 110
Empirical Report: The Application of a One-Parameter IRT Model on a Test of Mathematical

Problem Solving ……………………………………………………………………………… 113
Special Topic: A Review of Psychometric Theory ………………………………………….. 125
Lesson 4
Using a Computer Software in Analyzing Test Items ………………………………………... 140
How to Install …………………………………………………………………………………. 140
Opening the Program …………………………………………………………………………. 143
How to Save Data …………………………………………………………………………….. 162
How to Open File from Excel ………………………………………………………………… 163
How to Encode and Save Data in Microsoft Excel …………………………………………… 165
Chapter 4
Developing Teacher-Made Test ………………………………………………………………. 170
Lesson 1
The Test Blueprint ……………………………………………………………………………. 171
Outline of the Test Development Process …………………………………………….. 173
Table of Specifications ……………………………………………………………….. 173
Lesson 2
Designing Selected-Response Items ………………………………………………………….. 176
Binary-choice Items …………………………………………………………………... 176
Instructions in Writing Binary Type of Items ………………………………………… 178

IX
Multiple Choice Items ……………………………………………………………… 179
Guidelines in Writing Multiple-choice items ………………………………………. 182
Matching Items …………………………………………………………………….. 188
Guidelines in Writing Matching Type Items ……………………………………….. 189
Lesson 3
Designing Constructed-Response Types ……………………………………………………... 191
Short-answer Items …………………………………………………………………… 191
Guidelines in Writing Short Answer Items …………………………………………… 193
Essay Items …………………………………………………………………………. .. 194
Lesson 4
Designing Interpretive Exercise ………………………………………………………………. 201
Guidelines in Writing Interpretive Exercise …………………………………………. .201
Examples of Interpretive Exercise ……………………………………………………. 202
Chapter 5
Constructing Non-Cognitive measures ……………………………………………………….. 210
Lesson 1
The Nature of Non-Cognitive Constructs ……………………………………………………. 211
Lesson 2
Steps in Constructing Non-Cognitive Measures ……………………………………............... 216
Lesson 3
Response Formats ……………………………………………………………………………. 222

X
Chapter 6
Art of Questioning …………………………………………………………………………… 241
Lesson 1
Functions of Questioning ……………………………………………………………………. 242
Lesson 2
Types of Questions ………………………………………………………………………….. 244
Lesson 3
Taxonomic Questions ……………………………………………………………………….. 247
Lesson 4
Practical Considerations in Questioning …………………………………………………….. 249
Chapter 7
Grading Students ……………………………………………………………………………. 251
Lesson 1
Defining Grading …………………………………………………………………………… 252
Lesson 2
The Purposes of Grading …………………………………………………………………… 254
Feedback …………………………………………………………………………… 254
Administrative Purposes …………………………………………………………… 255
Discovering Exceptionalities ………………………………………………………. 256
Motivation ………………………………………………………………………….. 256
Lesson 3
Rationalizing Grades ……………………………………………………………………….. 258

XI
Empirical Report: Do Parents and Teaching Approach matter in Predicting

Students’ Grades? …………………………………………………………………………… 262
Chapter 8
Standardized Tests ………………………………………………………………………….. 273
Lesson 1
What are Standardized Tests ……………………………………………………………….. 274
Classifications of Tests …………………………………………………………….. 275
Lesson 2
Interpreting Test Scores through Norm and Criterion Reference ………………………….. 276
What is the use of a Norm? …………………………………………………………. 276
The Normal Curve ………………………………………………………………….. 276
Steps in Creating a Norm …………………………………………………………… 278
Interpreting Areas in the Norm …………………………………………………….. 282
Areas of the Normal Curve ………………………………………………………… 285
Lesson 3
Standards in Educational and Psychological Testing ……………………………………… 290
Controlling the Use of Tests ……………………………………………………….. 290
Security of the Test Content ……………………………………………………….. 290
Test Administration ………………………………………………………………… 291
Introducing the Test to Test Takers ………………………………………………… 292
Testing Different Groups ………………………………………………………….. 292
Example of Standardized Tests ……………………………………………………. 292
Intelligence Tests ………………………………………………………….. 292
Achievement Tests ………………………………………………………… 296

XII
Aptitude Tests ………………………………………………………………. 298
Personality Tests ……………………………………………………………. 299
Attitude Tests ……………………………………………………………… 305
Interest Test ………………………………………………………………… 307
Chapter 9
The Status of Educational Assessment in the Philippines ………………………………….310
Lesson 1
Assessment in the Early Years …………………………………………………………….. 311
Monroe Survey (1925) …………………………………………………………….. 311
Research, Evaluation, and Guidance Division of the Bureau of Public Schools ….. 311
Economic Survey Committee ……………………………………………………… 311
The Proser Survey ………………………………………………………………….. 312
Other Government Commissioned Surveys ………………………………………… 312
UNESCO Survey (1949) ……………………………………………………………. 312
Lesson 2
Assessment in the Contemporary Period and Future Directions ……………………………. 314
EDCOM Report (1991) ……………………………………………………………… 314
Philippine Education Sector Study (PESS-1999) ………………………………….. 314
Fund for Assistance to Private Education (FAPE) …………………………………. 314
Center for Educational Measurement ………………………………………………. 315
Asian Psychological Services and Assessment Corporation ………………………. 315
Building Future Leaders and Scientific Experts in Assessment and Evaluation in the
Philippines …………………………………………………………………………………. 316
XIII
Professional Organization on Educational Assessment ……………………………. 316
List of Appendices
Appendix A
Critical Values of the Pearson r Moment Correlation ………………………………………... 318
Appendix B
Areas of the Normal Curve …………………………………………………………………… 319
Glossary ………………………………………………………………………………………. 327

1
Chapter 1
Assessment, Measurement, and Evaluation
Chapter Objectives
1. Describe assessment in the educational and classroom setting.

2. Identify ways on how assessment is conducted in the educational setting.
3. Explain how is assessment integrated with instruction and learning.
4. Distinguish the critical features of measurement, evaluation, and assessment.
5. Provide the uses of assessment results.
Lessons
1 Assessment in the Classroom Context

2 The Role of Measurement and Evaluation in Assessment
The Nature of Measurement
The Nature of Evaluation
Forms of Evaluation
Models of Evaluation
Examples of Evaluation Studies
3 The Process of Assessment
The Process of Assessment
Forms of Assessment
Components of Classroom Assessment
Paradigm Shifts in the Practice of Assessment
Uses of Assessment
2
Lesson 1
Assessment in the Classroom Context
To better understand the nature of classroom assessment, it is important to answer three

questions: (1) What is assessment? (2) How is assessment conducted? And, (3) when is
assessment conducted?
What is How is assessment When is assessment

assessment? conducted? conducted?
It is customary in the educational setting that at the end of a quarter, trimester, or

semester, students receive a grade. The grade reflects a combination of different forms of
assessment that both the teacher and the student have conducted. These grades were based on a
variety of information that the student and teacher gathered in order to objectively come up with
a value that is very much reflective of the student’s performance. The grades also serve to
measure how well the students have accomplished the learning goals intended for them in a
particular subject, course, or training. The process of collecting various information needed to
come up with an overall information that reflects the attainment of goals and purposes is referred
to as assessment (The details of this process will be explained in the next section). The process of
assessment involves other concepts such as measurement, evaluation, and testing (The
distinction of these concepts and how they are related will be explained in the proceeding section
of the book).
The teacher and students use various sources in coming up with an overall assessment of
the student’s performance. A student’s grade that is reflective of their performance is a collective
assessment from various sources such as recitation, quizzes, long tests, final exams, projects,
final papers, performance assessments, and the other sources. Different schools and teachers
would give certain weights to these identified criteria depending on their set goals for the subject
or course. Some schools assign weights based on the nature of the subject area, some teachers
would base it on the objectives set, and others treat all criteria set with equal weights. There is no
ideal weight for these various criteria because it will depend on the overall purpose of the
learning and teaching process, orientation of the teachers, and goals of the school.
An overall assessment should come from a variety of sources to be able to effectively use
the information in making decisions about the students. For example, in order to promote a
student on the next grade or year level, or move to the next course, the information taken about
the student’s performance should be based on multiple forms of assessment. The student should
have been assessed in different areas of their performance to make valid decisions such as for
their promotion, deciding the top pupils, honors, and even failure and being retained to the
current level. These sources come from objective assessments of learning such as several
quizzes, a series of recitation, performance assessments on different areas, and feedback. These
forms of assessment are generally given in order to determine how well the students can
demonstrate a sample of their skills.
3
Assessment is integrated in all parts of the teaching and the learning process. This means
that assessment can take place before instruction, during instruction, and after instruction. Before
instruction, teachers can use assessment results as basis for the objectives and instructions for
their plans. These assessment results come from the achievement tests of students from the
previous year, grades of students from the previous year, assessment results from the previous
lesson or pretest results before instruction will take place. Knowing the assessment results from
different sources prior to planning the lesson helps teachers decide on a better instruction that is
more fit to the kind of learners they will handle, set objectives appropriate for their
developmental level, and think of better ways of assessing students to effectively measure the
skills acquired. During instruction, there are many ways of assessing student performance. While
class discussion is conducted, teachers can ask questions and students can answer them orally to
assess whether students can recall, understand, apply, analyze, evaluate, and synthesize the facts
presented. During instruction teachers can also provide seat works and work sheets on every unit
of the lesson to determine if students have mastered the skill needed before moving to the next
lesson. Assignments are also provided to reinforce student learning inside the classroom.
Assessment done during instruction serves as formative assessment where it is meant to prepare
students before they are finally assessed on major exams and tests. When the students are ready
to be assessed after instruction took place, they are assessed in a variety of skills they are trained
for which then serves as a summative form of assessment. Final assessments come in the forms
of final exams, long tests, and final performance assessment which covers larger scope of the
lesson and more complex skills are required to be demonstrated. Assessments conducted at the
end of the instruction are more structured and announced where students need time to prepare.
Review Questions:
1. What are the other processes involved in assessment?

2. Why should there be several sources of information in order to come up with an
overall assessment?
3. What are the different purposes of assessment when conducted before, during, and
after assessment?
4. Why is assessment integrated in the teaching and learning process?
Ask a sample of students the following questions:
1. Why do you think assessment is needed in learning?

2. What are the different ways of assessing student learning in the courses you
are taking?
Tabulate the answers and present the answers in class

4
Lesson 2
The Role of Measurement and Evaluation in Assessment
The concept of assessment is broad that it involves other processes such as measurement
and evaluation. Assessment involves several measurement processes in order to arrive with
quantified results. When assessment results are used to make decisions and come up with
judgments, then evaluation takes place.
Measurement
Assessment Evaluation
The Nature of Measurement
Measurement is an important part of assessment. Measurement has the features of

quantification, abstraction, and further analysis that is typical in the process of science. Some
assessment results come in the forms of quantitative values that enable the use of further
analysis.
Obtaining evidence of different phenomena in the world can be based on measurement. A
statement can be accepted as true or false if the event can be directly observed. In the educational
setting, before saying that a student is “highly intelligent,” there must be observable proofs to
demonstrate that the student is indeed “highly intelligent.” The people involved in identifying
whether a student is “highly gifted” have to gather evidence accurate information to claim the
student as such. When people start demonstrating certain characteristics such as “intelligence,”
by making a judgment, obtaining a high test score, exemplified performance in cognitive tasks,
high grades, then measurement must have taken place. If measurement is carefully done, then the
process meets the requirements of scientific inquiry.
Objects per se are not measured, what is measured are the characteristics or traits of
objects. These measurable characteristics or traits are referred to as variables. Examples of
variables that are studied in the educational setting are intelligence, achievement, aptitude,
interest, attitude, temperament, and others.
Nunnaly (1970) defined measurement as “consist of rules for assigning numbers to
objects in such a way as to represent quantities of attributes.” Measurement is used to quantify
characteristics of objects. Quantification of characteristics or attributes has advantages:
5
1. Quantification of characteristics or attributes determines the amount of that attribute

present. If a student was placed in the 10th percentile rank on an achievement test, then that
means that the student has achieved less in reference to others. A student who got a perfect score
on a quiz on the facts about the life of Jose Rizal means that the student has remembered enough
information about Jose Rizal.
2. Quantification facilitates accurate information. If a student gets a standard score of -2

on a standardized test (standard scores ranges from -3 to +3 where 0 is the mean), it means that
the student is below average on that test. If a student got a stannine score of 8 on a standardized
test (stannine scores range from 1 to 9 where 5 is the average), it means that the student is above
the average or have demonstrated superior ability on the trait measured by the standardized test.
3. Quantification allows objective comparison of groups. Suppose that male and female
students were tested in their math ability using the same test for both groups. Then mean results
of the males math scores is 92.3 and the mean results of the females math scores is 81.4. It can
be said that males performed better in the math test than females when tested for significance.
4. Quantification allows classification of groups. The common way of categorizing

sections or classes is based on students’ general average grade from the last school year. This is
especially true if there are designated top sections within a level. In the process, students grades
are ranked from highest to lowest and the necessary cut-offs are made depending on the number
of students that can be accommodated in a class.
5. Quantification results make the data possible for further analysis. When data is
quantified, teachers, guidance counselors, researchers, administrators, and other personnel can
obtain different results to summarize and make inferences about the data. The data may be
presented in charts, graphs, and tables showing the means and percentages. The quantified data
can be further estimated using inferential statistics such as when comparing groups,
benchmarking, and assessing the effectiveness of an instructional program.
The process of measurement in the physical sciences (physics, chemistry, biology) is

similar in education and the social sciences. Both use instruments or tools to arrive with
measurement results. The only difference is the variables of interest being measured. In the
physical sciences, measurement is more accurate and precise because of the nature of physical
data which is directly observable and the variables involved are tangible in all senses. In
education, psychology, and behavioral science, the data is subject to measurement errors and
large variability because of individual differences and the inability to control variations in the
measurement conditions. Although in education, psychology, and behavioral science, there are
statistical procedures for obtaining measurement errors such as reporting standard deviations,
standard errors, and variance.
Measurement facilitates objectivity in the observation. Through measurement, extreme
differences in results are avoided, provided that there is uniformity in conditions and individual
differences are controlled. This implies that when two persons measure a variable following the
same conditions, they should be able to get consistent results. Although there may be slight
difference (especially if the variable measured is psychological in nature), but the results should
6
be at least consistent. Repeating the measurement process several times and consistency of
results would mean objectivity of the procedure undertaken.
The process of measurement involves abstraction. Before a variable is measured using an
instrument, the variable’s nature needs to be clarified and studied well. The variable needs to be
defined conceptually and operationally to identify ways on how it is going to be measured.
Knowing the conceptual definition based on several references will show the theory or
conceptual framework that fully explains the variable. The framework reveals whether the
variable is composed of components or specific factors. Then these specific factors need to be
measured that comprise the variable. A characteristic that is composed of several factors or
components are called latent variables. The components are usually called factors, subscales, or
manifest variables. An example of a latent variable would be “achievement.” Achievement is
composed of factors that include different subject areas in school such as math, general science,
English, and social studies. Once the variable is defined and its underlying factors are identified,
then the appropriate instrument that can measure the achievement can now be selected. When the
instrument or measure for achievement is selected, it will now be easy to operationally define the
variable. Operational definition includes the procedures on how a variable will be measured or
made to occur. For example, ‘achievement’ can be operationally defined as measured by the
Graduate Record Examination (GRE) that is composed of verbal, quantitative, analytical,
biology, mathematics, music, political science, and psychology.
When a variable is composed of several factors, then it is said to be multidimensional.
This means that a multidimensional variable would require an instrument with several subtests in
order to directly measure the underlying factors. A variable that do not have underlying factors is
said to be unidimensional. A unidimensional variable only measures an isolated unitary attribute.
An example of unidemensional measures are the Rosenberg self-esteem scale and the Penn State
Worry Questionnaire (PSWQ). Examples of multidimensional measures are various ability tests
and personality tests where it is composed of several factors. The 16 PF is a personality test that
is composed of 16 components (researved, more intelligent, affected by feelings, assertive, sober,
conscientious, venturesome, tough-minded, suspicious, practical, shrewd, placid, experimenting,
self-sufficient, controlled, and relaxed).
The common tools used to measure variables in the educational setting are tests,
questionnaires, inventories, rubrics, checklists, surveys and others. Tests are usually used to
determine student achievement and aptitude that serve a variety of purposes such as entrance
exam, placement tests, and diagnostic tests. Rubrics are used to assess performance of students in
their presentations such as speech, essays, songs, and dances. Questionnaires, inventories, and
checklists are used to identify certain attributes of students such as their attitude in studying,
attitude in math, feedback on the quality of food in the canteen, feedback on the quality of
service during enrollment, and other aspects.
The Nature of Evaluation
Evaluation is arrived when the necessary measurement and assessment have taken place.
In order to evaluate whether a student will be retained or promoted to the next level, different
aspects of the student’s performance were carefully assessed and measured such as the grades
and conduct. To evaluate whether the remedial program in math is effective, the students’
improvement in math, teachers’ teaching performance, whether students’ attitude towards math
changed should be carefully assessed. Different measures are used to assess different aspects of
7
the remedial program to come up with an evaluation. According to Scriven (1967) evaluation is
“judging the worth or merit” of a case (ex. student), program, policies, processes, events, and
activities. These objective judgments derived from evaluation enable stakeholders (a person or
group with a direct interest, involvement, or investment in the program) to make further
decisions about the case (ex. students), programs, policies, processes, events, and activities.
In order to come up with a good evaluation, Fitzpatrick, Sanders, and Worthen (2004)
indicated that there should be standards for judging quality and deciding whether those standards
should be relative or absolute. The standards are applied to determine the value, quality, utility,
effectiveness, or significance of the case evaluated. In evaluating whether a university has a good
reputation and offers quality education, it should be comparable to a standard university that
topped the World Rankings of University. The features of the university evaluated should be
similar with the standard university selected. A standard can also be in the form of ideal
objectives such as the ones set by the Philippine Accreditation of Schools, Colleges, and
Universities (PAASCU). A university is evaluated if they can meet the necessary standards set
by the external evaluators.
Fitzpatrick, Sanders, and Worthen (2004) clarified the aims of evaluation in terms of its
purpose, outcome, implication, setting of agenda, generalizability, and standards. The purpose of
evaluation is to help those who hold a stake in whatever is being evaluated. Stakeholders consist
of many groups such as students, teachers, administrators, and staff. The outcome of evaluation
leads to judgment whether a program is effective or not, whether to continue or stop a program,
whether to accept or reject a student in the school. The implication that evaluation gives is to
describe the program, policies, organization, product, and individuals. In setting the agenda for
evaluation, the questions for evaluation come from many sources, including the stakeholders. In
making generalizations, a good evaluation is specific to the context in which the evaluation
object rests. The standards of a good evaluation are assessed in terms of its accuracy, utility,
feasibility, and propriety.
A good evaluation adheres to the four standards of accuracy, utility, feasibility, and
propriety set by the ‘Joint Committee on Standards for Educational Evaluation’ headed by
Daniel Stufflebeam in 1975 at Western Michigan University’s Evaluation Center. These four
standards set are now referred to as ‘Standards for Evaluation of Educational Programs, Projects,
and Materials.’ Table 1 presents the description of the four standards.
8
Table 1
Standards for Evaluation of Educational Programs, Projects, and Materials
Standard Summary Components

Utility Intended to ensure that an evaluation Stakeholder identification, evaluator credibility,
will serve the information needs of information scope and selection, values identification,
its intended users. report clarity, report timeliness and dissemination,
evaluation impact
Feasibility Intended to ensure that an evaluation Practical procedures, political viability, cost
will be realistic, prudent, diplomatic, effectiveness
and frugal.
Propriety Intended to ensure that an evaluation Service orientation, formal agreements, rights of human
will be conducted legally, ethically, subjects, human interaction, complete and fair
and with due regard for the welfare assessment, disclosure of findings, conflict of interest,
of those involved in the evaluation as fiscal responsibility
well as those affected by its results.
Accuracy Intended to ensure that an evaluation Program documentation, content analysis, described
will reveal and convey technical purposes and procedures, defensible information
adequate information about the sources, valid information, reliable information,
features that determine the worth or systematic information, analysis of quantitative
merit of the program being evaluated. information, analysis of qualitative information, justified
conclusions, impartial reporting, metaevaluation
Forms of Evaluation
Owen (1999) classified evaluation according to its form. He said that evaluation can be
proactive, clarificative, interactive, monitoring, and impact.
1. Proactive. Ensure that all critical areas are addressed in an evaluation process.
Proactive evaluation is conducted before a program begins. It assists stakeholders in making
decisions on determining the type of program needed. It usually starts with needs assessment to
identify the needs of stakeholders that will be implemented in the program. A review of literature
is conducted to determine the best practices and creation of benchmarks for the program.
2. Clarificative. This is conducted during program development. It focuses on the

evaluation of all aspects of the program. It determines the intended outcomes and how the
program designed will achieve them. Determining the how the program will achieve its goals
involves determining the strategies that will be implemented.
3. Interactive. This evaluation is conducted during program development. It focuses on

improving the program. It identifies what the program is trying to achieve, whether the goals are
consistent with the plan, and how the program can be changed to make the goals effective.
4. Monitoring. This evaluation is conducted when the program has settled. It aims to
justify and fine tune the program. It focuses whether the outcome of the program has delivered to
its intended stakeholders. It determines the target population, whether the implementation meets
the benchmarks, be changed to be done in the program to make it more efficient.
9
5. Impact. This evaluation is conducted when the program is already established. It

focuses on the outcome. It evaluates if the program was implemented as planned, whether the
needs were served, whether the goals are attributable to the program, and whether the program is
cost effective.
These forms of evaluation are appropriate at certain time frames and stage of a program.
The illustration below shows when each evaluation is appropriate.
Program Duration
Planning and Implementation Settled
Development
Phase
Proactive Interactive and monitoring Impact
Clarificative
Models of Evaluation
Evaluation is also classified according to the models and framework used. The
classifications of the models of evaluation are objectives-oriented, management oriented,
consumer-oriented, expertise-oriented, participant-oriented, and theory driven.
1. Objectives-oriented. This model of evaluation determines the extent to which the

goals of the program are met. The information that results in this model of evaluation can be
used to reformulate the purpose of the program evaluated, the activity itself, and the assessment
procedures used to determine the purpose or objectives of the program. In this model there
should be a set of established program objectives and measures that are undertaken to evaluate
which goals are met and which goals are not met. The data is compared with the goals. The
specific models for the objectives-oriented are the Tylerian Evaluation Approach, Metfessel and
Michael’s Evaluation Paradigm, Provus Discrepancy Evaluation Model, Hammond’s Evaluation
Cube, and Logic Model (see Fitzpatrick, Sanders, & Worthen, 2004).
2. Management-oriented. This model is used to aid administrators, policy-makers,

boards and practitioners to make decisions about a program. The system is structured around
inputs, process, and outputs to aid in the process of conducting the evaluation. The major target
of this type of evaluation is the decision-maker. This form of evaluation provides the information
needed to decide on the status of a program. The specific models of this evaluation are the CIPP
(Context, Input, Process, and Product) by Stufflebeam, Alkin’s UCLA Evaluation Model, and
Patton’s Utilization-focused evaluation (see Fitzpatrick, Sanders, & Worthen, 2004).
3. Consumer-oriented. This model is useful in evaluating whether the product is

feasible, marketable, and significant. A consumer-oriented evaluation can be undertaken to
determine if there will be many enrollees of a school that will be built on a designated location, if
there will be takers of a graduate program proposed, and if the course is producing students that
are employable. Specific models for this evaluation are Scriven’s Key Evaluation Checklist, Ken
Komoski’s EPIE Checklist, Morrisett and Stevens Curriculum Materials Analysis System
(CMAS) (see Fitzpatrick, Sanders, & Worthen, 2004).
10
4. Expertise-oriented. This model of evaluation uses an external expert to judge an

institution’s program, product, or activity. In the Philippine setting, the accreditation of schools
is based on this model. A group of professional experts make evaluations based on the existing
school documents. These group of experts should complement each other in producing a sound
judgment of the school’s standards. This model comes in the form of formal professional reviews
(like accreditation), informal professional reviews, ad hoc panel reviews (like funding agency
review, blue ribbon panels), ad hoc individual reviews, and educational connoisseurship (see
Fitzpatrick, Sanders, & Worthen, 2004).
5. Participant-oriented. The primary concern of this model is to serve the needs of those
who participate in the program such as students and teachers in the case of evaluating a course.
This model depends on the values and perspectives of the recipients of an educational program.
The specific models for this evaluation are Stake’s Responsive evaluation, Patton’s Utilization-
focused evaluation, Rappaport’s Empowerment Evaluation (see Fitzpatrick, Sanders, &
Worthen, 2004).
6. Program Theory. This evaluation is conducted when stakeholders and evaluators

intend to determine to understand both the merits of a program and how its transformational
processes can be exploited to improve the intervention (Chen,2005). The effectiveness of a
program in a theory driven evaluation takes into account the causal mechanism and its
implementation processes. Chen (2005) identified three strengths of the program theory
evaluation: (1) Serves accountability and program improvement needs, (2) establish construct
validity on the parts of the evaluation process, and (3) increase internal validity. Program theory
measures the effect of program intervention on outcome as mediated by determinants. For
example, a program implemented instructional and training public school students on proper
waste disposal, the quality of the training is assessed. The determinants of the stakeholders are
then identified such as adaptability, learning strategies, patience, and self-determination. These
factors are measured as determinants. The outcome measures are then identified such as the
reduction of wastes, improvement of waste disposal practices, attitude change, and rating of
environmental sanitation. The effect of the intervention on the determinants is assessed and the
effect of determinants on the outcome measures. The direct effect of the intervention and the
outcome is also assessed. The model of this evaluation is illustrated below.
Figure 1
Implicit Theory for Proper Waste Disposal
Determinants
Adaptability, learning
Intervention Outcome
strategies, patience,
and self-determination
Quality of Instruction Reduction of wastes,
and Training improvement of waste
disposal practices,
attitude change, and
rating of environmental
sanitation
11
Table 2
Integration of the Forms and Models of Evaluation
Form of Evaluation Focus Models of Evaluation

Proactive Is there a need? What do we/others know Consumer-oriented
about the problems to be addressed? Best Identifying Context in CIPP
practices?
Clarificative What is program trying to achieve? Is Setting goals in Tyler’s Evaluation
delivery working, consistent with plan? How Approach
could the program or organization be
changed to be more effective?
Interactive What is the program trying to achieve? Is Stake’s Responsive Evaluation
delivery working, consistent with plan? How Objectives-oriented
could the program or organization be
changed to be more effective?
Monitoring Is the program reaching the target CIPP
population? Is implementation meeting
benchmarks? Are there differences across
sites, time? How/what can be changed to be
more efficient, effective?
Impact Is the program implemented as planned? Are CIPP
stated goals achieved? Are needs served? Objectives-oriented
Can you attribute goal achievement to Program theory
program? Unintended outcomes? Cost
effective?
12
Table 3
Implementing procedures of the Different Models of Evaluation
Form of Focus Models of Evaluation

Evaluation
Objectives-oriented Tylerian Evaluation Approach 1. Establish broad goals
2. Classify the goals
3. Define objectives in behavioral terms
4. Find situations in which achievement of objectives can be shown
5. Develop measurement techniques
6. Collect performance data
7. Compare performance data with behaviorally shared objectives.
Metfessel and Michael’s Evaluation 1. Involve stakeholders as facilitators in program evaluation
Paradigm 2. Formulate goals
3. Translate objectives into communicable forms
4. Select instruments to furnish measures
5. Carry out periodic observation
6. Analyze data
7. Interpret data using standards
8. Develop recommendations for further implementation
Provus Discrepancy Evaluation Model 1. Agreeing on standards
2. Determine whether discrepancy exist between performance and standards
3. Use information on discrepancies to decide whether to improve, maintain, or
terminate the program.
Hammond’s Evaluation Cube 1. Needs of stakeholders
2. Characteristics of the clients
3. Source of service
Logic Model 1. Inputs
2. Service
3. Outputs
4. Immediate, intermediate, long-term, and ultimate outcomes
Management-oriented CIPP (Context, Input, Process, and 1. Context evaluation
Product) by Stufflebeam 2. Input evaluation
3. Process evaluation
4. Product evaluation
Alkin’s UCLA Evaluation Model 1. Systems assessment
2. Program planning
3. Program implementation
4. Program improvement
5. Program certification
Patton’s Utilization-focused evaluation 1. Identifying relevant decision makers and information users
2. What information is needed by various people
3. Collect and provide information
Consumer-oriented Scriven’s Key Evaluation Checklist 1. Evidence of achievement
2. Follow-up results
3. Secondary and unintended efforts
4. Range of utility
5. Moral considerations
6. Costs
Morrisett and Stevens Curriculum 1. Describe characteristics of product
Materials Analysis System (CMAS) 2. Analyze rationale and objectivity
3. Consider antecedent conditions
4. Consider content
5. Consider instructional theory
6. Form overall judgment
Expertise-oriented Formal Professional reviews Accreditation
Informal Professional Reviews Peer reviews
Ad Hoc Panel Reviews Funding agency review, blue ribbon panels
Ad Hoc Individual Reviews Consultation
Educational Connoisseurship Critics
Participant-oriented Stake’s Responsive Evaluation 1. Intents
2. Observations
3. Standards
4. Judgments
Fetterman’s Empowerment Evaluation 1. Training
2. Facilitation
3. Advocacy
4. Illumination
5. Liberation
Program Theory  Determinant mediating the 1. Establish common understanding between stakeholders and evaluator
relationship between 2. Clarifying stakeholders theory
intervention and outcome 3. Constricting research design
 Relationship between program
components that is conditioned
by a third factor
13
EMPIRICAL REPORTS
Examples of Evaluation Studies
Program Evaluation of the Civic Welfare solidarity and collaboration with the immersion
Training Services centers.
By Carlo Magno
The NSTPCW1 and NSTPCW2 of a An evaluation of the Community Service

College was evaluated using Stakes’ Responsive Program of the De La Salle University-
Evaluation. The NSTP offered by the college is College of Saint Benilde
the Civic Welfare Training Service (CWTS) which By Josefina Otarra-Sembrabo
focuses on developing students’ social concern,
values, volunteerism and service for the general Community Service Program is an
welfare of the community. The main purpose of outreach program in line with the mission-vision
the evaluation is to determine the impact of the of De La Salle-College of Saint Benilde (DLS-
current NSTPCW1 and NSTPCW2 program CSB). The Benildean core values are realized
offered by DLS-CSB by assessing (1) students through a direct service to marginalized sectors
values, management strategies, and awareness in the society. The students are tasked to have
of social issues, (2) students performance during immersion with the marginalized such as the
the immersion, (3) students insights after street children, elderly, special people, and the
immersion, (4) teaching performance, and (7) like. After their service in the community,
strengths and weaknesses of the program. The students reflect on what they do and formulate
evaluation of the outcome of the program shows insights and relate it to the Lasallian education.
that the impact on values is high, the impact of This service is a social transformation for
the components of the NSTPCW2 is high, and students and community.
the awareness of social issue is also high. The To evaluate the Community Service
students’ insights show the acquisition of skills, Program (CSP), Stufflebeam’s Context-Input-
values and awareness also concords with the Process-Product Evaluation was utilized. This
impact gained. There is agreement that the type of evaluation focuses on the decision-
students are consistently present and they show management strategy. In the model, continuous
high rating on service, involvement and attitude feedback is important for better decisions and
during the immersion activity. The more the improvement of the program. This framework has
teacher uses a learner-centered approach, the four types which include context, input, process,
better the outcome is on the students’ part. The and product. The context evaluation determines if
strengths of NSTPCW1 include internal and the objectives of the program have been met. It
external aspects and the weaknesses are on the aims to know if the objectives of the CSP have
teachers, class activities and social aspect. For been achieved in relation to the mission and
NSTPCW2, the strengths are on student vision of DLS-CSB. The input evaluation
learning, activities and formation while the describes the respondents and beneficiaries of
weaknesses are on the structure, activities, the CSP. Process evaluation describes how the
additional strategies and the outreach area. program was implemented in terms of
When compared with the Principle on Social procedures, policies, techniques, and strategies.
Development of the Lasallian Guiding Principle, This provides the evaluators the needed
generally the NSTP program is acceptable in information to determine the procedural issues
terms of the standards on understanding of social and to interpret the outcome of project. In the
reality and social intervention and developing on product evaluation, the outcome information is
14
being related to the objectives and context, input assembly, group meetings, leadership training,
and process information. The information will be orientation seminar, initial area visit, immersion,
used to decide on whether to terminate, modify group processing, and submission of documents.
or refocus a program. The students rated it as moderate as well. Seven
There were a total of 250 participants in out 10 of the procedures need improvement. In
the study composed of students, beneficiaries, the role of the students, 68 of the students
program staff members and selected clients. The considered the role of advisers as helpful.
instruments used were three sets of evaluation However, the effectiveness of the performance
questionnaires for the students, program was rated only moderately satisfactory. Three
implementers, and beneficiaries and one strong points given to the CSP are the provision
interview guide used for the recipients of the of opportunities to gain social awareness,
CSP. Data analysis was both quantitative and actualizing social responsibility and personal
qualitative in nature. growth of the students. Subsequently, the
For the context evaluation, the weaknesses include difficulty of program
evaluators looked into the objectives of the CSP, procedure, processes, locations and negative
mission-vision of CSB, objectives of Social Action attitude of the students. Some of the
Office (SAO), and their congruence. The DLS- recommendations focus on program preparation,
CSB mission vision is realized in the six core program staff and community service locations.
Benildean values, and to realize the mission- For the insights of the beneficiaries, some
vision, SAO created a CSP to enhance social problems such as attendance and seriousness of
awareness of the students and instill social the students are taken into account and resolved
responsibility. Likewise, the objectives of CSP through dialogue, feedback and meetings. They
are aligned also to CSB mission and vision. 75% also suggested to the CSP more intensive
of the respondents said that CSP objectives are orientation and preparation as well as closer
in line with the CSB mission-vision. This was coordination and program continuity.
supported with actual experiences. Moderate Lastly, for the product evaluation,
extent was given by the students and internalization and personification of the core
beneficiaries as to the extent the community Benildean values, benefits gained by the
service program has met. students and beneficiaries were taken into
For the input evaluation, the profile of the account. For the internalization and
students, program recipients, and implementers personification, it appears that four out of 6 core
was reported. Most of the students were males, values are manifested by the students which are
average age was 21 and from Manila. The deeply rooted faith, appreciation of individual
recipients were mostly centers from Metropolis uniqueness, professional competency and
by the religious groups. Program implementers creativity. Students also gained personal benefits
on the other hand are staff member responsible such as increased social awareness, social
for the implementation of the program and has responsibility actualization, positive values, and
been into the college for 1-5 years. realizations of their blessings. On the other
The process evaluation of the program hand, the beneficiaries benefits include long term
focused on the policies and procedures of the and short term benefits. The short terms are the
CSP, role of the community service adviser, socialization activities, interaction between the
strength and weaknesses of the CSP, students and clients, material help, manpower
recommendation for improvement, and insights assistance and tutorial classes while long term
of the program beneficiaries. In terms of policies, are values inculcated to the children,
the CSP is a requirement for the CSB students interpersonal relationships, knowledge imparted
written in the Handbook. The program has 10 to them, and contribution to physical growth. The
procedures including application, general program beneficiaries also identified the
15
strengths of CSP such as development of inner Implementation of CSP on a regular basis,

feelings of happiness, love and concern as a student training, production of documentations
result of their interaction with the students, and organized reports of the students; (6)
knowledge imparted to them and extension of systematize community service, more volunteers,
material help through the program. The expand the coverage of marginalized sectors,
weakness in one hand also includes the lack of considering other locations of marginalized
preparation and interaction with the beneficiaries. sectors; (7) informing the students their specific
These findings are the basis of roles in the community service; (8) involvement
conclusion. DLS-CSB has indeed a clear vision of the community service unit to seminars and
for their students and it was actualized in the conferences; (9) periodic program evaluation,
CSP. There is a need to strengthen the relation assessment of students involvement in the
of the CSP objectives and college vision mission sectors; (10) Systematize needs assessment and
as implied in the moderate ratings in the conduct longitudinal studies with the effects of
evaluation. There seems be the need for CSP in the lives of previous CSP volunteers.
expansion of the coverage of program recipients
since it does not fully address the objectives set
in the CSP. A review and update with procedures World Bank Evaluation Studies on
is needed due to the problems encountered by Educational Policy
the students and beneficiaries. The CSP advisers By Carlo Magno
were also not able to perform their roles well from
the point of the students and representative of This report provides a panoramic view of
centers. The weakness pointed in this program different studies on education sponsored by the
implies that there is a need for improvement world bank focusing on the evaluation
especially in the procedural stage. More intensive component. The report specifically presents
preparation should be done both in the completed studies on educational policy from
implementation and interaction with the 1990 to 2006. A panoramic view of the studies is
marginalized sectors due to the need to better presented showing the area of investigation,
understand the sector they are to serve. evaluation model, method used, and
Continuity of the program was highly recommendations. A synthesis of these reports is
recommended due to the short term and shown in terms of the areas of investigation,
repetitive activities, which will allow them to content, methodology, and model used through
successfully inculcate all of the core benildean vote counting. The vote counting is a modal
values. However, the integration of these core categorization assumed to give the best estimate
values does not vary among the students in of selected criteria (Bushman, 1997).
terms of sex, year of entry and course. All in all, The World Bank provides support to
the community service program proved to be education systems throughout the developing
beneficial for the students, beneficiaries and world. Such support is broadly aimed at helping
recipients of the program. countries attain the objectives of “Education for
With regard to the findings and All” and education for success in the knowledge
conclusions, there are some recommendations economy. An important goal is to tailor Bank
for the CSP. Recommendations include: (1) assistance to region- and country-specific factors
Continuity, changes, and improvement by taking such as demographics, culture, and the socio-
into consideration the flaws and weakness of the economic or geopolitical climate. Consequently,
previous program; (2) Intensive preparation for a top priority is to inform development assistance
the service; (3) review of the load of the students with the benefit of country-specific analysis
so they could give quality service to the sectors; examining (1) what factors drive education
(4) improvement in the procedural stages, (5) outcomes; (2) how do they interact with each
16
other; (3) which factors carry the most weight and Table 1
which actions are likely to produce the greatest Counts of Area of Investigation From 1990 - 2006
result; and (4) where do the greatest risks and
constraints lie. The world bank divided the Year Country Area of Investigation No. of
Studies
Total no. of
Studies per
countries according to different regions such as year
2006 Vanuatu Language learning 1 1
Sub-Saharan Africa, East Asia and the Pacific, 2005 None 0 0
Europe and Central Asia, Latin America and the 2004 Indonesia
Thailand
Undergraduate/Tertiary
Education
2 5
Carribean, and Middle East and North Africa. Senegal Adult Literacy 1
Different Early Child Development 2
Regions,
Columbia
Areas of Investigation 2003 Thailand Undergraduate/Tertiary 1 2
Education
Different AIDS/HIV Prevention 1
There are 28 studies done on Regions
2002 Different Textbook/Reading materials 1 2
educational policy with a manifested evaluation regions
component. Education studies with no evaluation 2001
Africa
Brazil
Secondary Education
Early Child Development
1
1 2
aspect were not included. A synopsis of each China Secondary Education 1
2000 Different School Self-evaluation 1 7
study with the corresponding methodology and Regions
recommendations are found in the world bank Different
Regions
Early Child Development 1
webpage. The different areas of investigation Pakistan, Basic Education 3

Cuba
were enumerated and the number of studies Africa Adult Literacy 1
conducted for each year was counted as shown 1999
Africa
USA
Tertiary Distance Education
Test Evaluation
1
1 3
in Table 1. Most of the studies on educational Different Infant Care 1
Regions
policy are targeting the basic needs of a country Early Child Development 1
and specified region of the world such as the 1998 Different
Regions
Teacher Development 1 2
effectiveness of education in the basic education, Different ICT 1

Regions
tertiary, critical periods such as child 1997 None 0 0
development programs and promoting adult 1996 Different
Regions
Basic Education (school
financing)
1 2
literacy. From the earliest period (1990’s) the Chile ICT 1

1995 None 0
trend of the studies done is on information and 1994 Philippines Vocational Education 1 1
communications technology (ICT) on basic 1993
1992 Different
None
Secondary Education 1
0
1
education. The pattern for the 21st century Regions
Total=28
studies shows a concentration in evaluating the
implementation of tertiary education across
It is shown in table 1 that most studies on
countries. This is critical since developing nations
educational policy were conducted for the year
rely on the expertise produced by its manpower
2000 since it is a turning point of the century. For
in the field of science and technology. For the
the coming of a new century much is being
latest period, a new area of investigation which is
prepared and this is operationalized by assessing
language learning was explored due to the
a world wide report on what has been
recognition of globalization on some countries
accomplished from the recent 20th century. The
like Vanuatu.
studies typically covers a broad range of
education topics such as school self-evaluation,
early child development, basic education, adult
literacy, and tertiary distance education. These
areas of investigation cover most of the fields
done for the 20th century and an overall view of
what has been accomplished was reported. It
can also be noted that there is an increase of
17
studies conducted at the start of the 21st century. the concern is on early child development since it
This can be explained with the growing trend in is a critical stage in life which evidently results to
globalization where communication across hampering the development of an individual if not
countries is more accessible. It can also be noted cared for at an early age. This also shows the
that no studies were completed on educational increasing number of children where their needs
policy with evaluation for the years 1993, 1995, are undermined and intervention has to take
1997 and 2005. The trend in the number of place. These programs sought the assistance of
studies shows that consequently after a year, the the world bank because they need further
study gives more generalized findings since the funding for the program to exist. Having an
study covered a larger and wide array of evaluation of the child program likely supports
sampling where these studies took a long period the approval for further grant.
of time to finish. More results are expected Somehow there is a large number of
before the end of 2005. The trend of studies studies on basic and tertiary education where its
across the years is significantly different with the effectiveness is evaluated. Almost all countries
expected number of studies as revealed using a offer the same structure of education world wide
one-way chi-square where the computed value in terms of the level from basic education to
(2=28.73, df=14) exceeds a probability of tertiary education. These deeply needs attention
2=23.58 with 5% probability of error. since it is a basic key to developing nations to
improve the quality of their education because
Table 2 the quality of their people with skills depend on
Counts of Area of Investigation From 1990 - 2006 the countries overall labor force.
When the observed counts of studies for
Area of Investigation Number of Studies each area of interest is tested for significant
Language learning 1 goodness of fit, the computed chi-square value
Undergraduate/Tertiary 4 (2=13, df=13) did not reach significance at 5%
Education
Adult literary 2 level of significance. This means that the
Early Child Development 5 observed counts per area do not significantly
AIDS/HIV Prevention 1 differ to what is expected to be produced.
Textbook/Reading material 1
Secondary education 3 Table 3
School Self-evaluation 1
Basic education 4 Study Grants by Country
Test Evaluation 1
Infant Care 1 Country No. of studies
ICT 2 Vanuatu 1
Teacher Development 1 Indonesia 1
Vocational Education 1 Thailand 1
Senegal 1
Different Regions 10
Table 2 shows the number of studies Brazil 1
conducted for every area in line with educational China 1
policy with evaluation. Most of the studies Pakistan 1
completed and funded are in the area of early Cuba 1
child development followed by tertiary education Africa 2
USA 1
and basic education. This can be explained by Chile 1
the increasing number of early child care Philippines 1
programs around the world which is continuing
and needs to be evaluated in terms of its The studies done for each country are
effectiveness at a certain period of time. Much of almost equally distributed except for Africa with
18
two studies from 1990 until the present period. Method of Studies
There is a bulk of studies done worldwide which
covers a wider array of sampling across different Various methodologies are used to
countries. The world wide studies usually investigate the effectiveness of educational
evaluate common programs across different programs across different countries. Although it
countries such as teacher effectiveness and child can be seen in the report that there is not much
development programs. However, there is great concentration and elaboration on the use and
difficulty to come up with an efficient judgment of implementation of the procedures done to
the overall standards of each program. The evaluate the programs. Most only mentioned the
advantage of having a world wide study on questionnaires and assessment techniques they
educational programs for different regions is to used. There are some that mentioned a broad
have a simultaneous description of the common range of methodologies such as quasi-
programs that are running where the funding is experiments and case studies but the specific
most likely concentrated to one team of designs are not indicated. It can also be noted
investigators rather than separate studies with that reports written by researchers/professors
different fund allocations. Another is the from universities are very clear in their method
efficiency of maintaining consistency of which is academic in nature but world bank
procedures across different settings. Unlike personnel writing the report tends to focus on the
different researchers setting different standards justification of the funding rather than the clarity
for each country. of the research procedure undertaken. It can also
In the case of Africa, two studies were be noted that the reports did not show any part
granted concentrating on adult literacy and on the methodology. Most presented the
distance education because these educational introduction and some justifications of the
programs are critical in their country as program and later in the end the
compared to others. As shown in the recommendations. The methodologies are just
demographics of the African region that their mentioned and not elaborated within the report
programs (adult literacy, distance education) are and only mentioned on some parts of the
increasingly gaining benefits to its stakeholders. justification of the program.
There is a report of remarkable improvement on
their adult education and more tertiary students Table 4
are benefiting form the distance education. Since Counts of Methods Used
they are showing effectiveness, much funding is
needed to continue the programs. Method Counts
When the number of studies are tested Questionnaires/Inventories/Tests 4
for significance across countries, the chi-square Quasi Experimental 5
computed (2=35.44, df=12) reached significance True Experimental 1
against a critical value of 2=21.03 at 5% Archival Data (Analyzed available 6
probability of error. This means that the number demographics)
of studies for each country differs significantly to Observations 1
what is expected to be produced. This is also due Case Studies 1
to having a large concentration of studies for Surveys 1
different regions as compared to minimal studies Multimethod 9
for each country which made the difference.
It can be noted in table 4 that most
studies employ a multimethod approach where
different methods are employed in a study. The
19
multimethod approach creates an efficient way of really after the model but in establishing the
cross-validating results for every methodology program or continuity of the program. There are
undertaken. One result in one method can be in marked difference between university
reference to another result to another method academicians and world bank personnel doing
which makes it powerful than using singularity. the study where the latter are misplaced in their
Since evaluation of the program is being done in assessment due to the lack of guidance from a
most studies, it is indeed better to consider using model and academicians would specifically state
a multimethod since it can generate findings the context but somehow failed to elaborate in
where the researcher can arrive with better the process for adopting a CIPP model. Most
judgment and description of the program. studies are clear in their program objectives but
It can also be noted that most studies failed to provide accurate measures of the
are also using archival data to make justifications program directly. The worst is that most studies
of the program. Most of these researchers in are actually not guided with the use of a model in
reference to the archival data are coming up with evaluating the educational programs proposed.
inferences from enrollment percent, dropout
rates, achievement levels, and statistics on Table 5
physical conditions such as weight and height Counts Models/Frameworks Used
etc. which can be valid but they do not directly
assess the effectiveness of the program. The Model/Framework Counts
difficulty of using these statistics is that they do Objectives-Oriented 10
not provide a post measurement of the program Evaluation
evaluated. These may be due to the difficulty of Management-Oriented 9
arriving with national surveys on achievement Evaluation
levels and enrollment profiles of different Consumer-Oriented 0
educational institutions which is done annually Evaluation
but may not be in concordance with the timetable Expertise-Oriented 7
of the researchers. It is also commendable that a Evaluation
number of studies are considering to have quasi- Participant-Oriented 1
experimental designs to directly assess the Evaluation
effectiveness of educational programs. No model specified 3
The counts of the methodologies used is
tested for significance, the computed chi-square As shown in table 5 that majority of the
value (2=18.29, df=7) reached significance over evaluation used the objectives-oriented where
the critical chi-square value of 2=14.07 with 5% they specify the program objectives and
probability of error. This shows that the evaluated accordingly. A large number also used
methodologies used significantly varies to what is the management oriented and specifically made
expected. use of the CIPP by Stufflebeam (1968). A
number of studies also used experts as external
The Use of Evaluation Models evaluators of the program implementation. Most
of the studies actually did not mention the model
The evaluation method used by the used and the models were just identified as
studies was counted. There was difficulty in described by the procedure in conducting the
identifying the models used since the evaluation.
researchers did not specifically elaborate the Most studies used the objectives
evaluation or framework that they are using. It oriented since the thrust is on educational policy
can also be noted that the researchers are not and most educational programs start with a
20
means of stating objectives. These objectives are program since the judgment on how the program
also treated as ends where the evaluation is is taking place is concentrated on and not other
basically used as the basis. The other studies matters which undermines the result of the
which used the management-oriented evaluation program. A good alternative is for the research
are the ones who typically describe the context of grantee to allocate another budget on a follow up
the educational setting as to the available program evaluation after establishing the
archival data provided by national and program.
countrywide surveys. The inputs and outputs are
also described but most are weak in elaborating 4. It is recommended that when screening for
the process undertaken. The counts on the use studies a criteria on the use of an evaluation
of evaluation models (2=18, df=5) reached model should be included. The researchers
significance at 5% error. This means that the making an evaluation study can be guided better
counts are significantly different with the with the use of an evaluation model.
expected. This shows a need to use other
models of evaluation as appropriate to the study
being conducted. References
Recommendations Bray, M. (1996). Decentralization of Education

Community Financing. World Bank Reports.
1. It is recommended to increase distribution of
study grants across countries. There is Brazil Early Child Development: A Focus on the
concentration of performing studies regionally Impact of Preschools. (2001). World Bank
which may neglect cultural and ethical Reports.
considerations on testing and other forms of
assessment. As a consequence there is no Bregman, J. & Stallmeister, S. (2002).
cross-cultural perspective on how the programs Secondary Education in Africa: Strategies for
are implemented for each country because the Renewal. World Bank Reports.
focus is on the consistency of the programs.
Conducting individual studies will show a more
in-depth perspective of the program and how it is Bushman, B. J. (1997). Vote-counting
situated within a specific context. procedures in meta-analysis. In H. Cooper and
Hedges, L. V. (eds.) The Handbook of Research
2. It is recommended to have a specific section Synthesis. New York: Russell Sage Publications.
on the methodology undertaken by the
researcher. This helps future researchers to Craig, H. J., Kraft, R. J., & du Plessis, J. (1998).
qualify for the validity of the procedures Teacher Development: Making An Impact. World
undertaken by the study. Specifying clearly the Bank Reports.
method used enables the study to be replicated
as best practices for future researchers and can Education and HIV/AIDS: A Sourcebook of
easily identify procedures that needs to be HIV/AIDS Prevention Programs. (2003). World
improved. Bank Reports.
3. It is recommended to have separate studies Fretwell, D. I. & Colombano, J. E. (2000). Adult

concentrating exclusively on program evaluation Continuing Education: An Integral Part Of
after successive program implementations. This Lifelong Learning Emerging Policies and
will provide a better picture on the worth of a Programs for the 21st Century in Upper and
Middle Income Countries. World Bank Reports.
21
Saint, W (2000). Tertiary Distance Education and

Gasperini, L. (2000). The Cuban Education Technology and sub Saharan Africa. World Bank
System: Lessons and Dilemmas. World Bank Reports.
Reports.
Saunders, L. (2000). Effective Schooling in Rural
Getting an Early Start on Early Child Africa Report 2: Key Issues Concerning School
Development. (2004). World Bank Reports. Effectiveness and Improvement. World Bank
Reports.
Grigorenko, E. L. & Sternberg, R. J. (1999).
Assessing cognitive Development In Early Stufflebeam, D. L. (1968). Evaluation as
Childhood. World Bank Reports. enlightenment for decision making. Columbus:
Ohio State University Evaluation Center.
Indonesia - Quality of Undergraduate Education
Project. (2004). World Bank Reports. Tertiary Education in Colombia Paving the Way
for Reform. (2003). World Bank Reports.
Liang, X. (2001).China: Challenges of Secondary
Education. World Bank Reports. Thailand - Universities Science and Engineering
Education Project. (2004). World Bank Reports.
Nordtveit, B. J. (2004). Managing Public–Private Vanuatu: Learning and Innovation Credit for a
Partnership Lessons from Literacy Education in Second Education Project. (2006). World Bank
Senegal. World Bank Reports. Reports.
O'Gara, C., Lusk, D., Canahuati, J., Yablick, G. & Ware, S. A. (1992). Secondary School Science in
Huffman, S. L. (1999). Good Practices in Infant Developing Countries Status and Issues. World
and Toddler Group Care. World Bank Reports. Bank Reports.
Operational Guidelines for textbooks and reading Xie, O., & Young, M. E. (1999). Integrated Child
materials. (2002). World Bank Reports. Development in Rural China. World Bank
Reports.
Orazem, P. F. (2000). The Urban and Rural
Fellowship School Experiments in Pakistan: Young, E. M. (2000). From Early Child
Design, Evaluation, and Sustainability. World Development to Human Development: Investing
Bank Reports. in Our Children's Future. World Bank Reports.
Osin, L. (1998). Computers in Education in

Developing Countries: Why and How? World
Bank Reports.
Philippines - Vocational Training Project. (1994).

World Bank Reports.
Potashnik, M. (1996). Chile's Learning Network.

World Bank Reports.
Riley, K. & MacBeath, J. (2000). Putting School

self-evaluation in Place. World Bank Reports.
22
1. Look for an evaluation study that is published in the Asian Development

Bank webpage.
2. Summarize the study report in the following:
- What features of the study made it an evaluation?
- What form and model of evaluation was used?
- How was the form or model implemented in the study?
- What aspects of the evaluation study was measured?
23
Lesson 3
The Process of Assessment
The previous lesson clarified the distinction between measurement and evaluation. Upon
knowing the process of assessment in this lesson, you should know now how measurement and
evaluation are used in assessment.
Assessment goes beyond measurement. Evaluation can be involved in the process of
assessment. Some definitions from assessment references show the overlap between assessment
and evaluation. But Popham (1998), Gronlund (1993), and Huba and Freed (2000) defined
assessment without overlap with evaluation. Take note of the following definitions:
1. Classroom assessment can be defined as the collection, evaluation, and use of

information to help teachers make better decisions (McMillan, 2001).
2. Assessment is a process used by teachers and students during instruction that provides
feedback to adjust ongoing teaching and learning to improve students’ achievement of intended
instructional outcomes (Popham, 1998).
3. Assessment is the systematic process of determining educational objectives, gathering,
using, and analyzing information about student learning outcomes to make decisions about
programs, individual student progress, or accountability (Gronlund, 1993).
4. Assessment is the process of gathering and discussing information from multiple and
diverse sources in order to develop a deep understanding of what students know, understand, and
can do with their knowledge as a result of their educational experiences; the process culminates
when assessment results are used to improve subsequent learning (Huba & Freed, 2000).
Cronbach (1960) have three important features of assessment that makes it distinct with
evaluation: (1) Use of a variety of techniques, (2) reliance on observation in structured and
unstructured situations, and (3) integration of information. The three important features of
assessment emphasize that assessment is not based on single measure but a variety of measures.
In the classroom, a student’s grade is composed of the quizzes, assignments, recitations, long
tests, projects, and final exams. These sources were assessed through formal and informal
structures and integrated to come up with an overall assessment as represented by a student’s
final grade. In lesson 1, assessment was defined as “the process of the collecting various
information needed to come up with an overall information that reflects the attainment of goals
and purposes.” There are three critical characteristics of this definition:
1. Process of collecting various information. A teacher arrives at an assessment after

having conducted several measures of student’s performance. Such sources are recitations, long
tests, final exams, and projects. Likewise, a student is proclaimed as gifted after having tested
with a battery (several) of intelligence and ability tests. A student to be designated at Attention
Deficit Disorder (ADD) needs to be diagnosed by several attention span and cognitive tests with
a series of clinical interviews by a skilled clinical psychologist. A variety of information is
needed in order to come up with a valid way of arriving with accurate information.
2. Integration of overall information. Coming up with an integrated assessment from

various sources need to consider many aspects. The results of individual measures should be
consistent with each other to meaningfully contribute in the overall assessment. In such cases, a
24
battery of intelligence tests should yield the same results in order to determine the overall ability
of a case. In cases where some results are inconsistent, there should be a synthesis of the overall
assessment indicating that in some measures the result do not support the overall assessment.
3. Attainment of goals and purposes. Assessment is conducted based on specified goals.

Assessment processes are framed for a specified objective to determine if they are met.
Assessment results are the best way to determine the extent to which a student has attained the
objectives intended.
Assessment Procedures
The process of assessment was summarized by Bloom (1970). He indicated that there are
two processes involved in assessment:
1. Assessment begins with an analysis of criterion. The identification of criterion

includes the expectations and demands and other forms of learning targets (goals, objectives,
expectations, etc.).
2. It proceeds to the determination of the kind of evidence that is appropriate about the
individuals who are placed in the learning environment such as their relevant strengths and
weaknesses, skills, and abilities.
In the classroom context, it was explained in Lesson 1 that assessment takes place before,
during and after instruction. This process emphasizes that assessment is embedded in the
teaching and the learning process. Assessment generally starts in the planning of learning
processes when learning objectives are stated. A learning objective is defined in measurable
terms to have an empirical way of testing them. Specific behaviors are stated in the objectives so
that it corresponds with some form of assessment. During the implementation of the lesson,
assessment can occur. A teacher may provide feedback based on student recitations exercises,
short quizzes, and classroom activities that allow students to demonstrate the skill intended in the
objectives. The assessment done during instruction should be consistent with the skills required
in the objectives of the lesson. The final assessment is then conducted after enough assessment
can demonstrate student mastery of the lesson and their skills. The final assessment conducted
can be the basis for the succeeding objectives for the next lesson. The figure below illustrates the
process of assessment.
Figure 1
The Process of Assessment in the Teaching and Learning Context
Assessment
Learning Learning Assessment

Assessment
Objectives Experience
25
Forms of Assessment
Assessment comes in different forms. It can be classified as qualitative or quantitative,

structured or unstructured, and objective or subjective.
Quantitative and Qualitative. Assessment is not limited to quantitative values,

assessment can also be qualitative. Examples of qualitative assessments are anecdotal records,
written reports, written observations in narrative forms. Qualitative assessments provide a
narrative description of attributes of students, such as their strengths and weaknesses, areas that
need to be improved and specific incidents that support areas of strengths and weaknesses.
Quantitative values uses numbers to represent attributes. The advantages of quantification were
described in Lesson 2. Quantitative values as results in assessment facilitate accurate
interpretation. Assessment can be a combination of both qualitative and quantitative results.
Structured vs. Unstructured. Assessment can come in the form of structured or

unstructured way of gathering data. Structured forms of assessment are controlled, formal, and
involve careful planning and organized implementation. Examples of formal assessment are the
final exams where it is announced, students are provided with enough time to study, the coverage
is provided, and the test items are reviewed. A formal graded recitation can be a structured form
of assessment when it is announced, questions are prepared, and students are informed of the
way they are graded in their answers. Unstructured assessment can be informal in terms of its
processes. An example would be a short unannounced quiz just to check if students have
remembered the past lesson, informal recitations during discussion, and assignments arising from
the discussion.
Objective vs. Subjective. Assessment can be objective or subjective. Objective

assessment has less variation in results such as objective tests, seatworks, and performance
assessment with rubrics with right and wrong answers. Subjective assessment on the other hand
results to larger variation in results such as essays and reaction papers. Careful procedures should
be undertaken as much as possible to ensure objectivity in assessing essays and reaction papers.
Components of Classroom Assessment
Tests. Tests are basically tools that measure a sample of behavior. Generally there are a
variety of tests provided inside the classroom. It can be in the form of a quiz, long tests (usually
covering smaller units or chapters of a lesson), and final exams. Majority of the tests for students
are teacher-made-tests. These tests are tailored for students depending on the lesson covered by
the syllabus. The tests are usually checked by colleagues to ensure that items are properly
constructed.
Teacher made tests vary in the form of a unit, chapter, or long test. These generally assess
how much a student learned within a unit or chapter. It is a summative test in such a way that it is
given after instruction. The coverage is only what has been taught in a given chapter or tackled
within a given unit.
Tests also come in the form of a quiz. It is a short form assessment. It usually measures
how much the student acquired within a given period or class. The questions are usually from
what has been taught within the lesson for the day or topic tackled in a short period of time, say
26
for a week. On the other hand, it can be summative or formative. It can be summative if it aims
to measure the learning from an instruction, or formative if to aims to tests how much the
students already know prior the instruction. The results of quiz can be used by the teacher to
know where to start the lesson (example, the students already know how to add single digits, and
then she can already proceed to adding double digits). It can also determine if the objectives for
the day are met.
Recitation. A recitation is the verbal way of assessing students’ expression of their

answers to some stimuli provided in the instruction or by the teacher. It is a kind of assessment in
which oral participation of the student is expected. It serves many functions such as before the
instruction to ask the prior knowledge of the students about the topic. It can also be done during
instruction, wherein the teacher solicits ideas from the class regarding the topic. It can also be
done after instruction to assess how much the student learned from the lesson for the day.
Recitations are facilitated by questions provided by the teacher and it is meant that
students undergo thinking in order to answer the questions. There are many purposes of
recitation. A recitation is given if teachers wanted to assess whether students can recall facts and
events from the previous lesson. A recitation can be done to check whether a student understands
the lesson, or can go further in higher cognitive skills. Measuring high order cognitive skills
during recitation will depend in the kind of question that the teacher provides. Appraising a
recitation can be structured or unstructured. Some teachers announce the recitation and the
coverage beforehand to allow students to prepare. The questions are prepared and a system of
scoring the answers are provided as well. Informal recitations are just noted by the teacher.
Effective recitations inside the classroom are marked by all students having an equal chance of
being called. Some concerns of teacher regarding the recitation process are as follows:
Should the teacher call more on the students who are silent most of the time in class?
Should the teacher ask students who could not comprehend the lesson easily more often?
Should recitation be a surprise?
Are the difficult questions addressed to disruptive students?
Are easy questions only for students who are not performing well in class?
Projects. Projects can come in a variety of form depending on the objectives of the
lesson, a reaction paper, a drawing, a class demonstration can all be considered as projects
depending on the purpose. The features of a project should include: (1) Tasks that are more
relevant in the real life setting, (2) activity that requires higher order cognitive skills, (2)
assignments that can assess and demonstrate affective and psychomotor skills which
supplements instruction, and (4) activities that require application of the theories taught in class.
Performance Assessment. Performance assessment is a form of assessment that requires

students to perform a task rather than select an answer from a ready-made list. Examples would
be students demonstrating their skill in communication through a presentation, building of a
dayorama, dance number showing different stunts in a physical examination class. Performance
assessment can be in the form of an extended-response exercise, extended tasks, and portfolios.
Extended-response exercises are usually open-ended where students are asked to report their
insights on an issue, their reactions to a film, and opinions on an event. Extended tasks are more
precise that require focused skills and time like writing an essay, composing a poem, planning
27
and creating a script for a play, painting a vase. These tasks are usually extended as an
assignment if the time in school is not sufficient. Portfolios are collections of students’ works.
For an art class the students will compile all paintings made, for a music class all compositions
are collected, for a drafting class all drawings are compiled. Table 4 shows the different tasks
using performance assessment.
Table 4
Outcomes Requiring Performance Assessment
Outcome Behavior
Skills Speaking, writing, listening, oral reading, performing experiments, drawing,
playing a musical instrument, gymnastics, work skills, study skills, and social
skills
Work habits Effectiveness in planning, use of time, use of equipment resources, the
demonstration of such traits as initiative, creativity, persistence, dependability
Social Concern for the welfare of others, respect for laws, respect the property of
attitudes others, sensitivity to social issues, concern for social institutions, desire to work
toward social improvement
Scientific Open-mindedness, willingness to suspend judgment, cause-effect relations, an
attitudes inquiring mind
Interests Expressing feelings toward various educational, mechanical, aesthetic, scientific,
social, recreational, vocational activities
Appreciations Feeling of satisfaction and enjoyment expressed toward music, art, literature,
physical skill, outstanding social contributions
Adjustments Relationship to peers, reaction to praise and criticism authority, emotional
stability, social adaptability
Assignments. Assignment is a kind of assessment which extends classroom work. It is

usually a take home task which the student completes. It may vary from reading a material,
problem solving, research, and other tasks that are accomplishable in a given time. Assignments
are used to supplement a learning task or preparation for the next lesson.
Assignments are meant to reinforce what is taught inside the classroom. Tasks on the
assignment are specified during instruction and students carry out these tasks outside of the
school. When the student comes back, the assignment should have helped the student learn the
lesson better.
28
Paradigm Shifts in the Practice of Assessment
For over the years the practice of assessment has changed due to improvement in
teaching and learning principles. These principles are a result of researches that called for more
information on how learning takes place. The shift is shown from old practices to what should be
ideal in the classroom.
From To
Testing Alternative assessment
Paper and pencil Performance assessment
Multiple choice Supply
Single correct answer Many correct answer
Summative Formative
Outcome only Process and Outcome
Skill focused Task-based
Isolated facts Application of knowledge
Decontextualized task Contextualized task
External Evaluator Student self-evaluation
Outcome oriented Process and outcome
The old practice of assessment focuses on traditional forms of assessment such as paper
and pencil with single correct answer and usually conducted at the end of the lesson. For the
contemporary perspectives in assessment, assessment is not necessarily in the form of paper and
pencil tests because there are skills that are better captured in through performance assessment
such as presentations, psychomotor tasks, and demonstrations. Contemporary practice welcomes
a variety of answers from students where they are allowed to make interpretation of their own
learning. It is now accepted that assessment is conducted concurrently with instruction and not
only serving as a summative function. There is also a shift from assessment items that are
contextualized and having more utility. Rather than asking for the definitions of verbs, nouns,
and pronouns, students are required to make an oral or written communication about their
favorite book. It also important that students assess their own performance to facilitate self-
monitoring and self-evaluation.
29
Conduct a simple survey and administer to teachers the questionnaire:
Gender: ___ Male ____ Female Years of teaching experience: ________

Subject currently handled: ____________________
Always Often Sometimes Rarely Never

1. My students collect their works in a portfolio.
2. I look at both the process and the final work in
assessing students tasks.
3. I welcome varied answers among my students
during recitation.
4. I announce the criteria to my students on how they
are graded in their work.
5. I provide feedback on my students performance
often.
6. I use performance assessment when paper and
pencil test are not appropriate.
7. I sue other forms of informal assessment.
8. The students’ final grade in my course is based on
multiple assessment.
9. The students grade their group members during a
group activity aside from the grade I give.
10. I believe that my students’ grades are not
conclusive.
Uses of Assessment
Assessment results have a variety of application from selection to appraisal and aiding
the stakeholders in the decision making process. These functions of assessment vary within the
educational setting whether it is conducted for human resources, counseling, instruction,
research, and learning.
30
1. Appraising. Assessment is used for appraisal. Forms of appraisals are the grades,
scores, rating, and feedback. Appraisals are used to provide a feedback on individual’s
performance to determine how much improvement could be done. A low appraisal or negative
feedback indicates that performance still needs room for improvement while high appraisal or
positive feedback means that performance needs to be maintained.
2. Clarifying Instructional Objectives. Assessment results are used to improve the

succeeding lessons. Assessment results point out if objectives are met for a specific lesson. The
outcome of the assessment results are used by teachers in their planning for the next lesson. If
teachers found out that majority of students failed in a test or quiz, then the teacher assesses
whether the objectives are too high or may not be appropriate for students’ cognitive
development. Objectives are then reformulated to approximate students’ ability and performance
that is within their developmental stage. Assessment results also have implications to the
objectives of the succeeding lessons. Since the teacher is able to determine the students’
performance and difficulties, the teacher improves the necessary intervention to address them.
The teacher being able to address the deficiencies of students based on assessment results is
reflective of effective teaching performance.
3. Determining and reporting pupil achievement of education objectives. The basic

function of assessment is to determine students’ grades and report their scores after major tests.
The reported grade communicates students’ performance in many stakeholders such as with
teachers, parents, guidance counselors, administrators, and other concerned personnel. The
reported standing of students in their learning show how much they have attained the
instructional objectives set for them. The grade is a reflection of how much they have
accomplished the learning goals.
4. Planning, directing, and improving learning experiences. Assessment results are

basis for improvement in the implementation of instruction. Assessment results from students
serve as a feedback on the effectiveness of the instruction or the learning experience provided by
the teacher. If majority of students have not mastered the lesson, the teacher needs to come up
with a more effective instruction to target mastery for all the students.
5. Accountability and program evaluation. Assessment results are used for evaluation
and accountability. In making judgments about individuals or educational programs multiple
assessment information is used. Results of evaluations make the administrators or the ones who
implemented the program accountable for the stakeholders and other recipients of the program.
This accountability ensures that the program implementation needs to be improved depending on
the recommendations from evaluations conducted. Improvement takes place if assessment
coincides with accountability.
6. Counseling. Counseling also uses a variety of assessment results. The variables such
as study habits, attention, personality, and dispositions, are assessed in order to help students
improve them. Students who are assessed to be easily distracted inside the classroom can be
helped by the school counselor by focusing the counseling session in devising ways to improve
the attention of a student. A student who is assessed to have difficulties in classroom tasks are
31
taught to self-regulate during the counseling session. Students’ personality and vocational
interests are also assessed to guide them in the future courses suitable for them to take.
7. Selecting. Assessment is conducted in order to select students placed in the honor

roll, pilot sections. Assessment is also conducted to select from among student enrollees who
will be accepted in a school, college or university. Recipients of scholarships and other grants are
also based on assessment results.
Guide Questions:
1. What are the other uses of Assessment?

2. What major decision in the educational setting needs to be backed up by assessment results?
3. What are the things assessed in your school aside from selection of students and reporting
grades?
References
Bloom, B. (1970). Toward a theory of testing which include measurement-assessment-

evaluation. In M. C. Wittrock, and D. E Wiley (Eds.), The evaluation of instruction:
Issues and problems (pp. 25-69). New York: Holt, Rinehart, & Winston.
Chen, H. (2005). Practical program evaluation. Beverly Hills, CA: Sage.
Fitzpatrick, J. L., Sanders, J. R., & Worthen, B. R. (2004). Program evaluation: Alternative
approaches and practical guidelines (3rd ed.). New York: Pearson.
Gronlund, N. E. (1993). How to write achievement tests and assessment (5th ed.). Needham
Heights: Allyn & Bacon.
Huba, M. E. & Freed, J. E. (2000). Learner-Centered assessment on college campuses - Shifting

the focus from teaching to learning. Boston: Allyn and Bacon.
Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation
standards (2nd ed.). Thousand Oakes, CA: Sage.
Magno, C. (2007). Program evaluation of the civic welfare training services (Tech Rep. No. 3).
Manila, Philippines: De La Salle-College of Saint Benilde, Center for Learning and
Performance Assessment.
McMillan, J. H. (2001). Classroom assessment: Principles and practice for effective instruction.
Boston: Allyn & Bacon.
Nunnaly, J. C. (1970). Introduction to psychological measurement. New York: McGraw Hill.

32
Popham, W. J. (1998). Classroom assessment: What teachers need to know (2nd ed.). Needham
Heights, MA: Allyn & Bacon.
Scriven, M. (1967). The methodology of evaluation: Perspectives of curriculum evaluation.

Chicago: Rand McNally.
33
Chapter 2
The Learning Intents
Chapter Objectives
1. Describe frameworks of the various taxonomic tools.

2. Compare and contrast the various taxonomic tools for setting the learning intents.
3. Justify the use of taxonomic tools in assessment planning.
4. Formulate appropriate learning intents.
5. Use the taxonomic tools in formulating the learning intents.
6. Evaluate the learning intents on the basis of the taxonomic framework in use.
Lessons
1 Stating Learning Intents
2 The Conventional Taxonomic Tools

Bloom’s Taxonomy
The Revised Taxonomy
3 Other Learning Taxonomies
Gagne’s taxonomic guide
Stiggins & Conklin’s taxonomic categories
Marzano’s Dimension of learning
De Bono’s Six Thinking Hats
4 Specificity of the Learning Intents
34
Lesson 1
Stating Learning Intents
Having learned about measurement, assessment, and evaluation, this chapter discusses
the learning intents, which refer to the objectives or targets the teacher sets as the competency to
build on the students. This is the target skill or capacity that students need to develop as they
engage in the learning episodes. The same competency assessed using relevant tools to generate
quantitative and qualitative information about your students’ learning behavior.
Prior to designing the learning activities and assessment tasks, you first have to formulate
your learning intents. These intents exemplify the competency you wish students will develop in
themselves. At this point, your deep understanding on how learning intents should be formulated
is very useful. As you go through this chapter, your knowledge about the guidelines in
formulating these learning intents will help you understand how assessment tasks should be
defined.
One of the important skills of teachers is determing the appropriate elearning intents for
their students. Leanring intents are the targets of the instruction and assessment that takes place.
Learning intents come in the form of objectives, goals, standards, criteria, and expectations.
Usually teachers state their objectives at the start of instruction and assessment.
Objectives are selected based on the following sources:
The holistic formation of the student is the primary concern in the teaching and learning
process and objectives developing their cognitive, affective, and psychomotor skills are:
Professional preparation during preservice teaching. The training in most education

courses guide students in the different considerations in handking the child such as the
devekopmental stage of the child and certain skills that needs to be enhanced at each stage of
development. There are also several theories about learning that should be considered in when
teaching. Professional preparation includes the theories learned practical techniques applied
when teaching.
Reference materials. Different reference materials are used as sources of objectives.

Examples of reference materials are teacher’s guide that comes along with textbooks. Books and
articles that guide teachers what comptetencies to be developed for each level.
Existing list of objectives. Teachers make use of existing list of objectives in writing their
objectives for a specific lesson. These existing objectives are found in the syllabus and school
goals. These objectives are made specific in the lesson plans.
National Standards. There are available national standards that can be used as basis for
identifying appropriate objectives. One example is the minimum learning competencies that is
provided by the Department of Education.
35
Needs of students and society. The needs of the studnets are the basic source of
identifying objectives. These needs can be result of previous achievement tests, needs
assessment, and findings from previous school research involving the students.
Mission and vision of the school. The mission and vision of the school provides the major
guide in slecting objectives for students. The mission and vision are further broken down into
specific goals to be trabslated into unit lessons.
Writing an objective include a criteria in order to make it more specific and measurable.
It is important to make an objkective concrete in order to directly observe whether they are
attained or not during instruction. Consider the following objectives:
“From a standing still position on a level, hard surface (condition), male students (audience) will
jump (behavior) at least two feet (criterion).”
“Given two hours in the library without notes (condition), students in the high reading group
(audience) will identify (behavior) five sources on the topic “national health insurance”
(Criterion).
Notice that an objective statement is composed of a condition, audience, behavior, and

criterion.
Criteria in Writing Objectives
1. Behavior – Specific behavior as indicated by action verbs (summarize, enumerate,

compare, defend, justify).
2. Audience – Description of the students who are expected to demonstrate the behavior.
3. Criterion – Description of the criteria used to indicate whether the behavior has been
demonstrated (e. g. answering 8 out of 10 questions correctly; judgment of writing based on
grammar, spelling, sentence construction, and organization).
4. Condition – Circumstances, equipment, or material used when demonstrating the
behavior (e. g., with or without class notes, open book, using graph paper, given a calculator).
Considerations in Writing objectives
1. Objectives are always intended for the learners, audience, studnets or participants in a
training program.
2. Objectives are specific and measurable. Each objective should have a corresponding
assessment to test whether it was met. The following table shows how each objective will be
assessed.
36
Objective Assessment
Given a microscope with glass slides, students Performance assessment in the proper
in the biolog class will mount 5 specimens mounting of atleast 5 specimens.
found in the school garden.
Given the constructed anemometer, the grade 4 Listing of the wind speed every 5 hours during
pupils will record the wind speed every 5 school time
hours.
Given a 1 inch paper clip, the grade 5 students Measurement of the gym floor.
will measure the length, width, and area of the
gym floor.
3. Objectives should be attainable given the parameters of instruction and learning. For a
40 minute class, objvcetives should be realistically accomplished given the time frame.
Objectives can be classified as general and specific. Usually general objectives are stated
and then it is broken down into specific objectives in order to attain it. Consider the following
example:
General Objective:
Knows the meaning of spelling words
Specific Objectives:
Writes correct definitions for 80% of the words

Identifies correct antonyms for 50% of the words
Identifies correct synonyms for 70% of the words
Draws pictures that correctly illustrate 80% of the words
Writes sentences that include correct usage of 90% of the words
37
Lesson 2
The Conventional Taxonomic Tools
In formulating learning intents, it is helpful to be aware that appropriate targets of

learning come in different forms because learning environments differ in many ways. What is
crucial is the identification of which intents are more important than the others so that they are
given appropriate priority. When you formulate statements of learning intents, it is important that
you have a strong grasp of some theories of learning as these will aid you in determining what
competency could possibly be developed in the students. If you are familiar with Bloom’s
taxonomy, dust yourself off in terms of your understanding of it so that you can make a good use
of it.
Bloom’s Taxonomy
Bloom’s taxonomy is composed of three domains: Cognitice, affective and psychomotor.
Everytime a teacher states objectives these four domains are present. The behavioral traits for
each domain is provided in the following tables.
Cognitive level Behavioral Terms

Knowledge define, describe, identify, label, enumerate, match, outline, select, state,
name, reproduce
Comprehension Summarize, paraphrase, rephrase, convert, estimate, explain, generalize,
paraphrase, infer, rewrite, compute
Application Use, employ, give examples, changes, demonstrate, modify, predict,
show, problem solving
Analysis Relate, distinguish, differentiate, illustrate, separates, subdivides
Synthesis Formulate, compose, produce, categorize, combine, create, devise,
design, generate, organize, rearrange, reconstruct, reorganize, revise
Evaluation Appraise, decide, justify, conclude, criticize, describe, defend
Affective Domain Behavioral Terms

Receiving Asks, chooses, describes, follows, gives, holds, locates, points to, relies,
uses
Responding Answers, assists, complies, conforms, greets, performs, practices,
presents, recites, reports
Valuing Completes, explains, initiates, invites, joins, justifies, proposes, shares,
studies
Organization Adheres, alters, arranges, defends, generalizes, integrates, orders,
prepares, relates
Characterization Acts, discriminates, displays, influences, modifies, proposes, qualifies,
questions, revises, serves, solves, verifies
38
Psychomotor Domain
Imitation Observes a skill and attempts to repeat it
Manipulation Performs skill according to instruction rather than observation
Precision Reproduces a skill with accuracy, proportion and exactness
Articulation Combines more than one skill in sequence with harmony and
consistency
Naturalization Completes one or more skills with ease and becomes automatic with
limited physical or mental exertion
Figure 1
Bloom’s Taxonomy
EVALUATION
SYNTHESIS
ANALYSIS
APPLICATION
COMPREHENSION
KNOWLEDGE
Figure 1 shows a guide for teachers in stating learning intents based on six dimensions of
cognitive process. Knowledge, being the one of which degree of complexity is low, includes
simple cognitive activity such as recall or recognition of information. The cognitive activity in
comprehension includes understanding of the information and concepts, translating them into
other forms of communication without altering the original sense, interpreting, and drawing
conclusions from them. For application, emphasis is on students’ ability to use previously
acquired information and understanding, and other prior knowledge in new settings and applied
contexts that are different from those in which it was learned. For learning intents stated at the
Analysis level, tasks require identification and connection of logic, and differentiation of
concepts based on logical sequence and contradictions. Learning intents written at this level
indicate behaviors that indicate ability to differentiate among information, opinions, and
inferences. Learning intents at the synthesis level are stated in ways that indicate students’ ability
to produce a meaningful and original whole out of the available information, understanding,
contexts, and logical connections. Evaluation includes students’ ability to make judgments and
sound decisions based on defensible criteria. Judgments include the worth, relevance, and value
of some information, ideas, concepts, theories, rules, methods, opinions, or products.
39
Comprehension requires knowledge as information is required in understanding it. A

good understanding of information can facilitate its application. Analysis requires the first three
cognitive activities. Both synthesis and evaluation require knowledge, comprehension,
application, and analysis. Evaluation does not require synthesis, and synthesis does not require
evaluation either.
Competence Skill Demonstrated Behavioral Term

Knowledge  Observation and recall of List, define, tell, describe,
information identify, show, label, collect,
 Declarative knowledge examine, tabulate, quote,
 Mastery of subject matter name
Comprehension  Understanding of information Summarize, describe,
 Grasp meaning interpret, contrast, predict,
 Translate knowledge into new associate, distinguish,
context estimate, differentiate, discuss,
 Interpret facts, compare, contrast extend
 Order, group, infer causes
 Predict consequences
Application  Use information Apply, demonstrate, calculate,
 Use methods, concepts, theories in complete, illustrate, show,
new situations solve, examine, modify, relate,
 Solve problems using required change, classify, experiment,
skills or knowledge discover
Analysis  Seeing patterns Analyze, separate, order,
 Organization of parts explain, connect, classify,
 Recognition of hidden meanings arrange, divide, compare,
 Identification of components select, explain, infer
Synthesis  Use old ideas to create new ones Combine, integrate, modify,
 Generalize from given facts rearrange, substitute, plan,
 Relate knowledge from several create, design, invent, what
ideas if?, compose, formulate,
 Predict, draw conclusions prepare, generalize, rewrite
Evaluation  Compare and discriminate between Assess, decide, rank, grade,
ideas test, measure, recommend,
 Assess value of theories, convince, select, judge,
presentations explain, discriminate, support,
 Make choices based on reasoned conclude, compare,
argument summarize
 Verify value of evidence
 Recognize subjectivity
40
The Revise Bloom’s Taxonomy

Recently after 45 years since the birth of Bloom’s original taxonomy, a revised version
has come into the teaching practice, which was developed by Anderson and Krathwohl.
Statements that describe intended learning outcomes as a result of instruction are framed in terms
of some subject matter content and the action required with the content. To eliminate the
anomaly of unidimensionality of the statement of learning intents in their use of noun phrases
and verbs altogether, Figure 3 shows two separate dimensions of learning: the knowledge
dimension and the cognitive process dimension.
Knowledge Dimension has four categories, three of which include the subcategories of
knowledge in the original taxonomy. The fourth, however, is a new one, something that was not
yet gaining massive popularity at the time when the original taxonomy was conceived. It is new
and, at the same time, important in that it includes strategic knowledge, knowledge about
cognitive tasks, and self-knowledge.
Factual knowledge. This includes knowledge of specific information, its details and other
elements therein. Students make use of this knowledge to familiarize the subject matter or
propose solutions to problems within the discipline.
Factual knowledge is the basic elements that students must know to get acquainted with a
discipline or solve problems within it. It can come in the form of a terminology like definitions
and details and elements like asking for examples of natural resources.
Conceptual knowledge. This includes knowledge about the connection of information and
other elements to a larger structure of thought so that a holistic view of the subject matter or
discipline is formed. Students classify, categorize, or generalize ideas into meaningful structures
and models.
Interrelationships among the basic elements within a larger structure that enable them to
function together. The subtypes are slassifications and categories (ex. kinds of animals),
principles and generalizations (ex. law of supply and demand), and theorems and models (ex.
theory of evolution).
Procedural knowledge. This category of knowledge dimension includes the knowledge in
doing some procedural tasks that require specific skills and methods. Students also know the
criteria for using the procedures in levels of appropriateness.
Knowledge on how to do something, techniques and methods of specific skills. The
subtypes are subject-specific skills and algorithms (ex. whole number division), subject-specific
techniques and methods (ex. interviewing), criteria in determining when to use certain
procedures (ex. criteria used to apply Newton’s 2nd law of motion).
Metacognitive knowledge. This involves cognition in general as well as the awareness and
knowledge of one’s own cognition. Students know how they are thinking and become aware of
the contexts and conditions within which they are learning.
41
Knowledge of cognition and awareness. The subtypes are strategic knowledge (ex. use of
heuristics), knowledge of cognitive tasks (ex. knowledge cognitive demands of different tasks),
and self-knowledge (awareness of one’s own knowledge level).
Cognitive Process Dimension is where specific behaviors are pegged, using active verbs.
However, so that there is consistency in the description of specific learning behaviors, the
categories in the original taxonomies which were labeled in noun forms are now replaced with
their verb counterparts. Synthesis changed places with Evaluation, both are now stated in verb
forms.
Remembering. This includes recalling and recognizing relevant knowledge from long-
term memory.
Understanding. This is the determination of the meanings of messages from oral, written
or graphic sources.
Applying. This involves carrying out procedural tasks, executing or implementing them in
particular realistic contexts.
Analyzing. This includes deducing concepts into clusters or chunks of ideas and
meaningfully relating them together with other dimensions.
Evaluating. This is making judgments relative to clear standards or defensible criteria to
critically check for depth, consistency, relevance, acceptability, and other areas.
Creating. This includes putting together some ideas, concepts, information, and other
elements to produce complex and original, but meaningful whole as an outcome.
The use of the revised taxonomy in different programs has benefited both teachers and
students in many ways (Ferguson, 2002; Byrd, 2002). The benefits generally come from the fact
that the revised taxonomy provides clear dimensions of knowledge and cognitive processes in
which to focus in the instructional plan. It also allows teachers to set targets for metacognition
concurrently with other knowledge dimensions, which is difficult to do with the old taxonomy.
Figure 3
Sample Objectives Using the Revised Taxonomy
The The Cogntive Domain

knowledge Remember Understand Apply Analyze Evaluate Create Remember
Domain Factual #1
Conceptual #2 #3
Procedural
Metacognitive #4
# 1: Remember the characters of the story, “Family Adventure.”
# 2: Compare the roles of at least three characters of the story.
# 3: Evaluate the story according to specific criteria.
# 4: Recall personal strategies used in understanding the story.
42
Lesson 3
Other Learning Taxonomies
Both the Bloom’s taxonomy and the revised taxonomy are not the only existing
taxonomic tools for setting our instructional targets. There are other equally useful taxonomies.
Gagne’s Taxonomy
One of these is developed by Robert M. Gagne. In his theory of instruction, Gagne
desires to help teachers make sound educational decisions so that the probability that the desired
results in achieving learning is high. These decisions necessitate the setting of intentional goals
that assure learning.
In stating learning intents using Gagne’s taxonomy, we can focus on three domains. The
cognitive domain includes Declarative (verbal information), Procedural (intellectual skills), and
Conditional (cognitive strategies) knowledge. The psychological domain includes affective
knowledge (attitudes). The psychomotor domain involves the use of physical movement (motor
skills).
Verbal Information. Verbal information includes a vast body of organized knowledge that
students acquire through formal instructional processes, and other media, such as television, and
others. Students understand the meaning of concepts rather than just memorizing them. This
condition of learning lumps together the first two cognitive categories of Bloom’s taxonomy.
Learning intents must focus on differentiation of contents in texts and other modes of
communication; chunking the information according to meaningful subsets; remembering and
organizing information.
Intellectual skills. Intellectual skills include procedural knowledge that ranges from
Discrimination, to Concrete Concepts, to Defined Concepts, to Rules, and to Higher Order
Rules.
Discrimination involves the ability to distinguish objects, features, or symbols. Detection
of difference does not require naming or explanation.
Concrete Concepts involve the identification of classes of objects, features, or events,
such as differentiating objects according to concrete features, such as shape.
Defined Concepts include classifying new and contextual examples of ideas, concepts, or
events by their definitions. Here, students make use labels of terms denoting defined
concepts for certain events or conditions.
Rules apply a single relationship to solve a group of problems. The problem to be solved
is simple, requiring conformance to only one simple rule.
Higher order rules include the application of a combination of rules to solve a complex
problem. The problem to be solved requires the use of complex formula or rules so that
meaningful answers are arrived at.
43
Learning intents stated at this level of cognitive domain must be given attention to
abilities to spot distinctive features, use information from memory to respond to intellectual tasks
in various contexts, make connections between concepts and relate them to appropriate
situations.
Cognitive strategies. Cognitive strategies consist of a number of ways to make students
develop skills in guiding and directing their own thinking, actions, feelings, and their learning
process as a whole. Students create and hone their metacognitive strategies. These processes help
then regulate and oversee their own learning, and consist of planning and monitoring their
cognitive activities, as well as checking the outcomes of those activities. Learning intents should
emphasize abilities to describe and demonstrate original and creative strategies that students have
tried out in various conditions
Attitudes. Attitudes are internal states of being that are acquired through earlier
experience of task engagement. These states influence the choice of personal response to things,
events, persons, opinions, concepts, and theories. Statements of learning intents must establish a
degree of success associated with desired attitude, call for demonstration of personal choice for
actions and resources, and allow observation of real-world and human contexts.
Motor skills. Motor Skills are well defined, precise, smooth and accurately timed
execution of performances involving the use of the body parts. Some cognitive skills are required
for the proper execution of motor activities. Learning intents drawn at this domain should focus
on the execution of fine and well-coordinated movements and actions relative to the use of
known information, with acceptable degree of mastery and accuracy of performance.
Stiggins and Conklin’s Taxonomy
Another taxonomic tool is one developed by Stiggins & Conklin (1992), which involves
categories of learning as bases in stating learning intents.
Knowledge. This includes simple understanding and mastery of a great deal of subject
matter, processes, and procedures. Very fundamental to the succeeding stages of learning is the
knowledge and simple understanding of the subject matter. This learning may take the form of
remembering facts, figures, events, and other pertinent information, or describe, explain, and
summarize concepts, and cite examples. Learning intents must endeavor to develop mastery of
facts and information as well as simple understanding and comprehension of them.
Reasoning. This indicates ability to use deep knowledge of subject matter and
procedures to make defensible reason and solve problems with efficiency. Tasks under this
category include critical and creative thinking, problem solving, making judgments and
decisions, and other higher order thinking skills. Learning intents must, therefore, focus on the
use of knowledge and simple understanding of information and concepts to reason and solve
problems in contexts.
Skills, This highlights the ability to demonstrate skills to perform tasks with acceptable
degree of mastery and adeptness. Skills involve overt behaviors that show knowledge and deep
understanding. For this category, learning intents have to take particular interest in the
demonstration of overt behaviors or skills in actual performance that requires procedural
knowledge and reasoning.
44
Product. In this area, the ability to create and produce outputs for submission or oral
presentations is given importance. Because outputs generally represent mastery of knowledge,
deep understanding, and skills, they must be considered as products that demonstrate the ability
to use those knowledge and deep understanding, and employ skills in strategic manner so that
tangible products are created. For the statement of learning intents, teachers must state expected
outcomes, either process- or product-oriented.
Affect. Focus is on the development of values, interests, motivation, attitudes, self-
regulation, and other affective states. In stating learning intents on this category, it is important
that clear indicators of affective behavior can easily be drawn from the expected learning tasks.
Although many teachers find it difficult to determine indicators of affective learning, it is
inspiring to realize that it is not impossible to assess it.
These categories of learning by Stiggins and Conklin are helpful especially if your intents
focus on complex intellectual skills and the use of these skills in producing outcomes to increase
self-efficacy among students. In attempting to formulate statements of learning outcome at any
category, you can be clear about what performance you want to see at the end of the instruction.
In terms of assessment, you would know exactly what to do and what tools to use in assessing
learning behaviors based on the expected performance. Although stating learning outcomes at
the affective category is not as easy to do as in the knowledge and skill categories, but trying it
can help you approximate the degree of engagement and motivation required to perform what is
expected. Or if you would like to also give prominence to this category without stating another
learning intent that particularly focus on the affective states, you might just look for some
indicators in the cognitive intents. This is possible because knowledge, skills, and attitudes are
embedded in every single statement of learning intent.
Marzano’s Dimension of Learning
Another alternative guide for setting the learning targets is one that had been introduced
to us by Robert J. Marzano in his Dimensions of Learning (DOL). As a taxonomic tool, the DOL
provides a framework for assessing various types of knowledge as well as different aspects of
processing which comprises six levels of learning in a taxonomic model called the new taxonomy
(Marzano & Kendall, 2007). These levels of learning are categorized into different systems.
The Cognitive System. The cognitive system includes those cognitive processes that
effectively use or manipulate information, mental procedures and psychomotor procedures in
order to successfully complete a task. It indicates the first four levels of learning, such as:
Level 1: Retrieval. In this level of the cognitive system students engage some mental
operations for recognition and retrieval of information, mental procedure, or psychomotor
procedure. Students engage in recognizing, where they identify the characteristics, attributes,
qualities, aspects, or elements of information, mental procedure, or psychomotor procedure;
recalling, where they remember relevant features of information, mental procedure, or
psychomotor procedure; or executing, where they carry out a specific mental or psychomotor
procedure. Neither the understanding of the structure and value of information nor the how’s and
why’s of the mental or psychomotor procedure is necessary.
45
Level 2: Comprehension. As the second level of the cognitive system, comprehension

includes students’ ability to represent and organize information, mental procedure or
psychomotor procedure. It involves symbolizing where students create symbolic representation
of the information, concept, or procedures with a clear differentiation of its critical and
noncritical aspects; or integrating, where they put together pieces of information into a
meaningful structure of knowledge or procedure, and identify its critical and noncritical aspects.
Level 3: Analysis. This level of the cognitive system includes more manipulation of
information, mental procedure, or psychomotor procedure. Here students engage in analyzing
errors, where they spot errors in the information, mental procedure, or psychomotor procedure,
and in its use; classifying the information or procedures into general categories and their
subcategories; generalizing by formulating new principles or generalizations based on the
information, concept, mental procedure, or psychomotor procedure; matching components of
knowledge by identifying important similarities and differences between the components; and
specifying applications or logical consequences of the knowledge in terms of what predictions
can be made and proven about the information, mental procedure, or psychomotor procedure.
Level 4: Knowledge Utilization. The optimal level of cognitive system involves
appropriate use of knowledge. At this level, students put the information, mental procedure, or
psychomotor procedure to appropriate use in various contexts. It allows for investigating a
phenomenon using certain information or procedures, or investigating the information or
procedure itself; using information or procedures in experimenting knowledge in order to test
hypotheses, or generating hypotheses from the information or procedures; problem solving,
where students use the knowledge to solve a problem, or solving a problem about the knowledge
itself; and decision making, where the use of information or procedures help arrive at a decision,
or decision is made about the knowledge itself.
The Metacognitive System. The metacognitive system involves students’ personal agency
of setting appropriate goals of their learning and monitoring how they go through the learning
process. Being the 5th level of the new taxonomy, the metacognitive system includes those
learning targets as specifying goals, where students set goals in learning the information or
procedures, and make a plan of action for achieving those goals; process monitoring, where
students monitor how they go about the action they decided to take, and find out if the action
taken effectively serves their plan for learning the information or procedures; clarity monitoring,
where students determine how much clarity has been achieved about the knowledge in focus; and
accuracy monitoring, where students see how accurately they have learned about the information
or procedures.
The Self System. Placed at the highest level in the new taxonomy, the Self System is the
level of learning that sustains students’ engagement by activating some motivational resources,
such as their self-beliefs in terms of personal competence and the value of the task, emotions,
and achievement-related goals. At this level, students reason about their motivational
experiences. They reason about the value of knowledge by examining importance of the
information or procedures in their personal lives; about their perceived competence by examining
efficacy in learning the information or procedures; about their affective experience in learning by
examining emotional response to the knowledge under study; about their overall engagement by
examining motivation in learning the information or procedures.
46
In each system, three dimensions of knowledge are involved, such as information, mental
procedures, and psychomotor procedures.
Information
The domain of informational knowledge involves various types of declarative knowledge
that are ordered according to levels of complexity. From its most basic to more complex levels, it
includes vocabulary knowledge in which meaning of words are understood; factual knowledge,
in which information constituting the characteristics of specific facts are understood; knowledge
of time sequences, where understanding of important events between certain time points is
obtained; knowledge of generalizations of information, where pieces of information are
understood in terms of their warranted abstractions; and knowledge of principles, in which causal
or correlational relationships of information are understood. The first three types of
informational knowledge focus on knowledge of informational details, while the next two types
focus on informational organization.
Mental Procedures
The domain of mental procedures involves those types of procedural knowledge that
make use of the cognitive processes in a special way. In its hierarchic structure, mental
procedures could be as simple as the use of single rule in which production is guided by a small
set of rules that requires a single action. If single rules are combined into general rules and are
used in order to carry out an action, the mental procedures are already of tactical type, or an
algorithm, especially if specific steps are set for specific outcomes. The macroprocedures is on
top of the hierarchy of mental procedures, which involves execution of multiple interrelated
processes and procedures.
Psychomotor Procedures
The domain of psychomotor procedures involves those physical procedures for
completing a task. In the new taxonomy, psychomotor procedures are considered a dimension of
knowledge because, very similar to mental procedures, they are regulated by the memory system
and develop in a sequence from information to practice, then to automaticity (Marzano &
Kendall, 2007).
In summary, the new taxonomy of Marzano & Kendal (2007) provides us with a
multidimensional taxonomy where each system of thinking comprises three dimensions of
knowledge that will guide us in setting learning targets for our classrooms. Table 2a shows the
matrix of the thinking systems and dimensions of knowledge.
47
Systems of Dimensions of Knowledge
Thinking Information Mental Procedure Psychomotor Procedure
Level 6
(Self System)
Level 5
(Metacognitive System)
Level 4:
Knowledge Utilization
(Cognitive System)
Level 3:
Analysis
(Cognitive System)
Level 2:
Comprehension
(Cognitive System)
Level 1:
Retrieval
(Cognitive System)
De Bono’s Six Thinking Hats

Now, if you wish to explore on other alternative tools for setting your learning objectives,
here’s another help for us to make our learning intents target on the more complex learning
outcomes, this one from Edward de Bono (1985). There are six thinking hats, each of which is
named for a color that represents a specific perspective. When these hats are “worn” by the
student, information, issues, concepts, theories, and principles are viewed in ways that are
descriptive of mnemonically associated perspectives of the different hats. Let’s say that your
learning intent necessitates students to mentally put on a white hat whose descriptive mental
processes include gathering of information and thinking how it can be obtained, and the
emotional state is neutral, then learning behaviors may be classifying facts and opinions, among
others. It is essential to be conscious that each hat that represents a particular perspective
involves a frame of mind as well as an emotional state. Therefore, the perspective held by the
students when a hat is mentally worn, would be a composite of mental and emotional states.
Below is an attempt to summarize these six thinking hats.
48
Figure 5
Summative map of the Six Thinking Hats
THE HATS
WHITE RED BLACK YELLOW GREEN BLUE
Perspective Observer Self & others Self & others Self & others Self & others Observer
Stern judge
Representation White paper, Fire, warmth wearing black Sunshine, Vegetation Sky, cool
neutral rode optimism
Looking for Presenting Judging with Looking for Exploring Establishing

needed views, a logical benefits and possibilities control of the
objective feelings, negative productivity & making process of
Descriptive thinking and
facts and emotions, and view, looking with logical hypotheses,
Behavior information, intuition for wrongs & positive view, composing engagement,
including how without playing the seeing what new ideas using
these can be explanation or devil’s is good in with creative metacognition
obtained justification advocate anything thinking
These six thinking hats are beneficial not only in our teaching episodes but also in the
learning intents that we set for our students. If qualities of thinking, creative thinking
communication, decision-making, and metacognition are some of those that you want to develop
in your students, these six thinking hats could help you formulate statements of learning intents
that clearly set the direction of learning. Added benefits would be that when your intents are
stated in the planes these hats, the learning episodes can be defined easily. Consequently,
assessment is made more meaningful.
A. Formulate statements of learning intent using the Revised taxonomy, focusing on any
category of knowledge dimension but on the higher categories of cognitive dimension.
B. Bring those statements of learning intents to Robert Gagne’s taxonomy and see where they
will fit. You may customize the statements a bit so that they fit well to any of Gagne’s
categories of learning.
C. Do the same process of fitting to Stiggins’ categories of learning, then The New
Taxonomy. Remember to customize the statements when necessary.
D. Draw insights from the process and share them in class.
49
Lesson 4
Specificity of the Learning Intent
Learning intents usually come in relatively specific statements of desired learning

behavior or performance we would like to see in our students at the end of the instructional
process. In making these intents facilitate relevant assessment, it is important that they are stated
with very active verbs, those that represent clear actions or behaviors so that indicators of
performance are easily identified. These active verbs are essential part of the statement of
learning intents because they specify what the students actually do within and at the end of a
specified period of time. In this case, assessment becomes convenient to do because it can
specifically focus on the indicated behaviors or actions.
Gronlund, (in Mcmillan, 2005), uses the term instructional objectives to mean
intended learning outcomes. He emphasizes that instructional objectives should be
stated in terms of specific, observable, and measurable student responses.
In writing statements of learning intents for the course we teach, we aim to state behavior
outcomes to which our teaching efforts are devoted, so that, from these statements, we can
design specific tasks in the learning episodes for our students to engage into. However, we need
to make sure that these statements will have to be set with proper level of generality so that they
don’t oversimplify or complicate the outcome.
A statement of intent could have a rather long range of generality so that many sub-
outcomes may be indicated. Learning intents that are stated in general terms will need to be
defined further by a sample of the specific types of student performance that characterize the
intent. In doing this, assessment will be easy because the performance is clearly defined. Unlike
the general statements of intent that may permit the use of not-so-active verbs such as know,
comprehend, understand, and so on, the specific ones use active verbs in order to define specific
behaviors that will soon be assessed. The selection of these verbs is very vital in the preparation
of a good statement of learning intent. Three points to remember might help select active verbs.
1. See that the verb clearly represents the desired learning intent.
2. Note that the verb precisely specifies acceptable performance of the student.
3. Make sure that the verb clearly describes relevant assessment to be made within or at the
end of the instruction.
50
The statement, students know the meaning of terms in science is general. Although it
gives us an idea of the general direction of your class towards the expected outcome, we might
be confused as to what specific behaviors of knowing will be assessed. Therefore, it is necessary
that we draw some representative sample of specific learning intent so that we will let students:
 write a definition of particular scientific term
 identify the synonym of the word
 give the term that fits a given description
 present an example of the term
 represent the term with a picture
 describe the derivation of the term
 identify symbols that represent the term
 match the term with concepts
 use the term in a sentence
 describe the relationship of terms
 differentiate between terms
 use the term in

If these behaviors are stated completely as specific statements of learning intent, we can
have a number of specific outcomes. To make specifically defined outcomes, the use of active
verbs is helpful. If more specificity is desired, statements of condition and criterion level can be
added to the learning intents. If you think that the statement, student can differentiate between
facts and opinions, needs more specificity, then you might want to add a condition so that it will
now sound like this:
Given a short selection, the student can identify statements of facts and of
opinions.
If more specificity is still desired, you might want to add a statement of
criterion level. This time, the statement may sound like this:
Given a short selection, the student can correctly identify at least 5 statements of
facts and 5 statements of opinion in no more than five minutes without the aid of
any resource materials.
51
The lesson plan may allow the use of moderately specific statements of learning intents,
with condition and criterion level briefly stated. In doing assessment, however, these intents will
have to be broken down to their substantial details, such that the condition and criterion level are
specifically indicated. Note that it is not necessarily about choosing which one statement is better
than the other. We can use them in planning for our teaching. Take a look at this:
Learning Intent Student will differentiate between facts and opinions from written texts.
Assessment Given a short selection, the student can correctly identify at least 5
statements of facts and 5 statements of opinion in no more than five
minutes without the aid of any resource materials.
If you insert in the text the instructional activities or learning episodes in well described
manner as well as the materials needed (plus other entries specified in your context), you can
now have a simple lesson plan.
Should the statement of learning intent be stated in terms of

teacher performance or student performance that is to be
demonstrated after the instruction? How do these two differ
from each other?
Should it be stated in terms of the learning process or
learning outcome? How do these two differ from one
another?
Should it be subject-matter oriented or competency-oriented?
52
Exercises
Recognizing Measurable Instructional Objectives
Place an “X” before each of those in the following list which are student objectives stated in
measurable terms.
___ 1.
To develop critical thinking skills.
___ 2.
To identify those celestial bodies that are known as planets.
___ 3.
To provide worthwhile experiences for the students.
___ 4.
To recognize subject and verb in a sentence.
___ 5.
To tie shoes in a bow, without making a knot.
___ 6.
To write a summary of factors that led to World War II.
___ 7.
To fully appreciate the value of music.
___ 8.
To prepare a critical comparison of the two major political parties n the United States
today.
___ 9. To illustrate an awareness of the importance of balanced ecology by supplying relevant
newspaper articles.
___ 10. To know all the rules of spelling and grammar.
Classifying Instructional Objectives According to their Domains
Classify each of the following instructional objectives by writing on the blank space the
appropriate letter according to the domain: (C) Cognitive; (P) Psychomotor; (A) affective.
____ 1. The students will continue jumping rope until that student can successfully jump it
ten times in succession.
____ 2. The student will identify the capitals of all fifty states.
____ 3. The student will summarize the history of the development of the republican party in
the United States.
____ 4. The student will demonstrate a continuing desire to learn to use the microscope by
volunteering to work with it during free time.
____ 5. The student will volunteer to tidy up the room.
____ 6. After reading and analyzing several books, the student will identify the respective
authors.
____ 7. The student will translate a favorite Vietnamese poem into English.
____ 8. The student will accurately predict the results of combining genes from an available
gene pool.
____ 9. The student will indicate his or her interest in the subject by voluntarily reading
additional books from the library about dinosaurs.
____ 10. The student will make the ring toss in a minimum of seven in ten attempts.
53
Identifying Cognitive Levels of Questions
Instruction: Indicate the cognitive level of the following questions. Write whether they are
knowledge, comprehension, application, analysis, synthesis or evaluation.
_________________ 1. What was the name of the organization represented by our great
speaker?
_________________ 2. How are the styles of the two artists similar?
_________________ 3. Which of the poems do you think is the most interesting?
_________________ 4. What other tools could you use to accomplish the same task?
_________________ 5. What country lies between China and India?
_________________ 6. How might these rocks be logically grouped?
_________________ 7. Could you explain how these two types of redwood needles differ?
_________________ 8. What do you predict would happen if we mixed equal amounts of two
colored solutions, the red solution with the yellow solution?
_________________ 9. With the key words provided, compose an eight line poem.
_________________ 10. Describe how these poem makes you feel.
_________________ 11. Do you suppose everyone feels the same after reading that poem?
_________________ 12. What do you think caused the city to move the location of the zoo?
_________________ 13. How would the park be different today if it had been left there?
_________________ 14. Observe what happens when I pour in the second liquid.
_________________ 15. Using what you have learned about silent letters, circle all the
words in the one-page story that use silent letters.
Determining Cognitive and Knowledge Process
Indicate the cognitive domain and knowledge domain of each objective/questions.
Objective Cognitive Knowledge

Process Dimension
1. The student will be able to explain the concept of employee
empowerment
2. Students will be able to recommend appropriate statistical tests
for a given research design
3. What is the national fruit of the Philippines?
4. Explain the law of supply and demand
5. Demonstrate the steps in making a peanut cookie
6. Calculate the area of a circle
7. Propose a new product, give it a name and plan a marketing
campaign for it
8. What is the best way of teaching the concept of employee
empowerment?
9. Make flowchart to show the stages of bereavement
10. Recommend an improvement for the tollhouse chocolate chip
cookie recipe
54
11. Compare and contrast two works of art in terms of form, color,
and texture
12. Conduct a debate about abortion
13. Name a newly discovered insect
14. Trace your own family tree
15. Which is the best brand of orange juice and why?
16. Recommend a revision in the Philippine constitution given the
disaster events in the Philippines
Matching Items with Criterion Item

Instruction: Indicate whether the matching of the objective and item is suitable.
1. Objective: Given a performance of an instrumental or vocal melody containing a melodic or
rhythmic error, and given the score for the melody, be able to point out the error.
Criterion item: The instructor will play the melody of the attached musical score on the piano
and will make an error either in rhythm or melody. Raise your hand when the error occurs.
Is the item Suitable? ___ Yes ____ No ____Can’t tell
2. Objective: Given mathematical equations containing one unknown, be able to solve for the
unknown.
Criterion Item: Sam weighs 97 kilos. He weighs 3.5 kilos more than Barrey. How much does
Barry weigh?
3. Objective: Be able to demonstrate familiarity with sexual anatomy and physiology
Criterion Item: Draw and label a sketch of the male and female reproductive systems.
4. Objective: Given any one of the computers in our product line, in its original carton, be able
to install and adjust the machine, preparing it for use. Criteria: The machine shows normal
indication, and the area is free of debris and cartons.
Criterion item: Select one of the cartons containing one of our model XX computers, and install
it for the secretary in Room 45. Make sure it is ready for use and the area is left clean.

55
5. Objective: When given a set of paragraphs (that use words within your vocabulary), some of
which are missing topic sentences, be able to identify the paragraph without topic sentences.
Criterion Item: Turn to page 29 in your copy of Silas Marner. Underline the topic sentence of
each paragraph on that page.
Answer key
Recognizing Measurable Instructional Objectives
Mark “X” on item numbers 2, 4, 5, 6, 8, 10
Classifying Instructional Objectives According to their Domains
1. P, 2. C, 3. C, 4. A, 5. P, 6. C, 7. C, 8. C, 9. A, 10. P
Identifying Cognitive Levels of Questions
1. knowledge, 2. Analysis, 3. Evaluation, 4. Application, 5. Knowledge, 6. Synthesis, 7. Analysis, 8. Synthesis, 9.

Synthesis, 10. Evaluation, 11. Evaluation, 12. Analysis, 13. Synthesis, 14. Knowledge, 15. application
Determining Cognitive and Knowledge Process
1. Understanding Conceptual, 2. Application Conceptual, 3. Remembering Factual, 4. Understanding Conceptual,

5. Applying Procedural, 6. Applying Conceptual, 7. Creating Metacognitive, 8. Evaluating Procedural,
9. Understanding Conceptual, 10. Evaluating Procedural, 11. Applying Conceptual, 12. Evaluating metacognitive,
13. Creating Factual, 14. Applying Factual, 15. Evaluating Conceptual, 16. Creating Metacognitve
Matching Items with Criterion Item
1. yes, 2. yes, 3. no, 4. yes, 5. no
References
Byrd, P. A. (2002). The revised taxonomy and prospective teachers. Theory into Practice, 41, 4,
244
Ferguson, C. (2002). Using the revised taxonomy to plan and deliver team-taught, integrated,
thematic units. Theory into Practice, 41, 238.
Marzano, R. J., & Kendall, J. S. (2007). The new taxonomy of educational objectives (2nd ed.).
CA: Sage Publications Company.
Stiggins, R. & Conklin, N. (1992). In teachers’ hands: Investigating the practice of classroom
assessment. Albany, NY: SUNY Press.
56
Chapter 3
Characteristics of an Assessment Tool
Objectives
1. Determine the use of the different ways of establishing an assessment tools’ validity and
reliability.
2. Familiarize on the different methods of establishing an assessment tools’ validity and
reliability.
3. Assess how good an assessment tool is by determining the index of validity, reliability,
item discrimination, and item difficulty.
Lessons
1 Reliability
Test-retest
Split-half
Parallel Forms
Internal Consistency
Inter-rater Reliability
2 Validity
Content
Criterion-related
Construct Validity
Divergent/Convergent
3 Item Analysis: Item Difficulty and Discrimination

Procedure for Determining Otem Difficulty and Discrimination
Item Resposne Theory: Obtaining Item Difficulty Using the Rasch Model
Procedure for the Calibration of Item and Person Ability
A Review of Psychomtertic Theory
4 Using a Computer Software in Analyzing Test Items
How to Install
Opening the Program
How to Save Data
How to Open File from Excel
How to Encode and Save Data in Microsoft Excel
57
Lesson 1
Reliability
What makes a good assessment tool? How does one know that a test is good to be used?
Educational assessment tools are judged by their ability to provide results that meet the needs of
users. For example, a good test provides accurate findings about a students’ achievement if users
intend to determine achievement levels. The achievement results should remain stable across
different conditions so that they can be used for longer periods of time.
Assessment
Tool
Reliable Valid Ability to

discriminate
traits
A good assessment tool should be reliable, valid and be able to discriminate traits. You
may have probably encountered several tests that are available in the internet and magazines that
tell what kind of personality that you have, your interests, and dispositions. In order to determine
these characteristics accurately, the tests offered in the internet and magazines should show you
evidence that they are indeed valid or reliable. You need to be critical in selecting what test to
use and consider well if these tests are indeed valid and reliable. There are several ways of
determining how reliable and valid an assessment tool is depending on the nature of the variable
and purpose of the test. These techniques vary from different statistical analysis and this chapter
will also provide the procedure in the computation and interpretation.
Reliability is the consistency of scores across the conditions of time, forms, test, items
and raters. The consistency of results in an assessment tool is determined statistically using the
correlation coefficient. You can refer to the section of this chapter to determine how a
correlations coefficient is estimated. The types of reliability will be explained in three ways:
Conceptually and analytically.
Test-retest Reliability
Test-retest reliability is the consistency of scores when the same test is retested in another
occasion. For example, in order to determine whether a spelling test is reliable, the same
spelling test will be administered again to the same students at a different time. If the scores in
the spelling test across the two occasions are the same, then the test is reliable. Test-retest is a
measure of temporal stability since the test score is tested for consistency across a time gap. The
time gap of the two testing conditions can be within a week or a month, generally it does not
exceed six months. Test-retest is more appropriate for variables that are stable like psychomotor
skills (typing test, block manipulations tests, grip strength), aptitude (spatial, discrimination,
58
visual rotation, syllogism, abstract reasoning, topology, figure ground perception, surface
assembly, object assembly), and temperament (extraversion/introversion, thinking/feeling,
sensing/intuiting, judging/perceiving).
To analyze the test-retest reliability of an assessment tool, the first and second set of
scores of a sample of test takers is correlated. The higher the correlation the more reliable the test
is.
Procedure for Correlating Scores for the Test-Retest
Correlating two variables involves producing a linear relationship of the set of scores. For
example a 50 item aptitude test was administered to 10 students at one time. Then it was
administered again after two weeks to the same 10 students. The following are the scores
produced:
Student Aptitude Test (Time 1) Aptitude Retest (Time 2)

A 45 47
B 30 33
C 20 25
D 15 19
E 26 28
F 20 23
G 35 38
H 26 29
I 10 15
J 27 29
In the following data, ‘student A’ got a score of 45 during the first occasion of the aptitude test
and after two weeks got a score of 47 in the same test. For ‘student B,’ a score of 30 was
obtained for the first occasion and 33 after two weeks. The same goes for students C, D. E, F, G,
H, I, and J. The scores of the test at time 1 and retest at time 2 is plotted in a graph called a
scatterplot below. The straight linear line projected is called a regression line. The closer the
plots to the regression line, the stronger is the relationship between the test and retest scores. If
their relationship is strong, then the test scores are consistent and can be interpreted as reliable.
To estimate the strength of the relationship a correlation coefficient needs to be obtained. The
correlation coefficient gives information about the magnitude, strength, significance, and
variance of the relationship of two variables.
59
Scatterplot of Aptitude Retest (Time 2) against Aptitude Test (Time 1)

sample data test retest 2v*10c
Aptitude Retest (Time 2) = 5.2727+0.9184*x
50
A
45
40
G
Aptitude Retest (Time 2)
35
B
30 HJ
E
C
25
F
20 D
I
15
10
5 10 15 20 25 30 35 40 45 50
Aptitude Test (Time 1)
Different types of correlation coefficients are used depending on the level of measurement of a
variable. Levels of measurement can be nominal, ordinal, interval, and ratio. More information
about the levels of measurement is explained at the beginning chapters of any statistics book.
Most commonly, assessment data are in the interval scales. For interval and ratio or continuous
variables, the statistics that estimates the correlation coefficient is the Pearson Product Moment
correlation or the r. The r is computed using the formula:
NXY  (X )(Y )

r
[ NX 2  (X ) 2 ][ NY 2  (Y ) 2 ]
Where
r = correlation coefficient
N = number of cases (respondents, examinees)
ΣXY = summation of the product of X and Y
ΣX = summation of the first set of scores designated as X
ΣY = summation of the second set of scores designated as Y
ΣX2 = sum of squares of the first set of scores
ΣY2 = sum of squares of the second set of scores
60
To obtain the parameters of ΣX , ΣY, ΣX2, and ΣY2, a table is set up.
Aptitude Test Aptitude Retest

(Time 1) (Time 2)
Student X Y XY X2 Y2
A 45 47 2115 2025 2209
B 30 33 990 900 1089
C 20 25 500 400 625
D 15 19 285 225 361
E 26 28 728 676 784
F 20 23 460 400 529
G 35 38 1330 1225 1444
H 26 29 754 676 841
I 10 15 150 100 225
J 27 29 783 729 841
2 2
ΣX=254 ΣY=286 ΣXY =8095 ΣX =7356 ΣY =8948
To obtain a value of 2115 on the 4th column on XY, simply multiply 45 and 47, 2025 on the 5th
column is obtained by squaring 45 (452 or 45 x 45), 2209 on the last column is obtained by
squaring 47 (472 or 47 x 47). The same is done for each pair of scores in each row. The values of
ΣX , ΣY, ΣX2, and ΣY2 are obtained by adding up or summating the scores from student A to
student J. The values are then substituted in the equation for Pearson r.
NXY  (X )(Y )

r
[ NX  (X ) 2 ][ NY 2  (Y ) 2 ]
2
10(8095)  (254)(286)
r
[10(7356)  (254) 2 ][10(8948)  (286) 2 ]
r = .996
An obtained r value of .996 can be interpreted in four ways: Magnitude, strength,

significance, and variance. In terms of its magnitude, by observing the scatterplot the scores
project a regression line showing the increase of the aptitude test increases, the retest scores also
increases. This magnitude is said to be positive. A positive magnitude indicates that as the X
scores increases, the Y scores also increases. In such cases that a correlation coefficient of -.996
is obtained, this indicates a negative relationship where as the X scores increases, the Y scores
decreases or vice versa.
For strength, as the correlation coefficient reaches 1.0 or -1.00 the stronger is the
relationship, the closer it is to “0” the weaker the relationship. A strong relationship indicates
that the plots are very close to the projected linear regression line. In the case of the .996
correlation coefficient, it can be said that there is a very strong relationship between the scores of
61
the aptitude test and retest scores. The cut-off can be used as guide to determine the strength of
the relationship:
Correlation Coefficient Value Interpretation

0.80 – 1.00 Very high relationship
0.6 – 0.79 High relationship
0.40 – 0.59 Substantial/marked relationship
0.2 – 0.39 Low relationship
0.00 – 0.19 Negligible relationship
For significance, it tests whether the odds favor a demonstrated relationship between X
and Y being real as opposed to being chance. If the relationship favors to be real, then the
relationship is said to be significant. Consult a statistics book for a detailed explanation of testing
for the significance of r. To test whether a correlation coefficient of .996 is significant it is
compared with an r critical value. The critical values for r is found in Appendix A of this book.
Assuming that the probability or error is set at alpha level .05 (it means that the probability [p] is
less that [<] 5 out of 100 [.05] than the demonstrated relationship is due to chance) (DiLeonardi
& Curtis, 1992), and the degrees of freedom is 8 (df=N-2, 8=10-2), a critical value of .632 is
attained. A value of .632 is the intersecting value in Appendix A for df=8 and alpha level of .05.
Significance is determined when the obtained value is greater than the critical value. In this case,
since .996 is greater than .632, then there is a significant relationship between the aptitude test
and the retest scores.
For the variance, it is interpreted as the amount of overlap between the X and Y. This is
interpreted as the “percentage of the time that the variability in X accounts for or explains the
variability in Y.” Variance is determined by squaring the correlation coefficient (r2). For the
given data set, the variance would be r2=.9962 (would give a variance of .992), in percent, the
variance is 99.2 (.992 x 100). To interpret this value, “99.2 percent of the time, the scores during
the first aptitude test accounts for or explains the scores during the retest.”
Generally a correlation coefficient of .996 indicates that the test scores for aptitude during
the test and the retest time is highly reliable or consistent since the value is very strong and
significant. A software is provided with this book to help you compute for test retest correlation
coefficients and the other techniques for establish reliability and validity. A detailed
demonstration on using the software is found at the end of this chapter.
Parallel Form or Alternate Form Reliability
In this technique two tests are used that are equivalent in the aspects of difficulty, format,
number of items, and specific skills measured. The equivalent forms are administered to the
same examinee at one occasion and the other in a different occasion. Split half is both a measure
of temporal stability and consistency of responses. Since the two tests are administered
separately across time it is a measure of temporal stability like the test-retest. But on the second
occasion, what is administered is not the exact same test but an equivalent form of the test.
Assuming that the two tests are really measuring the same characteristics, then there should be
consistency on the scores. Parallel forms can be used in affective and cognitive measures in
general as long as there are available forms of the test.
62
Reliability is determined by correlating the scores from the first form and the second
form. In most cases, Form A of the test is correlated with Form B of the test. A strong and
significant relationship would indicate equivalence and consistency of the two forms.
Split-half Reliability
In split-half, the test is split into two parts and the scores for each part should show
consistency. The logic behind splitting the test into two parts is to determine whether the scores
within the same test is internally consistent or homogeneous.
There are many ways of splitting the test into two halves. One is by randomly distributing
the items equally into two halves. Another is separating the odd numbered items with the even
numbered items. In doing the split-half reliability, one ensures that the test contains large amount
of items so that there will still be several items left for each half. The assumption here is that
there should be more items in order for the test to be more reliable. It follows that the more the
items in a test, the more it becomes reliable.
Spit-half is analyzed by summating first the total scores for each half of the test for each
participant. The total scores in pairs are then correlated. A high correlation coefficient would
indicate internal consistencies of the responses in the test. Since only half of the test is correlated,
a correction formula called the Spearman-Brown (rtt)is used by doubling the length of the test.
The formula is:
2r
rtt 
1 r
Where
rtt = Spearman-Brown coefficient

r = correlation coefficient
Suppose that a test that measures aggression with 60 items was split into two halves
having 30 items each half and the computed r is .93. The Spearman-Brown coefficient would be
.96. Observe that from the correlation coefficient of .93 there was an increase to .96 when
converted into Spearman-Brown.
Computation of Correlation coefficient to Spearman-Brown:
2(.93)
rtt 
1  .93
rtt = .96
63
Internal Consistency Reliabilities
Several techniques can be used to test whether the responses in the items of a test are
internally consistent. The Kuder-Richardon, Cronbach’s alpha, interitem correlation, and item
total correlation can be used.
The Kuder-Richardson (KR #20) is used if the responses in the data are binary. Usually it
is used for tests with right or wrong answers where correct responses are coded as “1” and
incorrect responses are coded as “0.” The KR#20 formula is:
k  pq 
KR20  1  2 
k 1  
To determine σ2 (variance)
x 2
2 
N 1
Where
k=number of items
p=proportion of students with correct answers
q=proportion of students with incorrect answers
σ2=variance
Σx2=sum of squares
Suppose that the following data was obtained in a 10 item math test (“1” correct answer “0”
incorrect answer) among 10 students:
Item XX
Student Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 10 Total (X) ( X  X )2
A 1 1 1 1 1 1 1 1 1 1 10 2.8 7.84
B 1 1 1 1 1 1 1 0 1 1 9 1.8 3.24
C 1 1 1 1 1 1 1 0 0 1 8 0.8 0.64
D 1 1 1 1 1 1 1 1 0 0 8 0.8 0.64
E 1 1 1 1 1 1 1 0 0 0 7 -0.2 0.04
F 1 1 1 1 1 1 0 0 0 1 7 -0.2 0.04
G 1 1 1 1 1 1 0 1 0 0 7 -0.2 0.04
H 1 1 1 0 0 0 1 1 1 0 6 -1.2 1.44
I 1 1 1 1 0 1 0 0 0 0 5 -2.2 4.84
J 1 1 1 1 0 0 0 0 0 1 5 -2.2 4.84
2
total 10 10 10 9 7 8 6 4 3 5 X =7.2 x =23.6
p 1 1 1 0.9 0.7 0.8 0.6 0.4 0.3 0.5 σ2=2.62
q 0 0 0 0.1 0.3 0.2 0.4 0.6 0.7 0.5
pq 0 0 0 0.09 0.21 0.16 0.24 0.24 0.21 0.25 Σpq=1.4
64
Computation of Variance:
Get the total scores of each examinee (X), then compute for the average of the scores of
the ten examinees ( X =7.2). Subtract the mean to each individual total score ( X  X ) then
square each of these differences ( X  X )2. Get the summation of these squared difference and
this value is the Σ( X  X )2. In the data given the value of ( X  X )2 is 23.6 and N=10. Substitute
these values to obtain the variance.
23.6
2 
10  1
σ2=2.62
KR20 Computation:
The variance is now computed (σ2=2.62), the next step is to obain the value of Σpq. This is
obtained by summating the total correct responses for each item (total). This total is converted
into a proportion (p) by dividing with the total number of cases (N-10). A total of 10 when
divided by 10 (N) will have a proportion of 1. Then to determine q which is the proportion
incorrect, subtract the proportion correct from 1. If the proportion correct is 0.9 the proportion
incorrect will be 0.1. Then pq is determined by multiplying the p and q. Get the summation of
the pq and it will yield Σpq which has the value of 1.4. Substitute all the values in the KR 20
formula.
10  1.4 
KR20  1  
10  1  2.62 
KR20 = 0.52
The internal consistency of the 10 item math test is 0.52 indicating the responses are not highly
consistent with each other.
The Cronbach’s alpha also determines the internal consistency of the responses of items
in the same test. The Cronbach’s alpha can be used for responses that are not limited to binary
type such as a five point scale and other response format that are expressed numerically. Usually
tests beyond the binary type are affective measures and inventories where there is no right or
wrong answers.
Suppose that a five item test measuring attitude towards school assignments was
administered to five high school. Each item in the questionnaires is answered using 5 point
Likert Scale (5=strongly agree, 4=agree, 3=not sure, 2=disagree, 1=strongly disagree).
Below are the five items that measures attitude towards school assignments. Each student
will select in a likert scale of 1 to 5 how they respond to each of the items. Then their responses
are encoded.
65
Items Strongly Agree Not Disagree Strongly

agree sure disagree
1. I enjoy doing my assignments. 5 4 3 2 1
2. I believe that assignments help me learn the lesson 5 4 3 2 1
better.
3. I know that assignments are meant to enhance our 5 4 3 2 1
skills in school.
4. I make it a point to check my notes and books 5 4 3 2 1
everyday to see if I have assignments.
5. I make sure that I complete all my assignments 5 4 3 2 1
everyday.
The next table shows how the Cronbach’s alpha is determined given the responses of the
five students. In the next table, student A answered ‘5’ for item 1, ‘5’ for item 2, ‘4’ for item 3,
‘4’ for item 4, and ‘1’ for item 5. The same goes for students B, C, D, and E.
In computing for Cronbach’s alpha, the variance (σ2) for the students’ scores and the
summation of variance (σ2) for each item score is used. Obtaining the variance for the scores of
each respondent is the same in Kuder Richardson where the mean of the scores is substracted to
each score, then the value is squared and the sum of squares (22.8) is divided by n-1 (5-1=4).
Diving the sum of squares (22.9) with the n-1 (4) will give the variance (σ2=5.7). For obtaining
the summation of item variance ( SD t2 ) , get the sum of all scores per item (summate going
down for each column in the table below, ΣX), then square each score and summate going down
(ΣX2). Each item will have its own ΣX and ΣX2. These parameters are used to obtain the
variance for each item. The formula to determine the variance for each item is:
(X ) 2
X 2 
SD 2 t  n
n 1
After obtaining the variance for reach item, summate all these variances, ( SD t2 ) .
The value (38.8) is divided by n-1 and will give the value of ( t2 ) which is the variance of the
items ( t2 ) =9.7. The values obtained can now be substituted in the formula for Cronbach’s
alpha:
 n   t  ( t ) 
2 2
Cronbach' s    
 n  1   t2 
 5  5.7  5.2 
Cronbach' s    
 5  1  5.7 
Cronbach’s α = .10
The table below shows the values obtained in the procedure.

66
Student item1 item2 item3 item4 item5 total for each case (X) Score-Mean (Score-Mean)2
A 5 5 4 4 1 19 2.8 7.84
B 3 4 3 3 2 15 -1.2 1.44
C 2 5 3 3 3 16 -0.2 0.04
D 1 4 2 3 3 13 -3.2 10.24
E 3 3 4 4 4 18 1.8 3.24
X case=16.2 Σ(Score-Mean)2=22.8
total for each
item (ΣX)
 ( Score  Mean )
=
14 21 16 17 13 X item 16.2
2
ΣX2 48 91 54 59 39  t2 
n 1
22.8
 t2 
5 1
SD2t 2.2 .7 .7 .3 1.3 ΣSD2t =5.2  t2 = 5.7
 n   t  ( t ) 
2 2
Cronbach' s    
 n  1   t2 
 5  5.7  5.2 
Cronbach' s    
 5  1  5.7 
Cronbach’s α = .10
The internal consistency of the responses in the attitude towards teaching is .10 indicating low internal consistency.
67
Internal consistency is also determined by correlating each combination of items in a test

which is known as the interitem correlation. The responses in the items are internally consistent
if they yield high correlation coefficients.
To demonstrate the interitem correlation among the responses of the five students in their
attitude towards assignments, each set of items scores are correlated with each other using the
Pearson’s r. This means that item 1 is correlated with item 2, item 3, item 4, and item 5, then,
item 2 is correlated with item 3, item 4, and item 5, then item 3 is correlated with item 4 and item
5, then item 4 is correlated with item 5. Such combination will produce a correlation matrix:
Item 1 Item 2 Item 3 Item 4 Item 5

Item 1 1.00 0.24 0.85 0.74 -0.65
Item 2 0.24 1.00 -0.07 -0.22 -0.68
Item 3 0.85 -0.07 1.00 0.87 -0.16
Item 4 0.74 -0.22 0.87 1.00 -0.08
Item 5 -0.65 -0.68 -0.16 -0.08 1.00
Notice that a perfect correlation coefficient is obtained when the item is correlated with
itself (1.00). It can also be noted that strong correlation coefficients were obtained between
items, 1 and 3, 1 and 4, indicating internal consistencies. Some had negative correlations like
between items 1 and 5, and 2 and 5. A negative correlation means that as the scores of one item
increases, the other decreases.
Interrater Reliability
When rating scales are used by judges, the responses can also be tested if they are
consistent. The concordance or consistency of the ratings is estimated by computing the
Kendall’s ω coefficient of concordance.
Suppose that following thesis presentation ratings were obtained from three judges for 5
groups who presented their thesis. The rating scale is in a scale of 1 to 4 where 4 is the highest
and 1 is the lowest.
Thesis presentation Rater 1 Rater 2 Rater 3 Sum of Ratings D D2

1 4 4 3 11 2.6 6.76
2 3 2 3 8 -0.4 0.16
3 3 4 4 11 2.6 6.76
4 3 3 2 8 -0.4 0.16
5 1 1 2 4 -4.4 19.36
X Ratings =8.4 ΣD2=33.2
The concordance among three raters using the Kendall’s tau is computed by summating
the total ratings for each case (thesis presentation). The mean is obtained for the sum of ratings
68
( X Ratings =8.4). The mean is then subtracted to each of the Sum of Ratings (D). Each difference is
squared (D2), then the sum of squares is computed (ΣD2=33.2). these values can now be
substituted in the Kendall’s ω formula. In the formula, m is the numbers of raters.
12D 2
W
m 2 ( N )( N 2  1)
12(33.2)
W
3 (5)(52  1)
2
W=0.37
A value of .38 Kendall’s ω coefficient estimates the agreement of the three raters in the 5 thesis
presentations. Given this value, there is a moderate concordance among the three raters because
the value is not very high.
69
Summary on Reliability
Type of Nature Measure of: Use Statistical Procedure

Reliability
Test-retest Repeating the Temporal stability When variables are  Correlate the
identical test on a stable ex: motor scores from the first
second occasion coordination, finger test and second test.
dexterity, aptitude,  The higher the
capacity to learn correlation the more
reliable
Alternate Form/ Same person is Equivalence; Used for personality  Correlate scores on
Parallel Form tested with one form Temporal stability and mental ability the first form and
on the first occasion and consistency of tests scores on the second
and with another response form
equivalent form on
the second
Split-half Two scores are Internal Used for personality  Correlate scores of
obtained for each consistency; and mental ability the odd and even
person by dividing Homogeneity of tests numbered items
the test into items The test should have  Convert the
equivalent halves many items obtained correlation
coefficient into a
coefficient estimate
using Spearman
Brown
Kuder- Computed for binary Consistency of Used if there is a  Use KR #20 or KR

Richardson (e.g., true/false) responses to all correct answer (right #21 formula
Reliability items items or wrong)
Coefficient The reliability is Consistency of Used for personality  Use the
Alpha used to estimate responses to all tests with multiple Cronbach’s alpha
internal consistency items scored-items formula
of items Homogeneity of
items
Inter-item Correlation of all Consistency of Used for personality Each item is correlated
reliability item combinations responses to all tests with multiple with every item in the
items scored-items test
Homogeneity of
items
Scorer Reliability Having a sample of To decrease Performance The two scores from
cases independently examiner or scorer assessments the two raters obtained
scored by two raters variance Clinical instruments are correlated with each
employed in intensive other
individual tests ex. The Kendalls’s ω is
projective tests used to estimate
concordance of raters
70
Activity:
Test whether the typing test is valid. The following are the scores of 15 participants on a typing
test using test-retest reliability.
First Test Retest

47 30
45 44
43 40
24 28
35 40
45 46
46 46
34 37
34 35
36 35
43 40
21 25
22 23
23 24
24 20
71
Activity 2
Administer the “Academic Self-regulation Scale” to atleast 30 students then obtain its internal
consistency using split-half, Cronbach’s alpha, and interitem correlation.
Self-regulation Scale
Instruction: The following items assess your learning and study strategy use. Read each item carefully and
RESPOND USING THE SCALE PROVIDED. Encircle the number that corresponds to your answer.
4: Always 3: Often 2: Rarely 1: Never
Before answering the items, please recall some typical situations of studying which you have experienced. Kindly
encircle the number showing how you practice the following items.
Always Often Rarely Never

MS 1. I make use of flashcards for short-answered questions. 4 3 2 1
MS 2. I make lists of related information by categories 4 3 2 1
MS 3. I rewrite class notes by rearranging the information in my own words. 4 3 2 1
MS 4. I use graphic organizers to put abstract information into a concrete form. 4 3 2 1
MS 5. I represent concepts with symbols such as drawings so I can easily remember 4 3 2 1
them.
MS 6. I make a summary of my readings. 4 3 2 1
MS 7. I make outlines as guides while I am studying. 4 3 2 1
MS 8. I summarize every topic we would have in class. 4 3 2 1
MS 9. I visualize words in my mind to recall terms. 4 3 2 1
MS 10. I recite the answers to questions on the topic that I made up. 4 3 2 1
MS 11. I record into a tape the lessons/notes. 4 3 2 1
MS 12. I make sample questions from a topic and answer them. 4 3 2 1
MS 13. I recite my notes while studying for an exam. 4 3 2 1
MS 14. I use post-its to remind me of my homework. 4 3 2 1
MS 15. I make a detailed schedule of my daily activities. 4 3 2 1
GS 16. I make a timetable of all the activities I have to complete. 4 3 2 1
GS 17. I plan the things I have to do in a week. 4 3 2 1
GS 18. I use a planner to keep track of what I am supposed to accomplish. 4 3 2 1
GS 19. I keep track of everything I have to do in a notebook or on a calendar. 4 3 2 1
SE 20. If I am having a difficulty, I inquire assistance from an expert. 4 3 2 1
SE 21. I like peer evaluations for every output 4 3 2 1
SE 22. I evaluate my accomplishments at the end of each study session. 4 3 2 1
SE 23. I ask others how my work is before passing it to my professors. 4 3 2 1
SE 24. I take note of the improvements on what I do. 4 3 2 1
SE 25. I monitor my improvements in doing certain task. 4 3 2 1
SE 26. I ask feedback of my performance from someone who is more capable. 4 3 2 1
SE 27. I listen attentively to people who comment on my work. 4 3 2 1
SE 28. I am open to feedbacks to improve my work. 4 3 2 1
SE 29. I browse through my past outputs to see my progress. 4 3 2 1
SE 30. I ask others what changes should be done with my homework, papers, etc. 4 3 2 1
SE 31. I am open to changes based from the feedbacks I received. 4 3 2 1
SA32. I use internet in making my research papers. 4 3 2 1
SA 33. I surf the net to find the information that I need. 4 3 2 1
SA 34. I take my own notes in class. 4 3 2 1
SA 35. I enjoy group works because we help one another. 4 3 2 1
SA 36. I call or text a classmate about the homework that I missed. 4 3 2 1
SA 37. I look for a friend whom I can have an exchange of questions 4 3 2 1
SA 38. I study with a partner to compare notes. 4 3 2 1
SA 39. I explain to my peers what I have learned. 4 3 2 1
ES 40. I avoid watching the television if I have a pending a homework. 4 3 2 1
ES 41. I isolate myself from unnecessary noisy places. 4 3 2 1
ES 42. I don’t want to hear a single sound when I’m studying. 4 3 2 1
72
ES 43. I can’t study nor do my homework if the room is dark. 4 3 2 1

ES 44. I switch off my TV for me to concentrate on my studies. 4 3 2 1
RS 45. I recheck my homework if I have done it correctly before passing. 4 3 2 1
RS 46. I do things as soon as the teacher gives the task. 4 3 2 1
RS 47. I am concerned with the deadlines set by the teachers. 4 3 2 1
RS 48. I picture in my mind how the test will look like based on previous tests 4 3 2 1
RS 49. I finish all my homework first before doing unnecessary things. 4 3 2 1
OR50. I highlight important concepts and information I find in my readings. 4 3 2 1
OR 51. I make use of highlighters to highlight the important concepts in my reading. 4 3 2 1
OR 52. I put my past notebooks, handouts, and the like in a certain shelf. 4 3 2 1
OR 53. I study at my own pace. 4 3 2 1
OR 54. I fix my things first before I start studying. 4 3 2 1
OR 55. I make sure my study area is clean before studying. 4 3 2 1
MS: Memory Strategy

GS: Goal Setting
SE: Self-evaluation
SA: Seeking Assistance
ES: Environmental Structuring
RS: Responsibility
OR: Organizing
Further Analysis
1. Show the Cronbach’s alpha for each factor and indicate whether the responses are
internally consistent.
2. Split the test into two then indicate whether the responses are internally consistent.
3. Intercorrelate each item.
73
Lesson 2
Validity
Validity indicates whether an assessment tool is measuring what it intends to measure.

Validity estimates indicate whether the latent variable shared by items in a test is in fact the
target variable of the test developer. Validity is the ability of a scale or test to predict events,
relationship with other measures, and representativeness of item content.
Content Validity
Content validity is the systematic examination of the test content to determine whether it
covers a representative sample of the behavior domain to be measured. For affective measures, it
concerns whether the items are enough to manifest the behavior measured. For cognitive tests, it
concerns whether the items cover all contents specified in an instruction.
Content validity is more appropriate for cognitive tests like achievement tests and teacher
made tests. In these types of tests, there is a presence of a specified domain that will be included
in the test. The content covered is found in the instructional objectives in the lesson plan,
syllabus, table of specifications, and textbooks.
Content validity is conducted through consultation with experts. In the process, the
objectives of the instruction, table of specifications, and items of the test are shown to the
consulting experts. The experts check whether the items are enough to cover the content of the
instruction provided, whether the items are measuring the objectives set, and if the items are
appropriate for the cognitive skill intended. The process also involves correcting the items if they
are appropriately phrased for the level that will take the test and whether the items are relevant to
the subject area tested.
Details on constructing Table of Specifications are explained in the next chapters.
Criterion-Prediction Validity
Criterion-prediction involves prediction from the test to any criterion situation over time
interval. For example, to assess the predictive validity of an entrance exam, it will be correlated
later with the students’ grades after a trimester/semester. The criterion in this case would be the
students’ grade which will come in the future.
Criterion-prediction is used for hiring job applicants, selecting students for admission to
college, assigning military personnel to occupational training programs. For selecting job
applicants, the pre-employment tests are correlated with the obtained supervisor rating in the
future. In assigning military personnel for training, the aptitude test administered before training
will be correlated with the future post assessment in the training. A positive and high correlation
coefficients should be obtained in these cases to adequately say that the test has a predictive-
validity.
Generally the analysis involves the test score correlated with other criterion measures
example are mechanical aptitude and job performance as a machinist.
74
Construct Validity
Construct validity is the extent to which the test may be said to measure a theoretical
construct or trait. This is usually conducted for measures that are multidimensional or contains
several factors. The goal of construct validity is to explain and prove the factors of the measure
as it is true with the theory used.
There are several methods for analyzing the constructs of a measure. One way is to
correlate a new test with a similar earlier test as measured approximately the same general
behavior. For example, a newly constructed measure for temperament is correlated with an
existing measure of temperament. If high correlations are obtained between the two measures it
means that the two test are measuring the same constructs or traits.
Another widely used technique to study the factor structure of a test is the factor analysis
which can be exploratory of confirmatory. Factor analysis is a mathematical technique that
involves arriving with sources of variation among the constructs involved. These variations are
usually called factors or components (as explained in chapter 1). Factor analysis reduces the
number of variables and it detects the structure in the relationships between variables, or classify
variables. A factor is a set of highly intercorrelated variables. In using a Principal Components
Analysis as a method of factor analysis, the process involves extracting the possible groups that
can be formed through the eigenvalues. A measure of how much variance each successive factor
extracts. The first factor is generally more highly correlated with the variables than the second
factor. This is to be expected because these factors are extracted successively and will account
for less and less variance overall. Factor extraction stops when factors begin to yield low
eigenvalues. An example of the extraction showing eigenvalues is illustrated below in the study
by Magno (2008) where he developed a scale measuring parental closeness with 49 items and
four factors are hypothesized (bonding, support, communication, interaction).
Plot of Eigenvalues
16
15
14
13
12
11
10
9
Value
8
7
6
5
4
3
2
1
0
Number of Eigenvalues
75
The scree plot shows that 13 factors can be used to classify the 49 items. The number of
factors is determined by counting the eigenvalues that are greater than 1.00. But having 13
factors is not good because it does not further reduce the variables. One technique in the scree
test is to assess the place where the smooth decrease of eigenvalues appears to level off to the
right of the plot. To the right of this point, presumably, one finds only "factorial scree" - "scree"
is the geological term referring to the debris which collects on the lower part of a rocky slope. In
applying this technique, the fourth eigenvalue shows a smooth decrease in the graph. Therefore,
four factors can be considered in the test.
The items that will belong under each factor is determined by assessing the factor
loadings of each item. Each item in the process will load in each factor extracted. The item that
highly loaded in a factor will technically belong to that factor because it is highly correlated with
the other items in that factor or group. A factor loading of .30 means that the item contributes
meaningfully to the factor. A factor loading of .40 means the item is highly contributions to the
factor. An example of a table with factor loading is illustrated below.
1 2 3 4
item1 0.032 0.196 0.172 0.696
item2 0.13 0.094 0.315 0.375
item3 0.129 0.789 0.175 0.068
item4 0.373 0.352 0.35 0.042
item5 0.621 -0.042 0.251 0.249
item6 0.216 -0.059 0.067 0.782
item7 0.093 0.288 0.307 0.477
item8 0.111 0.764 0.113 0.085
item9 0.228 0.315 0.144 0.321
item10 0.543 0.113 0.306 -0.01
In the table above, the items that highly loaded to a factor should have a loading of .40 and
above. For example, item 1 highly loaded on factor 4 with a factor loading of .696 as compared
with the other loadings .032, .196, and 0.172 for factors 1, 2, and 3 respectively. This means that
item 1 will be classified under factor 4 together with item 6 and item 7 because they all highly
load under the fourth factor. Factor loadings are best assessed when the items are rotated
(Consult scaling theory references for details on factor rotation).
Another way of proving the factor structure of a construct is through Confirmatory Factor
Analysis (CFA). In this technique, there is a developed and specific hypothesis about the
factorial structure of a battery of attributes. The hypothesis concerns the number of common
factors, their pattern of intercorrelation, and pattern of common factor weights. It is used to
indicate how well a set of data fits the hypothesized structure. The CFA is done as follow-up to a
standard factor analysis. In the analysis, the parameters of the model is estimated, and the
goodness of fit of the solution to the data is evaluated. For example, in the study of Magno
(2008) confirmed the factor structure of parental closeness (bonding, support, communication,
succorance) after a series of principal components analysis. The parameter estimates and the
goodness of fit of the measurement model were then analyzed.
76
Figure 1
Measurement Model of Parental Closeness using Confirmatory Factor Analysis
The model estimates in the CFA shows that all the factors of parental closeness have
significant parameters (8.69*, 5.08*, 5.04*, 1.04*). The delta errors are used (28.83*, 18.02*,
18.08*, 2.58*), and each factor has a significant estimate as well. Having a good fit reflects on
having all factor structures as significant for the construct parental closeness. The goodness of fit
using chi-square is a rather good fit (2=50.11, df=2). The goodness of fit based on the Root
Mean square standardized residual (RMS=0.072) shows that there is less error having a value
close to .01. Using Noncentrality fit indeces, the values show that the four factor solution has a
good fit for parental closeness (McDonald Noncentrality Index=0.910, Population Gamma
Index=0.914).
Confirmatory Factor Analysis can also be used to assess the best factor structure of a
construct. For example, the study of Magno, Tangco, and Sy (2007), the assessed the factor
structure of metacognition (awareness of one’s learning) on its effect on critical thinking
(measured by the Watson Glaser Critical Thinking Appraisal). Two factor structured of
metacognition was assessed. The first model of metacognition includes two factors which is
regulation of cognition and knowledge of cognition (see Schraw and Dennison, ). The second
model tested metacognition with eight factors: Declarative knowledge, procedural knowledge,
conditional knowledge, planning, information management, monitoring, debugging strategy, and
evaluation of learning.
77
Model 1. Two Factors of Metacognition
Model 2: Eight Factors of Metacognition

78
The results in the analysis using CFA showed that model 1 has a better fit as compared to
model 2. This indicates that metacogmition is better viewed with two factors (knowledge of
cognition and regulation of cognition) that with eight factors.
The Principal Components Analysis and Confirmatory Factor Analysis can be conducted
using available statistical softwares such as Statistica and SPSS.
Convergent and Divergent Validity
According to Anastasi and Urbina (2002), the method of convergent and divergent
validity is used to prove the correlation of variables with which it should theoretically correlate
(convergent) and also it does not correlate with variables from which it should differ (divergent).
In convergent validity, constructs that are intercorrelated should be high and positive as
explained in the theory. For example, in the study of Magno (2008) on parental closeness, when
the factors of parental closeness were intercorrelated (bonding, support, communication, and
sucorrance), a positive magnitude was obtained indicating convergence of these constructs.
Factors of Parental Closeness (1) (2) (3) (4)

(1) Bonding 1.00 0.70** 0.62** 0.44**
(2) Communication 1.00 0.57** 0.28**
(3) Support 1.00 0.59**
(4) Succorance 1.00
**p<.05
For divergent validity, a construct should inversely correlate with its opposite factors. For
example, the study by Magno, Lynn, Lee, and Kho (in press) constructed a scale that measures
Mothers’ involvement on their grade school and high school child. The factors of mothers
involvement in school-related activities are intercorrelated. Observe that these factors belong in
the same test but controlling was negatively related permissive and loving is negatively related
with autonomy. This indicates divergence of the factors within the same measure.
Factors of Controlling Permissive Loving Autonomy

Mother’s
Involvement
Controlling ---
Permissive -0.05 ---
Loving 0.05 0.17* ---
Autonomy 0.14* 0.41* -0.36* ---
79
Summary on Validity
Type of Validity Nature Use (Statistical) Procedure

Content Validity Systematic examination of More appropriate for  Items are based on
the test content to determine achievement tests & teacher instructional objectives,
whether it covers a made tests course syllabi & textbooks
representative sample of the  Consultation with
behavior domain to be experts
measured.  Making test-
specifications
Criterion-Prediction Prediction from the test to Hiring job applicants, Test scores are correlated
Validity any criterion situation over selecting students for with other criterion
time interval admission to college, measures ex: mechanical
assigning military personnel aptitude and job
to occupational training performance as a machinist
programs
Construct Validity The extent to which the test Used for personality tests.  Correlate a new test
may be said to measure a Measures that are with a similar earlier test
theoretical construct or trait. multidimensional as measured approximately
the same general behavior
 Factor analysis
 Comparison of the
upper and lower group
 Point-biserial
correlation (pass and fail
with total test score)
 Correlate subtest with
the entire test
Convergent Validity The test should correlate Commonly for personality Multitrait-multidimensional
significantly from variables it measures matrix
is related to
Divergent Validity The test should not correlate Commonly for personality Multitrait-multidimensional
significantly from variables measures matrix
from which it should differ
80
EMPIRICAL REPORT
The Development of the Self-disclosure Scale areas in his or her life have been easy for them
to shell out and what areas need more
Carlo Magno revelations.
Sherwin Cuason It has always been psychologists
Christine Figueroa concern to explain what is going on inside a
De La Salle University-Manila particular individual in relation to his entire
system of personality. One important component
Abstract of looking into the intrinsic phenomenon of
The purpose of the present study is to develop a measure human behavior is self-disclosure. Self-
for self-disclosure. The items were based on a survey disclosure as defined by Sidney Jourard (1958) is
administered to 83 college students. From the survey 114
items were constructed under 9 hypothesized factors. The the process of making the self known to other
items were reviewed by experts. The main try out form of person; “target persons” are persons whom
the test was composed of 112 items administered to 100 information about the self is communicated. In
high school and college students. The data analysis the process of self-disclosure we make ourselves
showed that the test has a Cronbach’s alpha of .91. The manifest in thinking and feeling through our
factor loadings retained 60 items with high summated
correlations under five factors. The new factors are beliefs, actions - actions expressed verbally (Chelune,
relationships, personal matters, interests, and intimate Skiffington, & Williams, 1981). In addition,
feelings. Hartley (1993) stressed the importance of
interpersonal communication in disclosing the
Each person has a complex personality self. Hartley (1993) defined self-disclosure as
system. Individuals are oftentimes very much the means of opening up about oneself with other
interested in knowing our personality type, people. Moreover, Norrel (1989) defined self-
attitudes, interests, aptitude, achievement and disclosure as the process by which persons
intelligence. This is the reason why we should make themselves known to each other and occur
develop a psychological test that would help us when an individual communicates genuine
assess our standing. The test we have thoughts and feelings.
developed aims to measure the self-disclosing Generally, self-disclosure is the process
frequency individuals in different areas. This will in which a person is willing to share or open
help them know what areas in their lives they are oneself to another person or group whom the
willing to let other people know. This would be a individual can trust. This process is done
good instrument for counselors to use for the verbally. The factors identified in self-disclosure
assessment of their clients. The result of the which are potent areas in the content in
client’s test would help the counselor adjust his communicating superficial or intimate topics are
or her skills eliciting or disclosing more or other (1) Personal matters, (2) Thoughts & ideas, (3)
areas or other topics. Religion, (4) Work, study & accomplishments, (5)
Self-disclosure is a very important aspect Sex, (6) Interpersonal relationship, (7) Emotional
in the counseling process, because self- state, (8) tastes, (9) Problems.
disclosure is one of the instruments the The process of self-disclosure occurs
counselor can use. The consequence of the during interaction with others (Chelume,
client not disclosing himself is their inability to Skiffington, & Williams, 1981). In the studies that
respond to their problem and to the counselor. Jourard (1961;1969) conducted, he stated that a
This is what the researchers took into person will permit himself to be known when “ he
consideration in developing the test. It could also believes his audience is man of goodwill.” There
be used outside the counseling process. An should be a guarantee of privacy that the
individual may want to take it to find out what
81
information disclosed will not escape the circle. is used interchangeably.

Jourard (1971) noted that persons need
to self-disclose to get in touch with their real Areas of Self-disclosure
selves, to have intimate relationships with people
and to bond with others, in pursuit of the truth of In terms of the information disclosed, the
one’s being and to direct their destiny on the researchers arrived with nine hypothesized
basis of knowledge. Jourard agrees with Buber factors based on a survey study conducted.
(1965) that in a humanistic sense of self- These factors are: Interpersonal relationship,
disclosure “we see the index of man functioning thoughts and ideas,
at his highest and truly human level rather than at work/study/accomplishments, sex, religion,
the level of a thing or an animal. “ personal characteristics, emotional state, tastes,
The consequences that follow after self- problems. The factors are reflected on the
disclosure are manifested on its outcomes subjects disposition of being students in which
(Jourard, 1971). The outcomes are: there are influences of social situation of
(1) We learn the extent to which we are schooling and social life.
similar, one to the other, and to the extent to
which we differ from one another in thoughts, Interpersonal Relationship. Interpersonal
feelings, hopes and reactions to the past. relationship is operationally defined as the range
(2) We learn of the other man’s needs, of relationships or bonding formed within outside
enabling them to help him or to ensure that his the family which include peers, friends, and
needs will not be met. casual acquaintances. Jourard (1971) proposed
(3) We learn the extent to which a man that disclosure of relatively intimate information
accords with or deviates from moral and ethical indicates movement towards greater intimacy in
standards. interpersonal relationships. In support, it is
In a survey that the researchers have indicated that self-disclosure illuminate the
conducted, a person after disclosing feels better process of developing relationships (Hill & Stull,
(42.2%), happy (8.26%), free (5.51%), fine 1981; Altman & Taylor, 1973).
(4.6%), relaxed (3.67%), peaceful (3.67%), okay In terms of gender, it was consistently
(3.67%), lighter (2.75%), calm (2.75%), great proven that women disclose themselves to their
(1.83%), satisfied (1.83%), nothing (6.42%), and same gender to the greater extent that men do.
others (12.88%). Furthermore, it was reported Females have generally been reported to be
that on being transparent or open, individuals feel more disclosing than males (Jourard, 1971;
relieved that a burden was taken off their Chelume et al, 1981; Taylor et al, 1981). Some
shoulders, they experience peace of mind, and studies indicate that individuals who are more
consequently happiness, contact with his or her willing to disclose personal information about
real self, and better able to direct their destiny on themselves also to high-disclosing rather than to
the basis of knowledge (Jourard, 1971; low-disclosing others (Jourard, 1959; Jourard &
Maningas, 1993). Landsman, 1960; Richman, 1963; Altman &
Cozby (1973) noted that self-disclosure Taylor, 1973).
as an ongoing behavioral process include five It was reported that self-disclosure is
basic parameters: amount of personal significantly and positively related with friendship
information disclosed; intimacy of the information and this relationship is greatest with respect to
disclosed; rate or duration of disclosure; affective intimate topics or superficial information (Rubin &
manner of presentation; and disclosing flexibility. Levy, 1975; Newcomb, 1961; Priest & Sawyer,
These are the appropriate cross-situational 1967). Rubin and Shenker (1975) adapted a
modulation of disclosure. Cozby (1973) further self-disclosure questionnaire of Jourard and
stated that interrelatedness on these parameters Taylor (1971) in which they came up with four
82
new clusters; interpersonal relationship, attitudes, person and to know another person - sexually
sex, and tastes. These clusters contain items on and cognitively - will find the prospective
sensitive information one withholds. The self- terrifying.
disclosure reports are only moderately reliable Sex as a factor in self-disclosure is
(.62 to .72 for men and .51 to .78 for women). included because most closely knitted
In marital relationship, it was found that adolescents gives focal view on sex. The survey
parners have greater self-disclosure and marital study that was conducted shows that 5.26% of
satisfaction (Levinger & Senn, 1967; Jorgensen, males and 3.44% of females disclose themselves
1980). regarding sexual matters.
In parent-child relationship it was
reported that there are no differences in the Personal matters about the self.
content of the self-disclosure of Filipino Personal matters consist of private truths about
adolescents with their mother and father (Cruz, oneself and they may be favorable or
Custodio, & Del Fierro, 1996). The study also unfavorable evaluative reaction toward
indicated that birth order is highly relevant in something or someone, exhibited in one’s belief,
analyzing the content of self-disclosure. The feelings or intended behavior.
results of the study also show that children are In an experiment conducted by Taylor,
more disclosing toward the mothers because Gould, and Brounstein (1981), they found that
they empathize. the level of intimacy of the disclosure was
determined by (1) dispositional characteristics,
Sex. One of the most intimate topics as (2) characteristics of subjects, and (3) the
a content in self-disclosure is sex. It is usually situation. Their personalistic hypothesis was
embarassing and hard to open to others because confirmed that the level of disclosure affects the
some people have the faulty learning that it is level of intimacy. Some studies also show that
evil, lustful, and dirty (Coleman, Butcher, & some individuals are more willing to disclose
Carson, 1980). But mature individuals view personal information about themselves to high
human sexuality as a way of being in the world of disclosing rather than low disclosing others
men and women whose moments of life and (Jourard, 1959; Jourard & Landsman, 1960;
every aspect of living is spent to experience Jourard & Richman, 1963; Altman & Taylor,
being with the entire world in a distinctly male or 1973). Furthermore, Jones & Archer (1976) have
female way (Maningas, 1995). Furthermore, sought directly that the recipient’s attraction
sexuality is part of our natural power or capacity towards a discloser would be mediated by the
to relate to others. It gives the necessary personalistic attribution the recipient makes for
qualities of sensitivity, warmth, mental respect in the disclosers level of intimacy.
our interpersonal relationship and openness Kelly and McKillop (1996) in their article
(Maningas, 1995). stated that “choosing to reveal personal secrets
Sexuality as being part of our is a complex decision that could have distorting
relationship needs to be opened up or expressed consequences, such as being rejected and
as Freud noted the desire of our instinct or id. alienated from the listener.” But Jourard (1971)
Maningas (1995) stressed out that sex is an noted that a healthy behavior feels “right” and it
integral part of our personal self-expression and should produce growth and integrity. Thus,
our mission of self-communication to others. disclosing personal matters about oneself is a
Some findings by Jourard (1964) on subject means of being honest and seeking others to
matter differences noted that details about one’s understand you better.
sex life is not muchly disclosable as compared to
other factors. Jourard (1964) also noted that Emotional State. One of the factors of
anyone who is reluctant to be known by another self-disclosure defined as one’s revelation of
83
emotions or feelings to other people. A psychological rationale for the selected use of
retrospective study was conducted to determine therapist self-disclosure, the conscious sharing of
what students did to make their developing thoughts, feelings, attitudes, or experiences with
romantic relationship known to social network a patient (Goldstein, 1994).
members and what they did to keep their
relationship from becoming known. It is shown in Religion. We operationally defined
this study that the most frequent reasons for religion in self-disclosure as the ability of an
revelation were felt obligation to reveal based on individual to share his experiences, thoughts, and
the relationship with the target, the desire for emotions toward his beliefs about God. Healey
emotional expression, and the desire for (1990) provided an overview of the role of self-
psychological support from the target. The most disclosure in Judeo-Christian religious
frequent reason to withhold information was the experience with emphasis in the process of
anticipation of a negative reaction from the target spiritual direction. In the study done by Kroger
(Baxter, 1993). The researchers felt that the (1994), he shows the catholic confession as the
determination of the probability of self-disclosure embodiment of common sense regarding the
will be a lot better if emotional state is considered social management of personal secrets, of the
as a factor. Emotions, disclosures & health sins committed, and considers confession as a
addresses some of the basic issues of model for understanding the problem of the
psychology and psychotherapy: how people social transmission of personal secrets in
respond to emotional upheavals, why they everyday life. It is very important and considered
respond the way they do, and why translating as a factor in self-disclosure because of the fact
emotional events into language increases that the Filipino people are very religious, and
physical and mental health (Pennebaker, 1995). study shows that religious people disclose more
(Kroger, 1994).
Taste. Is defined as the likes and
dislikes of a person openned to other people. In Problems. When a person is depressed,
a study made by Rubin & Shenker (1975), they he tends to find others that will listen and can
made a test studying the friendship, proximity share the problem with. To release the tension a
and self-disclosure of college students in the person feels, he usually discloses it. Clarity of a
contexts of being roomates or hallmates. The problem is attained when people start to
items were categorized in four clusters, in what verbalize it and in the process, a solution can be
we thought would be ascending order of reached. In the study of Rime (1995), they
intimacy-tastes, attitudes, interpersonal revealed that after major negative life events and
relationships, and self-concept and sex. This traumatic emotional episodes, ordinary emotions,
would help us determine whether people are too, are commonly accompanied by intrusive
willing to share superficial information right away memories and the need to talk about the
as well as intimate information. episode. It also considered the hypothesis that
such mental rumination and social sharing would
Thoughts. Is defined as the things in represent spontaneously initiated ways of
mind that one is willing to share with other processing emotional information.
people. “A friend”, Emerson wrote, “ is a person
with whom I may be sincere. Before him I may Work/Study. Work or study is defined as the
think aloud.” A large number of studies have person’s present duty or responsibility which is
documented the link between friendship and the expected of him and is needed to be fulfilled in a
disclosure of personal thoughts and feelings that given time. It is considered a factor in self-
Emerson’s statement implies (Rubin & Shenker, disclosure because this will give a glimpse of
1975). Another study presents a self- how open a person can share his joy and burden
84
in his current responsibility. In the study of Starr Method

(1975), it was hypothesized that self- disclosure
is causally related to psychological and physical Search for Content Domain
well being, with low disclosure related to In the search for content domains, a
maladjustment and high disclosure associated survey was made and answered by 55 females
with mental health. from 16-22 years old. The respondents were
students from the CLA, COE, COS and CBE of
Table 1 DLSU. The survey questionnaire aims to gather
Hypothesized factors of Self-disclosure data about the self-disclosing activities of the
students. It indicates the person whom one
Factor Definition usually discloses, topics disclosed, situation
Emotional state One’s revelation of emotions or
feelings to other people. Feelings,
where one discloses, how one discloses,
attitudes toward a situation being characteristics while disclosing, and rate of their
revealed to others. own self- disclosing habit. The self- disclosure
Interpersonal Indicates movement towards greater questionnaire by Sidney Jourard and Rubin and
relationship intimacy in interpersonal
relationships. Range of relationships Shenkers intimacy of self- disclosure was
or bonding formed within the outside reviewed on how they came up with their items
the family.
and factors.
Personal Private truth about oneself,
matters favorable or unfavorable, toward
something or someone and is Item Writing and Review
exhibited in one’s belief, feelings or Based from the survey, 114 items under
intended behavior. Being honest and
seeking others to know you better by nine factors were constructed and the verbal
disclosing. frequency scale was used. The items were
Problems Depressing event or situation that reviewed by two psychology professors and one
can be lightened through disclosing.
Conflict, disagreement experienced psychometrician from De La Salle University.
by an individual. Some items were deleted, some were removed,
Religion Ability of an individual to share his and some were added. After being reviewed the
experiences, thoughts and emotions
toward his feeling of God. Concept, pre-try out form was constructed.
perception and view of religion by an
individual being able to share or Development of the Pretest Form
tackle in the face of others.
Sex As a way of being in the world of The pretest-form consists of 114 items
men and women whose moments of with nine factors. The factors were sex (5 items),
life is spent to experience being with problems (21 items), interpersonal relationship
the entire world in a distinctly male
or female way. Willingness of a (17 items), accomplishments/work/study (14
person to discuss his sexual items), religion (6 items), tastes (8 items),
experiences, needs and views. thoughts (9 items), and personal matters (20
Taste Likes and dislikes of a person
opened to other people. Views,
items). The scaling used was the verbal
feeling, appreciation of a person, frequency scale (always, often, sometimes,
place or thing. rarely, never).
Thoughts Information in mind that you are
willing to share with other people.
Perception regarding a thing, or Pre-tryout Form
situation which is shared with others. In the pre-tryout form, 10 forms were
Work/study/ Person’s present duty in which is prepared to be answered by 10 respondents
accomplishment expected to him. A person’s
responsibility being expected by conveniently selected, then a feedback is given
others and to be fulfilled in a on vague and not applicable items, and other
particular time. comments. There were 10 psychology majors
who answered the pre-test form (6 females and 4
85
males). components analysis. A matrix was made

The pre-tryout form consists of 110 items between the factors and the reliability was
still with nine factors. There were six negative obtained using the Cronbach’s alpha. The items
items (item no. 7, 30, 97, 106, 107, 109) and the were grouped using Principal Components
rest were positive items. The scaling used was Analysis.
the verbal frequency scale because the test is a
measure of a habit. The order of the items was Development of the Final form
randomly arranged and the responses are In the final form there were 60 items
answered by checking the corresponding scale. accepted and 62 items were deleted in the item
The purpose of the pre-tryout form is for mild analysis due to low factors loadings (below .40).
testing 10 subjects and to ask for comments for There were five factors extracted in the Principal
further revision. Components Analysis: Beliefs, relationships,
personal matters, interests, and intimate feelings.
Development of the Main-tryout form
The comments made on the pretryout Plan on Developing the Norms
form were considered and the main-tryout form A norm will be used to interpret the
was developed. The main-tryout form was scores. The test is scored based on the
consists of 112 items. The test was intended for corresponding answer on each item. A score is
adolescents because the items were empirically yield for a particular factor. The raw score will
based on adolescent subjects and it reflects their have an equivalent percentile based on a norm,
usual activities. There were six negative items. and a corresponding percentile will have a
The scaling used was the verbal frequency scale. remark.
The arrangement of items were in random order
and the task of the respondent is to check the Test Plan
corresponding scale beside each item. In administering the test there is no
There were 100 respondents who alloted time to answer the test. The respondents
answered the test. The respondents were fourth or person taking the test is instructed to shade
year high school students of a private school, their corresponding answer on the answer sheet.
their ages range from 15 to 16. There were 48 There is no right or wrong answer in the test so
males and females. The rest of the participants respondents should answer as honestly as
were college students from De La Salle possible.
University. In scoring, the answer Always is
The sampling design is purposive in equivalent to 5 points, often=4, sometimes=3,
which the respondent’s are included if they rarely=2, never=1. All the items are positive
belong to fourth year level in high school and in because all the negative items were removed
college in private schools. During the during the item analysis due to low factor
administration of the test, the researchers loadings. The score on each item will be
explained the purpose of the test to the students summated and there is an equivalent percentile
and they all agreed to answer. It took the for a particular score.
respondents 20 minutes to answer the test. The In the interpretation, the garnered
researchers then reviewed the data after the percentile will have a remark of high frequency,
collection. Each test was scored and encoded in average frequency, and low frequency.
the computer. A low disclosing individual would mean
that the particular person never or rarely opens
Item Analysis and Factor Analysis up his or herself toward others in the particular
The 112 items were intercorrelated and area.
the factors were extracted using the principal An average self-disclosing individual
86
would mean that the particular person have Table 2

opened in general terms about a particular matter Accepted items with their factor loadings
only when necessary and on selected others on
a particular area. Item number Factor 1
item 33 .68766
A high self-disclosing individual would item 70 .64815
mean that the person has opened and shared item 8 .61846
item 3 .59245
himself fully and in complete details to others in item 20 .55228
the particular area. The individual will have the item 98 .53677
item 77 .45061
tendency to let himself be known in all item 52 .45001
dimensions of his or her being. item 59 .40504
item 101 .38157
item 18 .32574
Results Factor 2
item 88 .64293
The corrected item-total correlation of the item93 .72024
62 items have a total correlation of above .30. item95 .6780
item65 .59372
The item-total correlation of accepted items item94 .54047
ranges from .4866 to .3009, the item correlation item53 .51697
item75 .50285
of the deleted items ranges from -.0123 to .2980. item68 .44658
The coefficient alpha reliability is .9134, the item66 .41453
item76 .41102
standard item alpha is .9166. item96 .38957
A correlation matrix was made on the item99 .36690
item11 .36482
112 items, the mean for the interiitem correlation Factor 3
is .339, the variance is 1821.3782, and the item111 .29875
item82 .77164
standard deviation is 42.6776. The highest item83 .69717
intercorrelation of items is .6543 that occurred item17 .59079
item10 .54027
between item number 51 and item number 74. item104 .45128
In the process of factor analysis, the item100 .44697
item56 .42587
hypothesized nine factors were extracted into 18 item60 .39554
factors with an eigenvalue of 1.07878. The item62 .39486
item69 .32917
researchers considered 4% of variance which item27 .63290
offers 5 factors. Table 2 shows the accepted item34 .61822
item39 .58744
items with their factor loadings. Factor 4
item78 .54582
item43 .49976
item01 .49312
item28 .43613
item26 .43205
item35 .42141
item32 .41807
item72 .41475
item06 .35834
item73 .32098
Factor 5
item10 .54207
item100 .446979
item104 .54207
Item17 .59079
item56 .42587
item60 .39554
item62 .38486
item69 .32917
item82 .77164
item83 .69717
87
Table 3 Discussion
Factor Transformation Matrix At first, there were nine hypothesized
factor based on a survey. The 18 factors were
FACTOR
1
FACTOR
2
FACTOR
3
FACTOR
4
FACTOR
5
then extracted with eigenvalues greater than
FACTOR .48 -.56 -.25 -.001 -.61 1.00. Finally, there were a final of five factors with
1
FACTOR .45 .43 -.56 -.49 .19 acceptable factor loadings. The five factors have
2
FACTOR .45 .55 .57 .09 -.39
new labels because the items were rotated
3
FACTOR
differently based on the data on the main tryout.
.4 -.42 .49 -.34 .52
4 Factor 1 contains items about the beliefs on
FACTOR .41 .02 -.21 .79 .39
5 religion, and ideas on a particular topic and it is
labeled as such. Factor 2 contains items
reflecting relationships with friends and it was
The new five factors were given new names labeled as “relationships.” Factor 3 contains
because the contents were different. Factor 1 items about a person’s secrets and attitudes and
was labeled as Beliefs with 11 items, Factor 2 most of the items contains personal matters and
was labeled as relationships with 13 items, it was labeled as such. Factor 4 is a cluster of
Factor 3 labeled as Personal Matters with 13 taste and perceptions so it was labeled as
items, and Factor 4 as intimate feelings with 13 interest. Factor 5 contains feelings about
items, and factor 5 labeled as interests with 10 oneself, problems, love, success, and
items. frustrations, so it was labeled as intimate
feelings. The factors were reliable due to their
Table 4 alpha which are .8031, .7696, .7962, .7922,
New Table of Specifications .7979. It only shows that each factor is
consistent with the intended purpose of the
FACTORS Number ITEM RELIABILITY researchers. In the result of factor analysis the
of items NUMBER
items were not equal in each factor, factor 1 has
Factor 1: Beliefs 11 8,101,18, .8031 11 items, factor 2 has 13 items, factor 3 has 13
20,33, 52,
59, 70, 77, items, factor 4 has 10 items and factor 5 has 13
98, 3 items. The five factors account for the areas in
Factor 2: 13 105, 15, .7696 which a particular individual self-discloses.
Relationships 21, 24, 31,
41, 48, 55,
61, 63 79, There were nine hypothesized factors, all
84, 88 of these were disproved, new factors arrived after
Factor 3: Personal 13 11, 111, .7962
factor analysis. The items were reclassified in
Matters 53, 65, 66, every factor and was given a new name. Only
68, 75, 76,
93, 94, 95,
five factors were accepted following the four
96, 99 percent rating of the eigenvalue. These factors
Factor 4: Intimate 13 1,6, 26, 27, .7922
are Beliefs, Relationships, Interests, Personal
Feelings 28, 32, 34, matters, and intimate feelings. The test we have
35, 39, 43,
72, 73, 78
developed intended to measure the degree of
self-disclosure of individuals but it was refocused
Factor 5: Interests 10 10, 100, .7979 to measure the self-disclosure each person
104, 17,
56, 60, 62, makes on each different areas or factors.
69, 82, 83
60 In terms of the test’s psychometric

property, it has gone in the level of item review
by experts and factor analysis. It has an internal
88
consistency of .9134 which is high. Considering self - disclosure in psychotherapy. In Stricker, G.

that the test has just undergone its initial stages, & Fisher, M. (eds.) Self-disclosure in the
further validation study is recommended to give therapeutic relationship (pp. 17-27). New York,
more detailed properties of the test. Norming and NY, US: Plenum Press.
interpretation for the test is not yet further
established where it needs to be administered to Hill, C. T. & Stull, D. E. (1981). Sex differences in
a large sample size. An intensive study should be effects of social and value similarity in same-sex
made with considerable and appropriate number friendship. Journal of Personality and Social
of respondents. In terms of the sampling a Psychology, 41(3), 488-502.
probabilistic technique is suggested to account
for further generalization in the study because Jones, E. E., & Archer, R. L. (1976). Are there
the current test only used a purposive non- special effects of personalistic self - disclosure?
probabilistic sampling. Journal of Experimental Social Psychology,
12(2), 180-193.
References
Jorgensen, S. R. (1980). Contraceptive attitude -
Altman, I., & Taylor, D. A. (1973). Social behavior consistency in adolescence. Population
penetration: The development of interpersonal & Environment: Behavioral & Social Issues, 3(2),
relationships. New York: Holt, Rinehart & 174-194.
Winston.
Jourard, S. M (1970). Experimenter - subject
Baxter, D. E. (1993). Empathy: Its role in nursing "distance" and self - disclosure. Journal of
burnout. Dissertation Abstracts International, 53, Personality and Social Psychology, 15(3), 278-
4026. 282.
Chelune, G. J., Skiffington, S, & Williams, C. Jourard, S. M. & Jaffe, P. E. (1970). Influence of
(1981). Multidimensional analysis of observers' an interviewer's disclosure on the self - disclosing
perceptions of self - disclosing behavior. Journal behavior of interviewees. Journal of Counseling
of Personality and Social Psychology, 41(3), 599- Psychology, 17(3), 252-257.
606.
Jourard, S. M. & Landsman, M. J. (1960).
Coleman, C., Butcher, A. & Carson, C. (1980). Cognition, cathexis, and the dyadic effect in
Abnormal psychology and modern life (6th ed.). men's self-disclosing behavior. Merrill-Palmer
New York: JMC. Quarterly, 6, 178-185.
Cozby, P. C. (1973). Self - disclosure: A literature Jourard, S. M. & Rubin, J. E. (1968). self -
review. Psychological Bulletin, 79(2), 73-91. disclosure and touching: a study of two modes of
interpersonal encounter and their inter - relation.
Goldstein, J. H. (1994). Toys, play, and child Journal of Humanistic Psychology, 8(1), 39-48.
development. New York, NY, US: Cambridge
University Press. Jourard, S. M. (1959). Healthy personality and
self-disclosure. Mental Hygiene, 43, 499-507.
Hartley, P. (1993). Interpersonal communication.
Florence, KY, US: Taylor & Frances/Routledge. Jourard, S. M. (196). Religious denomination and
self - disclosure. Psychological Reports, 8, 446.
Healey, B. J. (1990). Self - disclosure in religious
spiritual direction: Antecedents and parallels to
89
Jourard, S. M. (1961). Self-disclosure patterns in 13(3), 237-249.

British and American college females. Journal of
Social Psychology, 54, 315-320. Maningas, I. (1995). Moral theology. Manila:
DLSU Press.
Jourard, S. M. (1961). Self-disclosure scores and
grades in nursing college. Journal of Applied Newcomb, T. M. (1981). The acquaintance
Psychology, 45(4), 244-247. process. Oxford, England: Holt, Rinehart &
Winston.
Jourard, S. M. (1964). The transparent self.
Princeton: Van Nostrand, 1964. Pennebaker, J. W. (1995). Emotion, disclosure,
and health: An overview. Emotion, Disclosure, &
Jourard, S. M. (1968). You are being watched. Health, 14, 3-10.
PsycCRITIQUES, 14(3), 174-176.
Priest, R. F. & Sawyer, J. (1967). Proximity and
Jourard, S. M. (1970), The beginnings of self- peership: bases of balance in interpersonal
disclosure. Voices: the Art & Science of attraction. American Journal of Sociology, 72(6),
Psychotherapy, 6(1), 42-51. 633-649.
Richman, S. (1963). Because experience can't
Jourard, S. M. (1971). Self - disclosure: An be taught. New York State Education, 50(6), 18-
experimental analysis of the transparent self. 20.
Oxford, England: John Wiley.
Rimé, B. (1995). The social sharing of emotion
Jourard, S. M., & Landsman, M. J. (1960). as a source for the social knowledge of emotion.
Cognition, cathexis, and the "dyadic effect" in In Russell, J. A., Fernández-Dols, J., Manstead,
men's self-disclosing behavior. Merrill-Palmer A., & Wellenkamp, J. C. (eds). Everyday
Quarterly, 6, 178-186. conceptions of emotion: An introduction to the
psychology, anthropology and linguistics of
Jourard, S. M., & Resnick, J. L. (1970). Some emotion (pp. 475-489). NATO ASI series D:
effects of self - disclosure among college women. Behavioural and social sciences, Vol. 81. New
Journal of Humanistic Psychology, 10(1), 84-93. York, NY, US: Kluwer Academic/Plenum
Publishers.
Jourard, S. M., & Richman, P. (1963). Disclosure
output and input in college students. Merrill- Rubin, J. A. & Levy, P. (1975). Art-awareness: A
Palmer Quarterly, 9, 141-148. method for working with groups. Group
Psychotherapy & Psychodrama, 28, 8-117.
Kelly, A. E. & McKillop, K. J. (1996). Rubin, Z. (1970). Measurement of romantic love.
Consequences of revealing personal secrets. Journal of Personality and Social Psychology, 16,
Psychological Bulletin, 120(3), 450-465. 265-273.
Kroger, R. O. (1994). The Catholic Confession Starr, P. D. (1975). Self - disclosure and stress
and everyday self - disclosure. In Siegfried, J. among Middle - Eastern university students.
(ed). The status of common sense in psychology Journal of Social Psychology, 97(1), 141-142.
(pp. 98-120). Westport, CT, US: Ablex
Publishing. Taylor, D. A., & Gould, R. J., & Brounstein, P. J.
(1981). Effects of personalistic self - disclosure.
Levinger, G. & Senn, D. J. (1967). Disclosure of Personality and Social Psychology Bulletin, 7(3),
feelings in marriage. Merrill-Palmer Quarterly, 487-492.
90
Exercise
Give the best type of reliability and validity to use in the following cases.
___________________1. A scale measuring motivation was correlated on a scale measuring

laziness, a negative coefficient was expected.
___________________2. An achievement test on personality theories was administered to

psychology majors, and the same test was administered among engineering students who have
not taken the course. It was expected that there would be a significant difference on the mean
scores of the two groups.
___________________3. The 16 PF that measures 16 personality factors were intercorrelated

with the 12 factors of the Edwards Personality Preference Schedule (EPPS). Both instruments are
measures of personality but contain different factors.
___________________4. The multifactorial metamemory questionnaire (MMQ) arrived with

three factors when factor analysis was conducted. It had a total of 57 items that originally
belonged to five factors.
___________________5. The scores on the depression diagnostic scale were correlated with the
Minnesota Multiphasic Personality Inventory (MMPI). It was found that clients who are
diagnosed to be depressive have high scores on the factors of MMPI.
___________________6. The scores of Mike’s mental ability taken during fourth year high
school were used in order to determine whether he will be qualified to enter in the college he
wants to study.
___________________7. Maria, who went for drug rehabilitation, was assessed using the self-
concept test and her records in the company where she was working at which contained her
previous security scale scores were requested. The two tests were compared.
___________________8. Mrs. Ocampo a math teacher before preparing her test constructs a
table of specifications and after making the items it was checked by her subject area coordinator.
___________________9. In an experiment, self-disclosure of participants was obtained by

having three raters listen to the recordings between a counselor and client having a counseling
session. The raters used an ad hoc self-disclosure inventory and later their ratings were compared
using the coefficient of concordance. The concordance indicates whether the three raters agree
on their ratings.
___________________10. A test measuring “sensitivity” was constructed in order to establish

its reliability. The scores for each item were entered in a spreadsheet to determine whether the
responses for each were consistent.
91
___________________11. The items of a newly constructed personality test measuring Carl

Jung’s psychological functions used a Likert scale. The scores for each item were correlated with
all possible combinations.
___________________12. A test on science was made by Ms. Asuncion, a science teacher.

After scoring each test, she determined the internal consistency of items.
___________________13. In a battery of tests, the section A class received both the Strong
Vocational Interest Blank (SVIB) and the Jackson Vocational Interest Survey (JVIS). Both are
measures of vocational interest and the scores are correlated to determine if one measures the
same construct.
___________________14. The Work Values Inventory (WVI) was separated into 2 forms and
two set of scores were generated. The two sets of scores were correlated to see if they measure
the same construct.
___________________15. Children’s moral judgment was studied if it would change overtime.

It was administered during the first week of classes then another at the end of the first quarter.
___________________16. The study of values was designed to measure 6 basic interests,

motives, or evaluative attitudes such as theoretical, economic, aesthetic, social, political, and
religious. These six factors were derived after a validity analysis.
___________________17. When the EPPS items were presented in a free choice format, the
scores correlated quite highly with the scores obtained with the regular forced-choice form of
the test.
___________________18. The two forms of the MMPI (Form F and form K scales) were
correlated to detect faking or response sets.
___________________19. In a study by Miranda, Cantina, and Cagandahan (2004) they

intercorrelated the 15 factors of the Edwards Personal Preference Inventory.
92
Lesson 3
Item Analysis: Item Difficulty and Item Discrimination
Students are usually keen in determining whether an item is difficult or easy and whether
the test is a good test or a bad test based on their own judgment. A test item being judged as easy
or difficult is referred to as item difficulty and whether a test is good or bad is referred to as item
discrimination. Identifying a test items’ difficulty and discrimination is referred to as item
analysis. Two approaches will be presented in this chapter on item analysis: Classical Test
Theory (CTT) and Item Response Theory (IRT). A detailed discussion on the difference between
the CTT and IRT is found at the end of Lesson 3.
Classical Test Theory
Regarded as the “True Score Theory.” Responses of examinees are due only to variation in ability of
interest. All other potential sources of variation existing in the testing materials such as external
conditions or internal conditions of examinees are assumed either to be constant through rigorous
standardization or to have an effect that is nonsystematic or random by nature. The focus of CTT is the
frequency of correct responses (to indicate question difficulty); frequency of responses (to examine
distracters); and reliability of the test and item-total correlation (to evaluate discrimination at the item
level).
Item Response Theory
Synonymous with latent trait theory, strong true score theory, or modern mental test theory. It is more
applicable to for tests with right and wrong (dichotomous) responses. It is an approach to testing based
on item analysis considering the chance of getting particular items right or wrong. In IRT, each item on a
test has its own item characteristic curve that describes the probability of getting each particular item
right or wrong given the ability of the test takers (Kaplan & Saccuzzo, 1997).
Item difficulty is the percentage of examinees responding correctly to each item in the
test. Generally, an item difficulty is difficult if a large percentage of the test takers are not able to
answer it correctly. On the other hand, an item is easy if a large percentage of the test takers are
able to answer it correctly (Payne, 1992).
Item discrimination refers to the relation of performance on each item to performance on

the total score (Payne, 1992). An item can discriminate if most of the high-scoring test takers are
able to answer the item correctly and an item will have a low discriminating power if the low-
scoring test takers can equally answer the test item correct as contrasted with the high-scoring
test takers.
Procedure for Determining Index of Item Difficulty and Discrimination
1. Arrange the test papers in order from highest to lowest.

93
2. Identify the high and low scoring group by getting the upper 27% and lower 27%. For
example there are 20 test takers, the 27% of the test takers is 5.4, rounding it off will give 5 test
takers. This means that the top 5 (high scoring test-takers) and the bottom 5 (low scoring test-
takers) test takers will be included in the item analysis.
3. Tabulate the correct and incorrect responses of the high and low test-takers for each item. For
example, in the table below there are 5 test takers in the high group (test takers 1 to 5) and 5 test
takers in the low group (test takers 6 to 10). Test taker 1 and 2 in the high group got a correct
response for items 1 to 5. Test taker 3 was wrong in item 5 marked as “0.”
Item 1 Item 2 Item 3 Item 4 Item 5 Total

High Test taker 1 1 1 1 1 1 5
test Test taker 2 1 1 1 1 1 5
takers Test taker 3 1 1 1 1 0 4
Group Test taker 4 1 0 1 1 0 4
Test taker 5 1 1 1 0 0 3
Total 5 4 5 4 2
Low Test taker 6 1 1 0 0 0 2
group Test taker 9 1 0 0 0 0 1
Test taker 10 0 0 0 1 0 1
Total 3 3 1 1 0
4. Get the total correct response for each item and convert it into a proportion. The proportion is
obtained by dividing the total correct response of each item to the total number of test takers in
the group. For example, in item 2, 4 is the total correct response and dividing it by 5 which is the
total test takers in the high group will give a proportion of .8. The procedure is done for the high
and low group.
pH = Total Correct Response pL = Total Correct Response

N per group N per group
Item 1 Item 2 Item 3 Item 4 Item 5 Total

High Test taker 1 1 1 1 1 1 5
Group Test taker 4 1 0 1 1 0 4
Test taker 5 1 1 1 0 0 3
Total 5 4 5 4 2
Proportion of the 1 .8 1 .8 .4
High Group (pH)
Low Test taker 6 1 1 0 0 0 2
group Test taker 9 1 0 0 0 0 1
Test taker 10 0 0 0 1 0 1
Total 3 3 1 1 0
Proportion of the .6 .6 .2 .2 0
low group (pL)
94
5. Obtain the item difficulty by adding the proportion of the high group (pH) and proportion of
the low group (pL) and dividing by 2 for each item.
pH  pL
Item difficulty 
2

High Group (pH)
Proportion of the low .6 .6 .2 .2 0
group (pL)
Item difficulty .8 .7 .6 .55 .2
Interpretation Easy item Easy item Average item Average item Difficult item
The table below is used to interpret the index of difficulty. Given the table below, items 1 and 2
are easy items because they have high correct response proportions for both high and low group.
Items 3 and 4 are average items because the proportions are within the .25 and .75 middle bound.
Item 5 is a difficult item considering that there are low proportions correct for the high and low
group. In the case of item 5, only 40% are able to answer in the high group and none got it
correct in the low group (0). Generally as the index of difficulty reaches a value of “0,” the more
difficult an item is, as it reaches “1,” it becomes easy.
Difficulty Index Remark

.76 or higher Easy Item
.25 to .75 Average Item
.24 or lower Difficult Item
6. Obtain the item discrimination by getting the difference between the proportion of the high
group and proportion of the low group for each item.
Item discrimination=pH – pL

High Group (pH)
Proportion of the low .6 .6 .2 .2 0
group (pL)
Item discrimination .4 .2 .8 .6 .4
Interpretation Very good item Reasonably Very good item Very good item Very good item
good item
The table below is used to interpret the index discrimination. Generally, the larger the difference
between the proportion of the high and low group, the item becomes good because it shows a
large gap in the correct response between the high and low group as shown by items 1, 3, 4, and
5. In the case of item 2, a large proportion of the low group (60%) got the item correct as
contrasted with the high group (80%) resulting with a small difference (20%) making the item
only reasonably good.
95
Index discrimination Remark

.40 and above Very good item
.30 - .39 Good item
.20 - .29 Reasonably Good item
.10 - .19 Marginal item
Below .10 Poor item
Analyzing Item Distracters
Analyzing item distracters involve determining whether the options in a multiple

response item type are effective. In multiple response types such as a multiple choice, the test
taker will choose from among the options or distracters the correct answer. In creating
distracters, the test developer ensures they belong in the same category where they are close to
the answer. For example:
What cognitive skill is demonstrated in the objective “Students will compose a five paragraph
essay about their reflection on modern day heroes”?
a. Understanding
b. Evaluating
c. Applying
d. Creating
Correct answer: d
The distracters for the given item are all cognitive skills in Bloom’s revised taxonomy where all
can be a possible answer but there is one best answer. In analyzing whether the distracters are
effective, the frequency of examinees selecting each option is reported.
Group Group Options Total no. Difficulty Discrimination

size of correct Index Index
a b c d
High 15 1 3 1 10 17 .57 .20
Low 15 1 6 1 7
 Correct answer
For the given item with the correct answer of letter d, majority of the examinees in the
high and low group preferred option “d” which is the correct answer. Among the high group,
distracters a, b, and c are not effective distracters because there are very few examinees who
selected them. For the low group, option “b” can be an effective distracter because 40% (6
examinees) of the examinees selected it as their answer as opposed to 47% (7 examinees) of
them got the correct answer. In this case distracters “a” and “c” need some revision by making it
close to the answer to make it more attractive for test takers.
96
EMPIRICAL REPORT
Construction and Development of a Test Mapa; (4) Mga Direksyon; (5) Anyong Lupa at
Instrument for Grade 3 Social Studies Anyong Tubig; (6) Simbolong Ginagamit sa
Carlo Magno Mapa; (7) Panahon at Klima; (8) Mga Salik na
may Kinalaman sa Klima; (9) Mga Pangunahing
Abstract Hanapbuhay sa Bansa; (10) Pag-aangkop sa
This study investigated the psychometric properties and Kapaligiran. The topics were based upon the
item analysis of a one-unit test in geography for grade lessons provided by the Elementary Learning
three students. The skills and contents of the test were Competence from the Department of Education.
based on the contents covered for the first quarter that is The test aims for the students to: (1)
indicated in the syllabus. A table of specifications was Identify the important concepts and definitions;
constructed to frame the items into three cognitive skills (2) comprehend and explain the reasons for
that include knowledge, comprehension, and application. given situations and phenomena; (3) Use and
The test has a total of 40 items on 10 different test types. analyze different kinds of maps in identifying
The items were reviewed by a social studies teacher and important symbols and familiarity of places.
academic coordinator. The split-half reliability was used
and a correlation of .3 was obtained. Each test type was Method
correlated and resulted from low and high coefficients. The
item analysis showed that most of the items turned out to Search for Skills and Content Domain
be easy and most are good items. The skills and contents of the test were
identified based on the topics covered for grade
The purpose of this study is to construct three students in the first quarter. The test is
and analyze the items of a one-unit geography intended to be administered for the first quarter
test for grade three students. The test basically exam. The skills intended for the first quarter’s
measures grade three student’s achievement on topic include identifying concepts and terms,
Philippine Geography for the first quarter that comprehending explanations, applying principles
served as a quarterly test. The test when on situations, using and analyzing maps,
standardized through validation and reliability synthesizing different explanations for a
would be used for future achievement test in particular event, and evaluating the truthfulness
Philippine Geography. and validity of reasons and statements through
There is a need to construct and inference.
standardize a particular achievement test in In constructing the test, a table of
Philippine Geography since there is none yet specifications was first constructed to plan out
available locally. the distribution of items for each topic and the
The test is in Filipino language because objectives to be gained by the students.
of the nature of the subject. The subject cover
topics on (1) Kapuluan ng Pilipinas; (2) Malalaki
at Maliliit na Pulo ng Bansa; (3) Mapa at Uri ng
97
Table 1. Table of Specifications for a unit in Philippine Geography for Grade 3

Nilalaman Natutukoy ang Nauunawaan ang Nagagamit at Total Number
mahahalagang mga dahilan sa nasusuri ang mapa of Items
konsepto at mahahalagang sa pagtukoy ng
kahulugan kapaliwangan sa mga mahahalagang
bawat sitwasyon pananda
Kapuluang Pilipinas 4 4
Malalaki at maliliit na 4 4
pulo ng bansa
Mapa at Uri ng mapa 4 4
Mga direksyon 6 6
Anying lupa at 5 5
Anyong Tubig
Simbolong ginagamit 4 4
sa mapa
Panahon at Klima 2 3 5
Mga salik na may 2 2
kinalaman sa lima
Mga pangunahing 3 3
hanapbuhay ng bansa
Pag-aangkop sa 3 3
kapaligiran
Total Number of Items 11 16 13 40
Percentage 27.5% 40% 32.5% 100
Table of Specifications placed on the knowledge part since there is a

The Table of Specifications contains 10 little need for the students to recall and memorize
topics taken which is a unit about Philippine concepts and terms. The main highlight of this
Geography. The 27.5% of the items were placed unit is to gain the ability to explain geographical
for the knowledge level, 40% were placed for principles on Philippine geography and its
comprehension, and 32.5% were placed on the relatedness to our culture.
application level. Most of the items were
concentrated on the comprehension since the Item Writing
main purpose is for the students to understand There were 40 items constructed based
and comprehend the unit on Philippine on the Table of Specifications (see Table 1). A
Geography and it is the foundation knowledge for 40-item test is just enough for grade three
the entire lesson for the school year. Having students since it is not too much or few for their
mastered this base knowledge will help students capacity. Also in determining the amount of items
explain and give reasons for the next lessons to place on the test, the attention span and time
that will be taken. Also, most of the items were frame for testing is considered. Basically in the
distributed on the application level since the quarterly test, a particular test on a subject is
students need to learn practically how to use given a time limit of one hour.
maps, and how they could benefit from using The items were based more from what
maps and figures of the unit. Few items were the students gained from the discussion in the
98
classroom, reflection on the topic, work

exercises, group works, activities in school, and Test Administration
from the book. Respondents. There were 88 grade 3
The items were divided into 10 parts in students in three sections who took the test for
the test. Test I contains four items in a True or the purpose of a Quarter Examination. Out of the
False type. Test II contains 5 items in a matching 88 students, the top 40 students were the ones
type of test. Test III contains 2 items in a multiple that were included in the sample. There are 11
choice type and the stem item is based on a (27%) respondents each for the upper and lower
figure presented. Test IV contains 4 items within group whose scores are subjected to item
2 situations. Test V contains 4 items in a multiple analysis for difficulty and discrimination.
choice type, a physical map as a basis for Procedure. The teacher for grade 3
answering. Test VI is another multiple choice Sibika at Kultura directly instructed the two other
type that concentrates on the use of different teachers who will administer the test for the two
types of map. Test VII is a short answer type of other sections. It was kept into consideration the
test in which the students will supply what constancy and the other factors that would affect
direction is asked from the question base on a the students’ performance on the test. The test
map presented containing 6 items. Test VIII, a 5- was administered simultaneously for the three
item interpretive exercise type of test in which a classes in the morning as the first test to be
situation is given and for each situation taken for that day. The students took the test for
inferences were listed and the task of the one hour. Some students were able to finish the
students is to choose the best inference test ahead of time, and they were just advised to
applicable for the given situation. Test IX, a review their work. When the bell rang the teacher
three-item multiple choice type in which the instructed the students to pass their paper
students will answer depending on a figure of a forward. All the test papers were gathered and
Philippine map and whether condition id given. were checked. After a week the students were
Test X, a three-point essay question evaluated informed about their results and the top 40
according to the (a) correctness of answer (1.5 students who were included in the sample for
pts.); (b) Explanation (1 pt); and, (c) followed study were informed about the teachers’ concern
instruction (0.5 pt). There were two raters who for their test. A letter of request for the parents
evaluated the answer for the essay type of test. was sent to inform them about the purpose of the
research and the students’ score. The parents
Content Validation replied positively.
The test was content validated and Data-Analysis. The scores were
reviewed by a teacher in Social Studies from tabulated and encoded in so that the computation
Ateneo de Davao. The suggestions were of the results will be easy. The split-half method
considered and the test was revised accordingly. for obtaining the internal consistency among the
Also, before arriving with a final draft of test for scores was employed. The odd and the even
administration, it was checked by the academic items were separated and were correlated using
coordinator of the school where the test will be the Pearson’s r moment correlation coefficient.
administered whether the items are appropriate The upper and lower groups were chosen
for the level of grade three students and some according to 27% of the lowest and the highest
typographical errors. In the process of content among the 40 respondents. The item analysis
validation, the topics covered and the table of was employed by computing for each item’s
specifications were provided in order to difficulty and the item discrimination. The remark
determine whether the items were generally for each item was then given according to the
covered for the topics studied. standards of difficulty and its discrimination,
whether a good item or not. The coefficient of
99
concordance was used in order to determine scores correlated. The last item was not included
inter-rater reliability of the essay type of test. since it has no matched item to be correlated
There were two judges who evaluated and used with. Also, the other items were essay type in
criteria to score the essay part of the test. which subjected to a different analysis. The low
coefficient of internal consistency can also be
Result and Discussion accounted with the various types of tests used,
Reliability thus can be accounted with the variation and
The test’s reliability was generated differences in the performance of the
through the split-half method by correlating the respondents. In other words, the respondents
odd numbered and even numbered items. The may respond and perform differently for each
arrived internal consistency is 0.3, which is low. type of test.
The low correlation between the odd and even The nature of the test cannot be
numbered items can be accounted with the measured on its general homogeneity since the
different topic contents within the 40-item test. It test contains several topics and several types of
should have been more appropriate to construct format responses. Thus, respondents perform
a large pool of items for the 10 content topics or differently for different types of test. The test has
factors that the test have, but 40 items are the 10 types measuring different skills such as
usual standard of items of the school for the identifying the important concepts and definitions,
quarterly test. The test has been administered for comprehension and explanations on the reasons
the purpose of quarterly test because the for given situations and phenomena, and using
usability of the test is considered. With regards to and analyzing different kinds of maps in
this type of measure it can only be accounted identifying important symbols and familiarity of
with the reliability of half of the test. This explains places.
the low value of the correlation coefficient. The The dilemma is that the content domains
split-half coefficient is then transformed into a included in the test are part of a general topic on
spearman brown coefficient since the correlation Philippine geography. To test the internal
is only for the half of the test. The resulting consistency among the 9 different contents,
Spearman-Brown coefficient is 0.46 which means correlation matrix was done.
that the items have a moderate relationship.
Also, it is a rule of thumb that there
should at least be 30 pairs of scores to be
correlated, but in this case there were only 18
Table 2
Intercorrelation of the Nine Subtests
I II III IV V VI VII VIII IX
I --
II -.13 --
III .98* .99* --
IV .18 -.81* -.48* --
V -.21 -.42 .47* -.19 --
VI .19 .58* .47* .6 -.65* --
VII -.73 .28 .4I* -.56* .73* -.24 --
VIII .07 -.19 -.47* .96* .08 -.8 -.25 --
IX .85* -.58* .48* .15 .97* -.52* -.52* .28 --
100
There is a high relationship between test students who answered the item correctly. In this
I and test IX. The higher the scores on case, most of the respondents got the answer
identification of concepts are, the higher the that is why most of the items turned out to be
scores on comprehension of weather map. Also, easy. It can be accounted that in general, the test
a high relationship existed between test V and was fairly easy since most of the items turned out
test IX. The higher the scores on the 76% and above.
interpretation of a physical map are, the higher There were 27% items that are
the scores on interpretation of the weather map. considered poor. These items were rejected
There is also a high relationship between test IV since most scores is in the high range of the low
and test VIII. The higher the scores on the group and some scores of the low group are near
inference about the Philippine islands are, the to the scores of the high group who have
higher the scores on the comprehension on answered it correctly. Considering the poor items
weather. Generally, the results on inter- such as item 2, 4, 9, 13, 15, 30, 31, 32, 33, and
correlation among the contents showed pretty 34 the pattern is indicative. There are very few
crude results due to the few items and the items marginal items that are subjected for
for each type of the test were not equal. The improvement. There are only 8% (3 items) that
pairing in the computation was done based on are remarked as marginal since the scores of the
the minimum number of items for each test type. low group and the high groups are almost the
same. This means that both the high and the low
Item Difficulty and Index Discrimination group can answer this item fairly. 21.6% (8 items)
To evaluate the quality of each type of of the items are reasonably good items since
item in the test, item analysis was done by there is enough interval between the high and
determining each items difficulty and index low groups. Also there are few items remarked
discrimination. The proportion of examinees as good items and enough to be considered as
getting each of items correctly was evaluated very good items. 16.21% of the items are good
according to the scale below. items and 24.3% are very good items. There is a
pattern that there is a wide distance of scores
Difficulty Index Remark between the high group and the low group.
.76 or higher Easy Item
.25 to .75 Average Item Interrater Reliability
.24 or lower Difficult Item The coefficient of concordance was used
Source: Lamberte, B. (1998). Determining the Scientific to determine the degree of agreement between
Usefulness of Classroom Achievement Test. Cutting Edge the two raters who judged the essay type in the
Seminar. De La Salle University. test. The essay type basically measures the
student’s knowledge on the adaptation of farmers
The items’ difficulty and discrimination index in farming. The criteria used for rating the essay
value are indicated in Table 3. The difficulty index are: (a) at least 2 answers are correct (1.5pts);
shows a pattern that 67.6% of the items are easy (b) the answer was explained (1 pt); (c) and the
and 32.43% of the test is on the average scale. instruction on answering was followed (0.5 pt).
Considering that the test was constructed for The results indicate that there is low agreement
grade three students the teacher was putting it between the two raters. A high value of W which
down on the level of the student’s capacity and is 0.74 was computed indicating close
ability. But it may also mean that the students concordance between the raters. This means
gained mastery of the subject matter that most of that the two raters showed a small variation in
them are able to answer it correctly. It should be rating the answers in the essay. The small error
taken note that the easiness and difficulty of the of variance can be accounted with the difference
items are dictated on the proportion of the of the disposition of the two raters. The first rater
101
was the actual teacher in the subject but the the inference about the Philippine Islands, the
second rater was also an Araling Panlipunan higher the scores on the comprehension on
teacher but teaching in the higher level. There topics about weather. A high correlation
was a difference on how they view the answer coefficient was found between these types.
even though they talked about the rating Although the results may not be too accurate
procedure at the start. since the basis for the matrix comparison does
not have equal number of items and the
Conclusion minimum number of items were the only ones
A low internal consistency was subjected in the analysis. It is recommended that
generated due to the different subject content in equal number of items for each test should be
the test and each test measures different skills. made to account a more accurate result in the
These two factors affected the internal regression analysis. There is also a low
consistency of the test. It is indeed difficult to agreement between the two raters for the essay
make it entirely uniform since the subject type since they have different perceptions on
contents are required as minimum learning giving points for the answers. The item difficulty
competence by the Department of education. showed the most of the items are easy since the
Also the listed subject contents are the planned students have gained mastery of the subject
focus for the first quarter of the schools subject matter. The index discrimination showed that the
matter budgeting. A multiple regression analysis items are distributed according to its power.
was performed to observe the relationship There are almost equal number of items that are
among the test types. It was found that the higher poor (27%), marginal item (8%), reasonably good
the scores on the interpretation of a physical map (22%), good (16%) and very good (24%).
the higher the scores on the interpretation of the
weather map and also the higher the scores on
Table 3. Item Discrimination and Index Discrimination.

Item No. Total High Low PH PL Difficulty Remark Item Remark
Group Group Index Discriminat
ion
1 32 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
2 26 7 6 0.636 0.545 0.591 Average Item 0.091 Poor item
3 34 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
4 38 11 10 1 0.909 0.955 Easy Item 0.909 Poor item
5 36 11 8 1 0.727 0.864 Easy Item 0.273 Reasona
bly Good
item
6 34 11 5 1 0.455 0.727 Average Item 0.545 Very
Good
item
7 33 10 8 0.909 0.727 0.818 Easy Item 0.182 Marginal
item
bly Good
item
102
9 39 11 10 1 0.634 0.955 Easy Item 0.091 Poor item

10 24 9 4 0.818 0.456 0.591 Average Item 0.455 Very
Good
item
11 23 9 5 0.818 0.273 0.636 Average Item 0.364 Good
item
Good
item
13 36 11 9 0.909 0.727 0.864 Easy Item 0.091 Poor item
14 34 11 8 1 1 0.864 Easy Item 0.273 Marginal
item
15 39 10 11 1 1 1 Easy Item 0 Poor item
Good
item
Good
item
18 34 10 7 1 0.636 0.818 Easy Item 0.364 Good
item
Good
item
bly Good
item
Good
item
Good
item
23 33 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
bly Good
item
bly Good
item
26 24 11 9 1 0.364 0.909 Easy Item 0.182 Marginal
item
27 37 11 4 0.636 0.818 0.5 Average Item 0.273 Reasona
bly Good
item
item
103
29 39 11 7 1 0.909 0.818 Easy Item 0.364 Good

item
30 40 11 10 1 1 0.955 Easy Item 0.091 Poor item
item
Good
item
37 36 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
Item Response Theory: Obtaining Item Difficulty Using the Rasch Model
It is said that the IRT is an approach to testing based on item analysis considering the
chance of getting particular items right or wrong. In IRT, each item on a test has its own item
characteristic curve that describes the probability of getting each particular item right or wrong
given the ability of the test takers (Kaplan & Saccuzzo, 1997). This will be realized at the latter
section in the computational procedure.
In using the Rasch model as an approach for determining item difficulty, the calibration
of test item difficulty is independent of the person used for the calibration unlike in the classical
test theory approach where it is dependent on the group. The method of test calibration does not
matter whose responses to these items use for comparison. It gives the same results regardless on
who takes the test. The score a person obtains on the test can be used to remove the influence of
their abilities from the estimation of their difficulty. Thus, the result is a sample free item
calibration.
Rasch’s (1960), the proponent who derived the technique, intended to eliminate
references to populations of examinees in analyses of tests unlike in classical test theory where
norms are used to interpret test scores. According to him that test analysis would only be
worthwhile if it were individual centered with separate parameters for the items and the
examinees (van der Linden & Hambleton, 2004).
The Rasch model is a probabilistic unidimensional model which asserts that: (1) The
easier the question the more likely the student will respond correctly to it, and (2) the more able
the student, the more likely he/she will pass the question compared to a less able student. When
the data fit the Rasch model, the relative difficulties of the questions are independent of the
relative abilities of the students, and vice versa (Rasch, 1977).
As shown in the graph below (Figure 1), a function of ability (θ) which is a latent trait
forms the boundary between the probability areas of answering an item incorrectly and
answering the item correctly.
104
Figure 1
Item Characteristic Curves of an 18-item Mathematical Problem Solving Test
Easy item
Easy item
Easy item
Easy item
Difficult item
Difficult item
Difficult item
In the item characteristic curve, the score on the item represents ability (θ) and the x-axis is the
range of item difficulties in log functions. It can be noticed that items 1, 7, 14, 2, 8, and 15 do not
require high ability to be answered correctly as compared t items 5, 12, 18, and 11 that require
high ability. The item characteristic curves are judged within 50% of the ability and a cut off of
“0” on item discrimination. The curves within the left side of the “0” item difficulty as marked in
the 50% ability are easy items and the ones on the right side are difficult items. The program
called WINSTEPS was used to produce the curves.
The IRT Rasch model basically identifies the location of a persons’ ability in a set of
items for a given test. The test items has a predefined set of difficulties, the person’s position
should be reflective that his ability should be matched with the difficult of the items. The ability
of the person as symbolized by θ and the items as δ. In the figure below, there are 10 items (δ1 to
105
δ10), and the location of the person’s ability (θ) is in between δ7 and δ8. In the continuum, the
items are prearranged from the easiest (at the left) to the most difficult (at the right). If the
position of the person’s ability is between δ7 and δ8, then it is expected that the person taking the
test should be able to answer items δ1 to δ6 (“1” correct response, “0” incorrect response), since
this items are answerable given the level of ability of the person. This kind of calibration is said
to fit the Rasch model where the position of the person’s ability is within a defined line of item
difficulties.
Case 1
In Case 2, the person is able to answer four difficult items and unable to respond correctly with
the easy items. There is now difficulty in locating the person in the continuum. If the items are
valid measures of ability, then the easy items should be answerable than the difficult ones. This
means that the items are not suited for the person’s ability. This case do not fit the Rasch model.
Case 2
The Rasch model allow to estimate person ability (θ) through their score on the test and
the item’s difficulty (δ) through the item correct separately that’s why it is considered to be test
free and sample free.
In different cases, it can be encountered that the person’s response (θ) to the test is higher
than the specified item difficulty (δ), so their difference (θ–δ) is greater than zero. But when the
ability or response (θ) is less than the specified item difficulty (δ), their difference (θ–δ) is less
than 0 as in Case 2. When the ability of the person (θ) is equivalent to the item’s difficulty (δ),
the difference (θ–δ) is 0 as in Case 1. This variation in person responses and item’s difficulty is
106
represented in an Item Characteristic Curve (ICC) which show the way the item elicits responses
from persons of every ability (Wright & Stone, 1979).
Figure 1
ICC of a Given Ability and Item Difficulty
An estimate of response x is obtained when a person with ability (θ) is acting on an item with
diffuculty (δ). It can be specified in the model that in the interaction between ability (θ) and item
difficulty (δ) that when ability is greater than the difficulty, the probability of getting the correct
answer is more than .5 or 50%. When the ability is less than the difficulty, the probability of of
getting the correct answer is less than .5 or 50%. The variation of these estimates on the
probability of getting a correct response is illustrated in Figure 1. The mathematical units for θ
and δ are defined in logistic functions (ln) to produce a linear scale and generality of measure.
The next section guides you in estimating the calibration of item difficulty and person
ability measure.
Procedure for the Calibration of Item and Person Ability
The Rasch model will be used for the responses of 10 students in a 25 item problem
solving test. In determining the item difficulty in the Rasch model, all participants who took the
test are included unlike the classical test theory where the upper and lower 27% are the only ones
included in the analysis.
107
ITEM NUMBER
Examinees 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 total
9 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 20
10 1 1 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 13
5 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 11
3 0 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 0 1 0 0 1 10
8 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 10
1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1 9
6 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 9
7 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 9
4 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 8
2 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 7
Total 5 3 6 1 3 4 8 5 2 4 3 3 6 5 6 1 4 2 6 5 2 5 5 2 10
Grouped Distribution of Different Item Scores
1. Code each score for each item as “1” for right answer and “0” for wrong answer.
2. Arrange the scores (persons) from highest to lowest
3. Remove items where all the respondents got a correct answer.
4. Remove items where all the respondents got a wrong answer.
5. Rearrange scores (person from highest to lowest).
6. Group the items with similar total item score (si)
7. Indicate the frequency (fi) of items for each group of items
 s 
8. Divide each total item score (si) with N (i) = proportion correct   i  i 
 n
9. Subtract 1 with the i = proportion incorrect (1 – ρi)
10. Divide the proportion incorrect with the proportion correct and get the natural log of this
 1   i  
using a scientific calculator = logit incorrect (xi)  xi  ln   
  i  
11. Multiply the frequency (fi) with the Logit incorrect (xi)=fixi
12. Square the xi and multiply with each fi=fixi2
13. Compute for the value of x
f x
x  i i
f i
14. To get the initial item calibration (doi) subtract the logit incorrect (xi) with the x
(doi= xi- x)
15. Estimate the value of U which will be used later in the final estimates.
108
f i ( xi ) 2  [(fi )( x) 2 ]
U
f i  1
Table 1
Grouped Distribution of the 7 Different Item Scores of 10 Examinees
item
initial
score item item item Proportion proportion logit Frequency X Frequency
item
group name score frequency correct incorrect incorrect logit X logit2
calibration
index
si fi ρi 1-ρi xi fixi fi(xi)2 doi= xi- x

1 7 8 1 0.8 0.2 -1.39 -1.39 1.92 -1.87
3, 13,
2 15, 19 6 4 0.6 0.4 -0.41 -1.62 0.66 -0.89
1, 8,
14,
20,
3 22, 23 5 6 0.5 0.5 0.00 0.00 0.00 -0.48
6, 10,
4 17 4 3 0.4 0.6 0.41 1.22 0.49 -0.07
2, 5,
5 11, 12 3 4 0.3 0.7 0.85 3.39 2.87 0.37
9, 18,
6 21, 24 2 4 0.2 0.8 1.39 5.55 7.69 0.91
7 4, 16 1 2 0.1 0.9 2.20 4.39 9.66 1.72
Σfi(xi)2
Σfi=24 Σ fixi =11.54 =23.29
f i x i 11.54
x  x  x• = 0.48
f i 24
f i ( xi ) 2  [(fi )( x) 2 ] 23.29  [( 24)(0.48) 2 ]

U U U = 0.77
f i  1 24  1
Grouped Distribution of Observed Person Scores
16. Count the number of possible scores (r) for each of the person’s total score (L).
17. Count the number of persons for each possible score=person frequency (nr)
 r
18. Divide each possible score with the total score = proportion correct   r  
 L
19. Obtain the proportion incorrect by subtracting the proportion correct with 1 (1-ρr)
20. Determine the logit correct (yr) of the quotient between proportion correct (ρr)and
   
proportion incorrect (1-ρr)  y r  ln  r  
 1   r  
109
21. Multiply the logit correct (yr)with each person frequency (nr) (nryr)
22. Square the values of the logit correct (yr2) and multiply with the person frequency (nr)
23. The logit correct (yr)is the initial person measure (bro=yr)
24. Compute for the value of y• and V to be used later in the final estimates.
Table 2
Grouped Distribution of Observed Examinee Scores on the 24 Item Mathematical Problem
Solving Test
Initial
Possible Person Proportion Logit Frequency X Person
score frequency correct correct Frequency X Logit Logit2 Measure
nr ρr yr nryr nr(yr)2 bro=yr
7 1 0.29 -0.89 -0.89 0.79 -0.89
8 1 0.33 -0.69 -0.69 0.48 -0.69
9 3 0.38 -0.51 -1.53 0.78 -0.51
10 2 0.42 -0.34 -0.67 0.23 -0.34
11 1 0.46 -0.17 -0.17 0.03 -0.17
12 0 0.50 0.00 0.00 0.00 0.00
13 1 0.54 0.17 0.17 0.03 0.17
14 0 0.58 0.34 0.00 0.00 0.34
15 0 0.63 0.51 0.00 0.00 0.51
16 0 0.67 0.69 0.00 0.00 0.69
17 0 0.71 0.89 0.00 0.00 0.89
18 0 0.75 1.10 0.00 0.00 1.10
19 0 0.79 1.34 0.00 0.00 1.34
20 1 0.83 1.61 1.61 2.59 1.61
Σnr=10 Σnryr =-2.18 Σnr(yr)2= 4.92
nr y r  2.18
y  y  y•= -0.22
nr 10
nr ( y r ) 2  nr ( y) 2 4.92  [10( 0.22) 2 ]

V  V V = 0.49
nr  1 10  1
Final Estimates of Item difficulty
25. Compute for the expansion factor (Y)

110
V 0.49
1 1
Y  2.89 Y  2.89 Y = 1.11
UV (0.77)(0.49)
1 1
8.35 8.35
V = 0.49 (from Table 2)

U = 0.77 (from Table 1)
26. Multiply the expansion factor (Y) with the initial item calibration (dio). The item score
group index, item name, and initial item calibration is taken from Table 1.
27. Compute the Standard Error (SE) for each item scores
N
SE (d i )  Y
Si ( N  Si )
Table 3
Final Estimates of Item Difficulties from 10 Examinees
Sample
Spread
Item Score Initial Item Expansion Corrected Item Calibration
Group Index Item Name Calibration Factor Calibration Item Score Standard Error
i doi Y di=Ydoi si SE (di)
1 7 -1.87 1.11 -2.07 8 0.878
2 3, 13, 15, 19 -0.89 1.11 -0.98 6 0.717

1, 8, 14, 20,
3 22, 23 -0.48 1.11 -0.53 5 0.702
4 6, 10, 17 -0.07 1.11 -0.08 4 0.717
5 2, 5, 11, 12 0.37 1.11 0.41 3 0.766
6 9, 18, 21, 24 0.91 1.11 1.01 2 0.878

7 4, 16 1.72 1.11 1.91 1 1.170
N=10
Final Estimates of Person Measures
28. Compute for the value of X

111
U 0.77
1 1
X  2.89 X  2.89 X = 1.16
UV (0.77)(0.49)
1 1
8.35 8.35
V = 0.49 (from Table 2)

U = 0.77 (from Table 1)
29. Multiply the expansion factor (X) with each of the initial measure (bro) = corrected
measure to obtain the corrected measure (br). The possible score and initial measure is
taken from Table 2.
30. Compute for the Standardized Error (SE).
L
SE  X
r( L  r )
Test width corrected Measure

Possible score Initial measure expansion factor measure standard error nr
r b ro X br=Xbro
7 -0.89 1.18 -1.05 0.53 1
8 -0.69 1.18 -0.82 0.51 1
9 -0.51 1.18 -0.60 0.50 3
10 -0.34 1.18 -0.40 0.49 2
11 -0.17 1.18 -0.20 0.48 1
12 0.00 1.18 0.00 0.48 0
13 0.17 1.18 0.20 0.48 1
14 0.34 1.18 0.40 0.49 0
15 0.51 1.18 0.60 0.50 0
16 0.69 1.18 0.82 0.51 0
17 0.89 1.18 1.05 0.53 0
18 1.10 1.18 1.30 0.56 0
19 1.34 1.18 1.57 0.59 0
20 1.61 1.18 1.90 0.65 1
112
Figure 3
Item Map for the Calibrated Item Difficulty and Person Ability
Score
items logit persons
(item 7)1 -2.07 0
7
0 -1.05 1 (Case2)
Item 3 Item 13 Item 15 (Item 19) 4
-0.98 0
8
Items that do not 0 -0.82 1 (Case 4)
9
Require high ability 0 -0.60 3 (Case 1) Case 6 Case 7
(δ<θ) Item 1 item 8 Item 14
Item 20 Item 22 (Item 23) 6 -0.53 0
10
0 -0.40 2 (Case 3) Case 8
11
0 -0.20 1 (Case 5)
Item 6 Item 10 (Item 17) 3
-0.08 0
δ=θ 0 0.00 0
13
0 0.20 1 (Case 10)
0 0.40 0
Item 2 Item 5 Item 11 (Item 12) 4
0.41 0
0 0.60 0
Items that 0 0.82 0
require high ability Item 9 Item 18 Item 21 (Item 24) 4
(δ >θ) 1.01 0
0 1.05 0
0 1.30 0
0 1.57 0
20
0 1.90 1 (Case 9)
Item 16 (Item 4) 7
1.91 0
Figure 3 shows the item map of calibrated item difficulty (left side) and person ability (right
side) across their logit values. Observe that as the items become more difficult (increasing logits)
the person with the highest score (high ability) is matched close with the item. This match is
termed as a goodness of fit in the Rasch model. A good fit indicates that difficult items require
high ability to be answered correctly. More specifically the match in the logits of person ability
and item difficulty indicates a goodness of fit. In this case the goodness of fit of the item
difficulties are estimated using the z value. Lower z and non significant z values indicates a
goodness of fit of the item difficulty and person ability.
113
EMPIRICAL REPORT
The Application of a One-Parameter IRT description. Reitman's discussion described a

Model on a Test of Mathematical Problem problem solver as a person perceiving and
Solving accepting a goal without an immediate means of
Carlo Magno reaching the goal. Henderson and Pingry (1953)
Chang Young Hai wrote that in problem solving there must be a
goal, a blocking of that goal for the individual,
Abstract and acceptance of that goal by the individual.
The purpose of this research was to examine the validity of What is a problem for one student may not be a
a Mathematical Problem Solving Test for fourth year high
school students and to compare traditional and Rasch-
problem for another -- either because there is no
based scores in their ability. The Mathematical Problem blocking or no acceptance of the goal.
Solving test was administered to 31 fourth year high school Schoenfeld (1985) also pointed out that defining
students studying in two Chinese schools, and the data a problem is always relative to the individual.
were submitted to Rasch analysis. Traditional and Rasch- The measure of mathematical ability
based scores for a sample of fourth year high school
students were submitted to analyses of variance with group
through problem solving is subject to fluctuations
by comparing log and SE values across the test. Twenty- as any other ability constructs. Due to these
two items demonstrated acceptable model fit. The Rasch fluctuations, the measure of person ability and
model accounted for 26% of the variance in the responses item difficulty needs to be calibrating in a
to the remaining items. The findings generally support the logistical Model. An analysis that offers this
test’s validity. Finally, the results suggest to further
explores the dimensionality of problem solving as a
technique is using the one-parameter Rasch
construct. Model.
Problem solving has a special Research on Problem Solving

importance in the study of mathematics. A
primary goal of mathematics teaching and Various research methodologies are
learning is to develop the ability to solve a wide used in mathematics education research
variety of complex mathematics problems. Stanic including a clinical approach that is frequently
and Kilpatrick (1988) traced the role of problem used to study problem solving. Typically,
solving in school mathematics and illustrated a mathematical tasks or problem situations are
rich history of the topic. To many mathematically devised, and students are studied as they
literate people, mathematics is synonymous with perform the tasks. Often they are asked to talk
solving problems-doing word problems, creating aloud while working or they are interviewed and
patterns, interpreting figures, developing asked to reflect on their experience and
geometric constructions, proving theorems, etc. especially their thinking processes. Waters
The rhetoric of problem solving has been (1984) discusses the advantages and
so pervasive in the mathematics education of the disadvantages of four different methods of
1980s and 1990s that creative speakers and measuring strategy use involving a clinical
writers can put a twist on whatever topic or approach. Schoenfeld (1983) describes how a
activity they have in mind to call it problem clinical approach may be used with pairs of
solving. Every exercise of problem solving students in an interview. He indicates that "dialog
research has gone through some agony of between students often serves to make
defining mathematics problem solving. Reitman managerial decisions overt, whereas such
(1965) defined a problem as when you have decisions are rarely overt in single student
been given the description of something but do protocols."
not yet have anything that satisfies that
114
The basis for most mathematics problem

solving research for secondary school students in Problem Solving as a Process
the past 31 years can be found in the writings of Garofola and Lester (1985) have
Polya (1973, 1962, 1965), the field of cognitive suggested that students are largely unaware of
psychology, and specifically in cognitive science. the processes involved in problem solving and
Cognitive psychologists and cognitive scientists that addressing this issue within problem solving
seek to develop or validate theories of human instruction may be important.
learning (Frederiksen, 1984) whereas
mathematics educators seek to understand how Domain Specific Knowledge. To become
their students interact with mathematics a good problem solver in mathematics, one must
(Schoenfeld, 1985; Silver, 1987). The area of develop a base of mathematics knowledge. How
cognitive science has particularly relied on effective one is in organizing that knowledge also
computer simulations of problem solving (25,50). contributes to successful problem solving.
If a computer program generates a sequence of Kantowski (1974) found that those students with
behaviors similar to the sequence for human a good knowledge base were most able to use
subjects, then that program is a model or theory the heuristics in geometry instruction. Schoenfeld
of the behavior. Newell and Simon (1972), Larkin and Herrmann (1982) found that novices
(1980), and Bobrow (1964) have provided attended to surface features of problems
simulations of mathematical problem solving. whereas experts categorized problems on the
These simulations may be used to better basis of the fundamental principles involved.
understand mathematics problem solving. Silver (1987) found that successful
Constructivist theories have received problem solvers were more likely to categorize
considerable acceptance in mathematics math problems on the basis of their underlying
education in recent years. In the constructivist similarities in mathematical structure. Wilson
perspective, the learner must be actively involved (1967) found that general heuristics had utility
in the construction of one's own knowledge rather only when preceded by task specific heuristics.
than passively receiving knowledge. The The task specific heuristics were often specific to
teacher's responsibility is to arrange situations the problem domain, such as the tactic most
and contexts within which the learner constructs students develop in working with trigonometric
appropriate knowledge (Steffe & Wood, 1990; identities to "convert all expressions to functions
von Glasersfeld, 1989). Even though the of sine and cosine and do algebraic
constructivist view of mathematics learning is simplification."
appealing and the theory has formed the basis
for many studies at the elementary level, Algorithms. An algorithm is a procedure,
research at the secondary level is lacking. applicable to a particular type of exercise, which,
However, constructivism is consistent with if followed correctly, is guaranteed to give you the
current cognitive theories of problem solving and answer to the exercise. Algorithms are important
mathematical views of problem solving involving in mathematics and our instruction must develop
exploration, pattern finding, and mathematical them but the process of carrying out an
thinking (Schoenfeld, 1988; Kaput, 1979; algorithm, even a complicated one, is not
National Council of Supervisors of Mathematics, problem solving. The process of creating an
1978) thus teachers are urged and teacher algorithm, however, and generalizing it to a
educators become familiar with constructivist specific set of applications can be problem
views and evaluate these views for restructuring solving. Thus problem solving can be
their approaches to teaching, learning, and incorporated into the curriculum by having
research dealing with problem solving. students create their own algorithms. Research
involving this approach is currently more
115
prevalent at the elementary level within the the role of teacher, and direct instruction to
context of constructivist theories. develop students' abilities to generate subgoals.
It is useful to develop a framework to
Heuristics. Heuristics are kinds of think about the processes involved in
information, available to students in making mathematics problem solving. Most formulations
decisions during problem solving, that are aids to of a problem solving framework in U. S.
the generation of a solution, plausible in nature textbooks attribute some relationship to Polya's
rather than prescriptive, seldom providing (1973) problem solving stages. However, it is
infallible guidance, and variable in results. important to note that Polya's "stages" were more
Somewhat synonymous terms are strategies, flexible than the "steps" often delineated in
techniques, and rules-of-thumb. For example, textbooks. These stages were described as
admonitions to "simplify an algebraic expression understanding the problem, making a plan,
by removing parentheses," to "make a table," to carrying out the plan, and looking back.
"restate the problem in your own words," or to According to Polya (1965), problem
"draw a figure to suggest the line of argument for solving was a major theme of doing mathematics
a proof" are heuristic in nature. Out of context, and "teaching students to think" was of primary
they have no particular value, but incorporated importance. "How to think" is a theme that
into situations of doing mathematics they can be underlies much of genuine inquiry and problem
quite powerful (Polya, 1973; Polya, 1962; Polya, solving in mathematics. However, care must be
1965). taken so that efforts to teach students "how to
Theories of mathematics problem solving think" in mathematics problem solving do not get
(Newell & Simon, 1972; Schoenfeld, 1985; transformed into teaching "what to think" or "what
Wilson, 1967) have placed a major focus on the to do." This is, in particular, a byproduct of an
role of heuristics. Surely it seems that providing emphasis on procedural knowledge about
explicit instruction on the development and use of problem solving as seen in the linear frameworks
heuristics should enhance problem solving of U. S. mathematics textbooks and the very
performance; yet it is not that simple. Schoenfeld limited problems/exercises included in lessons.
(1985) and Lesh (1981) have pointed out the Clearly, the linear nature of the models
limitations of such a simplistic analysis. Theories used in numerous textbooks does not promote
must be enlarged to incorporate classroom the spirit of Polya's stages and his goal of
contexts, past knowledge and experience, and teaching students to think. By their nature, all of
beliefs. What Polya (1967) describes in How to these traditional models have the following
Solve It is far more complex than any theories we defects:
have developed so far. 1. They depict problem solving as a
Mathematics instruction stressing linear process.
heuristic processes has been the focus of several 2. They present problem solving as a
studies. Kantowski (1977) used heuristic series of steps.
instruction to enhance the geometry problem 3. They imply that solving mathematics
solving performance of secondary school problems is a procedure to be memorized,
students. Wilson (1967) and Smith (1974) practiced, and habituated.
examined contrasts of general and task specific 4. They lead to an emphasis on answer
heuristics. These studies revealed that task getting.
specific hueristic instruction was more effective These linear formulations are not very
than general hueristic instruction. Jensen (1984) consistent with genuine problem solving activity.
used the heuristic of subgoal generation to They may, however, be consistent with how
enable students to form problem solving plans. experienced problem solvers present their
He used thinking aloud, peer interaction, playing solutions and answers after the problem solving
116
is completed. In an analogous way, McGuinness, 2003; Lai, Cella, Chang, Bode, &
mathematicians present their proofs in very Heinemann, 2003; Linacre, Heinemann, Wright,
concise terms, but the most elegant of proofs Granger, & Hamilton, 1994; Velozo, Magalhaes,
may fail to convey the dynamic inquiry that went Pan, & Leiter, 1995; Ware, Bjorner, & Kosinski,
on in constructing the proof. 2000) but has rarely been used in mathematical
Another aspect of problem solving that is problem solving assessment (Willmes, 1981,
seldom included in textbooks is problem posing, 1992). Its primary advantages include the interval
or problem formulation. Although there has been nature of the measures it provides and the
little research in this area, this activity has been theoretical independence of item difficulty and
gaining considerable attention in U. S. person ability scores from the particular samples
mathematics education in recent years. Brown used to estimate them.
and Walter (1983) have provided the major work The Rasch model, also referred to in the
on problem posing. Indeed, the examples and item response theory literature as the one-
strategies they illustrate show a powerful and parameter logistic model, estimates the
dynamic side to problem posing activities. Polya probability of a correct response to a given item
(1972) did not talk specifically about problem as a function of item difficulty and person ability.
posing, but much of the spirit and format of The primary output of Rasch analysis is a set of
problem posing is included in his illustrations of item difficulty and person ability values placed
looking back. along a single interval scale. Items with higher
A framework is needed that emphasizes difficulty scores are less likely to be answered
the dynamic and cyclic nature of genuine correctly, and items with lower scores are more
problem solving. A student may begin with a likely to elicit correct responses. By the same
problem and engage in thought and activity to token, persons with higher ability are more likely
understand it. The student attempts to make a to provide correct responses, and those with
plan and in the process may discover a need to lower ability are less likely to do so.
understand the problem better. When a plan has Rasch analysis (a) estimates the
been formed, the student may attempt to carry it difficulty of dichotomous items as the natural
out and be unable to do so. The next activity may logarithm of the odds of answering each item
be attempting to make a new plan, or going back correctly (a log odds, or logit score), (b) typically
to develop a new understanding of the problem, scales these estimates to mean = 0, and then (c)
or posing a new (possibly related) problem to estimates person ability scores on the same
work on. scale. In analysis of dichotomous items, item
Problem solving abilities, beliefs, difficulty and person ability are defined such that
attitudes, and performance develop in contexts when they are equal, there is a 50% chance of a
(Schoenfeld, 1988) and those contexts must be correct response. As person ability exceeds item
studied as well as specific problem solving difficulty, the chance of a correct response
activities. increases as a logistic ogive function, and as
item difficulty exceeds person ability, the chance
Rasch Analysis of success decreases. The formal relationship
Rasch analysis (Bond & Fox, 2001; among response probability, person ability, and
Rasch, 1980; Wright & Stone, 1979) offers item difficulty is given in the mathematical
potential advantages over the traditional equation by Bond and Fox (2001, p. 201). A
psychometric methods of classical test theory. It graphic plot of this relationship, known as the
has been widely applied in health status item characteristic curve (ICC), is given for three
assessment (e.g., Antonucci, Aprile, & Paulucci, items of different difficulty levels.
2002; Duncan, Bode, Lai, & Perera, 2003; One useful feature of the Rasch model is
Fortinsky, Garcia, Sheenan, Madigan, & Tullai- referred to as parameter separation or specific
117
objectivity (Bond & Fox, 2001; Embretson & Wainer & Mislevy, 2000). This assumption
Reise, 2000). The implication of this requires that individual items do not influence
mathematical property is that, at least in theory, one another (i.e., they are uncorrelated, once the
item difficulty values do not depend on the dimension of item difficulty-person ability is taken
person sample used to estimate them, nor do into account). Thus, no considerations of item
person ability scores depend on the particular content, beyond their difficulty values, are
items used to estimate them. In practical terms, necessary for estimating person ability, and
this means that given well-calibrated sets of changing the order of item administration should
items that fit the Rasch model, robust and directly not change item or person estimates. In
comparable ability estimates may be obtained mathematical terms, this assumption states that
from different subsets of items. This, in turn, the probability of a string of responses is equal to
facilitates both adaptive testing and the equating the product of the individual probabilities of each
of scores obtained from different instruments of the separate responses comprising it. Failure
(Bond & Fox, 2001; Embretson & Reise, 2000). to meet this assumption can suggest the
Rasch theory makes a number of explicit presence of another dimension in the data.
assumptions about the construct to be measured Local dependence is often a concern in
and the items used to measure it, two of which the construction of reading comprehension tests
have already been discussed above. The first is that include multiple questions about the same
that all test items respond to the same passage, because responses to such questions
unidimensional construct. One set of tools for may be determined not only by the difficulty of
examining the extent to which test items each individual item but also by the difficulty and
approximate unidimensionality are the fit content of the passage. Responses to items of
statistics provided by Rasch analysis. These fit this type are often intercorrelated even after their
statistics indicate the amount of variation individual difficulties have been taken into
between model expectations and observations. account. To give another example, if a particular
They identify items and people eliciting question occurring earlier in a test provides
unexpected responses, such as when a person specific information about the answer to a later
of high ability responds incorrectly to an easy question, then these two items are also likely to
question, perhaps because of carelessness or demonstrate local dependence.
because of a poorly constructed or administered A final important assumption of the
item. Fit statistics can be informative with respect Rasch model is that the slope of the item
to dimensionality because they indicate when characteristic curve, also known as the item
different people may be responding to different discrimination parameter, is equal to 1 for all
aspects of an item's content or the testing items (Bond & Fox, 2001; Embretson & Reise,
situation. 2000; Wainer & Mislevy, 2000). This assumption
A second key assumption of Rasch is presented graphically in Figure 1, where all
analysis, also mentioned above, is that three curves are parallel with a slope equal to 1.
individuals can be placed on an ordered The consequence of this assumption is that a
continuum along the dimension of interest, from given change in ability level will have the same
those having less ability to those having more effect on the log odds of a correct response for
(Bond & Fox, 2001). Similarly, the analysis all items. Items that have different discrimination
assumes that items may be placed on the same values, a given change in ability has different
scale, from those requiring less ability to those consequences for different items. When an item's
requiring more. discrimination parameter is high, a relatively
A third assumption underlying Rasch small change in ability level results in a large
analysis is that of local, or conditional, change in response probability. When
independence (Embretson & Reise, 2000; discrimination is low, larger changes in ability
118
level are needed to change response probability. 1. In the current investigation, the Rasch
A highly discriminating item (i.e., one with a high model was used to analyze a set Mathematiocal
ICC slope) is more likely to result in different Problem Solving data provided by a sample of
responses from two individuals of different ability fourth year high school students in two Chinese
levels, whereas an item with a low discrimination Schools. One purpose of the study was to
parameter (i.e. a low ICC slope) more often determine whether the construct validity of the
results in the same response from both. Rasch test is supported by Rasch analysis. Specifically,
models have been shown to be robust to small it is hypothesized that the test responds to a
and/or unsystematic violations of this assumption cohesive unidimensional construct. Item fit
(Penfield, 2004; van de Vijver, 1986), but when statistics, a Rasch-based unidimensionality
the ICC slopes in an item set differ substantially coefficient, and principal-components analysis of
and/or systematically from 1, the test developer model residuals were used to evaluate this
is advised to reconsider the extent to which the hypothesis.
offending items measure the relevant construct 2. To test the hypothesis that Rasch
(Wright, 1991). estimates of person ability, because of their
An example on the use of the one- status as interval-level measures, are more valid
parameter Rasch Model is the study by and sensitive than traditionally computed scores.
El-Korashy (1995) where the Rasch Model was
applied to the selection of items for an Arabic Method
version of the Otis-Lennon Mental Ability Test. Participants
Correspondence of item calibration to person
measurement indicated that the test is suitable The participants were 31 high school
for the range of mental ability intended to be students from two different schools. The two
measured. Another is the study by Lamprianou high schools are UNO High School and Grace
(2004) that analyzes data from three testing Christian High School. These two high schools
cycles of the National Curriculum tests in were chosen for their popularity in molding high
mathematics in England using the Rasch model. achievers in Mathematics. The participants were
It was found that pupils having English as an fourth year high school students, both male and
additional language and pupils belonging to female students and belonging to the 16-18 age
ethnic minorities are significantly more likely to group. The decision to choose high school
generate aberrant response patterns. However, students was made because the high school
within the groups of pupils belonging to ethnic educational system was much more regimented,
minorities, those who speak English as an and it can be safely assumed that any given
additional language are not significantly more fourth year student would have studied the
likely to generate misfitting response patterns. lessons required of a third year student.
This may indicate that the ethnic background Convenient sampling was used to select the
effect is more significant than the effect of the respondents.
first language spoken. The results suggest that
pupils having English as an additional language Instrument
and pupils belonging to ethnic minorities are
mismeasured significantly more than the Mathematical Problem Solving Test. The
remainder of pupils by taking the mathematics Mathematical Problem Solving test was
National Curriculum tests. More research is constructed to measure the problem solving
needed to generalize the results to other subjects ability of the students (seer Appendix A). There
and contexts. are 25 items included in the test that covers third
Purpose of the Study year high school lessons. Third year lessons
were used because the participants will only be
119
starting their fourth year in high school, and might

not have enough knowledge of fourth year math. Procedure
The coverage of the test includes fractions,
factoring, simple algebraic equations and various The Mathematical Problem Solving Test
word problems. These factors are based on the was administered to fourth Year High school
Merle S. Alferez (MSA) Review Questions for All students of two Chinese Schools in Manila. A
College Entrance Test (ACET) and University of letter requesting to administer the test was sent
the Philippines College Admissions Test to the Math teacher. The mathematics teacher
(UPCAT), and the College Entrance Test was given detailed instructions on how to
Reviewer third edition. administer the test. A copy of the instructions to
A professor from the Mathematics be given to the students were provided so that
Department of De La Salle University-Manila was the administration would be constant across
asked to critique the items in the Mathematical situations. After administering the test the
Problem Solving Test. The item reviewer was students and teachers were debriefed about the
given a copy of the Table of Specifications. This purpose of the study.
table served to orient about the nature of the
items used in the test. The proponent then Data Analysis
explained the purpose of the test in order to
revise the items to better fulfill the objectives of To describe the distribution of the scores,
the exam. After the mathematical problem the mean, standard deviation, kurtosis, and
solving test was revised, it was pre-tested on 10 skewness were obtained. The reliability of the
high school students from Saint Jude Catholic items were evaluated using the Kuder
School to determine the length of time needed by Richardson #20.
students in answering the entire test. Item Analysis was conducted using both
The Mathematical Problem Solving Test Classical Test Theory (CTT) and Item Response
was then given to 31 fourth year high school Theory (IRT). In the CTT the item difficulty and
students for pilot testing. The data from the pilot item discrimination were determined using the
testing were used for reliability and item analysis. proportion of the high group and the low group.
The Kuder-Richardson reliability was used to Item difficulty is determined by getting the
determine the internal consistency of the items of average proportion of correct responses between
the Mathematical Problem Solving Test. This the high group and low group. The Item
method was used to be able to find the discrimination is determined by computing for the
consistency of the responses on all the items in difference between the high group and the low
the test. The test has an internal consistency of group. The estimation of Rasch item difficulty and
.84 based on the KR #20. The skewness of the person ability scores and related analyses were
distribution of scores is somehow negatively carried out using WINSTEPS. This software
skewed with a value of -.158. The distribution of package begins with provisional central
scores has a kurtosis of -1.05. The overall mean estimates of item difficulty and person ability
of the test performance of the participants in the parameters, compares expected responses
pilot test is 16.23 with a standard deviation of based on these estimates to the data, constructs
standard deviation of 5.45. This shows that new parameter estimates using maximum
scores of 17 to 25 are high in problem solving likelihood estimation, and then reiterates the
and a score of 15 and below are below average. analysis until the change between successive
A standard deviation of 5.45 means that the iterations is small enough to satisfy a preselected
individual scores are dispersed. criterion value. The item parameter estimates are
typically scaled to have M = 0, and person ability
scores are estimated in reference to the item
120
mean. A unit on this scale, a logit, represents the answered with low ability while items 3, 7 and 2
change in ability or difficulty necessary to change requires higher ability to get a correct response.
the odds of a correct response by a factor of The characteristic curve shows that
2.718, the base of the natural logarithm. Persons Items 11, 13, 2 and 8 have the probability of
who respond to all items correctly or incorrectly, being answered with low ability while items 9, 10
and items to which all persons respond correctly and 4 requires higher ability to get a correct
or incorrectly, are uninformative with respect to response. The overlap between items 11 and 13
item difficulty estimation and are thus excluded and items 9 and 109 means that the same ability
from the parameter estimation process. are required to get the probability of answering
the item correct.
The characteristic curve shows that
Results Items 16 and 19 have the probability of being
answered with low ability while items 17, 18, 20,
Item analysis was used to evaluate 21 and 22 requires higher ability to get a correct
whether the items in the Mathematical Problem response. The overlap between items 18, 20, 21,
Solving Test are easy, average or difficult. The and 22, and items 16 and 19 means that the
difficulty of an item is based on the percentage of same ability are required to get the probability of
people who answered it correctly. The index answering the item correct. Items 23, 24 and 25
discrimination revealed that there are no are excluded because of extreme responses.
marginal items as well as bad items; however,
84% of the items are very good, 2% are good Examination of Fit
items and 2% are reasonably good items.
In the item difficulty, each item indicates The average INFIT statistics is 1.00 and
whether it is easy, average or difficult. Item average OUTFIT statistics is .98 which indicates
difficulty is determined if the items have the that the data for the items are showing goodness
appropriate difficulty level. It was found out that of fit because the value is less than 1.5 except for
there are no difficult items presented, although items 23, 24 and 25.
72% of the items are average and 28% are easy. Unidimensionality Coefficient
To address the question of construct
One Parameter-Rasch Model dimensionality, a Rasch unidimensionality
When the test scores and ability of the coefficient was calculated. This coefficient was
students in the Mathematical Problem Solving calculated as the ratio of the person separation
Test was calibrated new indices for the reliability reliability estimated using model standard errors
was obtained. The student reliability was .50 with (which treat model misfit as random variation) to
a RMSE of .52 and the Math reliability is .34 with the person separation reliability estimated using
an RMSE of .82. The errors associated with real standard errors (which regard misfit as true
these estimates are high indicating that the data departure from the unidimensional model; Wright,
does not fit well the expected ability and test 1994). The closer the value of the coefficient to
difficulty. Figure 1 shows the test characteristic 1.0, the more closely the data approximate
curve generated by the WINSTEPS. unidimensionality. The unidimensionality
In the computed separation for ability is coefficient for the current data set was .61 (ratio
1.20 and the item (expected score) is 11 which is of 1.20 and .73 separation values) which is quite
.73 when converted into a standardized estimate. marginal to 1.00. This means that the data might
Although these extreme values are adjusted by form dimensions.
fine tuning the slopes produced for each item. Principal Components analysis shows
The characteristic curve shows that that there can possibly be 7 factors that can be
Items 5, 1, 6 and 4 have the probability of being
121
formed with the items excluding item 25 with no because of the additional lexical load imposed by
variation as indicated in the scree plot. the inclusion of size adjectives.
Principal-components analysis of model Aspects of the tests validity were
residuals conducted for the 24-item pool (after supported by the present analyses. First, two
exclusion of the seven misfitting items) revealed items were only excluded because of poor model
that 26.97% of the variance in the observations fit. Perhaps participants were not generally able
was accounted for by the Rasch dimension of to figure out the proper response strategy by the
item difficulty-person ability. The next largest end of the test (because of the provision of
factor extracted accounted for only 4.86% of the repeats and cues) and were then able to
remaining variance. effectively implement problem solving strategies.
The log functions for each item show If this is correct, then eliminating these items
large standard errors. This supports the principal should introduce misfit for the items of this type.
components analysis that there might be factors The two other items that were excluded
formed out of the 22 items. because of poor model fit were the last test item,
which differs from the earlier items in that it
Discussion contains two-part commands and requires
responses using more skills. This suggests that
The present results generally support the initial responses to different kinds of commands
construct and content validity of the Mathematical might be determined in part by another construct,
problem solving test. First, the acceptable fit of for example, ability to switch set.
the 22 test items to the Rasch model and the A second aspect of the test’s validity that
marginal unidimensionality coefficient (.61) the present analysis failed to confirm concerns
support the hypothesis that the RTT measures a the homogeneity of item difficulty within subtests.
unidimensional construct. Furthermore, The differences between the parameter
acceptable item and person separation indices estimates within the items suggest that they are
and reliability coefficients suggest that the not necessarily homogeneous with respect to
parameter estimates obtained in the current difficulty. The present finding might have been in
study are both reproducible and useful for part the result of a relatively small and poorly
differentiating items and persons from one targeted sample. A larger sample with a broader
another. distribution might obtain less item variability.
In addition, principal-components Although sample sizes of approximately 100
analysis of Rasch model residuals (with the two have been argued to produce stable item
misfitting items excluded) indicated that the parameter estimates (Linacre, 1994; van de
dimension of person ability-item difficulty Vijver, 1986), larger samples are preferable.
accounted for the majority of the variance in the Willmes's (1981) prior finding suggests that the
data (26.97%) and the next largest factor present result may be reliable, but his participant
extracted accounted for very little additional sample was similarly sized, if perhaps better
variance (4.86%). Although this does not provide targeted.
further support for the unidimensionality of the
test. References
The pattern of item difficulty across Andrich, D. (2004). Controversy and the Rasch model: A
subtests was consistent with item content and characteristic of incompatible paradigms? Medical
Care, 41, 17-116.
similar for values derived by Rasch analysis and Antonucci, G., Aprile, T., & Paulucci, S. (2002). Rasch
traditional methods. As expected, based on analysis of the Rivermead Mobility Index: A study
increasing lexical load, the results showed using mobility measures of first-stroke inpatients.
variation in the difficulty. There are more items Archives of Physical Medicine and Rehabilitation, 83,
that can be answered requiring low ability 1442-1449.
122
Arvedson, J. C., McNeil, M. R., & West, T. L. (1986). of Rasch modeling to the Outcome and Assessment
Prediction of Revised Token Test overall, subtest, and Information Set. Medical Care, 41, 601-615.
linguistic unit scores by two shortened versions. Frederiksen, N. (1984). Implications of cognitive theory for
Clinical Aphasiology, 16, 57-63. instruction in problem solving. Review of Educational
Blackwell, A., & Bates, E. (1995). Inducing agrammatic Research, 54, 363-407.
profiles in normals: Evidence for the selective Freed, D. B., Marshall, R. C., & Chulantseff, E. A. (1996).
vulnerability of morphology under cognitive resource Picture naming variability: A methodological
limitation. Journal of Cognitive Neuroscience, 7, 228- consideration of inconsistent naming responses in
257. fluent and nonfluent aphasia. In R. H. Brookshire
Bobrow, D. G. (1964). Natural language input for a (Ed.), Clinical aphasiology conference (pp. 193-205).
computer problem solving system. Unpublished Austin, TX: Pro-Ed.
doctoral dissertation, Massachusetts Institute of Garfola, J. & Lester, F. K. (1985). Metacognition, cognitive
Technology, Boston. monitoring, and mathematical performance. Journal
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch for Research in Mathematics Education, 16, 163-176.
model: Fundamental measurement in the human Guilford, J. P. (1954). Psychometric methods. New York:
sciences. Mahwah, NJ: Erlbaum. McGraw-Hill.
Briggs, D. C., & Wilson, M. (2003). An introduction to Henderson, K. B. & Pingry, R. E. (1953). Problem solving
multidimensional measurement using Rasch models. in mathematics. In H. F. Fehr (Ed.), The learning of
Journal of Applied Measurement, 4, 87-100. mathematics: Its theory and practice (21st Yearbook
Brown, S. I. & Walter, M. I. (1983). The art of problem of the National Council of Teachers of Mathematics)
posing. Hillsdale, NJ: Lawrence Erlbaum. (pp. 228-270). Washington, DC: National Council of
Chang, W-C., & Chan, C. (1995). Rasch analysis for Teachers of Mathematics.
outcomes measures: Some methodological Hobart, J. C. (2002). Measuring disease impact in disabling
considerations. Archives of Physical Medicine and neurological conditions: Are patients' perspectives and
Rehabilitation, 76, 934-939. scientific rigor compatible? Current Opinions in
Cliff, N. (1992). Abstract measurement theory and the Neurology, 15, 721-724.
revolution that never happened. Psychological Howard, D., Patterson, K., Franklin, S., Morton, J., &
Science, 3, 186-190. Orchard-Lisle, V. (1984). Variability and consistency in
DiSimoni, F. G., Keith, R. L., & Darley, F. L. (1980). naming by aphasic patients. Advances in Neurology,
Prediction of PICA overall score by short versions of 42, 263-276.
the test. Journal of Speech and Hearing Research, 23, Jensen, R. (1984). A multifaceted instructional approach
511-516. for developing subgoal generation skills. Unpublished
Duffy, J. R., & Dale, B. J. (1977). The PICA scoring scale: doctoral dissertation, The University of Georgia.
Do its statistical shortcomings cause clinical Kahneman, D. (1973). Attention and effort. Englewood
problems? In R. H. Brookshire (Ed.), Collected Cliffs, NJ: Prentice-Hall.
proceedings from clinical aphasiology (pp. 290-296). Kantowski, M. G. (1974). Processes involved in
Minneapolis, MN: BRK. mathematical problem solving. Unpublished doctoral
Duncan, P. W., Bode, R., Lai, S. M., & Perera, S. (2003). dissertation, The University of Georgia, Athens.
Rasch analysis of a new stroke-specific outcome Kantowski, M. G. (1977). Processes involved in
scale: The Stroke Impact Scale. Archives of Physical mathematical problem solving. Journal for Research in
Medicine and Rehabilitation, 84, 950-963. Mathematics Education, 8, 163-180.
Efron, B., & Tibshirani, R. (1986). Bootstrap methods for Kaput, J. J. (1979). Mathematics learning: Roots of
standard errors, confidence intervals, and other epistemological status. In J. Lochhead and J. Clement
measures of statistical accuracy. Statistical Science, (Eds.), Cognitive process instruction. Philadelphia,
1, 54-77. PA: Franklin Institute Press.
El-Korashy, A. (1995). Applying the Rasch model to the Lai, J-S., Cella, D., Chang, C. H., Bode, R., & Heinemann,
selection of items for a mental ability test. Educational A. W. (2003). Item banking to improve, shorten and
and Psychological Measurement, 55, 753. computerize self-reported fatigue: An illustration of
Embretson, S. E., & Reise, S. P. (2000). Item response steps to create a core item bank from the FACIT-
theory for psychologists. Mahwah, NJ: Erlbaum. Fatigue Scale. Quality of Life Research, 12, 485-501.
Fischer, G. H., & Molenaar, I. W. (1995). Rasch models: Lamprianou, I. & Boyle, B. (2004). Accuracy of
Foundations, recent developments and applications. Measurement in the Context of Mathematics National
New York: Springer. Curriculum Tests in England for Ethnic Minority Pupils
Fortinsky, R. H., Garcia, R. I., Sheenan, T. J., Madigan, E. and Pupils Who Speak English as an Additional
A., & Tullai McGuinness, S. (2003). Measuring Language. JEM, 41, 239-251.
disability in Medicare home care patients: Application
123
Larkin, J. (1980). Teaching problem solving in physics: The Merbitz, C., Morris, J., & Grip, J. C. (1989). Ordinal scales
psychological laboratory and the practical classroom. and foundations of misinference. Archives of Physical
In F. Reif & D. Tuma (Eds.), Problem solving in Medicine and Rehabilitation, 70, 308-312.
education: Issues in teaching and research. Hillsdale, Michell, J. (1990). An introduction to the logic of
NJ: Lawrence Erlbaum. psychological measurement. Hillsdale, NJ: Erlbaum.
Lesh, R. (1981). Applied mathematical problem solving. Michell, J. (1997). Quantitative science and the definition of
Educational Studies in Mathematics, 12(2), 235-265. measurement in psychology. British Journal of
Linacre, J. M. (1994). Sample size and item calibration Psychology, 88, 355-383.
stability. Rasch Measurement Transactions, 7, 328. Michell, J. (2004). Item response models, pathological
Linacre, J. M. (1998). Structure in Rasch residuals: Why science, and the shape of error. Theory and
principal components analysis? Rasch Measurement Psychology, 14, 121-129.
Transactions, 12, 636. National Council of Supervisors of Mathematics. (1978).
Linacre, J. M. (2002). Facets, factors, elements and levels. Position paper on basic mathematical skills.
Rasch Measurement Transactions, 16, 880. Mathematics Teacher, 71(2), 147-52. (Reprinted from
Linacre, J. M., & Wright, B. D. (1994). Reasonable mean- position paper distributed to members January 1977.)
square fit values. Rasch Measurement Transactions, Newell, A. & Simon, H. A. (1972). Human problem solving.
8, 370. Englewood Cliffs, NJ: Prentice Hall.
Linacre, J. M., & Wright, B. D. (2003). WINSTEPS: Norquist, J. M., Fitzpatrick, R., Dawson, J., & Jenkinson, C.
Multiple-choice, rating scale, and partial credit Rasch (2004). Comparing alternative Rasch-based methods
analysis [Computer software]. Chicago: MESA Press. vs. raw scores in measuring change in health. Medical
Linacre, J. M., Heinemann, A. W., Wright, B., Granger, C. Care, 42, 125-136.
V., & Hamilton, B. B. (1994). The structure and Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
stability of the Functional Independence Measure. theory (3rd ed.). New York: McGraw-Hill.
Archives of Physical Medicine and Rehabilitation, 75, Orgass, B. (1976). Eine Revision des Token Tests, Teil I
127-132. und II [A revision of the token tests, Part I and II].
Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Diagnostica, 22, 70-87.
Statistical theories of mental test scores. Reading, Penfield, R. D. (2004). The impact of model misfit on partial
MA: Addison-Wesley. credit model parameter estimates. Journal of Applied
Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint Measurement, 5, 115-128.
measurement: A new type of fundamental Polya, G. (1962). Mathematical discovery: On
measurement. Journal of Mathematical Psychology, 1, understanding, learning and teaching problem solving
1-27. (vol. 1). New York: Wiley.
Lumsden, J. (1978). Tests are perfectly reliable. British Polya, G. (1965). Mathematical discovery: On
Journal of Mathematical and Statistical Psychology, understanding, learning and teaching problem solving
31, 19-26. (vol. 2). New York: Wiley.
Masters, G. (1993). Undesirable item discrimination. Rasch Polya, G. (1973). How to solve it. Princeton, NJ: Princeton
Measurement Transactions, 7, 289. University Press. (Originally copyrighted in 1945).
McHorney, C. A., Haley, S. M., & Ware, J. E. (1997). Porch, B. (2001). Porch Index of Communicative Ability.
Evaluation of the MOS SF-36 physical functioning Albuquerque, NM: PICA Programs.
scale (PF-10): II. Comparison of relative precision Rasch, G. (1980). Probabilistic models for some
using Likert and Rasch scoring methods. Journal of intelligence and attainment tests. Chicago: University
Clinical Epidemiology, 50, 451-461. of Chicago Press. (Original work published 1960)
McNeil, M. R. (1988). Aphasia in the adult. In N. J. Lass, L. Reitman, W. R. (1965). Cognition and thought. New York:
V. McReynolds, J. Northern, & D. E. Yoder (Eds.), Wiley.
Handbook of speech-language pathology and Schoenfeld, A. H. (1983). Episodes and executive
audiology (pp. 738-786). Toronto, Ontario, Canada: D. decisions in mathematics problem solving. In R. Lesh
C. Becker. & M. Landau, Acquisition of mathematics concepts
McNeil, M. R., & Hageman, C. F. (1979). Auditory and processes. New York: Academic Press
processing deficits in aphasia evidenced on the Schoenfeld, A. H. (1985). Mathematical problem solving.
Revised Token Test: Incidence and prediction of Orlando, FL: Academic Press.
across subtest and across item within subtest Schoenfeld, A. H. (1988). When good teaching leads to
patterns. In R. H. Brookshire (Ed.), Clinical bad results: The disasters of "well taught"
aphasiology conference proceedings (pp. 47-69). mathematics classes. Educational Psychologist, 23,
Minneapolis, MN: BRK. 145-166.
Schoenfeld, A. H., & Herrmann, D. (1982). Problem
perception and knowledge structure in expert and
124
novice mathematical problem solvers. Journal of Waters, W. (1984). Concept acquisition tasks. In G. A.
Experimental Psychology: Learning, Memory and Goldin & C. E. McClintock (Eds.), Task variables in
Cognition, 8, 484-494. mathematical problem solving (pp. 277-296).
Segall, D. O. (1996). Multidimensional adaptive testing. Philadelphia, PA: Franklin Institute Press.
Psychometrika, 61, 331-354. Willmes, K. (1981). A new look at the Token Test using
Silver, E. A. (1987). Foundations of cognitive theory and probabilistic test models. Neuropsychologia, 19, 631-
research for mathematics problem-solving instruction. 645.
In A. H. Schoenfeld (Ed.), Cognitive science and Willmes, K. (1992). Psychometric evaluation of
mathematics education (pp. 33-60). Hillsdale, NJ: neuropsychological test performances. In N. von
Lawrence Erlbaum. Steinbuechel, D. Y. Cramon, & E. Poeppel (Eds.),
Smith, J. P. (1974). The effects of general versus specific Neuropsychological rehabilitation (pp. 103-113).
heuristics in mathematical problem-solving tasks Heidelberg, Germany: Springer-Verlag.
(Columbia University, 1973). Dissertation Abstracts Willmes, K. (2003). Psychometric issues in aphasia therapy
International, 34, 2400A. research. In I. Papathanasiou & R. De Bleser (Eds.),
Smith, R. M. (1986). Person fit in the Rasch model. The sciences of aphasia: From theory to therapy (pp.
Educational and Psychological Measurement, 46, 227-244). Amsterdam: Pergamon.
359-372. Wilson, J. W. (1967). Generality of heuristics as an
Stanic, G., & Kilpatrick, J. (1988). Historical Perspectives instructional variable. Unpublished Doctoral
on Problem Solving in the Mathematics Curriculum. In Dissertation, Stanford University, San Jose, CA.
R. I. Charles & E. A. Silver (Eds.), The teaching and Wright, B. D. (1991). IRT in the 1900's: Which models work
assessing of mathematical problem solving (pp. 1-22). best? Rasch Measurement Transactions, 6, 196-200.
Reston, VA: National Council of Teachers of Wright, B. D. (1994). A Rasch unidimensionality coefficient.
Mathematics. Rasch Measurement Transactions, 8, 385.
Steffe, L. P., & Wood, T. (Eds.). (1990). Transforming Wright, B. D. (1996). Local dependency, correlations and
Children's Mathematical Education. Hillsdale, NJ: principal components. Rasch Measurement
Lawrence Erlbaum. Transactions, 10, 509-511.
Stevens, S. S. (1946, June 7). On the theory of scales of Wright, B. D. (1999). Fundamental measurement for
measurement. Science, 103, 677-680. psychology. In S. E. Embretson & S. L. Hershberger
van de Vijver, F. J. R. (1986). The robustness of Rasch (Eds.), The new rules of measurement: What every
estimates. Applied Psychological Measurement, 10, psychologist and educator should know (pp. 65-104).
45-57. Mahwah, NJ: Erlbaum.
Velozo, C. A., Magalhaes, L. C., Pan, A.-W., & Leiter, P. Wright, B. D., & Linacre, J. M. (1989). Observations are
(1995). Functional scale discrimination at admission always ordinal; measurements, however, must be
and discharge: Rasch analysis of the Level of interval. Archives of Physical Medicine and
Rehabilitation Scale-III. Archives of Physical Medicine Rehabilitation, 70, 857-860.
and Rehabilitation, 76, 705-712. Wright, B. D., & Masters, G. S. (1982). Rating scale
von Glasersfeld, E. (1989). Constructivism in education. In analysis. Chicago: MESA Press.
T. Husen & T. N. Postlethwaite (Eds.), The Wright, B. D., & Stone, M. H. (1979). Best test design.
international encyclopedia of education. (pp. 162-163). Chicago: MESA Press.
(Suppl. Vol. I). New York: Pergammon. Wright, B., & Masters, G. (1997). The partial credit model.
Wainer, H., & Mislevy, R. J. (2000). Item response theory, In W. van der Linden & R. Hambleton (Eds.),
item calibration, and proficiency estimation. In H. Handbook of modern item response theory (pp. 101-
Wainer, N. J. Dorans, D. Eignor, R. Flaugher, B. F. 121). New York: Springer.
Green, & R. J. Mislevy, et al. (Eds.), Computerized
adaptive testing: A primer (2nd ed., pp. 61-100).
Mahwah, NJ: Erlbaum.
Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green,
B. F., Mislevy, R. J., et al. (2000). Computerized
adaptive testing: A primer (2nd ed.). Mahwah, NJ:
Erlbaum.
Ware, J. E., Bjorner, J. B., & Kosinski, M. (2000). Practical
implications of item response theory and
computerized adaptive testing: A brief summary of
ongoing studies of widely used headache impact
scales. Medical Care, 38, II73-II82.
125
SPECIAL TOPIC
A Review of Psychometric Theory

Carlo Magno
This review presents the nature of psychometrics including the issues on psychological
measurement, its relevant theories, and current practice. The basic scaling models are discussed since
they are processes enabling the quantification of psychological constructs. The issues and research trends
in classical test theory and item response theory with its different models and their implication on test
construction are explained. Towards the end of the article different methods of scaling people, stimuli and
responses are discussed.
The Nature of Psychometrics and Issues in Measurement
Psychometrics concerns itself with the science of measuring psychological constructs such as
ability, personality, affect, and skills. Psychological measurement methods are crucially important for basic
research in psychology. Research in psychology involves the measurement of variables in order to conduct
further analyses. In the past, obtaining adequate measurement on psychological constructs is considered
an issue in the science of psychology. Some references indicate that there are psychological constructs
that are deemed to be unobservable and is difficult to quantify. This issue is carried over by the fact that
psychological theories are filled with variables that either cannot be measured at all at the present time or
can be measured only approximately (Kaplan & Saccuzzo, 1997) such as anxiety, creativity, dogmatism
achievement, motivation, attention and frustration. Moreover according to Emmanuel Kant, “it is impossible
to have a science of psychology because the basic data could not be observed and measured.” Although
the field of psychological measurement has advanced and practitioners in the field of psychometrics were
able to properly deal with issues and devise methods on the basic premise of scientific observation and
measurement. Since most psychological constructs involve subjective experiences such as feelings
sensations and desires – and when individuals make a judgment, state their preferences and even talk
about these experiences, then it is possible for measurement to take place and thus it meets the
requirements of scientific inquiry. It is very much possible to assign numbers to psychological constructs as
to represent quantities of attributes and even formulate rules of standardizing the measurement process.
In the process of standardizing psychological measurement, it requires a process of abstraction
where psychological attributes are observed in relation to other constructs such as attitude and
achievement (Magno, 2003). This process allows to establish the association among variables such as
construct validation and criterion-predictive processes. Also, emphasizing measurement of psychological
constructs forces researchers and test developers to consider carefully the nature of the construct before
attempting to measure it. This involves a thorough literature review on the conceptual definition of an
attribute before constructing valid items of a test. It is also a common practice in psychometrics where
numerical scores are used to communicate the amount of an attribute of an individual. Quantification is so
much intertwined with the concept of measurement. In the process of quantification, mathematical systems
and statistical procedures are used enabling to examine the internal relationship among data obtained
through a measure. Such procedures enable psychometrics to build theories considering itself part of the
system of science.
126
Branches of Psychometric Theory
There are two branches of psychometric theory: The classical test theory and the items response
theory. Both theories enable to predict outcomes of psychological tests by identifying parameters of item
difficulty and the ability of test takers. Both are concerned to improve the reliability of psychological tests.
Classical Test Theory
Classical test theory in references is regarded as the “true score theory.” The theory starts from the
assumption that systematic effects between responses of examinees are due only to variation in ability of
interest. All other potential sources of variation existing in the testing materials such as external conditions
or internal conditions of examinees are assumed either to be constant through rigorous standardization or
to have an effect that is nonsystematic or random by nature (Van der Linden & Hambleton, 2004). The
central model of the classical test theory is that observed test scores (TO) are composed of a true score (T)
and an error score (E) where the true and the error scores are independent. The variables are established
by Spearman (1904) and Novick (1966) and best illustrated in the formula:
TO = T + E
The classical theory assumes that each individual has a true score which would be obtained if
there were no errors in measurement. However, because measuring instruments are imperfect, the score
observed for each person may differ from an individual’s true ability. The difference between the true score
and the observed test score results from measurement error. Using a variety of justifications, error is often
assumed to be a random variable having a normal distribution. The implication of the classical test theory
for test takers is that tests are fallible imprecise tools. The score achieved by an individual is rarely the
individual’s true score. This means that the true score for an individual will not change with repeated
applications of the same test. This observed score is almost always the true score influenced by some
degree of error. This error influences the observed to be higher or lower. Theoretically, the standard
deviation of the distribution of random errors for each individual tells about the magnitude of measurement
error. It is usually assumed that the distribution of random errors will be the same for all individuals.
Classical test theory uses the standard deviation of errors as the basic measure of error. Usually this is
called the standard error of measurement. In practice, the standard deviation of the observed score and the
reliability of the test are used to estimate the standard error of measurement (Kaplan & Saccuzzo, 1997).
The larger the standard error of measurement, the less certain is the accuracy with which an attribute is
measured. Conversely, small standard error of measurement tells that an individual score is probably close
to the true score. The standard error of measurement is calculated with the formula: Sm  S 1  r .
Standard errors of measurement are used to create confidence intervals around specific observed scores
(Kaplan & Saccuzzo, 1997). The lower and upper bound of the confidence interval approximate the value of
the true score.
Traditionally, methods of analysis based on classical test theory have been used to evaluate tests.
The focus of the analysis is on the total test score; frequency of correct responses (to indicate question
difficulty); frequency of responses (to examine distractors); reliability of the test and item-total correlation (to
evaluate discrimination at the item level) (Impara & Plake, 1997). Although these statistics have been
widely used, one limitation is that they relate to the sample under scrutiny and thus all the statistics that
describe items and questions are sample dependent (Hambelton, 2000). This critique may not be
127
particularly relevant where successive samples are reasonably representative and do not vary across time,
but this will need to be confirmed and complex strategies have been proposed to overcome this limitation.
Item Response Theory
Another branch of psychometric theory is the item response theory (IRT). IRT may be regarded as
roughly synonymous with latent trait theory. It is sometimes referred to as the strong true score theory or
modern mental test theory because IRT is a more recent body of theory and makes stronger assumptions
as compared to classical test theory. This approach to testing based on item analysis considers the chance
of getting particular items right or wrong. In this approach, each item on a test has its own item
characteristic curve that describes the probability of getting each particular item right or wrong given the
ability of the test takers (Kaplan & Saccuzzo, 1997). The Rasch model as an example of IRT is appropriate
for modeling dichotomous responses and models the probability of an individual's correct response on a
dichotomous item. The logistic item characteristic curve, a function of ability, forms the boundary between
the probability areas of answering an item incorrectly and answering the item correctly. This one-parameter
logistic model assumes that the discriminations of all items are assumed to be equal to one (Maier, 2001).
Another fundamental feature of this theory is that item performance is related to the estimated
amount of respondent’s latent trait (Anastasi & Urbina, 2002). A latent trait is symbolized as theta () which
refers to a statistical construct. In cognitive tests, latent traits are called the ability measured by the test.
The total score on a test is taken as an estimate of that ability. A person’s specified ability () succeeds on
an item of specified difficulty.
There are various approaches to the construction of tests using item response theory. Some
approaches use the two-dimensions that plot item discriminations and item difficulties. Other approaches
use a three-dimension for the probability of test takers with very low levels of ability getting a correct
response (as demonstrated in Figure 1). Other approaches use only the difficulty parameter (one
dimension) such as the Rasch Model. All these approaches characterize the item in relation to the
probability that those who do well or poorly on the exam will have different levels of performance.
Two – Parameter Model/Normal – Ogive Model. The ogive model postulates a normal cumulative
distribution function as a response function for an item. The model demonstrates that an item difficulty is a
point on an ability scale where an examinee has a probability of success on the item of .50 (van der Linden
& Hambleton, 2004). In the model, the difficulty of each item can be defined by 50% threshold which is
customary in establishing sensory thresholds in psychophysics. The discriminative power of each item
represented by a curve in the graph is indicated by its steepness. The steeper the curve, the higher the
correlation of item performance with total score and the higher the discriminative index.
The original idea of the model was traced back from Thurstone’s use of the normal model in his
discriminal dispersion theory of stimulus perception (Thurstone, 1927). Researchers in psychophysics
studied the relation between psychophysical properties from a stimuli and their perception from human
subjects. In the process a stimulus is presented to the subject and he/she will report the detection of the
stimulus. The detection increases as the stimulus intensity also increases. With this pattern, the cumulative
distribution with parametrization was used as a function.
Three – Parameter Model/Logistic Model. In plotting an ability () with the probability of correct
response Pi () in a three parameter model, the slope of the curve itself indicates the item discrimination.
The higher the value of the item discrimination, the steeper the slope. In the model, Birnbaum (1950)
proposed a third parameter to account for the nonzero performance of low ability examinees on multiple
128
choice items. The nonzero performance is due to the probability of guessing correct answers to multiple
choice items (Van der Linden & Hambleton, 2004).
Figure 1. Hypothetical Item Characteristic Curves for Three Items using a Three Parameter Model
100
90
80
70
Item Item Item
60 1 2 3
50
40
30
20
10
0
Ability Scale
The item difficulty parameter (b1, b2, b3) corresponds to the location on the ability axis at which the
probability of a correct response is .50. It is shown in the curve that item 1 is easier and item 2 and 3 have
the same difficulty at .50 probability of correct response. Estimates of item parameters and ability are
typically computed through successive approximations procedures where approximations are repeated until
the values stabilize.
One – Parameter Model/Rasch Model. The Rasch model is based on the assumption that both
guessing and item differences in discrimination are negligible. In constructing tests, the proponents of the
Rasch model frequently discard those items that do not meet these assumptions (Anastasi & Urbina, 2002).
Rasch began his work in educational and psychological measurement in the late 1940’s. Early in the 1950’s
he developed his Poisson models for reading tests and a model for intelligence and achievement tests
which was later called the “structure models for items in a test” which is called today as the Rasch model.
Rasch’s (1960) main motivation for his model was to eliminate references to populations of
examinees in analyses of tests. According to him that test analysis would only be worthwhile if it were
individual centered with separate parameters for the items and the examinees (van der Linden &
Hambleton, 2004). His worked marked IRT with its probabilistic modeling of the interaction between an
individual item and an individual examinee. The Rasch model is a probabilistic unidimensional model which
asserts that (1) the easier the question the more likely the student will respond correctly to it, and (2) the
more able the student, the more likely he/she will pass the question compared to a less able student .

The Rasch model was derived from the initial Poisson model illustrated in the formula:  

where  is a function of parameters describing the ability of examinee and difficulty of the test, 
represents the ability of the examinee and  represents the difficulty of the test which is estimated by the
summation of errors in a test. Furthermore, the model was enhanced to assume that the probability that a
129
student will correctly answer a question is a logistic function of the difference between the student's ability
[θ] and the difficulty of the question [β] (i.e. the ability required to answer the question correctly), and only a
function of that difference giving way to the Rasch model.
From this, the expected pattern of responses to questions can be determined given the estimated θ
and β. Even though each response to each question must depend upon the students' ability and the
questions' difficulty, in the data analysis, it is possible to condition out or eliminate the student's abilities (by
taking all students at the same score level) in order to estimate the relative question difficulties (Andrich,
2004; Dobby & Duckworth, 1979). Thus, when data fit the model, the relative difficulties of the questions
are independent of the relative abilities of the students, and vice versa (Rasch, 1977). The further
consequence of this invariance is that it justifies the use of the total score (Wright & Panchapakesan,
1969). In the current analysis this estimation is done through a pair-wise conditional maximum likelihood
algorithm.
The Rasch model is appropriate for modeling dichotomous responses and models the probability of
an individual's correct response on a dichotomous item. The logistic item characteristic curve, a function of
ability, forms the boundary between the probability areas of answering an item incorrectly and answering
the item correctly. This one-parameter logistic model assumes that the discriminations of all items are
assumed to be equal to one (Maier, 2001).
According to Fischer (1974) the Rasch model can be derived from the following assumptions:
(1) Unidimensionality. All items are functionally dependent upon only one underlying continuum.
(2) Monotonicity. All item characteristic functions are strictly monotonic in the latent trait. The item
characteristic function describes the probability of a predefined response as a function of the latent trait.
(3) Local stochastic independence. Every person has a certain probability of giving a predefined
response to each item and this probability is independent of the answers given to the preceding items.
(4) Sufficiency of a simple sum statistic. The number of predefined responses is a sufficient statistic
for the latent parameter.
(5) Dichotomy of the items. For each item there are only two different responses, for example
positive and negative. The Rasch model requires that an additive structure underlies the observed data.
This additive structure applies to the logit of Pij, where Pij is the probability that subject i will give a
predefined response to item j, being the sum of a subject scale value u i and an item scale value vj, i.e. In
(Pij/1 - Pij) = ui + vj
There are various applications of the Rasch Model in test construction through item-mapping
method (Wang, 2003) and as a hierarchical measurement method (Maier, 2001).
Rasch Standard-setting Through Item-mapping. According to Wang (2003) that it is logical to

justify the use of an item-mapping method for establishing passing scores for multiple-choice licensure and
certification examinations. In the study the researcher wanted to determine a score that decides a passing
level of competency using the Angoff as a standard setting method in the Rasch Model. The Angoff (1971)
procedure with various modifications is the most widely used for multiple-choice licensure and certification
examinations (Plake, 1998). As part of the Angoff standard-setting process, judges are asked to estimate
the proportion (or percentage) of minimally competent candidates (MCC) who will answer an item correctly.
These item performance estimates are aggregated across items and averaged across judges to yield the
recommended cut score. As noted (Chang, 1999; Impara & Plake, 1997; Kane, 1994), the adequacy of a
judgmental standard-setting method depends on whether the judges adequately conceptualize the minimal
competency of candidates, and whether judges accurately estimate item difficulty based on their
conceptualized minimal competency. A major criticism of the Angoff method is that judges' estimates of
item difficulties for minimal competency are more likely to be inaccurate, and sometimes inconsistent and
contradictory (Bejar, 1983; Goodwin, 1999; Mills & Melican, 1988; National Academy of Education [NAE],
130
1993; Reid, 1991; Shepard, 1995). Studies found that judges are able to rank order items accurately in
terms of item difficulty, but they are not particularly accurate in estimating item performance for target
examinee groups (Impara & Plake, 1998; National Research Council, 1999; Shepard, 1995). A fundamental
flaw of the Angoff method is that it requires judges to perform the nearly impossible cognitive task of
estimating the probability of MCCs answering each item in the pool correctly (Berk, 1996; NAE).
An item-mapping method, which applies the Rasch IRT model to the standard setting process, has
been used to remedy the cognitive deficiency in the Angoff method for multiple-choice licensure and
certification examinations (McKinley, Newman, & Wiser, 1996). The Angoff method limits judges to each
individual item while they make an immediate judgment of item performance for MCCs. In contrast, the
item-mapping method presents a global picture of all items and their estimated difficulties in the form of a
histogram chart (item map), which serves to guide and simplify the judges' process of decision making
during the cut score study. The item difficulties are estimated through application of the Rasch IRT model.
Like all IRT scaling methods, the Rasch estimation procedures can place item difficulty and candidate
ability on the same scale. An additional advantage of the Rasch measurement scale is that the difference
between a candidate's ability and an item's difficulty determines the probability of a correct response
(Grosse & Wright, 1986). When candidate ability equals item difficulty, the probability of a correct answer to
the item is .50. Unlike the Angoff method, which requires judges to estimate the probability of an MCC's
success on an item, the item-mapping method provides the probability (i.e., .50) and asks judges to
determine whether an MCC has this probability of answering an item correctly. By utilizing the Rasch
model's distinct relationship between candidate ability and item difficulty, the item-mapping method enables
judges to determine the passing score at the point where the item difficulty equals the MCC's ability level.
The item-mapping method incorporates item performance in the standard-setting process by
graphically presenting item difficulties. In item mapping, all the items for a given examination are ordered in
columns, with each column in the graph representing a different item difficulty. The columns of items are
ordered from easy to hard on a histogram-type graph, with very easy items toward the left end of the graph,
and very hard items toward the right end of the graph. Item difficulties in log odds units are estimated
through application of the Rasch IRT model (Wright & Stone, 1979). In order to present items on a metric
familiar to the judges, logit difficulties are converted to scaled values using the following formula: scaled
difficulty = (logit difficulty × 10) + 100. This scale usually ranges from 70 to 130.
Rasch Hierarchical Measurement Method. In a study by Maier (2001) a hierarchical measurement
model is developed that enables researchers to measure a latent trait variable and model the error variance
corresponding to multiple levels. The Rasch hierarchical measurement model (HMM) results when a Rasch
IRT model and a one-way ANOVA with random effects are combined. Item response theory models and
hierarchical linear models can be combined to model the effect of multilevel covariates on a latent trait.
Through the combination, researchers may wish to examine relationships between person-ability estimates
and person-level and contextual-level characteristics that may affect these ability estimates. Alternatively, it
is also possible to model data obtained from the same individuals across repeated questionnaire
administrations. It is also made possible to study the effect of person characteristics on ability estimates
over time.
Advantages of the IRT
The benefit of the item response theory is that its treatment of reliability and error of measurement
through item information function are computed for each item (Lord, 1980). These functions provide a
sound basis for choosing items in test construction. The item information function takes all items
parameters into account and shows the measurement efficiency of the item at different ability levels.
Another advantage of the item response theory is the invariance of item parameters which pertains to the
131
sample-free nature of its results. In the theory the item parameters are invariant when computed in groups
of different abilities. This means that a uniform scale of measurement can be provided for use in different
groups. It also means that groups as well as individuals can be tested with a different set of items,
appropriate to their ability levels and their scores will be directly comparable (Anastasi & Urbina, 2002).
Scaling Models
Measurement is essentially concerned with the methods used to provide quantitative descriptions
of the extent to which individuals manifest or possess specified characteristics” (Ghiselli, Campbell, &
Zedeck, 1981, p. 2). “Measurement is the assigning of numbers to individuals in a systematic way as a
means of representing properties of the individuals” (Allen & Yen, 1979, p. 2). “‘Measurement’ consists of
rules for assigning symbols to objects so as to (1) represent quantities of attributes numerically (scaling) or
(2) define whether the objects fall in the same or different categories with respect to a given attribute
(classification)” (Nunnally &Bernstein, 1994, p. 3).
There are important aspects to consider in the process of measurement in psychometrics. First, it
is needed to quantify an attribute of interest. That is, there are numbers to designate how much (or little) of
an attribute an individual possesses. Second, attribute of interest must be quantified in a consistent and
systematic way (i.e., standardization). That is, when the measurement process is replicated, it is systematic
enough that meaningful replication is possible. Finally, attributes of individuals (or objects) are measured
not the individuals per se.
Levels of Measurement
As the definition of Nunnally and Bernstein (1994) suggests, by systematically measuring the
attribute of interest individuals can either be classified or scaled with regard to the attribute of interest.
Engaging in classification or scaling depends in large part on the level of measurement used to assess a
construct. For example, if the attribute is measured on a nominal scale of measurement, then it is only
possible to classify individuals as falling into one or another mutually exclusive category (Agresti & Finlay,
2000). This is because the different categories (e.g., men versus women) represent only qualitative
differences. Nominal scales are used as measures of identity (Downie & Heath, 1984). When gender are
coded such as males coded 0, females 1 that does not mean that these values have any quantitative
meaning. They are simply labels for gender categories. At the nominal level of measurement, there are a
variety of sorting techniques. In this case, subjects are asked to sort the stimuli into different categories
based on some dimension.
There are some data that reflect rank order of individuals or objects such as a scale evaluating the
beauty of a person from highest to lowest (Downie & Heath, 1984). This would represent an ordinal scale of
measurement where objects are simply rank ordered. It does not provide how much hotter one object is
than another, but it can be determined that that A is hotter than B, if A is ranked higher than B. At the
ordinal level of measurement, the Q-sort method, paired comparisons, Guttman’s Scalogram, Coomb’s
unfolding technique, and a variety of rating scales can be used. The major task of subject is to rank order
items from highest to lowest or from weakest to strongest.
The interval scale of measurement have equal intervals between degrees on the scale. However,
the zero point on the scale is arbitrary; 0 degrees Celsius represents the point at which water freezes at
sea level. That is, zero on the scale does not represent “true zero,” which in this case would mean a
complete absence of heat. In determining the area of a table a ratio scale of measurement is used because
zero does represent “true zero”.
132
When the construct of interest is measured at the nominal (i.e., qualitative) level of measurement,
objects are only classified into categories. As a result, the types of data manipulations and statistical
analyses that can be perform on the data is very limited. In cases of descriptive statistics, it is possible to
compute frequency counts or determine the modal response (i.e., category), but not much else. However, if
it were at least possible to rank order the objects based on the degree to which the construct of interest
possess, then it is possible to scale the construct. In addition, higher levels of measurement allow for more
in-depth statistical analyses. With ordinal data, for example, statistics such as the median, range, and
interquartile range can be computed (Downie & Heath, 1984). When the data is interval-level, it is possible
to calculate statistics such as means, standard deviations, variances, and the various statistics of shape
(e.g., skewness and kurtosis). With interval-level data, it is important to know the shape of the distribution,
as different-shaped distributions imply different interpretations for statistics such as the mean and standard
deviation.
At the interval and ratio level of measurement, there is direct estimation, the method of bisection,
and Thurstone’s methods of comparative and categorical judgments. With these methods, subjects are
asked not only to rank order items but also to actually help determine the magnitude of the differences
among items. With Thurstone’s method of comparative judgment, subjects compare every possible pair of
stimuli and select the item within the pair that is the better item for assessing the construct. Thurstone’s
method of categorical judgment, while less tedious for subjects when there are many stimuli to assess in
that they simply rate each stimulus (not each pair of stimuli), does require more cognitive energy for each
rating provided. This is because the SME must now estimate the actual value of the stimulus.
Unidimensional Scaling Models
Psychological measurement is typically most interested in scaling some characteristic, trait, or

ability of a person. It determines how much of an attribute of interest a given person possesses. This allow
to estimate the degree of inter-individual and intra-individual differences among the subjects on the attribute
of interest. There are various ways of scaling such as scaling the stimuli given to individuals, as well as the
responses that individuals provide.
Scaling for a Stimuli (Psychophysics)
Scaling of stimuli is more prominent in the area of psychophysics or sensory/perception psychology

that focuses on physical phenomena and whose roots date back to mid–19th century Germany. It was not
until the 1920s that Thurstone began to apply the same scaling principles to scaling psychological attitudes.
In the process of psychophysical scaling one factor is held constant (e.g., responses), collapse across a
second (e.g., stimuli), and then scale the third (e.g., individuals) factor. With psychological scaling,
however, it is typical to ask participants to provide their professional judgment of the particular stimuli,
regardless of their personal feelings or attitudes toward the topic or stimulus. This may include ratings of
how well different stimuli represent the construct and at what level of intensity the construct is represented.
In scaling for stimuli, the research issue that frequently concerns is the exact nature of functional
relations between scaling of the stimuli in different circumstances (Nunnaly, 1970).
There is a variety of ways on scaling for stimuli through psychophysical method. Psychophysical
methods examine the relationship between the placement of objects on the two scales and attempts to
establish principles or laws that connect the two (Roberts, 1999). The following psychophysical method
includes rank order, constant stimuli and successive categories.
133
(1) Method of Adjustment. An experimental paradigm which allows the subject to make small
adjustments to a comparison stimulus until it matches a standard stimulus. The intensity of the stimulus is
adjusted until target is just detectable.
(2) Method of Limits. Adjust intensity in discreet steps until observer reports that stimulus is just
detectable.
(3) Method of Constant Stimuli. Experimenter has control of stimuli. Several chosen stimulus
values are chosen to bracket the assumed threshold. Stimulus is presented many times in random order.
Psychometric function is derived from proportion of detectable responses.
(4) Staircase Method. To determine a threshold as quickly as possible. Compromise between the
method of limits and method of constant stimuli.
(5) Method of Forced Choice (2AFC). Observer must choose between two or more options. Good
for cases where observers are less willing to guess.
(6) Method of Average Error. The subject is presented with a standard stimuli. The subject then
undergoes trials to target the stimulus presented.
(7) Rank order. Requires the subject to rank stimuli from most to least with respect to some
attribute of judgment or sentiment.
(8) Paired comparison. A subject is required to rank a stimuli two at a time in all possible pairs.
(9) Successive categories. The subject is asked to sort a collection of stimuli into a number of
distinct piles or categories, which are ordered with respect to a specified attribute.
(10) Ratio judgment. The experimenter selects a standard stimulus and a number of variable
stimuli that differ quantitatively from the standard stimulus on a given characteristic. The subjects selects
from the range of variable stimuli, the stimulus whose amount of the given characteristic corresponds to the
ratio value.
(11) Q sort. Subjects are required to sort the stimuli into an approximate normal distribution, with
its being specified how many stimuli are to be placed in each category.
Scaling for People (Psychometrics)
Many issues arise when performing a scaling study. One important factor is who is selected to
participate in the study. Many stimuli or scaling involve some psychological (latent) dimension of people
without any connection to a direct counterpart "physical" dimension. When people (psychometrics) are
scaled, it is typical to obtain a random sample of individuals from the population to generalize. With
psychometrics participants are asked to provide their individual feelings, attitudes, and/or personal ratings
toward a particular topic. In doing so, one is able to determine how individuals differ on the construct of
interest. With stimulus scaling, however, the researcher would sum across raters within a given stimulus
(e.g., question) in order to obtain rating(s) of each stimulus. Once the researcher is confident that each
stimulus did, in fact, tap into the construct and had some estimate of the level at which it did so, only then
should the researcher feel confident in presenting the now scaled stimuli to a random sample of relevant
participants for psychometric purposes. Thus, with psychometrics, items (i.e., stimuli) are summed across
within an individual respondent in order to obtain his or her score on the construct.
The major requirement in scaling for people is that variables should be monotonically related to
each other. A relationship is monotonic if higher scores in one scale correspond to higher scores on
another scale, regardless of the shape of the curve (Nunnally, 1970). In scaling for people many items on a
test is used to minimize measurement error. The specificity of items can be averaged when they are
combined. By combining items, one can make relatively fine distinctions between people. The problem of
scaling people with respect to attributes is then one of collapsing responses to a number of items as to
obtain one score for each person.
134
One variety of scaling for people is the deterministic model and it assumes that there is no error in
item trace lines. Trace lines shows that a high level of ability would have a probability close to 1.0 of
correctly obtaining a response. The model assumes that up to a point on the attribute, the probability of
response alpha is zero and beyond that point the probability of response alpha is 1.0. Each item has a
biserial correlation of 1.0 with the attribute, and consequently each item perfectly discriminates at a
particular point of the attribute.
There are varieties of scaling models for people that includes Thurstone, Lickert scale, Guttman
scale, and Semantic differential scaling.
(1) Thurstone scaling. There are 300 or so judges to rate 100 statements on a particular issue on
an 11 point scale. A subset of statements are then shown to respondents and their score is the mean of the
ratings for the statement they select.
(2) Lickert scale. Respondents are request to state their level of agreement with a series of
attitude statements. Each scale point is given a value (say, 1- 5) and the person is given the core
corresponding to their degree of agreement. Often a set of Likert items are summed to provide a total score
for the attitude.
(3) Guttman scale. It involves producing a set of statements that form a natural hierarchy. Positive
answers to the item at one point on the hierarchy assume positive answers to all the statements below (e.
g. disability scale). Gets over problem of item totals being formed by different sets of responses.
Scaling Responses
Scaling responses is a decision arrived by the subjects’ response to a stimuli. Such response
options may include requiring participants to make comparative judgments (e.g.,which is more important, A
or B?), subjective evaluations (e.g., strongly agree to strongly disagree), or an absolute judgment (e.g., how
hot is this object?). Different response formats may well influence how to write and edit stimuli. In addition,
they may also influence how one evaluates the quality or the “accuracy” of the response. For example, with
absolute judgments, standard of comparisons are used, especially if subjects are being asked to rate
physical characteristics such as weight, height, or intensity of sound or light. With attitudes and
psychological constructs, such “standards” are hard to come by. There are a few options (e.g., Guttman’s
Scalogram and Coomb’s unfolding technique) for simultaneously scaling people and stimuli, but more often
than not only one dimension is scaled at a time. However, a stimuli is scaled first (or seek a well-
established measure) before having confidence in scaling individuals on the stimuli.
Multidimensional Scaling Models
With unidimensional scaling, as described previously, subjects are asked to respond to stimuli with
regard to a particular dimension. With multidimensional scaling (MDS), how-ever, subjects are typically
asked to give just their general impression or broad rating of similarities or differences among stimuli.
Subsequent analyses, using Euclidean spatial models, would “map” the products in multidimensional
space. The different multiple dimensions would then be “discovered” or “extracted” with multivariate
statistical techniques, thus establishing which dimensions the consumer is using to distinguish the
products. MDS can be particularly useful when subjects are unable to articulate “why” they like a stimulus,
yet they are confident that they prefer one stimulus to another.
135
References
Agresti, A. & Finlay, B. (1997). Statistical methods for the social sciences (3rd ed.). New Jersey: Prentice
Hall.
Anastasi, A. & Urbina, S. (2002). Psychological testing. Prentice Hall: New York.
Andrich, D. (1998). Rasch models for measurement. Sage University: Sage Publications.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational
measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education.
Bejar, I. I. (1983). Subject matter experts' assessment of item statistics. Applied Psychological
Measurement, 7, 303-310.
Berk, R. A. (1996). Standard setting: the next generation. Applied Measurement in Education, 9, 215-235.
Chang, L. (1999). Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Applied
Measurement in Education, 12, 151-166.
Crocker, L. M., & Algina, J. (1986). Introduction to classical and modern test theory. Belmont, CA:
Wadsworth.
Dobby J, & Duckworth, D (1979): Objective assessment by means of item banking. Schools Council
Examination Bulletin, 40, 1-10.
Downie, N.M., & Heath, R.W. (1984). Basic statistical methods (5th ed.). New York: Harper & Row
Publishers.
Fischer, G. H. (1974) Derivations of the Rasch Model. In Fischer, G. H. & Molenaar, I. W. (Eds) Rasch
Models: foundations, recent developments and applications, pp. 15-38 New York: Springer Verlag.
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences. New
York: W. H. Freeman.
Guildford, J. P. (1954). Psychometric methods. New York: McGraw-Hill.
Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff minimum passing
levels for a group of borderline candidates. Applied Measurement in Education, 12, 13-28.
Grosse, M. E., & Wright, B. D. (1986). Setting, evaluating, and maintaining certification standards with the
Rasch model. Evaluation and the Health Professions, 9, 267-285.
Hambelton, R. K. (2000). Emergence of item response modeling in instrument development and data
analysis. Medical Care, 38, 60-65.
136
Impara, J. C., & Plake, B. S. (1997). Standard setting: An alternative approach. Journal of Educational
Measurement, 34, 353-366.
Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in
the Angoff standard setting method. Journal of Educational Measurement, 35, 69-81.
Kane, M. (1994). Validating the performance standards associated with passing scores. Review of
Educational Research, 64, 425-461.
Kaplan, R. M. & Saccuzo, D. P. (1997). Psychological testing: Principles, applications and issues. Pacific
Grove: Brooks Cole Pub. Company.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:
Erlbaum.
Magno, C. (2003). Relationship between attitude towards technical education and academic achievement
in mathematics and science of the first and second year high school students, caritas don bosco school, sy
2002 – 2003. An unpublished master’s thesis, Ateneo de Manila University, Quezon City, Manila.
Maier, K. S. (2001). A Rasch hierarchical measurement model. Journal of Educational and Behavioral
Statistics, 26, 307-331.
McKinley, D. W., Newman, L. S., & Wiser, R. F. (1996, April). Using the Rasch model in the standard-
netting process. Paper presented at the annual meeting of the National Council of Measurement in
Education, New York, NY.
Mills, C. N., & Melican, G. J. (1988). Estimating and adjusting cutoff scores: Future of selected methods.
Applied Measurement in Education, 1, 261-275.
National Academy of Education (1993). Setting performance standards for student achievement. Stanford,
CA: Author.
National Research Council (1999). Setting reasonable and useful performance standards. In J. W.
Pellegrino, L. R. Jones, & K. J. Mitchell (Eds.), Grading the nation's report card: Evaluating NAEP and
transforming the assessment of educational progress (pp. 162-184). Washington, DC: National Academy
Press.
Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of mathematical
psychology, 3, 1 – 18.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.) New York: McGraw-Hill.
Plake, B. S. (1998). Setting performance standards for professional licensure and certification. Applied
Measurement in Education, 11, 65-80.
Reid, J. B. (1991). Training judges to generate standard-setting data. Educational Measurement: Issues
and Practice, 10, 11-14.
137
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark:
Danish Institute for Educational Research.
Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of
scientific statements. In G. M. Copenhagen (ed.). The Danish yearbook of philosophy (pp.58-94).
Munksgaard.
Shepard, L. A. (1995). Implications for standard setting of the national academy of education evaluation of
the national assessment of educational progress achievement levels. Proceedings of Joint Conference on
Standard Setting for Large-Scale Assessments (pp. 143-160). Washington, DC: The National Assessment
Governing Board (NAGB) and the National Center for Education Statistics (NCES).
Shepard, L. A. (1995). Implications for standard setting of the national academy of education evaluation of
the national assessment of educational progress achievement levels. Proceedings of Joint Conference on
Standard Setting for Large-Scale Assessments (pp. 143-160). Washington, DC: The National Assessment
Governing Board (NAGB) and the National Center for Education Statistics (NCES).
Spearman,, C. (1904). The proof and measurement of association between two things. American Journal of
Psychology, 15, 72 – 101.
Torgerson, W. S. (1958). Theory and methods of scaling. New York: Wiley.
Thurstone, L. L. (1927). The unit of measurements in educational scales. Journal of Educational

Psychology, 18, 505 – 524.
Wang, N. (2003). Use of the Rasch IRT model in standard setting: An item-mapping method. JEM, 40, 231.
Van der Linden, W. J. & Hambleton, R. K. (2004). Item response theory: Brief history, common models, and
extension. New York: Mc Graw Hill.
van der Ven, A. H. G. S. (1980). Introduction to scaling. New York: Wiley.
Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA Press.
Wright BD, Panchapakesan N (1969). A procedure for sample free item analysis.
Educational and Psychological Measurement, 29, 23-48.
138
Exercise:
Calibrate the item difficulty and person ability of the scores in a Reading Comprehension test
with 19 items among 15 Korean students. After performing he Rasch Model, determine item
difficulty using the Classical test theory approach. Compare the results.
Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item
Case
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
A 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 1 0 1
B 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0
C 0 0 0 1 0 1 0 0 0 0 1 1 0 1 1 0 1 0 1
D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
E 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
F 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0
G 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1
H 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1
I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1
K 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0
L 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1
M 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1
N 0 0 1 1 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1
O 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1
References
Anastasi, A. & Urbina, S. (2002). Psychological testing (7th ed.). NJ: Prentice Hall.
DiLeonardi, J. W. & Curtis, P. A. (1992). What to do when the numbers are in: A users guide to
statistical data analysis in the human services. Chicago IL, Nelson-Hall Inc.
Kaplan, R. M. & Saccuzzo, D. P. (1997). Psychological testing: Principles, applications, and

issues (4th ed.). Pacific Grove, USA: Brooks/Cole Publishing Company.
Magno, C. (2007). Exploratory and confirmatory factor analysis of parental closeness and
multidimensional scaling with other parenting models. The Guidance Journal, 36, 63-89.
Magno, C., Lynn, J, Lee, K., & Kho, R. (in press). Parents’ School-Related Behavior: Getting
Involved with a Grade School and College Child. The Guidance Journal.
139
Magno, C., Tangco, N., & Tan, C, (2007). The role of metacognitive skills in developing critical
thinking. Paper presented at the Asian Association of Social Psychology in Universiti
Malaysia, Kota Kinabalu, Sabah Malaysia, July 25 to 28.
Payne, D. A. (1992). Measuring and evaluating educational outcomes. MacMillan Publishing

Company: New York.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen,
Denmark: Danish Institute for Educational Research.
Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality
and validity of scientific statements. In G. M. Copenhagen (ed.). The Danish yearbook of
philosophy (pp.58-94). Munksgaard.
Van der Linden, W. J. & Hambleton, R. K. (2004). Item response theory: Brief history, common
models, and extension. New York: Mc Graw Hill.
Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA
Press.
140
Lesson 4
Using a Computer Software in Analyzing Test Items
The “ANALYSIS OF TEST DATA” is a program together with this book that is used to analyze
and test the reliability and validity of test data.
The program runs best using Windows XP.
How to install:
Insert the disc then click “My Computer”

Click on the drive of the CD reader.
A folder will come out with a label “installer”
Double click the folder “installer”
Four files will appear
1. Double click the “set up” file.
2. Just Click “OK” button to continue.

141
3. Click the “Change Directory” button and choose the folder where you want to install the
program.
4. Click the computer icon to begin the installation.
5. Click the “Continue” button to proceed.

142
6. Wait for few seconds for the setup.
7. If asked to keep the file, click “YES” button.
8. A dialogue box will appear “The destination of file in use. Please ensure that all applications
are closed”. Just press “IGNORE” button then click “YES” button to continue.
9. Once the installation is complete, just press “OK” button.

143
Opening the Program:
Click the “Start” in the desktop then click “All Programs”.
Look for the program “Analysis of Test Data” and click to open it.
The menu will first appear. Start by clicking which statistical analysis you would like to perform.
Note: The program could only input and analyze up to 500 respondents and up to 100 test
items.
144
The button lets you analyze the Test-Retest reliability of the test data.
This is used to find the reliability of test scores by repeating the same test on a second occasion.
To use the Test-Retest Reliability coefficient:

1. Click the “Test-Retest” button.
2. Enter the number of participants who took the test in the text box appeared on the
screen.
3. Press “Enter” key to show the cases in which where you will input the scores.
4. Make sure that a blinking cursor will appear in the cell you want to put the score.
5. Enter the total scores of each participant for each test in each column.
SAMPLE INPUT for Test-Retest Reliability Analysis:
6. To move the cursor to other cells, Press “Tab” key or simply click the cell.
7. If you want to add cases, just click the button or to

make the necessary changes. Click the “Yes” button. Enter the number of participants
you want to add in the text box provided, and then press the “Enter” key to show the
added cases.
8. But if you want to reduce the number of cases, then click the button
or button to go back to the main menu and do the steps again.

9. Once you’re done encoding all the scores, click the “Proceed with the
Analysis” button located above the screen or the button located below
to show the results.

145
SAMPLE RESULTS for Test-Retest Reliability Analysis:
The button allows you to analyze for the Split-Half reliability of the test data.
This is used when two scores set of scores are obtained for each participant who took a test by
dividing the test scores into equivalent halves.
To determine the Split-half Reliability coefficient:

1. Click the “Split-half” button.
2. Enter the number of the items of the test in the text box appeared on the screen.
3. Press “Tab” key to move the cursor or click the text box below. Then enter the
number of participants who took the test.
5. Enter the scores of each participant for each item in each column.
SAMPLE INPUT for Split-Half Reliability Analysis:

or items you want to add. Enter the numbers in the text boxes provided, and then
press the “Enter” key to show the added cases.
146

10. Once you’re done putting all the scores and completing the cases, click the “Proceed
with the Analysis” button located above the screen or the button
located below to proceed to next step.
11. Select the item number then click the button to which set you want to place the
selected item number. But make sure that the item number selected should be shaded
by gray before clicking the arrow sign.
12. And if you want to remove the item from the set, select the item number you want to
move but make sure that the item number should be shaded by gray, and then click
the button.
13. Click the “Proceed with the Analysis” button located above the screen
or the button located below to display results.

147
SAMPLE RESULTS for Split-Half Reliability Analysis:
The button allows you to do the Parallel Form technique to compute for the
reliability of test data. This is used when a person can be tested with one form on the first
occasion and with another equivalent form on the second.
To determine the Parallel Form Reliability coefficient:

1. Click the “Parallel Form” button.
screen.
4. Enter the scores of each participant in each column.
SAMPLE INPUT for Parallel Form Reliability Analysis:

148
you want to add. Enter the number in the text box provided, and then press the
“Enter” key to show the added cases.

located below to reveal the results.
SAMPLE RESULTS for Parallel Form Reliability Analysis:
The button would allow you to analyze for the Cronbach’s Alpha reliability of
the test data.
This is used to determine the internal consistency of responses of participants to all items in the
test.
To determine the Cronbach’s Alpha coefficient:

1. Click the “Cronbach’s Alpha” button.
149
SAMPLE INPUT for Cronbach’s Alpha Reliability Analysis:

or items you want to add. Enter the numbers in the text boxes provided, and then
press the “Enter” key to show the added cases.

SAMPLE RESULTS for Cronbach’s Alpha Reliability Analysis:
The button would allow analyze for the Kuder Richardson reliability of the test
data.
150
This is used to determine the internal consistency of binary responses of participants to all items
in the test.
To determine the Kuder Richardson Reliability coefficient:

1. Click the “Kuder Richardson” button.
SAMPLE INPUT for Kuder- Richardson Reliability Analysis:
6. Correct answers should be marked as “1” and wrong answers should be marked as
“0” in each cell.
9. If you want to add the number of cases, just click the button or
to make the necessary changes. Click the “Yes” button. Enter the
number of cases or items you want to add. Enter the numbers in the text boxes
provided, and then press the “Enter” key to show the added cases.


151
SAMPLE RESULTS for Kuder-Richardson Reliability Analysis:
The button would allow you to analyze for the Interrater reliability of the test
data.
This is used to determine the concordance of raters’ scores.
To determine the Interrater Reliability coefficient:

1. Click the “Interrater Reliability” button.
2. Enter the number of the raters in the text box appeared on the screen.
number of cases.
5. Enter the scores given by the rater for each item in each column.
SAMPLE INPUT for Interrater Reliability Analysis:
make the necessary changes. Click the “Yes” button. Enter the number of cases or
raters you want to add. Enter the number in the text boxes provided, and then press
the “Enter” key to show the added cases.
152
SAMPLE RESULTS for Interrater Reliability Analysis:
The button would you analyze for the Criterion-Reference validity of the test
data.
This is used to indicate the effectiveness of a test to predict an individual’s performance in the
future.
To determine the Criterion-Reference Validity coefficient:

1. Click the “Criterion-Reference” button.
screen.
5. Enter the total scores of each participant in each column.
153
SAMPLE INPUT for Criterion-reference Validity Analysis:

added cases.


SAMPLE RESULTS for Criterion-reference Validity Analysis:
The button would you analyze for the Concurrent validity the test data.
This is used when both test measures are present and their relationship coincides with a theory.
154
To determine the Concurrent Validity coefficient:

1. Click the “Concurrent” button.
screen.
SAMPLE INPUT for Concurrent Validity Analysis:

added cases.


155
SAMPLE RESULTS for Concurrent Validity Analysis:
The button and button would you analyze for the Convergent
and Discriminant validity of the test data. This is used to prove the correlation of the variables
with which it should theoretically correlate (convergent) and also it does not correlate with
variables from which it should differ (divergent).
Once the “Convergent” or “Divergent” button was clicked, 3 submenu buttons will appear.
The (3) Three Sub Menus are the following:

a. Compare Two Independent Groups
b. Compare Two Dependent Groups
c. Correlate Two Variables
Do determine the Concurrent Validity coefficient: comparing 2 independent

groups:
156
1. Click the “Compare Two Independent Groups” button.

2. Enter the number of participants who took the test in group 1, then Press “tab”
key to move the cursor and enter the number of participants who took the test in
group 2. Enter the number of participants in the text boxes appeared on the screen.
3. Identify the alpha level and the direction of the hypothesis.
4. Click either of the two text boxes on the screen then press “Enter” key to show
the cases in which where you will input the scores.
6. Enter the total scores of each participant in each column per group.
SAMPLE INPUT for Convergent and Discriminant Validity: Comparing 2 Independent

Groups Analysis:

make the necessary changes. Click the “Yes” button. Enter the number of
participants you want to add in the text boxes provided, and then press the
9. But if you want to reduce the number of cases, then click the
button or button to go back to the sub menu and do the steps again.
10. Once you’re done putting all the scores and completing the cases, click the
“Proceed with the Analysis” button located above the screen or the
button located below to reveal the results.

157
SAMPLE RESULTS for Convergent and Discriminant Validity: Comparing 2 Independent

Groups Analysis:
Do determine the Concurrent Validity coefficient: comparing 2 dependent

groups:
1. Click the “Compare Two Dependent Groups” button.
2. Enter the number of participants who took the test. Enter the number of
participants in the text box appeared on the screen.
3. Identify the alpha level and the direction of the hypothesis.
4. Click either of the two text boxes on the screen then press “Enter” key to show
the cases in which where you will input the scores.
SAMPLE INPUT for Convergent and Discriminant Validity: Comparing 2 Dependent Groups
Analysis:

make the necessary changes. Click the “Yes” button. Enter the number of
158
participants you want to add in the text boxes provided, and then press the
SAMPLE RESULTS for Convergent and Discriminant Validity: Comparing 2 Dependent

Groups Analysis:
Do determine the Concurrent Validity coefficient: Correlating Two Variables:

1. Click the “Correlate Two Variables” button.
2. Enter the number of participants who took the test. Enter the number of
participants in the text box appeared on the screen.
3. Enter the scores of each participant in each column.
159
SAMPLE INPUT for Convergent and Discriminant Validity Comparing Two Variables
Analysis:
number of participants you want to add in the text boxes provided, and then press

160
SAMPLE RESULTS for Convergent and Discriminant Validity Comparing Two Variables
Analysis:
This is method uses the classical test theory approach in determining the item
difficulty and item discrimination based on proportions of high and low group.
To determine the item difficulty and item discrimination of test items:

1. Click the “Determine Item Discrimination and Difficulty” button.
2. Enter the number of participants who belongs to upper group and lower group.
Enter the number of participants in the text box appeared on the screen.
number of test items.
5. Enter the scores for each item in each column.
SAMPLE INPUT for Item Analysis:
161
number of participants you want to add in the text boxes provided, and then press
button or button to go back to the main menu and do the steps again.
9. Once you’re done putting all the scores and completing the cases, click the “Start
Compute” button located above the screen or the
SAMPLE RESULTS for Item Analysis:

162
How to Save Data
1. Click the “Save Result” button.
2. Press the folder icon to choose the folder where you want save the file.
3. Then click the “Save” button to save the file.

4. Click “OK” button once the file have been already saved.
How to Print Data
1. Click the “Print Results” button
2. Then click the “Print” icon on the upper left of the screen.
163
How to Open File from Excel:
NOTE:
Make sure that you have encoded the data with Microsoft Excel XP or a lower version with
correct format for the file to be open by the program.
1. Click the “Open File form Excel” button.
2. When the uploaded button is clicked, the data will be entered to the cases in
the program. But make sure that the sheet name matches type of analysis that being used.
(Check: “How to encode and save data in Microsoft Excel” for the template guide).
3. Make sure that you know how to use the Excel format. Check the folder
where you save and installed the program to familiarize yourself with the
given templates.
164
165
How to Encode and Save Data in Microsoft Excel
1. Open the folder in which you have installed the program.
2. Five files will appear on the screen, then double click the “Template” excel file to view
the formats required in encoding data in Microsoft excel. Make sure to follow the formats
given to ensure that your encoded data could be open using the “Analysis of Test Data”
program.
3. After clicking the “Template” file, a Microsoft excel file with multiple sheets will open
as being shown below:
4. Make sure to save your encoded data with correct sheet name in Microsoft Excel.
166
PLEASE BE GUIDED ACCORDINGLY:
 Click the sheet with label below the screen to view the correct template for
encoding the data that is to be analyzed using Cronbach’s Alpha Reliability Analysis.
encoding the data that will be analyzed using Interrater Reliability analysis.
167
encoding the data that is to be analyzed using Kuder Richardson Reliability analysis.
encoding the data that is to be to be analyzed using Split-Half Reliability analysis.
168
encoding the data that is to be analyzed using either the following analysis: Test-Retest
Reliability, Parallel Form Reliability, Criterion-reference validity, Concurrent
Validity, Convergent (Correlate 2 Variables) and Divergent (Correlate 2 Variables).
 Click the sheets with label below the screen to view the
correct template for encoding the data that is to be analyzed when Comparing Two
Independent Group Validity analysis.
169
encoding the data that is to be analyzed when Comparing Two Dependent Group
Validity analysis.
encoding the data that is to be analyzed using the Item Analysis procedure.
170
Chapter 4
Developing a Teacher-Made Test
Objectives
1. Explain the theories and concepts that rationalize the practice of assessment.
2. Make a table of specifications of the test items.
3. Design pen-and-paper tests that are aligned to the learning intents.
4. Justify the advantages and disadvantages of any pen-and-paper test.
5. Evaluate the test items according to the guidelines.
Lessons
1 The Test Blueprint

Outline of the Test Development Process
Tabel of Specifications
2 Designing Selected-Response Items
Binary-choice items
Instructions in Writing Binary Type of Items
Multiple-choice items
Guidelines in Writing Multiple-choice Items
Matching items
Guideliens in Writing Multiple-choice Items
3 Designing Constructed-Response Items
Short-answer items
Guidelines in Writing Short Answer Items
Essay items
4 Designing Interpretive Exercise
Guidelines in Writing Intepretive Exercise
Examples of Interpretive Exercise
171
Lesson 1
The Test Blueprint
As we have mentioned in the previous chapters, teaching involves decision-making. This

chapter discusses another aspect of teaching that requires intelligent and informed decisions from
teachers. In this chapter, we wish to provide teachers with the basic scaffolds for developing pen-
and-paper tests in the hope of meeting the following objectives:
As the term suggests, teacher-made tests are assessment tools, particularly pen-and-paper
types, that teachers develop, use, and assess based on the learning targets set of the task or
domain to be tested. It makes use of content, knowledge, as well as process domains. The content
domain is the subject area from which items are drawn. In its general sense, it is the subject or
course (i.e., science, math, English, etc.) in which testing is to be made. Specifically, it covers
specific topics under a subject area (i.e., the laws of motion, addition of fractions, or use of a
singular verb in a sentence). The knowledge domain involves those dimensions or types of
knowledge to be tested. In the revised taxonomy, this domain involves those knowledge
dimensions as factual, conceptual, procedural, and metacognitive knowledge types. As for the
process domain, any pen-and-paper test involves the aspects of mental processes that students
use to engage the task in the test. In the revised taxonomy, those mental procedures as
remembering, understanding, applying, and so on, are the processes that may be tested.
A. Call to mind those alternative taxonomic tools in Chapter 2.

B. Identify the knowledge domain and the process domain of each alternative
taxonomy.
C. Monitor your understanding by clearly accounting for what you already know
about these domains, or by figuring out those areas that you do not understand
yet.
D. Formulate questions regarding what you wish to clarify about the matter that you
do not clearly understand.
E. When appropriate, raise your questions in class or discuss with your classmates.
To make sure that you have these domains accounted for in your assessment design,
engage yourself to make a table of specifications, one that will allow you to explicitly indicate
what content to cover in your test, what knowledge dimensions to focus, and what cognitive
processes to pay attention to.
The Table of Specifications is a matrix where the rows consist of the specific topic or
skills (content or skill areas) and the columns are the learning behaviors or competencies that we
desire to measure. Although we can also add more elements in the matrix, such as Test
Placement, Equivalent Points, or Percent values of items, the conventional prototype table of
specifications may look like this:
172
Cognitive Processes
Content (or Skill) Areas Knowledge Application Analysis TOTAL
1. Translation from words 1 2 2 5

to mathematical symbols
2. Forming the Linear 1 3 2 6
Equation
3. Solving the Linear 3 1 4
Equation
4. Checking the Answer 3 2 5
TOTAL 2 11 7 20
The number of items Number of items for solving The total number
measuring Knowledge linear equation, measuring of test items
Application
As you have seen in the above table of specifications, only three cognitive processes are
indicated. This means that if you use the old Bloom’s taxonomy of behavioral objectives, include
only those levels that you wish to measure in the test, although it is recommended that more than
a single processes should be measured in a test, depending, of course, on your purpose of testing.
As a test blueprint, the table of specifications ensures that the teacher sees all the essential
details of testing and measuring student learning. It makes the teacher sure that the content areas
(or skill areas) and the levels of behavior in which learning is hoped to anchor are measured. The
test’s degree of difficulty may also be seen in the table of specifications. When the distribution of
test items is concentrated in the higher-order cognitive behaviors (analysis, synthesis,
evaluation), the test’s difficulty level is higher as compared to when the items are concentrated in
the lower-order cognitive behaviors (knowledge, comprehension, application).
As you have learned in Chapter 2 of this book, there are many taxonomic tools that may
be used in our instructional planning. The taxonomic tool for planning the test should be
consistent to the taxonomy of learning objectives used in the overall instructional plan.
Understandably, designing the table of specifications using any taxonomic tool will require a
little of our time, effort, and other personal and motivational resources. Before we may be
tempted to develop pen-and-paper test items without first preparing our table of specifications,
and run the risk of not actually evaluating our students on the basis of our learning intents, we
need to first brush up on our understanding of the instrumental function of the table of
specifications as a blueprint for our test, and convince ourselves that this is an important process
in any test development activity. In developing the table of specifications, we suggest that you do
not yet think of the types of pen-and-paper test you wish to give. Instead, you just focus on
planning of your test in terms of your assessment domain.
173
Outline of the Test Development Process
1. Specify the ultimate goals of the education process

2. Derive from these the goals of the portion of the system under study
3. Specify these goals in terms of expected student behavior. If relevant, specify the
acceptance level of successful learning.
4. Determine the relative emphasis or importance of various objectives, their content, and
their behaviors.
5. Select or develop situations that will elicit the desired behavior in the appropriate context
or environment, assuming the student has learned it.
6. Assemble a sample of such situations that together represent accurately the emphasis on
content and behavior previously determined.
7. Provide for the recording of responses in a form that will facilitate scoring but will not so
distort the nature of the behavior elicited that it is no longer a true sample or index of the
behavior desired.
8. Establish scoring criteria and guides to provide objective and unbiased judgment.
9. Try out the instrument in preliminary form.
10. Revise the sample of situations on the basis of tryout information.
11. Analyze reliability, validity, and score distribution in accordance with the projected use
of scores.
12. Develop test norms and a manual, and reproduce and distribute the test.
Table of Specifications
The Table of Specifications (TOS) is a blueprint for selecting appropriate test items. The TOS
can be one grid or two grid. A one grid table of specifications only allows one to indicate the
number if items in a test across the competencies or topics. In a two grid TOS, the percentage of
the cognitive skills and time frame for each topic/competencies are indicated.
One-grid Table of Specifications
Content No. of items

Outline
1. Table of specifications 10
2. Test and Item characteristics 20
3. Test layout 5
4. Test instructions 5
5. Reproducing the test 5
6. Test length 5
7. Scoring the test 5
TOTAL 55
174
Two grid Table of Specifications
Content shown in one axis and the cognitive (and/or affective) domain on the other
Cognitive Domain
Content Knowledge Comprehension Application
I.
II.
III.
Example
Weight Content Knowledge Comprehension Application No. of items
(Time Outline 30% 40% 30% by content
Frame) area
35% 1. Table of specifications 1 4 4 9
30% 2. Test and Item characteristics 2 3 3 8
10% 3. Test layout 1 1 0 2
5% 4. Test instructions 0 1 0 1
5% 5. Reproducing the test 1 0 0 1
5% 6. Test length 1 0 1 2
10% 7. Scoring the test 2 1 0 3
8 10 8 26
The number of items in a cell is computed using the formula:
Given time
items  X percentage of cognitive skill X total number of items
Total time
Test Length
 The test must be of sufficiently length to yield reliable scores
 The longer the test, the more the reliable the results. This also targets the validity of the test
because the test should be valid if it is reliable.
 For the grade school, one must consider the stamina and attention span of the pupils
 The test should be long enough to be adequately reliable and short enough to be
administered
Test Instruction
 It is the function of the test instructions to furnish the learning experiences needed in order to
enable each examinee to understand clearly what he is being asked to do.
 Instructions may be oral, a combination of written and oral instruction is probably desirable,
except with very young children.
 Clear concise and specific.
175
Test layout
 The arrangement of the test items influences the speed and accuracy of the examinee
 Utilize the space available while retaining readability.
 Items of the same type should be grouped together
 Arrange test items from easiest to most difficult as a means of reducing test anxiety.
 The test should be ordered first by type then by content
 Each item should be completed in the column and page in which it is started.
 If the reference material is needed, it should occur on the same page as the item
 If you are using numbers to identify items it is better to use letters for the options
Scoring the test

 Use separate answer sheets
 Punched key
 Overlay key
 Strip key
Plight of the student

 The teacher should discuss with the class the content areas and levels of the cognitive domain
to be examined
 The discussion should utilize a vocabulary and a level of complexity appropriate to the
development level of the student
 Types of test are considered
 Examples of test type are provided
176
Lesson 2
Designing Selected-Response Items
When you are done with your test blueprint, you are now ready to start developing your
test items. For this phase of test development, you will need to decide what types or methods of
pen-and-paper assessment you wish to design. To aid you in this process, we will now discuss
some of the common pen-and-paper types of test and the basic guidelines in the formulation of
items for each type.
In deciding about the assessment method to use for a pen-and-paper test, you choose
which among the selected response or the constructed response types would be appropriate for
your blueprint. Selected response tests use those types of items that require the test takers to
respond by choosing an option from a list of alternatives. Common types of constructed response
tests are binary-choice, multiple-choice, and matching tests.
Binary-choice Items
The binary-choice test offers students the opportunity to choose between two options for
an answer. The items must be responded to by choosing one of two categorically distinct
alternatives. The true-false test is an example of this type of selected-response test. This type of
selected-response test typically contains short statements that represent less complex
propositions, and therefore, is efficient in assessing certain levels of students’ learning in a
reasonably short period of testing time. In addition to this, a binary-choice test may cover a wider
content area in a brief assessment session (Popham, 2005).
To assist you in developing binary-choice items, here are some guidelines with brief
descriptions of each. These guidelines may not capture everything that you need to be mindful of
in developing teacher-made tests. These are just the basics of what you need to know. It is
important that you also explore on other aspects of test development, including the context in
which the test is to be used, among others.
Make the Instructions Explicit. Basic in pen-and-paper test is that instructions indicate
the task that students need to do and the credit they can earn from making every correct answer.
However there is one more thing you need to indicate in your instructions for a binary-choice test
– the reference of validity or reference of truth. When you ask your students to judge whether the
statement is true or false, correct or erroneous, or valid or invalid, you need to state the reference
of truth or correctness of a response. If the reference is a reading article, textbook, teacher’s
lecture, class discussion, or resource person, state it in your instructions. This will help students
think contextually and on-track. This will also help you cluster your items according to specific
domain or context. Also, it can minimize the problem of conflict of information, such as one
resource material says this and one person (maybe your student’s parent or another teacher) says
otherwise. For items that vary in context and reference of truth, state the reference in the item
itself. For example, if the item is drawn from a person’s opinion, such as the principal’s speech,
or a guest speaker’s ideas, it is important that you attribute to opinion to its source. Lastly,
although not a must, it might be nicer to use “please” and “thank you” in our test instructions.
177
State the item as either definitely true or false. Statements must be categorically true or
false, and not conditionally so. It should clearly communicate the quality of the idea or concept
as to whether it is true, correct, and valid or false, erroneous, and invalid. Make sure that it
clearly corresponds to the reference of validity and that the context must be explicit. For the
quality to be categorical, it must invite only a judgment of contradictories, not contraries. For
example, white or not white implies a contradiction because one idea is a denial of the other. To
say black or white indicates opposing ideas that imply values between them, such as gray. A
good item is one that implies only contradictory, mutually exclusive qualities, that is, either true
or false, and it does not need further qualification in order to make it true or false.
Keep the statements short, simple, but comprehensible. In formulating binary-choice
items, it is wise to consider brevity in the statement. Good binary-choice items are concisely
written so that they present the ideas clearly but avoid extraneous materials. Making the
statements too long is risky in that it might unintentionally indicate clues that will make your
statement obviously true or false. There is actually no clear-cut rule for brevity. It is usually left
to the teacher’s judgment. In preparing the whole binary-choice test, it is also important that all
the items or statements maintain relatively the same length. For a statement to be
comprehensible, it must make a clear sense of the ideas or concepts on focus, which is usually
lost when a teacher lifts a statement from a book and use it as a test item statement.
Do away with tricks. We remember that the purpose of assessing our students’ learning
is based on the assessment objectives we set. Clearly, solving tricks is remote if not totally
excluded in our intents. Therefore, we need to avoid using tricks, such as using double-negatives
in the statement or switching keys. The use of double-negative statements is a logical trickery
because the “valence” of the statement is still maintained, not altered. These statements are
usually puzzling, and will therefore take more time for students to understand. Switching keys is
when you ask students to answer “false” if the statement is true, or “true” if the statement is
false. This is obviously an unjustifiable trick. By all means, we have to avoid using any kind of
tricks not only in binary-choice tests but also in all other types and methods of assessment.
Get rid of those clues. Clues come in different forms. One of the common clues that can
weaken validity and reliability of our assessment is comes from our use of certain words, such as
those that denote universal quantity or definite degree (i.e., all, everyone, always, none, nobody,
never, etc.). These words are usually false because it is almost always wrong to say that one
instance applies to all sorts things. Other verbal clues may come from the use of terms that
denote indefinite degree (i.e., some, few, long time, many years, regularly, frequently, etc.).
These words do not actually indicate a definite quantity or degree, and, thus, violate the rule on
definiteness of quality in letter b. Other clues may come from the way statements are arranged
according to the key, such as alternating items that are true and false, or any other style of
placing the items in a systematic and predictable order. This should be avoided because once the
students notice the pattern, they are not likely to read the items anymore. Instead, they respond to
all items mindlessly but obtain high scores.
Basic in test development is our mindful tracking of our purpose. Binary-choice items
can be a useful tool for assessing learning intents that are drawn from various types of
knowledge, but include only simpler cognitive processes. In this test, students only recall their
178
understanding of the subject matter covered in assessment domain. They do not manipulate this
knowledge by using more complex, deeper cognitive strategies and processes.
Another important point to consider in deciding whether to use the binary-choice test is
its degree of difficulty. Because this type of test offers only two options, the chance that a
student chooses the correct option is 50%, the remainder is his chance of choosing the wrong
option. This 50-50 probability of selecting the correct answer is problematic because the chance
of answering the question correctly is high even if the student is not quite sure of his
understanding. One way of reducing the likelihood of guessing for the right option is suggested
by Popham (2004), that is to include more items because if students are successful in their
guesswork for a 10-item binary-choice test, it is likely impossible to maintain this success with,
let us say, a 30-item test.
Instructions in Writing BinaryType of Items
1. Avoid the use of “specific determiners”
FAULTY: No picture-no sound in a television set may indicate a bad 5U4G.

IMPROVED: A bad 5U4G tube in a television set will result in no picture sound.
2. Base true-false items upon statements that are absolutely true or false, without qualifications
or exceptions.
FAULTY: World War II was fought in Europe and the Far East.
IMPROVED: The primary combat locations in terms of military personnel during World War II
were Europe and the Far East.
3. Avoid negative stated items when possible and eliminate all double negatives.
FAULTY: It is not frequently observed that copper turns green as a result of oxidation.
IMPROVED: Copper will turn green upon oxidizing.
4. Use quantitative and precise rather than qualitative language where possible.
FAULTY: Many people voted for Gloria Arroyo in the 2003 Presidential election.
IMPROVED: Gloria Arroyo received more than 60 percent of the popular votes cast in the
Presidential election of 2003.
5. Avoid stereotypic and textbook statements.

FAULTY: From time to time efforts have been made to explode the notion that there may be a cause
relationship between arboreal life and primate anatomy.
IMPROVED: There is a known relationship between primate anatomy and arboreal life.
6. Avoid making the true items consistently longer than the false items.
7. Avoid the use of unfamiliar or esoteric language.

179
FAULTY: According to some peripatetic politicos, the raison d’etre for capital punishment is
retribution.
IMPROVED: According to some politicians, justification for the existence of capital punishment
can be traced to the Biblical statement, “An eye for an eye.”
8. Avoid complex sentences with many dependent clauses.
FAULTY: Jane Austen, an American novelist born in 1790, was a prolific writer and is best
known for her novel Pride and Prejudice, which was published in 1820.
IMPROVED: Jane Austen is best known for her novel Pride and prejudice.
9. It is suggested that the crucial elements of an item be placed at the end of the statement.
FAULTY: Oxygen reduction occurs more readily because carbon monoxide combines with
hemoglobin faster than oxygen does.
IMPROVED: Carbon monoxide poisoning occurs because carbon monoxide dissolves delicate
lung tissue.
Think of a subject matter in your area of specialization,

something that you have deep and wide knowledge about.
Think of a competency that can be tested by using a binary-

choice type of assessment. Do this by formulating a
statement of learning intent.
Convince yourself as to why the binary-choice test can be

used to test the competency.
Multiple-choice Items
Multiple-choice test is another selected-response type where students respond to every
item by choosing one option among a set of three to five alternatives. The item begins with a
stem followed by the options or alternatives. This type of pen-and-paper test has been widely
used in national achievement tests and other high stake assessment, such as the professional
board examinations. Perhaps, the reason for this is because multiple-choice test is capable of
measuring a range of knowledge and cognitive skills, obviously more than what other types of
objective tests can do.
180
A multiple-choice test may come in two types. The correct-answer type is one whose
items pose specific problems with only one correct answer from the list of alternatives. For
instance, if a stem is followed by four alternatives, only one of them is correct (the keyed
answer), and the other three are incorrect. In this type of multiple-choice test, all items should be
designed in this fashion. The other is the best-answer type where the stem establishes a problem
to be answered by choosing one best option. Understandably, the other options are acceptable but
not necessarily the best alternatives to answer the problem posed in the stem. In this type of
multiple-choice test, only one option is the best answer (keyed answer), and the others may all be
conditionally acceptable, or some are acceptable, some others are totally incorrect.
To guide you in formulating good multiple-choice items, here are some fundamental
guidelines that will be helpful in going through the process.
Make the instructions explicit. When giving a multiple-choice test, the instructions must
indicate the content area or context, the ways in which students respond to every item, and the
scoring. If you are using the correct answer-answer type, it is helpful to the students if your
instructions state that they “choose the correct answer”. Common sense should tell us not to use
this expression when our multiple-choice test is of best-answer type, but “choose the best
answer” would be more appropriate. Lastly, you may want to say “please” and “thank you.”
Formulate a problem. As mentioned above, every item in a multiple-choice test has a
stem and a set of alternatives. The stem should clearly formulate a problem. This is to compel
students to respond to it by choosing one option that will correctly answer the problem or best
address it. There are two ways of posing a problem in the stem of multiple-choice test. One way
is by formulating a question or and interrogative statement. If the stem is “In what year did the
first EDSA revolution happen?” it clearly poses a problem to be answered than “The year when
the first EDSA revolution happen.” The other way to pose a problem in the stem is by
formulating an incomplete sentence where one of the options correctly or best completes it. It
may be phrased as “The first EDSA revolution happened in the year” then the statement is
followed by the list of alternatives. As you will also learn about completion types in the
subsequent section of this chapter, when you use the incomplete sentence format to pose a
problem in the stem of a multiple-choice item, always remove a keyword at the end of the
statement, or at least near the end. If the keyword is at the end of the statement, you don’t end
with any punctuation mark or a blank space. If the missing keyword is near the end of the
statement but not necessarily the last word, replace that keyword which you removed with an
underlined blank space, and end your statement with an appropriate punctuation mark.
State the stem in positive form. Ask yourself, how reasonable is it for you to state your
item’s stem in a negative form? or how important is assessing students’ ability to deal with the
“negatives” in your test?, you surely struggle to seek a good answer that justifies the use of
negative statements in your multiple-choice test.
One of the common problems we encounter in a negatively phrased stem is the high
chance of not spotting the word that carries the negation (e.g., not). Another is the difficulty in
anchoring the negative item to the learning intent. In general case, “which one is” will work more
effectively in assessing students’ learning than “which one is not.” The rule-of-thumb says that
you avoid the use of negative statements, unless there is a really compelling reason for why you
will need to phrase your stem in a negative form. If this reason is reasonable enough, you need to
181
highlight the word that carries the negation, such as writing “not” as “not,” “NOT,” “not,” or
not.
Include only useful alternatives. Remember that the set of alternatives following the stem
is a list of options from which students pick out their response. In any type of multiple-choice
test, only one alternative is keyed, and the rest are distractors. The keyed alternative is ultimately
useful because it is what we expect every student who learned the subject matter should choose.
If the set of alternatives does not contain the expected answer, it is clearly a bad item. This
problem is more dreadful in a correct-answer type than in the best-answer type. At least for the
latter, the second best alternative can stand as the key if the best answer is missing in the list. If
the correct answer is missing in the list of options in a correct-answer type, then there is really
no answer to the problem posed in the stem, and must be removed from the test.
Even if the distractors are not the expected answers, they serve an important function in
the multiple-choice test. As distractors, they should distract those students who do not learn
enough the subject matter, but not those who learn. Therefore, these distractors should be
plausible appear as if they are correct or best options. The way plausible distractors work in a
multiple-choice test is by making the students believe that these distractors are the correct or best
answer even if they are actually not. An important consideration in dealing with the alternatives
is maintaining a homogeneous grouping. For example, if a stem asks about the name of a
particular inventor in science, all alternatives should be names of scientific inventors.
As stated above, a multiple-choice item should have three to five alternatives. Choosing
to include 3, or 4, or 5, depends on the grade level or year level of the class of students you are
handling. We suggest that higher grade- or year-level students be given items with more than 3
options as this will increase the level of test difficulty and reduce the effects of guessing on your
assessment of students’ learning. In instances when you wish that students evaluate each option
as to is plausibility, you may add the option “none of the above” as the fourth or fifth alternative.
However, you have to use this alternative with caution. Use it only for correct-answer type of
multiple-choice test and when you intend to increase the difficulty of an item and that the
presence of this option will help you come up with a better inference of your students’ learning.
Let us say, for example, you are testing computational skills of your students using multiple-
choice items and you encourage mental computations as they deal with the item. If you give
them only number options, they may just choose any one option based on simple estimation,
believing that one of them is the correct answer. Adding the “none of the above” option will
encourage students to do mental computation to check on each option’s correctness, because they
know it is possible that the correct answer is not in the list. Obviously, you cannot use this option
in a best-answer type of multiple-choice test.
The option “all of the above” should never be used at all as this can invite guessing that
will work for your students. If your last option (4th or 5th) is “all of the above” and your students
notice at least 2 options that are correct, they are likely to guess that “all of the above” is the
correct option. Similarly, if they spot one incorrect option, automatically they disregard the “all
of the above” option. When they do so, the item’s difficulty is reduced.
One of the instances that teachers are tempted to unreasonably use “none of the above” or
include “all of the above” option even if it is not allowed, is when they force themselves to
maintain the same number of alternatives for their multiple-choice items. In this case, they use
182
these alternatives as “fillers” in case they run out of options to maintain its number in all the
items. In order to avoid this mistake, it is important to realize that, for classroom testing
purposes, multiple-choice items do not have to come with the same number of options for all its
items. It is okay to have some items with four options while some other items have five.
Scatter the positions of keyed answers. In formulating your multiple-choice items, spread
the keyed answers to different response positions (i.e., a, b, c, d, and e). Make sure the number of
items whose keyed answer is “a” is proportional to the items keyed for each of the other response
positions. Better yet, if you give a 20-item multiple-choice test with 4 options per item, key five
items to each response position (25% of items per response position or approximately so).
The good thing about multiple-choice test is that it is capable of measuring skills higher
than just recall or simple comprehension. If properly formulated, the test can measure higher
level thinking (Airasian, 2000). Also, the fact that every item in the multiple-choice test is
followed by more than two response options makes it obtain its reputation of having higher
difficulty level because the probability that one option is correct becomes smaller as you increase
the number of options. Certainly, a 4-option item is more difficult than a 3-option item because
the former indicates only a 25% probability that one option is correct, which is lower than 33%
probability for a 3-option item. A 5-option item is clearly more difficult.
A. Think of a specific subject matter in your field of specialization, one that you are very
familiar with.
B. Write a learning intent that can be measured using multiple-choice test.
C. Formulate at least 5 correct-answer-type items and another 5 best-answer-type items.
D. Check the quality of your output based on the guidelines discussed above. As you do this,
monitor your learning as well as your confusions, doubts or questions.
E. Raise questions in class.
Guidelines in Writing Multiple Choice Items

1. It is recommended that the stem be a direct question.
2. The stem should pose a clear, define, explicit, and singular problem.
FAULTY: Salvador Dali is
a. a famous Indian.
b. important in international law.
c. known for his surrealistic art.
d. the author of many avant-garde plays.
IMPROVED: With which one of the fine arts is Salvador Dali associated?
a. surrealistic painting
b. avant-garde theatre
c. polytonal symphonic music
d. impressionistic poetry
3. Include in the stem any words that might otherwise be repeated in each response.
183
FAULTY: Milk can be pasteurized at home by

a. heating it to a temperature of 130o
b. Heating it to a temperature of 145o
c. Heating it to a temperature of 160o
d. Heating it to a temperature of 175o
IMPROVED: The minimum temperature that can be used to pasteurize milk at home is:
a. 130o
b. 145o
c. 160o
d. 175o
4. Items should be stated simply and understandably, excluding all nonfunctional words from
stem and alternatives.
FAULTY: Although the experimental research, particularly that by Hansmocker must be

considered equivocal and assumptions viewed as too restrictive, most testing experts would
recommend as the easiest method of significantly improving paper-and-pencil achievement test
reliability to
a. increase the size of the group being tested.
b. increase the differential weighting of items.
c. increase the objective of scoring.
d. increase the number of items.
e. increase the amount of testing time.
IMPROVED: Assume a 10-item, 10-minute paper-and-pencil multiple choice achievement test
has a reliability of .40. The easiest way of increasing the reliability to .80 would be to increased
a. group size
b. scoring objectivity
c. differential item scoring weights
d. the number of items
e. testing time
5. Avoid interrelated items
6. Avoid negatively stated items
FAULTY: None of the following cities is a state capital except

a. Bangor
b. Los Angeles
c. Denver
d. New Haven
IMPROVED: Which of the following cities is a state capital?

a. Bangor
b. Los Angeles
c. Denver
d. New Haven
184
7. Avoid making the correct alternative systematically different from other options
8. If possible the alternatives should be presented in some logical, numerical, or systematic

order.
9. Response alternatives should be mutually exclusive.
FAULTY: Who wrote Harry Potter and the Goblet of Fire?

a. J. K. Rowling
b. Manny Paquiao
c. Lea Salonga
d. Mark Twain
IMPROVED: Who wrote Penrod?

a. J. K. Rowling
b. J. R. R. Tolkien
c. V. Hugo
d. L. Carrol
10. Make all responses plausible and attractive to the less knowledgeable and skillful student.
FAULTY: Which of the following statements makes clear the meaning of the word “electron”?
a. An electronic tool
b. Neutral particles
c. Negative particles
d. A voting machine
e. The nuclei of atoms
IMPROVED: Which of the following phrases is a description of an “electron”?

a. Neutral particle
b. Negative particle
c. Neutralized proton
d. Radiated particle
e. Atom nucleus
11. The response alternative “None of the above” should be used with caution, if at all.
FAULTY: What is the area of a right triangle whose sides adjacent to the right angle are 4 inches
long respectively?
a. 7
b. 12
c. 25
d. None of the above
185
IMPROVED: What is the area of a right triangle whose sides adjacent to the right angle are 4
inches and 3 inches respectively?
a. 6 sq. inches
b. 7 sq. inches
c. 12 sq. inches
d. 25 sq. inches
e. None of the above
12. Make options grammatically parallel to each other and consistent with the stem.
FAULTY: As compared with the American factory worker in the early part of the 19th century,
the American factory worker at the close of the century
a. was working long hours
b. received greater social security benefits
c. was to receive lower money wages
d. was less likely to belong to a labor union.
e. became less likely to have personal contact with employers
IMPROVED: As compared with the American factory worker in the early part of the century, the
American factory worker at the close of the century
a. worked longer hours.
b. had more social security.
c. received lower money wages.
d. was less likely to belong to a labor union
e. had less personal contact with his employer
13. Avoid such irrelevant cues as “common elements” and “pat verbal associations.”
FAULTY: The “standard error of estimate’ refer to

a. the objectivity of scoring.
b. the percentage of reduced error variance.
c. an absolute amount of possible error.
d. the amount of error in estimating criterion scores.
IMPROVED: The “standard error of estimate” is most directly related to which of the following
test characteristic?
a. Objectivity
b. Reliability
c. Validity
d. Usability
e. Specificity
186
14. In testing for understanding of a term or concept, it is generally preferable to present the term
in the stem and alternative definitions in the options.
FAULTY: What name is given to the group of complex organic compounds that occur in small
quantities in natural foods that are essential to normal nutrition?
a. Calorie
b. Minerals
c. Nutrients
d. Vitamins
IMPROVED: Which of the following statements is the best description of a vitamin?
15. Use objective items – items’ whose correct answers are agreed by experts
Examples of multiple choice in different cognitive and knowledge domains
Factual Knowledge
The Monroe Doctrine was announced about 10 years after the

a. Revolutionary War
b. War of 1812
c. Civil War
d. Spanish-American War
Conceptual Knowledge
Which of the following statements of the relationship between market price and normal price is
true?
a. Over a short period of time, market price varies directly with changes in normal price.
b. Over a long period of time, market price tends to equal normal price.
c. Market price is usually lower than normal price.
d. Over a long period of time, market price determines normal price.
Translation from symbolic form to another form, or vice versa

3. Which of the graphs below best represent the supply situation where a monopolist maintains a
uniform price regardless of the amounts which people buy?
A B C D
187
Application
In the following items (4-8) you are to judge the effects of a particular policy on the distribution
of income. In each case assume that there are no other changes in policy that would counteract
the effect of the policy described in the item. Mark the item:
If the policy described would tend to reduce the existing degree of inequality in the distribution
of income,
If the policy described would tend to increase the existing degree of inequality in the distribution
of income, or
If the policy described would have no effect, or an indeterminate effect, on the distribution of
income.
__ 4. Increasingly progressive income taxes.

__ 5. Confiscation of rent on unimproved
__ 6. Introduction of a national sales tax
__ 7. Increasing the personal exemptions from income taxes
__ 8. Distributing a subsidy to sharecroppers on southern farms
Analysis
An assumption basic to Lindsay’s preference for voluntary associations rather than government
order… is a belief
a. that government is not organized to make the best use of experts
b. that freedom of speech, freedom of meeting, freedom of association, and possible only
under a system of voluntary associations.
c. in the value of experiment and initiative as a means of attaining an ever improving
society
d. in the benefits of competition
Judgments in terms of external criteria
For items 14-16, assume that in doing research for a paper about the English language you find a
statement by Otto Jespersen that contradicts one point of view in a language you have always
accepted. Indicate which of the statements would be significant in determining the value of
Jespersen’s statement. For the purpose of these items, you may assume that these statements are
accurate. Mark each item using the following key.
Significant positively – that is, might lead you to trust his statement and to revise your own
opinion.
Significant negatively – that is, night lead you to distrust his statement
Has no significance
__ 14. Mr. Jesperson was professor of English at Copenhagen University

__ 15. The statement in question was taken from the very first article that Jespersen published
__ 16. Mr. Jespersen’s books are frequently referred to in other works that you consult.
188
Matching Items
Another common type of selected-response test is the matching type of test that comes
with two parallel lists (i.e., premise and response), where students match the entries on one list
with those in the other list. The first list consists of descriptions (words or phrases), each of
which serves as a premise of test item. Therefore, each premise is taken as a test item, and must
be numbered accordingly. Each premise will be matched with the entries in the second (or
response) list. There is only one and the same response list for all the premises in the first list.
In developing good matching items, it is helpful to consider the following hints that will
guide you in the process of designing your lists.
Make instructions explicit. In making your instructions for a matching test explicit, the
context, task, and scoring must be clearly indicated. For its context, your instructions must
introduce the description as well as the response lists. If, for example, your description list
contains premises about scientific inventions you must state in your instructions that the first list
(or first column) is about scientific inventions. If your response list contains names of scientific
inventors, you must also state in your instructions that the second list (or second column)
contains names of scientific inventors. You may phrase it something like this: “In the first
column are scientific inventions. The second column lists names of scientific inventors. Match
the inventions with their inventors.” Then indicate the scoring. Having said this, we suggest that
your lists should be labeled with headings accordingly. In the case of the above example, you
may write the column heading as “Inventions” or “Column A: Inventions” for the first column or
description list, and “”Inventors” or “Column B: Inventors” for the second column or response
list.
Maintain brevity and homogeneity of the lists. The list of premises or descriptions must
be fairly short, that is, include only those items that go together as a group. For example, if your
matching test covers the common laboratory operations in chemistry, choose only those that are
relevant to your assessment domain. Doing this, you are also maintaining homogeneity of your
list. In matching tests, it is extremely important that entries in the description list are drawn from
one relatively specific assessment domain. For example, never mix up common laboratory
operations with measurements. Instead, decide as to whether you will include only one of these.
The same is true for your response list. Include only those that belong to the assessment domain.
Note here homogeneity in your lists is non-negotiable.
Also, in writing good matching items, it is imperative that the descriptions are longer than
the responses, not the other way around. After students read one of the descriptions, he reads all
options in the response list. If the description is longer than each of the options, at least, the
student only reads it once or twice. If the entries in the response list are long, it will take up more
time for the student to read all options just to respond to one description or item.
Finally, include more options than descriptions. If your description list has 10
descriptions or items, make your responses 12 or a bit more. This strategy reduces the effect of
response elimination where the student already disregards those options already chosen to match
the other descriptions. For example, if the student has already responded to 8 out of 10
descriptions with high confidence of his responses, so far, but finds the last 2 items difficult, with
only 10 options, only 2 options are available for his choice, and therefore, each of the remaining
189
option has a 50% probability that is it the correct option. If you include more than 10 responses,
the options for the last 2 descriptions would still be more, and the probability that each option is
correct is smaller than 50%. This will reduce the effect of guessing. Better yet, formulate your
descriptions in a way that some options may be used more than once. In this case, you maintain
the plausibility of all options for every description.
Keep the options plausible for each description. Because there is only one and the same
list of options for each of the descriptions, it is vital that you keep the options plausible for every
description. It means that if you have ten descriptions and twelve options, one option is keyed for
each description and the other eleven should be plausible distractors. Usually, if the rule on
homogeneity is very well observed, it is relatively easy to maintain one list of plausible options
for each description. In addition to this, never establish a systematic sequence of keyed
responses, such as coding with a word, such as G-O-L-D-E-N-S-T-A-R, which means that the
keyed response letter for the first description is “G” and the keyed response for the 10th
description is “R.” If this pattern is initially detected by the students, such as G-O-L- _ -E- _ -S-
T- _ -R, they immediately jump into guessing that the missing letters are D, N, and A,
respectively (and guessing it right).
Place the whole test in the same page of the test paper. After stating the instructions for
a matching test, write the lists or columns below it and make sure all descriptions and options are
written on the same page where the matching test is placed in the test paper. Never attempt to
extend some items or options in the next page of the test paper because, if you do so, students
will keep flipping between pages as they respond to your matching items. If you notice in your
draft that some items already go to the next page, you do some simple adjustments, like reducing
the font size of your items, as long as it remains legible, or improve the efficiency of your test
layout. If the problem still exists, shorten your list, or if there are other types of test in your test
paper, decide to switch your matching test with your other test.
The use of selected-response tests is effective in various types of learning intents and
assessment contexts. With careful design, these tests can measure capabilities beyond those
lower-order kinds, especially if the items are formulated to elicit students’ higher levels of
cognitive skills (Popham, 2004).
Guidelines in Wriitng Matching Type Items
1. Matching Exercises should be complete on a single page.

2. Use response categories that are related but mutually exclusive.
3. Keep the number of stimuli relatively small (10-15), and let the number of possible responses
exceed the number of stimuli by two or three.
4. The direction should clearly specify the basis for matching stimuli and responses.
5. Keep the statements in the response column short and list them in some logical order
190
FAULTY: Match List A with List B. You will be given one point for each correct match.
List A List B
a. cotton gin a. Eli Whitney
b. reaper b. Alexander Graham Bell
c. wheel c. David Brinkley
d. TU54G tube d. Louisa May Alcott
e. steamboat e. None of these
 Directions failed to specify the basis for matching

 List are enumerated identically
 Responses not listed logically
 Lacks homogeneity
 Equal number of elements
 Use of “None of the above”
IMPROVED: Famous inventions are listed in the left-hand column and inventors in the right-
hand column below. Place the letter corresponding to the inventor in the space next to the
invention for which he s famous. Each match is worth 1 point, and “None of these” may be the
correct answer. Inventors may be used more than once.
Inventions Inventors
__ 1. steamboat a. Alexander Graham-Bell
__ 2. cotton skin b. Robert Fulton
__ 3. sewing machine c. Elias Howe
__ 4. reaper d. Cyrus McCormick
e. Eli Whitney
f. None of these
191
Lesson 3
Designing Constructed-Response Types
Another set of options for the types of pen-and-paper test to give is the constructed-
response test. Unlike the selected-response types, the constructed-response test does not provide
students with options for answers, but rather require students to produce and give a relevant
answer to every test item. Drawing from its name, we understand that, in this type of test,
students construct their response, instead of just choosing it from a given list of alternatives.
Constructed-response methods of assessment include certain types of pen-and-paper tests
and performance-based assessments. In this chapter we focus our discussion only on constructed-
response types of pen-and-paper test. Some of the common types of pen-and-paper constructed-
response test are short-answer and essay.
Short-answer Items
As the name suggests, short-answer items allow students to provide short answers to the
questions or descriptions. This type of constructed-response test calls for students to respond to
either a direct question, a specific description, or an incomplete sentence by supplying a word, a
phrase, or a sentence. If a test contains direct questions, students are expected to answer the
question by giving a word, a symbol, a number, a phrase, or a sentence that is being asked. The
same applies to items using specific descriptions of words, phrases, or sentences. Items
composed of incomplete sentences ask students to complete the every sentence by supplying the
word or phrase that should meaningfully complete the sentence in terms of the assessment
domain.
In formulating questions or descriptions that compose your test items, it is important to
always think according to the name of this test type, so that you are mindful that the items should
call for “short answers.” Do not dare to ask questions that require long answer, otherwise you are
using the short-answer items as essay items. If your assessment target calls for students to
response with longer statements or written discussions, it is preferable that you give essay items
instead of short-answer items.
Make instructions explicit. Short-answer items usually have simple instructions. In fact,
it is tempting to just expect that students understand how to go about the test using only their
common sense. However, it is always unsafe to assume that every student understand what you
want them to do with your test. Besides, it is always advisable that you give your students the
necessary prompt before they respond to the test items. In short, you need to set clear instructions
even for short-answer items, which should indicate the content area, the task, and scoring. In
directing students on the task you expect them to do, specify if they answer the question, indicate
what is described, or complete the sentences, depending on your item’s format. Lastly, remember
to say “please” and “thank you” in your instructions.
Decide on the item’s format. When you decide to use short-answer items, also decide if
all your items should come in questions, descriptions, or incomplete sentences. Whichever you
decide to use to format your items, maintain consistency of the format for all your short-answer
items. For example, if you wish to give a 15-item short-answer test and expect that students
supply short answers to your questions, have all your items of the test written in a direct question
192
form. Never mix up direct question items with descriptions or incomplete sentences. One
important criterion for choosing what format to use is the age of the student. For younger
learners, it is usually preferable to use direct questions than descriptions or incomplete sentences.
Once you already make up your mind as to the item format, walk through your way to
formulating each item.
Structure the items accordingly. Because short-answer items call for “short answer” as
may be inferred from its name, always make sure you structure every item in a way that it
requires only a brief answer (i.e., a word, a symbol, a number [or a set of numbers], a phrase, or
a short sentence). This is achieved by formulating very clear, specific, explicit, and error-free
statements in your items. A clear and specific question calls for a specific answer. If your
description clearly and explicitly represents the object that is described, and you are sure that it
refers to a specific word, symbol, or phrase, then your item is structured properly. If your items
are incomplete sentences, structure every item so that the missing word or phrase is a keyword or
a key idea. Ordinarily, an incomplete sentence has only one blank which corresponds to one
missing keyword. You may want to remove 2 keywords as long as it does not distort the key idea
of the incomplete sentence which should guide the students in figuring out the missing words.
Never go more than 2 blanks.
One important reason why we need to ensure that students supply only brief responses is
because we make sure that responses are easy to check objectively. We encounter a major
problem related to scoring if students’ responses are lengthy. With long responses, it is difficult
to give accurate scores. Of course, we already know, as discussed in Chapter 3, that inaccurate
scoring of students’ responses in the test undermines the reliability of our measures, and reduces
the validity of the inference we make on our students’ learning outcomes.
Provide the blanks in appropriate places. Blanks are spaces in the items where
students supply their answers by writing a word, a symbol, a number, a phrase, or a sentence. If
your items are all in a direct question format where each question begins with an item number,
place the blanks on the left-side of the item number. When you type the item, begin with the
blank space, followed by the item number, then the question. This rule also applies to items
using explicit descriptions. If you are using the incomplete sentence format for your items, place
the blank near the end of the sentence. This means that you take out a keyword that is found near
the end of the sentence so that it becomes an incomplete sentence. Never take out a keyword
from the beginning of a sentence. The reason for this is that you need to first establish the key
idea of the sentence so that students immediately know what is missing in the sentence right after
one reading. If the blank space is near the beginning of the sentence, students will find it hard to
understand the key idea and will, therefore, read the sentence more than once in order to figure
out the missing word. In all item formats, always maintain the same length of the blanks in all
your short-answer items.
The good thing short-answer items is that students really produce a correct answer rather
than merely selecting one from a set of given alternatives. In this case, if students only possess a
partial knowledge of the subject matter, which usually works with selected-response items, they
will find short-answer items difficult to give a correct response to every item. Although we
generally recognize that these types of items are appropriate for measuring simple kinds of
learning outcomes, they are capable of measuring various types of challenging outcomes if the
items are carefully developed. However, it is not advisable that you force yourself to use short-
193
answer items to measure more complex and deeper levels of cognitive processes. It is always
helpful that you know other methods of assessment so that you have a wide range of options
where you can freely navigate yourselves depending on your assessment purposes.
A. Think of a specific subject matter in your field of specialization, one that

you are very familiar with.
B. Write a learning intent that can be measured using short-answer test.
C. Formulate at least 5 items using one of the suggested formats.
D. Check the quality of your output based on the guidelines discussed above.
As you do this, monitor your learning as well as your confusions, doubts
or questions.
E. Raise questions in class.
Guidelines in Writing Short Answer Items
1. Require short, definite, clear-cut, and explicit answers
FAULTY: Earnest Hemingway wrote______

IMPROVED: The Old Man and the Sea was written by _______.
Who wrote The Old man and the Sea?
2. Avoid multimutilated statements
FAULTY: _____ pointed out in ____ the freedom of thought in America was seriously
hampered by ___, ____, & __.
IMPROVED: That freedom of thought in America was seriously hampered by social pressures
toward conformity was pointed out in 1830 by ______.
3. If several equal answers equal credit should be given to each one.
4. Specify and announce in advance whether scoring will take spelling into account.
5. In testing for comprehension of terms and knowledge of definition, it is often better to supply
the term and require a definition than to provide a definition and require the term.
FAULTY: What is the general measurement term describing the consistency with which items in
a test measure the same thing?
194
IMPROVED: Define “internal consistency reliability.”
6. It is generally recommended that in completion items the blanks come at the end of the
statement.
FAULTY: A (an) ________ is the index obtained by dividing a mental age score by
chronological age and multiplying by 100.
IMPROVED: The index obtained by dividing a mental age score by chronological age and
multiplying by 100 is called a (an) ________
7. Minimize the use of textbook expressions and stereotyped language.
FAULTY: The power to declare war is vested in ______

IMPROVED: Which national legislative body has the authority to declare war?
8. Specify the terms in which the response is to be given.
FAULTY: Where does the Security Council of the United Nations hold its meeting?
IMPROVED: In what city of the United States does the Security Council of the United Nations
hold its meeting?
FAULTY: If a circle has 4-inch diameter, its area is_____

IMPROVED: A circle has 4-inch diameter. Its area in square inches correct to two decimal
places, is _____
9. In general, direct questions are preferable to incomplete declarative sentences.

FAULTY: Gold was discovered in California in the year ___
IMPROVED: In what year was gold discovered in California?
10. Avoid extraneous clues to the correct answer
FAULTY: A fraction whose denominator is greater than its numerator is a _____

IMPROVED: Fractions whose denominator are greater than their numerators are called _____
Essay Items
Relative to our learning intents, there are times when it is necessary that our students
supply lengthy responses so that they exhibit more complex cognitive processes. For some
learning targets, a single word, a phrase or, a sentence is not enough to measure students’
learning outcomes. For these targets, we need a constructed-response type of test that will allow
students to adequately exhibit their learning through sufficient writing; hence, essay items work
for these purposes.
Just like short-answer items, essay items call for students to produce rather than select
answers from the given alternatives. But unlike short-answer items, essay items call for more
substantial, usually lengthy response from students. Because the length and complexity of the
195
response may vary, essay items are appropriate measures of higher-level cognitive skills.
Following are some guidelines that will help you formulate good essay items.
Communicate the extensiveness of expected response. By reading the essay item, your
students must know exactly how brief or extensive their responses should be. This is made
possible by making your item clearly convey the degree of extensiveness you expect from their
response. Extensiveness depends on the degree of complexity of your item. To determine the
degree of complexity you desire to assess, you may design an essay item according to any of the
two types, depending, of course on your assessment objective– the restricted-response and
extended-response items. If you wish to measure students’ ability to understand, analyze, or
apply certain concepts to new contexts while dealing with relatively simple dimensions of
knowledge, and if the task requires only a relatively short time period, the restricted-response
type may be preferred. If, however, you wish to assess students’ capability to evaluate or
synthesize various aspects of knowledge, which will naturally require longer time for their
responses to be completed, the extended-response type is preferable. Notice that even at this
phase of determining the degree of complexity of your essay item, it is very vital that you clearly
make a decision based on your learning intent. This phase is crucial because if you design an
essay item that is of extended type but give it to your students as if it is of restricted type, your
students’ failure to meet the assessment standards set for the item may not be due to their level of
learning, but rather because they needed more time to gather and process information before they
could come up with responses that are relevant to your assessment standards. Your inference on
students’ learning becomes problematically unreliable and invalid. Equally problematic your
inference becomes if you construct a truly restricted type of essay item but give it as if it is an
extended-type essay item.
Prime and prompt students through the item. Unlike the other types of pen-and-paper
tests, an essay item already includes the context, assessment task, and assessment focus
standards, altogether. The statement of context provides a background of the subject matter in
focus, and primes the students’ thinking of that subject matter. The prime helps students to be
selectively attentive to a subject matter that is relevant to the assessment task of the essay item.
Without it, students tend to grapple with understanding the subject matter that is embedded in the
statement of assessment task, and may find it difficult to stay in focus. The assessment task is
what the students directly respond to in order to write an essay. Both the statements of context
(or the prime) and the assessment task (or the prompt) are important in setting the students’
attention to the subject matter and in making them think of a response that meets the assessment
standards. Notice, for example, that if the item is phrased as “Compare and contrast the
governance of Estrada and Arroyo,” students first struggle to generate some ideas related to
these two names, then think of governance or political administrations of the two Philippine
presidents in general sense. It is because the item does not have a prime. In this case, the item is
not helping the student stay in a clear focus of what the item really intends to assess. It will be
different if the item begins with a prime, such as when phrased something like, “Our country has
been run by a number of presidents already, and along with the change in political
administration are the changes in the agenda of reforms. Compare and contrast the economic
reform agenda of the presidential administrations of Estrada and Arroyo.” In this item, students
are primed to think of the reform agenda on the two presidents, which is very probable that they
focus more on the context as they respond to the assessment task. This latter example is not yet a
complete essay item as it lacks other necessary elements, but it clearly shows how effectively
196
you can prime and prompt students to appropriately respond to your test item. This item may be
improved to become a full-blown essay item if you add other elements, such as the guide to the
extensiveness of the desired response as well as the assessment standards.
Provide clear assessment standards. You might think that, if it has both the prime and
prompt, you item can already stand. This is not true. For an essay item to stand as a good one, it
must also indicate a clear guide for the value of the item. The assessment standards inform the
students about what specific aspects of their responses you will give merit, and what aspects will
earn more credit than the others. If, for example, you give credit to their argument if they can
provide an evidence, then you need to categorically ask for it in your essay item. Similarly, if
you give two or three essay items and you wish to give more credit to one item based on its
complexity, you also need to indicate the item’s value. This way, students know when and where
they devote most of their time and effort, and decide how much of these resources will be
invested to each item. One simple way of guiding students in term of the item’s value is to
indicate the assessment weight you assign to the item in parentheses at the end of the item.
Do away with optional items. While reading this part you may be recalling a common
experience in taking an essay test when the teacher asked you to choose some, but not all, essay
items to answer, and that you tended to choose those items that were more convenient to your
understanding and readiness. This practice of providing optionality in essay items where students
are made to choose fewer items to answer than what is presented should be stopped. From what
you may recall in your experience, it is obvious that, when students are free to choose only a few
items to answer, they would choose those items that are easy to them. As a consequence, each
student will be choosing items that are “easy” to them, and, thus leads to flawed inference of
students’ learning because students’ responses are marked under different standards and levels of
complexity, depending on the items they chose to answer. One of the basic questions you will
need to answer if you plan to do this is, What is the assurance that all your items have equal
level of complexity and that they measure exactly the same knowledge domains and cognitive
processes? This question is extremely difficult to answer. This guideline, therefore, says that if,
for example, you have 3 essay items for the test you are about to administer, have each of your
students answer all the 3 items.
Prepare a scoring rubric. Because an essay item calls for relatively extensive response
from the students, it is always necessary that you prepare a scoring rubric or guide prior to giving
the test. The scoring scheme will help you pre-assess the validity and reliability of your item
because it will allow you to identify the criteria as well as the key ideas you expect your students
to give in response to the item. Your scoring rubric indicates the descriptions in scoring the
quality of your students’ responses in the essay item. It includes a set of standards that define
what is expected in a learning situation, and important indicators of how students’ responses to
the task will be scored. Having said this, we ask you to choose the scoring approach that will best
fit your assessment context. You have two options for this purpose. One is the holistic approach,
another is the analytic approach.
The holistic approach allows you to focus on your students’ overall response to an essay
item. As you assess the response as a whole, this approach will guide you in terms of what
dimensions of the learning outcome you pay attention to. For example, if your essay item intends
to let students manifest their ability to argue with appropriate evidence, and explain in good,
clear, and coherent language, you need to identify the dimensions that can capture those abilities
197
in you assessment; hence you may have the following dimensions indicated in your holistic
rubric: Logic of the argument, Relevance of evidence, Communicative clarity, Lexical choice,
and Mechanics (spelling, punctuations, etc.). These dimensions serve as your criteria for
assessing students’ response. It is always appropriate that you indicate the dimensions to assess
because these dimensions keep you in focus as you assign a score to each of your students’
response. And for you to be guided further in terms of how much score to give, each dimension
must be assigned a corresponding point or set of points. For example, you wish to give a
maximum of 6 points for the logic of the argument, indicate it in your holistic rubric so that it
might look the items in the box below.
Assessment Criteria Points:
 Logic of the argument 6

 Relevance of evidence 4
 Communicative clarity 3
 Lexical choice 2
 Mechanics (spelling, punctuations, etc.). 2
Another way of setting a guide to scoring is by way of assigning the same points for each
criterion but you also indicate the weight of each criterion based on its importance or value. The
box below gives you a view of how the contents may look like.
Assessment Criteria Points Weight
 Logic of the argument 5 40%

 Relevance of evidence 5 35%
 Communicative clarity 5 15%
 Lexical choice 5 5%
 Mechanics (spelling, punctuations, etc.). 5 5%
When employing a holistic approach for scoring students’ responses in an essay item,
your decision as to how much score to give based on each dimension is not guided by clear
descriptions of the quality of response. It usually rests on the teacher’s judgment of the student’s
response in terms of each criterion. Because this approach does not require specific descriptions
of the quality of response, it is easy and efficient to use. The major weakness of this approach,
however, is the fact that it does not specify the graded levels of performance quality which
invites teachers’ subjective judgment of students’ response. Acknowledging its major weakness,
we recommend that you use the holistic approach only for restricted-response items where
students are tested only on less complex skills requiring only a small amount of time.
198
In contrast, the analytic approach allows for a more detailed and specific assessment
scheme in that it indicates not only the dimensions or criteria, but also the specific descriptions
of the different levels of performance quality per criterion. Supposing we take the sample criteria
in the boxes above and use them as the same criteria for our analytic rubric, we proceed by
determining the levels of performance quality for each criterion. For the logic of the argument
criterion we set a scale of varying performance quality, perhaps ranging from Excellent to Poor,
with other levels of quality in between. A simple way to do this is exemplified in the box below.
Assessment Criteria Scale Indicators

Excellent Satisfactory Fair Poor
(8 pts) (6 pts) (4 pts) (2 pts)
 Logic of the argument (40%) ____ ____ ____ ____
 Relevance of evidence (35%) ____ ____ ____ ____
 Communicative clarity (15%) ____ ____ ____ ____
 Lexical choice (5%) ____ ____ ____ ____
 Mechanics (5%) ____ ____ ____ ____
As indicated in the box above, there are 4 scale indicators, each representing a level of
performance quality. In this example, the teacher will put a check on the space below the scale
indicator that matches the quality of a student’s response on every criterion. Scores are obtained
by assigning points in every scale indicator. You may also specify the weight of each criterion
depending on the degree of importance or value of the criterion.
A more calibrated analytic rubric not only indicates the scale levels for the teacher to
check against the quality of students’ response in an essay item, but also describe that
performance quality that falls under each level of the scale. This rubric describes what quality of
performance will qualify as “excellent” and what type of performance is “poor.” In this case, the
analytic rubric should include descriptive statements for each scale level of each criterion. The
table below shows an example of these descriptive statements applied to one of the criteria we
used in the above example, just to illustrate the point.
199
Criterion Scale Indicators
Excellent Satisfactory Fair Poor

(7-8 points) (5-6 points) (3-4 points) (1-2 points)
Logic of the Argument is Argument is Some Assumptions

argument clearly premised premised on valid assumptions are are generally
on valid assumptions with weak and the too weak and
assumptions and logical sequence argument is not the argument is
is logically only in some completely problematic.
sequenced. parts. logical.
The good thing about using the analytic approach in scoring essay responses is that it
helps you identify the specific level of students’ performance, and your assessment of students’
learning outcomes is objective. Therefore, it increases the reliability of your measure and will
facilitate more valid and reliable inference. It is also beneficial for the students because, through
the analytic rubric, they can pinpoint the specific level of their performance, and can judge its
quality by matching it against the descriptions. This type of rubric is best for essay items that
measure more complex cognitive skills and more sophisticated knowledge dimensions.
Whichever approach you wish to use for scoring your students’ response to your essay
items, your decision will work if you are already clear on the following questions:
 What do you want your students to know and be able to do in the essay?
 How well do you want your students to know and be able to do it in the essay?
 How will you know when your students know it and do it well in the essay?
As you clarify your practice with reference to those questions, walk your way to
constructing your scoring scheme using any approach, and following the simple steps indicated
below.
 Set appropriate assessment target.
 Decide on the type of the rubric to use.
 Identify the dimensions of performance that reflect the learning outcomes.
 Weigh the dimensions in proportion to their importance or value.
 Determine the points (or range of points) to be allocated to each level of performance.
 Show the rubric with colleagues and/or students before using it.
Some teachers are excited to use essay items because these items provide more
opportunities to assess various types of learning outcomes, particularly those that involve higher
level cognitive processing. If carefully constructed, essay items can test students’ ability to
logically arrange concepts, analyze relationships between them; state assumptions or compare
200
positions, evaluate them and draw conclusions; formulate hypotheses and argue on the causal
relationships of concepts, organize information or bring in evidences to support some findings;
propose solutions to certain problems and evaluate the solutions in light of certain criteria. These
and much, much more competencies can be measured using good essay items.
201
Lesson 4
Designing Interpretive Exercise
Items that are interpretive exercise usually contain a stimulus aside from the item stem
and options. Its is useful in measuring more cpmplex skills such as relevance of information,
generalizations, inference, applying principles, recognizing assumptions, and interpretations.
Guidelines in Writing Interpretive Exercise
1. Select an introductory that is in harmony with the objectives of the course.

• Amount of emphasis of various interpretive skills is a factor.
• Do not overload test takers with interpretive items in a particular area.
• Selection of introductory should be guided by general emphasis to be given to the
measurement of complex achievement.
2. Select introductory material that is appropriate to the curricular experience and reading ability
of the examinees.
3. Select introductory material that is new to pupils.
4. Select introductory material that is brief but meaningful.
5. Revise introductory material for clarity, conciseness, and greater interpretive value.
6. Construct test items that require analysis and interpretation of introductory material.
7. Make the number of items roughly proportional to the length of the introductory material.
8. Observe all suggestions for constructing objective test items.

202
Examples of Interpretive Exercise:

Ability to Recognize the Relevance of Information
203
Ability to Recognize Warranted and Unwarranted Generalizations

204
Ability to Recognize Inferences

205
Ability to Interpret Experimental Findings

206
Ability to Apply Principles
Ability to Recognize Assumptions

207
Reading Comprehension
Bem (1975) has argued that androgynous people are “better off” than their sex-typed
counterparts because they are not constrained by rigid sex-role concepts and are freer to respond to a
wider variety of situations. Seeking to test this hypothesis, Bem exposed masculine, feminine, and
androgynous men and women to situations that called for independence (a masculine attribute) or
nurturance (a feminine attribute). The test for masculine independence assessed the subject’s willingness
to resist social pressure by refusing to agree with peers who gave bogus judgments when rating cartoons
for funniness (for example, several peers might say that a very funny cartoon was hilarious). Nurturance
or feminine expressiveness, was measured by observing the behavior of the subject when left alone for
ten minutes with a 5-month old baby. The result confirmed Bem’s hypothesis. Both the masculine sex-
typed and the androgynous subjects were more independent (less conforming) on the ‘independence”
test than feminine sex-typed individuals. Furthermore, both the feminine and the androgynous subjects
were more “nurturant” than the masculine sex-typed individuals when interacting with the baby. Thus, the
androgynous subjects were quite flexible, they performed as masculine subjects did on the “feminine”
task.
35. What is the independent variable in the study?
a. Situations calling for independence and nurturance

b. Situation to make the sex type react
c. Situations to make the androgynous be flexible
d. Situations like sex type, androgynous and sex role concepts
36. What are the levels of the IV?
a. masculine attribute and feminine attribute

b. rating cartoons and taking care of a baby
c. independence and nurturance
d. flexibility and rigidity
37. How was the Dependent variable measured?
a. task performance
b. frequency of refusals and conformity
c. rating scale
d. counting the behavior occured
38. What design is employed in the study?
a. factorial
b. factorial design based on mixed model
c. randomized block design
d. switching replications design
208
Interpreting Diagrams
Instruction. Study the following illustrations and answer the following questions.
Figure 1
Group A
Group B
Pretest Posttest
101. Which group received the treatment?
a. group A b. group B c. none of the above
102. Why did group B remain stable across the experiment?
a. there is an EV
b. there is no treatment
c. there is the occurence of ceiling effect
103. What is the problem at the start of the experiment?
a. the groups are nonequivalent

b. the groups are competing with each other
c. the treatment took place immediately
209
References
Airasian, P. W. (2000). Assessment in the classroom: A concise approach (2nd ed.) USA:
McGraw-Hill Companies.
Popham, W. J. (2005). Classroom assessment: What teachers need to know (4th ed.). Boston,
MA: Allyn and Bacon.
210
Chapter 5
Constructing Non-Cognitive Measures
Objectives
1. Follow the procedures in constructing an affective measure.

2. Determine techniques in writing items for non-cognitive measures.
3. Use the appropriate response format for a scale constructed.
4. Give the uses of non-cognitive tests.
Lessons
1 What are non-cognitive constructs?

Steps in Constructing Non-Cognitive measures
Response Formats
2 Steps in constructing non-cognitive measures
3 Response Formats
211
Lesson 1
The Nature of Non-Cognitive Constructs
Human behavior is composed of multiple dimensions. Behaviors are characteristics in

which one thinks, feels, and acts as people interact with their environment. The previous sections
emphasized on techniques in assessing the cognitive domain as applied in creating teacher-made
tests and analyzing the test using either the Classical Test Theory or the Item Response Theory
approach. This chapter guides you in the construction of measures in the affective domain.
Anderson (1981) explained affective characteristics as “qualities which presents people’s typical
ways of feeling, or expressing their emotions” (p. 3). Sta. Maria and Magno (2007) found that
affective characteristics run on two dimensions: Intensity and direction. Intensity refers to the
strength of the characteristic expressed. The direction of affect refers to the cause of the affect
from object external factors to person factors. Examples of intensity reflect high scores on
certain affective measures such as aggression scales and motivation scales. The direction will be
the cause of aggressive whether it is from an external person pr the self, whereas for motivation,
the cause may be internal (ability) or material factors such as rewards.
Figure 1
Dimensions of Affect
High Intensity
Person Object
Low Intensity
212
Affective characteristics are further classified according to specific variables such as

attitudes, beliefs, interest, values, and dispositions.
Attitude. Attitudes are learned predispositions to respond in a consistently favorable or

unfavorable manner with respect to a given object (Meece et al. 1982). According to Meece et al.
(1982) Attitude is related to academic achievement since attitudes are learned over time by being
in contact with the subject area. Information about the subject area is received through
instruction and consequently attitude is developed. Moreover, if a person is favorably
predisposed toward an academic course, that favorable disposition should lead to favorable
behaviors like achievement.
According to Bandura (1977), attitude is often used in conjunction with motivation
to achieve. It is how capable people judge themselves to perform a task successfully. Moreover
extensive evidence and documentation were provided for the conclusion that attitude is a key
factor in the extent to which people can bring about significant outcomes in their lives.
According to Overmier and Lawry (1979) that one potential source of the drive to
perform is the incentive value of the performance. Incentive theories of motivation suggest that
people will perform an act when its performance is likely to result in some outcome they desire,
or that is important to them. For example, in anticipation of a situation in which a person is
required to perform, that person may expend considerable effort in preparation because of the
mediation provided by the desire to achieve success or avoid failure. That desire would be said to
provide incentive motivation for the person to expend the effort. Accordingly, a test, as a
stimulus situation, may be theorized to provoke students to study as a response, because of the
mediation of the desire to achieve success or avoid failure on that test. Studying for the test,
therefore, would be the result of incentive motivation.
In more objective terms attitude may be said to connote response consistency with
regards to certain categories of stimuli (Anastasi, 1990). In actual practice, attitude has been
most frequently associated with social stimuli and with emotionally toned responses (Anastasi
1990).
Zimbardo and Leippe (1991) defined attitude as favorable or unfavorable evaluative
reactions whether exhibited in beliefs, feelings, or inclinations to act toward something.
According to Myers (1996) attitude is commonly referred to beliefs and feelings related to a
person or event and their resulting behavior. Attitudes are an efficient way to size up the world.
This means that when individuals have to respond quickly to something, the feeling can guide
the way one reacts. Psychologists agree that knowing people’s attitude is to predict their actions.
Attitudes involve evaluations. Attitude is an association between an object and our evaluation of
it. When this association is strong, the attitude becomes accessible. Encountering the object calls
up the associated evaluation towards it. One acquires attitude in a manner that makes them
sometimes potent, sometimes not. An extensive series of experiments shows that when attitudes
arise from experience, they are far more likely to endure and to guide actions. She concluded that
attitudes predict actions if other influences are minimized, if it is specific to the action and it is
potent.
An example of an attitude scale is the “Attitude Towards Church Scale” by Thurstone
and Chave (1929). The scale measure the respondents position on a continuum ranging from
strong depreciation to strong appreciation of the church. It is composed of 45 items. A split-half
measure of .848 was obtained and corrected by Spearman-Brown formula became .92.
Discriminant validity was conducted where participants were classified according to their
213
religion where the catholic group had the highest mean score. In another discriminant validity the
participants who frequently attended church had the highest mean. Example of items are:
1. I think the teaching of the church is altogether too superficial to have much social
and significance.
2. I feel the church services give me inspiration and help me to live up to my best during
the following week.
3. I think the church keeps business and politics up to higher standard than they would
otherwise tend to maintain.
Beliefs. Beliefs are judgments and evaluations that we make about ourselves, about
others, and about the world around us (Dilts, 1999). Beliefs are generalizations about things such
as causality or the meaning of specific actions (Pajares, 1992). Examples of belief statements
made in the educational environment are “A quiet classroom is conducive to learning,”
“Studying longer will improve a student’s score on the test,” “Grades encourage students to work
harder.”
Beliefs play an important part in how teachers organize knowledge and information and
are essential in helping teachers adapt, understand, and make sense of themselves and their world
(Schommer, 1990; Taylor, 2003; Taylor & Caldarelli, 2004). How and what teachers believe
have a tremendous impact on their behavior in the classroom (Pajares, 1992; Richardson, 1996).
An example of a measure of belief is the Schommer Epistemological Questionnaire.
Schommer (1990) developed this questionnaire to assess beliefs about knowledge and learning.
A 21-item questionnaire was developed by the researchers to measure epistemological beliefs of
Asian students. The questionnaire was adapted from Schommer's 63-item epistemological beliefs
questionnaire. This Asian version of the Schommer Epistemological Questionnaire has been
validated with a sample of 285 Filipino college students. This epistemological questionnaire was
revised to have lesser items, and simpler expression of ideas to be more appropriate for Asian
learners. The number of statements was reduced to ensure that the participants would not be
placed under any stressed while completing the questionnaires. Students are asked to rate their
degree of agreement for each item on a 5-point Likert scale ranging from 1 (strongly disagree) to
5 (strongly agree). Wording of items varied in voice from first person (I) to third person
(students) in an effort to illustrate how the same belief could be queried from somewhat different
perspectives. Items assessed four epistemological belief factors including beliefs about the
ability to learn (ranging from fixed at birth to improvable), structure of knowledge (ranging from
isolated pieces to integrated concepts), speed of learning (ranging from quick learning to gradual
learning), and stability of knowledge (ranging from certain knowledge to changing knowledge).
Schommer (1990) has reported reliability and validity testing for the Epistemological
Questionnaire; the instrument reliably measures adolescents' and adults' epistemological beliefs
and yields a four-factor model of epistemology. Schommer (1993) has reported test-retest
reliability of .74. Factor analyses were conducted on the mean for each subset, rather than at the
item level.
Interests. Interest generally refers to individual's strengths, needs, and preferences.

Knowledge of one’s interest strengthens understanding of career decision making and overall
development. Strong (1955) defined interests as "a liking/disliking state of mind accompanying
214
the doing of an activity" (p. 138). Interests may be referred to as instrumental means to an end
independent of perceived importance (Savickas, 1999).
According to Holland’s theory, there are six vocational interest types. Each of these six
types and their accompanying definitions are presented below:
Realistic People with Realistic interests like work activities that include practical,
hands-on problems and solutions. They enjoy dealing with plants, animals,
and real-world materials like wood, tools, and machinery. They enjoy outside
work. Often people with Realistic interests do not like occupations that
mainly involve doing paperwork or working closely with others.
Investigative People with Investigative interests like work activities that have to do with
ideas and thinking more than with physical activity. They like to search for
facts and figure out problems mentally rather than to persuade or lead people.
Artistic People with Artistic interests like work activities that deal with the artistic
side of things, such as forms, designs, and patterns. They like self-expression
in their work. They prefer settings where work can be done without
following a clear set of rules.
Social. People with Social interests like work activities that assist others and
promote learning and personal development. They prefer to communicate
more than to work with objects, machines, or data. They like to teach, to give
advice, to help, or otherwise be of service to people.
Enterprising People with Enterprising interests like work activities that have to do with
starting up and carrying out projects, especially business ventures. They like
persuading and leading people and making decisions. They like taking risks
for profit. These people prefer action rather than thought.
Conventional People with Conventional interests like work activities that follow set
procedures and routines. They prefer working with data and detail rather than
with ideas. They prefer work in which there are precise standards rather than
work in which you have to judge things by yourself. These people like
working where the lines of authority are clear.
Examples of affective measures of interest are the Strong-Campbell Interest Inventory and
Strong Interest Inventory (SII), Jackson, Vocational Interest Inventory, Guilford-Zimmerman
Interest Invmetory, Kuder Occupational Interest Survey. For a list of vocational interest tests,
visit the site: http://www.yorku.ca/psycentr/tests/voc.html.
Values. Values refer to “the principles and fundamental convictions which act as general
guides to behavior, the standards by which particular actions are judged to be good or desirable
(Halstead & Taylor, 2000, p. 169). These values are used as guiding principles to act and justify
accordingly (Knafo & Schwartz, 2003). The values are internalized and learned at an early stage
in life. The school setting is one major avenue where people show how the values are learned,
respected and uphold. A student who strives for valuing education in school are provided with
opportunities to behave in ways that will allow him to do well in school and thus attain values of
hard work, perseverance and diligence when it comes to academic-related tasks. Examples of
values are diligence, respect for authority, emotional restraint, filial piety, and humility.
215
An example of a measure of values is the Asian Values Scale-Revised (AVS-R). The

AVS-R is a 25-item instrument designed to measure an individual’s adherence to Asian cultural
values the enculturation process and the maintenance of one’s native cultural values and beliefs,
(Kim & Hong, 2004). In particular, the AVS-R assesses dimensions of Asian cultural values,
which include, “collectivism, conformity to norms, respect for authority figures, emotional
restraint, filial piety, hierarchical family structure, and humility”. This instrument used a 4-point
Likert scale with 1 being strongly disagree to 4 being strongly agree. A high score indicates a
high level of adherence to Asian Values while a low score would indicates otherwise. Factors
included are high expectations for achievement (e.g. One need not minimize or depreciate one’s
own achievement – reverse worded), hierarchical family structure (e.g. One need not follow the
role expectations of one’s family- reverse worded), respect for education (e.g. Educational and
career achievements do not need to be one’s top priority – reverse worded), perseverance and
hard work (e.g. One need not focus all energies on one’s studies- reverse worded), filial piety
(e.g. One should avoid bringing displeasure to one’s ancestors), respect for authority (e.g.
Younger persons should be able to confront their elders- reverse worded), emotional restraint
(e.g. One should have sufficient inner resources to resolve emotional problems), and finally,
collectivism (e.g. One should think of one’s group before himself). The AVS-R has the
reliability of .80 and internal consistency coefficients of .81 and .82. Apart from this, a 2-week
test-retest reliability of .83 was obtained. Construct validity was obtained by identifying, via a
nationwide survey and focus groups discussions whereas, concurrent validity was obtained
through confirmatory factor analyses, in which a factor structure comprising the AVS, the
Individualism-Collectivisn Scale (Triandis, 1995), and the Suinn-Lew Asian Self Identity
Acculturation Scale (SL-ASIA: Suinn, Rickard–Figueroa, Lew & Vigil, 1987) was confirmed.
Discriminant validity was evidenced in a low correlation between the AVS scores, which reflect
values enculturation and the SL-ASIA scores, which reflect predominantly behavioral
acculturation.
Dispositions. The National Council for Accreditation of Teacher Education (2001), which
stated that dispositions are: the values, commitments, and professional ethics that influence
behaviors toward students, families, colleagues, and communities and affect student learning,
motivation, and development as well as the educator's own professional growth. Dispositions are
guided by beliefs and attitudes related to values such as caring, fairness, honesty, responsibility,
and social justice. Examples of dispositions include fairness, being democratic, empathy,
enthusiasm, thoughtfulness, and respectfulness. Disposition measures are also created for
metacognition, self-regulation, self-efficacy, approaches to learning, and critical thinking.
Activity
Use the internet and give examples of affective scales under each of the following areas.
Attitudes Beliefs Interest Values Dispositions

216
Lesson 2
Steps in Constructing Non-Cognitive Measures
The steps involved in constructing an affective measure follow an organized sequence of

procedures that when properly done results to a good scale. Constructing a scale is a research
process where the test developer poses research questions, hypothesize on these questions, and
gather data to provide evidence on the reliability and validity of the scale.
Decide what information should be sought
The construction of a scale begins with clearly identifying what construct needs to be
measured. The basis of constructing a scales is when (1) no scales are available to measure such
construct, (2) all scales are foreign and it is not suitable for the stakeholders or sample that will
take the measure, (3) existing measures are not appropriate for the purpose of assessment, (4) the
test developer intends to explore the underlying factors of a construct and eventually confirm it.
Once the purpose of developing a scale is clear, the test developer decides what type of
questionnaire to be used. Decide whether the measure will be an attitude, belief, interest, value,
and disposition.
When the specific variable construct is clearly framed, it is very important that the test
developer search for relevant literature reviews from different studies involving the construct
intended to be measured. What is needed in the literature review is the definition that the test
developer wants to adapt and whether the construct has underlying factors. The definition and its
underlying factors is the major basis for the test developer later on to write the items. Having a
thorough literature review helps the test developer to provide a conceptual framework as basis
for the construct being measured. The framework can come in the form of theories, principles,
models, and a set of taxonomy that the test developer can use as basis for hypothesizing factors
of the construct intended to measure. Having a thorough knowledge of the literature about a
construct helps the researcher identify different perspectives on how the factors were arrived and
possible problems with the application of these factors across different groups. This will help the
test developer justify the purpose of constructing a scale.
When the constructs and its underlying factors or subscales are established through
thorough literature review, a plan to make the scale needs to be designed. The plan starts with
creating a Table of Specifications. The Table of Specifications indicates the number of items for
each subscale, the items phrased in positive and negative statements, and the response format.
Write the first draft of items
The test developer uses the definitions provided in the framework to write the
preliminary items of the scale. Items are created for each subscale as guided by the conceptual
definition. The number of items as planned in the Table of Specifications is also considered. As
much as possible, a large number of items are written to represent well the behavior being
measured. In helping the test developer write some items, a well represented set of behaviors
manifesting the construct should be covered. Qualitative studies reporting the specific responses
are very helpful in writing the items. An open-ended survey, focus group discussion, and
interviews can be conducted in order to come up with statements that can be used to write items.
217
When these methods are employed as a start of item writing, the questions generally seeks for
specific behavioral manifestations of the subscales intended to measure. An example would be
the study of Magno and Mamauag (2008) where they created the “Best Engineering Traits”
(BET) that measures dispositions of engineering students in the areas of assertiveness,
intellectual independence, practical inclination, and analytical interest. The items in this scale
were based on an open-ended survey conducted among engineering students. The survey asked
the following questions:
1. How do you show your expertise in different situations as an Engineering student?
2. How do you apply engineering theories in your everyday life?
3. What are the instances that an Engineer needs to be assertive?
4. In what ways can an Engineer be independent in his intellectual thinking?
5. What do you think are other personality traits or characteristics that would make you an
effective engineer?
Example of item statements generated from the survey responses are as follows:
1. I like watching repairmen when they are fixing something.

2. I gather necessary information before making decisions.
3. I hate to buy things in hardware stores.
4. I do not rely on mathematical solutions in arriving at conclusions.
Notice that the item statements begin with the pronoun “I.” This indicates self-referencing for the
respondents when they answer the items. Items 1 and 2 in the example are stated in a positive
statement while items 3 and 4 are stated in negative. This ensures that respondents would be
consistent with their answers in a subscale where the items should be responded in the same way.
For negative items, reverse scoring is done with the responses to be consistent with the positive
items. The following are guidelines in writing good items:
Good questionnaire items should:

1. Include vocabulary that is simple, direct, and familiar to all respondents
2. Be clear and specific
3. Not involve leading, loaded or double barreled questions
4. Be as short as possible
5. Include all conditional information prior to the key ideas
6. Be edited for readability
Example of bad items:

I am satisfied with my wages and hours at the place where I work. (Double Barreled)
I not in favor congress passing a law not allowing any employer to force any employee to retire
at any age. (Double Negative)
218
Most people favor death penalty. What do you think? (Leading Question)
Select a scaling technique
After writing the items, the test developer decides on the appropriate response format to
be used in the scale. The most common response formats used in scales are the Lickert scale
(measure of position in an opinion), Verbal frequency scale (measure of a habit), Ordinal scale
(ordering of responses), and the Linear numeric scale (judging a single dimension in an array). A
detailed description of each scaling technique is presented in the next lesson.
Develop directions for responding
It is important that directions or instructions for the target respondents be created as early
as when the items are created. When making instructions, it is very important that it is clear and
concise. Respondents should be informed how to answer. When you intend to have a separate
answer sheet, make sure to inform the respondents about it in the instructions. Instructions
should also include ways of changing answers, how to answer (encircle, check, or shade). Inform
the respondents in the instructions specifically what they need to do.
The following are the instructions formulated for the BET:
This is an inventory to find out your suitability to further study Engineering. This can help guide you in
your pursuit of an academic life. The inventory attempts to assess what interests and strategies you have
learned or acquired over the years as a result of your study.
In the inventory, you will find statements describing various interests and strategies one acquires through
years of schooling and other learning experiences. Indicate the extent of your agreement or disagreement
to each of these statements by using the following scale:
4 STRONGLY AGREE (SA)

3 AGREE (A)
2 DISAGREE (D)
1 STRONGLY DISAGREE (SD)
There are no right or wrong answers here. You either AGREE or DISAGREE with the statement. It is
best if you do not think about each item too long --- just answer this test as quickly as you can, BUT
please DO NOT OMIT answering any item.
DO NOT WRITE OR MAKE ANY MARKS ON THE TEST BOOKLET. All answers are to be written on
your answer sheet.
Ensure that you have filled out your answer sheet properly and legibly for your name, school, date of
birth, age, and gender.
Be sure also that you have copied correctly you test booklet number on the space provided in your
answer sheet. Do not turn the page until you are told to do so.
You have a total of 40 minutes to finish this whole test. Do not spend a lot of time in any one item.
Answer all items as truthfully and honestly as you can.
Notice that the instruction started with the purpose of the test. This is done to dispel any
misconceptions that the respondents think about the test. Then the instruction describes the kind
219
of items expected for the test. Then the respondent is told how to answer the items. The scaling
technique is also provided. The respondents are reminded that there are no right or wrong
answers to avoid faking good or bad in the test. The respondents are reminded such as not
making any marks on the test booklet, use of answer sheets, answering all items and the time
allotment. As much as possible, detailed instructions are provided to avoid any problems.
Conduct a judgmental review of items
For achievement tests and teacher made tests, this procedure is called content validation.
But for affective measures, it would be difficult to conduct content validation because there is no
available content area for an affective variable. The definition and behavioral manifestations
from empirical reports can qualify for the areas measured. Instead the items are reviewed
according to the definition or framework provided whether they are relevant, not within the
confines of the theory, measuring something else, applicability of the target respondents, and
whether it needs revision for clarity.
Item review is conducted among experts in the content being measured. In the process of
item review, the together with the constructed items, the conceptual definition of the constructs
are provided to guide the reviewer to ensure that the items are framed. It is also necessary to
arrange the items according to each subscale where is belongs so that the reviewer can easily
evaluate the appropriateness of the items in that subscale. A suggested format for item review is
shown below:
Practical Inclination – finding meaning about concepts and

adapt to, shape, and select environments covering a wide range
of applications (Sternberg, 2004). Application, putting into Suggested
practice, use knowledge, implement, propose something new. Accept Reject Revise Revision
1. I like fixing broken things in the house.
2. I help out my father fix broken things in the house.
3. I help do the manual computation if there is no available
calculator.
4. I help my friends organize their schedule if they do not know
what to consider.
When giving items for review, the test developer write a formal letter to the reviewer and
indicate specifically how do you want the review to be done. Indicate specifically if you also
intent to review the grammar of the statement because most reviewers would just focus on the
content and its frame on the definition.
Reexamine and revise the questionnaire
After the items have been reviewed expect that there would be several corrections and
comments. Several comments indicate that the items will be better because it have been
thoroughly studied and critiqued. In fact, several comments should be more appreciated than few
because it means that the reviewers are offering better ways on how to fix and reconstruct your
items. In this stage, it is necessary to consider the suggestions and comments provided by the
reviewer. If there are things that are not clear to you, do not hesitate to go back and ask the
reviewer once more. This will ensure that the items will be better when the final form of the
scale is assembled.
220
Prepare a draft and gather preliminary pilot data
Preparing the items for pilot testing requires a layout of the test for the respondents. The
general format of the scale should be emphasized on making it as easy as possible to use. Each
item can be identified with a number of a letter to facilitate scoring of responses later. The items
should be structured for readability and recording responses. Whenever possible items with the
same response formats are placed together. In designing self-administered scales, it is suggested
to make it visually appealing to increase response rate. The items should be self-explanatory and
the respondents can complete it in a short time. In ordering of items, the first few questions set
the tone for the rest of the items and determine how willingly and conscientiously respondents
will work on subsequent questions.
Before going to the actual pilot test, the items can be administered first to at least 3
respondents who belong in the target sample and observe them in some areas that take them long
in answering and if the instructions are clearly followed. A retrospective verbal report can be
conducted while the participants are answering the scale to clarify any difficulties that might
arise in answering the items.
In the actual pilot testing, the scale is administered to a large sample (N=320). The ideal
number of sample would be three time the total number of items. If there are 100 items in the
scale, the ideal sample size would be 300 or more. Having a large number of respondents makes
the responses more representative of the characteristic being measured. Large sample tends to
make the distribution of the scores assume normality.
In administering a scale, the proper testing condition should be maintained such as the
absence of distractions, room temperature, proper lighting, and other aspects that can cause large
measurement errors.
Analyze Pilot data
The responses in the scale should be recorded using a spreadsheet. The numerical
responses are then analyzed. The analysis consists of determining whether the test is reliable or
valid. Techniques of establishing validity and reliability are explained in chapter 3. If the test
developer intends to use parallel forms or test-retest, then two time frames would be set in the
design of the testing.
The analysis of items would indicate whether the test as a whole or the individual items
are valid or reliable. If principal components analysis is conducted, each item will have a
corresponding factor loading, the items that do not highly load on any factor are removed from
the item pool. If certain items when removed would also increase the Cronbach’s alpha
reliability of the test. These techniques suggest removing certain items to improve the index of
reliability and validity of the test. This implies that a new form is produced complying with the
results of the items analysis. That’s why it is needed to have a large pool of item to begin with
because not all items will be accepted in the final form of the test.
Revise the Instrument
The instrument is then revised because items with low factor loadings are removed, items
that when removed will increase Cronbach’s alpha is also considered. In the process of principal
221
components analysis even though the test developer has proposed a set of factors these factors
may not hold true because items will have a different grouping. The test developer then thinks of
new factor labels for the new grouping of items. These cases necessitate the test developer to
revise the items and come up with another revised form. This revised form is again administered
to another large sample to collect evidence of the scale being valid or reliable.
Gather final pilot data
For the final pilot data gathering, a large sample is again selected which is three times the
number of items. The sample should have the same characteristics as with the first pilot sample.
The data gathered would serve to establish the final estimates of the tests validity and reliability.
Conduct Additional Validity and Reliability Analysis
The validity and reliability is again analyzed using the new pilot data. The test developer
wants to determine if the same factors will still be formed and whether the test will still show the
same index of reliability.
Edit the questionnaire and specify the procedures for its use
Items with low factor loadings are again removed resulting to less items. A new form of
the test with reduced items will be formed. The remaining items have evidence of good factor
loadings. The final form of the test can now be formed.
Prepare the Test Manual
The test manual indicates the purpose of the test, instructions in administering, procedure
for scoring, interpreting the scores including the norms. Establishing norms will be fully
discussed in the next chapter.
Think of a construct that you want to study for a research or for your thesis in
the future. Follow the steps in test construction in developing the scale.
222
Lesson 3
Response Formats
This lesson presents the different scaling techniques used in tests, questionnaires, and
inventories. The important assumption for putting scales on tests and questionnaires is to provide
quantities and figures that can be analyzed and interpreted statistically. One characteristic of
research is that it should be measurable, through scales we are able to measure and quantify
concepts under study. Scales also enable the results be analyzed by mathematical formulas to
arrive with quantities of results.
The scaling techniques discussed here can be categorized accordingly to the levels of
measurement such as nominal, ordinal, interval, and ratio. In some references, the scaling
techniques come in conjunction with the levels of measurement. The purpose of mentioning the
level of measurement is to separate them as a topic and how they are related to scaling
techniques.
According to Bailey (1996) scaling is a process of assigning numbers or symbols to
various levels of a particular concept that we wish to measure. Scales can either be open-ended
or close-ended. For open-ended questions scales refer to the criteria set in order to effectively
and objectively assess the information presented. For close-ended questions, scales refer to
response formats for certain concepts and statements. Varieties of these scales serving as a
response format on tests and questionnaires will be presented in this report.
Before I present the varieties of scaling techniques the following should be remembered
as a framework for discussion:
(1) What kind of question is this scale used for?

(2) What general behavior does this scale measure?
(3) What is the unique feature of this scaling technique? (4) What are the advantages and
disadvantages in using this scale?
Classification and Types of Scales
The scaling techniques are be classified according to three categories base on the type of
question they are used. These categories are scaling techniques for Multiple Choice Questions,
Conventional scale types used for measuring behavior on questionnaires, Scale Combinations,
Nonverbal scales for questions requiring illustrations and Social Scaling for obtaining the profile
of a group (Alreck & Settle, 1995).
MULTIPLE CHOICE QUESTIONS
Multiple choice questions are common and known for being simple and versatile. They
can be used to obtain mental ability and a variety of behavioral patterns. This is ideal for
responses that fall into discrete categories. When the answers can be expressed as numbers, a
direct question should be used, and the number of units should be recorded.
223
1. Multiple Response Item

In this scaling technique the respondents can indicate one or more alternatives, and they
are instructed to check any within the question, itself. In this case each alternative becomes a
variable to be analyzed.
Please check any type of food that you regularly eat in the cafeteria.
___ Hamburger
___ Pasta
___ Soup
___ Fried chicken
___ French fries
2. Single-Response Item
In this scaling technique one alternative is singled out from among several by the
respondent. The item is still multiple choice but only one response is required. Single response
items can be used only when (1) the choice criterion is clearly stated and (2) the criterion
actually defines a single category.
What kind of food do you most often eat in the cafeteria? (Check only one)
___ Hamburger
___ Pasta
___ Soup
___ Fried chicken
___ French fries
CONVENTIONAL SCALE TYPES
These types of scales are commonly used for surveys. Every information need or survey
question can be scaled effectively with the use of one or more of the scales. One should
remember that the decision of scaling technique is a matter of choice among the conventional
scales.
3. Lickert Scale
Used to obtain people’s position on certain issues or conclusions. This is a form of
opinion or attitude measurement. In this scale the issue or opinion is obtained from the
respondents’ degree of agreement or disagreement.
The advantage of this scale include flexibility, economy, and ease of composition. The
procedure is flexible because items can be only a few words long, or they can consist of several
lines. The method is economical because one set of instructions and scale ca serve many items.
The respondent can quickly and easily complete the items.
Also, the Lickert scale enables to obtain a summated value. Beside obtaining the results
of each item, a total score can be obtained from a set of items. The total value would be an index
of attitudes toward the major issue, as a whole.
224
Please pick a number from the scale to show how much you agree or disagree with each statement and
jot it in the space to the left of the item.
Scale
1 Strongly agree
2 Agree
3 Neutral
4 Disagree
5 Strongly disagree
____ 1. I can get a good job even if my grades are bad.

____ 2. School is one of the most important things in my life.
____ 3. Many of the things we learn in class are useful.
____ 4. Most of what I learn in school will be useful when I get a job.
____ 5. School is not a waste of time.
____ 6. Dropping out of school would be a huge mistake for me.
____ 7. School is more important than most people think.
4. Verbal Frequency Scale

The verbal frequency scale contains five words that indicate how often an action has been
taken. This scale is used to know the frequency of some action or behavior by respondents. A
straight forward question is recommended when the absolute number of times is appropriate and
required. In using the verbal frequency scale, the researcher wants to know the proportion of
percentage of activity, given an opportunity to perform it.
The advantage of using this scale is that respondents are not forced with the precision of
recollection exactly how many times they have behaved in a certain way. Another is the ease of
assessment and response by those being surveyed. It has the ability to array activity levels across
a five category spectrum for data description, and the ease of making comparisons among
subsamples or among different actions for the same sample of respondents.
A disadvantage is that it provides only a gross measure of proportion.
Please pick a number from the scale to show how often you do each of the things listed below and jot in
the space at the left.
Scale
1 Always
2 Often
3 Sometimes
4 Rarely
5 Never
___ 1. I take a brunch at the middle of breakfast and lunch.

___ 2. I take a light snack 4 hours after lunch.
___ 3. I take midnight snack.
5. Ordinal Scale
The ordinal scale is also a multiple choice item but the response alternatives don’t stand
in any fixed relationship with one another. The response alternatives define an ordered sequence.
The responses are ordinal because each time a category is listed, it comes before the next one.
The principal advantage of the ordinal scale is the ability to obtain a measure relative to
some other benchmark. The order is the major focus and not simply the chronology.
225
Ordinarily, when do you or someone in your family would read a pocket book at home on a weekday?
(Please check only one)
___ The first thing in the morning

___ A little while after awakening
___ Mid-morning
___ Just before lunch
___ Right after lunch
___ Mid-afternoon
___ Early evening before dinner
___ Right after dinner
___ Late evening
___ Usually don’t read pocket books
6. Forced Ranking Scale

The forced ranking scale produce ordinal values and items are each ranked relative to one
another. This scaling technique obtains not only the most preferred, but also the sequence of the
remaining items.
One of the main advantage of this scaling technique is that the relativity or relationship
that’s measured is among the items. The forced ranking scale indicates what those choices are
likely to be, from unlimited number of alternatives.
The limitation is its failure to measure the absolute standing and the interval between
items. The number of entities or items that can be ranked is also a limitation. Respondents must
first go through the entire list and identify their first choice.
Please rank the books listed below in their order of your preference. Jot the number 1 next to the one you
prefer most, number 2 by your second choice, and so forth.
___ Harry Potter series

___ Lord of the Rings Series
___ Twilight
___ The Lion, the Witch, and the Wardrobe
7. Paired Comparison Scale

This scale is used to measure simple, dichotomous choices between alternatives. The
focus must be almost exclusively on the evaluation of one entity relative to another. This scaling
is accomplished where only two items are ranked at a time.
One major problem is the lack of transitivity in which there are several pairs to be ranked.
If the data is summated there are cases of “ties.” These limitations are avoided by using ratings,
rather than rankings, of items taken two at a time.
226
For each pair of study skills listed below, please put a check mark by the one you most prefer, if you had
to choose between the two.
___ Note taking

___ Memorizing
___ Memorizing
___ Graphic organizer
___ Note taking

___ Graphic organizer
8. Comparative Scale
The comparative scale is appropriate when making comparison(s) between one object
and one or more others. With this type of scale, one entity can be used as the standard or
benchmark by which several others can be judged.
The advantage of this scale is that no absolute standard is presented or required and all
evaluations are made on a comparative basis. Ratings are all relative to the standard or
benchmark used. When there is no absolute standard that exist, the comparative scale approach is
applicable. Another advantage is its flexibility. The same two entities can be compared on
several dimensions or criteria, and several different entities can be compared with the standard.
The comparative scale is used for research interests on comparisons of own sponsor’s
store, brand, institution, organization, candidate, or individual with that of others that are
competitive.
According to Alreck & Settle (1995), that the comparative scales are more powerful in
several ways: They present an easy, simple task to the respondent, ensuring cooperation and
accuracy. They provider interval data, rather than only ordinal values, as rankings do. They
permit several things that have been compared to the same standard to be compared with one
another, and economy of space and time are inherent in them.
Compared to the previous teacher, the new one is… (Check one space)
Very About Very

Superior the same Inferior
1 2 3 4 5
9. Linear, Numeric Scale

The linear, numeric scale is used in judging a single dimension and arrayed on a scale
with equal intervals. The scale is characterized by a simple, linear, numeric scale with extremes
labeled appropriately.
This scaling technique is economical, since a single question, set of instructions, and
rating scale apply to many individual items. It also provides absolute measures of importance
and relative measures, or rankings, if responses among the various items are compared.
The linear, numeric scale is less appropriate for measuring approximate frequency, and
not applicable when direct comparison with a particular standard is required.
227
How important to you is each of the people in the school below?

If you feel that the people in the school is extremely important, pick a number from the far right side of the
scale and jot it in the space beside the item. If you feel it’s extremely unimportant, pick a number from the
far left, and if you feel the importance is between these extremes, pick a number from some place in the
middle of the scale to show your opinion.
Scale
Extremely Unimportant 1 2 3 4 5 Extremely Important
___ 1. Directress
___ 2. Principal
___ 3. Teachers
___ 4. Academic Coordinator
___ 5. Discipline officer
___ 6. Cashier
___ 7. Registrar
___ 8. Librarian
___ 9. Janitor
10. Semantic Differential Scale

In using this scaling device, the image of a brand, store, political candidate, company,
organization, institution, or idea, can be measured, assessed, and compared with that of similar
topic. The areas investigated are called entities.
In using this scale the researcher must first select a series of adjectives that might be used
to describe the topic object. The attributes used by the researcher should be relevant in the minds
of the respondents. Once the adjectives have been identified, a polar opposite of each adjective
must be determined.
The advantage of this scale is its ability to portray images clearly and effectively. The
results provide a profile of the image of the topic that’s rated because several pairs of bipolar
adjectives are used. Also, the entire image profiles can be compared with one another. Another
advantage is its ability to measure ideal images or attribute levels. The disadvantage lies on the
difficulty to arrive with antonyms of the concepts for each item.
Please put a check mark in the space on the line below to show your opinion about the school guidance
counselor
Empathic ______ ______ ______ ______ _____ _____ _____ Apathetic

1 2 3 4 5 6 7
Approachable ______ ______ ______ ______ _____ _____ _____ Aloof
1 2 3 4 5 6 7
Understanding ______ ______ ______ ______ _____ _____ _____ Defensive
1 2 3 4 5 6 7
Unconditional ______ ______ ______ ______ _____ _____ _____ Conditional
1 2 3 4 5 6 7
228
11. Adjective Checklist

This scale is used to view descriptive adjectives or phrases that apply to the topic or
object of study. As compared with the semantic differential scale, the adjective checklist is a
very straightforward method of obtaining information about how a topic is described and viewed.
The advantage of the adjective checklist is its simplicity, directness, and economy. The
adjectives listed can be varied. Short descriptive phrases can even be used. This is useful in
doing exploratory research work.
The disadvantage of this scale is the dichotomous data it yields. There’s no indication
how much each item describes the topic.
Please put a check mark on the space in from of any word that describes your school.
___ Easy ___ Safe

___ Technical ___ Exhausting
___ Boring ___ Difficult
___ Interesting ___ Rewarding
12. Semantic Distance Scale

The semantic distance scale includes a linear, numeric scale below the instructions and
above the descriptive adjectives or phrases. It requires the respondents to provide a rating of how
much each item describes the topic. The data generated by the scale is interval distance from the
item to the topic. This scale is also used to portray an image.
The advantage of this scale is that the adjectives or images can be specified without
comparing it to its opposite and with the interval data that it can produce, it can be manipulated
and statistically processed. The disadvantage is its great complexity, the respondents’ task is
more difficult to explain.
Please pick a number from the scale to show how well each word or phrase below describes your teacher
and jot it in the space in front of each item.
Scale
Not at all 1 2 3 4 5 6 7 Perfectly
___ Intelligent ___ Approachable

___ Strict ___ Good in teaching
___ Respected ___ Can control the class
13. Fixed Sum Scale

The fixed sum scale is used to determine what proportion of some resource or activity has
been devoted to each of several possible choices or alternatives. The scale is most effective when
it’s used to measure actual behavior or action in the recent past. Ordinarily, about 10 different
categories are the maximum, but as few as 2 or 3 can be used. The number to which the data
must total has to be very clearly stated.
The major advantage of this scale is its simplicity and clarity. The instructions are easily
understood and the respondent task is ordinarily easy to complete. It’s also important to add an
inclusive alternative for “others.”
229
Of the last 10 times you went to the library, haw many times did you visited each of the following library
sections.
___ Reference
___ Periodical
___ Circulation
___ Filipinana
___ Other (What? __________________)
SCALE COMBINATION
The scale combinations take the form by listing items together in the same format in
which they share a common scale. This saves valuable questionnaire space. It reduces the
response task and facilitates recording. The respondents mentally carry the same frame of
reference and judgment criteria from one item to the next, so the data are closely comparable.
14. Multiple Rating List

It is a commonly used variation of the linear, numeric scale. The difference is that the
multiple rating list has the labels of the scale extremes at the top. The scale itself is listed beside
each item.
The advantage is that all the respondent has to do is circle a number, and that’s easier
than writing it and the responses for a visual pattern. So the juxtaposition on the responses on a
horizontal spectrum is a closer mapping to the way people actually think about the evaluations
they’re making.
Several colleges and universities are listed below. Please indicate how safe or risky is their location by circling the
number beside it. If you feel it’s very safe, circle a number towards the left. If you feel it’s very risky, circle one
towards the right, and if you think it’s some place in between, circle a number from the middle range that indicates
your opinion.
Extremely Safe Extremely Risky

University of the Philippines 1 2 3 4 5 6 7
De La Salle University-Manila 1 2 3 4 5 6 7
Ateneo de Manila University 1 2 3 4 5 6 7
Mapua Technical Institute 1 2 3 4 5 6 7
University of Sto Tomas 1 2 3 4 5 6 7
15. Multiple Rating Matrix

It is a condensed format in using a combination of linear, numeric scale items. The
difference lies in the way the items are listed in a matrix of rows with multiple columns.
This scaling technique has two advantages: First, it saves questionnaire space. The
multiple rating matrix takes less questionnaire space, yet it captures many data points. The
objects and their characteristics that are rated are all very close to one another. The respondents
are readily able to compare their evaluations from one rating object to another.
The disadvantage is that is in terms of its complexity. The instructions are complex and
the task is a bit difficult.
230
The table below lists 3 universities, and several characteristics of universities along the left side. Please
take one university at a time. Working down the column, pick a number from the scale indicating your
evaluation of each characteristic and jot it on the space in the column below the university and to the right
of the characteristic. Please fill in every space, giving your rating for each university on each
characteristic.
Scale
Very Poor 1 2 3 4 5 6 Excellent
University of the De La Salle Ateneo de Manila

Philippines University-Manila University
Faculty
Research
Facilities
Services
16. Diagram Scale

The diagram scale is useful for measuring configurations of several things, where the
special relationship convey part of the meaning.
Please list the ages of all those in your class in the spaces below. Jot the ages of the boys in the top
circles and the ages of the girls in the bottom circles.
♂ ♂ ♂ ♂ ♂ ♂
Boys
♀ ♀ ♀ ♀ ♀ ♀
Girls
NONVERBAL SCALES
The Nonverbal scales take the form of pictures and graphs to obtain the data. This is
useful for respondents who have limited ability to read or to understand numeric scales.
17. Picture Scale

It facilitates the respondents to recognize letters, numbers, and other symbols using
recognized facial expressions and other illustrations. Some points to consider in creating picture
scale are: (1) They must be very easy for respondents to understand. (2) They should show
something respondents have often seen. (3) They should represent the thing that’s being
measured. (4) They should be easy to draw or create.
18. Graphic Scale

The graphic scale shows in ascending or descending order the amount of information that
is being quantified. The graphic scale provides more useful measurement data because the
231
extremes visually represent none and all or total. Picture and graphic scales are most often used
only for personal interview surveys because they are designed for a special need.
Which of the faces indicates your feeling about your math course?
How much have you learned in your math course?
5 4 3 2 1
What is the level of your math proficiency?

232
SOCIAL SCALING
Social scaling as defines by Lazarfield (1958) as “properties of collectives which are

obtained by performing some operation on data about the relations of each member to some or
all of the other.”
19. Sociometric Scaling

Sociometric measures are generally constructed by administering to all members of the
group a questionnaire asking each about his or her relations with the other members of the group
(Bailey, 1995).
One way of analyzing sociometric data is in the form of the sociometric matrix. A
sociometric matrix lists the persons’ names in both the rows and columns, and uses some code to
indicate which person is chosen by the subject in response to the question.
20. Sociogram
The sociogram is a graphic representation of sociometric data. In a sociogram each
individual is represented by an illustrative symbol. The symbols are then connected by arrows
and it describes the relationship among the individuals involved. Those chosen most often are
referred to as stars, those not chosen by others are called isolates, and the small groups made up
of individuals who choose one another are called cliques (Best & Khan, 1990).
Scale Selection Criteria
Some scales are easily identified as potentially useful for obtaining some information,
needs, and questions, and there are often other scales that are clearly inappropriate.
How to create effective scales?
1. Keep it simple. The less complex scale should be used. Even after identifying a scale
consider an easier and simpler scale.
2. Respect the respondent. Select scales that will make a quick and easy as possible for the
respondents that will reduce non-response bias and improve accuracy.
3. Dimension the response. The dimensions that respondents think is not usually common with
one another, some commonality must be discovered. It must not be obscure and difficult, and
they should parallel respondents thinking.
4. Pick the denominations. Always use the denominations that are best for respondents. The
data can later be converted to the denominations sought by information users.
5. Choose the range. Categories or scale increments should be about the same breadth as those
ordinarily used by respondents.
6. Group only when required. Never put things into categories when they can easily be
expressed in numeric terms.
7. Handle neutrality carefully. If respondents genuinely have no preference, they’ll recent the
forced choice inherent in a scale with an even number of alternatives. If feelings aren’t
especially strong, an odd number of scale points may result in fence- riding or piling in the
midpoint, even when some preference exist.
233
8. State instructions clearly. Even the least capable respondents must be able to understand. Use
language that’s typical of the respondents. Explain exactly what the respondent should do
and the task sequence they should follow. List the criteria by which they should judge and
use an example or practice if there is any doubt.
9. Always be flexible. The scaling techniques can be modified to fit the task and the
respondents.
10. Pilot test the scales. Individual parcels can be checked with a few typical respondents.
References
Anastasi, A. (1990). Psychological testing. New York: McMillan Pub.
Anderson, L. W. (1981). Assessing affective characteristics in the schools. Boston: Allyn and
bacon.
Alreck, P. L. & Settle, R. B. (1995). The survey research handbook (2nd ed.). Chicago: Irwin
Prof. Books.
Bailey, K. D, (1995). Methods of social research (4th ed.). New York: McMillan Pub.
Bandura, A. (1977). Self-efficacy: Toward a unifying theory of behavioral change.

Psychological Review, 84, 191-215.
Best, J. W. & Kahn, J. V. (1995). Research in education (6th ed.). New Jersey: Prentice Hall.
Dilts, R. B. (1999). Sleight of mouth: The magic of conversational belief change. Capitola, CA:
Meta Publications.
Halstead, J. M. & Taylor, M. J. (2000). Learning and teaching about values: A review of recent
research. Cambridge Journal of Education, 30, 169-203.
Knafo, A. & Schwartz, S.H. (2003). Parenting and adolescents’ accuracy in perceiving parental
values. Child Development, 74(2), 595-611.
Lazarfield, P. F. (1958). Evidence and Inference in Social Research. Daedalus, 8, 99-130.
National Council for Accreditation of Teacher Education. (2001). Professional standards for the
accreditation of schools, colleges, and departments of education. Washington, DC: Author.
Meece, J., Parson, J., Kaczala, C., Goff, S., & Futerman. R. (1982). Sex differences in math
achievement: Toward a model of academic choice. Psychological Bulletin, 91, 324 – 348.
234
Overmier, J.B.& J. A. Lawry. (1979). Conditioning and the mediation of behavior. In G.H.
Bower (ed.). The psychology of learning and motivation (pp. 1- 55). New York: Academic Press.
Pajares, M. F. (1992). Teachers' Beliefs and Educational Research: Cleaning Up a Messy

Construct. Review of Educational Research, 62, 307-332.
Richardson, V. (1996). The role of attitude and beliefs in learning to teach. In J. Sikula, T.
Buttery, & E. Guyton (Eds.), Handbook of research on teacher education ( pp. 102-119). New
York: Macmillan.
Savickas, M. L. (1999). The psychology of interests. In M. L. Savickas & A. R. Spokane (Eds.),

Vocational interests: Meaning, measurement and counseling use (pp. 19-56). Palo Alto, CA:
Davies-Black.
Schommer, M. (1990). Effects of beliefs about the nature of knowledge on comprehension.

Journal of Educational Psychology, 82, 498-504.
Sta. Maria, M. & Magno, C. (12007). Dimensions of Filipino negative emotions. Paper presented
at the 7th Conference of the Asian Association of Social Psychology, July 25-28, 2007 in Kota
Kinabalu, Sabah, Malaysia.
Strong, E. K. (1955). Vocational interests 18 years after college. Minneapolis: University of

Minnesota Press.
Taylor, E. (2003). Making meaning of non-formal education in state and local parks: A park
educator's perspective. In T. R. Ferro (Ed.), Proceedings of the 6th Pennsylvania Association of
Adult Education Research Conference (pp. 125-131). Harrisburg, PA, Temple University.
Taylor, E., & Caldarelli, M. (2004). Teaching beliefs of non-formal environmental educators: A
perspective from state and local parks in the United States. Environmental Education Research,
10, 451-469.
Zimbardo, P. G, & Leippe, M. R. (1991). The psychology of attitude change and social
influence. New York: McGraw Hill.
235
Empirical Report categories were drawn from social learning theory and
research (e. g., Bandura, 1982; Schunk, 1984;
Zimmerman, 1983). They included goal-setting,
Development of the Academic Self-Regulation Scale
environmental structuring, self-consequences, and self-
evaluating. Several other categories were included on the
Carlo Magno
basis of closely allied theoretical formations namely the
MR Aplaon
strategies of organizing and transforming (Baird, 1983,
Carmine Gañac
cited from Zimmerman & Pons, 1986), seeking and
Sheena Marie Morales
selecting information (Wang, 1983; Baird, 1983 cited from
De LaSalle University-Manila
Zimmerman & Pons, 1986) and rehearsal and mnemonic
strategies (Mccombs, 1984, cited from Zimmerman &
Abstract Pons, 1986). Also included are strategies of seeking social
Exploratory (EFA) and Confirmatory Factor Analysis (CFA) assistance and reviewing previously compiled records such
were conducted to verify the subscales of a constructed as class notes and notes on text material (Wang, 1983,
academic self-regulation scale. The subscales were based cited from Zimmerman & Pons, 1986).
on Zimmerman and Pons (1986) model of self-regulation. There have been increasing researches on the
Items were written to assess the self-regulatory strategies mechanisms through which students regulate their own
that the students use in an academic setting. In the EFA, motivation and academic learning (e.g., Corno, 1989;
eight factors were extracted that were consistent with the Harris, 1990; Zimmerman and Schunk, 1989). In addition,
eight factors proposed in the previous self-regulation task-related cognitive and metacognitive strategies such as
model. The CFA showed that self-regulation indeed is mnemonic encoding and self-monitoring have been the
composed of eight factors (goal-setting, organizing and center of much research on self-regulated learning.
transforming information, self-consequences, seeking According to Zimmerman & Pons (1986), the social
information, seeking social assistance, environmental cognitive theory of academic regulation states that students
structuring, rehearsal and mnemonic strategies and self- regulate the motivational, affective and social determinants
evaluation). A one-factor with eight component of their intellectual functioning as well as cognitive aspects
measurement model had an adequate fit (χ2=109.68, and that the exercise of self-regulatory skills produces
χ2/df=.5, RMS=.07, Joreskog GFI=.90, Bentler-Bonnet beneficial results. It is also said that good self-regulators do
Normed Fit Index=.90). Each of the subscales of the better academically than poor regulators even after
academic self-regulation showed adequate internal controlling for other potential influential factors.
consistencies. Intercorrelations of all eight subscales Moreover, in Zimmerman’s study (1981), it was
showed convergence with significant correlations (p<.01). theorized that human achievement is heavily dependent on
the use of self-regulation particularly in competitive and
evaluative settings. In the previous researches of Bandura,
Self-regulated learning has been a topic of (1982), Schunk, (1984) and Zimmerman, (1983), academic
considerable interest in educational psychology. Self- achievement is also the one realm where self-regulated
regulated strategy is defined as self directed process by learning processes are assumed to be crucial. In the upper
which learners transform their mental abilities into grades, success in school is believed to be highly
academic skills. These are also actions directed at dependent on student self-regulation, especially in
acquiring information or skill that involve agency, purpose unstructured settings where studying often occurs. On the
or goals and instrumentality self-perceptions by a learner. other side, Krouse & Krouse, (1981) said in their study that
Furthermore, self-regulated learners are generally the major cause of underachievement is the inability of the
characterized as active learners who efficiently manage students to control their own behavior. In addition,
their own learning experiences in many different ways. Zimmerman and his colleagues have been interested in
They have adaptive learning goals and are persistent in learning how students become willing and able to assume
their efforts to reach those goals and are proficient at responsibility for controlling or self-regulating their
monitoring and, if necessary, modifying their strategy use academic achievement. Research also suggests that
in response to shifting task demands. In short, self- learning self-regulating skills can lead to academic
regulated learners are motivated, independent, and achievement and an increased sense of efficacy.
metacognitively active participants in their own learning Self-regulation fosters learning. As indicated by
(Zimmerman, 1990). Several studies were also conducted Ertmer, Newby, & MacDougall (1996), students with high
that developed measures for self-regulation. One measure levels of self-regulation possess attributes and skills that
was developed by Zimmerman and Martines-Pons (1986) would be likely to enhance performance in a case-based
which is the self-regulation interview. course. It is generally acknowledged that self-regulation is
From the existing literature, a number of not an automatic process for all learners. Schunk (1989)
categories of self-regulated strategies were identified. The stated that self-regulation does not automatically develop
236
as people become older nor it is passively acquired from Rehearsal and mnemonic strategies. It may be
the environment. written or verbal; overt and covert. It uses mnemonic
In the social cognitive theory of Bandura (1986), devices, teaching someone else the material, making
self-regulation operates through a set of psychological sample questions, using mental imagery, and using
subfunctions. Zimmerman (1986) added that these include repetition.
one’s self-monitoring of activities, applying personal Environmental Structuring. It is selecting or
standards for judging and directing one’s performance, arranging the physical setting, isolating or eliminating or
enlisting self-reactive influences to guide and motivate minimizing distractions, and breaking up study periods and
one’s effort and employing proper strategies to achieve spread over time.
success. Self-evaluation. A process of checking the
The present study confirmed the factors of self- quality or progress. It involves task analysis, self-
regulation using the model of Zimmerman and Martinez- instructions, enactive feedback, and attentiveness.
Pons (1986). This investigation was undertaken to validate
a scale for measuring academic self-regulation. There is a Method
need to construct direct items and confirm the factors Sample
because the instrument developed by Zimmerman and The participants were composed of 110 college
Pons is a structured interview questionnaire in which open- students in the initial exploratory analysis. Another sample
ended responses could be gathered. The instrument they of 310 was drawn for confirming the factor structure
developed is difficult to score and the data collected is proposed. The participants were from different private and
qualitative where scoring issues arise. Having direct items public schools in Metro Manila, specifically De La Salle
to measure academic self-regulatory strategies is easier University-Manila, Ateneo de Manila University,
and more efficient to use. Polytechnic University of the Philippines, University of Sto.
Self-regulation is an action directed at acquiring Tomas, and Lyceum of the Philippines. The ages ranged
information or skill that involves agency, purpose or goals, from 17 up to 25 years old. Convenience sampling was
instrumentality, self-perceptions by a learner (Zimmerman, done by going to the nearest college schools in Metro
1983). The Social Cognitive Theory explains how people Manila and looked for college students who are willing to
acquire and maintain certain behavioral patterns while also participate in the study.
providing the basis for intervention strategies (Bandura
1997). These are the classes underlying Self-regulation Test Development Design
theory (Zimmerman & Ponz, 1986): The measure is constructed by confirmatory the
Goal setting. Most theories of self-regulation factor proposed in previous studies on self-regulation. This
emphasize its inherent link with goals. A goal reflects one’s is done to verify that self-regulation has eight factors.
purpose and refers to quantity, quality, or rate of These are goal-setting, organizing and transforming
performance (Locke & Lathman, 1990). Goal setting information, seeking and selecting information, seeking
involves establishing a standard or objective to serve as social assistance, self-consequences, rehearsal and
the aim of one’s actions. Goals are involved across the mnemonic strategies, environmental structuring and self-
different phases of self-regulation: forethought (setting a evaluation.
goal and deciding on goal strategies); performance control
(employing goal-directed actions and monitoring Search for Content Domain
performance); and self-reflection (evaluating one’s goal The establishment of the conceptual definitions
progress and adjusting strategies to ensure success) for self-regulation and the factors that compose academic
(Zimmerman, 1998). It also involves sequencing, timing, self-regulation was first done to be guided in constructing
completing, time management, and pacing. the items. The conceptual framework was based on the
Organizing and transforming information. The eight factors by Zimmerman and Martinez-Pons (1986).
use outlines, summaries, rearrangement of materials, The eight factors are goal-setting, organizing and
highlighting, flash cards, and drawing pictures, diagrams, transforming information, self-consequences, seeking
and webbing or mapping. information, seeking social assistance, environmental
Self-consequences. This refers to the person’s structuring, rehearsal and mnemonic strategies and self-
reinforcement, how he/she treats motivation, arrangement evaluation.
or imagination of punishments, and delay of gratification.
Seeking and selecting information. This entails Item Writing
where and who a person seeks information. The use of The items that were constructed that make up
library resources, internet resources, reviewing cards, the scale for academic self-regulation were based from the
rereading records, tests, books. definitions for each subscale. It was composed of 110
Seeking social assistance. Seeking assistance items. The items reflect the specific strategies that the
from peers, teachers, and adults. student uses while engaging in a learning task. The items
237
also show how students engage in different learning and subscale was determined to assess the internal
their methods for studying, completing their home works consistencies of the items.
and participating in class.
The constructed items were then distributed to Validity Analysis
which factor they fall under. To cluster the items A confirmatory factor analysis (CFA) was
appropriately, conceptual definition for each factor was conducted to determine the plausibility of the one-factor
used. The items were modified to fit the factors they belong structure with eight components academic self-regulation.
to. The fit of the hypothesized one-factor model was assessed
by examining several fit indices including three absolute
Selection of Scaling Technique and one incremental fit index. The minimum fit function chi-
The responses of the participants were square, the root mean square error of approximation
measured with the use of a four-point Verbal Frequency (RMSEA), and the standardized root mean square residual
Scale. Responses were based on how frequent they used (SRMR) are absolute fit indices. The chi-square statistic
the particular study strategy being described in the items. (χ2) assesses the difference between the sample
This type of scale provides answers in the form of coded covariance matrix and the implied covariance matrix from
data that are comparable and can be manipulated. The the hypothesized model (Fan, Thompson, & Wang, 1999).
scales include always, often, rarely and never. Points were A statistically non-significant χ2 indicates adequate model
assigned to each scale wherein ”always” corresponds to fit. Because the χ2 test is very sensitive to large sample
four (4) points, “often” corresponds to three (3) points, sizes (Hu & Bentler, 1995), additional absolute fit indices
“rarely” correspond to two (2) points and “never” were examined. The RMSEA is moderately sensitive to
corresponds to one (1) point. simple model misspecification and very sensitive to
complex model misspecification (Hu & Bentler, 1998). Hu
Item Review and Bentler (1999) suggest that values of .06 or less
Experts in the study of self-regulation and scale indicate a close fit. The SRMR is very sensitive to simple
development checked and reviewed the items that were model misspecification and moderately sensitive to
constructed. The items were judged whether it is accept, complex model misspecification (Hu & Bentler, 1998). Hu
reject or needs to be revise. The initial items reviewed was and Bentler (1999) suggest that adequate fit is represented
composed of 21 items for goal-setting, 26 items for by values of .08 or less.
organizing and transforming information, 14 for self-
consequences, 15 items for seeking and selecting
information, 12 for seeking social assistance, 16 items for Results
rehearsal and mnemonic strategies, 12 for environmental
structuring, and 15 for self-evaluation. The 110 items were first reduced into underlying
factors by conducting a principal components analysis with
Final Form varimax rotation. The eigenvalues extracted showed 10
The items were revised based on the reviews factors can be extracted with value of 1.00. Although the
provided. The first draft of the instrument was composed of ninth and tenth eigenvalues are far from the values of the
151 items and the final was composed only of 110 items. first eight. This justifies the appropriateness of an eight
The items for the final form for testing was then coded factor scale. The items contained on their respective factor
(e.g., GS for Goal-setting, SE for self-evaluation, RM for were maintained. The labels for each factor were also
rehearsal and mnemonic strategies) for convenience in maintained considering the items were also grouped as
encoding the responses. There were 15 items for goal hypothesized. Other items were omitted because of low
setting factor, 24 items for organizing and transforming, 10 factor loading (below 0.4). There were only 77 items
items for self-consequences, 13 items for seeking remained. The factors were goal setting (22 items),
information, 8 for seeking social assistance, 15 items for organizing and transforming (9 items), self-consequences
rehearsal and mnemonic strategies, 10 items for (17 items), seeking and selecting information (8 items),
environmental structuring and 15 items for self-evaluation. seeking social assistance (5 items), rehearsal and
There was a total of 110 items for the final form. After the mnemonic strategies (5 items), environmental structuring
revision of items, the test was administered. (3 items), and self-evaluation (8 items). High factor
loadings were obtained for each item belonging on their
Reliability Analysis respective factors.
In the data-analysis, the reliability of the test Using a separate sample (N=310), the
instrument was determined. The 110 items were Confirmatory Factor Analysis (CFA) verified the eight
intercorrelated to show internal consistency through subscales of self-regulation in one latent factor solution. All
interitem correlation. The Cronbach’s alpha of each subscales had a significant path estimate in a one-factro
solution for academic self-regulation with adequate fit
238
(χ2=109.68, χ2/df=.5, RMS=.07, Joreskog GFI=.90, Bentler- evaluation” which means that a person undergo process of
Bonnet Normed Fit Index=.90). Other supplementary checking the quality or progress of his or her own learning.
goodness of fit indices was also adequate for a one-factor It involves task analysis, self-instructions, enactive
measurement model solution (see Tables 1 and 2). feedback, and attentiveness.
The factors are reliable due to their Cronbach’s
The standard deviation, skewness, kurtosis, CFA alpha values which are 0.88, 0.90, 0.79, 0.80, 0.72, 0.85,
parameter and Cronbach’s alpha of each factor was 0.65, and 0.87. It only shows that each factor is consistent
obtained (see Table 3). Having relatively large numbers for with the proposed definition of the factors. There were no
standard deviation, it shows that the scores are dispersed. new factors arrived after factor analysis but there are
While for the skewness and kurtosis indicates a normal possible factors that could be drawn out as the scree plot
distribution of scores. Hihg internal consistencies were illustrates. The items were not reclassified in every factor
onbtained for each subscales of the instrument with but some were omitted since they don’t achieved more
Cronbach’s alpha >.70. than 0.4 value in the factor loading analysis. All factors
The factor scores of each subscale were also were accepted as shown with their eigenvalues 1.0 above.
intercorrelated. The intercorrelations all showed a positive The value obtained for internal consistency is
magnitude where each dimension increases with one 0.96 which is considerably high. The test has just
another. This means that all eight factors are internally undergone its initial stages, which is only preliminary pilot
consistent and have convergence with other (see Table 4). testing, thus, it is just necessary to further confirm and
established more reliable test properties.
Discussion
This present study confirmed the factors of References

academic self-regulation which are goal setting, organizing Baird, L.L. (1983). Attempts at defining personal
and transforming information, self-consequence, seeking competency. Princeton, NJ: Educational testing
information, seeking social assistance, rehearsal and Service.
mnemonics, and environmental structuring as proposed by Bandura, A. (1982). Self-efficacy mechanism in human
Zimmerman and Martinez-Pons (1986). The said factors agency. American Psychologist, 37, 122-147.
were all accepted in factor loadings and they were all also Ertmer, P.A., Newby, T.J., MacDougall, M.(1996).
labeled as such. Factor 1 contains items about establishing Students' responses and approaches to case-
a standard or objective to serve as the aim of one’s actions based instruction: The role of reflective self-
or in other words person, setting his or her goals in regulation. American Educational Research
learning. Factor 2 which is organizing and transforming Journal, 33(3), 719-752
information: involves the using outlines, summaries, Krouse, J.H. & Krouse, H.J. (1981). Toward a multimodal
rearrangement of materials, highlighting, flash cards, and theory of academic achievement. Educational
drawing pictures, diagrams, and webbing or mapping in Psychologist, 49, 725-747.
studying. Factor 3 consists of items regarding “self- Schutz, P.A. & Davis, H.A. (2000). Emotions and self-
consequences”, which refer to the person’s reinforcement, regulation during test taking.
how he/she treats motivation, arrangement or imagination Educational Psychologist,35(4) 243-256
of punishments, and delay of gratification. Factor 4 is a Trigwell, K. & Prosser, M. (1991). Improving the quality of
group of items that entails where and who a person seeks student learning: The influence of learning
information, in which library resources, internet resources, context and student approaches to learning on
reviewing cards, rereading records, tests, books are useful learning outcomes. Higher Education, 22(3),
tools for person in completing his or her learning tasks 251-266
which is labeled as “seeking and selecting information”. Zimmerman, B.J. & Pons, M.M. (1990). Student differences
Under factor 5 are items that pertain to person asking for in self-regulated learning
other people’s help in his or her learning. Meanwhile, factor relating grade, sex, and giftedness to self-
6 is a cluster of items that expound the use of rehearsal efficacy and strategy use. Journal of
and mnemonic strategies in studying, such as teaching Educational Psychology, 82(1), 51-59
someone else the material, making sample questions, Zimmerman, B. J. (1989). A social–cognitive view of self-
using mental imagery, and using repetition. Factor 7 that is regulated learning. Journal of Educational
“environmental structuring” composed of items that wherein Psychology, 81, 329-339.
the person selects or arranges the physical setting, Zimmerman, B.J. & Pons, M.M. (1986). Development of a
isolating or eliminating or minimizing distractions, and Structured Interview for Assessing Student Use
breaking up study periods and spread over time. And lastly, of Self-Regulated Learning Strategies. American
factor 8 is constructed by items that is all about “self Educational Research Journal, 23(4), 614-328.
239
Zimmerman, B.J. (1983). Social learning theory: A Zimmerman, B. (1990). Self-regulated learning and
contextualist account for cognitive functioning. In academic achievement: An overview.
C.J. Brainerd (Ed), Recent advances in cognitive Educational Psychologist, 25, 3-17.
developmental theory. New York: Springer, 1-49.
Table 1
Goodness of Fit Estimates
Values
Discrepancy Function 0.50
Maximum Residual Cosine 0.00
Maximum Absolute Gradient 0.00
ICSF Criterion 0.00
ICS Criterion 0.00
ML Chi-Square 109.68
Degreed of Freedom 20.00
p-level 0.00
RMS Standard Residual 0.07
Table 2
Single Sample Fit Indices Values
Single Sample Fit Indices Values
Joreskog GFI 0.87
Joreskog AGFI 0.77
Akaike Information Criterion 0.65
Schwarz’s Bayesian Criterion 0.89
Browne-Cudeck Cross Validation Index 0.65
Independence Model Chi-square 910. 58
Independence Model df 28.00
Bentler-Bonnet Normed Fit Index 0.88
Betler-Bonnet Non-formed Fit Index 0.86
Bentler Comparative Fit Index 0.90
James-Mulaik-Brett Parsimonious Fit Index 0.63
Bollen’s Rho 0.84
Bollen’s Delta 0.90
240
Table 3
Estimates of the Subscales of Self-regulation
N M SD Skewness Kurtosis Initial CFA Cronbach’s
Eigenvalues Parameter Alpha
(N=110) estimates
(N=310)
Goal Setting 310 42.06 7.96 -0.47 0.18 22.79 5.42* 0.88
Organizing and 310 61.35 12.02 -0.11 0.39 5.60 9.94* 0.90
Transforming Information
Self-consequence 310 30.00 5.51 -0.07 0.14 4.89 4.13* 0.79
Seeking Information 310 35.68 5.66 -0.18 -0.18 3.03 3.81* 0.80
Seeking Social Assistance 310 21.20 3.95 -0.20 -0.22 3.03 2.22* 0.72
Rehearsal and Mnemonics 310 38.45 7.50 -0.01 -0.25 2.67 5.95* 0.85
Environmental Structuring 310 29.91 4.68 -0.32 -0.34 2.50 2.89* 0.70
Self-evaluating 310 40.59 7.69 0.09 -0.01 2.29 5.87* 0.87
*p<.05
Table 4
Correlation Matrix of the Factor of Self-regulation
(1) (2) (3) (4) (5) (6) (7) (8)
(1) 1.00
(2) 0.64** 1.00
(3) 0.60** 0.62** 1.00
(4) 0.44** 0.51** 0.61** 1.00
(5) 0.24** 0.43** 0.32** 0.42** 1.00
(6) 0.47** 0.65** 0.54** 0.55** 0.61** 1.00
(7) 0.51** 0.54** 0.54** 0.40** 0.22** 0.42** 1.00
(8) 0.45** 0.63** 0.52** 0.50** 0.53** 0.68** 0.45** 1.00
**Significant at 0.01
241
Chapter 6
Art of Questioning
Chapter Objectives:
1. Develop a deep understanding on the functional role of questioning in enhancing

students’ learning;
2. Critically assess the circumstances under which certain types of questions may be more
useful;
3. Frame questions that are appropriate for the target skills to be developed in the students
Lessons
1 Functions of Questioning
2 Types of Questions
3 Taxonomic Questioning
4 Practical Considerations in Questioning
242
Lesson 1
Functions of Questioning
Every time we get inside our classrooms and deal with our students in various teaching
and learning circumstances, our ability to ask questions is always brought to fore. Being
intricately embedded in our pedagogies and assessments, questioning is one of the most basic
processes we deal with. But, just how appropriate our questions are, we need to discuss about the
art of questioning.
To begin with, we ask ourselves this fundamental question, “Why do we ask questions?”
From our teaching methods and strategies to our assessments, questioning is inevitable. From the
transmissive to more constructivist approaches of teaching, asking questions is always a
“mainstay” process. To answer this fundamental question, we need to first look into the function
of questioning as its works in our own selves, then in terms of how it works in the learning
process in general.
As you are reading this chapter, or even the previous chapter of this book, you
effortlessly ask questions. Why is that? How important is that process in our understanding of the
concepts are trying to learn about? Whenever you ask a question, regardless of whether you just
keep it in mind or express it verbally, you activate your senses and drive your attention to what
you are currently processing. As you engage a reading material, for example, and you ask
questions about what you are reading, you are bringing yourself into a deeper level of the
learning experience where you become more “into the experience.” Obviously, questioning
brings you to the level of focused and engaged learning as you become particularly attentive to
everything that takes place within and around you.
In the classrooms, we ask our students many questions without always being aware that
the kinds of questions we ask make or break students’ deep academic engagement. At this
juncture, therefore, we emphasize the point that, as teachers, just asking questions is not enough
to bring our students to the level of engagement we desire. What matters in this case is the
quality of the questions we ask them. The effects of questioning on our students differ,
depending on how “good” or “bad” our questioning is.
From various studies, we now know that “good” questioning positively affects students’
learning. Teachers’ good questioning boosts students’ classroom engagement because the
atmosphere where good questions are tossed encourages them to push themselves some more
into the state of inquiry. If students’ feel that questions are interesting, sensible, and important,
they are driven not only to “know more” but also “think more.” Good questioning encourages
deep thinking and higher levels of cognitive processing, which can result in better learning of the
subject matter in focus. One distinct mark of a classroom that employs good questioning is that
students generally participate in a scholarly conversation. This happens because teachers’ good
questioning encourages the same good questioning from the students as they discuss with their
teachers and with each other.
On the contrary, bad questioning distorts the climate of inquiry and investigation. It
undermines the students’ motivation to “know more” and “think more” about the subject matter
in focus. If, for example, a teacher’s question makes the student feels stupid and impossibly
capable of answering, the whole process of questioning leads to a breakdown of students’
243
academic engagement. Indeed, it is important for a teacher to always think of his or her
intentions for tossing questions in the class. Certainly, questions encouraged by a sound motive
will work better that those ill-motivated ones.
Think about questioning as a tool for increasing

students’ academic engagement.
Mentally explore into what kinds of motives may

encourage learning and what other kinds of motives
that may undermine students’ learning.
Write your thoughts or bring it up in class for purposes
of academic discussion.
244
Lesson 2
Types of Questions
Now that you have just explored on the kinds of motives that may encourage or
undermine students’ learning, it is helpful if you focus on those motives that establish an
atmosphere of inquiry in your classrooms. Focus on those intentions that will allow for the use of
questioning as a tool for deep learning rather than those that embarrass students and discourage
them from engaging your lessons.
However, because teaching is not a trial-and-error endeavor, motives might not be
enough to guide our questioning so that it makes desirable effects on our students’ learning. With
the sound motive being the undercurrent of our questioning, we need to also know what types of
questions to ask to engage our students.
Interpretive Question
This type of question calls for students’ interpretation of the subject matter in focus. It
usually asks students to provide missing information or ideas so that the whole concept is
understood. An interpretive question assumes that, as students engage the question, they monitor
their understanding of the consequences of the information or ideas. In a class with primary
graders, the teacher narrated a story a boy in dark-blue shirt who was lost in a crowd of people at
a carnival one evening, and his mother roved around for hours to find him. After narrating the
story, one of the questions the teacher asked her pupils was “If the boy wore a bright-colored
shirt, what could change in the mother’s effort in looking for the boy?” Question that call for
interpretation of a situation is a powerful tool for activating your students’ analytical ability.
Inference Question
If the question you ask intends that students go beyond available facts or information and
focus on identifying and examining the suggestive clues embedded in the complex network of
facts or information, you may use toss up an inference question. After a series of discussions on
the Katipunan revolution in a Philippine history class, the teacher presented a picture that
appeared to capture a perspective of the Katipunan revolution. As the teacher showed the picture,
he asked, “What do you know by looking at this picture?” Having learned about Katipunan
revolution in its different angles, students were prompted to explore on clues that may suggest
certain perspectives of the event, and focus on a more salient clue that represented one
perspective, such as, for instance, the common people’s struggles during the revolution, or the
bravery of those who fought for the country, or the heroism of its leaders. Inference questions
encourage students to engage in higher-order thinking and organize their knowledge rather than
just randomly fire out bits and pieces of information.
245
Transfer Question
Questioning is one of the processes that affect transfer (Mayer, 2002). Transfer questions
are tools for a specific type of inference where students are asked to take their knowledge to new
contexts, or bring what they already know in one domain to another domain. Questions of this
type bring students to a level of thinking that goes beyond just using their knowledge where it is
used by default. For example, after a lesson on the literary works of Edgar Allan Poe, students
were already familiar with Poe’s literary style or approach. So that the teacher can infer on
students’ familiarization and understanding of Poe’s rhetoric “trademark,” the teacher thinks of a
literary work from a different source, let us say, one from the long list of fairy tales. Then the
teacher asked a transfer question, “Imagine that Edgar Allan Poe wrote his version of the fairy
tale story, ‘Jack and the beanstalk,’ you are making a critical review of his version of the story,
what do you expect to see in his rhetoric quality?” This question prompts the students to bring
their knowledge of Poe’s rhetoric style to a new domain, that is, a different literary piece with a
different rhetoric quality. This question further encourages the students to thresh out only those
relevant knowledge that must be transferred, and therefore, helps them account for their learning
of a subject matter.
Predictive Question
Asking predictive questions allows students to think in the context of a hypothesis.
Through questions of this type, students infer on what is likely to happen given the
circumstances in hand. In other words, students are compelled to think about the “what if” of the
phenomenon under study, mindful of the circumstances on focus. This type of question has long
been used in the natural sciences, but is certainly not for their exclusive use. In any subject area,
we can let our students think scientifically. One of the ways to do so is to let them engage our
predictive questions or to drive them to raise the same type of question in the class. Predictive
questions prompt the students to go beyond the default condition and infer on what is likely to
happen if some circumstances change. Here, students make use of higher levels of cognitive
processing as they estimate probabilities.
Metacognitive Question
The types of questions discussed above all focus on students’ cognitive processes. To
bring students into the level of regulation over their own learning, we also need to ask
metacognitive questions. Questions of this type allow students to think about how they are
thinking, and learn about how they are learning your course lessons. Successful learners tend to
show higher level of awareness of how they are thinking and learning. They show clear
understanding of how they struggle with academic tasks, comprehend written texts, solve
problems, or make decisions. A metacognitive question invites students to know how they know,
and, thus, become more aware of the processes that take place within them while they are
thinking and learning. In a math class, for instance, the teacher not only asks to solve a word
problem but also to describe how the student is able to solve the word problem.
246
A. Think of a subject matter within your area of specialization.

B. Make a rough plan as to how you might present the subject matter in an
appropriate class (considering grade/year level).
C. Formulate questions that will likely encourage your students to engage the
subject matter. Try as much as possible to formulate questions in all types of
questions discussed above.
D. Justify why those questions fall under their respective types.
247
Lesson 3
Taxonomic Questioning
After trying your best to formulate questions for every type of questions discussed above,
we will now bring you to the discussion on planning the questioning in terms of taxonomic
structure. Questions differ not only in terms of types but also in terms of what cognitive
processes are involved based on the taxonomy of learning targets you are using. For our students
to benefit more from our questioning, it is necessary to plan our questioning taxonomically.
In Chapter 2 of this book we learned about the different taxonomic tools for setting your
learning intents or target. These tools also serve as frameworks for planning and constructing
your questions. Because questioning influences the quality of students’ reasoning, the questions
we ask our students to respond to must be pegged on certain levels of cognitive processes
(Chinn, O’Donnell, & Jinks, 2000). For example, Bloom’s taxonomy provides a way of
formulating questions in various levels of thinking, as in the following:
Questions intended for knowledge should encourage recall of information. Such
questions may be What is the capital city of…? or What facts does… tell?
For comprehension, questions should call for understanding of concepts, such as What is
the main idea of…? or Compare the…
Questions at the level of application must encourage the use of information or concept in
a new context, like How would you use…? or Apply… to solve…
If analysis is desired where students are driven to think critically, the questions must
focus on relationships of concepts and logic of arguments, such as What is the difference
between…?” or How are…and…analogous?
To encourage synthesis, questioning must focus on students’ original thinking and
emergent knowledge, like Based on the information, what could be a good name for…? or What
would…be like if…?
In terms of questioning at the level of evaluation, students are prompted to judge the
ideas or concepts based on certain criteria. Questions may be like Why would you choose…? or
What is the best strategy for…?
If you are to use the revised taxonomy where you need to consider both the knowledge
and cognitive process dimensions, it is important that you first identify the knowledge dimension
you wish to focus, and ask yourself, “What questions will be appropriate for every knowledge
dimension?
A. Think about your understanding of the definitional meaning of each type of

knowledge (factual, conceptual, procedural, metacognitive).
B. Based on your understanding of their definitional meanings, discuss what kinds of
questions to ask for each of the types of knowledge.
C. Discuss your ideas with your teacher and/or classmates.
D. Synthesize your understanding after sharing your thoughts and listening to those
of others.
248
Your clear understanding of the kinds of questions to ask based on the types of
knowledge in focus helps you to categorically focus on any of those types of knowledge,
depending on what is relevant to your teaching and assessment at any given time. After
anchoring your questions into a particular type of knowledge, the next step is to frame your
question so that it conveys the relevant cognitive process needed for a successful learning of the
subject matter. If your focus is factual knowledge, you can toss up different questions that vary
according to the cognitive processes. You can raise a question on factual knowledge that
necessitates the use of recall (remember) or synthesis (create), depending on your learning
intents. You can navigate in the same way across the different levels of cognitive processing
while anchoring on any other type of knowledge.
A. Based on a subject matter of your choice, formulate at least one factual

knowledge question for each of the six levels of cognitive processes in the revised
Bloom’s taxonomy (remember, understand, apply, analyze, evaluate, & create).
B. Based on the same (or a different) subject matter, do the same for conceptual
knowledge, then for procedural knowledge, then for metacognitive knowledge.
C. Share your output in the class for discussion.
You may also try out on the alternative taxonomic tools discussed in Chapter 2, and see
how you can brush up on your art of questioning while maintaining your track towards your
learning intents. When you wish to verify the validity of your questions, always go back to the
conceptual description of the taxonomy. It should be an important process as you build on your
art of questioning so that, aside from its artistic sense, your questioning also becomes scientific
in so far as teaching-and-learning process in concerned.
249
Lesson 4
Practical Considerations in Questioning
We now give you some tips in questioning. These tips are add-on elements to the items
that have already been discussed in the preceding section of this chapter.
Consider your students’ interest

Before you ask a question with your students, think if the question you are about to ask
can arouse their interest in the subject matter. Think of a context that might interest your students
and use that context as the backdrop of your question. Here, it is important that you know the
“language” of your students. You should be able to anticipate their needs based on their
developmental characteristics. Also, you should have an idea of their interest, such as the kinds
or genres of music they enjoy listening, the computer games they play, or the kinds of sports they
engage into, and so on. If you contextualize your question in these aspects, you know that your
students are likely to engage your question.
Hold on to your targets
As you did the “On-Task” exercises in the previous section of this chapter you realized
that the questions we toss up in our classrooms must be anchored on our learning intents.
Airasian (2000) contends that the questions we ask communicate to our students the topics and
processes are important. To be on track, classroom questioning should be aligned to relevant
instructional targets. Always remember to ask questions that do not only allow students to
orchestrate their cognitive prowess but also those that scaffold them to think at the level you so
desire. Make sure your questions are sensible as far as your learning targets are concerned. As
much as possible ask questions that are both relevant to your learning intents and interesting to
your students.
Expect for answers
Perhaps common sense will tell you that whenever we ask a question, we always expect
for a good answer. But reality has it that some teachers pose questions that are obscure and that
relevant answers are hardly drawn from the students. If, with conscious effort, we expect for
relevant answers, we always make sure that our questions are clear, and that they can be
answered by our studies based on their capacities. To do this, we need to first understand our
students’ the developmental characteristics, their actual capacities and aptitude. Knowing all
these, we can toss up questions that match their capabilities. Finally, Do not ask a question that is
only part of your script so that it only serves as a cue of what you will say next. Do not ask a
question if you are not actually expecting your students to respond to you if you are not really
interested to pick up on your students’ answers. Ask a question only if you truly intend to let
your students respond to it.
250
Push your students farther

While it is important to ask questions that match with students’ capacities, it is also vital
that the questions we ask our students to answer challenge them to exhaust their cognitive
resources so that if they realize that they are on the edge of their available knowledge, the
question encourages them to think of possibilities. This exercise gives students the opportunity to
discover new realms of knowledge that will be explored. Also, this will build on their scientific
thinking of problematizing the existing knowledge by subjecting it to tentativeness so that more
argument, more exploration, and more thinking become necessary resources for learning.
What do you now know about the art of questioning?

Account for your understanding of the why’s and how’s
of questioning.
How important is the art of questioning in the

assessment process? Reason out some benefits of
developing the art of questioning on our assessment
practices.
References
Airasian, P. W. (2000). Assessment in the classroom: A concise approach. 2nd edition. USA:
McGraw-Hill Companies.
Chinn, C. A., O’Donnell, A. M., & Jinks, T. S. (2000). The structure of discourse in
collaborative learning. Journal of Experimental Education, 69, 77-97.
Mayer, R. E. (2002). The promise of educational psychology Volume II: Teaching for
meaningful learning. NJ: Merrill Prentice Hall.
251
Chapter 7
Grading Students
Chapter Objectives
1. Define grading in the educational setting of the Philippines.

2. Explain grading as a process.
3. Identify the different purposes of grading.
4. Explain the rationales for grading.
5. Reflect on the advantages and disadvantages of each grading rationale
6. Reflect on when a rational for grading is appropriate or not.
Lessons
1 Defining Grading
2 The Purposes of Grading
Feedback
Administrative Purposes
Discovering Exceptionalities
Motivation
3 Rationalizing Grades
Absolute/ Fixed Standards
Norms
Individual Growth
Achievement Relative to Ability
Achievement Relative to Effort
252
Lesson 1
Defining Grading
Effective and efficient way of recording and reporting evaluation results is very important
and useful to persons concerned in the school setting. Hence, it is very important that students’
progress is recorded and reported to them, their parents and teachers, school administrators,
counselors and employers as well because this information shall be used to guide and motivate
students to learn, establish cooperation and collaboration between the home and the school and
in certifying the students’ qualifications for higher educational levels and for employment. In the
educational setting, grades are used to record and report students’ progress. Grades are essential
in education such that it is through it that students’ learning can be assessed, quantified and
communicated. Every teacher needs to assign grades which are based on assessment tools such
as tests, quizzes, projects and so on. Through these grades, achievement of learning goals can be
communicated with students and parents, teachers, administrators, and counselors. However, it
should be remembered that grades are just a part of communicating student achievement;
therefore, it must be used with additional feedback methods.
According to Hogan (2007), grading implies (a) combining several assessments, (b)
translating the result into some type of scale that has evaluative meaning, and (c) reporting the
result in a formal way. From this definition, we can clearly say that grading is more than
quantitative values as many may see it; rather, it is a process. Grades are frequently
misunderstood as scores. However, it must be clarified that scores make up the grades. Grades
are the ones written in the report cards of students which is a compilation of students’ progress
and achievement all through out a quarter, a trimester, a semester or a school year. Grades are
symbols used to convey the overall performance or achievement of a student and they are
frequently used for summative assessments of students. Take for instance two long exams, five
quizzes, and ten homework assignments as requirements for a quarter in a particular subject area.
To arrive at grades, a teacher must be able to combine scores from the different sets of
requirements and compute or translate them according to the assigned weights or percentages.
Then, he/ she should also be able to design effective ways on how he/ she can communicate it
with students, parents, administrators and others who are concerned. Another term not
commonly used to refer to the process is marking. Figure 1 shows a graphical interpretation
summarizing the grading process.
Figure 1. Summary of Grading Process.
Separate Assessments COMBINED

Tests, Quizzes, Exam Depends on the assigned
Projects, Seatworks, weights/ percentages for
Worksheets…etc. each set of requirements.
REPORTED TRANSLATED
Grades are communicated Combined scores are
to teachers, students, translated into scales
parents, administrators, etc. with evaluative meaning.
253
Review Questions:
1. Why is grading considered as a process?

2. Explain the different steps that make up grading.
3. Differentiate grades from scores.
4. How are grades essential in the educational context?
5. How can you use grades in different contexts?
254
Lesson 2
The Purposes of Grading
Grading is very important because it has many purposes. In the educational setting, the
primary purpose of grades is to communicate to parents, and students their progress and
performance. For teachers, grades of students can serve as an aid in assessing and reflecting
whether they were effective in implementing their instructional plans, whether their instructional
goals and objectives were met, and such. Administrators on the other hand, can use the grades of
students for a more general purpose as compared to teachers, such that they can use grades to
evaluate programs, identify and assess areas that needs to be improved and whether or not
curriculum goals and objectives of the school, and state has been met by the students through
their institution. From these purposes identified, the purposes of grading can be sorted out into
four major parts in the educational setting.
Feedback
Feedback plays an important role in the field of education such that it provides
information about the students’ progress or lack. Feedback can be addressed to three distinct
groups concerned in the teaching and learning process: parents, students, and teachers.
Feedback to Parents. Grades especially conveyed through report cards provide a critical
feedback to parents about their children’s progress in school. Aside from grades in the report
cards however, feedbacks can also be obtained from standardized tests, teachers’ comments.
Grades also help parents to identify the strengths and weaknesses of their child.
Depending on the format of report cards, parents may also receive feedbacks about their
children’s behavior, conduct, social skills and other variables that might be included in the report
card. On a general point of view, grades basically tell parents whether their child was able to
perform satisfactorily.
However, parents are not fully aware about the several and separate assessments which
students have taken that comprised their grades. Some of these assessments can be seen by
parents but not all. Therefore, grades of students, communicated formally to parents can
somehow let parents have an assurance that they are seeing the overall summary of their
children’s performance in school.
Feedback to Students. Grades are one way of providing feedbacks to students such that it
is through grades that students can recognize their strengths and weaknesses. Upon knowing
these strengths and weaknesses, students can be able to further develop their competencies and
improve their deficiencies. Grades also help students to keep track of their progress and identify
changes in their performance.
Personally, I feel that this feedback is directly proportional with the age and year level
with the students such that grades are given more importance and meaning by a high school
student as compared to a grade one student; however, I believe that the motivation grades can
give is equal across different ages and year levels. Such that grade one students (young ones) are
motivated to get high grades because of external rewards and high school students (older ones)
are also motivated internally to improve one’s competencies and performance.
255
Feedback to Teachers. Grades serve as relevant information to teachers. It is through

grades of students that they can somehow (a) assess whether students were able to acquire the
competencies they are supposed to have after instruction; (b) assess whether their instruction
plan and implementation was effective for the students; (c) reflect about their teaching strategies
and methods; (d) reflect about possible positive and negative factors that might have affected the
grades of students before, during and after instruction; and (e) evaluate whether the program was
indeed effective or not. Given these beneficial purposes of grades to teachers, we can really say
that teaching and learning is a two way interrelated process, such that it is not only the students
who learn from the teacher, but the teacher also learns from the students. Through grades of
students, a teacher can be able to undergo self- assessment and self- reflection in order to
improve herself and be able to recognize relative effectiveness of varying instructional
approaches across different student groups being observed and be flexible and effective across
different situations.
Administrative Purposes
Promotion and Retention. Grades can serve as one factor in determining if a student will
be promoted to the next level or not. Through the grades of students, skills, and competencies
required of him to have for a certain level can be assumed whether or not he was able to achieve
the curriculum goals and objectives of the school and/ or the state. In some schools, the grade of
students is a factor taken into consideration for his/ her eligibility in joining extracurricular
activities (performing, theater arts, varsity, cheering squads… etc.). Grades are also used to
qualify a student to enter high school or college in some cases. Other policies may arise
depending on the schools’ internal regulations. At times, failing marks may prohibit a student
from being a part of the varsity team, running for officer, joining school organizations, and some
privileges that students with passing grade get. In some colleges and universities, students who
get passing grades are given priority in enrolling for the succeeding term, as compared to
students who get failing grades.
Placement of Students and Awards. Through grades of students, placement can be done.
Grades are factors to be considered in placing students according to their competencies and
deficiencies. Through which, teaching can be more focused in terms of developing the strengths
and improving the weaknesses of students. For example, students who consistently get high,
average and failing grades are placed in one section wherein teachers can be able to focus more
and emphasize students’ needs and demands to ensure a more productive teaching learning
process. Another example which is more domain specific would be grouping students having
same competency on a certain subject together. Through this strategy, students who have high
ability in Science can further improve their knowledge and skills by receiving more complex and
advanced topics and activities at a faster pace, and students having low ability in Science can
receive simpler and more specific topics at a slower pace (but making sure they are able to
acquire the minimum competencies required for that level as prescribed by the state curriculum).
Aside from placement of students, grades are frequently use as basis for academic awards. Many
or almost all schools, universities and colleges have honor rolls, and dean’s list, to recognize
student achievement and performance. Grades also determine graduation awards for the overall
achievement or excellence a student has garnered through out his/ her education in a single
subject or for the whole program he has taken.
256
Program Evaluation and Improvement. Through the grades of students taking a certain
program, program effectiveness can be somehow evaluated. Grades of students can be a factor
used in determining whether the program was effective or not. Through the evaluation process,
some factors that might have affected the program’s effectiveness can be identified and
minimized to improve the program further for future implementations.
Admission and Selection. External organizations from the school also use grades as
reference for admission. When students transfer from one school to another, their grades play
crucial role for their admission. Most colleges and universities also use grades of students in their
senior year in high school together with the grade they shall acquire for the entrance exam.
However, grades from academic records and high stakes tests are not the sole basis for
admission; some colleges and universities also require recommendations from the school,
teachers and/ or counselors about students’ behavior and conduct. The use of grades is not
limited to the educational context, it is also used in employment, for job selection purposes and
at times even in insurance companies that use grades as basis for giving discounts in insurance
rates.
Discovering Exceptionalities
Diagnosing Exceptionalities. Exceptionalities, disorders and other malfunctions can also

be determined through the use of grades. Although the term exceptionality is often stereotyped as
something negative, it has its positive sides such as giftedness and such. Grades play an essential
role in determining these exceptionalities such that it is a factor to be considered in diagnosing a
person. Through grades, intelligence, ability, achievement, aptitude, and other factors that are
quite difficult to measure can be interpreted and therefore be given proper interventions and
treatments when they fall out of the established norms.
Counseling Purposes. It is through the grades of students that teachers can somehow seek
the assistance of a counselor. For instance, a student who normally performs well in class
suddenly incurs consecutive failing marks, then teachers who was able to observe this should be
able to think and reflect about the probable reasons that caused the student’s performance to
deteriorate and consult with the counselor about procedures she can do to help the student. If the
situation requires skills that are beyond the capacity of the teacher, then referral should be made.
Grades are also used in counseling when personality, ability, achievement, intelligence, and other
standardized tests are being measured.
Motivation
Motivation can be provided through grades; most students study hard in order to acquire
good grades; once they get good grades, they are motivated to study harder to get higher grades.
Some students are motivated to get good grades because of their enthusiast to join extra-
curricular activities, since some schools do not allow students to join extra curricular activities if
they have failing grades. There are numerous ways on how grades serve as motivators for
students across different contexts (family, social, personal…etc.). Thus, grades may serve as one
of the many motivators for students.
257
Review Questions:
1. What are the different purposes of grades in the educational context? Explain each.
2. How do grades motivate you as a student?
3. How does feedback affect your performance in school?
1. Ask 10-15 grade 1 students on how grades motivate them.

2. Ask 10-15 high school or college students on how grades motivate them.
3. Tabulate the data you were able to gather and compare how grades motivate
students at different levels.
4. Report your findings in class.
258
Lesson 3
Rationalizing Grades
Attainment of educational goals can be made easier if grades could be accurate enough to
convey a clear view of a student’s performance and behavior. But the question is what basis shall
we use in assigning grades? Should we grade students in relation to (a) an absolute standard, (b)
norms or the student’s peer group, (c) the individual growth of each student, (d) the ability of
each student, or (e) the effort of the students/? Each of these approaches has their own
advantages and disadvantages depending on the situation, test takers, and the test being used. It is
expected for teachers to be skillful in determining when to use a certain approach and when not
to.
Absolute Standards. Using absolute standards as basis for grades means that students’
achievement is related to a well defined body of content or a set of skills. For a criterion-
referenced measurement, this basis is strongly used. An example for a well defined body of
content would be: “Students will be able to enumerate all the presidents of the Philippines and
the corresponding years they were in service.” An example for a set of skills would be something
like: “Students will be able to assemble and disassemble the M16 in 5 minutes.” However, this
type of grading system is somewhat questionable when different teachers make and use their
own standards for grading students’ performance since not all teachers have the same set of
standards. Therefore, standards of teachers may vary across situations and is subjective
according to their own philosophies, competencies and internal beliefs about assessing students
and education in general. Hence, this type of grading system would be more appropriate when it
is used in a standardized manner. Such that a school administration or the state would provide
the standards and make it uniform for all. An example for tests wherein this type of grading is
appropriate would be standardized tests wherein scales are from established norms and grades
are obtained objectively.
Norms. The grades of students in this type of grading system is related to the performance
of all others who took the same test; such that the grade one acquires is not based on set of
standards but is based from all other individuals who took the same test. This means that students
are evaluated based on what is reasonably expected from a representative group. To further
explain this grading system, take for instance that in a group of 20 students, the student who got
the most number of correct answers- regardless whether he got 60% or 90% of the items
correctly, gets a high grade; and the student who got the least number of correct answers-
regardless whether he got 10% or 90% of the items correctly, would get a low grade. It can be
observed in this example that (a) 60% would warrant a high grade if it was the highest among all
the grades of participants who took the test; and (b) a 90% can possibly be graded as low
considering that it was the lowest among all the grades of the participants who took the test.
Therefore, this grading system is not advisable when the test is to be administered in a
heterogeneous group because results would be extremely high or extremely low. Another
problem for this approach is the lack of teacher competency in creating a norm for a certain test
which lets them settle for absolute standards as basis for grading students. Also, this approach
would require a lot of time and effort in order to create a norm for a sample. This approach is
also known as “grading on the curve.”
259
Individual Growth. The level of improvement is seen as something relevant in this type
of grading system as compared to the level of achievement. However, this approach is somewhat
difficult to implement such that growth can only be observed when it is related to grades of
students prior to instruction and grades after the instruction, hence, pretests and posttests are to
be used in this type of grading system. Another issue about this type of grading system is that it
is very difficult to obtain gain or growth scores even with highly refined instruments. This
system of grading disregards standards and grades of others who took the test; rather, it uses the
quantity of progress that a student was able to have to assess whether he/ she will have a high
grade or a low grade. Notice that initial status of students is required in this type of grading
system.
Achievement Relative to Ability. Ability in this context refers to mental ability,

intelligence, aptitude, or some familiar constructs. This type of grading is quite simple to
understand such that a student with high potential on a certain domain I expected to achieve at a
superior level, and the student with limited ability should be rewarded with high grades if the
student exceeds expectations.
Achievement Relative to Effort. Similarly, this type of grading system is relative to the
effort that students exerted such that a student who works really diligently, responsibly,
complying to all assignments and activities, doing extra credit projects and so on should receive
a high grade regardless of the quality of work he was able to produce. On the contrary, a student
who produces a good work will not be merited a high grade if he was not able to exert enough
effort. Notice that grades are based merely on efforts and not on standards.
As mentioned earlier, each of these approaches in arriving at grades have their own
strengths and limitations.
Using absolute standards, one can focus on the achievement of students. However, this
approach fails to state reasonable standards of performance and therefore can be subjective.
Another drawback in this approach would be the difficulty in specifying clear definitions;
although this difficulty can be solved, it can never be eliminated.
The second approach is appealing such that it ensures realism that is at times lacking in
the first approach. It avoids the problem of setting too high or too low standards. Also, situation
wherein everyone fails can be prevented. However, the individual grade of students is dependent
on the others which is quite unfair. A second drawback to this kind of approach is that how will
the teacher choose the relevant group; will it be the students in one class, students in the school,
students in the state, or students in the past ten years? Answers to these questions are essential to
be answered by a teacher to have a rationale if achievement in relation to other students. Another
difficulty for this approach is the tendency of encouraging unhealthy competitions; if this
happens, then students become competitors with one another and it is not a good environment for
teaching and learning.
The last three approaches can be clustered such that they have similar strengths and
weaknesses. The strength of theses three is that they focus more on the individual, making the
individual define a standard for himself. However, these three approaches have two drawbacks;
one is that conclusions would seem awkward, or if not, detestable. For example, a student who
performed low but was able to exert effort gets a high grade; but a student who performed well
but exerted less effort got a lower grade. Another example would be: Ken with an IQ of 150 gets
260
a lower grade compared to Tom with an IQ of 80 because Ken should have performed better;
while we were pleasantly amazed with Tom’s performance… Kyle starting with little knowledge
about statistics learned and progressed a lot. Lyra, who was already proficient and
knowledgeable in statistics, gained less progress. After the term, Kyle got a higher grade since he
was able to progress more; although it can be clearly seen that Lyra is better than him.
Conditions of these types make people feel uncomfortable with such conclusions. The second
drawback would be reliability. Reliability is hard to obtain when we use differences as basis for
grades of students. In the case of effort, it is quite hard to measure and quantify it, therefore, it is
based on subjective judgments and informal observations. Hence, resulting grades from these
three approaches when combined to achievement are somewhat unreliable. Table 1 presents a
summary of the advantages and disadvantages of the different rationales in grading.
Table 1
Advantages and Disadvantages of Different Rationales in Grading.
Rationale Advantages Disadvantages______
Absolute Standards - Focuses exclusively on - Standards are opinionated

achievement - Difficulty in getting clear
definitions
Norms - Ensures realism - Individual grades depend on

- Always clear to determine - Choosing relevant group
Improvement, - Concentration on individual - Awkward conclusions

Ability, Effort - Reliability
Review Questions:
1. What rationale for grading do you feel most effective?

2. What rationale for grading is used in your school? Is it uniform across different
subjects?
3. When is each of the rationales effective to apply?
4. When is each of the rationales ineffective to apply?
261
Conduct a survey to your 20 teachers about:

1. What for them is the most effective rationale for grading and why?
2. What for them is the most ineffective rationale for grading and why?
3. When is each of the rationales most effective to use?
4. Present the results in a table form.
5. Make a reflection paper about the results you gathered from twenty teachers.
References
Hogan, T. P. (2007). Educational assessment a practical introduction. United States of America:

John Wiley & Sons, Inc.
Popham, J. W. (1998). Classroom assessment: What teachers need to know (2nd ed.). Needham
Heights, MA: Allyn & Bacon.
Brookhart, S. M. (2004). Grading. Upper Saddle River, New Jersey: Pearson Education Inc.
Oriondo, L. L. & Dallo-Antonio, E. M. (1984). Evaluating educational outcomes. Quezon City:

Rex Printing Company Inc.
262
EMPIRICAL REPORT
Do Parents and Teaching Approach matter in Teaching approach has proved to be a
Predicting Students’ Grades? factor that predicts a student’s achievement. It is
Aldrich B. Alvaera, Ma. Eloisa Bayan, and Darwin P.
defined as a customary way of teaching,
Martinez described as either teacher- centered or learner-
De La Salle University- Manila centered (Lefrancois, 2000). The tradition in the
school setting has always been a teacher-
Abstract centered approach, where the students are just
The study determined whether teaching approach (teacher- passive receivers of knowledge. The underlying
centeredness and learner-centeredness), parental
involvement, and parental autonomy can significantly concept of the teacher-centered approach is
predict students’ Grades. With a sample of 382 grade four based on traditional pedagogy wherein
public school students in Metro Manila, the researchers knowledge is passed from teacher to children
administered the Teacher-Centered Practices (Katsuko, 1995). However, the trend in schools
Questionnaire and Learner-Centered Practices now is to move away from the teacher- centered
Questionnaire to measure the teaching approach of the
students’ class adviser and the Perception of Parents approach and adopt a new approach called the
Scale for children to measure how involved and learner- centered approach. Unlike a teacher-
autonomous the students’ fathers and mothers were. The centered approach, the learner- centered
students’ general average grade from the previous grading approach does not limit the students to acquiring
period was used as a measure of their academic knowledge solely from their teachers. Instead,
performance. Using stepwise forward regression, only
mother involvement revealed to be a significant predictor of they are limited by their own capabilities on when
academic achievement. Implications and recommendations and how they will learn (Fadul, 2006). Schools
were included in the discussion. nowadays move towards the learner- centered
approach because of the benefits that the new
The parents’ and teachers’ approaches approach offers. The new approach claims that
important factors that influence students in their students are more actively involved with the
school performance. For instance, the way subject matter, they are more motivated as
parents take care of their children and the way learners and they learn more skills, especially
teachers deal with the students have an influence discipline, communication and collaboration skills
on the students’ behavior in school. The way (Johnson, 2000). The diversity in the students’
parents relate with their children and the way needs has grown too large to a teacher- centered
teachers handle their students also help explain approach to address (Laboard, 2003).
a student’s academic performance in school. Despite the trend of moving from a
Despite attempts to improve learning and traditional approach to the new learner-centered
achievement of students, there are still issues approach, there are still teachers who believe
regarding the outcome of the students’ that the teacher-centered approach is a more
performance. Results of the National Diagnostic appropriate construct in the classroom setting.
Test (NDT) administered in the year 2002- 2003 Biggs (1999) says that this approach is
of first year students showed that of the 1.3 appropriate especially when the teacher or the
million first year students all over in the transmitter of knowledge is one who comes from
Philippines, only 18% passed the competency a position of expertise.
level for English, 10% for Science and only 8% Parental influence has also been
for Math (Evasco, 2005). The result of the study identified as an important factor affecting student
implies a very alarming condition because it achievement (Halawah, 2006). Previous studies
shows how low the literacy rate of Filipino (Boveja, 1998; Gregory, 2006; Halawah, 2006;
students is. Given this issue, there is now a need Weiner, 1974; Wu & Qi, 2006) have found that
to improve the teaching in the Philippine setting. low achievement has been associated with
263
students having parents who are less involved in and achievement (Gray & Steinberg, 1999;
their school work. On the other hand, students Strage & Brandt, 1999). Gray and Steinberg
who have parents that are more involved with (1999) analyzed the concept of parenting and its
their school work have a higher achievement specific parts. In this study, they found that
tendency (Rollins & Thomas, 1979). parental involvement has a contribution to every
Previous studies have found that the aspect of adolescent development. For instance,
performance of students benefits most when their when parents were perceived to be more
parents are highly involved in their school work involved, the adolescents became more
(Gray & Steinberg, 1999). However, parental psychologically rounded and performed better in
autonomy and support has also been proven to school. In another study, Strage and Brandt
be a significant predictor of academic (1999) studied how the academic performance of
achievement (Strage & Brandt, 1999). college students was predicted by parental
Studies have shown that teaching autonomy. It was revealed that when parents
approach, parental involvement and autonomy granted more autonomy, demands and support,
predict academic achievement when studied the students became more confident, persistent
separately. It has been well established that and positively oriented to their teachers. In other
several factors influence academic achievement words, autonomy granting paired with other
of students. However, studies on teaching factors (i.e. demand and support) significantly
approach and achievement (Adams & predicts college students’ overall GPA. Both
Engelmann, 1996; Gleason, 1995; Spector, studies (Gray & Steinberg, 1999; Strage &
1995; Nelson, 1995; and Brent & DiObilda, 1993; Brandt, 1999) clearly showed that the
Reyes, 2001) fall short on defining which performance of students is greatly affected by
particular paradigm would be most influential highly involved parenting and liberal autonomy
towards achievement. Instead, the studies (Gray & Steinberg, 1999).
showed how each of them supports or follows a In another study (Grolnick & Deci, 1991;
particular learning theory. Gray & Steinberg, 1999), the researchers
The studies (Adams & Engelmann, 1996; examined a process model of relationships
Gleason, 1995; Spector, 1995; Nelson, 1995; among children’s perception of their parents,
Brent & DiObilda, 1993; Reyes, 2001; Boveja, their motivation and their performance in school.
1998; Gregory, 2006; Halawah, 2006; Weiner, The study involved three motivation variables
1974; Wu & Qi, 2006) presented were limited to namely: Control understanding, perceived
predicting academic achievement of students competence and perceived autonomy. With the
without taking into account both the teaching and use of the developed Perception of Parents
parenting factors. Hence, the purpose of this scale, 456 children from Grade 3 through 6
study is to bridge the gap between the factors participated in the study. The results showed that
that influence achievement of students. In the maternal autonomy support and involvement
present study the factors of teaching approach were positively associated with the three
(teacher-centered and learner-centered), motivation variables. On the other hand, paternal
parental involvement and parental autonomy autonomy and support were related to the three
were used as predictors of students’ academic motivation variables as well. Finally, analysis of
achievement as measured by grades. the results showed that the three motivational
factors mediate the child’s academic
achievement. Although both mother and father
Parental Involvement, Autonomy and autonomy support and involvement showed
Achievement significant relationships with motivation,
Several studies have explored the descriptive data from the children’s perception of
relationship of parental involvement, autonomy parents scale revealed that mothers were more
264
supportive and involved as compared to fathers. centered paradigms work (Hara, 1995).
Unlike other studies (Gray & Steinberg, 1999; Compared to the studies regarding parenting
Strage & Brandt, 1999), Grolnick and Deci (1991) approach, studies regarding teaching approach
were able to show a clearer definition of how did not highlight a particular paradigm that would
parental involvement and autonomy support be most influential towards achievement
influenced student performance by separately (Winsler, Madigan & Aquilino, 2006; Alfaro
measuring mothers and fathers. By doing so, the Umana-Taylor & Bamaca, 2006; Assadi, Zokaei,
researchers were able to emphasize who Kaviani & Mohammadi, 2007; Rollins & Thomas,
between mothers and fathers have more 1979; Boveja, 1998; Gregory, 2006; Halawah,
influence on the student’s academic 2006; Weiner, 1974; Wu & Qi, 2006). Instead, the
achievement. studies showed how each of them support or
follow a particular learning theory.
Academic Achievement The social-cognitive theory (Bandura,
The article of the North Central Regional 1997) was used as the framework in the study.
Educational Laboratory on achievement used The theory explains the interaction of parental
standardized tests as the definition of involvement and autonomy and teaching
achievement. The article states that standardized approach as it predicts achievement. The social
test scores are used to determine how well cognitive theory explains the interaction between
students are doing in school. This would coincide the person and the environment which involves
with AbiSamra’s definition of achievement as the cognitive competencies such as achievement
quality and quantity of a student’s work (2000). that are developed and modified by social
Meanwhile, Steinberg (1993) discusses influences and structures within the environment
achievement as something comprised of ability such as parents and teachers. As by its
and performance. Achievement is definition, two of the major factors of the social-
multidimensional in the sense that it is intricately cognitive theory namely the person and the
associated to human growth and cognitive, environment can elaborate the relation of the
emotional, social and physical development. In variables used in the study. Environment refers
addition, it reflects the child as a whole that to the factors that can affect a person’s behavior
transpires across time and levels. which are parenting styles and teaching
The concept of achievement from the approach.
latter articles as the multidimensional growth and In the present study, parental
development of the student would be most involvement, parental autonomy and teaching
appropriate for the study. The current study approaches are environmental factors that can
focuses on academic achievement as measured influence the academic achievement of the
by the general average grade of the student from student. More specifically, a student’s
the previous grading period. understanding and level of participation can vary
It has been well established how depending on the way his parents influence him
academic achievement is influenced by a and on the teacher’s approach to the lesson.
particular factor. However, previous studies Thus, a combination of teacher’s approach and
(Grolnick & Deci, 1991; Gray & Steinberg, 1999; high parental involvement can explain students’
Strage & Brandt, 1999) have focused on academic achievement.
predicting achievement without stating other Parents’ involvement in the child’s
factors such as teaching approach that would schooling like assisting the child’s in making their
also lead to achievement. assignments explains much the grade of the
On the studies regarding teaching child. Studies (Grolnick & Deci, 1991; Gray &
approach and achievement, related literature Steinberg, 1999) showed that when parents pay
revealed how the teacher-centered and learner- more attention to their child’s studies, the
265
tendency is that the child performs better in and teaching approaches can relate with each
school. Gray and Steinberg (1999) concluded other to predict achievement.
that when students feel that their parents are
involved with their school work, they become Participants
more psychologically rounded and as a result, A total of 400 grade four students were
perform better in their school work. asked to participate in the study. Only students
Acknowledging individual differences from the top four classes were included in the
between students and serving as a facilitator also study. The sample was recruited from different
explains the grades of the student. As part of the public schools in Metro Manila namely Andres
learner-centered principles, executing such Bonifacio Elementary School, Rizal Elementary
behaviors would show that a student’s teacher is School, M. Hizon Elementary School and
inclined toward learner-centered approach. On Francisco Balagtas Elementary School. These
the other hand, a teacher-centered approach is public schools have self-contained classes
effective for learning basic skills (Snowman & wherein there’s only one teacher per class who
Biehler, 2000). handles all the subjects. These self-contained
classes are used from pre-school to fourth grade
Research Questions in the elementary level. Participants were
The study intended to determine whether recruited through purposive sampling and
parental involvement and autonomy (mothers inclusion criteria included participants who grew
and fathers), and teaching approach can predict up with at least one parent. Participants who
public school students achievement as measured grew up in the absence of both parents were still
by the general average grades of students. The asked to complete the questionnaires but were
study addressed the following questions: no longer included in the analysis of the study.
(1) Is there a significant relationship Out of the 400 respondents, 18 students were
among parental involvement, parental autonomy removed because they did not meet the criteria
and teaching approach with achievement? and only 382 students were included in the
(2) Can parental involvement, parental analysis. Their ages ranged from 9-11 years old
autonomy, learner-centeredness, and teacher- with a mean age of 9.57. There were 183 males
centeredness significantly predict student and 199 females.
achievement?
(3) How much does each variable Instruments
contribute in predicting student achievement?
Perception of Parents Scale. The
Method Perception of Parents Scale, Child Scale is an
Research Design instrument developed by Grolnick, Deci, & Ryan
(1997) which assesses the children’s perceptions
A descriptive design was utilized in the of the degree to which their parents are
current study. It investigated the perceived autonomy supportive and the degree to which
parental involvement, perceived parental their parents are involved. The scale has 22
autonomy and teaching approach as predictors items, in which 11 items are about their mother
of achievement. The current study focused on and then the same 11 items focusing on their
the relation of the different variables and father. Factor analysis of the scale has revealed
described how the relationship of the said a clear four-factor solution with factors labeled
variables can predict achievement. Stepwise mother involvement (1, 3, 5, 9, and 11), mother
forward regression was used to further explain autonomy support (2, 4, 6, 7, 8, and 10), father
how parental involvement, parental autonomy involvement (12, 14, 16, 20, and 22) and father
autonomy support (13, 15, 17, 18, 19, and 21).
266
The validity of the scale is .86 for mother emphasizes on competition, individual work, and
autonomy support, and .88 for mother discipline. The items from this questionnaire
involvement. Both father autonomy support and focused on these four factors. The validity of the
father involvement had a validity of .85. scale was analyzed using Cronbach’s alpha and
yields a score of .8968.
Learner- Centered Practices
Questionnaire. The Learner- Centered Practices Procedure
Questionnaire (LCPQ) is based on the principles
of the learner-centered practices by McCombs For the test administration, the
(1997) (Magno & Sembrano, 2007). The 25 item administrator gave the general instructions.
scale is consists of four areas namely (1) positive Afterwards, a researcher passed around an
interpersonal characteristics (items 1 to 5) which information sheet which contained the following:
reflects the capacity to develop positive Research ID no., name, age, gender, grew up
interpersonal relationships with students and with mother- yes/no and grew up with father-
instructor’s ability to value and respect students yes/no.
The items have a Cronbach’s alpha of .986 which The first test administered was the
establishes it internal consistency. (2) Perception of Parents Scale (POPS) which took
Encourages personal challenge (items 6 to 10) 20 minutes. For each item, the administrator read
focuses on the instructors ability to take charge each statement and waited for the participants to
of the students learning obtained an internal finish writing down their answer.
consistency of .983 using Cronbach’s alpha. (3) After collecting the POPS, the
Adopts class learning needs (items 11-15), items administrator read the instructions for the
which show the ability of the teacher to be Learner-Centered Practices Questionnaire as the
flexible in addressing the student’s need and other researchers distributed the questionnaires.
have a internal consistency of .975 using For each item, the administrator read each
Cronbach’s alpha. (4) facilitates the learning statement and waited for the participants to finish
process (items 16 to 19) consists of items writing down their answer. This took about 20
reflecting the instructors ability to encourage minutes as well. The final questionnaire was the
students to monitor their own learning process Teacher- Centered Practices Questionnaire. The
and this items gathered a .990 internal administrator read the instructions while the other
consistency using Cronbach’s alpha. The overall researchers distributed the questionnaires. For
reliability of the scale is .994 indicating high each item, the administrator read each statement
internal consistency of the items (Magno & and waited for the participants to finish writing
Sembrano, 2007). down their answer. After 20 minutes, the
administrator asked the participants to pass their
Teacher- Centered Practices papers.
Questionnaire. The Teacher- Centered Pratices
Questionnaire was adopted from Lefrancois Data Analysis
(2000). The 25 item questionnaire was The Pearson r was used to inter-
constructed under the areas of (1) direct correlate teacher-centeredness, learner-
instruction (item nos. 6, 7, 13, 14, 19, 20, 21, 22, centeredness, parental involvement, parental
24, 25), (2) competition within students (item nos. autonomy and achievement.
3, 8, 15, 16), (3) individual work (item nos. 10, The Multiple Regression was used as the
11, 17, 23) and (4) discipline (item nos. 1, 4, 5, 9, main analysis in the study. The Stepwise
12, 18). This book states that a formal teaching Forward Multiple Regression was used to further
approach is more of a direct instruction from investigate whether the factors of parental
teachers to the students. This approach involvement, parental autonomy, and teaching
267
approaches can predict achievement. In using Among the predictors, mother

the Stepwise Forward Regression, the variable involvement was the only predictor that showed
with the highest beta weight was first entered in significant relationship with student achievement,
the regression model to predict student p<.05. On the other hand, teacher-centeredness
achievement. Also, only beta weights which are and learner-centeredness showed no significant
significant were the only ones included in the relationship with the general average grades of
summary of Stepwise Forward Regression. In a the students. Instead, there was more significant
stepwise forward regression, variables are being relationship among the factors of teacher-
tried out one by one and including them if they centeredness and learner-centeredness. All the
are still statistically significant in predicting predictors from the POPS have a significantly
achievement. The study used a .05 level of negative relationship with teacher-centeredness
significance. and learner-centeredness.
Results Table 2
Stepwise Forward Regression Predicting Student
The researchers determined the means Achievement
and standard deviation of the general average of Standardized
SE of
Unstandardized t(379 p-
Standardized
students, parental involvement, parental Beta
Beta
ß ) level
autonomy and teaching approaches. Bivariate Mother
0.11* 0.05 0.32 2.16 0.03
correlation was also conducted to determine Involvement
whether there was a significant relationship *p<.05, Note. R= .11, R²= .012, Adjusted R²= .95, SE= 2.87
between the variables. Finally, a stepwise

forward regression was used to determine which A stepwise forward regression was
variable best predicted student achievement. conducted to predict students’ grades. In the
analysis, the predictor with the highest beta
The general average grade of students weight was entered to predict students’ grades.
was 84.44 with a standard deviation of 2.89. For In the stepwise regression, only mother
teacher-centeredness and learner-centeredness, involvement as a predictor attained the tolerance
the highest possible score was three. The mean level to be a predictor of students’ grades and the
score was 2.27 and 2.38, for teacher- other factors (mother autonomy, father
centeredness and learner-centeredness autonomy, father involvement, learner-
respectively. In the Perception of Parents Scale, centeredness, and teacher centeredness) did not
mother involvement had the highest mean score meet the tolerance level.
with 2.69 while father autonomy had the lowest The regression was rather a poor fit
mean score with 2.52. (R2adj= 95%), but the overall relationship of
mother involvement, mother autonomy, father
Table 1 autonomy, father involvement, learner-
Correlation Matrix centeredness, and teacher centeredness
(1) (2) (3) (4) (5) (6) (7) combined to predict students’ grades was
(1) General Average Grade 1.0000 significant as indicated by F(1, 379)=4.65, p<.05.
(2) Teacher-Centeredness -.054 1.00 With other variables held constant, mother
(3) Learner-Centeredness -.034 .87* 1.00
(4) Mother involvement .11* -.45* -.35* 1.00
involvement scores were related to students’
(5) Mother autonomy .05 -.30* -.29* .49* 1.00 grades, increasing by 0.11 for every extra point
(6) Father involvement .10 -.34* -.26* .76* .42* 1.00 of mother involvement. The effects of this
(7) Father autonomy .06 -.33* -.29* .55* .49* .53* 1.00
predictors was found to be significant
*p<.05 t(379)=2.16, with p<.05.
268
A t-test for dependent samples was used teachers in public school classrooms (Hicap,
to compare the means of Mother Involvement 2006a). It was even mentioned that teachers in
and Father Involvement in order for the public schools are so bad that they teach English
researchers to see which one was higher in subjects in Filipino due to their inadequate
terms of involvement. Table 4.1 shows that there English speaking skills (Hicap, 2006b). In an
was no significant difference between the article by Hicap (2006a), he discussed specific
parents’ involvement in terms of the mean issues and steps which the Department of
scores. Education should take in order to improve the
The same method was used to compare quality of public education in the country, one of
the mean scores of mother autonomy and father which is allotting more funds for re-training of
autonomy to find out which parent was higher in teachers. The problem of re-training teachers has
terms of autonomy. In figure 4.2 it was found out been an issue for a long time. Sutaria (1990)
that there was no significant difference between cited that there is a need for strengthening
the mean scores of father autonomy and mother teachers’ competence in teaching to maximize
autonomy. learning for the poor performing students. The
above mentioned proves that teachers fail to
maximize the potential of learners due to their
Discussion incompetence in applying teaching strategies.
Although there was no significant
Looking at the results of the study, it was relationship between the teaching approaches
found that although the participants came from and achievement, Table 2 shows a significant
the top classes of different public schools around relationship between teacher-centeredness and
Manila, the mean of their general averages was learner-centeredness. The significant correlation
not as high as expected. In addition, it is safe to between the two approaches explains why the
assume that there are relatively similar grading mean score for both scales were close to each
standards among the three schools because the other (2.27 for Teacher-Centeredness while 2.38
grades of all the students were close to one for Learner-Centeredness). The relationship
another. This was also indicated by the low between rating for both teaching approaches can
standard deviation. be explained by the fact that behaviors of
In determining which variable has a Filipinos are situation specific (Marcus &
significant relationship with student achievement, Kitayama, 1991). The teachers utilize both
mother involvement was significantly related with teacher-centered and learner-centered practices
the students’ academic achievement. Other according to the type of situation or the kind of
variables such as teaching approach, mother student that they have. It was also because of
autonomy, father involvement and father this same reason why all the other factors were
autonomy, failed to show a significant significantly correlated with the teaching
relationship with the achievement of the students. approaches. Father involvement and father
The researchers believe that the insignificant autonomy also did not have a significant
relationship of teaching approach with student relationship with the achievement of students.
achievement suggests that there is inefficiency or This could be attributed to the fact that Filipino
poor quality of teaching in the public schools. fathers have a more procreative nature of
Another reason for this is the inadequate teacher parenting approach (Tan, 1989). Basically,
training that teachers in the public schools have. Filipino fathers equate fatherhood with the
The deteriorating quality of public education, biological aspect and their responsibility of
especially their continuous decline of students’ providing for their family. Sadly, this nature of an
performance in National Achievement Tests is approach does not go beyond those facets.
believed to be an effect of poor teacher quality Table 2 also showed a significant relationship
269
between all the parental factors and the teaching samples to determine which parent had higher
approaches. Again, due to the nature of Filipino involvement and autonomy ratings. Table 4.1
teachers utilizing both teacher-centered and showed that there was no significant difference
learner-centered practices, all the other variables between the parents’ involvement in terms of
yielded a significant relationship between them. their means. At the same time, Table 4.2 showed
The only variable that showed a that there was no significant difference between
significant relationship with the achievement of the parents’ autonomy. Although both tables did
students was mother involvement. This finding not show any significant differences between
supports other studies (Bogenshneider, 1997; mothers and fathers and their involvement and
Grolnick & Slowiaczek, 1994) that mother autonomy, this does not have any bearing on
involvement influences student achievement. both factors’ contribution in predicting student
This finding coincides with the fact that mothers achievement since this was only a comparison of
are the primary caretaker of children (Mendez & the mean scores of the variables.
Jocano, 1979; Licuanan, 1979; Lagmay, 1983; The findings of the study have several
Mindoza, Botor & Tablante, 1984; UP CHE, implications. First, the poor quality of teaching in
1958) and responsible for supervising the studies the Philippine public education affects the
of their children. Decisions made regarding the relatively low achievement of students in the
child’s daily routine, health and schooling is also public schools. The role of the teacher is critical
attributed to Filipino mothers as well (UP CHE, for they are the people who determine the
1958). Mothers were thought to be more involved content to be taught, the teaching strategy to be
in the sense that they were perceived to show used and the conditions of learning the content.
interest and spend time relating to their child’s Cortes (1987) said that the teacher factor is
school activities (Murray, 2005). The findings on believed to explain the low level of achievement
the significant relationship between mother of Filipino students. The fact that the students
involvement and achievement is further failed to recognize which particular approach
supported by the findings in Table 3 showing their teachers are using show that their teachers
mother involvement significantly predicting are failing to effectively practice their teaching
student achievement. Of all the predictors of strategies. Second, the high mother involvement
achievement used by the researchers, it was only rating implies that mothers have more time to
mother involvement that had significantly look after the studies of their children as
predicted student achievement. Although the compared to the fathers. In the Philippine setting,
stepwise forward regression analysis did not fathers perceive their role as mainly the provider
show the other factors as significant predictors of of the family, thus making them pay more
student achievement, this does not mean that attention to their job and less attention to their
teaching approach, father involvement, father children (Tan, 1989).
autonomy and mother autonomy does not It was concluded in the study that only
contribute in predicting achievement. This simply mother involvement can predict students’
implies that their contribution in the achievement achievement. Teacher-centeredness and learner-
of the students is not as significant as compared centeredness were significantly related with each
to the contribution of mother involvement. In other. This indicates that teachers are failing to
addition, this further stresses the poor quality of utilize teaching strategies appropriately which
teaching in public schools because of their failure may have influenced the students in
to significantly predict the achievement of distinguishing which teaching approach their
students. teacher uses.
To further look into the role of mothers In predicting student achievement,
and fathers in predicting student achievement, factors such as father involvement, father
the researchers used a T-test for two dependent autonomy, mother autonomy, and teaching
270
approaches, do not contribute much in predicting

student achievement. Father involvement and Cortes, J. R. (1987). Educational and national
father autonomy were unable to predict student development: The Philippine experience and
achievement because of the fact that Filipino future possibilities. Publisher’s Printing: Diliman,
fathers have a more procreative nature of Quezon city.
parenting approach (Tan, 1989).
Because mothers are easily viewed as Evasco, R. M. (2005). Bridging the Gap: The
the primary caretakers of their children and Bridge Program of Department of Education. The
responsible for supervising the studies of her Philippine Journal of Education, 33, 372-373.
children, only mother involvement showed a Fadul, J. (2004). The Learning-Centered
significant prediction of achievement. Paradigm: Synthesis of Curriculum-Centered and
Learner-Centered Paradigms. International
References Journal of Learning, 12, 161- 173.
AbiSamra, N. (2000). The Relationship between Gleason, M. M. (1995). Using Direct Instruction
Emotional Intelligence and Academic to Integrate Reading and Writing for Students
Achievement in Eleventh Graders. Retrieved with Learning Disabilities. Reading and Writing
October 4, 2007, from: Quarterly, 11, 91-108.
http://members.fortunecity.com/nadabs/research-
intell2.html Gray, M.R., & Steinberg, L. (1999). Unpacking
authoritative parenting: Reassessing
Adams, G. L. & Engelmann, S. (1996). Research multidimensional construct. Journal of Marriage
on Direct Instruction: 25 Years beyond DISTAR. and the Family, 61, 574-587.
Seattle, WA: Educational Achievement Systems.
Gregory, A. & Weinstein, R. S. (2006).
Assadi, S. M, Zokaei, N., Kaviani, H., Connection and Regulation at Home and in
Mohammadi, M. R., et al. (2007). Effect of School: Predicting Growth in Achievement for
Sociocultural Context and Parenting Style on Adolescents. Journal of Adolescent Research,
Scholastic Achievement among Iranian 19, 405.
Adolescents. Oxford, 16, 169.
Grolnick, W. S., Ryan, R. M., & Deci, E. L.
Bandura, A. (1997). Self-efficacy: The exercise of (1991). The inner resources for school
control. New York: W.H. Freeman. performance: Motivational mediators of children's
perceptions of their parents. Journal of
Biggs, J. (1999). Teaching for Qualiy Learning at Educational Psychology, 83, 508-517.
university. Philadelphia: Open
University. Grolnick, W. S. (2003). The psychology of
parental control: How well mean parenting
Boveja, M. (1998). Parenting Styles and backfires. Mahwah, NJ: Erlbaum.
Adolescent’s Learning Strategies in the Urban
Community. Journal of Multicultural Counseling Halawah, I. (2006). The Effect of Motivation,
and Development, 26, 110- 120. Family Environment and Student Characteristics
on Achievement. Journal of Instructional
Brent, G. & DiObilda, N. (1993). Effects of Psychology, 33, 91- 99.
Curriculum Alignment versus Direct Instruction of
Urban Children. Journal of Educational
Research, 86, 333-338.
271
Hara, K. (1995). Teacher-Centered and Child-

Centered Pedagogical Approaches in Teaching Mendez, P. & Jocano, F. (1979). The Filipino
Children’s Literature. Education, 115, 332. adolescent in a rural and an urban setting: A
study in culture and education. Manila: Centro
Hicap, J. M. (2006, December 5). Poor Quality of Escolar University Research and Development
Education Tops National Agenda. Knight Ridder Center.
Tribune Business News, p. 1
Mindoza, A., Botor, C. & Tablante, I. (1984).
Hicap, J. M. (2006, December 21). Teachers Mother-child relationships in early childhood.
Vow SC War Against Use of English. Knight Cebu: Cebu Star Press.
Ridder Tribune Business News, p. 1 Murray, B. (2005). Self-Determination Theory in a
Laboard, K. & Brown. (2003). From Teacher- Collectivist Educational Context: Motivation of
Centered to Learning-Centered Curriculum: Korean Students Studying English as a Foreign
Improving Learning in Diverse Classrooms. Language. Unpublished undergraduate’s thesis.
Education, 124, 1. University of Texas.
Lagmay, L. (1993). Kruz-na-ligtas: Early Nelson, J. R. (1996). Effects of Direct Instruction,

socialization in an urbanizing community. Cooperative Learning and Independent Learning
Quezon City: University of the Philippines Press. Practices on the Classroom Behavior of Students
with Behavioral Disorders: A Comparative
Lefrancios, G. R. (2000). Psychology of Analysis. Journal of Emotional and Behavioral
Teaching. Wadsworth. Disorders, 4, 53-62.
Licuanan, P. (1979). Some aspects of child- Rollins, B. C. & Thomas, D. L. (1979). Parental
rearing in an urban low-income community. Support, Power and Control Techniques
Philippines Studies, 27, 453-468. in the Socialization of Children. In W. R. Burr, R.
Hill, F. I. Nye, & I. L. Reiss (Eds.) Contemporary
Magno, C. (in press). Exploratory and Theories about the Family, 1, 317- 364.
Confirmatory Analysis of Parenting Closeness
and Multidimensional Scaling of other Parenting Ryan, R. M., Deci, E. L., & Grolnick, W. S.
Models. The Guidance Journal. (1995). Autonomy, relatedness, and the self:
Their relation to development and
Magno, C. & Sembrano, J. (2007). The Role of psychopathology. In D. Cicchetti & D. J. Cohen
Teacher Efficacy and Characteristics on (Eds.), Developmental psychopathology: Vol. 1.
Teaching Effectiveness, Performance and Use of Theory and methods, 618–655. New York: Wiley.
Learner-Centered Practices. The Asia-Pacific
Education Researcher, 60, 167-180. Schuh, K. L. (2003). Knowledge Construction in
the Learner-Centered Classroom. Journal of
McCombs, B. L. (2003). Applying the LCPs to Educational Psychology, 95, 246-442.
High School Education. Theory into Practice, 42,
117-126. Snowman, J. & Biehler, R. (2000). Psychology
Applied to Teaching (9th Ed.). Boston: Houghton
McCombs, B. L. (2001). What Do We Know Mifflin.
About Learners and Learning? The Learner-
Centered Framework: Bringing the Educational Spector, J. E. (1995). Phonemic Awareness
System into Balance. Educational Horizons, 8, Training: Application of Principles of Direct
182-193.
272
Instruction. Reading and Writing Quarterly, 11,

37-51. Winsler, A., Madigan, A. L. & Aquilino, S. A.
(2006). Correspondence Between
Steinberg, L., Lamborn, S.D., Dornbusch, S.M., & Maternal and Paternal Parenting Style in Early
Darling, N. (1992) Impact of parenting Childhood. Early Childhood Research Quarterly,
practices on adolescent achievement: 20, 1- 12
authoritative parenting, school involvement,
and encouragement to succeed. Child Wu, F. & Qi, S. (2006). Longitudinal Effects of
Development, 63, 1266-1281. Parenting on Children’s Academic in African
American Families. The Journal of Negro
Steinberg, L., & Silk, J. S. (2002). Parenting Audiatiom, 75, 415- 430.
adolescents. In M. H. Bornstein (Ed.), Handbook
of parenting: Vol. 1. Children and parenting (2nd
ed., pp. 103–133). Mahwah, NJ: Erlbaum.
Sutaria, M. C. (1990). Teaching for Maximum

Experience. International Review of Education,
36, 243-250.
Tan, A. (1989). Four meaning of fatherhood.

Philippines Journal of Psychology, 22, 51-60.
Taylor, P. G. (1999). Making Sense of Academic

Life. Philadelphia: SRHE.
Weiner, B. (1974). Achievement Motivation and

Attribution Theory. Morristown, N. J.:
General Learning.
273
Chapter 8
Standardized Tests
Objectives
1. Characterize standardized tests.

2. Determine the classification of tests.
3. Follow procedures in constructing norms.
4. Follow standards in test administration and preparation.
Lessons
1 What are standardized tests?

Classifications of Tests
2 Interpreting Test Scores through Norm and Criterion Reference
What is the use of a Norm?
The Normal Curve
Steps in Creating a Norm
Interpreting Areas in the Norm
Areas of the Normal Curve
3 Standards in Educational and Psychological testing
Controlling the Use of Tests
Security of the Test Content
Test Administration
Introducing the Test to Test Takers
Testing Different Groups
Example of Standardized Tests
Intelligence Tests
Achievement Tests
Aptitude Tests
Personality Tests
Aptitude Tests
Interest Test
274
Lesson 1
What are Standardized Tests?
A test is a tool used to measure a sample of behavior. Why did we say “a sample” and not
the entire behavior? A test can only measure part of a behavior. A test CANNOT measure the
entire behavior of a person, or characteristics measured. For example in a personality test, you
cannot test the entire personality. In t case of NEO-PI, the subscale on extrovertness can only
measure part of extrovertness. As an implication, during pre-employment testing, before an
applicant is accepted they administer a series or battery of tests to well represent the behavior
that needs to be uncovered. In school admission, the university or college require student
applicants’ grades, entrance exam, essay, recommendation letter, and bioprofile to decide on the
suitability of the student. A test can never measure everything. There are proper uses of tests.
What do you need to consider in a test?
As discussed in chapter 3 a test should be valid, reliable and can discriminate ability
before one should use it. Validity means if the test is measuring what it is suppose to measure.
Reliability means if the test scores are consistent when the same test or a test with another test.
Discrimination is the ability of the test to determine who learned and who does not.
What is the purpose of standardization?
The primary purpose of standardization is to (1) facilitate the development of tools; and
(2) to ensure that results from a test are indeed reliable and therefore can be used to assign
values/ qualities to attributes being measured (through the established norms of a said test).
What makes a test standardized?
The unique characteristic of a standardized test which differentiates it from other tests
are: (1) Uniform procedures in test administration, and scoring, and (2) having establishment of
norms.
Uses of Tests
1) Screen applicants for jobs and educational/training programs
2) Classification and placement of people in different contexts
3) Educational, vocational, and personal counseling and guidance
4) Retention/dismissal/promotion/rotation of students/employees in programs/jobs
5) Diagnosing and prescribing treatments in clinics/hospitals
6) Evaluating, cognitive, intra and interpersonal changes due to educational and

psychotherapeutic programs
275
7) Conducting researcher on individual development over time and on effective of a new

program
Classifications of Tests
Standardized VS. Non-Standardized. Standardized tests have fixed directions for scoring
and administering. Can be purchased with test manuals, booklet, answer sheet. It was sampled to
those who are considered in the norm. Non-Standardized or teacher-made test is intended for
classroom assessment. it is used for classroom purposes. It intends to measure the behavior in
line with the objectives of the course. Examples are Quiz, Long Test, Exams, etc. Can a teacher
made test become a standardized test? Yes, as long as it is valid, reliable, and has a norm
Individual Tests VS. Group Tests. Individual Tests are administered to one examinee at a
time. Used for special populations such as children and people with mental disorders. Examples
are Stanford-Binet and WISC. Group Tests are administered to many examinees at a time.
Examples are classroom Tests.
Speed VS. Power. Speed test consists of easy items but time is limited. Power consists of
few pre-calculated difficult item and time is also limited.
Objective VS. Non-Objective/Subjective. Objective tests have fixed objective scoring

standards and commonly has right and wrong answers. Non-Objective/Subjective tests have
variation in responses and with no fixed answers. Examples are essays and Personality Tests.
Verbal VS. Non-Verbal Tests. Verbal consists of vocabulary and sentences. Examples are
Math test with characters. Non-Verbal consists of puzzles and diagrams. Examples are Abstract
reasoning and projective tests. Performance Test requires to manipulate objects.
Cognitive VS. Affective. Cognitive measures the process and products of natural ability.
Example are intelligence, aptitude, memory, problem solving. Achievement Test assesses what
has been learned in the past. Aptitude Test focuses in future and what the person is capable of
learning. Example is Mechanical Aptitude Test, Structural Visualization. Affective assesses
interest, personality, and attitudes, non-cognitive aspects.
276
Lesson 2
Interpreting Test Scores through Norm and Criterion Reference
The process of test standardization involves having uniformity of procedure and an

established norm.
Uniformity of procedure means that the testing conditions must be same for all.
Directions are formulated. Time limit is considered. Preliminary demonstration on administering
the test. In administering consider the rate of speaking and tone of voice, inflection, pauses, and
facial expression. Inflection is a change in the form of a word that reflects a change in
grammatical function. Test administration should be uniform to maintain constancy across
testing groups and minimizing measurement errors.
Having an establishing norms for a test means obtaining a normal or average
performance in the distribution of scores. A normal distribution is obtained by increasing the
sample size. A norm is a standard and it is based on a very large group of samples. Norms are
reported in the manual of standardized tests. Aside from the norm the test manual includes
description of the test, how to administer the test, reminders before testing, dialogues of the
person administering the test, how to interpret the test
A normal distribution found in the manual takes the shape of a bell curve. It shows the
number of people within a range of scores. It also reports the percentage of people for particular
scores. The norm is used to convert a raw score in to standard scores for interpretability.
What is the use of a norm?

(1) A norm is a basis for interpreting a test score
(2) You use a norm to interpret a particular score
There are two ways of interpreting scores: Norm-Reference and Criterion-Reference.

Criterion reference is a given set of standards. The scores are then compared on the given
criterion. For example, in a 20 item test: 16-20 high, 11-15 average, 6-10 poor, 0-5 low. In a
criterion-reference the score is interpreted for a particular cut off scores. Most commonly the
grading system in schools are criterion reference where 100-95 is outstanding, 90-94 is very
good, 85-89 is good, 80 to 84 is satisfactory, 75 to 79 needs improvement, and 74 and below are
poor.
The interpretation for norm reference would depend on the distribution of scores of the
sample. The mean and standard deviations are computed and it will approximate the middle area
of the distribution. The standing of every individual in a norm reference is based on the mean
and standard deviation of the sample. Standardized tests commonly interpret scores using norm
reference where they have standardized samples.
The Normal Curve
Creating norms are usually done by test developers, psychometricians, and other practitioners
in testing. When a test is created, it is administered to a large group of individuals. This group of
individuals are the target sample where the test is intended for. If the test can be used for a wide
range of individuals, then a norm for a specific group possessing that characteristic needs to be
constructed. It means that a separate norm is created for males and females, for ages 11-12, 13-
14, 15-16, 17-18 and so on. There should be a norm for every kind of user for the test in order to
277
interpret his position in a given distribution. A variety of norms is needed because one cannot
use a norm that was made for 12 years old and use it for 18 years old because the ability of an 18
years old is different from the ability of a 13 years old. If a 21 years old need to take a test but
you DO NOT have a norm for a 21 years old, then you have to create a norm for a 21 years old.
There is a need to create norms for certain groups because the types of groups involved are
different from one another in terms of curriculum, ability, etc. For example, majority of
standardized tests used in the Philippine setting are from the west. This means that the content
and norms used are based in that setting. Thus, there is a need to create norms specifically for
Filipinos. Another concern in developing norms is that it expires across a period of time. Norms
created in the 1960’s cannot be used to interpret the scores of test takers of 2008. Thus, a norm
needs to be created every year.
In creating a norm, the goal is to come up with a normal distribution of scores that is
typical of a normal curve. A normal distribution is asymptotic and symmetrical. Asymptotic
means that the two tails of the normal curve do not touch the base which extends to infinity. The
sides of the normal distribution are symmetrical. The normal curve is a theoretical distribution of
cases where the mean, median, and mode are the same and in which distances from the mean can
be measured in standardized distances such as standard deviation units or z scores. The z-scores
are standardized values transformed from distributions that are not distributed normally. There
are 6 standard scores presented for each area in a normal curve. The z-score ranges from -3 to +
3, with a mean of 0 and the standard deviation is 1.
0.13% 2.14% 13.59% 34.13% 34.13% 13.59% 2.14% 0.13%
- 3 SD - 2 SD - 1 SD 0 +1 SD +2 SD + 3 SD
278
Steps in Creating a Norm
Suppose that a general ability test with 100 items was constructed and it was pilot tested
to 25 participants. The goal is to construct a norm to interpret scores of future test takers
(Generally 25 respondents are not enough to create a norm).
96 74 64 50 76
83 80 92 85 91
59 68 76 75 69
64 87 71 81 83
73 67 68 70 75
1. Compute for the Range

R = (highest score ˗ lowest score) + 1
R = (96-50) + 1
R = 47
2. Compute for the interval size (i)
R
i
10
47
i
10
i = 4.7 (5 will be the interval size)
3. Start the class interval with the score that is divisible to your interval size. The lowest score
which is 50 is divisible by 5 (interval size), so the class interval can start at 5.
279
4. Create the Frequency Distribution Table (FDT)
Class interval Tally Frequency Relative Cumulative Cumulative

(ci) (f) frequency Frequency Percentage
(rf) (cf) (cP)
95-99 | 1 4% 25 100
90-94 || 2 8% 24 96
85-89 || 2 8% 22 88
80-84 |||| 4 16% 20 80
75-79 |||| 4 16% 16 64
70-74 |||| 4 16% 12 48
65-69 |||| 4 16% 8 32
60-64 || 2 8% 4 16
55-59 | 1 4% 2 8
50-54 | 1 4% 1 4
Σf=25
Divisible by 5 Should f Copy the cf

Count the scores have a rf  cf  X 100
that belongs to N lowest f then N
total of 25, add each f
each class interval N=25 going up
The frequency (f) and relative frequency (rf) indicates how many participants scored within a
class interval. The cumulative percentage (cP) indicates the point in a distribution that has a
given percent of the cases below it. For the example, an examinee who scored 87 means that
88% of the participants are below his score and there are 22% of the cases above his score.
midpoint
280
When a histogram is created for the data set, it typifies a normal distribution. To
determine if a distribution of scores will approximate a normal curve, there are indices to be
assessed:
1. The mean and median should have approximately close values.

2. The computed skewness (sk) is close to zero
3. The computed kurtosis (K) is close to 0.256
Computation of the mean and median:
X 1877
X  X  X  75.08
N 25
N (.5)  cf 25(.5)  12
C50  cb  (i ) C50  74.5  (5) C50 = 75.13
f 4
The 50% of the N=25 is 12.5, given this proportion, select from the cumulative frequency (cf) in
the frequency distribution table that is close to 12.5 but will not exceed it. This value would be
12 which will then be used as cf in the formula. The f used is 4 because given a cf of 12 a
frequency of 4 is still needed to approximate 12.4. The value 4 is taken as the frequency above
12. The i value is the interval size which is 5. To determine cb which is the class boundary, get
the corresponding upper limit value of 12 in the class interval. This upper limit value is 74 (70 is
the lower limit value). The boundary between 74 and the next limit which is 75 is 74.5, therefore
74.5 will be used as the cb.
The value of the mean (75.08) and median (75.13) are close. It can be assumed that the
distribution is normal.
Estimating Skewness. Skewness of a distribution refers to the tail of the distribution. If

the tails are asymptotic, then the distribution is said to be normally distributed with skewness of
0. A distribution which is not normal is said to be skewed. If the tail goes to the right this type of
skewness is positive. If the tail is on the left, then it is negatively skewed.
281
Notice that in a skewed distribution, the mean and median are not equal. In a positively skewed
distribution, the mean is pulled by the extreme scores on the right having a higher value than the
median ( X  C50 ) . While in the negative skewed curve, the mean is pulled by the extreme
scores in the left side having a median with higher value ( X  C50 ) .
Formula to determine Skewness:
3( X  C50 )
sk 
sd
Where sd is the standard deviation, X is the mean, and C50 the median. In the previous section the
mean and median are already computed with values 75.08 and 75.13, respectively. To determine the
value of the standard deviation, the formula below is used:
( X ) 2 (1877) 2
X 2  143713 
sd  N sd  25 sd = 10.78
N 1 25  1
Where ΣX is the sum of all scores, ΣX2 is the sum of squares, and N is the sample size. ΣX2 is
obtained by squaring each score and then summate it. It will give a sum of 143713 from the
given data. Substitute the values in the formula:
3( X  C50 ) 3(75.08  75.13)

sk  sk  sk = -0.014
sd 10.78
The value of the sk is almost 0 which indicates that the distribution is normal.
Estimating Kurtosis. Kurtosis refers to the peakedness of the curve. If a curve is peaked
and the tails are more elevated, the curve is leptokurtic, if the curve is flattened then it is said to
be platykurtic. A normal distribution is somewhat mesokurtic.
Formula for Kurtosis:

282
QD
Kurtosis 
P90  P10
 Q  Q1 
Where QD is the quartile deviation  QD  3  , P90 is the 90th percentile, and P10 is the
 2 
10th percentile. The formula to determine the median can be used to determine percentile ranks
P. The Q3 is also equivalent to P75 and Q1 is equivalent to P25. There are four estimates of
percentiles needed to determine kurtosis, P75, P25, P90, and P10.
N (.75)  cf 25(.75)  16
P75  cb  (i ) P75  79.5  (5) P75 = 82.94
f 4
N (.25)  cf 25(.25)  4
P25  cb  (i ) P25  64.5  (5) P25 = 67.31
f 4
N (.10)  cf 25(.10)  2
P10  cb  (i ) P10  59.5  (5) P10 = 60.75
f 2
N (.90)  cf 25(.90)  22
P90  cb  (i ) P90  89.5  (5) P90 = 90.75
f 2
You can now compute for the Quartile deviation (QD):
Q3  Q1 82.94  67.31
QD  QD  QD = 7.81
2 2
You can now compute for the Kurtosis:
QD 7.81
Kurtosis  Kurtosis  Kurtosis = 0.26
P90  P10 90.75  60.75
The distribution approximates a normal since the kurtosis value is exactly 0.26.
Interpreting Areas in the Norm
How many participants are there below a score a 94 in the test?

A score of 94 correspond approximately a percentile rank of 96%. Get the 96% of the total N
which is 25 to determine the number of participants 25(.96) = 24. This means that there are 24
cases below a score of 94.
283
What is the standard score corresponding to a score of 94? Locate this score in the normal
curve.
XX
To convert a raw score to a standard z-score the formula z  is used. Just replace
sd
the values in the formula where X is the given score. We use the given data set where the
X =75.08 and sd = 10.78.
94  75.08
z z = 1.76
10.78
-3 -2 -1 0 1 2 3
A z-score of 1.76 is at this

point in the distribution
which is raw score of 94.
284
Other standard Scales in a normal distribution

285
Notice that the z-score has a mean of 0 and standard deviation of 1. A T score has a mean of 50
and a standard deviation of 10. For the other scales:
Mean Standard Deviation

CEEB score 500 100
ACT 15 5
Stanine 5 2
Convert a raw score of 94 into T score, CEEB, ACT, and stanine. Given the z value of 1.76 for a
raw score of 94 just multiply the z with the standard deviation of the standard score then add the
mean value.
T score = z (10) + 50 T score = 1.76 (10) + 50 T score = 67.6

CEEB = z (100) + 500 CEEB = 1.76 (100) + 500 CEEB = 676
ACT = z (5) + 15 ACT = 1.76 (5) + 15 ACT = 23.8
Stanine = z (2) + 5 Stanine = 1.76 (2) + 5 Stanine = 8.52
A raw score of 94 has an equivalent 67.6 T score, 676 CEEB, 23.8 ACT, and 8.52 stanine.
Once a score is converted into a standard score, a score can be interpreted based on its position in
the normal curve. For example a raw score of 94 is said to be above the average given that its
location surpassed the area of the mean.
The normal distribution since it is symmetrical has constant areas. When a cutoff is made using
the z-score the following areas are as follows.
0.13% 2.14% 13.59% 34.13% 34.13% 13.59% 2.14% 0.13%
- 3 SD - 2 SD - 1 SD 0 +1 SD +2 SD + 3 SD
The areas show that from the mean to a z score of 1 the area covered is 34.13% which is
also 1 standard deviation away from the mean. From a standard score of -1 to +3 a total area of
68.26% (34.13% + 34.13%) is covered. From -2 to +2, a total area of 95.44% is covered in the
curve. From -3 to +3 a total area 99.72% is covered. The remaining areas of the normal curve is
286
0.13% each side. The approximate areas of the normal curve for every z-score is found in
Appendix B of the book.
For example, in a given raw score of 94, what is the area away from the mean? Given the
z score of 1.76 for a raw score of 94, look for the value of 1.76 in Appendix C (first column, z
score) gives a value of .4608 which is the area away from the mean. To illustrate, it means that
the area occupied from the mean “0” to a z score of 1.76 occupies 46.06% of the normal
distribution.
3.92%
-3 -2 -1 0 1 2 3
The shaded area occupies 46.06%

of the distribution from 0 to 1.76.
How many cases are within the 46.06 area of the distribution?
Just multiply the area .4606 with N (.4606 X 25) gives 12 participants.
What is the area above a z score of 1.76?

To determine the area above 1.76, one way is to look at Appendix C on the area in
smaller proportion. Locate the z value of 1.76 and the value corresponding to area in smaller
proportion is .0392. This means that the area remaining on the smaller right of the normal curve
is 3.92%.
Another solution would be to subtract the shaded area .4606 to .5 which is half of the
distribution. The answer would be the same which is .039.
Another solution would be to subtract the shaded area to the entire area of the curve (1 -
.4606) which gives a value of 0.539. You still need to subtract 0.5 for the remaining area (.5 -
.5394) which will give an area of .039.
287
1) How many cases are within the 68.26% of the Norm distribution.
Multiply N= 25 to .6826. Therefore, 25 x .6826 gives 17 cases.
17.06 people ranging

in the area of the
68.26%
normal distribution.
25-17= 8 cases
2) Given a score of 87 and another score of 73. How many people are between the two scores?
Convert 87 and 73 into z scores ( X =75.08, sd = 10.78). A score of 87 corresponds to a z score
of 1.11, and a score of 73 corresponds to a z score of -0.19. A z score of 1.11 is located on the
right side of the curve above the mean and a z score of -0.19 is on the left side below the mean
because the negative sign. The areas away from the mean can be located for each z score and add
these areas to determine the proportion. Then multiply this proportion with N=25 to determine
the cases in between the two scores.
(.3643 + .0753) = .4396 x 25 = 11 cases
.0753 .3643
288
Summary of the Distinction between Criterion and Norm Reference
Criterion-Referenced Norm-Referenced
Dimension
Tests Tests
To determine whether each student To rank each student with respect to
has achieved specific skills or the
concepts. achievement of others in broad areas
Purpose of knowledge.
To find out how much students
know before instruction begins and To discriminate between high and low
after it has finished. achievers.
Measures specific skills which
make up a designated curriculum.
Measures broad skill areas sampled
These skills are identified by
from a variety of textbooks, syllabi,
Content teachers and curriculum experts.
and the judgments of curriculum
experts.
Each skill is expressed as an
instructional objective.
Each skill is tested by at least four Each skill is usually tested by less
items in order to obtain an adequate than four items.
sample of student
Item performance and to minimize the Items vary in difficulty.
Characteristics effect of guessing.
Items are selected that discriminate
The items which test any given skill between high
are parallel in difficulty. and low achievers.
Each individual is compared with
Each individual is compared with a
other examinees and assigned a score-
preset standard for acceptable
-usually expressed as a percentile, a
achievement. The performance of
grade equivalent
other examinees is irrelevant.
score, or a stanine.
Score
Interpretation A student's score is usually
Student achievement is reported for
expressed as a percentage.
broad skill areas, although some
norm-referenced tests do report
Student achievement is reported for
student achievement for individual
individual skills.
skills.
289
Exercise
(True False) 1. The mean of a score is equivalent to zero in a standard z score.
(True False) 2. The mean and the median are equivalent to 0 in a normal curve.
(True False) 3. The 68% percent of the normal distributions are 2 standard deviations away
from the mean.
(True False) 4. The entire area of the normal distribution is 100%.
(True False) 5. The area in percentage from -3 to -2 of the normal distribution is 86.26%
(True False) 6. The extreme area of the normal distribution throughout is 0.13%?
(True False) 7. The area of the normal curve from +2 to -1 is 95.44
(True False) 8. The area from -2 to +1 is the equivalent +2 to +1.
(True False) 9. The mode is found on zero in a normal distribution.
290
Lesson 3
Standards in Educational and Psychological Testing
Controlling the Use of Tests
There is a need to control the use of tests due to the issue on leakage. When this happens
it will be difficult to determine abilities accurately. To control the use of test, proper
considerations are ensured: The qualified examiner and procedure in test adminsitration. A
person can be a qualified examiner provided that he/she undergoes training in administering a
particular test. The psychometrician is the one responsible for the psychometric properties and
the selection of tests. The psychometrician also trains the staff on how to administer standardized
tests properly.
A qualified examiner needs to follow instructions precisely by undergoing training or
orientation to develop the skill of administering a test. The examiner needs to follow precisely
the test manual. If the examiner largely deviates from the instructions, then it defeats the purpose
of standardization. One of the distinct qualities of standardized measures is the uniformity of
administration. Moreover, the lack of preciseness in following the instructions in the
administration of the test can affect the results of the test.
The examiner should have a thorough familiarity in the tests’ instructions. They should at
least memorize their script even when they introduce themselves to the examinees.
Careful control of testing condition which concerns the environment of the testing rooms
when taking the exam is also important. If there are many groups who will take the exam, the
condition should be the same for all. It includes the lighting, temperature, noise, ventilation, and
facilities. The condition of the testing room can affect the test taking process.
Proper checking procedure should also be taken into consideration. It should be decided
whether the test will be checked by scanning via computer or manually. Second round of
checking should also be done for verification if the checking is done accurately.
There should also be proper interpretation of results. Some trained examiners have the
skills to make a psychological profile out of the battery tests or several tests administered. The
psychometrician is qualified to write a narrative integrating all test results. In some cases, the
staff are trained how to write psychological profiles especially if there are occasional test takers.
Security of the Test Content
Tests content should be restricted in order to forestall deliberate efforts to fake scores.
The questionnaires can only be accessed by the psyhometricians. The staff, superiors, or anybody
else are not allowed to have access of the tests. To avoid leakage and familiarity, the
psychometrician can use different sets of standardized test which measure the same
characteristics for different groups of test takers.
Test results are confidential. The examiner is not allowed to show anybody the results of
the exam other than the test taker and the people who will use it for decision making. Test results
are kept where it can only be accessible to the psychometrician and qualified personnel.
The nature of the test should effectively be communicated to the test takers. It is
important to dispel any mystery and misconception regarding the test. It should be clarified to the
test takers what the test is for purposes of assessment and will be used for deciding whatever the
test is intended for. The procedures of the test can be reported to test takers in case they are
291
concerned. It is essential for them to know that the test is reliable and valid. Moreover the
examiner should also dispel the anxiety of the test takers to ensure that they will perform to the
best of their ability. After taking the test, feedback on the result of the test should be
communicated to the test takers. It is the right of the test takers to know the result of the test they
took. The psychometrician is responsible for keeping all the records of the results in case the test
takers look for it.
Test Administration
Before the test proper, the examiner should prepare for the test administration. The
preparation are memorizing the script, and familiarization with the instructions and procedures.
The examiner should memorize the exact verbal instruction especially the introduction part.
However, there are some standardized tests that do not require the examiner to memorize the
instructions and procedures. Some tests permit the examiner to read the instruction and
procedure from the manual.
In terms of preparing for the test materials, it is advisable that the examiner prepares a
day before the test taking day. The test examiner should count the test booklets, the answer
sheets, pencils, and prepare the sign boards, stopwatch, other materials, and the room itself. The
room reservation should have been made one month before the test taking. The testing schedules
are all prearranged. The room condition is fixed and this includes ventilation, air-conditions, and
chairs.
Thorough familiarity with specific testing procedure is also important and is done by
checking the names of the test takers. The pictures in the test permit should match the
examinees’ faces. Testing materials such as the stopwatch provided for administering the test
should be tested if they are properly working.
Advance briefing for the proctor is also done through orientation and training on how to
administer the test. The examiner during the test is also responsible for reading the instructions
carefully, take charge of timing, and in-charge of the group taking the exam. They should also
prevent the test-takers from cheating. The examiner checks if the numbers of test-takers
correspond with the test booklet number after the session. The examiner also makes sure that the
test takers follow instructions such as shading the circle if they are to shade it. In cases of
questions that cannot be answered by the proctor, there is a testing manger nearby that can be
consulted.
For the testing condition, the environment should not be noisy. The examiner should be
able to select good and suitable testing rooms that can facilitate a good testing environment for
the test takers. The area or place where the test is administered should be restricted. Noise in the
place should be regulated. Temperature in each room should be kept the same for all rooms. The
room should be free of noise, lights should be bright enough, good seating facilities and other
factors that can negatively affect the test takers as they are taking the exam should be controlled.
Special step should also be done to prevent distractions by putting signs outside the testing room
like “examination going on.” The examiner can also lock the door, or ask assistants outside the
room to tell people that test is going on in that area. Subtle testing conditions may affect
performance on ability and personality tests like the tables, chairs, type of answer sheet, paper
and pencil, and computer administration.
292
Introducing the Test to Test Takers
The test administrator should establish rapport with the test takers. Rapport means the
examiner’s efforts to arouse test takers interest in the test, elicit their cooperation, and encourage
them to be appropriate in response. For ability test, encourage test takers to their best effort to
perform well. For personality inventories, tell test takers to be frank and honest with their
responses to the questions. For projective tests, inform test takers to fully report and make
associations evoked by the stimuli without censoring or editing content. Generally, the test takers
motivate respondents to follow instructions carefully.
Testing Different Groups
For preschool children, the test administrator has to be friendly, cheerful, and relaxed. A
short testing time is recommended considering the attention span of children. The tasks required
in the test should be interesting. The scoring should also be flexible. Examples are demonstrated
to children on how to answer each test type.
For grade school students, the test administrator should appeal to their competitive side,
and their desire to do well.
For the educationally disadvantaged, they may not be motivated in the same way as the
usual test takers and so the examiner should adapt to their needs. Nonverbal tests are used for
deaf examinees and those who are not able to read and write. Oral tests should be given to
examinees who are having difficulty in writing.
For the emotionally disturbed, test administrators should be sensitive to difficulties the
test takers might have while interpreting scores. Testing should occur when these examinees are
in the proper condition.
For adults, test administrators should sell purpose of test, convince the test taker that it’s
for their own interest.
Examiner variables such as age, sex, ethnicity, professional/socio-economic status,
training, experience, personality characteristics and appearance affect the test takers. Situational
variables such as unfamiliar/stressful environment, activities before the test, emotional
disturbance, and fatigue also affect the test takers.
Examples of Standardized Tests
Intelligence tests
IPAT Culture-Fair Test of “g”
The Culture-Fair Test of “g” is a measure of individual intelligence in a manner designed

to reduce the influence of verbal fluency, cultural climate, and educational level. It can be
administered in individuals or in groups. It is a non-verbal test and requires only that examinees
be able to perceive relationships on shapes and figures. It has subtests including series,
classification, matrices, and conditions. There are also three scales of this test. The Scale 1 is
intended for children with ages 4-8 years old and older mentally handicapped people, while scale
2 and 3 are required to be administered in groups. Reliability was obtained and all coefficients
are quite high and have been evaluated across large and widely diverse samples. The difference
293
in level of reliability between the short from and full test (Form A and B) are sufficiently large to
warrant administration of the full test. Scale 2 reliability coefficients are .80-.87 for the full test
and .67 to .76 for the short form. Scale 3 on the other hand has a reliability coefficient of .82-.85
for the full test and .69 to .74 for the short form. The validity used was construct and concurrent
validity. Construct validity in the Scale 2 reported .85 for the full test and .81 for the short form.
For the concurrent validity of Scale 2 reported .77 for full test and .70 for short form. In the Scale
3, construct validity reported was .92 for full test and .85 for short form while concurrent validity
reported .65 for the full test and .66 for the short form. The standardization was done for both
scales. In scale 2 4, 328 males and females were included from the varied regions of US and
Britain and For Scale 3 3,140 American first to fourth year high school students and young
adults participated.
Otis Lennon Mental Ability Test
This test was developed by Arthur Otis and Roger Lennon and was published by the
Harcourt Brace and World, Inc. in New York on 1957. This test was designed to provide a
comprehensive assessment of the general mental ability for the students in American schools. It
is also developed to measure the student’s facility in reasoning and in dealing abstractly with
verbal, symbolic, and figural test. The content sampling includes a broad range of mental ability.
It is important to take note that it does not intend to measure the innate mental ability of the
students. There are 6 levels of Otis Lennon Mental Ability Test to ensure the comprehensive and
efficient measure of the mental ability available or already developed among students in Grade
K-12. The Primary Level I is intended for the students in the last half of kindergarten, Primary
Level II for the first half of grade 2, elementary I, for the half of grade 2 through grade 3,
Elementary II for Grade 4-6, Intermediate for grade 7 to 9 and Advance for grade 10-12. The
norm was obtained by getting 200,000 students from 117 school systems in the 50 states
participated in the National Standardization program. There were 12000 pupils from grade 1-12
while 6000 were from kindergarten. For the reliability, Split-half was used in which the
computed reliabilities range from .93 (Elem I) to .96 (intermediate). KR#20 or the Kuder-
Richardson also obtained above .93 (Elem I) to .96 (Intermediate) for reliability coefficients. Still
in the alternate forms of reliability range from .89 (Elem II) to .94 (Intermediate) for reliability
coefficients. As for the validity, school grades and scores on achievement test are computed.
Moreover the relationship between OLMAT and other accepted mental ability and aptitude test
were computed.
Otis Lennon School Ability Test
This test was developed by Arthur Otis and Roger Lennon and was published by the
Harcourt Brace and Jovanovich, Inc. in New York on 1979. It was developed to give an accurate
and efficient measure of the abilities needed to attain the desired cognitive outcomes of formal
education. It intends to measure the general mental ability or the Spearman’s “g”. It was
modified by Vernon based on the postulate two major factors or components of “g” which are
the verbal-educational and practical mechanical. However, this test focused on the verbal-
educational factor through a variety of tasks that call for the application of several processes to
verbal, quantitative and pictorial content. OLSAT was organized in five levels which includes
Primary Level I for grade 2 students, Primary level II for grades 2 and 3, Elementary for Grades
294
4 and 5, Intermediate, for grades 6-8 and Advance for grades 9 through 12. Each level is
designed to obtain reliable and efficient measurement to most students in which it is intended.
For each level, there are two parallel forms of the test; the Forms R and S were developed. Items
in these two forms are balanced in terms of content, difficulty and discriminatory power. These
two forms also obtained comparable results. A norm composed of 130000 students in 70 school
systems enrolled in Grades 1-12 from American schools was used for standardization. For the
reliability of the test, Kuder-Richardson yielded .91 to .95 reliability coefficients. Test-retest
reliability was also utilized and obtained .93 to .95 reliability coefficients. Lastly, standard error
of measurement was also computed wherein 2/3 of scores fell within +/- 1 standard error of
measurement from “true scores” and 95% fell within standard error of measurement from “true
scores”. For the validity, the OLSAT was compared to teacher’s grade and got .40- .60 and
median of .49. OLSAT was also correlated to Achievement test scores.
Raven’s Progressive Matrices
This test was originally developed by Dr. John C. Raven and was published by the U.S.
Distributor: The Psychological Corporation in 1936. It is a test of abstract reasoning which is a
multiple choice type. It was designed to measure the ability of a person to form perceptual
relations. Moreover it intends to measure a person’s ability to reason by analogy independent of
language and formal schooling. This test is a measure of Spearman's g. It is consisting of 60
items which are arranged in five sets (A, B, C, D, & E) of 12 items each. Each item contains a
figure with a missing piece. There are either six (sets A & B) or eight (sets C through E)
alternative pieces to complete the figure, only one of which is correct. Each set involves a
different principle or "theme" for obtaining the missing piece, and within a set the items are
roughly arranged in increasing order of difficulty. The raw score is converted to a percentile rank
through the use of the appropriate norms. This test is intended for people with age ranging from
6 up to adult. The matrices are offered in three different forms for participants of different ability
which includes the Standard Progressive Matrices, the Colored Progressive Matrices, and the
Advanced Progressive Matrices. The Standard Progressive Matrices were the original form of
the matrices and were first published in the year 1938. This test comprises five sets (A to E) of
12 items each with items within a set becoming increasingly difficult. This requires even greater
cognitive capacity in order to encode and analyze information. All of the items are presented in
black ink on a white background. There is also Colored Progressive Matrices Designed for
younger children, the elderly, and people with moderate or severe learning difficulties. This test
is consists of sets A and B from the standard matrices, with a further set of 12 items inserted
between the two, as set Ab. Most of the items are presented on a colored background so that the
test will appear visually stimulating for participants. On the other hand the very last few items in
set B are presented as black-on-white so that if participants exceed the tester's expectations,
transition to sets C, D, and E of the standard matrices is eased. Another form is Advanced
Progressive Matrices which contains 48 items, presented as one set of 12 (set I), and another of
36 (set II). Items here are also presented in black ink on a white background. Items become
increasingly difficult as progress is made through each set. The items in this form are appropriate
for adults and adolescents of above average intelligence. The last two forms of matrices has been
published on were published in 1998. In terms of establishing the norms, the standard sample
included are: British children between the ages of 6 and 16; Irish children between the ages of 6
and 12; military and civilian subjects between the ages of 20 and 65. Some more others includes
295
sample Canada, the United States, and Germany. The two main factors of Raven's Progressive
Matrices are the two main components of general intelligence (originally identified by
Spearman): Inductive ability (the ability to think clearly and make sense of complexity) and
reproductive ability (the ability to store and reproduce information). To determine reliability, the
split-half method and KR20 estimates values ranging from .60 to .98, with a median of .90. Test-
retest correlations was also used and obtained coefficients range from a low of .46 for an eleven-
year interval to a high of .97 for a two-day interval. The median test-retest value is
approximately .82. Raven provided test-retest coefficients for several age groups: .88 (13 yrs.
plus), .93 (under 30 yrs.), .88 (30-39 yrs.), .87 (40-49 yrs.), .83 (50 yrs. and over). For test
validity the Spearman used the SPM to be the best measure of g. Through the evaluation using
factor analytic methods which were used to define g initially, the SPM comes as close to
measuring it as one might expect. Majority of studies that factor analyzed the SPM along with
other cognitive measures in Western cultures report loadings higher than .75 on a general factor.
Moreover, concurrent validity coefficients between the SPM and the Stanford-Binet and
Weschler scales range between .54 and .88, with the majority in the .70s and .80s.
SRA Verbal
This test is a general ability test which measure the individual’s overall adaptability and
flexibility in comprehending and following instructions and in adjusting to alternating types of
problems. It is designed to use on both school and industry. It has two forms, A and B that can
also be sued at all educational levels from junior high school to college at all employee levels
from unskilled laborers to middle management. However, it is intended only for persons with
familiarity on the English Language. To determine the general ability of persons who speak
foreign language or of illiterates, a non-verbal or pictorial test should be used. The items in this
test have two types, the vocabulary (linguistic) and arithmetic reasoning (quantitative). This test
is intended for 12 to 17 years old. Reliability was determined and reported that the coefficients
are in the high .70s for all the scores- linguistic, qualitative and total. The means were also found
to be very similar. For the validity of the test, SRA is correlated with the other tests particularly
in the HS placement Test (r=.60) and in Army General Classification Test (r=.82).
Watson Glaser Critical Thinking Appraisal
This test was designed to measure the critical thinking of a person. This test was a series
of exercises which require the application of score of the important abilities involved in thinking
critically. It includes problems, statements, arguments, and interpretations of data similar to
those which a citizen in democracy might encounter in daily life. It has two forms, the Ym and
the Zm which also consist of 5 subtests. These subtests were designed to measure different and
interdependent aspects of critical thinking. There were 100 items which is not used as a test of
speed but a test of power. The five subtests are inference, recognizing assumptions, deduction,
interpretation and evaluation of arguments. Inference consists of 10 items and the students are
display the ability to discriminate among the degrees of truth or falsity of inferences drawn from
given data. The recognizing assumption (16 items) on the other hand allows the students to
recognize unstated assumptions or presuppositions which are taken in given statements or
assertions. Next, deduction (25 items) tests the ability to reason deductively from given
statements or premises and to recognize the relation of implication between prepositions. The
296
next is interpretation which measures the ability to weigh evidence and to distinguish between
generalizations from given data are not warranted beyond a reasonable doubt and generalization
which although not absolutely obtain or necessary do seem to be warranted beyond a reasonable
doubt. Lastly, evaluation of arguments measures the ability to distinguish between arguments
which are strong and relevant and those which are weak or irrelevant to a particular question or
issue. For the standardization of the test, norm was set. With this, 4 grade levels were included:
Grades 9, 10, 11 and 12. There was a total of 20,312 students participated. High schools had to
be a regular public institution in a community of 10 000-75000 with a minimum of 100 students.
This was done to avoid the biasing influences associated with extremely small schools and with
specialized High school found in some very large systems. The reliability was determined using
the split-half. The computed reliability coefficients were .61, .74, .53, .67 and .62 for are
inference, recognizing assumptions, deduction, interpretation, and evaluation of arguments
respectively in the Ym form. While for the Zm form, the reliability coefficients were .55, .54,
.41, .52 and .40 for inference, recognizing assumptions, deduction, interpretation, and evaluation
of arguments respectively. Validity was then determined through content and construct validity.
The indication for the validity was the extent to which the critical thinking appraisal measures a
sample of specified objective of such instructional programs. Moreover, for the construct
validity, various forms of test intercorrelation obtained .21-.50 and .56-.79 was the correlation
coefficient computed for the correlation of the subtests to the appraisal as a whole.
Achievement Tests
Metropolitan Achievement Test
This test was designed to provide an accurate and dependable data concerning the
achievement of the students on important skills and content areas of the school curriculum. The
test aims to support theories that explain achievement test should asses what is being taught in
the classrooms. The use has been extended to include the first half of kindergarten and Grade 10-
12. It is a two-component system of achievement evaluation both designed to obtain both norm-
referenced and criterion referenced information. The first one is the instructional component
which is designed for classroom teachers and curriculum specialists. This is an instructional
planning tool that provides prescriptive information in the educational performance of individual
students in terms of specific instructional objectives. There is a separate instructional battery
under this which includes reading, mathematics, and language all available in JI and KI forms.
The other one is the survey component which provides the classroom teacher with considerable
information about the strengths and weaknesses of the students in the class in the importance
skill and content areas of the school curriculum. Under this are 8 overlapping batteries covering
the age range from K-12. This also includes reading, mathematics, and language. The norm was
set and participants were selected to represent the national population in terms of school system
enrolment, public versus non-public school affiliation. Geographic design, socio-economic status
and ethnic background. There were 550 students and there were 10% public schools from the
Metropolitan Population and 10% also from the national population. For the Socio-economic
status, 54% were from metropolitan, and 52% from national population, all adults graduated and
high school. Reliability was computed using KR#20 and obtained .93 for reading, .91 for
mathematics, .88 for language. The basic battery was .96. Also, standard error of measurement
297
was also computed and yield 2.8 for reading, 2.9 for mathematics, 3.4 for language and the basic
battery is 5.3. Validity was determined through a content validity with a belief that the objective
and item should correspond to the school curriculum. With this in mind, compendium of
instructional objectives was made available.
Stanford Achievement Test
This test was designed by Gardner, Rudman, Karlson, & Merwin in 1981. This is a series
of test that is comprehensive which was later developed to assess the outcomes of learning at
different levels in educational sequences. This measures the objectives of the general education
from kindergarten through first year college. Its series include SESAT Stanford Early School
Achievement Test and TASK Stanford Test of Academic Skills. SAT is intended for primary,
intermediate, and junior high school. It assesses the essential learning outcomes of the school
curriculum. It was first established in 1923 and undergone several revisions until 1982. These
revisions were done to have a close match between test content and learning practices, to provide
norms that will have an accurate reflection of the performance of students in different grade
levels and achieve modern ways of interpreting the scores which result in improvement in
measurement technology. SESAT is for children in kindergarten and grade 1. This test measures
the cognitive development of children upon admission and entry into school in order to establish
a baseline where learning experiences may best begin. On the other hand, TASK was intended
for grade 8 to 13 students (first year college). This intends to measure the basic skills. The level I
of TASK is for grades 8-12 which measures the competencies and skills that are desired by the
adult social level, while Level II is for grades 9-13 and measures the skills that are requisite to
continued academic training. SAT contains 8 subtests which include reading comprehension,
vocabulary, listening comprehension, spelling, language, concepts of numbers, math
computations, math applications, and science. Reading comprehension is the measure of
understanding skills wherein textual (typical found in books), functional (printed found in daily
life), and recreational (reading for enjoyment such as poetry and fiction) were included.
Vocabulary is the measure of the pupil’s language competence without having to read prior the
test. Listening comprehension is the subtest which evaluates the ability of the student to process
information that has been heard. Spelling tests the ability of the student to identify the misspelled
words from a group of four words. The language test has three parts: Proper use of capital letters,
use of punctuation marks, and appropriate use of the parts of speech. Concept of number
includes the understanding of the student with the basic concepts about numbers. Math
computations include the multiplication and division of whole numbers, operations to fractions,
decimals and percents. Math application tests the student’s ability to apply the concepts they
have learned to problem solving. Lastly, science application tests measure the ability of the
students to understand the basic physical and biological sciences. One of the items in SAT under
vocabulary is “when you have a disease, you are ____” a. sick, b. rich, c. lazy, d. dirty. Te
reliability of the test was obtained through internal consistency, KR#20 (computed r= .85- .95),
standard error of measurements and alternate forms of reliability. For the validity, the test
content was compared with the instructional objectives of the curriculum.
298
Aptitude Tests
Differential Aptitude Test
This test was designed to meet the needs of the guidance counselors and consulting
psychologists, whose advice and ideas were sought in planning for a battery which would meet
the accurate standards and be practical on daily use in schools, social agencies, and business
organizations. The original forms (A and B) were developed in 1947 with the aim to provide an
integrated scientific and well-standardized procedure for measuring the abilities of the boys and
girls in grade 8-12 for the purposes of educational and vocational guidance. It was primarily for
junior and senior high school. It can also be used in educational and vocational counseling of
young adult out of school and selection of employees. This test was revised and restandardized in
1962 for the forms L and M and in 1972 for forms S and T. Included in the battery of test for
DAT are verbal reasoning, numerical ability, abstract reasoning, clerical speed and accuracy,
mechanical ability, space relations and spelling. The verbal reasoning measures the ability of the
student to understand concepts that were framed in words. Numerical ability subtest tests the
understanding of the students of numerical relationships and facility in handling numerical
concepts which includes arithmetic computations. Abstract Reasoning intends as a non-verbal
measure of the student’s reasoning ability. Clerical speed and accuracy intends to measure the
speed of response in simple perceptual task including simple number and letter combination.
Mechanical ability test is the constructed version of the Mechanical Comprehensive Test (but it
is easier) and measure mechanical intelligence. Space and relations measure the ability to deal
with concrete materials through visualization. Lastly, spelling measures the student’s ability to
detect errors in grammar, punctuations, and capitalizations. The norm was obtained through
percentiles and stanines. 76 school districts were included and test the grade 8-12 students in
Schools at District of Columbia. Schools with 300 or more students each were included. Small
school district’s entire enrollment is grade 8-12 also participated. For the large schools district,
representative were included taking into consideration the school achievement and racial
composition. All in all there were 14, 049 grade 8 students, 14, 793 grade 9 students, 13,613
grade 10 students, 11,573 grade 11, and 10,764 grade 12 students. The reliability was computed
through split-half and get the reliability coefficients. Validity was determined and it can be said
that the coefficient presented demonstrates the utility of Differential Aptitude Test for
educational guidance. Each of the test is potentially useful as to what the expectancy tables
evidently show the validity coefficient.
Flanagan Industrial Test
This test is a set of 18 short tests designed for use with adults in personnel selection
programs for a wide variety of jobs. The tests are short and self-administering. The FIT battery
measures 18 subscales including arithmetic, assembly, components, coordination, electronics,
expression, ingenuity, inspection, judgment, and comprehension, mathematics and reasoning,
mechanics, memory, patterns, planning, precision, scales, tables, and vocabulary. Arithmetic
measures the accuracy in working with numbers. Assembly measures the ability to visualize the
appearance of an object assembled from separate parts. Component is the ability to locate and
identify important parts of a whole. Coordination tests the coordination of arms and hand.
Electronics measures the understanding and electronic principles and analyze diagrams of
299
electrical circuits. Expression is the ability to feel and having the knowledge of correct English,
ability to convey ideas in writing and talking. Ingenuity refers to being creative and inventive
and having the ability to devise procedures equipment and presentations. Inspection is the ability
to spot flaws and imperfections ion series of articles accurately and quickly. Judgment and
comprehension is the ability to read with understanding and use good judgment n interpreting
materials. Math and reasoning refers to the understanding basic math concepts and ability to
apply in solving certain problems. Mechanics is the ability to understand mechanical principles
and analyze mechanical movements. Memory tests the learning and recalling ability in terms of
association. Patterns refer to the ability to perceive and reproduce simple pattern outlines
accurately. Planning is the ability to foresee problems that may arise and anticipate the best order
for carrying out steps. Precision refers to the ability to make appropriate figure movements with
accuracy. Scales is the ability to read and understand what the scales graphs and charts are
conveying. Tables refer to the ability to read and understand tables accurately and quickly.
Vocabulary refers to the ability to choose the tight terms to convey ones idea. The standard
sample in this test are 12th grade students. The reliability of the test was determined and reported
the reliability coefficients ranging from .50-.90 from the individual test. When FIT was
correlated with FACT the range was .28 (memory) to .79 (arithmetic). For the validity of the test,
it is said that many of the short test has fairly substantial reliability coefficients ranging from .20
to .50 using step wise multiple regression. It is also found that 5 of the three tests namely, Math
and reasoning, Judgment and comprehension, Planning, Arithmetic and Expression yield
multiple correlation of .5898 with fall semester GPA. The first four tests along with vocabulary
and precision provide a multiple correlation of .47 in spring semester GPA. In general, multiple
correlation vary from .57- to.40
Personality Test
Edwards Personal Preference Schedule
This test was created by Allen L. Edwards and was published by The Psychological
Corporation. This test is an instrument for research and counseling purposes. It can provide
convenient measures of independent personality variables. Moreover, it provides measure for test
consistency and profile stability. It is a non-projective personality test that was derived from H.
A. Murray’s theory which measures the rating of individuals in fifteen normal needs or motives.
These needs or motives from Murray’s theory are the statements used in the Edwards Personal
Preference Schedule. It consists of 15 scales including achievement, deference, order, exhibition,
autonomy, affiliation, interception, succorance, dominance, abasement, nurturance, change,
endurance, heterosexuality, and aggression. Achievement is described as the desire of the person
to exert best effort. Deference is the tendency of the person to get suggestions from other people,
doing what is expected praising others conforming and accept other’s leadership. Order is the
neatness and organization in doing one’s work, arranging everything in proper order so
everything will run smoothly. Exhibition is the tendency of saying smart and clever things to
gain other’s praise and be the center of attention. Autonomy is the ability to do whatever desired,
avoiding conformity and making independent decisions. Affiliation is having plenty of friends,
ability to form new acquaintance, and build intimate attachments with others. Intraception is the
tendency to put oneself on other’s shoes, and analyzing other’s behaviors and motives.
Succorance is the desire to be helped by others in times of trouble, seeks encouragement and
300
wants others to be sympathetic to him. Dominance is the tendency of the person to argue with
another’s view, act as a leader in the group thereby influencing others and make group decisions.
Abasement is the tendency to feel guilty when someone commits a mistake, accepts blame and
feels the need of confession after a mistake is done. Nurturance is the ability to help friends who
are in trouble, desire to help the less fortunate ones, showing great affection to others and being
kind and sympathetic. Change is the tendency to explore on new things, doing things out off
routine. Endurance is the ability to keep on doing the task until it is finished and sticking on the
problem until it is solved. Heterosexuality is the desire to go out with friends in the opposite sex,
becoming physically attracted to the people in the opposite sex and being sexually excited.
Lastly, aggression is the tendency to criticize others in public, attacking contrary points of view
and making fun of others. This test is intended for college students and adults. To set the norm,
1,509 students in college were included. Norm includes High School graduates and college
training. It was consist of 749 college females and 760 college males. Still, part of the sample in
the norm was adults consisting of male and female households heads who are members of
consumer purchase panel used for market surveys. They were from rural and urban areas of
countries in the 48 states. The consumer panel consisted of 5105 households. For the reliability,
a split-half reliability coefficient technique was used. The coefficients of internal consistency for
1,509 students in the college normative group range from .60 to .87 with a median of .78. A test-
retest stability coefficient with a one-week interval was also conducted. These are based on a
sample of 89 students and range from .55 to .87 with a median of .73. Other researchers have
reported similar results over a three-week period, showing correlations of .55 to 87 with a
median of .73. On the other hand, for validity, the manual reports studies comparing the EPPS
with the Guilford Martin Personality Inventory and the Taylor Manifest Anxiety Scale. Other
researchers have correlated the California Psychological Inventory, the Adjective Check List, the
Thematic Apperception Test, the Strong Vocational Interest Blank, and the MMPI with the
EPPS. In these studies there are often statistically significant correlations among the scales of
these tests and the EPPS, but the relationships are usually low-to-moderate and often are difficult
for the researcher to explain.
Guilford-Zimmerman Temperamental Survey
This inventory is developed for the organizational psychologists, personnel professionals,

clinical psychologists, and counseling professionals in mental health facilities, businesses, and
educational settings. This was developed with the aim to help measure attributes related to
personality and temperament that might help predict successful performance in various
occupations, to identify students who may have trouble adjusting to school and the types of
problems that may occur, to assess temperamental trends that may be the source of problems and
conflicts in marriage or other relationships and provide objective personality information to
complement other data that may assist with personnel selection, placement, and development.
This test provides a nonclinical description of an individual's personality characteristics that can
be used in career planning, counseling, and research. Its subscales include General Activity (G),
Restraint (R), Ascendance (A), Sociability (S), Emotional Stability (E), Objectivity (O),
Friendliness (F), Thoughtfulness (T), Personal relations (P) and Masculinity (M). A high score in
General Activity means having a strong drive and energy. A high score in Restraint may mean
not being a happy go lucky, carefree and impulsive. Ascendance high score means being a ride-
rough-shod over others, typically for works of foremen and supervisors. A high score in
301
sociability means optimism and cheerfulness. Obtaining a high score in objectivity may mean
less egoistic and insensitiveness. High score in friendliness means lack of fighting tendencies,
and desires to be liked by others. Obtaining a high score in thoughtfulness may pertain to men
who have an advantage in getting supervisory positions. Personal relations high scores mean the
high capability of getting along with other people. High score in masculinity may pertain to
people who behave in ways that are more acceptable to men. Examples of items in this test are
“You like to play practical jokes in others” and “Most people are out to get more than they give”.
Standardization of this test was done by gathering 523 college men and 389 college women in
one southern California University and two junior colleges for all except for the trait
thoughtfulness. In the male sample, there were veterans aging 18-30 year old. Reliability was
calculated using KR#20 and obtained reliability ranging from .79 for general activity and .87 for
sociability. However, in intercorrelations of ten traits gratifies low reliability coefficients, only
two scores are high, between Sociability and Ascendance and between Emotional Stability and
Objectivity. For the validity, it is believed that what each score measures is fairly well-defined
and that the score represent a confirmed dimension of personality and a dependable descriptive
category. Most impressive validity data have come from the use of inventories with supervisory
and administrative personnel.
IPAT 16 Personality Factors
The 16 Pf was originally developed by Raymond Cattel, Karen Cattel, and Heather Cattle
to help identify personality factors. It can be administered to individuals 16 years and older.
There are 16 bipolar dimensions of personality and 5 global factors. The bipolar dimensions of
Personality are Warmth (Reserved vs. Warm; Factor A), Reasoning (Concrete vs. Abstract;
Factor B), Emotional Stability (Reactive vs. Emotionally Stable; Factor C), Dominance
(Deferential vs. Dominant; Factor E), Liveliness (Serious vs. Lively; Factor F)
Rule-Consciousness (Expedient vs. Rule-Conscious; Factor G), Social Boldness (Shy vs.
Socially Bold; Factor H), Sensitivity (Utilitarian vs. Sensitive; Factor I), Vigilance (Trusting vs.
Vigilant; Factor L), Abstractedness (Grounded vs. Abstracted; Factor M), Privateness (Forthright
vs. Private; Factor N), Apprehension (Self-Assured vs. Apprehensive; Factor O), Openness to
Change (Traditional vs. Open to Change; Factor Q1), Self-Reliance (Group-Oriented vs. Self-
Reliant; Factor Q2), Perfectionism (Tolerates Disorder vs. Perfectionistic; Factor Q3), Tension
(Relaxed vs. Tense; Factor Q4). The global factors are Extraversion, Anxiety, Tough-
mindedness, Independence, and Self-Control. A stratified random sampling that reflects the 2000
U.S. Census was used to create the normative sample, which consisted of 10,261 adults. Test-
retest coefficients offer evidence of the stability over time of the different traits measured by the
16 PF. Pearson-Product Moment correlations were calculated for two-week and two-month test-
retest intervals. Reliability coefficients for the primary factors ranged from .69 (Reasoning,
Factor B) to .86 (Self-reliance, Factor Q2)with a mean of .80. Test-retest coefficients for the
global factors were higher, ranging from .84 to .90 with a mean of .87. Cronbach’s alpha values
ranged from .64 (Openness to Change, Factor Q) to .85 (Social Boldness, Factor H), with an
average of .74. Validity of the 16 PF (5th ed.) demonstrated its ability to predict various criterion
measures such as the Coopersmith Self-esteem Inventory, Bell’s adjustment inventory, and
social skills inventory. Its subscales are correlated well with the factors of the Myers-Briggs
Type Indicator.
302
Myers-Briggs Type Indicator
The Myers-Briggs Type Indicator (MBTI) assessment is a psychometric questionnaire

designed to measure psychological preferences on how people perceive the world and make
decisions. These preferences were extrapolated from the typological theories originated by Carl
Gustav Jung, as published in his 1921 book entitled “Psychological Types.” The original
developers of the personality inventory were Katharine Cook Briggs and her daughter, Isabel
Briggs Myers. This test is suited for 14 years old, and requires 7th grade reading level. It has 8
Factors: Extraversion, Sensing, Thinking, Judging, Introversion, intuition, Feeling, Perceiving.
People with a preference for Extraversion draw energy from action: they tend to act, then reflect,
then act further. If they are inactive, their level of energy and motivation tends to decline.
Conversely, those whose preference is Introversion become less energized as they act: they
prefer to reflect, then act, then reflect again. People with Introversion preferences need time out
to reflect in order to rebuild energy. Sensing and intuition are the information-gathering
(Perceiving) functions. They describe how new information is understood and interpreted.
Individuals who prefer Sensing are more likely to trust information that is in the present, tangible
and concrete: that is, information that can be understood by the five senses. They tend to distrust
hunches that seem to come out of nowhere. They prefer to look for details and facts. For them,
the meaning is in the data. On the other hand, those who prefer intuition tend to trust information
that is more abstract or theoretical, that can be associated with other information (either
remembered or discovered by seeking a wider context or pattern).Thinking and Feeling are the
decision-making (Judging) functions. The Thinking and Feeling functions are both used to make
rational decisions, based on the data received from their information-gathering functions
(Sensing or intuition). Those who prefer Thinking tend to decide things from a more detached
standpoint, measuring the decision by what seems reasonable, logical, causal, consistent and
matching a given set of rules. Those who prefer Feeling tend to come to decisions by associating
or empathizing with the situation, looking at it 'from the inside' and weighing the situation to
achieve, on balance, the greatest harmony, consensus and fit, considering the needs of the people
involved. Myers and Briggs taught that types with a preference for Judging show the world their
preferred Judging function (Thinking or Feeling). So TJ types tend to appear to the world as
logical, and FJ types as empathetic. According to Myers, Judging types prefer to "have matters
settled." Those types ending in P show the world their preferred Perceiving function (Sensing or
intuition). So SP types tend to appear to the world as concrete and NP types as abstract.
According to Myers, Perceiving types prefer to "keep decisions open. The validity of the test was
tested and it was found that unlike other personality measures such as the Minnesota Multiphasic
Personality Inventory or the Personality Assessment Inventory, the MBTI lacks validity scales to
assess response styles such as exaggeration or impression management. The MBTI has not been
validated by double-blind tests, wherein participants accept reports written for other participants,
and are asked whether or not the report suits them, and thus may not qualify as a scientific
assessment. With regard to factor analysis, one study of l,29l college-aged students found six
different factors instead of the four used in the MBTI. In other studies, researchers found that the
JP and the SN scales correlate with one another. For reliability, some researchers have
interpreted the reliability of the test as being low, with test takers who retake the test often being
assigned a different type. According to some studies, 39–76% of those tested fall into different
types upon retesting some weeks or years later. About 50% of people tested within nine months
303
remain the same overall type and 36% remain the same after nine months. When people are
asked to compare their preferred type to that assigned by the MBTI, only half of people pick the
same profile. Critics also argue that the MBTI lacks falsifiability, which can cause confirmation
bias in the interpretation of results. The standardization was made using high school, college, and
graduate students; recently employed college graduates; and public school teachers.
Panukat ng Ugali at Pagkatao
This test, also called PUP,w as developed by Virgilio G. Enriquez and Ma. Angeles
Guanzon-Lapena. It was published by the Research Training House in 1975. Panukat ng Ugali at
Pagkatao is a psychological test that can be used for research, employment, and screening of
members and students in an institution. Its reliability is .90 and test-retest reliability result was
.94 (p< .01). It has four trait subscales and each has underlying personality traits. The four trait
subscales are Extraversion or Surgency, Aggreableness, Conscientiousness, and Emotional
Stability. Under extraversion are ambition (+), guts/ daring (+), shyness or timidity (-) and
conformity (-). Ambition is the tendency of a person to act towards the accomplishment of his/
her goal. Guts/daring is the courage which is a very strong emotion from the person within. It
can be related to things that are in risk or danger be it in life, aspect of life and material things.
Shyness or timidity is the trait of being timid, reserved and unassertive. A person who is shy
tends not to socialize with others, does not engage in eye contact and lose trust to oneself so
prefers to be alone. Conformity is the tendency of a person to take into consideration what other
people are saying especially if that person has a higher position to him/ her. A conforming
person tends to disregard one's own opinion. For the agreeableness, the factors are respectfulness
(+), generosity (+), humility (+), helpfulness (+), difficulty to deal with (-), criticalness (+), and
belligerence (-). Respectfulness is the trait of giving value to the person you are taking to
regardless of his/ her position and age. Generosity is the ability to satisfy the needs of others by
giving what they need or want even it is not in accordance of one’s personal desire. Humility is
the trait of showing modesty and humbleness in dealing with other people, not boast of her
accomplishments and status in life. Helpfulness is the desire to attend to others’ needs and fill
their shortcomings. Difficulty to deal with others is the tendency of the person to agree on
something after many attempts of request. Criticalness is the tendency of the person to criticize
every small detail of something, giving attention to things that are rarely noticed by others.
Belligerence is the trait of a person of being a war-freak and hot headed, easily angered and
frequently encounters trouble due to short or absence of patience. For the conscientiousness
dimension, the personalities are thriftiness, perseverance, responsibleness, prudence, fickle
mindedness, and stubbornness. Thriftiness is the ability of a person to manage his/ her resources
wisely and conservative in spending money. Perseverance is the persistence of a person to
achieve ones goal and being constant with the things already started until it is finished.
Responsibleness is the capacity to do the task assigned to him/ her and being accountable to it.
Prudence is the ability to make sound and careful decisions by weighing the available options.
Fickle mindedness is the tendency of the person to think twice before finally making up one’s
mind and having constantly changing mind once in a while. Finally, stubbornness is the
determination to do things despite any prohibitions, hindrances and objections and hard to
convince that he/ she has committed a mistake. For the fourth dimension which is emotional
stability, 4 traits are included which are restraint (+/-), sensitiveness (-), low tolerance to joking/
teasing (-) and mood (-). Restraint is the tendency of the person not to show his/ her intense
304
emotion, keeping one’s own feelings as a self-control strategy. Sensitiveness is the tendency of
the person to be easily hurt or affected by little things said or done that the person does not like.
Low tolerance to joking/ teasing is the tendency of the person to have intense emotion due to
teasing or provocation of others. The mood is the tendency to show unusual attitude or behavior
and changing emotion due to an unexpected event that happened. The last dimension of this test
is the intellect or openness to experience includes 3 personality traits such as thoughtfulness,
creativity and inquisitiveness. Thoughtfulness id the tendency of the person to be so concerned
with the future especially regarding the problems or troubles. Creativity is the natural ability of
the person to make or create something out of local materials or resources, and having the ability
to express oneself. Creative people have a wide imagination and high inclination to music, arts,
and culture. Last, inquisitiveness is the trait of the person to be curious and sometimes intrusive.
To be able to make a norm, 3,702 ethnic group were asked to participate: 412 Bicolano, 152
Chabacano, 642 Ilocano, 489 cebuano, 170 Ilonggo, 190 Kapampangan, 513 tagalog, 378 waray,
29 Zambal and 83 others. For the validity of the test, all items are said to have positive direction.
2 subscales for validity were used, denial (certain that the respondents will disagree with the
statement such as “I never told a lie in my entire life”) and tradition (certain that the respondents
will agree to the statement such as “I would take care of my parents when I get old”)
Panukat ng Pagkataong Pilipino
This test was developed by Anadaisy J. Carlota, from the psychology department in the
University of the Philippines. It was published in Quezon City in the year of 1989. PPP is a 3-
form personality test designed to measure 19 personality dimensions. Each personality
corresponds to subscales which are comprised of homogeneous subset of items. The three forms
are the Form K, form S and the form KS. Form K corresponds to the salient traits for
interpersonal relations. Under this form are 8 personality traits which include thoughtfulness,
social curiosity, respectfulness, sensitiveness, obedience, helpfulness, capacity to be
understanding, and sociability. Thoughtfulness is the tendency to be considerate to others. A
person who is thoughtful tries not to be inconvenient to other people. Social curiosity is the
inquisitiveness about other’s life. A person who is socially curious tends to ask everything to
someone and loves to know everything that is happening around him/ her. Respectfulness is the
tendency of people to recognize one’s belief and privacy. Behavior of respectful person is
concertized by simply knocking on the door first before entering. Sensitiveness is the tendency
of a person to be affected easily by any negative type of criticisms. So, a sensitive person does
not want to hear any negative criticisms from other people. Obedience is the tendency of a
person to do what others demand of him/her. An obedient person tends to follow whatever
commanded to him by others. Helpfulness is the tendency of a person to offer service to others,
extend help and give resources. It is characterized by a person who is always willing to lend
his/her things to others. The capacity to be understanding is the person's tolerant to other people's
shortcomings and when this person is hurt by others; he/she is always ready to listen to
explanations. And lastly, sociability is the ability of the person to easily get along and befriend
with others. In social gatherings or event this person will always take the first move to introduce
himself/ herself to others. The second form of this test is Form S which includes 7 factors such as
orderliness, emotional stability, humility, cheerfulness, honesty, patience, and responsibility.
Orderliness is the neatness and organization in one’s appearance and even in work. The person
the is orderly puts his/her things in proper places. Emotional Stability is the ability of the person
305
to control his/ her emotions and manage to remain calm even when face in a great trouble.
Humility is the tendency to remain modest despite accomplishments and readily accepts own
mistakes. The person with humble personality does not boast about his/ her successes.
Cheerfulness is the disposition of the person to be cheerful and see the happy and funny aspects
of things that happen. A cheerful person is one who always finds funny things about situations.
Honesty is the sincerity and truthfulness of a person. A person who is honest tends to tell the
truth in every situation regardless of the feelings of others. Patience is the ability to cope up with
daily life's routine and repetitive activities. A patient person is one who responds to a child's
repetitive questions without getting mad. Lastly, responsibility is the tendency of the person to
do a particular task upon own initiative. A responsible person is characterized by not
procrastinating in accomplishing an activity. For the last form of PPP, the Form KS, there are 4
subscales which include creativity, risk-taking, achievement orientation and intelligence.
Creativity is the ability of being innovative, and think of various strategies in solving a problem.
Risk-taking is the tendency to take new challenges despite the unknown consequences. A risk-
taker person is the one who believes that one must take risks to be successful in life.
Achievement-orientation is the tendency of the person to strive for excellence and to emphasize
quality over quantity in every task he/ she does. And lastly, intelligence is the trait of a person to
perceive oneself as an intelligent person. This is also characterize by easily understanding the
material being read. This test can be taken by person with age ranging from 13 and above. It is
already written in Filipino and has translations in English, Cebuano, Ilokano and Ilonggo. During
its pretest, 245 respondents from ages 13-18 years old were included. There were more females
then. The reliability was tested through internal consistency reliabilities. All personality
dimensions except Achievement orientation has high reliability. Internal consistency reliability
was done three times. At first top 10 personalities were gotten, then top 12 then top 14. For the
fourth time, top 8 was taken and were included in the inventory. Form K has a mean reliability
coefficient of .69, Form s .81 and Form KS .72. For the validity, construct validity was applied
wherein internal structures of the original version of PP before clustering on 3 forms
intercorrelations among the subscales were obtained. The test was valid because for one, more
positive intercorrelations than negative were obtained. Second, in personality subscales there
were also more positive than negative except for social curiosity and sensitiveness. And lastly,
the magnitude of the correlations were small to moderate although the majority of the subscales
are significant at alpha level of p=.05. The predominance of positive intercorrelation means that
all of the subscales are measuring the same construct which is personality. This test was
standardized through norming which was developed in two forms including percentiles and
normalized standardized scores with a mean of 50 and standard deviation of 10.
Attitude Tests
Survey of Study Habits and Attitudes
This test was developed in order to help meet the challenge of students with high
scholastic aptitude but is very poor in schools while the mediocre in the scholastic test were
doing well in school. This test is easily administered study of methods, motivation for studying
certain attitudes toward scholastic activities which are important in the classrooms. The purpose
of developing this is to identify the students whose study habit and attitudes are different from
those of students who earn high grades, to aid understanding of the students with academic
306
difficulties and to provide a basic for helping such students improve their study habits and
attitudes and thus more fully realize their best potentialities. To add to this, study habits are
believed to be a strong predictor of achievement. This test consists of Form C for college and
Form H for high school (grades 7-12). The four basic subscales include delay avoidance, work
methods, teacher approval, and educational acceptance. It has 100 items and can be used as
screening instrument, diagnostic, teaching aid and teaching tool. There were separate norms for
both of the Forms. For Form C, 3054 first semester freshmen enrolled at the following nine
colleges were included: Antioch College, Bowling Green State University, Colorado Reed
College, San Francisco State College, Southwest Texas State College, Stephen F College, Austin
State College, Swarthmore College and Texas Lutheran College. For the Form H 11, 218
students in 16 different towns and metropolitan arena in America participated: Atacosta Texas
(10-12), Austin Texas (10-12), Buda Texas (7-12), Durango Colorado (10-12), Olen Ellyn,
Illinois (9), Gunnison Colorado (10-12), Hagerstown Maryland (7-12), Marion Texas (7-12),
Navarro Texas (7-12), New Brauntels Texas (7-9), Salt Lake City Utah (7-12), San Marcos
Texas (7-12), Seguin Texas (7-2), St. Louis Missouri (7-12) and Waelder Texas (7-12). The
computed reliability coefficients were baes on the Kuder-Richardson # 8, which ranged from .87
and .89. Using the test-retest method, the coefficients were .93, .91, .88, .90 for delay avoidance,
work methods, teacher approval, and educational acceptance respectively in the 4 weeks interval.
In the 14 week interval, the reliability coefficients were .88, .86, .83 and .85. For the validity, the
criterion used was the one-semester grade point average GPA. SSHA and GPA were correlated
and the result was .27-.66 for men and .26- .65 for women. The average validity coefficients for
10 colleges were .42 for men and .45 for women. When SSHA was also correlated to ACE
(American Council on educational psychological examination, a scholastic aptitude test was
always low. SSHA and Form C correlated to GPA obtained .25-.45 and the weighted average
was .36. SSHA and each subscale using Fisher’s z-functions .31 for delay, .32 for work methods,
.25 for teacher approval and .35 educational acceptance.
Work Values Inventory
This test intends to meet the need of assessing the goals which motivate man to work. It
measures the values which are extrinsic as well as those which are intrinsic work, the
satisfactions with men and women seek in work and the satisfactions which may be the
concomitants or outcomes of work. It seeks to measure these in boys and girls, in men and
women at all age levels beginning with adolescence and at all educational levels beginning with
entry into junior high school. It is both in the variety of values tapped and in the ages for which it
is appropriate. Its factors are altruism, esthetic, creativity, intellect stimulation, achievement,
independence, prestige, management, economic returns, security, surroundings, supervisory
relations, associates, way of life, and variety. Altruism refers to the work which enables the
person to contribute with the welfare of others. Esthetic is the works which permits to one to
make beautiful things and to contribute to beauty to the world. Creativity pertains to the work
which permits one to invent new things, design new products or develop new ideas. Intellect
stimulation refers to the work which provides opportunity for independent thinking and for
learning how and why things work. Achievement refers to the work which gives one a feeling of
accomplishment in a job well done. Independence pertains to the work in his own way as fasts or
as slowly as he wishes. Prestige pertains to the work which gives one standing in the eyes of
other and evokes respect. Management refers to the work which permits one to plan and lay out
307
work for others to do. Economic returns pertain to the work which pays well and enables one to
have the things he wants. Security pertains to the work which provides one with certainty of
having a job even in the hard times. Surroundings pertains to the work which is carried out under
pleasant conditions, not too hot, not too cold, noisy, dirty, etc. Supervisory relations refer to the
work which is carried out under a supervisor who is fair and with whom one can get along.
Associates refer to the work which brings one into contact with fellow workers whom he likes.
Way of life refers to the kind of work that permits to live the kind of life he chooses and to be the
type of person he wishes to be. Variety refers to the work that provides an opportunity to do
different types of job. One of the items in the inventory under creativity is “Create new ideas,
programs or structures departing from those ideas already in existence.” To set the standards of
this test, norm was obtained. The sample were grade 7 (902 females, 925 males), 8 (862 females,
949 males), 9 (844 females, 931 males), 10 (772 females, 859 males), 11 (824 females, 814
males), and 12 (724 females and 672 males). Reliability was obtained through test-retest method
and the reliability coefficients reported were: .83, .82, .84, .81, .83, .83, .76, .84, .88, .87, .74,
.82, and .80 for all the subscales. Validity was also determined through construct, content,
concurrent and predictive validity. Some of the construct validity were obtained by correlating
Altruism subscale to Social Service Scale (r= .67) and to Social scale of AVL (r= .29). Also
Esthetic subscale with Artist key SVIB (r= .55), with artictic scale of Kuder (r=.48) and with
Aesthetic Scale of AVL (r=.08).
Interest Test
Brainard Occupational Preference Inventory
This test was designed to be able to have a systematic study of a person’s interest. It is a
standardized questionnaire that is designed to bring to the fore of the facts about a person with
respect to his occupational interest so that he and his advisers can more intelligently and
objectively discuss his educational and occupational plans. This test is intended for 8-12th grade
students and adults. It requires relatively low reading skills as determined by the readability
formulas. It provides information concerning a vital phase in the complex matter of setting the
person’s vocational plans wisely and planning a program for attaining his goals. It yields score in
6 broad occupational fields for each service. Both females and males obtain scores in fields
identified as commercial, mechanical, professional, esthetic and scientific. Agricultural score is
only for boys and personal service is only for girls. Each field has 20 questions divided among 4
occupational sections. A 5-point scale was used from strongly dislike to strongly like. The
sample in the norm includes 10, 000 students in 14 school system, both males and females from
grade 8 to 12. Reliability was obtained through test-retest and boys got r=.73 in Commercial and
.88 in scientific scores while girls obtained .71 in the commercial and .84 in esthetic. Another
reliability method used was split-half and boys obtained .88 in commercial scores and .95 in
mechanical and scientific scores while girls obtained .82 in commercial scores and .95 in
scientific scores. For the test of validity, Brainard was correlated to Kuder Personal Preference
record and it was found that the latter test measures different in such a way that its focus is in the
interest and forces the respondents to choose three activities indicative of different types of
interest.
308
A. Look for other Standardized Tests and report its current validity and reliability.
B. Administer the test that you created in Lesson 2 chapter 5 to a large sample. Then
create a norm.
References
(1973). Measuring intelligence with the culture fair test: Manual for scales 2 and 3. Institute of
Personality and Ability Testing, Philippines
Bennett, G.K., Seashore, H.G., & Wesman, A.G. (1973). Fifth edition manual for the
differential aptitude test forms s and t. The Psychological Corporation, New York.
Brainard, PP. & Brainard, R.T. (1991). Brainard occupational preference inventory manual.
Bird Avenue, San Jose California USA
Briggs, K.C., & Myers, I.B. (1943). The Myers-Briggs Type Indicator Manual. Consulting
Psychologists Press, Inc.
Brown, W.F., & Holtzman, W.H. (1967). Survey of study habits and attitudes: SSHA manual.
The Psychological Corp, East 45th Street New York
Carlota, A. (1989). Panukat ng pagkataong pilipino PPP Manual. Quezon City Philippines.
Edwards, A.L. (1959). Edwards personal preference schedule manual. Psychological

Corporation, New York
Enriquez, V. G. & Guanzon, M.A. (1975). Panukat ng ugali at pagkatao manual. PPRTH-ASP
Panukat na Sikolohikal
Flanagan, J.C. (1965). Flanagan industrial test manual. Science Research Associates, East Street
Chicago Illinois
Gardner, E.F., Rudman, H.C., Karlson, B., & Merwin, J.C. (1981). Manual directions for
administering stanford schievement test. Harcourt Brace and Jovanovich, Inc., New York
Guilford, J.P.,& Zimmerman, W.S. (1949). Guilford zimmerman temperament survey: Manual
of instructions and interpretations. Harcourt Brace and Jovanovich, Inc., New York
309
Otis, A.S. & Lennon, R.T. (1957). Ottis-Lennon mental ability test manual for administration.
Harcourt Brace and Jovanovich, Inc., New York
Otis, A.S. & Lennon, R.T. (1979). Ottis-Lennon mental ability test manual for administration
and interpretation. Harcourt Brace and Jovanovich, Inc., New York
Prescott, G.A., Balow, I.H., Hogan, T.P. & Farr, R.C. (1978). Advanced 2: Metropolitan
achievement tests: Forms JS and KS. Harcourt Brace and Jovanovich, Inc., New York
Raven, J., Raven, J.C., & Court, J.H. (2003). Manual for raven's progressive matrices and
vocabulary scales. section 1: General overview. San Antonio, TX: Harcourt Assessment.
Super, D.E. (1970). Manual: Work values inventory. Houghton Mifflin Company
Thurstone, L.L & Thurstone, T.G. (1967). SRA verbal examiner’s manual. Science Research
Associates, East Street Chicago Illinois
Watson, G., & Glaser, E.M. (1964). Watson-Glaser critical thinking appraisal: Manual for
forms Ym and Zm. Harcourt Brace and Jovanovich, Inc., New York
310
Chapter 9
The Status of Educational Assessment in the Philippines
Objectives
1. Realize the strong foundation of the field of educational assessment in the Philippines.
2. Describe the history of formal assessment in the Philippines.
3. Describe the pattern of assessment practices in the Philippines.
Lessons
1 Assessment in the Early Years

Monroe Survey (1925)
Research, Evaluation, and Guidance Division of the Bureau of Public Schools
Economic Survey Committee
The Proser Survey
Other Government Commissioned Surveys
UNESCO Survey (1949)
2 Assessment in the Contemporary Period and Future Directions
EDCOM Report (1991)
Philippine Education Sector Study (PESS-1999)
Fund for Assistance to Private Education (FAPE)
Center for Educational Measurement
Asian Psychological Services and Assessment Corporation
Building Future Leaders and Scientific Experts in Assessment and Evaluation in
the Philippines
Professional Organization on Educational Assessment
311
Lesson 1
Assessment in the Early Years
Monroe Survey (1925)
Formal Assessment in the Philippines started as mandate from the government to look
into the educational status of the country (Elevazo, 1968). The first assessment was conducted
through a survey authorized by the Philippine legislature in 1925. The legislature created by the
Board of Educational Survey headed by Paul Monroe. Later the board appointed an Educational
Survey Commission who was also headed by Paul Monroe. This commission visited different
schools around the Philippines. They observed different activities conducted in schools around
the Philippines. The results of the survey reported the following:
1. The public school system that is highly centralized in administration needs to be humanized
and made less mechanical.
2. Textbook and materials need to be adapted to Philippine life.
3. The secondary education did not prepare for life and recommended training in agriculture,
commerce, and industry.
4. The standards of the University of the Philippines was high and it should be maintained by
freeing the university from political interference.
5. Higher education be concentrated in Manila.
6. English as medium of instruction was best. The use of local dialect in teaching character
education was suggested.
7. Almost all teachers (95%) were not professionally trained for teaching.
8. Private schools except under the religious groups were found to be unsatisfactory.
Research, Evaluation, and Guidance Division of the Bureau of Public Schools
This division started as the measurement and Research Division in 1924 that was an off
shoot to the Monroe Survey. It was intended to be the major agent of research in the Philippines.
Its functions were:
1. To coordinate the work of teachers and supervisors in carrying out testing and research
programs
2. To conduct educational surveys
3. To construct and standardize achievement tests
Economic Survey Committee
In a legislative mandate in 1927, the director of education created the Economic Survey
Committee headed by Gilbert Perez of the Bureau of Education. The survey studied the
economic condition of the Philippines. They made recommendations as to the best means by
which graduates of the public school could be absorbed to the economic life of the country. The
results of the survey pertaining to education include:
312
1. Vocational education is relevant to the economic and social status of the people.
2. It was recommended that the work of the schools should not be to develop a peasantry class
but to train intelligent, civic-minded homemakers, skilled workers, and artisans.
3. Devote secondary education to agriculture, trades, industry, commerce, and home economics.
The Prosser Survey
In 1930, C. A. Prosser made a follow-up study on vocational education in the Philippines.

He observed various types of schools and schoolwork. He interviewed school officials and
businessman. He recommended in the survey to improve various phases of the vocational
educational such as 7th grade shopwork, provincial trade schools, practical arts training in the
regular high schools, home economics, placement work, gardening, and agricultural education.
Other Government Commissioned Surveys
After the Prosner survey there were several surveys conducted to determine mostly the
quality of schools in the country after the 1930’s. All of these surveys were government
commissioned such as the Quezon Educational Survey in 1935 headed by Dr. Jorge C. Bacobo.
Another study was made in 1939 which is a sequel to the Quezon Educational Surveys which
made a thorough study of existing educational methods, curricula and facilities and recommend
change son financing public education in the country. This was followed by another
congressional survey in 1948 by the Joint Congressional Committee on Education to look into
the independence of the Philippines from America. This study employed several methodologies.
UNESCO Survey (1949)
The UNESCO undertook a survey on Philippine Education from March 30 to April 16,
1948 headed by Mary Trevelyan. The objective of the survey was to look at the educational
situation of the Philippines to guide planners of subsequent educational missions to the
Philippines. The report of the surveys was gathered from a conference with educators and
layman from private and public school all over the country. The following were the results:
1. There is a language problem and proposed a research program.

2. There is a need to for more effective elementary education.
3. Lengthening of the elementary-secondary program from 10 to 12 years.
4. Need to give attention to adult education.
5. Greater emphasis on community school
6. Conduct thorough surveys to serve as basis for long-range planning
7. Further strengthening of the teacher education program
8. Teachers income have not kept pace with the national income or cost of living
9. Delegation of administrative authority to provinces and chartered cities
10. Decrease of national expenditure on education
11. Advocated more financial support to schools from various sources
313
After the UNESCO study, it was followed by further government studies. In 1951, the
Senate Special Committee on Educational Standards of Private schools undertook to study
private schools. This study was headed by Antonio Isidro to investigate the standards of
instruction in private institutions of learning and to provide certificates of recognition in
accordance with their regulations. In 1967, the Magsaysay Committee on General Education that
was financed by the University of the East Alumni Association. In 1960, the National Economic
Council and the International Cooperation Administration surveyed public schools. The survey
was headed by Vitaliano Bernardino, Pedro Guiang, and J. Chester Swanson. Three
recommendations were provided to public schools: (1) To improve the quality of educational
services, (2) To expand the educational services, and (3) To provide better financing for the
schools.
The assessment conducted in the early years were mandated and/or commissioned by
government which was also initiated by the government. The private sectors were not yet
included in the studies as proponents and usually headed by foreign counterparts such as the
UNESCO and the Monroe, and Swanson survey. The focus of the assessments was on the overall
education of the country which is considered national research given the need of the government
to determine the status of the education in the country.
314
Lesson 2
Assessment in the Contemporary Period and Future Directions
EDCOM Report (1991)
The EDCOM report in 1991 indicated that high dropout rates especially in the rural areas
were significantly marked. The learning outcomes as shown by achievement levels show mastery
of the students in important competencies. There were high levels of simple literacy among both
15-24 year olds and 15+ year olds. “Repetition in Grade 1 was the highest among the six grades
of primary education reflects the inadequacy of preparation among the young children. All told,
the children with which the formal education system had to work with at the beginning of EFA
were generally handicapped by serious deficiencies in their personal constitution and in the skills
they needed to successfully go through the absorption of learning.”
Philippine Education Sector Study (PESS-1999)
The PESS was jointly conducted by the World Bank and Asian Development Bank. It
was recommended that:
1. A moratorium on the establishment of state colleges and universities.

2. Tertiary education institutions be weaned from public funding sources.
3. A more targeted program of college and university scholarships
Aside from the government initiatives in funding and conducting surveys that applies
assessment methodologies and processes. Aside from survey studies, the government also
practiced testing where they screen government employees which started in 1924. Grade four to
fourth year high school students were tested in the national level in 1960 to 1961. Private
organizations also spearhead the enrichment of assessment practices in the Philippines. These
private institutions are the Center for Educational Measurement (CEM) and the Asian
Psychological Services and Assessment Corporation (APSA).
Fund for Assistance to Private Education (FAPE)
FAPE started with testing programs such as the guidance and testing program in 1969.
They started with the College Entrance Test (CET) which was first administered in 1971 and
again in 1972. The consultants who worked with the project were Dr. Richard Pearson from the
Educational Testing Service (ETS), Dr. Angelina Ramirez, and Dr. Felipe. FAPE then worked
with the Department of Education, Culture, and Sports (DECS) to design the first National
College Entrance Exam (NCEE) that will serve to screen fourth year high school students who
are eligible to take a formal four-year course. There was a need to administer a national test then
because most universities and colleges do not have an entrance exam to screen students. Later
the NCEE was completely endorsed by FAPE to the National Educational Testing Center of the
DECS.
The testing program of FAPE continued where they developed a package of four tests
which are the Philippine Aptitude Classification Test (PACT), the Survey/Diagnostic Test
(S/DT), the College Scholarship Qualifying Test (CSQT), and the College Scholastic Aptitude
315
Test (CSAT). In 1978, FAPE institutionalized an independent agency called the Center for
Educational Measurement that will undertake the testing and other measurement services.
Center for Educational Measurement
CEM started as an initiative of the Fund for Assistance to Private Education (FAPE).
CEM was headed by Dr. Leticia M. Asuzano who was the executive vice-president. Since then
several private schools have been members to CEM to continue their commitment and goals.
Since 1960 CEM has developed up to 60 more tests focused on education such as the National
Medical Admissions Test (NMAT). The main advocacy of CEM is to improving the quality of
formal education through its continuing advocacy and supporting systematic research. CEM
promote the role of educational testing and assessment in improving the quality of formal
education at the institutional and systems level. Through test results, the CEM helps to improve
effectiveness so teaching and student guidance.
Asian Psychological Services and Assessment Corporation
Aside from the CEM, in 1982 there is a growing demand for testing not only in the
educational setting but in the industrial setting. Dr. Genevive Tan who was a consultant to
various industries felt the need to measure the Filipino ‘psyche’ in a valid way because most
industries use foreign tests. The Asian Psychological Services and Assessment Corporation was
created from this need. In 2001, headed by Dr. Leticia Asuzano, former EVP of CEM, APSA
extended its services for testing in the academic setting because of the growing demand of
private schools on quality tests.
The mission of APSA is a commitment to deliver excellent and focused assessment
technologies and competence-development programs to the academe and the industry to ensure
the highest standards of scholastic achievement and work performance and to ensure
stakeholders' satisfaction in accordance with company goals and objectives. APSA envisions
itself as the lead organization in assessment and a committed partner in the development of
quality programs, competencies, and skills for the academe and the industry.
APSA has numerous tests that measures mental ability, clerical aptitude, work habits, and
supervisory attitudinal survey. For the academe side, they have test for basic education,
Assessment of College Potential and Assessment of Nursing Potential. In the future the first
Assessment for Engineering Potential and Assessment of Teachers Potential will be available for
use in higher education.
APSA pioneered on the use of new mathematical approaches (IRT Rasch Model) in
developing tests which goes beyond the norm-reference approach. In 2002 they launched the
standards-based instruments in the Philippines that serve as benchmarks in the local and
international schools. Standards-based assessment (1) provides an objective and relevant
feedback to the school in terms of its quality and effectiveness of instruction measured against
national norms and international standards; (2) Identifies the areas of strengths and the
developmental areas of the institution's curriculum; (3) Pinpoints competencies of students and
learning gaps which serve as basis for learning reinforcement or remediation; (4) Provides good
feedback to the student on how well he has learned and his readiness to move to a higher
educational level.
316
Building Future Leaders and Scientific Experts in Assessment and Evaluation in the Philippines
There are only some universities in the Philippines that offer graduate training on
Measurement and evaluation. The University of the Philippines offer a master’s program in
education specialized in measurement and evaluation and doctor of philosophy in research and
evaluation. Likewise, De La Salle University-Manila has a master of science in psychological
measurement offered by the psychology department and their college of education which is a
center for excellence has a master of arts in educational measurement and evaluation, and a
doctor of philosophy in educational psychology major in research, measurement and evaluation.
There are only two universities in the Philippines that offer graduate training and
specialization on measurement and evaluation courses. Some practitioners were trained in other
countries such as in the United States and Europe. There is a greater call for educators and those
in the industry involved in assessment to be trained to produce more experts in the field.
Professional Organization on Educational Assessment
Aside from the government and educational institutions, the Philippine Educational
Measurement and Evaluation (PEMEA) is a professional organization geared n promoting the
culture of assessment in the country. The organization started with the National Conference on
Educational Measurement and Evaluation headed by Dr. Rose Marie Salazar-Clemeña who was
the dean of the College of Education in De La Salle University-Manila together with the De La
Salle-College of Saint Benilde’s Center for Learning and Performance Assessment. It was
attended by participants all around the Philippines. The theme of the conference was
“Developing a Culture of Assessment in Learning Organizations.” The conference aimed to
provide a venue for assessment practitioners and professionals to discuss the latest trends,
practices, and technologies in educational measurement and evaluation in the Philippines. In the
said conference the PEMEA was formed. The purpose of the organization are as follows:
1. To promote standards in various areas of education through appropriate and proper

assessment.
2. To provide technical assistance to educational institutions in the area of
instrumentation, assessment practices, benchmarking, and process of attaining standards.
3. To enhance and maintain the proper practice of measurement and evaluation in both
local and international level.
4. To enrich the theory, practice, and research in evaluation and measurement in the
Philippines.
The first batch of board of directors elected for the PEMEA are Dr. Richard DLC
Gonzales as President (University of Santo Tomas Graduate School), Neil O. Pariñas as Vice
president (De La Salle–College of Saint Benilde), Dr. Lina A. Miclat as secretary (De La Salle–
College of Saint Benilde), Marife M. Mamauag as treasurer (De La Salle–College of Saint
Benilde), Belen M. Chu as PRO (Philippine Academy of Sakya). The board members are Dr.
Carlo Magno (De La Salle University-Manila), Dennis Alonzo (University of Southeastern
Philippines, Davao City), Paz H. Diaz (Miriam Collage), Ma. Lourdes M. Franco (Center for
317
Educational Measurement), Jimelo S. Tipay (De La Salle–College of Saint Benilde), and Evelyn
Y. Sillorequez (Western Visayas State University).
Aside from the universities and professional organization that provide training on
measurement and evaluation, the field is growing in the Philippines because of the periodicals
that specialize in the field. The CEM has its “Philippine Journal of Educational Measurement.”
The APSA is continuing to publish its “APSA Journal of SBA Research.” And the PEMEA will
soon launch the “Educational Measurement and Evaluation Review.” Aside from these journals
there are Filipino experts from different institutions who published their work in international
journals and journals listed in the Social Science Index.
Write an essay describing the future direction of educational assessment in the

Philippines.
References
Salamanca, B. S. (1981). FAPE: The first decade. Manila: FAPE.
Elevazo, A. O. (1968). Educational research and national development. Manila: National

Science Development Board.
318
Appendix A
Critical Values of the Pearson r Moment Correlation
df α=0.1 α=0.05 α=0.02 α=0.01 α=0.001

1 0.988 0.997 0.9995 0.9999 0.99999
2 0.9 0.95 0.98 0.99 0.999
3 0.805 0.878 0.934 0.959 0.991
4 0.729 0.811 0.882 0.917 0.974
5 0.669 0.755 0.833 0.875 0.95
6 0.622 0.707 0.789 0.834 0.925
7 0.582 0.666 0.75 0.798 0.898
8 0.549 0.632 0.716 0.765 0.872
9 0.521 0.6021 0.685 0.735 0.847
10 0.497 0.576 0.658 0.708 0.823
11 0.476 0.553 0.633 0.684 0.801
12 0.458 0.532 0.612 0.661 0.78
13 0.441 0.514 0.592 0.641 0.76
14 0.426 0.497 0.574 0.623 0.742
15 0.412 0.482 0.558 0.606 0.725
16 0.4 0.468 0.543 0.59 0.708
17 0.389 0.456 0.529 0.575 0.693
18 0.378 0.444 0.516 0.561 0.679
19 0.369 0.433 0.503 0.549 0.665
20 0.36 0.423 0.492 0.537 0.652
25 0.323 0.381 0.445 0.487 0.597
30 0.296 0.349 0.409 0.449 0.554
35 0.275 0.325 0.381 0.418 0.519
40 0.257 0.304 0.358 0.393 0.49
45 0.243 0.288 0.338 0.372 0.465
50 0.231 0.273 0.322 0.354 0.443
60 0.211 0.25 0.295 0.325 0.408
70 0.195 0.232 0.274 0.302 0.38
80 0.183 0.217 0.257 0.283 0.357
90 0.173 0.205 0.242 0.267 0.338
100 0.164 0.195 0.23 0.254 0.321
319
Appendix B
z Standard Area from Mean to Area in larger Area in smaller

Score x/σ portion portion y Ordinate at x/σ
0.00 0.0000 0.5000 0.5000 0.3989
0.01 0.0040 0.5040 0.4960 0.3989
0.02 0.0080 0.5080 0.4920 0.3989
0.03 0.0120 0.5120 0.4880 0.3988
0.04 0.0160 0.5160 0.4840 0.3986
0.05 0.0199 0.5199 0.4801 0.3984
0.06 0.0239 0.5239 0.4761 0.3982
0.07 0.0279 0.5279 0.4721 0.3980
0.08 0.0319 0.5319 0.4681 0.3977
0.09 0.0359 0.5359 0.4641 0.3973
0.10 0.0398 0.5398 0.4602 0.397
0.11 0.0438 0.5438 0.4562 0.3965
0.12 0.0478 0.5478 0.4522 0.3961
0.13 0.0517 0.5517 0.4483 0.3956
0.14 0.0557 0.5557 0.4443 0.3951
0.15 0.0596 0.5596 0.4404 0.3945
0.16 0.0636 0.5636 0.4364 0.3939
0.17 0.0675 0.5675 0.4325 0.3932
0.18 0.0714 0.5714 0.4286 0.3925
0.19 0.0753 0.5753 0.4247 0.3918
0.20 0.0793 0.5793 0.4207 0.3910
0.21 0.0832 0.5832 0.4168 0.3902
0.22 0.0871 0.5871 0.4129 0.3894
0.23 0.0910 0.5910 0.4090 0.3885
0.24 0.0948 0.5948 0.4052 0.3876
0.25 0.0987 0.5987 0.4013 0.3867
0.26 0.1026 0.6026 0.3974 0.3857
0.27 0.1064 0.6064 0.3936 0.3847
0.28 0.1103 0.6103 0.3897 0.3836
0.29 0.1141 0.6141 0.3859 0.3825
0.30 0.1179 0.6179 0.3821 0.3814
0.31 0.1217 0.6217 0.3783 0.3802
0.32 0.1255 0.6255 0.3745 0.3790
0.33 0.1293 0.6293 0.3707 0.3778
0.34 0.1331 0.6331 0.3669 0.3765
0.35 0.1368 0.6368 0.3632 0.3752
0.36 0.1406 0.6406 0.3594 0.3739
0.37 0.1443 0.6443 0.3557 0.3725
0.38 0.1480 0.6480 0.3520 0.3712
0.39 0.1517 0.6517 0.3483 0.3697
320

0.40 0.1554 0.6554 0.3446 0.3683
0.41 0.1591 0.6591 0.3409 0.3668
0.42 0.1628 0.6628 0.3372 0.3653
0.43 0.1664 0.6664 0.3336 0.3637
0.44 0.1700 0.6700 0.3300 0.3621
0.45 0.1736 0.6736 0.3264 0.3605
0.46 0.1772 0.6772 0.3228 0.3589
0.47 0.1808 0.6808 0.3192 0.3572
0.48 0.1844 0.6844 0.3156 0.3555
0.49 0.1879 0.6879 0.3121 0.3538
0.50 0.1915 0.6915 0.3085 0.3521
0.51 0.1950 0.6950 0.3050 0.3503
0.52 0.1985 0.6985 0.3015 0.3485
0.53 0.2019 0.7019 0.2981 0.3467
0.54 0.2054 0.7054 0.2946 0.3448
0.55 0.2088 0.7088 0.2912 0.3429
0.56 0.2123 0.7123 0.2877 0.3410
0.57 0.2157 0.7157 0.2843 0.3391
0.58 0.2190 0.7190 0.2810 0.3372
0.59 0.2224 0.7224 0.2776 0.3352
0.60 0.2257 0.7257 0.2743 0.3332
0.61 0.2291 0.7291 0.2709 0.3312
0.62 0.2324 0.7324 0.2676 0.3292
0.63 0.2357 0.7357 0.2643 0.3271
0.64 0.2389 0.7389 0.2611 0.3251
0.65 0.2422 0.7422 0.2578 0.3230
0.66 0.2454 0.7454 0.2546 0.3209
0.67 0.2486 0.7486 0.2514 0.3187
0.68 0.2517 0.7517 0.2483 0.3166
0.69 0.2549 0.7549 0.2451 0.3144
0.70 0.2580 0.7580 0.2420 0.3123
0.71 0.2611 0.7611 0.2389 0.3101
0.72 0.2642 0.7642 0.2358 0.3079
0.73 0.2673 0.7673 0.2327 0.3056
0.74 0.2704 0.7704 0.2296 0.3034
0.75 0.2734 0.7734 0.2266 0.3011
0.76 0.2764 0.7764 0.2236 0.2989
0.77 0.2794 0.7794 0.2206 0.2966
0.78 0.2823 0.7823 0.2177 0.2943
0.79 0.2852 0.7852 0.2148 0.2920
0.80 0.2881 0.7881 0.2119 0.2897
0.81 0.2910 0.791 0.2090 0.2874
0.82 0.2939 0.7939 0.2061 0.285
321

0.83 0.2967 0.7967 0.2033 0.2827
0.84 0.2995 0.7995 0.2005 0.2803
0.85 0.3023 0.8023 0.1977 0.2780
0.86 0.3051 0.8051 0.1949 0.2756
0.87 0.3078 0.8078 0.1922 0.2732
0.88 0.3106 0.8106 0.1894 0.2709
0.89 0.3133 0.8133 0.1867 0.2685
0.90 0.3159 0.8159 0.1841 0.2661
0.91 0.3186 0.8186 0.1814 0.2637
0.92 0.3212 0.8212 0.1788 0.2613
0.93 0.3238 0.8238 0.1762 0.2589
0.94 0.3254 0.8264 0.1736 0.2656
0.95 0.3289 0.8289 0.1711 0.2541
0.96 0.3315 0.8315 0.1685 0.2516
0.97 0.3340 0.8340 0.1660 0.2492
0.98 0.3365 0.8365 0.1635 0.2568
0.99 0.3389 0.8389 0.1611 0.2444
1.00 0.3413 0.8413 0.1587 0.242
1.01 0.3438 0.8438 0.1562 0.2396
1.02 0.3461 0.8461 0.1539 0.2371
1.03 0.3485 0.8485 0.1515 0.2347
1.04 0.3508 0.8508 0.1492 0.2323
1.05 0.3531 0.8531 0.1469 0.2299
1.06 0.3554 0.8554 0.1446 0.2275
1.07 0.3577 0.8577 0.1423 0.2251
1.08 0.3599 0.8599 0.1401 0.2227
1.09 0.3621 0.8621 0.1379 0.2203
1.10 0.3643 0.8643 0.1357 0.2179
1.11 0.3665 0.8665 0.1335 0.2155
1.12 0.3686 0.8686 0.1314 0.2131
1.13 0.3708 0.8708 0.1292 0.2107
1.14 0.3729 0.8729 0.1271 0.2083
1.15 0.3749 0.8749 0.1251 0.2059
1.16 0.3770 0.8770 0.1230 0.2036
1.17 0.3790 0.8790 0.1210 0.2012
1.18 0.3810 0.8810 0.1190 0.1289
1.19 0.3830 0.8830 0.1170 0.1965
1.20 0.3849 0.8849 0.1151 0.1942
1.21 0.3869 0.8869 0.1131 0.1919
1.22 0.3869 0.8869 0.1112 0.1895
1.23 0.3869 0.8869 0.1093 0.1872
1.24 0.3925 0.8925 0.1075 0.1849
1.25 0.3944 0.8944 0.1056 0.1826
322

1.26 0.3962 0.8962 0.1038 0.1804
1.27 0.3980 0.8980 0.1020 0.1781
1.28 0.3997 0.8997 0.1003 0.1758
1.29 0.4015 0.9015 0.0985 0.1736
1.30 0.4032 0.9032 0.0968 0.1714
1.31 0.4049 0.9049 0.0951 0.1691
1.32 0.4066 0.9066 0.0934 0.1669
1.33 0.4082 0.9082 0.0918 0.1647
1.34 0.4099 0.9099 0.0901 0.1626
1.35 0.4115 0.9115 0.0885 0.1604
1.36 0.5131 1.0131 0.0869 0.1582
1.37 0.4147 0.9147 0.0853 0.1561
1.38 0.4162 0.9162 0.0838 0.1539
1.39 0.4177 0.9177 0.0823 0.1518
1.40 0.4192 0.9192 0.0808 0.1497
1.41 0.4207 0.9207 0.0793 0.1476
1.42 0.4222 0.9222 0.0778 0.1456
1.43 0.4236 0.9236 0.0764 0.1435
1.44 0.4251 0.9251 0.0749 0.1415
1.45 0.4265 0.9265 0.0735 0.1394
1.46 0.4279 0.9279 0.0721 0.1374
1.47 0.4292 0.9292 0.0708 0.1354
1.48 0.4306 0.9306 0.0694 0.1334
1.49 0.4319 0.9319 0.0681 0.1315
1.50 0.4332 0.9332 0.0668 0.1295
1.51 0.4345 0.9345 0.0665 0.1276
1.52 0.4357 0.9357 0.0643 0.1257
1.53 0.4370 0.9370 0.0630 0.1238
1.54 0.4382 0.9382 0.0618 0.1219
1.55 0.4394 0.9394 0.0606 0.1200
1.56 0.4406 0.9406 0.0594 0.1182
1.57 0.4418 0.9418 0.0582 0.1163
1.58 0.4429 0.9429 0.0571 0.1145
1.59 0.4441 0.9441 0.0559 0.1127
1.60 0.4452 0.9452 0.0548 0.1109
1.61 0.4463 0.9463 0.0537 0.1092
1.62 0.4474 0.9474 0.0526 0.1074
1.63 0.4484 0.9484 0.0516 0.1057
1.64 0.4495 0.9495 0.0505 0.1040
1.65 0.4505 0.9505 0.0495 0.1023
1.66 0.4515 0.9515 0.0485 0.1006
1.67 0.4535 0.9535 0.0475 0.0989
1.68 0.4535 0.9535 0.0465 0.0973
323

1.69 0.4545 0.9545 0.0455 0.0957
1.70 0.4554 0.9554 0.0446 0.0940
1.71 0.4564 0.9564 0.0436 0.0925
1.72 0.4573 0.9573 0.0427 0.0909
1.73 0.4582 0.9582 0.0418 0.0893
1.74 0.4591 0.9591 0.0409 0.0878
1.75 0.4599 0.9599 0.0401 0.0863
1.76 0.4608 0.9608 0.0392 0.0848
1.77 0.4616 0.9616 0.0384 0.0833
1.78 0.4625 0.9625 0.0375 0.0818
1.79 0.4633 0.9633 0.0367 0.0804
1.80 0.4641 0.9641 0.0359 0.0790
1.81 0.4649 0.9649 0.0351 0.0775
1.82 0.4656 0.9656 0.0344 0.0761
1.83 0.4664 0.9664 0.0336 0.0748
1.84 0.4671 0.9671 0.0329 0.0734
1.85 0.4648 0.9648 0.0322 0.0721
1.86 0.4686 0.9686 0.0314 0.0707
1.87 0.4693 0.9693 0.0307 0.0694
1.88 0.4699 0.9699 0.0301 0.0681
1.89 0.4706 0.9706 0.0294 0.0669
1.90 0.4713 0.9713 0.0287 0.0656
1.91 0.4719 0.9719 0.0281 0.0644
1.92 0.4726 0.9726 0.0274 0.0632
1.93 0.4732 0.9732 0.0268 0.0620
1.94 0.4738 0.9738 0.0262 0.0608
1.95 0.4744 0.9744 0.0256 0.0596
1.96 0.4750 0.9750 0.0250 0.0584
1.97 0.4756 0.9756 0.0244 0.0573
1.98 0.4761 0.9761 0.0239 0.0562
1.99 0.4767 0.9767 0.0233 0.0551
2.00 0.4772 0.9772 0.0228 0.0540
2.01 0.4778 0.9778 0.0222 0.0529
2.02 0.4783 0.9783 0.0217 0.0519
2.03 0.4788 0.9788 0.0212 0.0508
2.04 0.4793 0.9793 0.0207 0.0498
2.05 0.4798 0.9798 0.0202 0.0488
2.06 0.4803 0.9803 0.0197 0.0478
2.07 0.4808 0.9808 0.0192 0.0468
2.08 0.4812 0.9812 0.0188 0.0459
2.09 0.4817 0.9817 0.0183 0.0449
2.10 0.4821 0.9821 0.0179 0.0440
2.11 0.4826 0.9826 0.0174 0.0431
324

2.12 0.4830 0.9830 0.0170 0.0422
2.13 0.4834 0.9834 0.0166 0.0413
2.14 0.4838 0.9838 0.0162 0.0404
2.15 0.4842 0.9842 0.0158 0.0396
2.16 0.4846 0.9846 0.0154 0.0387
2.17 0.4850 0.9850 0.0150 0.0379
2.18 0.4854 0.9854 0.0146 0.0371
2.19 0.4857 0.9857 0.0143 0.0363
2.20 0.4861 0.9861 0.0139 0.0355
2.21 0.4864 0.9864 0.0136 0.0347
2.22 0.4868 0.9868 0.0132 0.0339
2.23 0.4871 0.9871 0.0129 0.0332
2.24 0.4875 0.9875 0.0125 0.0325
2.25 0.4878 0.9878 0.0122 0.0317
2.26 0.4881 0.9881 0.0119 0.0310
2.27 0.4884 0.9884 0.0116 0.0303
2.28 0.4887 0.9887 0.0113 0.0297
2.29 0.4890 0.9890 0.0110 0.0290
2.30 0.4893 0.9893 0.0107 0.0283
2.31 0.4896 0.9896 0.0104 0.0277
2.32 0.4898 0.9898 0.0102 0.0270
2.33 0.4901 0.9901 0.0099 0.0264
2.34 0.4904 0.9904 0.0096 0.0258
2.35 0.4906 0.9906 0.0094 0.0252
2.36 0.4909 0.9909 0.0091 0.0246
2.37 0.4911 0.9911 0.0089 0.0241
2.38 0.4913 0.9913 0.0087 0.0235
2.39 0.4916 0.9916 0.0084 0.0229
2.40 0.4918 0.9918 0.0082 0.0224
2.41 0.4920 0.9920 0.0080 0.0219
2.42 0.4922 0.9922 0.0078 0.0213
2.43 0.4925 0.9925 0.0075 0.0208
2.44 0.4927 0.9927 0.0073 0.0203
2.45 0.4929 0.9929 0.0071 0.0198
2.46 0.4931 0.9931 0.0069 0.0194
2.47 0.4932 0.9932 0.0068 0.0189
2.48 0.4934 0.9934 0.0066 0.0184
2.49 0.4936 0.9936 0.0064 0.0180
2.50 0.4938 0.9938 0.0062 0.0175
2.51 0.4940 0.9940 0.0060 0.0171
2.52 0.4941 0.9941 0.0059 0.0167
2.53 0.4943 0.9943 0.0057 0.0163
2.54 0.4945 0.9945 0.0055 0.0158
325

2.55 0.4946 0.9946 0.0054 0.0154
2.56 0.4948 0.9948 0.0052 0.0151
2.57 0.4949 0.9949 0.0051 0.0147
2.58 0.4951 0.9951 0.0049 0.0143
2.59 0.4952 0.9952 0.0048 0.0139
2.60 0.4953 0.9953 0.0047 0.0136
2.61 0.4955 0.9955 0.0045 0.0132
2.62 0.4956 0.9956 0.0044 0.0129
2.63 0.4957 0.9957 0.0043 0.0126
2.64 0.4959 0.9959 0.0041 0.0122
2.65 0.4960 0.9960 0.0040 0.0119
2.66 0.4961 0.9961 0.0039 0.0116
2.67 0.4962 0.9962 0.0038 0.0113
2.68 0.4963 0.9963 0.0037 0.0110
2.69 0.4964 0.9964 0.0036 0.0107
2.70 0.4965 0.9965 0.0035 0.0104
2.71 0.4966 0.9966 0.0034 0.0101
2.72 0.4967 0.9967 0.0033 0.0099
2.73 0.4968 0.9968 0.0032 0.0096
2.74 0.4969 0.9969 0.0031 0.0093
2.75 0.4970 0.9970 0.0030 0.0091
2.76 0.4971 0.9971 0.0029 0.0088
2.77 0.4972 0.9972 0.0028 0.0086
2.78 0.4973 0.9973 0.0027 0.0084
2.79 0.4974 0.9974 0.0026 0.0081
2.80 0.4974 0.9974 0.0026 0.0079
2.81 0.4975 0.9975 0.0025 0.0077
2.82 0.4976 0.9976 0.0024 0.0075
2.83 0.4977 0.9977 0.0023 0.0073
2.84 0.4977 0.9977 0.0023 0.0071
2.85 0.4978 0.9978 0.0022 0.0069
2.86 0.4979 0.9979 0.0021 0.0067
2.87 0.4979 0.9979 0.0021 0.0065
2.88 0.4980 0.9980 0.0020 0.0063
2.89 0.4981 0.9981 0.0019 0.0061
2.90 0.4981 0.9981 0.0019 0.0060
2.91 0.4982 0.9982 0.0018 0.0058
2.92 0.4982 0.9982 0.0018 0.0056
2.93 0.4983 0.9983 0.0017 0.0055
2.94 0.4984 0.9984 0.0016 0.0053
2.95 0.4984 0.9984 0.0016 0.0051
2.96 0.4985 0.9985 0.0015 0.0050
2.97 0.4985 0.9985 0.0015 0.0048
326

2.98 0.4986 0.9986 0.0014 0.0047
2.99 0.4986 0.9986 0.0014 0.0046
3.00 0.4987 0.9987 0.0013 0.0044
3.01 0.4987 0.9987 0.0013 0.0043
3.02 0.4987 0.9987 0.0013 0.0042
3.03 0.4988 0.9988 0.0012 0.0040
3.04 0.4988 0.9988 0.0012 0.0039
3.05 0.4989 0.9989 0.0011 0.0038
3.06 0.4989 0.9989 0.0011 0.0037
3.07 0.4989 0.9989 0.0011 0.0036
3.08 0.4990 0.9990 0.0010 0.0035
3.09 0.4990 0.9990 0.0010 0.0034
3.10 0.4990 0.9990 0.0010 0.0033
3.11 0.4991 0.9991 0.0009 0.0032
3.12 0.4991 0.9991 0.0009 0.0031
3.13 0.4991 0.9991 0.0009 0.0030
3.14 0.4992 0.9992 0.0008 0.0029
3.15 0.4922 0.9922 0.0008 0.0028
3.16 0.4992 0.9992 0.0008 0.0027
3.17 0.4922 0.9922 0.0008 0.0026
3.18 0.4993 0.9993 0.0007 0.0025
3.19 0.4993 0.9993 0.0007 0.0025
3.20 0.4993 0.9993 0.0007 0.0024
3.21 0.4993 0.9993 0.0007 0.0023
3.22 0.4994 0.9994 0.0006 0.0022
3.23 0.4994 0.9994 0.0006 0.0022
3.24 0.4994 0.9994 0.0006 0.0021
3.30 0.4995 0.9995 0.0005 0.0017
3.40 0.4997 0.9997 0.0003 0.0012
3.50 0.4998 0.9998 0.0002 0.0009
3.60 0.4998 0.9998 0.0002 0.0006
3.70 0.4999 0.9999 0.0001 0.0004
327
Glossary
Absolute standards, 258
Abstraction, 6
Accountability, 30
Accuracy, 8
Achievement tests, 296
ACT, 285
Adjective checklist, 228
Admission, 256
Affect, 44
Affective characteristics, 211
Affective domain, 37
Alternate form, 61
Analysis of test data, 140
Analysis, 37, 45
Application, 37
Appraisal, 29
Aptitude test, 298
Articulation, 38
Asian Psychological Services and Assessment Corporation, 315
Assessment, 2, 22
Assignments, 27
Attitude tests, 305
Attitude, 43, 212
Audience, 35
Behavior, 35
Beliefs, 213
Binary-choice, 176
Bloom’s taxonomy, 37
Brainard Occupational Preference Inventory, 307
CEEB, 285
Center for Educational Measurement, 315
Characterization, 37
Chi-square goodness of fit, 76
Clarificative, 8
Classical test theory, 92, 126
Cognitive level, 37
Cognitive strategies, 43
Cognitive system, 44
Cognitive test, 275
Comparative scale, 226
Comprehension, 37
Comprehension, 45
328
Conceptual definition, 6
Conceptual knowledge, 40
Concrete concepts, 42
Condition, 35
Confirmatory factor analysis, 75
Construct validity, 74
Constructed-response, 191
Consumer-oriented, 9
Content validity, 73
Convergent validity, 78
Correlation coefficient, 58
Counseling, 30
Criterion, 24
Criterion, 35
Criterion-prediction, 73
Critical value, 61
Cronbach’s alpha, 64
Culture fair test, 292
De Bono’s six thinking hats, 47
Defined concepts, 42,
Degrees of freedom, 61
Diagram scale, 230
Differential aptitude test, 298
Direction, 211
Discrimination, 42
Disposition, 215
Divergent validity, 78
Economic Survey Committee, 311
EDCOM report, 314
Edwards Personal Preference Schedule, 298
Eigenvalue, 74
Essay, 194
Evaluation, 37
Evaluation, 6, 7
Exceptionality, 256
Expertise-oriented, 10
Exploratory factor analysis, 74
Factor analysis, 74
Factor loading, 75
Factor, 6
Factual knowledge, 40
Feasibility, 8
Feedback, 254
Fixed sum scale, 228
Flanagan industrial test, 298
Forced ranking scale, 225
329
Fund for Assistance to Private Education, 314

Gagne’s taxonomy, 42
General objective, 36
Goodness of fit, 76
Grading, 252
Graphic scale, 230
Group test, 275
Guilford-Zimmerman Temperament Survey, 300
Higher order rules, 42
Imitation, 38
Impact, 9
Individual test, 275
Inference questions, 244
Information, 46
Intellectual skill, 42
Intelligence test, 292
Intensity, 211
Interactive, 8
Interest test, 307
Interest, 212
Internal consistency, 63
Interpretive exercise, 201
Interpretive questions, 244
Interrater reliability, 67
IPAT 16 PF, 301
Item calibration, 106
Item characteristic curve, 104
Item difficulty, 92
Item discrimination, 92
Item distracter, 95
Item map, 112
Item Response Theory, 92
Item review, 219
Kendall’s tau, 68
Knowledge utilization, 45
Knowledge, 37
Knowledge, 43
Kuder-Richardson, 63
Kurtosis, 281
Latent variable, 6
Learning intents, 33
Lickert scale, 223
Linear numeric scale, 226
Management-oriented, 9
Manifest variable, 6
Marzano’s dimension of learning, 44
330
Matching item type, 187

Mean, 280
Measurement error, 125
Measurement, 4
Median, 279
Mental procedures, 46
Metacognitive knowledge, 40
Metacognitive question, 245
Metacognitive system, 45
Metropolitan Achievement Test, 296
Models of evaluation, 9
Monitoring, 8
Monroe Survey, 311
Motivation, 256
Motor skills, 43
Multidimensional, 6
Multiple rating list, 229
Multiple rating matrix, 229
Multiple response, 223
Multiple-choice, 179
Myers-Briggs Type Indicator, 302
Naturalization, 38
Nom, 276
Non-cognitive construct, 211
Non-cognitive test, 275
Non-objective test, 275
Non-standardized test, 275
Non-verbal test, 275
Normal curve, 276
Norms, 258
Objective test, 275
Objective, 25
Objectives, 34
Objectives-oriented, 9
Objectivity, 5
Operational definition, 6
Ordinal scale, 224
Organization, 37
Otis Lennon Mental Ability Test, 293
Otis Lennon School Ability Test, 293
Panukat ng Pagkataong Pilipino, 304
Panukat ng Ugali at Pagkatao, 303
Parallel form, 61
Pared-comparison scale, 225
Participant-oriented, 10
Pearson r, 59
331
Performance assessment, 26
Personality test, 299
Philippine Education Sector Study, 314
Philippine Educational Measurement and Evaluation, 316
Picture scale, 230
Placement, 255
Power test, 275
Precision, 38
Predictive question, 245
Proactive, 8
Procedural knowledge, 40
Product, 44
Program theory, 9
Projects, 26
Promotion, 255
Propriety, 8
Prosser Survey, 312
Psychomotor domain, 38
Psychomotor procedures, 46
Qualitative, 25
Quantification, 4, 5
Quantitative, 25
Quartile deviation, 282
Questioning, 242
Rasch Model, 103
Ravens’ progressive Matrix, 294
Reasoning, 43
Receiving, 37
Recitation, 26
Reliability, 57
Responding, 37
Response format, 222
Retrieval, 44
Revised Bloom’s taxonomy, 40
Root Mean Square residual, 76
Rules, 42
Scales, 223
Scatterplot, 59
Scoring, 175
Scree plot, 75
Selected-response, 176
Selection, 256
Selection, 30
Self-system, 45
Semantic differential scale, 227
Semantic distance scale, 228
332
Short-answer item, 190

Significance, 61
Single-response, 223
Skewness, 280
Skills, 43
Sociogram, 232
Sociometric scaling, 232
Spearman-Brown, 62
Specific objectives, 36
Speed test, 275
Split-half, 62
SRA verbal, 295
Standard scales, 284
Standard scores, 277
Standardization, 275
Standardized test, 256, 275
Standards, 34
Stanford Achievement Test, 296
Stanine, 284
Stiggins and Conklin’s taxonomy, 43
Structured, 25
Subjective, 25
Survey of Study habits and attitudes, 305
Synthesis, 37
T score, 284
Table of specifications, 171
Teacher-made test, 171
Test administration, 291
Test development, 173
Test instruction, 174
Test layout, 175
Test length, 174
Test, 274
Test-retest, 57
Tests, 25
Thinking hats, 47
Transfer questions, 245
True score, 125
True-false, 176
UNESCO Survey, 312
Unidimensional, 6
Unstructured, 25
Utility, 8
Validity, 73
Values, 214
Valuing, 37
333
Variable, 6
Variance, 61
Variance, 64
Verbal frequency, 224
Verbal information, 42
Verbal test, 275
Watson Glaser Critical Thinking Appraisal, 295
Work values inventory, 306
z score, 286
334
About the Authors
Dr. Carlo Magno is presently a faculty of the Counseling and Educational Psychology
Department at De La Salle University-Manila where he teaches courses in measurement
and evaluation, educational research, psychometric theory, and statistics. He took his
undergraduate in De La Salle University-Manila with the degree Bachelor of Arts major
in Psychology. He took his Masters degree in Education major in Basic Education
teaching at the Ateneo de Manila University. He received his PhD in Educational
Psychology major in Measurement and Evaluation at De La Salle University-Manila with
high distinction. He was trained in Structural Equations Modeling at Freie Universität in
Berlin, Germany. In 2005 he was awarded as the Most Outstanding Junior Faculty in
Psychology by the Samahan ng Mag-aaral sa Sikolohiya and in 2007 he was the Best
Teacher Students’ Choice Award by the College of Education in DLSU-Manila. In 2008,
he was awarded by the National Academy of Science and Technology as the Most
Outstanding Published Scientific Paper in the Social Science. Majority of his research
uses quantitative techniques in the field of educational psychology. Some of his work on
teacher performance, learner-centeredness, measurement and evaluation, self-regulation,
metacognition, and parenting were published in local and international refereed journals
and presented in local and international conferences. He is presently a board member of
the Philippine Educational Measurement and Evaluation Association.He is also the senior
editor of the Philippine ESL Journal and The International Journal of Research and
Review.
Jerome A. Ouano is a faculty of the Counseling and Educational Psychology

Department at De La Salle University-Manila, teaching courses in cognition and
learning, interpersonal behavior, educational psychology, facilitating learning, and
assessment of learning. He is a trainer in the implementation of the new Pre-service
Teacher Education Curriculum of the Philippines in the areas of Assessment of Student
Learning, Facilitating Learning, and Field Study, and helps empower the administrators
and teachers of many Teacher Education Institutions in Mindanao. Mr. Ouano has
presented empirical papers in local and international conferences. He has written books
for Field Study courses, and is active in sharing his expertise on assessment with in-
service teachers in the basic education as well as the tertiary faculty in many schools in
the country. He obtained his Bachelor’s degree in Psychology and Philosophy from Saint
Columban College, and his Master’s degree in Peace and Development Studies from
Xavier University. He is currently finishing his PhD in Educational Psychology major in
Learning and Human Development at De La Salle University-Manila.

E-Book Designing Written Forms of Assessment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

E-Book Designing Written Forms of Assessment

Uploaded by

Copyright:

Available Formats

I

Designing Written Assessment of

This reference is inspired by my professors who have shaped me in my knowledge and

Title Page ………………………………………………………………………………………… I

Editors Note (Preface) ………………………………………………………………………… IV

Table of Contents ……………………………………………………………………………….. V

Assessment, Measurement, and Evaluation ................................................................................... 1

Assessment in the Classroom Context ………………………………………………………….. 2

The Role of Measurement and Evaluation in Assessment …………………………………….. 4

The Nature of Measurement……………………………………………………………. 4

The Nature of Evaluation ……………………………………………………………… 6

Forms of Evaluation ……………………………………………………………………8

Models of Evaluation …………………………………………………………………. 9

Empirical Report: Examples of Evaluation Studies ………………………………………….. 13

The Process of Assessment …………………………………………………………………… 22

Assessment Procedures ……………………………………………………………….. 24

Forms of Assessment …………………………………………………………………. 25

Components of Classroom Assessment ………………………………………………. 26

Paradigm Shifts in the Process of Assessment ………………………………………… 28

Uses of Assessment ……………………………………………………………………. 29

The Learning Intents …………………………………………………………………………. 32

Stating Learning Intents ……………………………………………………………………...… 33

The Conventional Taxonomic Tools …………………………………………………………. 36

Bloom’s Taxonomy …………………………………………………………………… 36

The Revised Blooms’ Taxonomy …………………………………………………….. 39

Other Learning Taxonomies …………………………………………………………………… 42

Gagne’s Taxonomy ……………………………………………………………………. 42

Stiggins and Conklin’s Taxonomy ……………………………………………………. 43

Marzano’s Dimension of Learning ……………………………………………………. 44

De Bono’s Six Thinking Hats …………………………………………………………. 47

Specifying the Learning Intents ………………………………………………………………. 49

Characteristics of an Assessment Tool ……………………………………………………….. 56

Test-retest reliability ……………………………….. ………………………………… 57

Procedure for Correlating Scores for the Test-Retest…………………………………. 58

Parallel Form or Alternate Form Reliability …………………………………………… 61

Split-half Reliability …………………………………………………………………… 64

Internal Consistency Reliability ……………………………………………………….. 63

Interrater Reliability …………………………………………………………………… 67

Content Validity ……………………………………………………………………… 73

Convergent and Divergent …………………………………………………………… 78

Empirical Report: The Development of the Self-disclosure Scale …………………………… 80

Item Difficulty and Item Discrimination ……………………………………………………. 92

Procedure for Determining Item Difficulty and Discrimination ……………………... 92

Analyzing Item Distracters…………………………………………………………….95

Grouped Distribution of Item Scores ………………………………………… 107

Grouped Distribution of Observed Person Scores …………………………… 108

Final Estimates of Item Difficulty …………………………………………… 109

Final estimates of Person Measures …………………………………………. 110

Empirical Report: The Application of a One-Parameter IRT Model on a Test of Mathematical

Special Topic: A Review of Psychometric Theory ………………………………………….. 125

Using a Computer Software in Analyzing Test Items ………………………………………... 140

How to Install …………………………………………………………………………………. 140

Opening the Program …………………………………………………………………………. 143

How to Save Data …………………………………………………………………………….. 162

How to Open File from Excel ………………………………………………………………… 163

How to Encode and Save Data in Microsoft Excel …………………………………………… 165

Developing Teacher-Made Test ………………………………………………………………. 170

The Test Blueprint ……………………………………………………………………………. 171

Outline of the Test Development Process …………………………………………….. 173

Table of Specifications ……………………………………………………………….. 173

Designing Selected-Response Items ………………………………………………………….. 176

Binary-choice Items …………………………………………………………………... 176

Instructions in Writing Binary Type of Items ………………………………………… 178