Professional Documents
Culture Documents
Carlo Magno
Jerome Ouano
II
- Copyright page -
III
Acknowledgement
Special thanks for those who helped exert their efforts in finishing this book: Ms. MR
Aplaon for gathering the materials needed to complete each chapter; Mr. Paul Ong for his
contribution in the chapter on “grading students”; Ms. Ma. Theresa Carmela Kanlapan for
editing the grammar of the manuscript; Mr. Robert Chu for drafting the figures; Ms. Sheena
Morales for her contribution in testing the program and the guide in using the program.
Carlo Magno
IV
- Editors Note –
V
Table of Contents
Copyright …………………………………………………………………………………………II
Acknowledgement ………………………………………………………………………………III
Chapter 1
Lesson 1
Lesson 2
Lesson 3
Chapter 2
Lesson 1
Lesson 2
Lesson 3
Lesson 4
Chapter 3
Lesson 1
Reliability ……………………………………………………………………………………… 57
VII
Kuder Richardson……………………………………………………………… 63
Cronbach’s Alpha ……………………………………………………………… 65
Lesson 2
Validity ……………………………………………………………………………………….. 73
Criterion-Prediction ………………………………………………………………….. 73
Construct ……………………………………………………………………………… 74
Lesson 3
Empirical Report: Construction and Development of a Test Instrument for Grade 3 Social
Studies ………………………………………………………………………………………… 96
Item Response Theory: Obtaining Item Difficulty Using the Rasch Model ………… 103
Procedure for the Calibration of Item and Person Ability …………………………… 106
Lesson 4
Chapter 4
Lesson 1
Lesson 2
Lesson 3
Lesson 4
Chapter 5
Lesson 1
Lesson 2
Lesson 3
Chapter 6
Lesson 1
Lesson 2
Lesson 3
Lesson 4
Chapter 7
Lesson 1
Lesson 2
Lesson 3
Chapter 8
Lesson 1
Lesson 2
Interpreting Test Scores through Norm and Criterion Reference ………………………….. 276
Lesson 3
Chapter 9
Lesson 1
Research, Evaluation, and Guidance Division of the Bureau of Public Schools ….. 311
Lesson 2
Building Future Leaders and Scientific Experts in Assessment and Evaluation in the
Philippines …………………………………………………………………………………. 316
XIII
List of Appendices
Appendix A
Appendix B
Chapter 1
Assessment, Measurement, and Evaluation
Chapter Objectives
Lessons
Lesson 1
Assessment in the Classroom Context
Assessment is integrated in all parts of the teaching and the learning process. This means
that assessment can take place before instruction, during instruction, and after instruction. Before
instruction, teachers can use assessment results as basis for the objectives and instructions for
their plans. These assessment results come from the achievement tests of students from the
previous year, grades of students from the previous year, assessment results from the previous
lesson or pretest results before instruction will take place. Knowing the assessment results from
different sources prior to planning the lesson helps teachers decide on a better instruction that is
more fit to the kind of learners they will handle, set objectives appropriate for their
developmental level, and think of better ways of assessing students to effectively measure the
skills acquired. During instruction, there are many ways of assessing student performance. While
class discussion is conducted, teachers can ask questions and students can answer them orally to
assess whether students can recall, understand, apply, analyze, evaluate, and synthesize the facts
presented. During instruction teachers can also provide seat works and work sheets on every unit
of the lesson to determine if students have mastered the skill needed before moving to the next
lesson. Assignments are also provided to reinforce student learning inside the classroom.
Assessment done during instruction serves as formative assessment where it is meant to prepare
students before they are finally assessed on major exams and tests. When the students are ready
to be assessed after instruction took place, they are assessed in a variety of skills they are trained
for which then serves as a summative form of assessment. Final assessments come in the forms
of final exams, long tests, and final performance assessment which covers larger scope of the
lesson and more complex skills are required to be demonstrated. Assessments conducted at the
end of the instruction are more structured and announced where students need time to prepare.
Review Questions:
Lesson 2
The Role of Measurement and Evaluation in Assessment
The concept of assessment is broad that it involves other processes such as measurement
and evaluation. Assessment involves several measurement processes in order to arrive with
quantified results. When assessment results are used to make decisions and come up with
judgments, then evaluation takes place.
Measurement
Assessment Evaluation
3. Quantification allows objective comparison of groups. Suppose that male and female
students were tested in their math ability using the same test for both groups. Then mean results
of the males math scores is 92.3 and the mean results of the females math scores is 81.4. It can
be said that males performed better in the math test than females when tested for significance.
5. Quantification results make the data possible for further analysis. When data is
quantified, teachers, guidance counselors, researchers, administrators, and other personnel can
obtain different results to summarize and make inferences about the data. The data may be
presented in charts, graphs, and tables showing the means and percentages. The quantified data
can be further estimated using inferential statistics such as when comparing groups,
benchmarking, and assessing the effectiveness of an instructional program.
be at least consistent. Repeating the measurement process several times and consistency of
results would mean objectivity of the procedure undertaken.
The process of measurement involves abstraction. Before a variable is measured using an
instrument, the variable’s nature needs to be clarified and studied well. The variable needs to be
defined conceptually and operationally to identify ways on how it is going to be measured.
Knowing the conceptual definition based on several references will show the theory or
conceptual framework that fully explains the variable. The framework reveals whether the
variable is composed of components or specific factors. Then these specific factors need to be
measured that comprise the variable. A characteristic that is composed of several factors or
components are called latent variables. The components are usually called factors, subscales, or
manifest variables. An example of a latent variable would be “achievement.” Achievement is
composed of factors that include different subject areas in school such as math, general science,
English, and social studies. Once the variable is defined and its underlying factors are identified,
then the appropriate instrument that can measure the achievement can now be selected. When the
instrument or measure for achievement is selected, it will now be easy to operationally define the
variable. Operational definition includes the procedures on how a variable will be measured or
made to occur. For example, ‘achievement’ can be operationally defined as measured by the
Graduate Record Examination (GRE) that is composed of verbal, quantitative, analytical,
biology, mathematics, music, political science, and psychology.
When a variable is composed of several factors, then it is said to be multidimensional.
This means that a multidimensional variable would require an instrument with several subtests in
order to directly measure the underlying factors. A variable that do not have underlying factors is
said to be unidimensional. A unidimensional variable only measures an isolated unitary attribute.
An example of unidemensional measures are the Rosenberg self-esteem scale and the Penn State
Worry Questionnaire (PSWQ). Examples of multidimensional measures are various ability tests
and personality tests where it is composed of several factors. The 16 PF is a personality test that
is composed of 16 components (researved, more intelligent, affected by feelings, assertive, sober,
conscientious, venturesome, tough-minded, suspicious, practical, shrewd, placid, experimenting,
self-sufficient, controlled, and relaxed).
The common tools used to measure variables in the educational setting are tests,
questionnaires, inventories, rubrics, checklists, surveys and others. Tests are usually used to
determine student achievement and aptitude that serve a variety of purposes such as entrance
exam, placement tests, and diagnostic tests. Rubrics are used to assess performance of students in
their presentations such as speech, essays, songs, and dances. Questionnaires, inventories, and
checklists are used to identify certain attributes of students such as their attitude in studying,
attitude in math, feedback on the quality of food in the canteen, feedback on the quality of
service during enrollment, and other aspects.
Evaluation is arrived when the necessary measurement and assessment have taken place.
In order to evaluate whether a student will be retained or promoted to the next level, different
aspects of the student’s performance were carefully assessed and measured such as the grades
and conduct. To evaluate whether the remedial program in math is effective, the students’
improvement in math, teachers’ teaching performance, whether students’ attitude towards math
changed should be carefully assessed. Different measures are used to assess different aspects of
7
the remedial program to come up with an evaluation. According to Scriven (1967) evaluation is
“judging the worth or merit” of a case (ex. student), program, policies, processes, events, and
activities. These objective judgments derived from evaluation enable stakeholders (a person or
group with a direct interest, involvement, or investment in the program) to make further
decisions about the case (ex. students), programs, policies, processes, events, and activities.
In order to come up with a good evaluation, Fitzpatrick, Sanders, and Worthen (2004)
indicated that there should be standards for judging quality and deciding whether those standards
should be relative or absolute. The standards are applied to determine the value, quality, utility,
effectiveness, or significance of the case evaluated. In evaluating whether a university has a good
reputation and offers quality education, it should be comparable to a standard university that
topped the World Rankings of University. The features of the university evaluated should be
similar with the standard university selected. A standard can also be in the form of ideal
objectives such as the ones set by the Philippine Accreditation of Schools, Colleges, and
Universities (PAASCU). A university is evaluated if they can meet the necessary standards set
by the external evaluators.
Fitzpatrick, Sanders, and Worthen (2004) clarified the aims of evaluation in terms of its
purpose, outcome, implication, setting of agenda, generalizability, and standards. The purpose of
evaluation is to help those who hold a stake in whatever is being evaluated. Stakeholders consist
of many groups such as students, teachers, administrators, and staff. The outcome of evaluation
leads to judgment whether a program is effective or not, whether to continue or stop a program,
whether to accept or reject a student in the school. The implication that evaluation gives is to
describe the program, policies, organization, product, and individuals. In setting the agenda for
evaluation, the questions for evaluation come from many sources, including the stakeholders. In
making generalizations, a good evaluation is specific to the context in which the evaluation
object rests. The standards of a good evaluation are assessed in terms of its accuracy, utility,
feasibility, and propriety.
A good evaluation adheres to the four standards of accuracy, utility, feasibility, and
propriety set by the ‘Joint Committee on Standards for Educational Evaluation’ headed by
Daniel Stufflebeam in 1975 at Western Michigan University’s Evaluation Center. These four
standards set are now referred to as ‘Standards for Evaluation of Educational Programs, Projects,
and Materials.’ Table 1 presents the description of the four standards.
8
Table 1
Standards for Evaluation of Educational Programs, Projects, and Materials
Forms of Evaluation
Owen (1999) classified evaluation according to its form. He said that evaluation can be
proactive, clarificative, interactive, monitoring, and impact.
1. Proactive. Ensure that all critical areas are addressed in an evaluation process.
Proactive evaluation is conducted before a program begins. It assists stakeholders in making
decisions on determining the type of program needed. It usually starts with needs assessment to
identify the needs of stakeholders that will be implemented in the program. A review of literature
is conducted to determine the best practices and creation of benchmarks for the program.
4. Monitoring. This evaluation is conducted when the program has settled. It aims to
justify and fine tune the program. It focuses whether the outcome of the program has delivered to
its intended stakeholders. It determines the target population, whether the implementation meets
the benchmarks, be changed to be done in the program to make it more efficient.
9
These forms of evaluation are appropriate at certain time frames and stage of a program.
The illustration below shows when each evaluation is appropriate.
Program Duration
Planning and Implementation Settled
Development
Phase
Proactive Interactive and monitoring Impact
Clarificative
Models of Evaluation
Evaluation is also classified according to the models and framework used. The
classifications of the models of evaluation are objectives-oriented, management oriented,
consumer-oriented, expertise-oriented, participant-oriented, and theory driven.
5. Participant-oriented. The primary concern of this model is to serve the needs of those
who participate in the program such as students and teachers in the case of evaluating a course.
This model depends on the values and perspectives of the recipients of an educational program.
The specific models for this evaluation are Stake’s Responsive evaluation, Patton’s Utilization-
focused evaluation, Rappaport’s Empowerment Evaluation (see Fitzpatrick, Sanders, &
Worthen, 2004).
Figure 1
Implicit Theory for Proper Waste Disposal
Determinants
Adaptability, learning
Intervention Outcome
strategies, patience,
and self-determination
Quality of Instruction Reduction of wastes,
and Training improvement of waste
disposal practices,
attitude change, and
rating of environmental
sanitation
11
Table 2
Integration of the Forms and Models of Evaluation
Table 3
Implementing procedures of the Different Models of Evaluation
EMPIRICAL REPORTS
Examples of Evaluation Studies
Program Evaluation of the Civic Welfare solidarity and collaboration with the immersion
Training Services centers.
By Carlo Magno
being related to the objectives and context, input assembly, group meetings, leadership training,
and process information. The information will be orientation seminar, initial area visit, immersion,
used to decide on whether to terminate, modify group processing, and submission of documents.
or refocus a program. The students rated it as moderate as well. Seven
There were a total of 250 participants in out 10 of the procedures need improvement. In
the study composed of students, beneficiaries, the role of the students, 68 of the students
program staff members and selected clients. The considered the role of advisers as helpful.
instruments used were three sets of evaluation However, the effectiveness of the performance
questionnaires for the students, program was rated only moderately satisfactory. Three
implementers, and beneficiaries and one strong points given to the CSP are the provision
interview guide used for the recipients of the of opportunities to gain social awareness,
CSP. Data analysis was both quantitative and actualizing social responsibility and personal
qualitative in nature. growth of the students. Subsequently, the
For the context evaluation, the weaknesses include difficulty of program
evaluators looked into the objectives of the CSP, procedure, processes, locations and negative
mission-vision of CSB, objectives of Social Action attitude of the students. Some of the
Office (SAO), and their congruence. The DLS- recommendations focus on program preparation,
CSB mission vision is realized in the six core program staff and community service locations.
Benildean values, and to realize the mission- For the insights of the beneficiaries, some
vision, SAO created a CSP to enhance social problems such as attendance and seriousness of
awareness of the students and instill social the students are taken into account and resolved
responsibility. Likewise, the objectives of CSP through dialogue, feedback and meetings. They
are aligned also to CSB mission and vision. 75% also suggested to the CSP more intensive
of the respondents said that CSP objectives are orientation and preparation as well as closer
in line with the CSB mission-vision. This was coordination and program continuity.
supported with actual experiences. Moderate Lastly, for the product evaluation,
extent was given by the students and internalization and personification of the core
beneficiaries as to the extent the community Benildean values, benefits gained by the
service program has met. students and beneficiaries were taken into
For the input evaluation, the profile of the account. For the internalization and
students, program recipients, and implementers personification, it appears that four out of 6 core
was reported. Most of the students were males, values are manifested by the students which are
average age was 21 and from Manila. The deeply rooted faith, appreciation of individual
recipients were mostly centers from Metropolis uniqueness, professional competency and
by the religious groups. Program implementers creativity. Students also gained personal benefits
on the other hand are staff member responsible such as increased social awareness, social
for the implementation of the program and has responsibility actualization, positive values, and
been into the college for 1-5 years. realizations of their blessings. On the other
The process evaluation of the program hand, the beneficiaries benefits include long term
focused on the policies and procedures of the and short term benefits. The short terms are the
CSP, role of the community service adviser, socialization activities, interaction between the
strength and weaknesses of the CSP, students and clients, material help, manpower
recommendation for improvement, and insights assistance and tutorial classes while long term
of the program beneficiaries. In terms of policies, are values inculcated to the children,
the CSP is a requirement for the CSB students interpersonal relationships, knowledge imparted
written in the Handbook. The program has 10 to them, and contribution to physical growth. The
procedures including application, general program beneficiaries also identified the
15
other; (3) which factors carry the most weight and Table 1
which actions are likely to produce the greatest Counts of Area of Investigation From 1990 - 2006
result; and (4) where do the greatest risks and
constraints lie. The world bank divided the Year Country Area of Investigation No. of
Studies
Total no. of
Studies per
countries according to different regions such as year
2006 Vanuatu Language learning 1 1
Sub-Saharan Africa, East Asia and the Pacific, 2005 None 0 0
Europe and Central Asia, Latin America and the 2004 Indonesia
Thailand
Undergraduate/Tertiary
Education
2 5
Carribean, and Middle East and North Africa. Senegal Adult Literacy 1
Different Early Child Development 2
Regions,
Columbia
Areas of Investigation 2003 Thailand Undergraduate/Tertiary 1 2
Education
Different AIDS/HIV Prevention 1
There are 28 studies done on Regions
2002 Different Textbook/Reading materials 1 2
educational policy with a manifested evaluation regions
component. Education studies with no evaluation 2001
Africa
Brazil
Secondary Education
Early Child Development
1
1 2
aspect were not included. A synopsis of each China Secondary Education 1
2000 Different School Self-evaluation 1 7
study with the corresponding methodology and Regions
recommendations are found in the world bank Different
Regions
Early Child Development 1
studies conducted at the start of the 21st century. the concern is on early child development since it
This can be explained with the growing trend in is a critical stage in life which evidently results to
globalization where communication across hampering the development of an individual if not
countries is more accessible. It can also be noted cared for at an early age. This also shows the
that no studies were completed on educational increasing number of children where their needs
policy with evaluation for the years 1993, 1995, are undermined and intervention has to take
1997 and 2005. The trend in the number of place. These programs sought the assistance of
studies shows that consequently after a year, the the world bank because they need further
study gives more generalized findings since the funding for the program to exist. Having an
study covered a larger and wide array of evaluation of the child program likely supports
sampling where these studies took a long period the approval for further grant.
of time to finish. More results are expected Somehow there is a large number of
before the end of 2005. The trend of studies studies on basic and tertiary education where its
across the years is significantly different with the effectiveness is evaluated. Almost all countries
expected number of studies as revealed using a offer the same structure of education world wide
one-way chi-square where the computed value in terms of the level from basic education to
(2=28.73, df=14) exceeds a probability of tertiary education. These deeply needs attention
2=23.58 with 5% probability of error. since it is a basic key to developing nations to
improve the quality of their education because
Table 2 the quality of their people with skills depend on
Counts of Area of Investigation From 1990 - 2006 the countries overall labor force.
When the observed counts of studies for
Area of Investigation Number of Studies each area of interest is tested for significant
Language learning 1 goodness of fit, the computed chi-square value
Undergraduate/Tertiary 4 (2=13, df=13) did not reach significance at 5%
Education
Adult literary 2 level of significance. This means that the
Early Child Development 5 observed counts per area do not significantly
AIDS/HIV Prevention 1 differ to what is expected to be produced.
Textbook/Reading material 1
Secondary education 3 Table 3
School Self-evaluation 1
Basic education 4 Study Grants by Country
Test Evaluation 1
Infant Care 1 Country No. of studies
ICT 2 Vanuatu 1
Teacher Development 1 Indonesia 1
Vocational Education 1 Thailand 1
Senegal 1
Different Regions 10
Table 2 shows the number of studies Brazil 1
conducted for every area in line with educational China 1
policy with evaluation. Most of the studies Pakistan 1
completed and funded are in the area of early Cuba 1
child development followed by tertiary education Africa 2
USA 1
and basic education. This can be explained by Chile 1
the increasing number of early child care Philippines 1
programs around the world which is continuing
and needs to be evaluated in terms of its The studies done for each country are
effectiveness at a certain period of time. Much of almost equally distributed except for Africa with
18
two studies from 1990 until the present period. Method of Studies
There is a bulk of studies done worldwide which
covers a wider array of sampling across different Various methodologies are used to
countries. The world wide studies usually investigate the effectiveness of educational
evaluate common programs across different programs across different countries. Although it
countries such as teacher effectiveness and child can be seen in the report that there is not much
development programs. However, there is great concentration and elaboration on the use and
difficulty to come up with an efficient judgment of implementation of the procedures done to
the overall standards of each program. The evaluate the programs. Most only mentioned the
advantage of having a world wide study on questionnaires and assessment techniques they
educational programs for different regions is to used. There are some that mentioned a broad
have a simultaneous description of the common range of methodologies such as quasi-
programs that are running where the funding is experiments and case studies but the specific
most likely concentrated to one team of designs are not indicated. It can also be noted
investigators rather than separate studies with that reports written by researchers/professors
different fund allocations. Another is the from universities are very clear in their method
efficiency of maintaining consistency of which is academic in nature but world bank
procedures across different settings. Unlike personnel writing the report tends to focus on the
different researchers setting different standards justification of the funding rather than the clarity
for each country. of the research procedure undertaken. It can also
In the case of Africa, two studies were be noted that the reports did not show any part
granted concentrating on adult literacy and on the methodology. Most presented the
distance education because these educational introduction and some justifications of the
programs are critical in their country as program and later in the end the
compared to others. As shown in the recommendations. The methodologies are just
demographics of the African region that their mentioned and not elaborated within the report
programs (adult literacy, distance education) are and only mentioned on some parts of the
increasingly gaining benefits to its stakeholders. justification of the program.
There is a report of remarkable improvement on
their adult education and more tertiary students Table 4
are benefiting form the distance education. Since Counts of Methods Used
they are showing effectiveness, much funding is
needed to continue the programs. Method Counts
When the number of studies are tested Questionnaires/Inventories/Tests 4
for significance across countries, the chi-square Quasi Experimental 5
computed (2=35.44, df=12) reached significance True Experimental 1
against a critical value of 2=21.03 at 5% Archival Data (Analyzed available 6
probability of error. This means that the number demographics)
of studies for each country differs significantly to Observations 1
what is expected to be produced. This is also due Case Studies 1
to having a large concentration of studies for Surveys 1
different regions as compared to minimal studies Multimethod 9
for each country which made the difference.
It can be noted in table 4 that most
studies employ a multimethod approach where
different methods are employed in a study. The
19
multimethod approach creates an efficient way of really after the model but in establishing the
cross-validating results for every methodology program or continuity of the program. There are
undertaken. One result in one method can be in marked difference between university
reference to another result to another method academicians and world bank personnel doing
which makes it powerful than using singularity. the study where the latter are misplaced in their
Since evaluation of the program is being done in assessment due to the lack of guidance from a
most studies, it is indeed better to consider using model and academicians would specifically state
a multimethod since it can generate findings the context but somehow failed to elaborate in
where the researcher can arrive with better the process for adopting a CIPP model. Most
judgment and description of the program. studies are clear in their program objectives but
It can also be noted that most studies failed to provide accurate measures of the
are also using archival data to make justifications program directly. The worst is that most studies
of the program. Most of these researchers in are actually not guided with the use of a model in
reference to the archival data are coming up with evaluating the educational programs proposed.
inferences from enrollment percent, dropout
rates, achievement levels, and statistics on Table 5
physical conditions such as weight and height Counts Models/Frameworks Used
etc. which can be valid but they do not directly
assess the effectiveness of the program. The Model/Framework Counts
difficulty of using these statistics is that they do Objectives-Oriented 10
not provide a post measurement of the program Evaluation
evaluated. These may be due to the difficulty of Management-Oriented 9
arriving with national surveys on achievement Evaluation
levels and enrollment profiles of different Consumer-Oriented 0
educational institutions which is done annually Evaluation
but may not be in concordance with the timetable Expertise-Oriented 7
of the researchers. It is also commendable that a Evaluation
number of studies are considering to have quasi- Participant-Oriented 1
experimental designs to directly assess the Evaluation
effectiveness of educational programs. No model specified 3
The counts of the methodologies used is
tested for significance, the computed chi-square As shown in table 5 that majority of the
value (2=18.29, df=7) reached significance over evaluation used the objectives-oriented where
the critical chi-square value of 2=14.07 with 5% they specify the program objectives and
probability of error. This shows that the evaluated accordingly. A large number also used
methodologies used significantly varies to what is the management oriented and specifically made
expected. use of the CIPP by Stufflebeam (1968). A
number of studies also used experts as external
The Use of Evaluation Models evaluators of the program implementation. Most
of the studies actually did not mention the model
The evaluation method used by the used and the models were just identified as
studies was counted. There was difficulty in described by the procedure in conducting the
identifying the models used since the evaluation.
researchers did not specifically elaborate the Most studies used the objectives
evaluation or framework that they are using. It oriented since the thrust is on educational policy
can also be noted that the researchers are not and most educational programs start with a
20
means of stating objectives. These objectives are program since the judgment on how the program
also treated as ends where the evaluation is is taking place is concentrated on and not other
basically used as the basis. The other studies matters which undermines the result of the
which used the management-oriented evaluation program. A good alternative is for the research
are the ones who typically describe the context of grantee to allocate another budget on a follow up
the educational setting as to the available program evaluation after establishing the
archival data provided by national and program.
countrywide surveys. The inputs and outputs are
also described but most are weak in elaborating 4. It is recommended that when screening for
the process undertaken. The counts on the use studies a criteria on the use of an evaluation
of evaluation models (2=18, df=5) reached model should be included. The researchers
significance at 5% error. This means that the making an evaluation study can be guided better
counts are significantly different with the with the use of an evaluation model.
expected. This shows a need to use other
models of evaluation as appropriate to the study
being conducted. References
O'Gara, C., Lusk, D., Canahuati, J., Yablick, G. & Ware, S. A. (1992). Secondary School Science in
Huffman, S. L. (1999). Good Practices in Infant Developing Countries Status and Issues. World
and Toddler Group Care. World Bank Reports. Bank Reports.
Operational Guidelines for textbooks and reading Xie, O., & Young, M. E. (1999). Integrated Child
materials. (2002). World Bank Reports. Development in Rural China. World Bank
Reports.
Orazem, P. F. (2000). The Urban and Rural
Fellowship School Experiments in Pakistan: Young, E. M. (2000). From Early Child
Design, Evaluation, and Sustainability. World Development to Human Development: Investing
Bank Reports. in Our Children's Future. World Bank Reports.
Lesson 3
The Process of Assessment
The previous lesson clarified the distinction between measurement and evaluation. Upon
knowing the process of assessment in this lesson, you should know now how measurement and
evaluation are used in assessment.
Assessment goes beyond measurement. Evaluation can be involved in the process of
assessment. Some definitions from assessment references show the overlap between assessment
and evaluation. But Popham (1998), Gronlund (1993), and Huba and Freed (2000) defined
assessment without overlap with evaluation. Take note of the following definitions:
Cronbach (1960) have three important features of assessment that makes it distinct with
evaluation: (1) Use of a variety of techniques, (2) reliance on observation in structured and
unstructured situations, and (3) integration of information. The three important features of
assessment emphasize that assessment is not based on single measure but a variety of measures.
In the classroom, a student’s grade is composed of the quizzes, assignments, recitations, long
tests, projects, and final exams. These sources were assessed through formal and informal
structures and integrated to come up with an overall assessment as represented by a student’s
final grade. In lesson 1, assessment was defined as “the process of the collecting various
information needed to come up with an overall information that reflects the attainment of goals
and purposes.” There are three critical characteristics of this definition:
battery of intelligence tests should yield the same results in order to determine the overall ability
of a case. In cases where some results are inconsistent, there should be a synthesis of the overall
assessment indicating that in some measures the result do not support the overall assessment.
Assessment Procedures
The process of assessment was summarized by Bloom (1970). He indicated that there are
two processes involved in assessment:
2. It proceeds to the determination of the kind of evidence that is appropriate about the
individuals who are placed in the learning environment such as their relevant strengths and
weaknesses, skills, and abilities.
In the classroom context, it was explained in Lesson 1 that assessment takes place before,
during and after instruction. This process emphasizes that assessment is embedded in the
teaching and the learning process. Assessment generally starts in the planning of learning
processes when learning objectives are stated. A learning objective is defined in measurable
terms to have an empirical way of testing them. Specific behaviors are stated in the objectives so
that it corresponds with some form of assessment. During the implementation of the lesson,
assessment can occur. A teacher may provide feedback based on student recitations exercises,
short quizzes, and classroom activities that allow students to demonstrate the skill intended in the
objectives. The assessment done during instruction should be consistent with the skills required
in the objectives of the lesson. The final assessment is then conducted after enough assessment
can demonstrate student mastery of the lesson and their skills. The final assessment conducted
can be the basis for the succeeding objectives for the next lesson. The figure below illustrates the
process of assessment.
Figure 1
The Process of Assessment in the Teaching and Learning Context
Assessment
Forms of Assessment
Tests. Tests are basically tools that measure a sample of behavior. Generally there are a
variety of tests provided inside the classroom. It can be in the form of a quiz, long tests (usually
covering smaller units or chapters of a lesson), and final exams. Majority of the tests for students
are teacher-made-tests. These tests are tailored for students depending on the lesson covered by
the syllabus. The tests are usually checked by colleagues to ensure that items are properly
constructed.
Teacher made tests vary in the form of a unit, chapter, or long test. These generally assess
how much a student learned within a unit or chapter. It is a summative test in such a way that it is
given after instruction. The coverage is only what has been taught in a given chapter or tackled
within a given unit.
Tests also come in the form of a quiz. It is a short form assessment. It usually measures
how much the student acquired within a given period or class. The questions are usually from
what has been taught within the lesson for the day or topic tackled in a short period of time, say
26
for a week. On the other hand, it can be summative or formative. It can be summative if it aims
to measure the learning from an instruction, or formative if to aims to tests how much the
students already know prior the instruction. The results of quiz can be used by the teacher to
know where to start the lesson (example, the students already know how to add single digits, and
then she can already proceed to adding double digits). It can also determine if the objectives for
the day are met.
Should the teacher call more on the students who are silent most of the time in class?
Should the teacher ask students who could not comprehend the lesson easily more often?
Should recitation be a surprise?
Are the difficult questions addressed to disruptive students?
Are easy questions only for students who are not performing well in class?
Projects. Projects can come in a variety of form depending on the objectives of the
lesson, a reaction paper, a drawing, a class demonstration can all be considered as projects
depending on the purpose. The features of a project should include: (1) Tasks that are more
relevant in the real life setting, (2) activity that requires higher order cognitive skills, (2)
assignments that can assess and demonstrate affective and psychomotor skills which
supplements instruction, and (4) activities that require application of the theories taught in class.
and creating a script for a play, painting a vase. These tasks are usually extended as an
assignment if the time in school is not sufficient. Portfolios are collections of students’ works.
For an art class the students will compile all paintings made, for a music class all compositions
are collected, for a drafting class all drawings are compiled. Table 4 shows the different tasks
using performance assessment.
Table 4
Outcomes Requiring Performance Assessment
Outcome Behavior
Skills Speaking, writing, listening, oral reading, performing experiments, drawing,
playing a musical instrument, gymnastics, work skills, study skills, and social
skills
Work habits Effectiveness in planning, use of time, use of equipment resources, the
demonstration of such traits as initiative, creativity, persistence, dependability
Social Concern for the welfare of others, respect for laws, respect the property of
attitudes others, sensitivity to social issues, concern for social institutions, desire to work
toward social improvement
Scientific Open-mindedness, willingness to suspend judgment, cause-effect relations, an
attitudes inquiring mind
Interests Expressing feelings toward various educational, mechanical, aesthetic, scientific,
social, recreational, vocational activities
Appreciations Feeling of satisfaction and enjoyment expressed toward music, art, literature,
physical skill, outstanding social contributions
Adjustments Relationship to peers, reaction to praise and criticism authority, emotional
stability, social adaptability
For over the years the practice of assessment has changed due to improvement in
teaching and learning principles. These principles are a result of researches that called for more
information on how learning takes place. The shift is shown from old practices to what should be
ideal in the classroom.
From To
Summative Formative
The old practice of assessment focuses on traditional forms of assessment such as paper
and pencil with single correct answer and usually conducted at the end of the lesson. For the
contemporary perspectives in assessment, assessment is not necessarily in the form of paper and
pencil tests because there are skills that are better captured in through performance assessment
such as presentations, psychomotor tasks, and demonstrations. Contemporary practice welcomes
a variety of answers from students where they are allowed to make interpretation of their own
learning. It is now accepted that assessment is conducted concurrently with instruction and not
only serving as a summative function. There is also a shift from assessment items that are
contextualized and having more utility. Rather than asking for the definitions of verbs, nouns,
and pronouns, students are required to make an oral or written communication about their
favorite book. It also important that students assess their own performance to facilitate self-
monitoring and self-evaluation.
29
Uses of Assessment
Assessment results have a variety of application from selection to appraisal and aiding
the stakeholders in the decision making process. These functions of assessment vary within the
educational setting whether it is conducted for human resources, counseling, instruction,
research, and learning.
30
1. Appraising. Assessment is used for appraisal. Forms of appraisals are the grades,
scores, rating, and feedback. Appraisals are used to provide a feedback on individual’s
performance to determine how much improvement could be done. A low appraisal or negative
feedback indicates that performance still needs room for improvement while high appraisal or
positive feedback means that performance needs to be maintained.
5. Accountability and program evaluation. Assessment results are used for evaluation
and accountability. In making judgments about individuals or educational programs multiple
assessment information is used. Results of evaluations make the administrators or the ones who
implemented the program accountable for the stakeholders and other recipients of the program.
This accountability ensures that the program implementation needs to be improved depending on
the recommendations from evaluations conducted. Improvement takes place if assessment
coincides with accountability.
6. Counseling. Counseling also uses a variety of assessment results. The variables such
as study habits, attention, personality, and dispositions, are assessed in order to help students
improve them. Students who are assessed to be easily distracted inside the classroom can be
helped by the school counselor by focusing the counseling session in devising ways to improve
the attention of a student. A student who is assessed to have difficulties in classroom tasks are
31
taught to self-regulate during the counseling session. Students’ personality and vocational
interests are also assessed to guide them in the future courses suitable for them to take.
Guide Questions:
References
Fitzpatrick, J. L., Sanders, J. R., & Worthen, B. R. (2004). Program evaluation: Alternative
approaches and practical guidelines (3rd ed.). New York: Pearson.
Gronlund, N. E. (1993). How to write achievement tests and assessment (5th ed.). Needham
Heights: Allyn & Bacon.
Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation
standards (2nd ed.). Thousand Oakes, CA: Sage.
Magno, C. (2007). Program evaluation of the civic welfare training services (Tech Rep. No. 3).
Manila, Philippines: De La Salle-College of Saint Benilde, Center for Learning and
Performance Assessment.
McMillan, J. H. (2001). Classroom assessment: Principles and practice for effective instruction.
Boston: Allyn & Bacon.
Popham, W. J. (1998). Classroom assessment: What teachers need to know (2nd ed.). Needham
Heights, MA: Allyn & Bacon.
Chapter 2
The Learning Intents
Chapter Objectives
Lessons
Lesson 1
Stating Learning Intents
Having learned about measurement, assessment, and evaluation, this chapter discusses
the learning intents, which refer to the objectives or targets the teacher sets as the competency to
build on the students. This is the target skill or capacity that students need to develop as they
engage in the learning episodes. The same competency assessed using relevant tools to generate
quantitative and qualitative information about your students’ learning behavior.
Prior to designing the learning activities and assessment tasks, you first have to formulate
your learning intents. These intents exemplify the competency you wish students will develop in
themselves. At this point, your deep understanding on how learning intents should be formulated
is very useful. As you go through this chapter, your knowledge about the guidelines in
formulating these learning intents will help you understand how assessment tasks should be
defined.
One of the important skills of teachers is determing the appropriate elearning intents for
their students. Leanring intents are the targets of the instruction and assessment that takes place.
Learning intents come in the form of objectives, goals, standards, criteria, and expectations.
Usually teachers state their objectives at the start of instruction and assessment.
The holistic formation of the student is the primary concern in the teaching and learning
process and objectives developing their cognitive, affective, and psychomotor skills are:
Existing list of objectives. Teachers make use of existing list of objectives in writing their
objectives for a specific lesson. These existing objectives are found in the syllabus and school
goals. These objectives are made specific in the lesson plans.
National Standards. There are available national standards that can be used as basis for
identifying appropriate objectives. One example is the minimum learning competencies that is
provided by the Department of Education.
35
Needs of students and society. The needs of the studnets are the basic source of
identifying objectives. These needs can be result of previous achievement tests, needs
assessment, and findings from previous school research involving the students.
Mission and vision of the school. The mission and vision of the school provides the major
guide in slecting objectives for students. The mission and vision are further broken down into
specific goals to be trabslated into unit lessons.
Writing an objective include a criteria in order to make it more specific and measurable.
It is important to make an objkective concrete in order to directly observe whether they are
attained or not during instruction. Consider the following objectives:
“From a standing still position on a level, hard surface (condition), male students (audience) will
jump (behavior) at least two feet (criterion).”
“Given two hours in the library without notes (condition), students in the high reading group
(audience) will identify (behavior) five sources on the topic “national health insurance”
(Criterion).
1. Objectives are always intended for the learners, audience, studnets or participants in a
training program.
2. Objectives are specific and measurable. Each objective should have a corresponding
assessment to test whether it was met. The following table shows how each objective will be
assessed.
36
Objective Assessment
Given a microscope with glass slides, students Performance assessment in the proper
in the biolog class will mount 5 specimens mounting of atleast 5 specimens.
found in the school garden.
Given the constructed anemometer, the grade 4 Listing of the wind speed every 5 hours during
pupils will record the wind speed every 5 school time
hours.
Given a 1 inch paper clip, the grade 5 students Measurement of the gym floor.
will measure the length, width, and area of the
gym floor.
3. Objectives should be attainable given the parameters of instruction and learning. For a
40 minute class, objvcetives should be realistically accomplished given the time frame.
Objectives can be classified as general and specific. Usually general objectives are stated
and then it is broken down into specific objectives in order to attain it. Consider the following
example:
General Objective:
Specific Objectives:
Lesson 2
The Conventional Taxonomic Tools
Psychomotor Domain
Imitation Observes a skill and attempts to repeat it
Manipulation Performs skill according to instruction rather than observation
Articulation Combines more than one skill in sequence with harmony and
consistency
Naturalization Completes one or more skills with ease and becomes automatic with
limited physical or mental exertion
Figure 1
Bloom’s Taxonomy
EVALUATION
SYNTHESIS
ANALYSIS
APPLICATION
COMPREHENSION
KNOWLEDGE
Figure 1 shows a guide for teachers in stating learning intents based on six dimensions of
cognitive process. Knowledge, being the one of which degree of complexity is low, includes
simple cognitive activity such as recall or recognition of information. The cognitive activity in
comprehension includes understanding of the information and concepts, translating them into
other forms of communication without altering the original sense, interpreting, and drawing
conclusions from them. For application, emphasis is on students’ ability to use previously
acquired information and understanding, and other prior knowledge in new settings and applied
contexts that are different from those in which it was learned. For learning intents stated at the
Analysis level, tasks require identification and connection of logic, and differentiation of
concepts based on logical sequence and contradictions. Learning intents written at this level
indicate behaviors that indicate ability to differentiate among information, opinions, and
inferences. Learning intents at the synthesis level are stated in ways that indicate students’ ability
to produce a meaningful and original whole out of the available information, understanding,
contexts, and logical connections. Evaluation includes students’ ability to make judgments and
sound decisions based on defensible criteria. Judgments include the worth, relevance, and value
of some information, ideas, concepts, theories, rules, methods, opinions, or products.
39
Knowledge of cognition and awareness. The subtypes are strategic knowledge (ex. use of
heuristics), knowledge of cognitive tasks (ex. knowledge cognitive demands of different tasks),
and self-knowledge (awareness of one’s own knowledge level).
Cognitive Process Dimension is where specific behaviors are pegged, using active verbs.
However, so that there is consistency in the description of specific learning behaviors, the
categories in the original taxonomies which were labeled in noun forms are now replaced with
their verb counterparts. Synthesis changed places with Evaluation, both are now stated in verb
forms.
Remembering. This includes recalling and recognizing relevant knowledge from long-
term memory.
Understanding. This is the determination of the meanings of messages from oral, written
or graphic sources.
Applying. This involves carrying out procedural tasks, executing or implementing them in
particular realistic contexts.
Analyzing. This includes deducing concepts into clusters or chunks of ideas and
meaningfully relating them together with other dimensions.
Evaluating. This is making judgments relative to clear standards or defensible criteria to
critically check for depth, consistency, relevance, acceptability, and other areas.
Creating. This includes putting together some ideas, concepts, information, and other
elements to produce complex and original, but meaningful whole as an outcome.
The use of the revised taxonomy in different programs has benefited both teachers and
students in many ways (Ferguson, 2002; Byrd, 2002). The benefits generally come from the fact
that the revised taxonomy provides clear dimensions of knowledge and cognitive processes in
which to focus in the instructional plan. It also allows teachers to set targets for metacognition
concurrently with other knowledge dimensions, which is difficult to do with the old taxonomy.
Figure 3
Lesson 3
Other Learning Taxonomies
Both the Bloom’s taxonomy and the revised taxonomy are not the only existing
taxonomic tools for setting our instructional targets. There are other equally useful taxonomies.
Gagne’s Taxonomy
One of these is developed by Robert M. Gagne. In his theory of instruction, Gagne
desires to help teachers make sound educational decisions so that the probability that the desired
results in achieving learning is high. These decisions necessitate the setting of intentional goals
that assure learning.
In stating learning intents using Gagne’s taxonomy, we can focus on three domains. The
cognitive domain includes Declarative (verbal information), Procedural (intellectual skills), and
Conditional (cognitive strategies) knowledge. The psychological domain includes affective
knowledge (attitudes). The psychomotor domain involves the use of physical movement (motor
skills).
Verbal Information. Verbal information includes a vast body of organized knowledge that
students acquire through formal instructional processes, and other media, such as television, and
others. Students understand the meaning of concepts rather than just memorizing them. This
condition of learning lumps together the first two cognitive categories of Bloom’s taxonomy.
Learning intents must focus on differentiation of contents in texts and other modes of
communication; chunking the information according to meaningful subsets; remembering and
organizing information.
Intellectual skills. Intellectual skills include procedural knowledge that ranges from
Discrimination, to Concrete Concepts, to Defined Concepts, to Rules, and to Higher Order
Rules.
Discrimination involves the ability to distinguish objects, features, or symbols. Detection
of difference does not require naming or explanation.
Concrete Concepts involve the identification of classes of objects, features, or events,
such as differentiating objects according to concrete features, such as shape.
Defined Concepts include classifying new and contextual examples of ideas, concepts, or
events by their definitions. Here, students make use labels of terms denoting defined
concepts for certain events or conditions.
Rules apply a single relationship to solve a group of problems. The problem to be solved
is simple, requiring conformance to only one simple rule.
Higher order rules include the application of a combination of rules to solve a complex
problem. The problem to be solved requires the use of complex formula or rules so that
meaningful answers are arrived at.
43
Learning intents stated at this level of cognitive domain must be given attention to
abilities to spot distinctive features, use information from memory to respond to intellectual tasks
in various contexts, make connections between concepts and relate them to appropriate
situations.
Cognitive strategies. Cognitive strategies consist of a number of ways to make students
develop skills in guiding and directing their own thinking, actions, feelings, and their learning
process as a whole. Students create and hone their metacognitive strategies. These processes help
then regulate and oversee their own learning, and consist of planning and monitoring their
cognitive activities, as well as checking the outcomes of those activities. Learning intents should
emphasize abilities to describe and demonstrate original and creative strategies that students have
tried out in various conditions
Attitudes. Attitudes are internal states of being that are acquired through earlier
experience of task engagement. These states influence the choice of personal response to things,
events, persons, opinions, concepts, and theories. Statements of learning intents must establish a
degree of success associated with desired attitude, call for demonstration of personal choice for
actions and resources, and allow observation of real-world and human contexts.
Motor skills. Motor Skills are well defined, precise, smooth and accurately timed
execution of performances involving the use of the body parts. Some cognitive skills are required
for the proper execution of motor activities. Learning intents drawn at this domain should focus
on the execution of fine and well-coordinated movements and actions relative to the use of
known information, with acceptable degree of mastery and accuracy of performance.
Stiggins and Conklin’s Taxonomy
Another taxonomic tool is one developed by Stiggins & Conklin (1992), which involves
categories of learning as bases in stating learning intents.
Knowledge. This includes simple understanding and mastery of a great deal of subject
matter, processes, and procedures. Very fundamental to the succeeding stages of learning is the
knowledge and simple understanding of the subject matter. This learning may take the form of
remembering facts, figures, events, and other pertinent information, or describe, explain, and
summarize concepts, and cite examples. Learning intents must endeavor to develop mastery of
facts and information as well as simple understanding and comprehension of them.
Reasoning. This indicates ability to use deep knowledge of subject matter and
procedures to make defensible reason and solve problems with efficiency. Tasks under this
category include critical and creative thinking, problem solving, making judgments and
decisions, and other higher order thinking skills. Learning intents must, therefore, focus on the
use of knowledge and simple understanding of information and concepts to reason and solve
problems in contexts.
Skills, This highlights the ability to demonstrate skills to perform tasks with acceptable
degree of mastery and adeptness. Skills involve overt behaviors that show knowledge and deep
understanding. For this category, learning intents have to take particular interest in the
demonstration of overt behaviors or skills in actual performance that requires procedural
knowledge and reasoning.
44
Product. In this area, the ability to create and produce outputs for submission or oral
presentations is given importance. Because outputs generally represent mastery of knowledge,
deep understanding, and skills, they must be considered as products that demonstrate the ability
to use those knowledge and deep understanding, and employ skills in strategic manner so that
tangible products are created. For the statement of learning intents, teachers must state expected
outcomes, either process- or product-oriented.
Affect. Focus is on the development of values, interests, motivation, attitudes, self-
regulation, and other affective states. In stating learning intents on this category, it is important
that clear indicators of affective behavior can easily be drawn from the expected learning tasks.
Although many teachers find it difficult to determine indicators of affective learning, it is
inspiring to realize that it is not impossible to assess it.
These categories of learning by Stiggins and Conklin are helpful especially if your intents
focus on complex intellectual skills and the use of these skills in producing outcomes to increase
self-efficacy among students. In attempting to formulate statements of learning outcome at any
category, you can be clear about what performance you want to see at the end of the instruction.
In terms of assessment, you would know exactly what to do and what tools to use in assessing
learning behaviors based on the expected performance. Although stating learning outcomes at
the affective category is not as easy to do as in the knowledge and skill categories, but trying it
can help you approximate the degree of engagement and motivation required to perform what is
expected. Or if you would like to also give prominence to this category without stating another
learning intent that particularly focus on the affective states, you might just look for some
indicators in the cognitive intents. This is possible because knowledge, skills, and attitudes are
embedded in every single statement of learning intent.
Marzano’s Dimension of Learning
Another alternative guide for setting the learning targets is one that had been introduced
to us by Robert J. Marzano in his Dimensions of Learning (DOL). As a taxonomic tool, the DOL
provides a framework for assessing various types of knowledge as well as different aspects of
processing which comprises six levels of learning in a taxonomic model called the new taxonomy
(Marzano & Kendall, 2007). These levels of learning are categorized into different systems.
The Cognitive System. The cognitive system includes those cognitive processes that
effectively use or manipulate information, mental procedures and psychomotor procedures in
order to successfully complete a task. It indicates the first four levels of learning, such as:
Level 1: Retrieval. In this level of the cognitive system students engage some mental
operations for recognition and retrieval of information, mental procedure, or psychomotor
procedure. Students engage in recognizing, where they identify the characteristics, attributes,
qualities, aspects, or elements of information, mental procedure, or psychomotor procedure;
recalling, where they remember relevant features of information, mental procedure, or
psychomotor procedure; or executing, where they carry out a specific mental or psychomotor
procedure. Neither the understanding of the structure and value of information nor the how’s and
why’s of the mental or psychomotor procedure is necessary.
45
In each system, three dimensions of knowledge are involved, such as information, mental
procedures, and psychomotor procedures.
Information
The domain of informational knowledge involves various types of declarative knowledge
that are ordered according to levels of complexity. From its most basic to more complex levels, it
includes vocabulary knowledge in which meaning of words are understood; factual knowledge,
in which information constituting the characteristics of specific facts are understood; knowledge
of time sequences, where understanding of important events between certain time points is
obtained; knowledge of generalizations of information, where pieces of information are
understood in terms of their warranted abstractions; and knowledge of principles, in which causal
or correlational relationships of information are understood. The first three types of
informational knowledge focus on knowledge of informational details, while the next two types
focus on informational organization.
Mental Procedures
The domain of mental procedures involves those types of procedural knowledge that
make use of the cognitive processes in a special way. In its hierarchic structure, mental
procedures could be as simple as the use of single rule in which production is guided by a small
set of rules that requires a single action. If single rules are combined into general rules and are
used in order to carry out an action, the mental procedures are already of tactical type, or an
algorithm, especially if specific steps are set for specific outcomes. The macroprocedures is on
top of the hierarchy of mental procedures, which involves execution of multiple interrelated
processes and procedures.
Psychomotor Procedures
The domain of psychomotor procedures involves those physical procedures for
completing a task. In the new taxonomy, psychomotor procedures are considered a dimension of
knowledge because, very similar to mental procedures, they are regulated by the memory system
and develop in a sequence from information to practice, then to automaticity (Marzano &
Kendall, 2007).
In summary, the new taxonomy of Marzano & Kendal (2007) provides us with a
multidimensional taxonomy where each system of thinking comprises three dimensions of
knowledge that will guide us in setting learning targets for our classrooms. Table 2a shows the
matrix of the thinking systems and dimensions of knowledge.
47
Level 6
(Self System)
Level 5
(Metacognitive System)
Level 4:
Knowledge Utilization
(Cognitive System)
Level 3:
Analysis
(Cognitive System)
Level 2:
Comprehension
(Cognitive System)
Level 1:
Retrieval
(Cognitive System)
Figure 5
THE HATS
Perspective Observer Self & others Self & others Self & others Self & others Observer
Stern judge
Representation White paper, Fire, warmth wearing black Sunshine, Vegetation Sky, cool
neutral rode optimism
These six thinking hats are beneficial not only in our teaching episodes but also in the
learning intents that we set for our students. If qualities of thinking, creative thinking
communication, decision-making, and metacognition are some of those that you want to develop
in your students, these six thinking hats could help you formulate statements of learning intents
that clearly set the direction of learning. Added benefits would be that when your intents are
stated in the planes these hats, the learning episodes can be defined easily. Consequently,
assessment is made more meaningful.
A. Formulate statements of learning intent using the Revised taxonomy, focusing on any
category of knowledge dimension but on the higher categories of cognitive dimension.
B. Bring those statements of learning intents to Robert Gagne’s taxonomy and see where they
will fit. You may customize the statements a bit so that they fit well to any of Gagne’s
categories of learning.
C. Do the same process of fitting to Stiggins’ categories of learning, then The New
Taxonomy. Remember to customize the statements when necessary.
D. Draw insights from the process and share them in class.
49
Lesson 4
Specificity of the Learning Intent
Gronlund, (in Mcmillan, 2005), uses the term instructional objectives to mean
intended learning outcomes. He emphasizes that instructional objectives should be
stated in terms of specific, observable, and measurable student responses.
In writing statements of learning intents for the course we teach, we aim to state behavior
outcomes to which our teaching efforts are devoted, so that, from these statements, we can
design specific tasks in the learning episodes for our students to engage into. However, we need
to make sure that these statements will have to be set with proper level of generality so that they
don’t oversimplify or complicate the outcome.
A statement of intent could have a rather long range of generality so that many sub-
outcomes may be indicated. Learning intents that are stated in general terms will need to be
defined further by a sample of the specific types of student performance that characterize the
intent. In doing this, assessment will be easy because the performance is clearly defined. Unlike
the general statements of intent that may permit the use of not-so-active verbs such as know,
comprehend, understand, and so on, the specific ones use active verbs in order to define specific
behaviors that will soon be assessed. The selection of these verbs is very vital in the preparation
of a good statement of learning intent. Three points to remember might help select active verbs.
1. See that the verb clearly represents the desired learning intent.
2. Note that the verb precisely specifies acceptable performance of the student.
3. Make sure that the verb clearly describes relevant assessment to be made within or at the
end of the instruction.
50
The statement, students know the meaning of terms in science is general. Although it
gives us an idea of the general direction of your class towards the expected outcome, we might
be confused as to what specific behaviors of knowing will be assessed. Therefore, it is necessary
that we draw some representative sample of specific learning intent so that we will let students:
Given a short selection, the student can identify statements of facts and of
opinions.
If more specificity is still desired, you might want to add a statement of
criterion level. This time, the statement may sound like this:
Given a short selection, the student can correctly identify at least 5 statements of
facts and 5 statements of opinion in no more than five minutes without the aid of
any resource materials.
51
The lesson plan may allow the use of moderately specific statements of learning intents,
with condition and criterion level briefly stated. In doing assessment, however, these intents will
have to be broken down to their substantial details, such that the condition and criterion level are
specifically indicated. Note that it is not necessarily about choosing which one statement is better
than the other. We can use them in planning for our teaching. Take a look at this:
Learning Intent Student will differentiate between facts and opinions from written texts.
Assessment Given a short selection, the student can correctly identify at least 5
statements of facts and 5 statements of opinion in no more than five
minutes without the aid of any resource materials.
If you insert in the text the instructional activities or learning episodes in well described
manner as well as the materials needed (plus other entries specified in your context), you can
now have a simple lesson plan.
Exercises
Place an “X” before each of those in the following list which are student objectives stated in
measurable terms.
___ 1.
To develop critical thinking skills.
___ 2.
To identify those celestial bodies that are known as planets.
___ 3.
To provide worthwhile experiences for the students.
___ 4.
To recognize subject and verb in a sentence.
___ 5.
To tie shoes in a bow, without making a knot.
___ 6.
To write a summary of factors that led to World War II.
___ 7.
To fully appreciate the value of music.
___ 8.
To prepare a critical comparison of the two major political parties n the United States
today.
___ 9. To illustrate an awareness of the importance of balanced ecology by supplying relevant
newspaper articles.
___ 10. To know all the rules of spelling and grammar.
Classify each of the following instructional objectives by writing on the blank space the
appropriate letter according to the domain: (C) Cognitive; (P) Psychomotor; (A) affective.
____ 1. The students will continue jumping rope until that student can successfully jump it
ten times in succession.
____ 2. The student will identify the capitals of all fifty states.
____ 3. The student will summarize the history of the development of the republican party in
the United States.
____ 4. The student will demonstrate a continuing desire to learn to use the microscope by
volunteering to work with it during free time.
____ 5. The student will volunteer to tidy up the room.
____ 6. After reading and analyzing several books, the student will identify the respective
authors.
____ 7. The student will translate a favorite Vietnamese poem into English.
____ 8. The student will accurately predict the results of combining genes from an available
gene pool.
____ 9. The student will indicate his or her interest in the subject by voluntarily reading
additional books from the library about dinosaurs.
____ 10. The student will make the ring toss in a minimum of seven in ten attempts.
53
Instruction: Indicate the cognitive level of the following questions. Write whether they are
knowledge, comprehension, application, analysis, synthesis or evaluation.
_________________ 1. What was the name of the organization represented by our great
speaker?
_________________ 2. How are the styles of the two artists similar?
_________________ 3. Which of the poems do you think is the most interesting?
_________________ 4. What other tools could you use to accomplish the same task?
_________________ 5. What country lies between China and India?
_________________ 6. How might these rocks be logically grouped?
_________________ 7. Could you explain how these two types of redwood needles differ?
_________________ 8. What do you predict would happen if we mixed equal amounts of two
colored solutions, the red solution with the yellow solution?
_________________ 9. With the key words provided, compose an eight line poem.
_________________ 10. Describe how these poem makes you feel.
_________________ 11. Do you suppose everyone feels the same after reading that poem?
_________________ 12. What do you think caused the city to move the location of the zoo?
_________________ 13. How would the park be different today if it had been left there?
_________________ 14. Observe what happens when I pour in the second liquid.
_________________ 15. Using what you have learned about silent letters, circle all the
words in the one-page story that use silent letters.
11. Compare and contrast two works of art in terms of form, color,
and texture
12. Conduct a debate about abortion
13. Name a newly discovered insect
14. Trace your own family tree
15. Which is the best brand of orange juice and why?
16. Recommend a revision in the Philippine constitution given the
disaster events in the Philippines
Criterion item: The instructor will play the melody of the attached musical score on the piano
and will make an error either in rhythm or melody. Raise your hand when the error occurs.
2. Objective: Given mathematical equations containing one unknown, be able to solve for the
unknown.
Criterion Item: Sam weighs 97 kilos. He weighs 3.5 kilos more than Barrey. How much does
Barry weigh?
Criterion Item: Draw and label a sketch of the male and female reproductive systems.
4. Objective: Given any one of the computers in our product line, in its original carton, be able
to install and adjust the machine, preparing it for use. Criteria: The machine shows normal
indication, and the area is free of debris and cartons.
Criterion item: Select one of the cartons containing one of our model XX computers, and install
it for the secretary in Room 45. Make sure it is ready for use and the area is left clean.
5. Objective: When given a set of paragraphs (that use words within your vocabulary), some of
which are missing topic sentences, be able to identify the paragraph without topic sentences.
Criterion Item: Turn to page 29 in your copy of Silas Marner. Underline the topic sentence of
each paragraph on that page.
Answer key
1. P, 2. C, 3. C, 4. A, 5. P, 6. C, 7. C, 8. C, 9. A, 10. P
References
Byrd, P. A. (2002). The revised taxonomy and prospective teachers. Theory into Practice, 41, 4,
244
Ferguson, C. (2002). Using the revised taxonomy to plan and deliver team-taught, integrated,
thematic units. Theory into Practice, 41, 238.
Marzano, R. J., & Kendall, J. S. (2007). The new taxonomy of educational objectives (2nd ed.).
CA: Sage Publications Company.
Stiggins, R. & Conklin, N. (1992). In teachers’ hands: Investigating the practice of classroom
assessment. Albany, NY: SUNY Press.
56
Chapter 3
Characteristics of an Assessment Tool
Objectives
1. Determine the use of the different ways of establishing an assessment tools’ validity and
reliability.
2. Familiarize on the different methods of establishing an assessment tools’ validity and
reliability.
3. Assess how good an assessment tool is by determining the index of validity, reliability,
item discrimination, and item difficulty.
Lessons
1 Reliability
Test-retest
Split-half
Parallel Forms
Internal Consistency
Inter-rater Reliability
2 Validity
Content
Criterion-related
Construct Validity
Divergent/Convergent
Lesson 1
Reliability
What makes a good assessment tool? How does one know that a test is good to be used?
Educational assessment tools are judged by their ability to provide results that meet the needs of
users. For example, a good test provides accurate findings about a students’ achievement if users
intend to determine achievement levels. The achievement results should remain stable across
different conditions so that they can be used for longer periods of time.
Assessment
Tool
A good assessment tool should be reliable, valid and be able to discriminate traits. You
may have probably encountered several tests that are available in the internet and magazines that
tell what kind of personality that you have, your interests, and dispositions. In order to determine
these characteristics accurately, the tests offered in the internet and magazines should show you
evidence that they are indeed valid or reliable. You need to be critical in selecting what test to
use and consider well if these tests are indeed valid and reliable. There are several ways of
determining how reliable and valid an assessment tool is depending on the nature of the variable
and purpose of the test. These techniques vary from different statistical analysis and this chapter
will also provide the procedure in the computation and interpretation.
Reliability is the consistency of scores across the conditions of time, forms, test, items
and raters. The consistency of results in an assessment tool is determined statistically using the
correlation coefficient. You can refer to the section of this chapter to determine how a
correlations coefficient is estimated. The types of reliability will be explained in three ways:
Conceptually and analytically.
Test-retest Reliability
Test-retest reliability is the consistency of scores when the same test is retested in another
occasion. For example, in order to determine whether a spelling test is reliable, the same
spelling test will be administered again to the same students at a different time. If the scores in
the spelling test across the two occasions are the same, then the test is reliable. Test-retest is a
measure of temporal stability since the test score is tested for consistency across a time gap. The
time gap of the two testing conditions can be within a week or a month, generally it does not
exceed six months. Test-retest is more appropriate for variables that are stable like psychomotor
skills (typing test, block manipulations tests, grip strength), aptitude (spatial, discrimination,
58
visual rotation, syllogism, abstract reasoning, topology, figure ground perception, surface
assembly, object assembly), and temperament (extraversion/introversion, thinking/feeling,
sensing/intuiting, judging/perceiving).
To analyze the test-retest reliability of an assessment tool, the first and second set of
scores of a sample of test takers is correlated. The higher the correlation the more reliable the test
is.
Correlating two variables involves producing a linear relationship of the set of scores. For
example a 50 item aptitude test was administered to 10 students at one time. Then it was
administered again after two weeks to the same 10 students. The following are the scores
produced:
In the following data, ‘student A’ got a score of 45 during the first occasion of the aptitude test
and after two weeks got a score of 47 in the same test. For ‘student B,’ a score of 30 was
obtained for the first occasion and 33 after two weeks. The same goes for students C, D. E, F, G,
H, I, and J. The scores of the test at time 1 and retest at time 2 is plotted in a graph called a
scatterplot below. The straight linear line projected is called a regression line. The closer the
plots to the regression line, the stronger is the relationship between the test and retest scores. If
their relationship is strong, then the test scores are consistent and can be interpreted as reliable.
To estimate the strength of the relationship a correlation coefficient needs to be obtained. The
correlation coefficient gives information about the magnitude, strength, significance, and
variance of the relationship of two variables.
59
45
40
G
Aptitude Retest (Time 2)
35
B
30 HJ
E
C
25
F
20 D
I
15
10
5 10 15 20 25 30 35 40 45 50
Aptitude Test (Time 1)
Different types of correlation coefficients are used depending on the level of measurement of a
variable. Levels of measurement can be nominal, ordinal, interval, and ratio. More information
about the levels of measurement is explained at the beginning chapters of any statistics book.
Most commonly, assessment data are in the interval scales. For interval and ratio or continuous
variables, the statistics that estimates the correlation coefficient is the Pearson Product Moment
correlation or the r. The r is computed using the formula:
Where
r = correlation coefficient
N = number of cases (respondents, examinees)
ΣXY = summation of the product of X and Y
ΣX = summation of the first set of scores designated as X
ΣY = summation of the second set of scores designated as Y
ΣX2 = sum of squares of the first set of scores
ΣY2 = sum of squares of the second set of scores
60
To obtain the parameters of ΣX , ΣY, ΣX2, and ΣY2, a table is set up.
To obtain a value of 2115 on the 4th column on XY, simply multiply 45 and 47, 2025 on the 5th
column is obtained by squaring 45 (452 or 45 x 45), 2209 on the last column is obtained by
squaring 47 (472 or 47 x 47). The same is done for each pair of scores in each row. The values of
ΣX , ΣY, ΣX2, and ΣY2 are obtained by adding up or summating the scores from student A to
student J. The values are then substituted in the equation for Pearson r.
10(8095) (254)(286)
r
[10(7356) (254) 2 ][10(8948) (286) 2 ]
r = .996
the aptitude test and retest scores. The cut-off can be used as guide to determine the strength of
the relationship:
For significance, it tests whether the odds favor a demonstrated relationship between X
and Y being real as opposed to being chance. If the relationship favors to be real, then the
relationship is said to be significant. Consult a statistics book for a detailed explanation of testing
for the significance of r. To test whether a correlation coefficient of .996 is significant it is
compared with an r critical value. The critical values for r is found in Appendix A of this book.
Assuming that the probability or error is set at alpha level .05 (it means that the probability [p] is
less that [<] 5 out of 100 [.05] than the demonstrated relationship is due to chance) (DiLeonardi
& Curtis, 1992), and the degrees of freedom is 8 (df=N-2, 8=10-2), a critical value of .632 is
attained. A value of .632 is the intersecting value in Appendix A for df=8 and alpha level of .05.
Significance is determined when the obtained value is greater than the critical value. In this case,
since .996 is greater than .632, then there is a significant relationship between the aptitude test
and the retest scores.
For the variance, it is interpreted as the amount of overlap between the X and Y. This is
interpreted as the “percentage of the time that the variability in X accounts for or explains the
variability in Y.” Variance is determined by squaring the correlation coefficient (r2). For the
given data set, the variance would be r2=.9962 (would give a variance of .992), in percent, the
variance is 99.2 (.992 x 100). To interpret this value, “99.2 percent of the time, the scores during
the first aptitude test accounts for or explains the scores during the retest.”
Generally a correlation coefficient of .996 indicates that the test scores for aptitude during
the test and the retest time is highly reliable or consistent since the value is very strong and
significant. A software is provided with this book to help you compute for test retest correlation
coefficients and the other techniques for establish reliability and validity. A detailed
demonstration on using the software is found at the end of this chapter.
In this technique two tests are used that are equivalent in the aspects of difficulty, format,
number of items, and specific skills measured. The equivalent forms are administered to the
same examinee at one occasion and the other in a different occasion. Split half is both a measure
of temporal stability and consistency of responses. Since the two tests are administered
separately across time it is a measure of temporal stability like the test-retest. But on the second
occasion, what is administered is not the exact same test but an equivalent form of the test.
Assuming that the two tests are really measuring the same characteristics, then there should be
consistency on the scores. Parallel forms can be used in affective and cognitive measures in
general as long as there are available forms of the test.
62
Reliability is determined by correlating the scores from the first form and the second
form. In most cases, Form A of the test is correlated with Form B of the test. A strong and
significant relationship would indicate equivalence and consistency of the two forms.
Split-half Reliability
In split-half, the test is split into two parts and the scores for each part should show
consistency. The logic behind splitting the test into two parts is to determine whether the scores
within the same test is internally consistent or homogeneous.
There are many ways of splitting the test into two halves. One is by randomly distributing
the items equally into two halves. Another is separating the odd numbered items with the even
numbered items. In doing the split-half reliability, one ensures that the test contains large amount
of items so that there will still be several items left for each half. The assumption here is that
there should be more items in order for the test to be more reliable. It follows that the more the
items in a test, the more it becomes reliable.
Spit-half is analyzed by summating first the total scores for each half of the test for each
participant. The total scores in pairs are then correlated. A high correlation coefficient would
indicate internal consistencies of the responses in the test. Since only half of the test is correlated,
a correction formula called the Spearman-Brown (rtt)is used by doubling the length of the test.
The formula is:
2r
rtt
1 r
Where
Suppose that a test that measures aggression with 60 items was split into two halves
having 30 items each half and the computed r is .93. The Spearman-Brown coefficient would be
.96. Observe that from the correlation coefficient of .93 there was an increase to .96 when
converted into Spearman-Brown.
2(.93)
rtt
1 .93
rtt = .96
63
Several techniques can be used to test whether the responses in the items of a test are
internally consistent. The Kuder-Richardon, Cronbach’s alpha, interitem correlation, and item
total correlation can be used.
The Kuder-Richardson (KR #20) is used if the responses in the data are binary. Usually it
is used for tests with right or wrong answers where correct responses are coded as “1” and
incorrect responses are coded as “0.” The KR#20 formula is:
k pq
KR20 1 2
k 1
To determine σ2 (variance)
x 2
2
N 1
Where
k=number of items
p=proportion of students with correct answers
q=proportion of students with incorrect answers
σ2=variance
Σx2=sum of squares
Suppose that the following data was obtained in a 10 item math test (“1” correct answer “0”
incorrect answer) among 10 students:
Item XX
Student Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 10 Total (X) ( X X )2
A 1 1 1 1 1 1 1 1 1 1 10 2.8 7.84
B 1 1 1 1 1 1 1 0 1 1 9 1.8 3.24
C 1 1 1 1 1 1 1 0 0 1 8 0.8 0.64
D 1 1 1 1 1 1 1 1 0 0 8 0.8 0.64
E 1 1 1 1 1 1 1 0 0 0 7 -0.2 0.04
F 1 1 1 1 1 1 0 0 0 1 7 -0.2 0.04
G 1 1 1 1 1 1 0 1 0 0 7 -0.2 0.04
H 1 1 1 0 0 0 1 1 1 0 6 -1.2 1.44
I 1 1 1 1 0 1 0 0 0 0 5 -2.2 4.84
J 1 1 1 1 0 0 0 0 0 1 5 -2.2 4.84
2
total 10 10 10 9 7 8 6 4 3 5 X =7.2 x =23.6
p 1 1 1 0.9 0.7 0.8 0.6 0.4 0.3 0.5 σ2=2.62
q 0 0 0 0.1 0.3 0.2 0.4 0.6 0.7 0.5
pq 0 0 0 0.09 0.21 0.16 0.24 0.24 0.21 0.25 Σpq=1.4
64
Computation of Variance:
Get the total scores of each examinee (X), then compute for the average of the scores of
the ten examinees ( X =7.2). Subtract the mean to each individual total score ( X X ) then
square each of these differences ( X X )2. Get the summation of these squared difference and
this value is the Σ( X X )2. In the data given the value of ( X X )2 is 23.6 and N=10. Substitute
these values to obtain the variance.
23.6
2
10 1
σ2=2.62
KR20 Computation:
The variance is now computed (σ2=2.62), the next step is to obain the value of Σpq. This is
obtained by summating the total correct responses for each item (total). This total is converted
into a proportion (p) by dividing with the total number of cases (N-10). A total of 10 when
divided by 10 (N) will have a proportion of 1. Then to determine q which is the proportion
incorrect, subtract the proportion correct from 1. If the proportion correct is 0.9 the proportion
incorrect will be 0.1. Then pq is determined by multiplying the p and q. Get the summation of
the pq and it will yield Σpq which has the value of 1.4. Substitute all the values in the KR 20
formula.
10 1.4
KR20 1
10 1 2.62
KR20 = 0.52
The internal consistency of the 10 item math test is 0.52 indicating the responses are not highly
consistent with each other.
The Cronbach’s alpha also determines the internal consistency of the responses of items
in the same test. The Cronbach’s alpha can be used for responses that are not limited to binary
type such as a five point scale and other response format that are expressed numerically. Usually
tests beyond the binary type are affective measures and inventories where there is no right or
wrong answers.
Suppose that a five item test measuring attitude towards school assignments was
administered to five high school. Each item in the questionnaires is answered using 5 point
Likert Scale (5=strongly agree, 4=agree, 3=not sure, 2=disagree, 1=strongly disagree).
Below are the five items that measures attitude towards school assignments. Each student
will select in a likert scale of 1 to 5 how they respond to each of the items. Then their responses
are encoded.
65
The next table shows how the Cronbach’s alpha is determined given the responses of the
five students. In the next table, student A answered ‘5’ for item 1, ‘5’ for item 2, ‘4’ for item 3,
‘4’ for item 4, and ‘1’ for item 5. The same goes for students B, C, D, and E.
In computing for Cronbach’s alpha, the variance (σ2) for the students’ scores and the
summation of variance (σ2) for each item score is used. Obtaining the variance for the scores of
each respondent is the same in Kuder Richardson where the mean of the scores is substracted to
each score, then the value is squared and the sum of squares (22.8) is divided by n-1 (5-1=4).
Diving the sum of squares (22.9) with the n-1 (4) will give the variance (σ2=5.7). For obtaining
the summation of item variance ( SD t2 ) , get the sum of all scores per item (summate going
down for each column in the table below, ΣX), then square each score and summate going down
(ΣX2). Each item will have its own ΣX and ΣX2. These parameters are used to obtain the
variance for each item. The formula to determine the variance for each item is:
(X ) 2
X 2
SD 2 t n
n 1
After obtaining the variance for reach item, summate all these variances, ( SD t2 ) .
The value (38.8) is divided by n-1 and will give the value of ( t2 ) which is the variance of the
items ( t2 ) =9.7. The values obtained can now be substituted in the formula for Cronbach’s
alpha:
n t ( t )
2 2
Cronbach' s
n 1 t2
5 5.7 5.2
Cronbach' s
5 1 5.7
Cronbach’s α = .10
Student item1 item2 item3 item4 item5 total for each case (X) Score-Mean (Score-Mean)2
A 5 5 4 4 1 19 2.8 7.84
B 3 4 3 3 2 15 -1.2 1.44
C 2 5 3 3 3 16 -0.2 0.04
D 1 4 2 3 3 13 -3.2 10.24
E 3 3 4 4 4 18 1.8 3.24
X case=16.2 Σ(Score-Mean)2=22.8
total for each
item (ΣX)
( Score Mean )
=
14 21 16 17 13 X item 16.2
2
ΣX2 48 91 54 59 39 t2
n 1
22.8
t2
5 1
SD2t 2.2 .7 .7 .3 1.3 ΣSD2t =5.2 t2 = 5.7
n t ( t )
2 2
Cronbach' s
n 1 t2
5 5.7 5.2
Cronbach' s
5 1 5.7
Cronbach’s α = .10
The internal consistency of the responses in the attitude towards teaching is .10 indicating low internal consistency.
67
Notice that a perfect correlation coefficient is obtained when the item is correlated with
itself (1.00). It can also be noted that strong correlation coefficients were obtained between
items, 1 and 3, 1 and 4, indicating internal consistencies. Some had negative correlations like
between items 1 and 5, and 2 and 5. A negative correlation means that as the scores of one item
increases, the other decreases.
Interrater Reliability
When rating scales are used by judges, the responses can also be tested if they are
consistent. The concordance or consistency of the ratings is estimated by computing the
Kendall’s ω coefficient of concordance.
Suppose that following thesis presentation ratings were obtained from three judges for 5
groups who presented their thesis. The rating scale is in a scale of 1 to 4 where 4 is the highest
and 1 is the lowest.
The concordance among three raters using the Kendall’s tau is computed by summating
the total ratings for each case (thesis presentation). The mean is obtained for the sum of ratings
68
( X Ratings =8.4). The mean is then subtracted to each of the Sum of Ratings (D). Each difference is
squared (D2), then the sum of squares is computed (ΣD2=33.2). these values can now be
substituted in the Kendall’s ω formula. In the formula, m is the numbers of raters.
12D 2
W
m 2 ( N )( N 2 1)
12(33.2)
W
3 (5)(52 1)
2
W=0.37
A value of .38 Kendall’s ω coefficient estimates the agreement of the three raters in the 5 thesis
presentations. Given this value, there is a moderate concordance among the three raters because
the value is not very high.
69
Summary on Reliability
Activity:
Test whether the typing test is valid. The following are the scores of 15 participants on a typing
test using test-retest reliability.
Activity 2
Administer the “Academic Self-regulation Scale” to atleast 30 students then obtain its internal
consistency using split-half, Cronbach’s alpha, and interitem correlation.
Self-regulation Scale
Instruction: The following items assess your learning and study strategy use. Read each item carefully and
RESPOND USING THE SCALE PROVIDED. Encircle the number that corresponds to your answer.
4: Always 3: Often 2: Rarely 1: Never
Before answering the items, please recall some typical situations of studying which you have experienced. Kindly
encircle the number showing how you practice the following items.
Further Analysis
1. Show the Cronbach’s alpha for each factor and indicate whether the responses are
internally consistent.
2. Split the test into two then indicate whether the responses are internally consistent.
3. Intercorrelate each item.
73
Lesson 2
Validity
Content Validity
Content validity is the systematic examination of the test content to determine whether it
covers a representative sample of the behavior domain to be measured. For affective measures, it
concerns whether the items are enough to manifest the behavior measured. For cognitive tests, it
concerns whether the items cover all contents specified in an instruction.
Content validity is more appropriate for cognitive tests like achievement tests and teacher
made tests. In these types of tests, there is a presence of a specified domain that will be included
in the test. The content covered is found in the instructional objectives in the lesson plan,
syllabus, table of specifications, and textbooks.
Content validity is conducted through consultation with experts. In the process, the
objectives of the instruction, table of specifications, and items of the test are shown to the
consulting experts. The experts check whether the items are enough to cover the content of the
instruction provided, whether the items are measuring the objectives set, and if the items are
appropriate for the cognitive skill intended. The process also involves correcting the items if they
are appropriately phrased for the level that will take the test and whether the items are relevant to
the subject area tested.
Details on constructing Table of Specifications are explained in the next chapters.
Criterion-Prediction Validity
Criterion-prediction involves prediction from the test to any criterion situation over time
interval. For example, to assess the predictive validity of an entrance exam, it will be correlated
later with the students’ grades after a trimester/semester. The criterion in this case would be the
students’ grade which will come in the future.
Criterion-prediction is used for hiring job applicants, selecting students for admission to
college, assigning military personnel to occupational training programs. For selecting job
applicants, the pre-employment tests are correlated with the obtained supervisor rating in the
future. In assigning military personnel for training, the aptitude test administered before training
will be correlated with the future post assessment in the training. A positive and high correlation
coefficients should be obtained in these cases to adequately say that the test has a predictive-
validity.
Generally the analysis involves the test score correlated with other criterion measures
example are mechanical aptitude and job performance as a machinist.
74
Construct Validity
Construct validity is the extent to which the test may be said to measure a theoretical
construct or trait. This is usually conducted for measures that are multidimensional or contains
several factors. The goal of construct validity is to explain and prove the factors of the measure
as it is true with the theory used.
There are several methods for analyzing the constructs of a measure. One way is to
correlate a new test with a similar earlier test as measured approximately the same general
behavior. For example, a newly constructed measure for temperament is correlated with an
existing measure of temperament. If high correlations are obtained between the two measures it
means that the two test are measuring the same constructs or traits.
Another widely used technique to study the factor structure of a test is the factor analysis
which can be exploratory of confirmatory. Factor analysis is a mathematical technique that
involves arriving with sources of variation among the constructs involved. These variations are
usually called factors or components (as explained in chapter 1). Factor analysis reduces the
number of variables and it detects the structure in the relationships between variables, or classify
variables. A factor is a set of highly intercorrelated variables. In using a Principal Components
Analysis as a method of factor analysis, the process involves extracting the possible groups that
can be formed through the eigenvalues. A measure of how much variance each successive factor
extracts. The first factor is generally more highly correlated with the variables than the second
factor. This is to be expected because these factors are extracted successively and will account
for less and less variance overall. Factor extraction stops when factors begin to yield low
eigenvalues. An example of the extraction showing eigenvalues is illustrated below in the study
by Magno (2008) where he developed a scale measuring parental closeness with 49 items and
four factors are hypothesized (bonding, support, communication, interaction).
Plot of Eigenvalues
16
15
14
13
12
11
10
9
Value
8
7
6
5
4
3
2
1
0
Number of Eigenvalues
75
The scree plot shows that 13 factors can be used to classify the 49 items. The number of
factors is determined by counting the eigenvalues that are greater than 1.00. But having 13
factors is not good because it does not further reduce the variables. One technique in the scree
test is to assess the place where the smooth decrease of eigenvalues appears to level off to the
right of the plot. To the right of this point, presumably, one finds only "factorial scree" - "scree"
is the geological term referring to the debris which collects on the lower part of a rocky slope. In
applying this technique, the fourth eigenvalue shows a smooth decrease in the graph. Therefore,
four factors can be considered in the test.
The items that will belong under each factor is determined by assessing the factor
loadings of each item. Each item in the process will load in each factor extracted. The item that
highly loaded in a factor will technically belong to that factor because it is highly correlated with
the other items in that factor or group. A factor loading of .30 means that the item contributes
meaningfully to the factor. A factor loading of .40 means the item is highly contributions to the
factor. An example of a table with factor loading is illustrated below.
1 2 3 4
item1 0.032 0.196 0.172 0.696
item2 0.13 0.094 0.315 0.375
item3 0.129 0.789 0.175 0.068
item4 0.373 0.352 0.35 0.042
item5 0.621 -0.042 0.251 0.249
item6 0.216 -0.059 0.067 0.782
item7 0.093 0.288 0.307 0.477
item8 0.111 0.764 0.113 0.085
item9 0.228 0.315 0.144 0.321
item10 0.543 0.113 0.306 -0.01
In the table above, the items that highly loaded to a factor should have a loading of .40 and
above. For example, item 1 highly loaded on factor 4 with a factor loading of .696 as compared
with the other loadings .032, .196, and 0.172 for factors 1, 2, and 3 respectively. This means that
item 1 will be classified under factor 4 together with item 6 and item 7 because they all highly
load under the fourth factor. Factor loadings are best assessed when the items are rotated
(Consult scaling theory references for details on factor rotation).
Another way of proving the factor structure of a construct is through Confirmatory Factor
Analysis (CFA). In this technique, there is a developed and specific hypothesis about the
factorial structure of a battery of attributes. The hypothesis concerns the number of common
factors, their pattern of intercorrelation, and pattern of common factor weights. It is used to
indicate how well a set of data fits the hypothesized structure. The CFA is done as follow-up to a
standard factor analysis. In the analysis, the parameters of the model is estimated, and the
goodness of fit of the solution to the data is evaluated. For example, in the study of Magno
(2008) confirmed the factor structure of parental closeness (bonding, support, communication,
succorance) after a series of principal components analysis. The parameter estimates and the
goodness of fit of the measurement model were then analyzed.
76
Figure 1
Measurement Model of Parental Closeness using Confirmatory Factor Analysis
The model estimates in the CFA shows that all the factors of parental closeness have
significant parameters (8.69*, 5.08*, 5.04*, 1.04*). The delta errors are used (28.83*, 18.02*,
18.08*, 2.58*), and each factor has a significant estimate as well. Having a good fit reflects on
having all factor structures as significant for the construct parental closeness. The goodness of fit
using chi-square is a rather good fit (2=50.11, df=2). The goodness of fit based on the Root
Mean square standardized residual (RMS=0.072) shows that there is less error having a value
close to .01. Using Noncentrality fit indeces, the values show that the four factor solution has a
good fit for parental closeness (McDonald Noncentrality Index=0.910, Population Gamma
Index=0.914).
Confirmatory Factor Analysis can also be used to assess the best factor structure of a
construct. For example, the study of Magno, Tangco, and Sy (2007), the assessed the factor
structure of metacognition (awareness of one’s learning) on its effect on critical thinking
(measured by the Watson Glaser Critical Thinking Appraisal). Two factor structured of
metacognition was assessed. The first model of metacognition includes two factors which is
regulation of cognition and knowledge of cognition (see Schraw and Dennison, ). The second
model tested metacognition with eight factors: Declarative knowledge, procedural knowledge,
conditional knowledge, planning, information management, monitoring, debugging strategy, and
evaluation of learning.
77
The results in the analysis using CFA showed that model 1 has a better fit as compared to
model 2. This indicates that metacogmition is better viewed with two factors (knowledge of
cognition and regulation of cognition) that with eight factors.
The Principal Components Analysis and Confirmatory Factor Analysis can be conducted
using available statistical softwares such as Statistica and SPSS.
According to Anastasi and Urbina (2002), the method of convergent and divergent
validity is used to prove the correlation of variables with which it should theoretically correlate
(convergent) and also it does not correlate with variables from which it should differ (divergent).
In convergent validity, constructs that are intercorrelated should be high and positive as
explained in the theory. For example, in the study of Magno (2008) on parental closeness, when
the factors of parental closeness were intercorrelated (bonding, support, communication, and
sucorrance), a positive magnitude was obtained indicating convergence of these constructs.
For divergent validity, a construct should inversely correlate with its opposite factors. For
example, the study by Magno, Lynn, Lee, and Kho (in press) constructed a scale that measures
Mothers’ involvement on their grade school and high school child. The factors of mothers
involvement in school-related activities are intercorrelated. Observe that these factors belong in
the same test but controlling was negatively related permissive and loving is negatively related
with autonomy. This indicates divergence of the factors within the same measure.
Summary on Validity
EMPIRICAL REPORT
The Development of the Self-disclosure Scale areas in his or her life have been easy for them
to shell out and what areas need more
Carlo Magno revelations.
Sherwin Cuason It has always been psychologists
Christine Figueroa concern to explain what is going on inside a
De La Salle University-Manila particular individual in relation to his entire
system of personality. One important component
Abstract of looking into the intrinsic phenomenon of
The purpose of the present study is to develop a measure human behavior is self-disclosure. Self-
for self-disclosure. The items were based on a survey disclosure as defined by Sidney Jourard (1958) is
administered to 83 college students. From the survey 114
items were constructed under 9 hypothesized factors. The the process of making the self known to other
items were reviewed by experts. The main try out form of person; “target persons” are persons whom
the test was composed of 112 items administered to 100 information about the self is communicated. In
high school and college students. The data analysis the process of self-disclosure we make ourselves
showed that the test has a Cronbach’s alpha of .91. The manifest in thinking and feeling through our
factor loadings retained 60 items with high summated
correlations under five factors. The new factors are beliefs, actions - actions expressed verbally (Chelune,
relationships, personal matters, interests, and intimate Skiffington, & Williams, 1981). In addition,
feelings. Hartley (1993) stressed the importance of
interpersonal communication in disclosing the
Each person has a complex personality self. Hartley (1993) defined self-disclosure as
system. Individuals are oftentimes very much the means of opening up about oneself with other
interested in knowing our personality type, people. Moreover, Norrel (1989) defined self-
attitudes, interests, aptitude, achievement and disclosure as the process by which persons
intelligence. This is the reason why we should make themselves known to each other and occur
develop a psychological test that would help us when an individual communicates genuine
assess our standing. The test we have thoughts and feelings.
developed aims to measure the self-disclosing Generally, self-disclosure is the process
frequency individuals in different areas. This will in which a person is willing to share or open
help them know what areas in their lives they are oneself to another person or group whom the
willing to let other people know. This would be a individual can trust. This process is done
good instrument for counselors to use for the verbally. The factors identified in self-disclosure
assessment of their clients. The result of the which are potent areas in the content in
client’s test would help the counselor adjust his communicating superficial or intimate topics are
or her skills eliciting or disclosing more or other (1) Personal matters, (2) Thoughts & ideas, (3)
areas or other topics. Religion, (4) Work, study & accomplishments, (5)
Self-disclosure is a very important aspect Sex, (6) Interpersonal relationship, (7) Emotional
in the counseling process, because self- state, (8) tastes, (9) Problems.
disclosure is one of the instruments the The process of self-disclosure occurs
counselor can use. The consequence of the during interaction with others (Chelume,
client not disclosing himself is their inability to Skiffington, & Williams, 1981). In the studies that
respond to their problem and to the counselor. Jourard (1961;1969) conducted, he stated that a
This is what the researchers took into person will permit himself to be known when “ he
consideration in developing the test. It could also believes his audience is man of goodwill.” There
be used outside the counseling process. An should be a guarantee of privacy that the
individual may want to take it to find out what
81
new clusters; interpersonal relationship, attitudes, person and to know another person - sexually
sex, and tastes. These clusters contain items on and cognitively - will find the prospective
sensitive information one withholds. The self- terrifying.
disclosure reports are only moderately reliable Sex as a factor in self-disclosure is
(.62 to .72 for men and .51 to .78 for women). included because most closely knitted
In marital relationship, it was found that adolescents gives focal view on sex. The survey
parners have greater self-disclosure and marital study that was conducted shows that 5.26% of
satisfaction (Levinger & Senn, 1967; Jorgensen, males and 3.44% of females disclose themselves
1980). regarding sexual matters.
In parent-child relationship it was
reported that there are no differences in the Personal matters about the self.
content of the self-disclosure of Filipino Personal matters consist of private truths about
adolescents with their mother and father (Cruz, oneself and they may be favorable or
Custodio, & Del Fierro, 1996). The study also unfavorable evaluative reaction toward
indicated that birth order is highly relevant in something or someone, exhibited in one’s belief,
analyzing the content of self-disclosure. The feelings or intended behavior.
results of the study also show that children are In an experiment conducted by Taylor,
more disclosing toward the mothers because Gould, and Brounstein (1981), they found that
they empathize. the level of intimacy of the disclosure was
determined by (1) dispositional characteristics,
Sex. One of the most intimate topics as (2) characteristics of subjects, and (3) the
a content in self-disclosure is sex. It is usually situation. Their personalistic hypothesis was
embarassing and hard to open to others because confirmed that the level of disclosure affects the
some people have the faulty learning that it is level of intimacy. Some studies also show that
evil, lustful, and dirty (Coleman, Butcher, & some individuals are more willing to disclose
Carson, 1980). But mature individuals view personal information about themselves to high
human sexuality as a way of being in the world of disclosing rather than low disclosing others
men and women whose moments of life and (Jourard, 1959; Jourard & Landsman, 1960;
every aspect of living is spent to experience Jourard & Richman, 1963; Altman & Taylor,
being with the entire world in a distinctly male or 1973). Furthermore, Jones & Archer (1976) have
female way (Maningas, 1995). Furthermore, sought directly that the recipient’s attraction
sexuality is part of our natural power or capacity towards a discloser would be mediated by the
to relate to others. It gives the necessary personalistic attribution the recipient makes for
qualities of sensitivity, warmth, mental respect in the disclosers level of intimacy.
our interpersonal relationship and openness Kelly and McKillop (1996) in their article
(Maningas, 1995). stated that “choosing to reveal personal secrets
Sexuality as being part of our is a complex decision that could have distorting
relationship needs to be opened up or expressed consequences, such as being rejected and
as Freud noted the desire of our instinct or id. alienated from the listener.” But Jourard (1971)
Maningas (1995) stressed out that sex is an noted that a healthy behavior feels “right” and it
integral part of our personal self-expression and should produce growth and integrity. Thus,
our mission of self-communication to others. disclosing personal matters about oneself is a
Some findings by Jourard (1964) on subject means of being honest and seeking others to
matter differences noted that details about one’s understand you better.
sex life is not muchly disclosable as compared to
other factors. Jourard (1964) also noted that Emotional State. One of the factors of
anyone who is reluctant to be known by another self-disclosure defined as one’s revelation of
83
emotions or feelings to other people. A psychological rationale for the selected use of
retrospective study was conducted to determine therapist self-disclosure, the conscious sharing of
what students did to make their developing thoughts, feelings, attitudes, or experiences with
romantic relationship known to social network a patient (Goldstein, 1994).
members and what they did to keep their
relationship from becoming known. It is shown in Religion. We operationally defined
this study that the most frequent reasons for religion in self-disclosure as the ability of an
revelation were felt obligation to reveal based on individual to share his experiences, thoughts, and
the relationship with the target, the desire for emotions toward his beliefs about God. Healey
emotional expression, and the desire for (1990) provided an overview of the role of self-
psychological support from the target. The most disclosure in Judeo-Christian religious
frequent reason to withhold information was the experience with emphasis in the process of
anticipation of a negative reaction from the target spiritual direction. In the study done by Kroger
(Baxter, 1993). The researchers felt that the (1994), he shows the catholic confession as the
determination of the probability of self-disclosure embodiment of common sense regarding the
will be a lot better if emotional state is considered social management of personal secrets, of the
as a factor. Emotions, disclosures & health sins committed, and considers confession as a
addresses some of the basic issues of model for understanding the problem of the
psychology and psychotherapy: how people social transmission of personal secrets in
respond to emotional upheavals, why they everyday life. It is very important and considered
respond the way they do, and why translating as a factor in self-disclosure because of the fact
emotional events into language increases that the Filipino people are very religious, and
physical and mental health (Pennebaker, 1995). study shows that religious people disclose more
(Kroger, 1994).
Taste. Is defined as the likes and
dislikes of a person openned to other people. In Problems. When a person is depressed,
a study made by Rubin & Shenker (1975), they he tends to find others that will listen and can
made a test studying the friendship, proximity share the problem with. To release the tension a
and self-disclosure of college students in the person feels, he usually discloses it. Clarity of a
contexts of being roomates or hallmates. The problem is attained when people start to
items were categorized in four clusters, in what verbalize it and in the process, a solution can be
we thought would be ascending order of reached. In the study of Rime (1995), they
intimacy-tastes, attitudes, interpersonal revealed that after major negative life events and
relationships, and self-concept and sex. This traumatic emotional episodes, ordinary emotions,
would help us determine whether people are too, are commonly accompanied by intrusive
willing to share superficial information right away memories and the need to talk about the
as well as intimate information. episode. It also considered the hypothesis that
such mental rumination and social sharing would
Thoughts. Is defined as the things in represent spontaneously initiated ways of
mind that one is willing to share with other processing emotional information.
people. “A friend”, Emerson wrote, “ is a person
with whom I may be sincere. Before him I may Work/Study. Work or study is defined as the
think aloud.” A large number of studies have person’s present duty or responsibility which is
documented the link between friendship and the expected of him and is needed to be fulfilled in a
disclosure of personal thoughts and feelings that given time. It is considered a factor in self-
Emerson’s statement implies (Rubin & Shenker, disclosure because this will give a glimpse of
1975). Another study presents a self- how open a person can share his joy and burden
84
Table 3 Discussion
Factor Transformation Matrix At first, there were nine hypothesized
factor based on a survey. The 18 factors were
FACTOR
1
FACTOR
2
FACTOR
3
FACTOR
4
FACTOR
5
then extracted with eigenvalues greater than
FACTOR .48 -.56 -.25 -.001 -.61 1.00. Finally, there were a final of five factors with
1
FACTOR .45 .43 -.56 -.49 .19 acceptable factor loadings. The five factors have
2
FACTOR .45 .55 .57 .09 -.39
new labels because the items were rotated
3
FACTOR
differently based on the data on the main tryout.
.4 -.42 .49 -.34 .52
4 Factor 1 contains items about the beliefs on
FACTOR .41 .02 -.21 .79 .39
5 religion, and ideas on a particular topic and it is
labeled as such. Factor 2 contains items
reflecting relationships with friends and it was
The new five factors were given new names labeled as “relationships.” Factor 3 contains
because the contents were different. Factor 1 items about a person’s secrets and attitudes and
was labeled as Beliefs with 11 items, Factor 2 most of the items contains personal matters and
was labeled as relationships with 13 items, it was labeled as such. Factor 4 is a cluster of
Factor 3 labeled as Personal Matters with 13 taste and perceptions so it was labeled as
items, and Factor 4 as intimate feelings with 13 interest. Factor 5 contains feelings about
items, and factor 5 labeled as interests with 10 oneself, problems, love, success, and
items. frustrations, so it was labeled as intimate
feelings. The factors were reliable due to their
Table 4 alpha which are .8031, .7696, .7962, .7922,
New Table of Specifications .7979. It only shows that each factor is
consistent with the intended purpose of the
FACTORS Number ITEM RELIABILITY researchers. In the result of factor analysis the
of items NUMBER
items were not equal in each factor, factor 1 has
Factor 1: Beliefs 11 8,101,18, .8031 11 items, factor 2 has 13 items, factor 3 has 13
20,33, 52,
59, 70, 77, items, factor 4 has 10 items and factor 5 has 13
98, 3 items. The five factors account for the areas in
Factor 2: 13 105, 15, .7696 which a particular individual self-discloses.
Relationships 21, 24, 31,
41, 48, 55,
61, 63 79, There were nine hypothesized factors, all
84, 88 of these were disproved, new factors arrived after
Factor 3: Personal 13 11, 111, .7962
factor analysis. The items were reclassified in
Matters 53, 65, 66, every factor and was given a new name. Only
68, 75, 76,
93, 94, 95,
five factors were accepted following the four
96, 99 percent rating of the eigenvalue. These factors
Factor 4: Intimate 13 1,6, 26, 27, .7922
are Beliefs, Relationships, Interests, Personal
Feelings 28, 32, 34, matters, and intimate feelings. The test we have
35, 39, 43,
72, 73, 78
developed intended to measure the degree of
self-disclosure of individuals but it was refocused
Factor 5: Interests 10 10, 100, .7979 to measure the self-disclosure each person
104, 17,
56, 60, 62, makes on each different areas or factors.
69, 82, 83
Chelune, G. J., Skiffington, S, & Williams, C. Jourard, S. M. & Jaffe, P. E. (1970). Influence of
(1981). Multidimensional analysis of observers' an interviewer's disclosure on the self - disclosing
perceptions of self - disclosing behavior. Journal behavior of interviewees. Journal of Counseling
of Personality and Social Psychology, 41(3), 599- Psychology, 17(3), 252-257.
606.
Jourard, S. M. & Landsman, M. J. (1960).
Coleman, C., Butcher, A. & Carson, C. (1980). Cognition, cathexis, and the dyadic effect in
Abnormal psychology and modern life (6th ed.). men's self-disclosing behavior. Merrill-Palmer
New York: JMC. Quarterly, 6, 178-185.
Cozby, P. C. (1973). Self - disclosure: A literature Jourard, S. M. & Rubin, J. E. (1968). self -
review. Psychological Bulletin, 79(2), 73-91. disclosure and touching: a study of two modes of
interpersonal encounter and their inter - relation.
Goldstein, J. H. (1994). Toys, play, and child Journal of Humanistic Psychology, 8(1), 39-48.
development. New York, NY, US: Cambridge
University Press. Jourard, S. M. (1959). Healthy personality and
self-disclosure. Mental Hygiene, 43, 499-507.
Hartley, P. (1993). Interpersonal communication.
Florence, KY, US: Taylor & Frances/Routledge. Jourard, S. M. (196). Religious denomination and
self - disclosure. Psychological Reports, 8, 446.
Healey, B. J. (1990). Self - disclosure in religious
spiritual direction: Antecedents and parallels to
89
Exercise
Give the best type of reliability and validity to use in the following cases.
___________________5. The scores on the depression diagnostic scale were correlated with the
Minnesota Multiphasic Personality Inventory (MMPI). It was found that clients who are
diagnosed to be depressive have high scores on the factors of MMPI.
___________________6. The scores of Mike’s mental ability taken during fourth year high
school were used in order to determine whether he will be qualified to enter in the college he
wants to study.
___________________7. Maria, who went for drug rehabilitation, was assessed using the self-
concept test and her records in the company where she was working at which contained her
previous security scale scores were requested. The two tests were compared.
___________________8. Mrs. Ocampo a math teacher before preparing her test constructs a
table of specifications and after making the items it was checked by her subject area coordinator.
___________________13. In a battery of tests, the section A class received both the Strong
Vocational Interest Blank (SVIB) and the Jackson Vocational Interest Survey (JVIS). Both are
measures of vocational interest and the scores are correlated to determine if one measures the
same construct.
___________________14. The Work Values Inventory (WVI) was separated into 2 forms and
two set of scores were generated. The two sets of scores were correlated to see if they measure
the same construct.
___________________17. When the EPPS items were presented in a free choice format, the
scores correlated quite highly with the scores obtained with the regular forced-choice form of
the test.
___________________18. The two forms of the MMPI (Form F and form K scales) were
correlated to detect faking or response sets.
Lesson 3
Item Analysis: Item Difficulty and Item Discrimination
Students are usually keen in determining whether an item is difficult or easy and whether
the test is a good test or a bad test based on their own judgment. A test item being judged as easy
or difficult is referred to as item difficulty and whether a test is good or bad is referred to as item
discrimination. Identifying a test items’ difficulty and discrimination is referred to as item
analysis. Two approaches will be presented in this chapter on item analysis: Classical Test
Theory (CTT) and Item Response Theory (IRT). A detailed discussion on the difference between
the CTT and IRT is found at the end of Lesson 3.
Regarded as the “True Score Theory.” Responses of examinees are due only to variation in ability of
interest. All other potential sources of variation existing in the testing materials such as external
conditions or internal conditions of examinees are assumed either to be constant through rigorous
standardization or to have an effect that is nonsystematic or random by nature. The focus of CTT is the
frequency of correct responses (to indicate question difficulty); frequency of responses (to examine
distracters); and reliability of the test and item-total correlation (to evaluate discrimination at the item
level).
Synonymous with latent trait theory, strong true score theory, or modern mental test theory. It is more
applicable to for tests with right and wrong (dichotomous) responses. It is an approach to testing based
on item analysis considering the chance of getting particular items right or wrong. In IRT, each item on a
test has its own item characteristic curve that describes the probability of getting each particular item
right or wrong given the ability of the test takers (Kaplan & Saccuzzo, 1997).
Item difficulty is the percentage of examinees responding correctly to each item in the
test. Generally, an item difficulty is difficult if a large percentage of the test takers are not able to
answer it correctly. On the other hand, an item is easy if a large percentage of the test takers are
able to answer it correctly (Payne, 1992).
2. Identify the high and low scoring group by getting the upper 27% and lower 27%. For
example there are 20 test takers, the 27% of the test takers is 5.4, rounding it off will give 5 test
takers. This means that the top 5 (high scoring test-takers) and the bottom 5 (low scoring test-
takers) test takers will be included in the item analysis.
3. Tabulate the correct and incorrect responses of the high and low test-takers for each item. For
example, in the table below there are 5 test takers in the high group (test takers 1 to 5) and 5 test
takers in the low group (test takers 6 to 10). Test taker 1 and 2 in the high group got a correct
response for items 1 to 5. Test taker 3 was wrong in item 5 marked as “0.”
4. Get the total correct response for each item and convert it into a proportion. The proportion is
obtained by dividing the total correct response of each item to the total number of test takers in
the group. For example, in item 2, 4 is the total correct response and dividing it by 5 which is the
total test takers in the high group will give a proportion of .8. The procedure is done for the high
and low group.
5. Obtain the item difficulty by adding the proportion of the high group (pH) and proportion of
the low group (pL) and dividing by 2 for each item.
pH pL
Item difficulty
2
The table below is used to interpret the index of difficulty. Given the table below, items 1 and 2
are easy items because they have high correct response proportions for both high and low group.
Items 3 and 4 are average items because the proportions are within the .25 and .75 middle bound.
Item 5 is a difficult item considering that there are low proportions correct for the high and low
group. In the case of item 5, only 40% are able to answer in the high group and none got it
correct in the low group (0). Generally as the index of difficulty reaches a value of “0,” the more
difficult an item is, as it reaches “1,” it becomes easy.
6. Obtain the item discrimination by getting the difference between the proportion of the high
group and proportion of the low group for each item.
Item discrimination=pH – pL
The table below is used to interpret the index discrimination. Generally, the larger the difference
between the proportion of the high and low group, the item becomes good because it shows a
large gap in the correct response between the high and low group as shown by items 1, 3, 4, and
5. In the case of item 2, a large proportion of the low group (60%) got the item correct as
contrasted with the high group (80%) resulting with a small difference (20%) making the item
only reasonably good.
95
What cognitive skill is demonstrated in the objective “Students will compose a five paragraph
essay about their reflection on modern day heroes”?
a. Understanding
b. Evaluating
c. Applying
d. Creating
Correct answer: d
The distracters for the given item are all cognitive skills in Bloom’s revised taxonomy where all
can be a possible answer but there is one best answer. In analyzing whether the distracters are
effective, the frequency of examinees selecting each option is reported.
For the given item with the correct answer of letter d, majority of the examinees in the
high and low group preferred option “d” which is the correct answer. Among the high group,
distracters a, b, and c are not effective distracters because there are very few examinees who
selected them. For the low group, option “b” can be an effective distracter because 40% (6
examinees) of the examinees selected it as their answer as opposed to 47% (7 examinees) of
them got the correct answer. In this case distracters “a” and “c” need some revision by making it
close to the answer to make it more attractive for test takers.
96
EMPIRICAL REPORT
Construction and Development of a Test Mapa; (4) Mga Direksyon; (5) Anyong Lupa at
Instrument for Grade 3 Social Studies Anyong Tubig; (6) Simbolong Ginagamit sa
Carlo Magno Mapa; (7) Panahon at Klima; (8) Mga Salik na
may Kinalaman sa Klima; (9) Mga Pangunahing
Abstract Hanapbuhay sa Bansa; (10) Pag-aangkop sa
This study investigated the psychometric properties and Kapaligiran. The topics were based upon the
item analysis of a one-unit test in geography for grade lessons provided by the Elementary Learning
three students. The skills and contents of the test were Competence from the Department of Education.
based on the contents covered for the first quarter that is The test aims for the students to: (1)
indicated in the syllabus. A table of specifications was Identify the important concepts and definitions;
constructed to frame the items into three cognitive skills (2) comprehend and explain the reasons for
that include knowledge, comprehension, and application. given situations and phenomena; (3) Use and
The test has a total of 40 items on 10 different test types. analyze different kinds of maps in identifying
The items were reviewed by a social studies teacher and important symbols and familiarity of places.
academic coordinator. The split-half reliability was used
and a correlation of .3 was obtained. Each test type was Method
correlated and resulted from low and high coefficients. The
item analysis showed that most of the items turned out to Search for Skills and Content Domain
be easy and most are good items. The skills and contents of the test were
identified based on the topics covered for grade
The purpose of this study is to construct three students in the first quarter. The test is
and analyze the items of a one-unit geography intended to be administered for the first quarter
test for grade three students. The test basically exam. The skills intended for the first quarter’s
measures grade three student’s achievement on topic include identifying concepts and terms,
Philippine Geography for the first quarter that comprehending explanations, applying principles
served as a quarterly test. The test when on situations, using and analyzing maps,
standardized through validation and reliability synthesizing different explanations for a
would be used for future achievement test in particular event, and evaluating the truthfulness
Philippine Geography. and validity of reasons and statements through
There is a need to construct and inference.
standardize a particular achievement test in In constructing the test, a table of
Philippine Geography since there is none yet specifications was first constructed to plan out
available locally. the distribution of items for each topic and the
The test is in Filipino language because objectives to be gained by the students.
of the nature of the subject. The subject cover
topics on (1) Kapuluan ng Pilipinas; (2) Malalaki
at Maliliit na Pulo ng Bansa; (3) Mapa at Uri ng
97
Mga direksyon 6 6
Anying lupa at 5 5
Anyong Tubig
Simbolong ginagamit 4 4
sa mapa
Panahon at Klima 2 3 5
Mga salik na may 2 2
kinalaman sa lima
Mga pangunahing 3 3
hanapbuhay ng bansa
Pag-aangkop sa 3 3
kapaligiran
Total Number of Items 11 16 13 40
concordance was used in order to determine scores correlated. The last item was not included
inter-rater reliability of the essay type of test. since it has no matched item to be correlated
There were two judges who evaluated and used with. Also, the other items were essay type in
criteria to score the essay part of the test. which subjected to a different analysis. The low
coefficient of internal consistency can also be
Result and Discussion accounted with the various types of tests used,
Reliability thus can be accounted with the variation and
The test’s reliability was generated differences in the performance of the
through the split-half method by correlating the respondents. In other words, the respondents
odd numbered and even numbered items. The may respond and perform differently for each
arrived internal consistency is 0.3, which is low. type of test.
The low correlation between the odd and even The nature of the test cannot be
numbered items can be accounted with the measured on its general homogeneity since the
different topic contents within the 40-item test. It test contains several topics and several types of
should have been more appropriate to construct format responses. Thus, respondents perform
a large pool of items for the 10 content topics or differently for different types of test. The test has
factors that the test have, but 40 items are the 10 types measuring different skills such as
usual standard of items of the school for the identifying the important concepts and definitions,
quarterly test. The test has been administered for comprehension and explanations on the reasons
the purpose of quarterly test because the for given situations and phenomena, and using
usability of the test is considered. With regards to and analyzing different kinds of maps in
this type of measure it can only be accounted identifying important symbols and familiarity of
with the reliability of half of the test. This explains places.
the low value of the correlation coefficient. The The dilemma is that the content domains
split-half coefficient is then transformed into a included in the test are part of a general topic on
spearman brown coefficient since the correlation Philippine geography. To test the internal
is only for the half of the test. The resulting consistency among the 9 different contents,
Spearman-Brown coefficient is 0.46 which means correlation matrix was done.
that the items have a moderate relationship.
Also, it is a rule of thumb that there
should at least be 30 pairs of scores to be
correlated, but in this case there were only 18
Table 2
Intercorrelation of the Nine Subtests
I II III IV V VI VII VIII IX
I --
II -.13 --
III .98* .99* --
IV .18 -.81* -.48* --
V -.21 -.42 .47* -.19 --
VI .19 .58* .47* .6 -.65* --
VII -.73 .28 .4I* -.56* .73* -.24 --
VIII .07 -.19 -.47* .96* .08 -.8 -.25 --
IX .85* -.58* .48* .15 .97* -.52* -.52* .28 --
100
There is a high relationship between test students who answered the item correctly. In this
I and test IX. The higher the scores on case, most of the respondents got the answer
identification of concepts are, the higher the that is why most of the items turned out to be
scores on comprehension of weather map. Also, easy. It can be accounted that in general, the test
a high relationship existed between test V and was fairly easy since most of the items turned out
test IX. The higher the scores on the 76% and above.
interpretation of a physical map are, the higher There were 27% items that are
the scores on interpretation of the weather map. considered poor. These items were rejected
There is also a high relationship between test IV since most scores is in the high range of the low
and test VIII. The higher the scores on the group and some scores of the low group are near
inference about the Philippine islands are, the to the scores of the high group who have
higher the scores on the comprehension on answered it correctly. Considering the poor items
weather. Generally, the results on inter- such as item 2, 4, 9, 13, 15, 30, 31, 32, 33, and
correlation among the contents showed pretty 34 the pattern is indicative. There are very few
crude results due to the few items and the items marginal items that are subjected for
for each type of the test were not equal. The improvement. There are only 8% (3 items) that
pairing in the computation was done based on are remarked as marginal since the scores of the
the minimum number of items for each test type. low group and the high groups are almost the
same. This means that both the high and the low
Item Difficulty and Index Discrimination group can answer this item fairly. 21.6% (8 items)
To evaluate the quality of each type of of the items are reasonably good items since
item in the test, item analysis was done by there is enough interval between the high and
determining each items difficulty and index low groups. Also there are few items remarked
discrimination. The proportion of examinees as good items and enough to be considered as
getting each of items correctly was evaluated very good items. 16.21% of the items are good
according to the scale below. items and 24.3% are very good items. There is a
pattern that there is a wide distance of scores
Difficulty Index Remark between the high group and the low group.
.76 or higher Easy Item
.25 to .75 Average Item Interrater Reliability
.24 or lower Difficult Item The coefficient of concordance was used
Source: Lamberte, B. (1998). Determining the Scientific to determine the degree of agreement between
Usefulness of Classroom Achievement Test. Cutting Edge the two raters who judged the essay type in the
Seminar. De La Salle University. test. The essay type basically measures the
student’s knowledge on the adaptation of farmers
The items’ difficulty and discrimination index in farming. The criteria used for rating the essay
value are indicated in Table 3. The difficulty index are: (a) at least 2 answers are correct (1.5pts);
shows a pattern that 67.6% of the items are easy (b) the answer was explained (1 pt); (c) and the
and 32.43% of the test is on the average scale. instruction on answering was followed (0.5 pt).
Considering that the test was constructed for The results indicate that there is low agreement
grade three students the teacher was putting it between the two raters. A high value of W which
down on the level of the student’s capacity and is 0.74 was computed indicating close
ability. But it may also mean that the students concordance between the raters. This means
gained mastery of the subject matter that most of that the two raters showed a small variation in
them are able to answer it correctly. It should be rating the answers in the essay. The small error
taken note that the easiness and difficulty of the of variance can be accounted with the difference
items are dictated on the proportion of the of the disposition of the two raters. The first rater
101
was the actual teacher in the subject but the the inference about the Philippine Islands, the
second rater was also an Araling Panlipunan higher the scores on the comprehension on
teacher but teaching in the higher level. There topics about weather. A high correlation
was a difference on how they view the answer coefficient was found between these types.
even though they talked about the rating Although the results may not be too accurate
procedure at the start. since the basis for the matrix comparison does
not have equal number of items and the
Conclusion minimum number of items were the only ones
A low internal consistency was subjected in the analysis. It is recommended that
generated due to the different subject content in equal number of items for each test should be
the test and each test measures different skills. made to account a more accurate result in the
These two factors affected the internal regression analysis. There is also a low
consistency of the test. It is indeed difficult to agreement between the two raters for the essay
make it entirely uniform since the subject type since they have different perceptions on
contents are required as minimum learning giving points for the answers. The item difficulty
competence by the Department of education. showed the most of the items are easy since the
Also the listed subject contents are the planned students have gained mastery of the subject
focus for the first quarter of the schools subject matter. The index discrimination showed that the
matter budgeting. A multiple regression analysis items are distributed according to its power.
was performed to observe the relationship There are almost equal number of items that are
among the test types. It was found that the higher poor (27%), marginal item (8%), reasonably good
the scores on the interpretation of a physical map (22%), good (16%) and very good (24%).
the higher the scores on the interpretation of the
weather map and also the higher the scores on
Item Response Theory: Obtaining Item Difficulty Using the Rasch Model
It is said that the IRT is an approach to testing based on item analysis considering the
chance of getting particular items right or wrong. In IRT, each item on a test has its own item
characteristic curve that describes the probability of getting each particular item right or wrong
given the ability of the test takers (Kaplan & Saccuzzo, 1997). This will be realized at the latter
section in the computational procedure.
In using the Rasch model as an approach for determining item difficulty, the calibration
of test item difficulty is independent of the person used for the calibration unlike in the classical
test theory approach where it is dependent on the group. The method of test calibration does not
matter whose responses to these items use for comparison. It gives the same results regardless on
who takes the test. The score a person obtains on the test can be used to remove the influence of
their abilities from the estimation of their difficulty. Thus, the result is a sample free item
calibration.
Rasch’s (1960), the proponent who derived the technique, intended to eliminate
references to populations of examinees in analyses of tests unlike in classical test theory where
norms are used to interpret test scores. According to him that test analysis would only be
worthwhile if it were individual centered with separate parameters for the items and the
examinees (van der Linden & Hambleton, 2004).
The Rasch model is a probabilistic unidimensional model which asserts that: (1) The
easier the question the more likely the student will respond correctly to it, and (2) the more able
the student, the more likely he/she will pass the question compared to a less able student. When
the data fit the Rasch model, the relative difficulties of the questions are independent of the
relative abilities of the students, and vice versa (Rasch, 1977).
As shown in the graph below (Figure 1), a function of ability (θ) which is a latent trait
forms the boundary between the probability areas of answering an item incorrectly and
answering the item correctly.
104
Figure 1
Item Characteristic Curves of an 18-item Mathematical Problem Solving Test
Easy item
Easy item
Easy item
Easy item
Difficult item
Difficult item
Difficult item
In the item characteristic curve, the score on the item represents ability (θ) and the x-axis is the
range of item difficulties in log functions. It can be noticed that items 1, 7, 14, 2, 8, and 15 do not
require high ability to be answered correctly as compared t items 5, 12, 18, and 11 that require
high ability. The item characteristic curves are judged within 50% of the ability and a cut off of
“0” on item discrimination. The curves within the left side of the “0” item difficulty as marked in
the 50% ability are easy items and the ones on the right side are difficult items. The program
called WINSTEPS was used to produce the curves.
The IRT Rasch model basically identifies the location of a persons’ ability in a set of
items for a given test. The test items has a predefined set of difficulties, the person’s position
should be reflective that his ability should be matched with the difficult of the items. The ability
of the person as symbolized by θ and the items as δ. In the figure below, there are 10 items (δ1 to
105
δ10), and the location of the person’s ability (θ) is in between δ7 and δ8. In the continuum, the
items are prearranged from the easiest (at the left) to the most difficult (at the right). If the
position of the person’s ability is between δ7 and δ8, then it is expected that the person taking the
test should be able to answer items δ1 to δ6 (“1” correct response, “0” incorrect response), since
this items are answerable given the level of ability of the person. This kind of calibration is said
to fit the Rasch model where the position of the person’s ability is within a defined line of item
difficulties.
Case 1
In Case 2, the person is able to answer four difficult items and unable to respond correctly with
the easy items. There is now difficulty in locating the person in the continuum. If the items are
valid measures of ability, then the easy items should be answerable than the difficult ones. This
means that the items are not suited for the person’s ability. This case do not fit the Rasch model.
Case 2
The Rasch model allow to estimate person ability (θ) through their score on the test and
the item’s difficulty (δ) through the item correct separately that’s why it is considered to be test
free and sample free.
In different cases, it can be encountered that the person’s response (θ) to the test is higher
than the specified item difficulty (δ), so their difference (θ–δ) is greater than zero. But when the
ability or response (θ) is less than the specified item difficulty (δ), their difference (θ–δ) is less
than 0 as in Case 2. When the ability of the person (θ) is equivalent to the item’s difficulty (δ),
the difference (θ–δ) is 0 as in Case 1. This variation in person responses and item’s difficulty is
106
represented in an Item Characteristic Curve (ICC) which show the way the item elicits responses
from persons of every ability (Wright & Stone, 1979).
Figure 1
ICC of a Given Ability and Item Difficulty
An estimate of response x is obtained when a person with ability (θ) is acting on an item with
diffuculty (δ). It can be specified in the model that in the interaction between ability (θ) and item
difficulty (δ) that when ability is greater than the difficulty, the probability of getting the correct
answer is more than .5 or 50%. When the ability is less than the difficulty, the probability of of
getting the correct answer is less than .5 or 50%. The variation of these estimates on the
probability of getting a correct response is illustrated in Figure 1. The mathematical units for θ
and δ are defined in logistic functions (ln) to produce a linear scale and generality of measure.
The next section guides you in estimating the calibration of item difficulty and person
ability measure.
The Rasch model will be used for the responses of 10 students in a 25 item problem
solving test. In determining the item difficulty in the Rasch model, all participants who took the
test are included unlike the classical test theory where the upper and lower 27% are the only ones
included in the analysis.
107
ITEM NUMBER
Examinees 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 total
9 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 20
10 1 1 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 13
5 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 11
3 0 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 0 1 0 0 1 10
8 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 10
1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1 9
6 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 9
7 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 9
4 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 8
2 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 7
Total 5 3 6 1 3 4 8 5 2 4 3 3 6 5 6 1 4 2 6 5 2 5 5 2 10
1. Code each score for each item as “1” for right answer and “0” for wrong answer.
2. Arrange the scores (persons) from highest to lowest
3. Remove items where all the respondents got a correct answer.
4. Remove items where all the respondents got a wrong answer.
5. Rearrange scores (person from highest to lowest).
6. Group the items with similar total item score (si)
7. Indicate the frequency (fi) of items for each group of items
s
8. Divide each total item score (si) with N (i) = proportion correct i i
n
9. Subtract 1 with the i = proportion incorrect (1 – ρi)
10. Divide the proportion incorrect with the proportion correct and get the natural log of this
1 i
using a scientific calculator = logit incorrect (xi) xi ln
i
11. Multiply the frequency (fi) with the Logit incorrect (xi)=fixi
12. Square the xi and multiply with each fi=fixi2
13. Compute for the value of x
f x
x i i
f i
14. To get the initial item calibration (doi) subtract the logit incorrect (xi) with the x
(doi= xi- x)
15. Estimate the value of U which will be used later in the final estimates.
108
f i ( xi ) 2 [(fi )( x) 2 ]
U
f i 1
Table 1
Grouped Distribution of the 7 Different Item Scores of 10 Examinees
item
initial
score item item item Proportion proportion logit Frequency X Frequency
item
group name score frequency correct incorrect incorrect logit X logit2
calibration
index
3, 13,
2 15, 19 6 4 0.6 0.4 -0.41 -1.62 0.66 -0.89
1, 8,
14,
20,
3 22, 23 5 6 0.5 0.5 0.00 0.00 0.00 -0.48
6, 10,
4 17 4 3 0.4 0.6 0.41 1.22 0.49 -0.07
2, 5,
5 11, 12 3 4 0.3 0.7 0.85 3.39 2.87 0.37
9, 18,
6 21, 24 2 4 0.2 0.8 1.39 5.55 7.69 0.91
7 4, 16 1 2 0.1 0.9 2.20 4.39 9.66 1.72
Σfi(xi)2
Σfi=24 Σ fixi =11.54 =23.29
f i x i 11.54
x x x• = 0.48
f i 24
16. Count the number of possible scores (r) for each of the person’s total score (L).
17. Count the number of persons for each possible score=person frequency (nr)
r
18. Divide each possible score with the total score = proportion correct r
L
19. Obtain the proportion incorrect by subtracting the proportion correct with 1 (1-ρr)
20. Determine the logit correct (yr) of the quotient between proportion correct (ρr)and
proportion incorrect (1-ρr) y r ln r
1 r
109
21. Multiply the logit correct (yr)with each person frequency (nr) (nryr)
22. Square the values of the logit correct (yr2) and multiply with the person frequency (nr)
23. The logit correct (yr)is the initial person measure (bro=yr)
24. Compute for the value of y• and V to be used later in the final estimates.
Table 2
Grouped Distribution of Observed Examinee Scores on the 24 Item Mathematical Problem
Solving Test
Initial
Possible Person Proportion Logit Frequency X Person
score frequency correct correct Frequency X Logit Logit2 Measure
nr ρr yr nryr nr(yr)2 bro=yr
7 1 0.29 -0.89 -0.89 0.79 -0.89
8 1 0.33 -0.69 -0.69 0.48 -0.69
9 3 0.38 -0.51 -1.53 0.78 -0.51
10 2 0.42 -0.34 -0.67 0.23 -0.34
11 1 0.46 -0.17 -0.17 0.03 -0.17
12 0 0.50 0.00 0.00 0.00 0.00
13 1 0.54 0.17 0.17 0.03 0.17
14 0 0.58 0.34 0.00 0.00 0.34
15 0 0.63 0.51 0.00 0.00 0.51
16 0 0.67 0.69 0.00 0.00 0.69
17 0 0.71 0.89 0.00 0.00 0.89
18 0 0.75 1.10 0.00 0.00 1.10
19 0 0.79 1.34 0.00 0.00 1.34
20 1 0.83 1.61 1.61 2.59 1.61
Σnr=10 Σnryr =-2.18 Σnr(yr)2= 4.92
nr y r 2.18
y y y•= -0.22
nr 10
V 0.49
1 1
Y 2.89 Y 2.89 Y = 1.11
UV (0.77)(0.49)
1 1
8.35 8.35
N
SE (d i ) Y
Si ( N Si )
Table 3
Final Estimates of Item Difficulties from 10 Examinees
Sample
Spread
Item Score Initial Item Expansion Corrected Item Calibration
Group Index Item Name Calibration Factor Calibration Item Score Standard Error
i doi Y di=Ydoi si SE (di)
1 7 -1.87 1.11 -2.07 8 0.878
U 0.77
1 1
X 2.89 X 2.89 X = 1.16
UV (0.77)(0.49)
1 1
8.35 8.35
29. Multiply the expansion factor (X) with each of the initial measure (bro) = corrected
measure to obtain the corrected measure (br). The possible score and initial measure is
taken from Table 2.
30. Compute for the Standardized Error (SE).
L
SE X
r( L r )
Figure 3
Item Map for the Calibrated Item Difficulty and Person Ability
Score
items logit persons
(item 7)1 -2.07 0
7
0 -1.05 1 (Case2)
Item 3 Item 13 Item 15 (Item 19) 4
-0.98 0
8
Items that do not 0 -0.82 1 (Case 4)
9
Require high ability 0 -0.60 3 (Case 1) Case 6 Case 7
(δ<θ) Item 1 item 8 Item 14
Item 20 Item 22 (Item 23) 6 -0.53 0
10
0 -0.40 2 (Case 3) Case 8
11
0 -0.20 1 (Case 5)
Item 6 Item 10 (Item 17) 3
-0.08 0
δ=θ 0 0.00 0
13
0 0.20 1 (Case 10)
0 0.40 0
Item 2 Item 5 Item 11 (Item 12) 4
0.41 0
0 0.60 0
Items that 0 0.82 0
require high ability Item 9 Item 18 Item 21 (Item 24) 4
(δ >θ) 1.01 0
0 1.05 0
0 1.30 0
0 1.57 0
20
0 1.90 1 (Case 9)
Item 16 (Item 4) 7
1.91 0
Figure 3 shows the item map of calibrated item difficulty (left side) and person ability (right
side) across their logit values. Observe that as the items become more difficult (increasing logits)
the person with the highest score (high ability) is matched close with the item. This match is
termed as a goodness of fit in the Rasch model. A good fit indicates that difficult items require
high ability to be answered correctly. More specifically the match in the logits of person ability
and item difficulty indicates a goodness of fit. In this case the goodness of fit of the item
difficulties are estimated using the z value. Lower z and non significant z values indicates a
goodness of fit of the item difficulty and person ability.
113
EMPIRICAL REPORT
prevalent at the elementary level within the the role of teacher, and direct instruction to
context of constructivist theories. develop students' abilities to generate subgoals.
It is useful to develop a framework to
Heuristics. Heuristics are kinds of think about the processes involved in
information, available to students in making mathematics problem solving. Most formulations
decisions during problem solving, that are aids to of a problem solving framework in U. S.
the generation of a solution, plausible in nature textbooks attribute some relationship to Polya's
rather than prescriptive, seldom providing (1973) problem solving stages. However, it is
infallible guidance, and variable in results. important to note that Polya's "stages" were more
Somewhat synonymous terms are strategies, flexible than the "steps" often delineated in
techniques, and rules-of-thumb. For example, textbooks. These stages were described as
admonitions to "simplify an algebraic expression understanding the problem, making a plan,
by removing parentheses," to "make a table," to carrying out the plan, and looking back.
"restate the problem in your own words," or to According to Polya (1965), problem
"draw a figure to suggest the line of argument for solving was a major theme of doing mathematics
a proof" are heuristic in nature. Out of context, and "teaching students to think" was of primary
they have no particular value, but incorporated importance. "How to think" is a theme that
into situations of doing mathematics they can be underlies much of genuine inquiry and problem
quite powerful (Polya, 1973; Polya, 1962; Polya, solving in mathematics. However, care must be
1965). taken so that efforts to teach students "how to
Theories of mathematics problem solving think" in mathematics problem solving do not get
(Newell & Simon, 1972; Schoenfeld, 1985; transformed into teaching "what to think" or "what
Wilson, 1967) have placed a major focus on the to do." This is, in particular, a byproduct of an
role of heuristics. Surely it seems that providing emphasis on procedural knowledge about
explicit instruction on the development and use of problem solving as seen in the linear frameworks
heuristics should enhance problem solving of U. S. mathematics textbooks and the very
performance; yet it is not that simple. Schoenfeld limited problems/exercises included in lessons.
(1985) and Lesh (1981) have pointed out the Clearly, the linear nature of the models
limitations of such a simplistic analysis. Theories used in numerous textbooks does not promote
must be enlarged to incorporate classroom the spirit of Polya's stages and his goal of
contexts, past knowledge and experience, and teaching students to think. By their nature, all of
beliefs. What Polya (1967) describes in How to these traditional models have the following
Solve It is far more complex than any theories we defects:
have developed so far. 1. They depict problem solving as a
Mathematics instruction stressing linear process.
heuristic processes has been the focus of several 2. They present problem solving as a
studies. Kantowski (1977) used heuristic series of steps.
instruction to enhance the geometry problem 3. They imply that solving mathematics
solving performance of secondary school problems is a procedure to be memorized,
students. Wilson (1967) and Smith (1974) practiced, and habituated.
examined contrasts of general and task specific 4. They lead to an emphasis on answer
heuristics. These studies revealed that task getting.
specific hueristic instruction was more effective These linear formulations are not very
than general hueristic instruction. Jensen (1984) consistent with genuine problem solving activity.
used the heuristic of subgoal generation to They may, however, be consistent with how
enable students to form problem solving plans. experienced problem solvers present their
He used thinking aloud, peer interaction, playing solutions and answers after the problem solving
116
is completed. In an analogous way, McGuinness, 2003; Lai, Cella, Chang, Bode, &
mathematicians present their proofs in very Heinemann, 2003; Linacre, Heinemann, Wright,
concise terms, but the most elegant of proofs Granger, & Hamilton, 1994; Velozo, Magalhaes,
may fail to convey the dynamic inquiry that went Pan, & Leiter, 1995; Ware, Bjorner, & Kosinski,
on in constructing the proof. 2000) but has rarely been used in mathematical
Another aspect of problem solving that is problem solving assessment (Willmes, 1981,
seldom included in textbooks is problem posing, 1992). Its primary advantages include the interval
or problem formulation. Although there has been nature of the measures it provides and the
little research in this area, this activity has been theoretical independence of item difficulty and
gaining considerable attention in U. S. person ability scores from the particular samples
mathematics education in recent years. Brown used to estimate them.
and Walter (1983) have provided the major work The Rasch model, also referred to in the
on problem posing. Indeed, the examples and item response theory literature as the one-
strategies they illustrate show a powerful and parameter logistic model, estimates the
dynamic side to problem posing activities. Polya probability of a correct response to a given item
(1972) did not talk specifically about problem as a function of item difficulty and person ability.
posing, but much of the spirit and format of The primary output of Rasch analysis is a set of
problem posing is included in his illustrations of item difficulty and person ability values placed
looking back. along a single interval scale. Items with higher
A framework is needed that emphasizes difficulty scores are less likely to be answered
the dynamic and cyclic nature of genuine correctly, and items with lower scores are more
problem solving. A student may begin with a likely to elicit correct responses. By the same
problem and engage in thought and activity to token, persons with higher ability are more likely
understand it. The student attempts to make a to provide correct responses, and those with
plan and in the process may discover a need to lower ability are less likely to do so.
understand the problem better. When a plan has Rasch analysis (a) estimates the
been formed, the student may attempt to carry it difficulty of dichotomous items as the natural
out and be unable to do so. The next activity may logarithm of the odds of answering each item
be attempting to make a new plan, or going back correctly (a log odds, or logit score), (b) typically
to develop a new understanding of the problem, scales these estimates to mean = 0, and then (c)
or posing a new (possibly related) problem to estimates person ability scores on the same
work on. scale. In analysis of dichotomous items, item
Problem solving abilities, beliefs, difficulty and person ability are defined such that
attitudes, and performance develop in contexts when they are equal, there is a 50% chance of a
(Schoenfeld, 1988) and those contexts must be correct response. As person ability exceeds item
studied as well as specific problem solving difficulty, the chance of a correct response
activities. increases as a logistic ogive function, and as
item difficulty exceeds person ability, the chance
Rasch Analysis of success decreases. The formal relationship
Rasch analysis (Bond & Fox, 2001; among response probability, person ability, and
Rasch, 1980; Wright & Stone, 1979) offers item difficulty is given in the mathematical
potential advantages over the traditional equation by Bond and Fox (2001, p. 201). A
psychometric methods of classical test theory. It graphic plot of this relationship, known as the
has been widely applied in health status item characteristic curve (ICC), is given for three
assessment (e.g., Antonucci, Aprile, & Paulucci, items of different difficulty levels.
2002; Duncan, Bode, Lai, & Perera, 2003; One useful feature of the Rasch model is
Fortinsky, Garcia, Sheenan, Madigan, & Tullai- referred to as parameter separation or specific
117
objectivity (Bond & Fox, 2001; Embretson & Wainer & Mislevy, 2000). This assumption
Reise, 2000). The implication of this requires that individual items do not influence
mathematical property is that, at least in theory, one another (i.e., they are uncorrelated, once the
item difficulty values do not depend on the dimension of item difficulty-person ability is taken
person sample used to estimate them, nor do into account). Thus, no considerations of item
person ability scores depend on the particular content, beyond their difficulty values, are
items used to estimate them. In practical terms, necessary for estimating person ability, and
this means that given well-calibrated sets of changing the order of item administration should
items that fit the Rasch model, robust and directly not change item or person estimates. In
comparable ability estimates may be obtained mathematical terms, this assumption states that
from different subsets of items. This, in turn, the probability of a string of responses is equal to
facilitates both adaptive testing and the equating the product of the individual probabilities of each
of scores obtained from different instruments of the separate responses comprising it. Failure
(Bond & Fox, 2001; Embretson & Reise, 2000). to meet this assumption can suggest the
Rasch theory makes a number of explicit presence of another dimension in the data.
assumptions about the construct to be measured Local dependence is often a concern in
and the items used to measure it, two of which the construction of reading comprehension tests
have already been discussed above. The first is that include multiple questions about the same
that all test items respond to the same passage, because responses to such questions
unidimensional construct. One set of tools for may be determined not only by the difficulty of
examining the extent to which test items each individual item but also by the difficulty and
approximate unidimensionality are the fit content of the passage. Responses to items of
statistics provided by Rasch analysis. These fit this type are often intercorrelated even after their
statistics indicate the amount of variation individual difficulties have been taken into
between model expectations and observations. account. To give another example, if a particular
They identify items and people eliciting question occurring earlier in a test provides
unexpected responses, such as when a person specific information about the answer to a later
of high ability responds incorrectly to an easy question, then these two items are also likely to
question, perhaps because of carelessness or demonstrate local dependence.
because of a poorly constructed or administered A final important assumption of the
item. Fit statistics can be informative with respect Rasch model is that the slope of the item
to dimensionality because they indicate when characteristic curve, also known as the item
different people may be responding to different discrimination parameter, is equal to 1 for all
aspects of an item's content or the testing items (Bond & Fox, 2001; Embretson & Reise,
situation. 2000; Wainer & Mislevy, 2000). This assumption
A second key assumption of Rasch is presented graphically in Figure 1, where all
analysis, also mentioned above, is that three curves are parallel with a slope equal to 1.
individuals can be placed on an ordered The consequence of this assumption is that a
continuum along the dimension of interest, from given change in ability level will have the same
those having less ability to those having more effect on the log odds of a correct response for
(Bond & Fox, 2001). Similarly, the analysis all items. Items that have different discrimination
assumes that items may be placed on the same values, a given change in ability has different
scale, from those requiring less ability to those consequences for different items. When an item's
requiring more. discrimination parameter is high, a relatively
A third assumption underlying Rasch small change in ability level results in a large
analysis is that of local, or conditional, change in response probability. When
independence (Embretson & Reise, 2000; discrimination is low, larger changes in ability
118
level are needed to change response probability. 1. In the current investigation, the Rasch
A highly discriminating item (i.e., one with a high model was used to analyze a set Mathematiocal
ICC slope) is more likely to result in different Problem Solving data provided by a sample of
responses from two individuals of different ability fourth year high school students in two Chinese
levels, whereas an item with a low discrimination Schools. One purpose of the study was to
parameter (i.e. a low ICC slope) more often determine whether the construct validity of the
results in the same response from both. Rasch test is supported by Rasch analysis. Specifically,
models have been shown to be robust to small it is hypothesized that the test responds to a
and/or unsystematic violations of this assumption cohesive unidimensional construct. Item fit
(Penfield, 2004; van de Vijver, 1986), but when statistics, a Rasch-based unidimensionality
the ICC slopes in an item set differ substantially coefficient, and principal-components analysis of
and/or systematically from 1, the test developer model residuals were used to evaluate this
is advised to reconsider the extent to which the hypothesis.
offending items measure the relevant construct 2. To test the hypothesis that Rasch
(Wright, 1991). estimates of person ability, because of their
An example on the use of the one- status as interval-level measures, are more valid
parameter Rasch Model is the study by and sensitive than traditionally computed scores.
El-Korashy (1995) where the Rasch Model was
applied to the selection of items for an Arabic Method
version of the Otis-Lennon Mental Ability Test. Participants
Correspondence of item calibration to person
measurement indicated that the test is suitable The participants were 31 high school
for the range of mental ability intended to be students from two different schools. The two
measured. Another is the study by Lamprianou high schools are UNO High School and Grace
(2004) that analyzes data from three testing Christian High School. These two high schools
cycles of the National Curriculum tests in were chosen for their popularity in molding high
mathematics in England using the Rasch model. achievers in Mathematics. The participants were
It was found that pupils having English as an fourth year high school students, both male and
additional language and pupils belonging to female students and belonging to the 16-18 age
ethnic minorities are significantly more likely to group. The decision to choose high school
generate aberrant response patterns. However, students was made because the high school
within the groups of pupils belonging to ethnic educational system was much more regimented,
minorities, those who speak English as an and it can be safely assumed that any given
additional language are not significantly more fourth year student would have studied the
likely to generate misfitting response patterns. lessons required of a third year student.
This may indicate that the ethnic background Convenient sampling was used to select the
effect is more significant than the effect of the respondents.
first language spoken. The results suggest that
pupils having English as an additional language Instrument
and pupils belonging to ethnic minorities are
mismeasured significantly more than the Mathematical Problem Solving Test. The
remainder of pupils by taking the mathematics Mathematical Problem Solving test was
National Curriculum tests. More research is constructed to measure the problem solving
needed to generalize the results to other subjects ability of the students (seer Appendix A). There
and contexts. are 25 items included in the test that covers third
Purpose of the Study year high school lessons. Third year lessons
were used because the participants will only be
119
mean. A unit on this scale, a logit, represents the answered with low ability while items 3, 7 and 2
change in ability or difficulty necessary to change requires higher ability to get a correct response.
the odds of a correct response by a factor of The characteristic curve shows that
2.718, the base of the natural logarithm. Persons Items 11, 13, 2 and 8 have the probability of
who respond to all items correctly or incorrectly, being answered with low ability while items 9, 10
and items to which all persons respond correctly and 4 requires higher ability to get a correct
or incorrectly, are uninformative with respect to response. The overlap between items 11 and 13
item difficulty estimation and are thus excluded and items 9 and 109 means that the same ability
from the parameter estimation process. are required to get the probability of answering
the item correct.
The characteristic curve shows that
Results Items 16 and 19 have the probability of being
answered with low ability while items 17, 18, 20,
Item analysis was used to evaluate 21 and 22 requires higher ability to get a correct
whether the items in the Mathematical Problem response. The overlap between items 18, 20, 21,
Solving Test are easy, average or difficult. The and 22, and items 16 and 19 means that the
difficulty of an item is based on the percentage of same ability are required to get the probability of
people who answered it correctly. The index answering the item correct. Items 23, 24 and 25
discrimination revealed that there are no are excluded because of extreme responses.
marginal items as well as bad items; however,
84% of the items are very good, 2% are good Examination of Fit
items and 2% are reasonably good items.
In the item difficulty, each item indicates The average INFIT statistics is 1.00 and
whether it is easy, average or difficult. Item average OUTFIT statistics is .98 which indicates
difficulty is determined if the items have the that the data for the items are showing goodness
appropriate difficulty level. It was found out that of fit because the value is less than 1.5 except for
there are no difficult items presented, although items 23, 24 and 25.
72% of the items are average and 28% are easy. Unidimensionality Coefficient
To address the question of construct
One Parameter-Rasch Model dimensionality, a Rasch unidimensionality
When the test scores and ability of the coefficient was calculated. This coefficient was
students in the Mathematical Problem Solving calculated as the ratio of the person separation
Test was calibrated new indices for the reliability reliability estimated using model standard errors
was obtained. The student reliability was .50 with (which treat model misfit as random variation) to
a RMSE of .52 and the Math reliability is .34 with the person separation reliability estimated using
an RMSE of .82. The errors associated with real standard errors (which regard misfit as true
these estimates are high indicating that the data departure from the unidimensional model; Wright,
does not fit well the expected ability and test 1994). The closer the value of the coefficient to
difficulty. Figure 1 shows the test characteristic 1.0, the more closely the data approximate
curve generated by the WINSTEPS. unidimensionality. The unidimensionality
In the computed separation for ability is coefficient for the current data set was .61 (ratio
1.20 and the item (expected score) is 11 which is of 1.20 and .73 separation values) which is quite
.73 when converted into a standardized estimate. marginal to 1.00. This means that the data might
Although these extreme values are adjusted by form dimensions.
fine tuning the slopes produced for each item. Principal Components analysis shows
The characteristic curve shows that that there can possibly be 7 factors that can be
Items 5, 1, 6 and 4 have the probability of being
121
formed with the items excluding item 25 with no because of the additional lexical load imposed by
variation as indicated in the scree plot. the inclusion of size adjectives.
Principal-components analysis of model Aspects of the tests validity were
residuals conducted for the 24-item pool (after supported by the present analyses. First, two
exclusion of the seven misfitting items) revealed items were only excluded because of poor model
that 26.97% of the variance in the observations fit. Perhaps participants were not generally able
was accounted for by the Rasch dimension of to figure out the proper response strategy by the
item difficulty-person ability. The next largest end of the test (because of the provision of
factor extracted accounted for only 4.86% of the repeats and cues) and were then able to
remaining variance. effectively implement problem solving strategies.
The log functions for each item show If this is correct, then eliminating these items
large standard errors. This supports the principal should introduce misfit for the items of this type.
components analysis that there might be factors The two other items that were excluded
formed out of the 22 items. because of poor model fit were the last test item,
which differs from the earlier items in that it
Discussion contains two-part commands and requires
responses using more skills. This suggests that
The present results generally support the initial responses to different kinds of commands
construct and content validity of the Mathematical might be determined in part by another construct,
problem solving test. First, the acceptable fit of for example, ability to switch set.
the 22 test items to the Rasch model and the A second aspect of the test’s validity that
marginal unidimensionality coefficient (.61) the present analysis failed to confirm concerns
support the hypothesis that the RTT measures a the homogeneity of item difficulty within subtests.
unidimensional construct. Furthermore, The differences between the parameter
acceptable item and person separation indices estimates within the items suggest that they are
and reliability coefficients suggest that the not necessarily homogeneous with respect to
parameter estimates obtained in the current difficulty. The present finding might have been in
study are both reproducible and useful for part the result of a relatively small and poorly
differentiating items and persons from one targeted sample. A larger sample with a broader
another. distribution might obtain less item variability.
In addition, principal-components Although sample sizes of approximately 100
analysis of Rasch model residuals (with the two have been argued to produce stable item
misfitting items excluded) indicated that the parameter estimates (Linacre, 1994; van de
dimension of person ability-item difficulty Vijver, 1986), larger samples are preferable.
accounted for the majority of the variance in the Willmes's (1981) prior finding suggests that the
data (26.97%) and the next largest factor present result may be reliable, but his participant
extracted accounted for very little additional sample was similarly sized, if perhaps better
variance (4.86%). Although this does not provide targeted.
further support for the unidimensionality of the
test. References
The pattern of item difficulty across Andrich, D. (2004). Controversy and the Rasch model: A
subtests was consistent with item content and characteristic of incompatible paradigms? Medical
Care, 41, 17-116.
similar for values derived by Rasch analysis and Antonucci, G., Aprile, T., & Paulucci, S. (2002). Rasch
traditional methods. As expected, based on analysis of the Rivermead Mobility Index: A study
increasing lexical load, the results showed using mobility measures of first-stroke inpatients.
variation in the difficulty. There are more items Archives of Physical Medicine and Rehabilitation, 83,
that can be answered requiring low ability 1442-1449.
122
Arvedson, J. C., McNeil, M. R., & West, T. L. (1986). of Rasch modeling to the Outcome and Assessment
Prediction of Revised Token Test overall, subtest, and Information Set. Medical Care, 41, 601-615.
linguistic unit scores by two shortened versions. Frederiksen, N. (1984). Implications of cognitive theory for
Clinical Aphasiology, 16, 57-63. instruction in problem solving. Review of Educational
Blackwell, A., & Bates, E. (1995). Inducing agrammatic Research, 54, 363-407.
profiles in normals: Evidence for the selective Freed, D. B., Marshall, R. C., & Chulantseff, E. A. (1996).
vulnerability of morphology under cognitive resource Picture naming variability: A methodological
limitation. Journal of Cognitive Neuroscience, 7, 228- consideration of inconsistent naming responses in
257. fluent and nonfluent aphasia. In R. H. Brookshire
Bobrow, D. G. (1964). Natural language input for a (Ed.), Clinical aphasiology conference (pp. 193-205).
computer problem solving system. Unpublished Austin, TX: Pro-Ed.
doctoral dissertation, Massachusetts Institute of Garfola, J. & Lester, F. K. (1985). Metacognition, cognitive
Technology, Boston. monitoring, and mathematical performance. Journal
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch for Research in Mathematics Education, 16, 163-176.
model: Fundamental measurement in the human Guilford, J. P. (1954). Psychometric methods. New York:
sciences. Mahwah, NJ: Erlbaum. McGraw-Hill.
Briggs, D. C., & Wilson, M. (2003). An introduction to Henderson, K. B. & Pingry, R. E. (1953). Problem solving
multidimensional measurement using Rasch models. in mathematics. In H. F. Fehr (Ed.), The learning of
Journal of Applied Measurement, 4, 87-100. mathematics: Its theory and practice (21st Yearbook
Brown, S. I. & Walter, M. I. (1983). The art of problem of the National Council of Teachers of Mathematics)
posing. Hillsdale, NJ: Lawrence Erlbaum. (pp. 228-270). Washington, DC: National Council of
Chang, W-C., & Chan, C. (1995). Rasch analysis for Teachers of Mathematics.
outcomes measures: Some methodological Hobart, J. C. (2002). Measuring disease impact in disabling
considerations. Archives of Physical Medicine and neurological conditions: Are patients' perspectives and
Rehabilitation, 76, 934-939. scientific rigor compatible? Current Opinions in
Cliff, N. (1992). Abstract measurement theory and the Neurology, 15, 721-724.
revolution that never happened. Psychological Howard, D., Patterson, K., Franklin, S., Morton, J., &
Science, 3, 186-190. Orchard-Lisle, V. (1984). Variability and consistency in
DiSimoni, F. G., Keith, R. L., & Darley, F. L. (1980). naming by aphasic patients. Advances in Neurology,
Prediction of PICA overall score by short versions of 42, 263-276.
the test. Journal of Speech and Hearing Research, 23, Jensen, R. (1984). A multifaceted instructional approach
511-516. for developing subgoal generation skills. Unpublished
Duffy, J. R., & Dale, B. J. (1977). The PICA scoring scale: doctoral dissertation, The University of Georgia.
Do its statistical shortcomings cause clinical Kahneman, D. (1973). Attention and effort. Englewood
problems? In R. H. Brookshire (Ed.), Collected Cliffs, NJ: Prentice-Hall.
proceedings from clinical aphasiology (pp. 290-296). Kantowski, M. G. (1974). Processes involved in
Minneapolis, MN: BRK. mathematical problem solving. Unpublished doctoral
Duncan, P. W., Bode, R., Lai, S. M., & Perera, S. (2003). dissertation, The University of Georgia, Athens.
Rasch analysis of a new stroke-specific outcome Kantowski, M. G. (1977). Processes involved in
scale: The Stroke Impact Scale. Archives of Physical mathematical problem solving. Journal for Research in
Medicine and Rehabilitation, 84, 950-963. Mathematics Education, 8, 163-180.
Efron, B., & Tibshirani, R. (1986). Bootstrap methods for Kaput, J. J. (1979). Mathematics learning: Roots of
standard errors, confidence intervals, and other epistemological status. In J. Lochhead and J. Clement
measures of statistical accuracy. Statistical Science, (Eds.), Cognitive process instruction. Philadelphia,
1, 54-77. PA: Franklin Institute Press.
El-Korashy, A. (1995). Applying the Rasch model to the Lai, J-S., Cella, D., Chang, C. H., Bode, R., & Heinemann,
selection of items for a mental ability test. Educational A. W. (2003). Item banking to improve, shorten and
and Psychological Measurement, 55, 753. computerize self-reported fatigue: An illustration of
Embretson, S. E., & Reise, S. P. (2000). Item response steps to create a core item bank from the FACIT-
theory for psychologists. Mahwah, NJ: Erlbaum. Fatigue Scale. Quality of Life Research, 12, 485-501.
Fischer, G. H., & Molenaar, I. W. (1995). Rasch models: Lamprianou, I. & Boyle, B. (2004). Accuracy of
Foundations, recent developments and applications. Measurement in the Context of Mathematics National
New York: Springer. Curriculum Tests in England for Ethnic Minority Pupils
Fortinsky, R. H., Garcia, R. I., Sheenan, T. J., Madigan, E. and Pupils Who Speak English as an Additional
A., & Tullai McGuinness, S. (2003). Measuring Language. JEM, 41, 239-251.
disability in Medicare home care patients: Application
123
Larkin, J. (1980). Teaching problem solving in physics: The Merbitz, C., Morris, J., & Grip, J. C. (1989). Ordinal scales
psychological laboratory and the practical classroom. and foundations of misinference. Archives of Physical
In F. Reif & D. Tuma (Eds.), Problem solving in Medicine and Rehabilitation, 70, 308-312.
education: Issues in teaching and research. Hillsdale, Michell, J. (1990). An introduction to the logic of
NJ: Lawrence Erlbaum. psychological measurement. Hillsdale, NJ: Erlbaum.
Lesh, R. (1981). Applied mathematical problem solving. Michell, J. (1997). Quantitative science and the definition of
Educational Studies in Mathematics, 12(2), 235-265. measurement in psychology. British Journal of
Linacre, J. M. (1994). Sample size and item calibration Psychology, 88, 355-383.
stability. Rasch Measurement Transactions, 7, 328. Michell, J. (2004). Item response models, pathological
Linacre, J. M. (1998). Structure in Rasch residuals: Why science, and the shape of error. Theory and
principal components analysis? Rasch Measurement Psychology, 14, 121-129.
Transactions, 12, 636. National Council of Supervisors of Mathematics. (1978).
Linacre, J. M. (2002). Facets, factors, elements and levels. Position paper on basic mathematical skills.
Rasch Measurement Transactions, 16, 880. Mathematics Teacher, 71(2), 147-52. (Reprinted from
Linacre, J. M., & Wright, B. D. (1994). Reasonable mean- position paper distributed to members January 1977.)
square fit values. Rasch Measurement Transactions, Newell, A. & Simon, H. A. (1972). Human problem solving.
8, 370. Englewood Cliffs, NJ: Prentice Hall.
Linacre, J. M., & Wright, B. D. (2003). WINSTEPS: Norquist, J. M., Fitzpatrick, R., Dawson, J., & Jenkinson, C.
Multiple-choice, rating scale, and partial credit Rasch (2004). Comparing alternative Rasch-based methods
analysis [Computer software]. Chicago: MESA Press. vs. raw scores in measuring change in health. Medical
Linacre, J. M., Heinemann, A. W., Wright, B., Granger, C. Care, 42, 125-136.
V., & Hamilton, B. B. (1994). The structure and Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
stability of the Functional Independence Measure. theory (3rd ed.). New York: McGraw-Hill.
Archives of Physical Medicine and Rehabilitation, 75, Orgass, B. (1976). Eine Revision des Token Tests, Teil I
127-132. und II [A revision of the token tests, Part I and II].
Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Diagnostica, 22, 70-87.
Statistical theories of mental test scores. Reading, Penfield, R. D. (2004). The impact of model misfit on partial
MA: Addison-Wesley. credit model parameter estimates. Journal of Applied
Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint Measurement, 5, 115-128.
measurement: A new type of fundamental Polya, G. (1962). Mathematical discovery: On
measurement. Journal of Mathematical Psychology, 1, understanding, learning and teaching problem solving
1-27. (vol. 1). New York: Wiley.
Lumsden, J. (1978). Tests are perfectly reliable. British Polya, G. (1965). Mathematical discovery: On
Journal of Mathematical and Statistical Psychology, understanding, learning and teaching problem solving
31, 19-26. (vol. 2). New York: Wiley.
Masters, G. (1993). Undesirable item discrimination. Rasch Polya, G. (1973). How to solve it. Princeton, NJ: Princeton
Measurement Transactions, 7, 289. University Press. (Originally copyrighted in 1945).
McHorney, C. A., Haley, S. M., & Ware, J. E. (1997). Porch, B. (2001). Porch Index of Communicative Ability.
Evaluation of the MOS SF-36 physical functioning Albuquerque, NM: PICA Programs.
scale (PF-10): II. Comparison of relative precision Rasch, G. (1980). Probabilistic models for some
using Likert and Rasch scoring methods. Journal of intelligence and attainment tests. Chicago: University
Clinical Epidemiology, 50, 451-461. of Chicago Press. (Original work published 1960)
McNeil, M. R. (1988). Aphasia in the adult. In N. J. Lass, L. Reitman, W. R. (1965). Cognition and thought. New York:
V. McReynolds, J. Northern, & D. E. Yoder (Eds.), Wiley.
Handbook of speech-language pathology and Schoenfeld, A. H. (1983). Episodes and executive
audiology (pp. 738-786). Toronto, Ontario, Canada: D. decisions in mathematics problem solving. In R. Lesh
C. Becker. & M. Landau, Acquisition of mathematics concepts
McNeil, M. R., & Hageman, C. F. (1979). Auditory and processes. New York: Academic Press
processing deficits in aphasia evidenced on the Schoenfeld, A. H. (1985). Mathematical problem solving.
Revised Token Test: Incidence and prediction of Orlando, FL: Academic Press.
across subtest and across item within subtest Schoenfeld, A. H. (1988). When good teaching leads to
patterns. In R. H. Brookshire (Ed.), Clinical bad results: The disasters of "well taught"
aphasiology conference proceedings (pp. 47-69). mathematics classes. Educational Psychologist, 23,
Minneapolis, MN: BRK. 145-166.
Schoenfeld, A. H., & Herrmann, D. (1982). Problem
perception and knowledge structure in expert and
124
novice mathematical problem solvers. Journal of Waters, W. (1984). Concept acquisition tasks. In G. A.
Experimental Psychology: Learning, Memory and Goldin & C. E. McClintock (Eds.), Task variables in
Cognition, 8, 484-494. mathematical problem solving (pp. 277-296).
Segall, D. O. (1996). Multidimensional adaptive testing. Philadelphia, PA: Franklin Institute Press.
Psychometrika, 61, 331-354. Willmes, K. (1981). A new look at the Token Test using
Silver, E. A. (1987). Foundations of cognitive theory and probabilistic test models. Neuropsychologia, 19, 631-
research for mathematics problem-solving instruction. 645.
In A. H. Schoenfeld (Ed.), Cognitive science and Willmes, K. (1992). Psychometric evaluation of
mathematics education (pp. 33-60). Hillsdale, NJ: neuropsychological test performances. In N. von
Lawrence Erlbaum. Steinbuechel, D. Y. Cramon, & E. Poeppel (Eds.),
Smith, J. P. (1974). The effects of general versus specific Neuropsychological rehabilitation (pp. 103-113).
heuristics in mathematical problem-solving tasks Heidelberg, Germany: Springer-Verlag.
(Columbia University, 1973). Dissertation Abstracts Willmes, K. (2003). Psychometric issues in aphasia therapy
International, 34, 2400A. research. In I. Papathanasiou & R. De Bleser (Eds.),
Smith, R. M. (1986). Person fit in the Rasch model. The sciences of aphasia: From theory to therapy (pp.
Educational and Psychological Measurement, 46, 227-244). Amsterdam: Pergamon.
359-372. Wilson, J. W. (1967). Generality of heuristics as an
Stanic, G., & Kilpatrick, J. (1988). Historical Perspectives instructional variable. Unpublished Doctoral
on Problem Solving in the Mathematics Curriculum. In Dissertation, Stanford University, San Jose, CA.
R. I. Charles & E. A. Silver (Eds.), The teaching and Wright, B. D. (1991). IRT in the 1900's: Which models work
assessing of mathematical problem solving (pp. 1-22). best? Rasch Measurement Transactions, 6, 196-200.
Reston, VA: National Council of Teachers of Wright, B. D. (1994). A Rasch unidimensionality coefficient.
Mathematics. Rasch Measurement Transactions, 8, 385.
Steffe, L. P., & Wood, T. (Eds.). (1990). Transforming Wright, B. D. (1996). Local dependency, correlations and
Children's Mathematical Education. Hillsdale, NJ: principal components. Rasch Measurement
Lawrence Erlbaum. Transactions, 10, 509-511.
Stevens, S. S. (1946, June 7). On the theory of scales of Wright, B. D. (1999). Fundamental measurement for
measurement. Science, 103, 677-680. psychology. In S. E. Embretson & S. L. Hershberger
van de Vijver, F. J. R. (1986). The robustness of Rasch (Eds.), The new rules of measurement: What every
estimates. Applied Psychological Measurement, 10, psychologist and educator should know (pp. 65-104).
45-57. Mahwah, NJ: Erlbaum.
Velozo, C. A., Magalhaes, L. C., Pan, A.-W., & Leiter, P. Wright, B. D., & Linacre, J. M. (1989). Observations are
(1995). Functional scale discrimination at admission always ordinal; measurements, however, must be
and discharge: Rasch analysis of the Level of interval. Archives of Physical Medicine and
Rehabilitation Scale-III. Archives of Physical Medicine Rehabilitation, 70, 857-860.
and Rehabilitation, 76, 705-712. Wright, B. D., & Masters, G. S. (1982). Rating scale
von Glasersfeld, E. (1989). Constructivism in education. In analysis. Chicago: MESA Press.
T. Husen & T. N. Postlethwaite (Eds.), The Wright, B. D., & Stone, M. H. (1979). Best test design.
international encyclopedia of education. (pp. 162-163). Chicago: MESA Press.
(Suppl. Vol. I). New York: Pergammon. Wright, B., & Masters, G. (1997). The partial credit model.
Wainer, H., & Mislevy, R. J. (2000). Item response theory, In W. van der Linden & R. Hambleton (Eds.),
item calibration, and proficiency estimation. In H. Handbook of modern item response theory (pp. 101-
Wainer, N. J. Dorans, D. Eignor, R. Flaugher, B. F. 121). New York: Springer.
Green, & R. J. Mislevy, et al. (Eds.), Computerized
adaptive testing: A primer (2nd ed., pp. 61-100).
Mahwah, NJ: Erlbaum.
Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green,
B. F., Mislevy, R. J., et al. (2000). Computerized
adaptive testing: A primer (2nd ed.). Mahwah, NJ:
Erlbaum.
Ware, J. E., Bjorner, J. B., & Kosinski, M. (2000). Practical
implications of item response theory and
computerized adaptive testing: A brief summary of
ongoing studies of widely used headache impact
scales. Medical Care, 38, II73-II82.
125
SPECIAL TOPIC
This review presents the nature of psychometrics including the issues on psychological
measurement, its relevant theories, and current practice. The basic scaling models are discussed since
they are processes enabling the quantification of psychological constructs. The issues and research trends
in classical test theory and item response theory with its different models and their implication on test
construction are explained. Towards the end of the article different methods of scaling people, stimuli and
responses are discussed.
Psychometrics concerns itself with the science of measuring psychological constructs such as
ability, personality, affect, and skills. Psychological measurement methods are crucially important for basic
research in psychology. Research in psychology involves the measurement of variables in order to conduct
further analyses. In the past, obtaining adequate measurement on psychological constructs is considered
an issue in the science of psychology. Some references indicate that there are psychological constructs
that are deemed to be unobservable and is difficult to quantify. This issue is carried over by the fact that
psychological theories are filled with variables that either cannot be measured at all at the present time or
can be measured only approximately (Kaplan & Saccuzzo, 1997) such as anxiety, creativity, dogmatism
achievement, motivation, attention and frustration. Moreover according to Emmanuel Kant, “it is impossible
to have a science of psychology because the basic data could not be observed and measured.” Although
the field of psychological measurement has advanced and practitioners in the field of psychometrics were
able to properly deal with issues and devise methods on the basic premise of scientific observation and
measurement. Since most psychological constructs involve subjective experiences such as feelings
sensations and desires – and when individuals make a judgment, state their preferences and even talk
about these experiences, then it is possible for measurement to take place and thus it meets the
requirements of scientific inquiry. It is very much possible to assign numbers to psychological constructs as
to represent quantities of attributes and even formulate rules of standardizing the measurement process.
In the process of standardizing psychological measurement, it requires a process of abstraction
where psychological attributes are observed in relation to other constructs such as attitude and
achievement (Magno, 2003). This process allows to establish the association among variables such as
construct validation and criterion-predictive processes. Also, emphasizing measurement of psychological
constructs forces researchers and test developers to consider carefully the nature of the construct before
attempting to measure it. This involves a thorough literature review on the conceptual definition of an
attribute before constructing valid items of a test. It is also a common practice in psychometrics where
numerical scores are used to communicate the amount of an attribute of an individual. Quantification is so
much intertwined with the concept of measurement. In the process of quantification, mathematical systems
and statistical procedures are used enabling to examine the internal relationship among data obtained
through a measure. Such procedures enable psychometrics to build theories considering itself part of the
system of science.
126
There are two branches of psychometric theory: The classical test theory and the items response
theory. Both theories enable to predict outcomes of psychological tests by identifying parameters of item
difficulty and the ability of test takers. Both are concerned to improve the reliability of psychological tests.
Classical test theory in references is regarded as the “true score theory.” The theory starts from the
assumption that systematic effects between responses of examinees are due only to variation in ability of
interest. All other potential sources of variation existing in the testing materials such as external conditions
or internal conditions of examinees are assumed either to be constant through rigorous standardization or
to have an effect that is nonsystematic or random by nature (Van der Linden & Hambleton, 2004). The
central model of the classical test theory is that observed test scores (TO) are composed of a true score (T)
and an error score (E) where the true and the error scores are independent. The variables are established
by Spearman (1904) and Novick (1966) and best illustrated in the formula:
TO = T + E
The classical theory assumes that each individual has a true score which would be obtained if
there were no errors in measurement. However, because measuring instruments are imperfect, the score
observed for each person may differ from an individual’s true ability. The difference between the true score
and the observed test score results from measurement error. Using a variety of justifications, error is often
assumed to be a random variable having a normal distribution. The implication of the classical test theory
for test takers is that tests are fallible imprecise tools. The score achieved by an individual is rarely the
individual’s true score. This means that the true score for an individual will not change with repeated
applications of the same test. This observed score is almost always the true score influenced by some
degree of error. This error influences the observed to be higher or lower. Theoretically, the standard
deviation of the distribution of random errors for each individual tells about the magnitude of measurement
error. It is usually assumed that the distribution of random errors will be the same for all individuals.
Classical test theory uses the standard deviation of errors as the basic measure of error. Usually this is
called the standard error of measurement. In practice, the standard deviation of the observed score and the
reliability of the test are used to estimate the standard error of measurement (Kaplan & Saccuzzo, 1997).
The larger the standard error of measurement, the less certain is the accuracy with which an attribute is
measured. Conversely, small standard error of measurement tells that an individual score is probably close
to the true score. The standard error of measurement is calculated with the formula: Sm S 1 r .
Standard errors of measurement are used to create confidence intervals around specific observed scores
(Kaplan & Saccuzzo, 1997). The lower and upper bound of the confidence interval approximate the value of
the true score.
Traditionally, methods of analysis based on classical test theory have been used to evaluate tests.
The focus of the analysis is on the total test score; frequency of correct responses (to indicate question
difficulty); frequency of responses (to examine distractors); reliability of the test and item-total correlation (to
evaluate discrimination at the item level) (Impara & Plake, 1997). Although these statistics have been
widely used, one limitation is that they relate to the sample under scrutiny and thus all the statistics that
describe items and questions are sample dependent (Hambelton, 2000). This critique may not be
127
particularly relevant where successive samples are reasonably representative and do not vary across time,
but this will need to be confirmed and complex strategies have been proposed to overcome this limitation.
Another branch of psychometric theory is the item response theory (IRT). IRT may be regarded as
roughly synonymous with latent trait theory. It is sometimes referred to as the strong true score theory or
modern mental test theory because IRT is a more recent body of theory and makes stronger assumptions
as compared to classical test theory. This approach to testing based on item analysis considers the chance
of getting particular items right or wrong. In this approach, each item on a test has its own item
characteristic curve that describes the probability of getting each particular item right or wrong given the
ability of the test takers (Kaplan & Saccuzzo, 1997). The Rasch model as an example of IRT is appropriate
for modeling dichotomous responses and models the probability of an individual's correct response on a
dichotomous item. The logistic item characteristic curve, a function of ability, forms the boundary between
the probability areas of answering an item incorrectly and answering the item correctly. This one-parameter
logistic model assumes that the discriminations of all items are assumed to be equal to one (Maier, 2001).
Another fundamental feature of this theory is that item performance is related to the estimated
amount of respondent’s latent trait (Anastasi & Urbina, 2002). A latent trait is symbolized as theta () which
refers to a statistical construct. In cognitive tests, latent traits are called the ability measured by the test.
The total score on a test is taken as an estimate of that ability. A person’s specified ability () succeeds on
an item of specified difficulty.
There are various approaches to the construction of tests using item response theory. Some
approaches use the two-dimensions that plot item discriminations and item difficulties. Other approaches
use a three-dimension for the probability of test takers with very low levels of ability getting a correct
response (as demonstrated in Figure 1). Other approaches use only the difficulty parameter (one
dimension) such as the Rasch Model. All these approaches characterize the item in relation to the
probability that those who do well or poorly on the exam will have different levels of performance.
Two – Parameter Model/Normal – Ogive Model. The ogive model postulates a normal cumulative
distribution function as a response function for an item. The model demonstrates that an item difficulty is a
point on an ability scale where an examinee has a probability of success on the item of .50 (van der Linden
& Hambleton, 2004). In the model, the difficulty of each item can be defined by 50% threshold which is
customary in establishing sensory thresholds in psychophysics. The discriminative power of each item
represented by a curve in the graph is indicated by its steepness. The steeper the curve, the higher the
correlation of item performance with total score and the higher the discriminative index.
The original idea of the model was traced back from Thurstone’s use of the normal model in his
discriminal dispersion theory of stimulus perception (Thurstone, 1927). Researchers in psychophysics
studied the relation between psychophysical properties from a stimuli and their perception from human
subjects. In the process a stimulus is presented to the subject and he/she will report the detection of the
stimulus. The detection increases as the stimulus intensity also increases. With this pattern, the cumulative
distribution with parametrization was used as a function.
Three – Parameter Model/Logistic Model. In plotting an ability () with the probability of correct
response Pi () in a three parameter model, the slope of the curve itself indicates the item discrimination.
The higher the value of the item discrimination, the steeper the slope. In the model, Birnbaum (1950)
proposed a third parameter to account for the nonzero performance of low ability examinees on multiple
128
choice items. The nonzero performance is due to the probability of guessing correct answers to multiple
choice items (Van der Linden & Hambleton, 2004).
Figure 1. Hypothetical Item Characteristic Curves for Three Items using a Three Parameter Model
100
90
80
70
Item Item Item
60 1 2 3
50
40
30
20
10
0
Ability Scale
The item difficulty parameter (b1, b2, b3) corresponds to the location on the ability axis at which the
probability of a correct response is .50. It is shown in the curve that item 1 is easier and item 2 and 3 have
the same difficulty at .50 probability of correct response. Estimates of item parameters and ability are
typically computed through successive approximations procedures where approximations are repeated until
the values stabilize.
One – Parameter Model/Rasch Model. The Rasch model is based on the assumption that both
guessing and item differences in discrimination are negligible. In constructing tests, the proponents of the
Rasch model frequently discard those items that do not meet these assumptions (Anastasi & Urbina, 2002).
Rasch began his work in educational and psychological measurement in the late 1940’s. Early in the 1950’s
he developed his Poisson models for reading tests and a model for intelligence and achievement tests
which was later called the “structure models for items in a test” which is called today as the Rasch model.
Rasch’s (1960) main motivation for his model was to eliminate references to populations of
examinees in analyses of tests. According to him that test analysis would only be worthwhile if it were
individual centered with separate parameters for the items and the examinees (van der Linden &
Hambleton, 2004). His worked marked IRT with its probabilistic modeling of the interaction between an
individual item and an individual examinee. The Rasch model is a probabilistic unidimensional model which
asserts that (1) the easier the question the more likely the student will respond correctly to it, and (2) the
more able the student, the more likely he/she will pass the question compared to a less able student .
The Rasch model was derived from the initial Poisson model illustrated in the formula:
where is a function of parameters describing the ability of examinee and difficulty of the test,
represents the ability of the examinee and represents the difficulty of the test which is estimated by the
summation of errors in a test. Furthermore, the model was enhanced to assume that the probability that a
129
student will correctly answer a question is a logistic function of the difference between the student's ability
[θ] and the difficulty of the question [β] (i.e. the ability required to answer the question correctly), and only a
function of that difference giving way to the Rasch model.
From this, the expected pattern of responses to questions can be determined given the estimated θ
and β. Even though each response to each question must depend upon the students' ability and the
questions' difficulty, in the data analysis, it is possible to condition out or eliminate the student's abilities (by
taking all students at the same score level) in order to estimate the relative question difficulties (Andrich,
2004; Dobby & Duckworth, 1979). Thus, when data fit the model, the relative difficulties of the questions
are independent of the relative abilities of the students, and vice versa (Rasch, 1977). The further
consequence of this invariance is that it justifies the use of the total score (Wright & Panchapakesan,
1969). In the current analysis this estimation is done through a pair-wise conditional maximum likelihood
algorithm.
The Rasch model is appropriate for modeling dichotomous responses and models the probability of
an individual's correct response on a dichotomous item. The logistic item characteristic curve, a function of
ability, forms the boundary between the probability areas of answering an item incorrectly and answering
the item correctly. This one-parameter logistic model assumes that the discriminations of all items are
assumed to be equal to one (Maier, 2001).
According to Fischer (1974) the Rasch model can be derived from the following assumptions:
(1) Unidimensionality. All items are functionally dependent upon only one underlying continuum.
(2) Monotonicity. All item characteristic functions are strictly monotonic in the latent trait. The item
characteristic function describes the probability of a predefined response as a function of the latent trait.
(3) Local stochastic independence. Every person has a certain probability of giving a predefined
response to each item and this probability is independent of the answers given to the preceding items.
(4) Sufficiency of a simple sum statistic. The number of predefined responses is a sufficient statistic
for the latent parameter.
(5) Dichotomy of the items. For each item there are only two different responses, for example
positive and negative. The Rasch model requires that an additive structure underlies the observed data.
This additive structure applies to the logit of Pij, where Pij is the probability that subject i will give a
predefined response to item j, being the sum of a subject scale value u i and an item scale value vj, i.e. In
(Pij/1 - Pij) = ui + vj
There are various applications of the Rasch Model in test construction through item-mapping
method (Wang, 2003) and as a hierarchical measurement method (Maier, 2001).
1993; Reid, 1991; Shepard, 1995). Studies found that judges are able to rank order items accurately in
terms of item difficulty, but they are not particularly accurate in estimating item performance for target
examinee groups (Impara & Plake, 1998; National Research Council, 1999; Shepard, 1995). A fundamental
flaw of the Angoff method is that it requires judges to perform the nearly impossible cognitive task of
estimating the probability of MCCs answering each item in the pool correctly (Berk, 1996; NAE).
An item-mapping method, which applies the Rasch IRT model to the standard setting process, has
been used to remedy the cognitive deficiency in the Angoff method for multiple-choice licensure and
certification examinations (McKinley, Newman, & Wiser, 1996). The Angoff method limits judges to each
individual item while they make an immediate judgment of item performance for MCCs. In contrast, the
item-mapping method presents a global picture of all items and their estimated difficulties in the form of a
histogram chart (item map), which serves to guide and simplify the judges' process of decision making
during the cut score study. The item difficulties are estimated through application of the Rasch IRT model.
Like all IRT scaling methods, the Rasch estimation procedures can place item difficulty and candidate
ability on the same scale. An additional advantage of the Rasch measurement scale is that the difference
between a candidate's ability and an item's difficulty determines the probability of a correct response
(Grosse & Wright, 1986). When candidate ability equals item difficulty, the probability of a correct answer to
the item is .50. Unlike the Angoff method, which requires judges to estimate the probability of an MCC's
success on an item, the item-mapping method provides the probability (i.e., .50) and asks judges to
determine whether an MCC has this probability of answering an item correctly. By utilizing the Rasch
model's distinct relationship between candidate ability and item difficulty, the item-mapping method enables
judges to determine the passing score at the point where the item difficulty equals the MCC's ability level.
The item-mapping method incorporates item performance in the standard-setting process by
graphically presenting item difficulties. In item mapping, all the items for a given examination are ordered in
columns, with each column in the graph representing a different item difficulty. The columns of items are
ordered from easy to hard on a histogram-type graph, with very easy items toward the left end of the graph,
and very hard items toward the right end of the graph. Item difficulties in log odds units are estimated
through application of the Rasch IRT model (Wright & Stone, 1979). In order to present items on a metric
familiar to the judges, logit difficulties are converted to scaled values using the following formula: scaled
difficulty = (logit difficulty × 10) + 100. This scale usually ranges from 70 to 130.
Rasch Hierarchical Measurement Method. In a study by Maier (2001) a hierarchical measurement
model is developed that enables researchers to measure a latent trait variable and model the error variance
corresponding to multiple levels. The Rasch hierarchical measurement model (HMM) results when a Rasch
IRT model and a one-way ANOVA with random effects are combined. Item response theory models and
hierarchical linear models can be combined to model the effect of multilevel covariates on a latent trait.
Through the combination, researchers may wish to examine relationships between person-ability estimates
and person-level and contextual-level characteristics that may affect these ability estimates. Alternatively, it
is also possible to model data obtained from the same individuals across repeated questionnaire
administrations. It is also made possible to study the effect of person characteristics on ability estimates
over time.
The benefit of the item response theory is that its treatment of reliability and error of measurement
through item information function are computed for each item (Lord, 1980). These functions provide a
sound basis for choosing items in test construction. The item information function takes all items
parameters into account and shows the measurement efficiency of the item at different ability levels.
Another advantage of the item response theory is the invariance of item parameters which pertains to the
131
sample-free nature of its results. In the theory the item parameters are invariant when computed in groups
of different abilities. This means that a uniform scale of measurement can be provided for use in different
groups. It also means that groups as well as individuals can be tested with a different set of items,
appropriate to their ability levels and their scores will be directly comparable (Anastasi & Urbina, 2002).
Scaling Models
Measurement is essentially concerned with the methods used to provide quantitative descriptions
of the extent to which individuals manifest or possess specified characteristics” (Ghiselli, Campbell, &
Zedeck, 1981, p. 2). “Measurement is the assigning of numbers to individuals in a systematic way as a
means of representing properties of the individuals” (Allen & Yen, 1979, p. 2). “‘Measurement’ consists of
rules for assigning symbols to objects so as to (1) represent quantities of attributes numerically (scaling) or
(2) define whether the objects fall in the same or different categories with respect to a given attribute
(classification)” (Nunnally &Bernstein, 1994, p. 3).
There are important aspects to consider in the process of measurement in psychometrics. First, it
is needed to quantify an attribute of interest. That is, there are numbers to designate how much (or little) of
an attribute an individual possesses. Second, attribute of interest must be quantified in a consistent and
systematic way (i.e., standardization). That is, when the measurement process is replicated, it is systematic
enough that meaningful replication is possible. Finally, attributes of individuals (or objects) are measured
not the individuals per se.
Levels of Measurement
As the definition of Nunnally and Bernstein (1994) suggests, by systematically measuring the
attribute of interest individuals can either be classified or scaled with regard to the attribute of interest.
Engaging in classification or scaling depends in large part on the level of measurement used to assess a
construct. For example, if the attribute is measured on a nominal scale of measurement, then it is only
possible to classify individuals as falling into one or another mutually exclusive category (Agresti & Finlay,
2000). This is because the different categories (e.g., men versus women) represent only qualitative
differences. Nominal scales are used as measures of identity (Downie & Heath, 1984). When gender are
coded such as males coded 0, females 1 that does not mean that these values have any quantitative
meaning. They are simply labels for gender categories. At the nominal level of measurement, there are a
variety of sorting techniques. In this case, subjects are asked to sort the stimuli into different categories
based on some dimension.
There are some data that reflect rank order of individuals or objects such as a scale evaluating the
beauty of a person from highest to lowest (Downie & Heath, 1984). This would represent an ordinal scale of
measurement where objects are simply rank ordered. It does not provide how much hotter one object is
than another, but it can be determined that that A is hotter than B, if A is ranked higher than B. At the
ordinal level of measurement, the Q-sort method, paired comparisons, Guttman’s Scalogram, Coomb’s
unfolding technique, and a variety of rating scales can be used. The major task of subject is to rank order
items from highest to lowest or from weakest to strongest.
The interval scale of measurement have equal intervals between degrees on the scale. However,
the zero point on the scale is arbitrary; 0 degrees Celsius represents the point at which water freezes at
sea level. That is, zero on the scale does not represent “true zero,” which in this case would mean a
complete absence of heat. In determining the area of a table a ratio scale of measurement is used because
zero does represent “true zero”.
132
When the construct of interest is measured at the nominal (i.e., qualitative) level of measurement,
objects are only classified into categories. As a result, the types of data manipulations and statistical
analyses that can be perform on the data is very limited. In cases of descriptive statistics, it is possible to
compute frequency counts or determine the modal response (i.e., category), but not much else. However, if
it were at least possible to rank order the objects based on the degree to which the construct of interest
possess, then it is possible to scale the construct. In addition, higher levels of measurement allow for more
in-depth statistical analyses. With ordinal data, for example, statistics such as the median, range, and
interquartile range can be computed (Downie & Heath, 1984). When the data is interval-level, it is possible
to calculate statistics such as means, standard deviations, variances, and the various statistics of shape
(e.g., skewness and kurtosis). With interval-level data, it is important to know the shape of the distribution,
as different-shaped distributions imply different interpretations for statistics such as the mean and standard
deviation.
At the interval and ratio level of measurement, there is direct estimation, the method of bisection,
and Thurstone’s methods of comparative and categorical judgments. With these methods, subjects are
asked not only to rank order items but also to actually help determine the magnitude of the differences
among items. With Thurstone’s method of comparative judgment, subjects compare every possible pair of
stimuli and select the item within the pair that is the better item for assessing the construct. Thurstone’s
method of categorical judgment, while less tedious for subjects when there are many stimuli to assess in
that they simply rate each stimulus (not each pair of stimuli), does require more cognitive energy for each
rating provided. This is because the SME must now estimate the actual value of the stimulus.
(1) Method of Adjustment. An experimental paradigm which allows the subject to make small
adjustments to a comparison stimulus until it matches a standard stimulus. The intensity of the stimulus is
adjusted until target is just detectable.
(2) Method of Limits. Adjust intensity in discreet steps until observer reports that stimulus is just
detectable.
(3) Method of Constant Stimuli. Experimenter has control of stimuli. Several chosen stimulus
values are chosen to bracket the assumed threshold. Stimulus is presented many times in random order.
Psychometric function is derived from proportion of detectable responses.
(4) Staircase Method. To determine a threshold as quickly as possible. Compromise between the
method of limits and method of constant stimuli.
(5) Method of Forced Choice (2AFC). Observer must choose between two or more options. Good
for cases where observers are less willing to guess.
(6) Method of Average Error. The subject is presented with a standard stimuli. The subject then
undergoes trials to target the stimulus presented.
(7) Rank order. Requires the subject to rank stimuli from most to least with respect to some
attribute of judgment or sentiment.
(8) Paired comparison. A subject is required to rank a stimuli two at a time in all possible pairs.
(9) Successive categories. The subject is asked to sort a collection of stimuli into a number of
distinct piles or categories, which are ordered with respect to a specified attribute.
(10) Ratio judgment. The experimenter selects a standard stimulus and a number of variable
stimuli that differ quantitatively from the standard stimulus on a given characteristic. The subjects selects
from the range of variable stimuli, the stimulus whose amount of the given characteristic corresponds to the
ratio value.
(11) Q sort. Subjects are required to sort the stimuli into an approximate normal distribution, with
its being specified how many stimuli are to be placed in each category.
Many issues arise when performing a scaling study. One important factor is who is selected to
participate in the study. Many stimuli or scaling involve some psychological (latent) dimension of people
without any connection to a direct counterpart "physical" dimension. When people (psychometrics) are
scaled, it is typical to obtain a random sample of individuals from the population to generalize. With
psychometrics participants are asked to provide their individual feelings, attitudes, and/or personal ratings
toward a particular topic. In doing so, one is able to determine how individuals differ on the construct of
interest. With stimulus scaling, however, the researcher would sum across raters within a given stimulus
(e.g., question) in order to obtain rating(s) of each stimulus. Once the researcher is confident that each
stimulus did, in fact, tap into the construct and had some estimate of the level at which it did so, only then
should the researcher feel confident in presenting the now scaled stimuli to a random sample of relevant
participants for psychometric purposes. Thus, with psychometrics, items (i.e., stimuli) are summed across
within an individual respondent in order to obtain his or her score on the construct.
The major requirement in scaling for people is that variables should be monotonically related to
each other. A relationship is monotonic if higher scores in one scale correspond to higher scores on
another scale, regardless of the shape of the curve (Nunnally, 1970). In scaling for people many items on a
test is used to minimize measurement error. The specificity of items can be averaged when they are
combined. By combining items, one can make relatively fine distinctions between people. The problem of
scaling people with respect to attributes is then one of collapsing responses to a number of items as to
obtain one score for each person.
134
One variety of scaling for people is the deterministic model and it assumes that there is no error in
item trace lines. Trace lines shows that a high level of ability would have a probability close to 1.0 of
correctly obtaining a response. The model assumes that up to a point on the attribute, the probability of
response alpha is zero and beyond that point the probability of response alpha is 1.0. Each item has a
biserial correlation of 1.0 with the attribute, and consequently each item perfectly discriminates at a
particular point of the attribute.
There are varieties of scaling models for people that includes Thurstone, Lickert scale, Guttman
scale, and Semantic differential scaling.
(1) Thurstone scaling. There are 300 or so judges to rate 100 statements on a particular issue on
an 11 point scale. A subset of statements are then shown to respondents and their score is the mean of the
ratings for the statement they select.
(2) Lickert scale. Respondents are request to state their level of agreement with a series of
attitude statements. Each scale point is given a value (say, 1- 5) and the person is given the core
corresponding to their degree of agreement. Often a set of Likert items are summed to provide a total score
for the attitude.
(3) Guttman scale. It involves producing a set of statements that form a natural hierarchy. Positive
answers to the item at one point on the hierarchy assume positive answers to all the statements below (e.
g. disability scale). Gets over problem of item totals being formed by different sets of responses.
Scaling Responses
Scaling responses is a decision arrived by the subjects’ response to a stimuli. Such response
options may include requiring participants to make comparative judgments (e.g.,which is more important, A
or B?), subjective evaluations (e.g., strongly agree to strongly disagree), or an absolute judgment (e.g., how
hot is this object?). Different response formats may well influence how to write and edit stimuli. In addition,
they may also influence how one evaluates the quality or the “accuracy” of the response. For example, with
absolute judgments, standard of comparisons are used, especially if subjects are being asked to rate
physical characteristics such as weight, height, or intensity of sound or light. With attitudes and
psychological constructs, such “standards” are hard to come by. There are a few options (e.g., Guttman’s
Scalogram and Coomb’s unfolding technique) for simultaneously scaling people and stimuli, but more often
than not only one dimension is scaled at a time. However, a stimuli is scaled first (or seek a well-
established measure) before having confidence in scaling individuals on the stimuli.
With unidimensional scaling, as described previously, subjects are asked to respond to stimuli with
regard to a particular dimension. With multidimensional scaling (MDS), how-ever, subjects are typically
asked to give just their general impression or broad rating of similarities or differences among stimuli.
Subsequent analyses, using Euclidean spatial models, would “map” the products in multidimensional
space. The different multiple dimensions would then be “discovered” or “extracted” with multivariate
statistical techniques, thus establishing which dimensions the consumer is using to distinguish the
products. MDS can be particularly useful when subjects are unable to articulate “why” they like a stimulus,
yet they are confident that they prefer one stimulus to another.
135
References
Agresti, A. & Finlay, B. (1997). Statistical methods for the social sciences (3rd ed.). New Jersey: Prentice
Hall.
Anastasi, A. & Urbina, S. (2002). Psychological testing. Prentice Hall: New York.
Andrich, D. (1998). Rasch models for measurement. Sage University: Sage Publications.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational
measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education.
Bejar, I. I. (1983). Subject matter experts' assessment of item statistics. Applied Psychological
Measurement, 7, 303-310.
Berk, R. A. (1996). Standard setting: the next generation. Applied Measurement in Education, 9, 215-235.
Chang, L. (1999). Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Applied
Measurement in Education, 12, 151-166.
Crocker, L. M., & Algina, J. (1986). Introduction to classical and modern test theory. Belmont, CA:
Wadsworth.
Dobby J, & Duckworth, D (1979): Objective assessment by means of item banking. Schools Council
Examination Bulletin, 40, 1-10.
Downie, N.M., & Heath, R.W. (1984). Basic statistical methods (5th ed.). New York: Harper & Row
Publishers.
Fischer, G. H. (1974) Derivations of the Rasch Model. In Fischer, G. H. & Molenaar, I. W. (Eds) Rasch
Models: foundations, recent developments and applications, pp. 15-38 New York: Springer Verlag.
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences. New
York: W. H. Freeman.
Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff minimum passing
levels for a group of borderline candidates. Applied Measurement in Education, 12, 13-28.
Grosse, M. E., & Wright, B. D. (1986). Setting, evaluating, and maintaining certification standards with the
Rasch model. Evaluation and the Health Professions, 9, 267-285.
Hambelton, R. K. (2000). Emergence of item response modeling in instrument development and data
analysis. Medical Care, 38, 60-65.
136
Impara, J. C., & Plake, B. S. (1997). Standard setting: An alternative approach. Journal of Educational
Measurement, 34, 353-366.
Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in
the Angoff standard setting method. Journal of Educational Measurement, 35, 69-81.
Kane, M. (1994). Validating the performance standards associated with passing scores. Review of
Educational Research, 64, 425-461.
Kaplan, R. M. & Saccuzo, D. P. (1997). Psychological testing: Principles, applications and issues. Pacific
Grove: Brooks Cole Pub. Company.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:
Erlbaum.
Magno, C. (2003). Relationship between attitude towards technical education and academic achievement
in mathematics and science of the first and second year high school students, caritas don bosco school, sy
2002 – 2003. An unpublished master’s thesis, Ateneo de Manila University, Quezon City, Manila.
Maier, K. S. (2001). A Rasch hierarchical measurement model. Journal of Educational and Behavioral
Statistics, 26, 307-331.
McKinley, D. W., Newman, L. S., & Wiser, R. F. (1996, April). Using the Rasch model in the standard-
netting process. Paper presented at the annual meeting of the National Council of Measurement in
Education, New York, NY.
Mills, C. N., & Melican, G. J. (1988). Estimating and adjusting cutoff scores: Future of selected methods.
Applied Measurement in Education, 1, 261-275.
National Academy of Education (1993). Setting performance standards for student achievement. Stanford,
CA: Author.
National Research Council (1999). Setting reasonable and useful performance standards. In J. W.
Pellegrino, L. R. Jones, & K. J. Mitchell (Eds.), Grading the nation's report card: Evaluating NAEP and
transforming the assessment of educational progress (pp. 162-184). Washington, DC: National Academy
Press.
Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of mathematical
psychology, 3, 1 – 18.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.) New York: McGraw-Hill.
Plake, B. S. (1998). Setting performance standards for professional licensure and certification. Applied
Measurement in Education, 11, 65-80.
Reid, J. B. (1991). Training judges to generate standard-setting data. Educational Measurement: Issues
and Practice, 10, 11-14.
137
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark:
Danish Institute for Educational Research.
Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of
scientific statements. In G. M. Copenhagen (ed.). The Danish yearbook of philosophy (pp.58-94).
Munksgaard.
Shepard, L. A. (1995). Implications for standard setting of the national academy of education evaluation of
the national assessment of educational progress achievement levels. Proceedings of Joint Conference on
Standard Setting for Large-Scale Assessments (pp. 143-160). Washington, DC: The National Assessment
Governing Board (NAGB) and the National Center for Education Statistics (NCES).
Shepard, L. A. (1995). Implications for standard setting of the national academy of education evaluation of
the national assessment of educational progress achievement levels. Proceedings of Joint Conference on
Standard Setting for Large-Scale Assessments (pp. 143-160). Washington, DC: The National Assessment
Governing Board (NAGB) and the National Center for Education Statistics (NCES).
Spearman,, C. (1904). The proof and measurement of association between two things. American Journal of
Psychology, 15, 72 – 101.
Wang, N. (2003). Use of the Rasch IRT model in standard setting: An item-mapping method. JEM, 40, 231.
Van der Linden, W. J. & Hambleton, R. K. (2004). Item response theory: Brief history, common models, and
extension. New York: Mc Graw Hill.
Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA Press.
Wright BD, Panchapakesan N (1969). A procedure for sample free item analysis.
Educational and Psychological Measurement, 29, 23-48.
138
Exercise:
Calibrate the item difficulty and person ability of the scores in a Reading Comprehension test
with 19 items among 15 Korean students. After performing he Rasch Model, determine item
difficulty using the Classical test theory approach. Compare the results.
Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item
Case
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
A 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 1 0 1
B 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0
C 0 0 0 1 0 1 0 0 0 0 1 1 0 1 1 0 1 0 1
D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
E 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
F 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0
G 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1
H 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1
I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1
K 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0
L 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1
M 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1
N 0 0 1 1 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1
O 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1
References
Anastasi, A. & Urbina, S. (2002). Psychological testing (7th ed.). NJ: Prentice Hall.
DiLeonardi, J. W. & Curtis, P. A. (1992). What to do when the numbers are in: A users guide to
statistical data analysis in the human services. Chicago IL, Nelson-Hall Inc.
Magno, C. (2007). Exploratory and confirmatory factor analysis of parental closeness and
multidimensional scaling with other parenting models. The Guidance Journal, 36, 63-89.
Magno, C., Lynn, J, Lee, K., & Kho, R. (in press). Parents’ School-Related Behavior: Getting
Involved with a Grade School and College Child. The Guidance Journal.
139
Magno, C., Tangco, N., & Tan, C, (2007). The role of metacognitive skills in developing critical
thinking. Paper presented at the Asian Association of Social Psychology in Universiti
Malaysia, Kota Kinabalu, Sabah Malaysia, July 25 to 28.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen,
Denmark: Danish Institute for Educational Research.
Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality
and validity of scientific statements. In G. M. Copenhagen (ed.). The Danish yearbook of
philosophy (pp.58-94). Munksgaard.
Van der Linden, W. J. & Hambleton, R. K. (2004). Item response theory: Brief history, common
models, and extension. New York: Mc Graw Hill.
Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA
Press.
140
Lesson 4
Using a Computer Software in Analyzing Test Items
The “ANALYSIS OF TEST DATA” is a program together with this book that is used to analyze
and test the reliability and validity of test data.
How to install:
3. Click the “Change Directory” button and choose the folder where you want to install the
program.
8. A dialogue box will appear “The destination of file in use. Please ensure that all applications
are closed”. Just press “IGNORE” button then click “YES” button to continue.
Look for the program “Analysis of Test Data” and click to open it.
The menu will first appear. Start by clicking which statistical analysis you would like to perform.
Note: The program could only input and analyze up to 500 respondents and up to 100 test
items.
144
The button lets you analyze the Test-Retest reliability of the test data.
This is used to find the reliability of test scores by repeating the same test on a second occasion.
6. To move the cursor to other cells, Press “Tab” key or simply click the cell.
Analysis” button located above the screen or the button located below
The button allows you to analyze for the Split-Half reliability of the test data.
This is used when two scores set of scores are obtained for each participant who took a test by
dividing the test scores into equivalent halves.
6. Make sure that a blinking cursor will appear in the cell you want to put the score.
7. To move the cursor to other cells, Press “Tab” key or simply click the cell.
9. But if you want to reduce the number of cases, then click the button
with the Analysis” button located above the screen or the button
11. Select the item number then click the button to which set you want to place the
selected item number. But make sure that the item number selected should be shaded
by gray before clicking the arrow sign.
12. And if you want to remove the item from the set, select the item number you want to
move but make sure that the item number should be shaded by gray, and then click
the button.
13. Click the “Proceed with the Analysis” button located above the screen
The button allows you to do the Parallel Form technique to compute for the
reliability of test data. This is used when a person can be tested with one form on the first
occasion and with another equivalent form on the second.
5. Make sure that a blinking cursor will appear in the cell you want to put the score.
6. To move the cursor to other cells, Press “Tab” key or simply click the cell.
you want to add. Enter the number in the text box provided, and then press the
“Enter” key to show the added cases.
8. But if you want to reduce the number of cases, then click the button
with the Analysis” button located above the screen or the button
The button would allow you to analyze for the Cronbach’s Alpha reliability of
the test data.
This is used to determine the internal consistency of responses of participants to all items in the
test.
6. Make sure that a blinking cursor will appear in the cell you want to put the score.
7. To move the cursor to other cells, Press “Tab” key or simply click the cell.
with the Analysis” button located above the screen or the button
The button would allow analyze for the Kuder Richardson reliability of the test
data.
150
This is used to determine the internal consistency of binary responses of participants to all items
in the test.
6. Correct answers should be marked as “1” and wrong answers should be marked as
“0” in each cell.
7. Make sure that a blinking cursor will appear in the cell you want to put the score.
8. To move the cursor to other cells, Press “Tab” key or simply click the cell.
9. If you want to add the number of cases, just click the button or
to make the necessary changes. Click the “Yes” button. Enter the
number of cases or items you want to add. Enter the numbers in the text boxes
provided, and then press the “Enter” key to show the added cases.
10. But if you want to reduce the number of cases, then click the button
with the Analysis” button located above the screen or the button
The button would allow you to analyze for the Interrater reliability of the test
data.
This is used to determine the concordance of raters’ scores.
6. Make sure that a blinking cursor will appear in the cell you want to put the score.
7. To move the cursor to other cells, Press “Tab” key or simply click the cell.
8. If you want to add cases, just click the button or to
make the necessary changes. Click the “Yes” button. Enter the number of cases or
raters you want to add. Enter the number in the text boxes provided, and then press
the “Enter” key to show the added cases.
152
9. But if you want to reduce the number of cases, then click the button
or button to go back to the main menu and do the steps again.
10. Once you’re done putting all the scores and completing the cases, click the “Proceed
with the Analysis” button located above the screen or the button
located below to reveal the results.
The button would you analyze for the Criterion-Reference validity of the test
data.
This is used to indicate the effectiveness of a test to predict an individual’s performance in the
future.
6. To move the cursor to other cells, Press “Tab” key or simply click the cell.
with the Analysis” button located above the screen or the button
The button would you analyze for the Concurrent validity the test data.
This is used when both test measures are present and their relationship coincides with a theory.
154
6. To move the cursor to other cells, Press “Tab” key or simply click the cell.
with the Analysis” button located above the screen or the button
The button and button would you analyze for the Convergent
and Discriminant validity of the test data. This is used to prove the correlation of the variables
with which it should theoretically correlate (convergent) and also it does not correlate with
variables from which it should differ (divergent).
Once the “Convergent” or “Divergent” button was clicked, 3 submenu buttons will appear.
5. Make sure that a blinking cursor will appear in the cell you want to put the score.
6. Enter the total scores of each participant in each column per group.
7. To move the cursor to other cells, Press “Tab” key or simply click the cell.
button or button to go back to the sub menu and do the steps again.
10. Once you’re done putting all the scores and completing the cases, click the
“Proceed with the Analysis” button located above the screen or the
SAMPLE INPUT for Convergent and Discriminant Validity: Comparing 2 Dependent Groups
Analysis:
7. To move the cursor to other cells, Press “Tab” key or simply click the cell.
participants you want to add in the text boxes provided, and then press the
“Enter” key to show the added cases.
9. But if you want to reduce the number of cases, then click the
button or button to go back to the sub menu and do the steps again.
10. Once you’re done putting all the scores and completing the cases, click the
“Proceed with the Analysis” button located above the screen or the
SAMPLE INPUT for Convergent and Discriminant Validity Comparing Two Variables
Analysis:
6. To move the cursor to other cells, Press “Tab” key or simply click the cell.
7. If you want to add the number of cases, just click the button or
to make the necessary changes. Click the “Yes” button. Enter the
number of participants you want to add in the text boxes provided, and then press
the “Enter” key to show the added cases.
8. But if you want to reduce the number of cases, then click the
button or button to go back to the sub menu and do the steps again.
9. Once you’re done putting all the scores and completing the cases, click the
“Proceed with the Analysis” button located above the screen or the
SAMPLE RESULTS for Convergent and Discriminant Validity Comparing Two Variables
Analysis:
This is method uses the classical test theory approach in determining the item
difficulty and item discrimination based on proportions of high and low group.
6. To move the cursor to other cells, Press “Tab” key or simply click the cell.
161
7. If you want to add the number of cases, just click the button or
to make the necessary changes. Click the “Yes” button. Enter the
number of participants you want to add in the text boxes provided, and then press
the “Enter” key to show the added cases.
8. But if you want to reduce the number of cases, then click the
button or button to go back to the main menu and do the steps again.
9. Once you’re done putting all the scores and completing the cases, click the “Start
2. Press the folder icon to choose the folder where you want save the file.
2. Then click the “Print” icon on the upper left of the screen.
163
NOTE:
Make sure that you have encoded the data with Microsoft Excel XP or a lower version with
correct format for the file to be open by the program.
2. When the uploaded button is clicked, the data will be entered to the cases in
the program. But make sure that the sheet name matches type of analysis that being used.
(Check: “How to encode and save data in Microsoft Excel” for the template guide).
3. Make sure that you know how to use the Excel format. Check the folder
where you save and installed the program to familiarize yourself with the
given templates.
164
165
2. Five files will appear on the screen, then double click the “Template” excel file to view
the formats required in encoding data in Microsoft excel. Make sure to follow the formats
given to ensure that your encoded data could be open using the “Analysis of Test Data”
program.
3. After clicking the “Template” file, a Microsoft excel file with multiple sheets will open
as being shown below:
4. Make sure to save your encoded data with correct sheet name in Microsoft Excel.
166
Click the sheet with label below the screen to view the correct template for
encoding the data that is to be analyzed using Cronbach’s Alpha Reliability Analysis.
Click the sheet with label below the screen to view the correct template for
encoding the data that will be analyzed using Interrater Reliability analysis.
167
Click the sheet with label below the screen to view the correct template for
encoding the data that is to be analyzed using Kuder Richardson Reliability analysis.
Click the sheet with label below the screen to view the correct template for
encoding the data that is to be to be analyzed using Split-Half Reliability analysis.
168
Click the sheet with label below the screen to view the correct template for
encoding the data that is to be analyzed using either the following analysis: Test-Retest
Reliability, Parallel Form Reliability, Criterion-reference validity, Concurrent
Validity, Convergent (Correlate 2 Variables) and Divergent (Correlate 2 Variables).
Click the sheets with label below the screen to view the
correct template for encoding the data that is to be analyzed when Comparing Two
Independent Group Validity analysis.
169
Click the sheet with label below the screen to view the correct template for
encoding the data that is to be analyzed when Comparing Two Dependent Group
Validity analysis.
Click the sheet with label below the screen to view the correct template for
encoding the data that is to be analyzed using the Item Analysis procedure.
170
Chapter 4
Developing a Teacher-Made Test
Objectives
1. Explain the theories and concepts that rationalize the practice of assessment.
2. Make a table of specifications of the test items.
3. Design pen-and-paper tests that are aligned to the learning intents.
4. Justify the advantages and disadvantages of any pen-and-paper test.
5. Evaluate the test items according to the guidelines.
Lessons
Lesson 1
The Test Blueprint
To make sure that you have these domains accounted for in your assessment design,
engage yourself to make a table of specifications, one that will allow you to explicitly indicate
what content to cover in your test, what knowledge dimensions to focus, and what cognitive
processes to pay attention to.
The Table of Specifications is a matrix where the rows consist of the specific topic or
skills (content or skill areas) and the columns are the learning behaviors or competencies that we
desire to measure. Although we can also add more elements in the matrix, such as Test
Placement, Equivalent Points, or Percent values of items, the conventional prototype table of
specifications may look like this:
172
Cognitive Processes
TOTAL 2 11 7 20
The number of items Number of items for solving The total number
measuring Knowledge linear equation, measuring of test items
Application
As you have seen in the above table of specifications, only three cognitive processes are
indicated. This means that if you use the old Bloom’s taxonomy of behavioral objectives, include
only those levels that you wish to measure in the test, although it is recommended that more than
a single processes should be measured in a test, depending, of course, on your purpose of testing.
As a test blueprint, the table of specifications ensures that the teacher sees all the essential
details of testing and measuring student learning. It makes the teacher sure that the content areas
(or skill areas) and the levels of behavior in which learning is hoped to anchor are measured. The
test’s degree of difficulty may also be seen in the table of specifications. When the distribution of
test items is concentrated in the higher-order cognitive behaviors (analysis, synthesis,
evaluation), the test’s difficulty level is higher as compared to when the items are concentrated in
the lower-order cognitive behaviors (knowledge, comprehension, application).
As you have learned in Chapter 2 of this book, there are many taxonomic tools that may
be used in our instructional planning. The taxonomic tool for planning the test should be
consistent to the taxonomy of learning objectives used in the overall instructional plan.
Understandably, designing the table of specifications using any taxonomic tool will require a
little of our time, effort, and other personal and motivational resources. Before we may be
tempted to develop pen-and-paper test items without first preparing our table of specifications,
and run the risk of not actually evaluating our students on the basis of our learning intents, we
need to first brush up on our understanding of the instrumental function of the table of
specifications as a blueprint for our test, and convince ourselves that this is an important process
in any test development activity. In developing the table of specifications, we suggest that you do
not yet think of the types of pen-and-paper test you wish to give. Instead, you just focus on
planning of your test in terms of your assessment domain.
173
Table of Specifications
The Table of Specifications (TOS) is a blueprint for selecting appropriate test items. The TOS
can be one grid or two grid. A one grid table of specifications only allows one to indicate the
number if items in a test across the competencies or topics. In a two grid TOS, the percentage of
the cognitive skills and time frame for each topic/competencies are indicated.
Content shown in one axis and the cognitive (and/or affective) domain on the other
Cognitive Domain
Content Knowledge Comprehension Application
I.
II.
III.
Example
Weight Content Knowledge Comprehension Application No. of items
(Time Outline 30% 40% 30% by content
Frame) area
35% 1. Table of specifications 1 4 4 9
30% 2. Test and Item characteristics 2 3 3 8
10% 3. Test layout 1 1 0 2
5% 4. Test instructions 0 1 0 1
5% 5. Reproducing the test 1 0 0 1
5% 6. Test length 1 0 1 2
10% 7. Scoring the test 2 1 0 3
8 10 8 26
Given time
items X percentage of cognitive skill X total number of items
Total time
Test Length
The test must be of sufficiently length to yield reliable scores
The longer the test, the more the reliable the results. This also targets the validity of the test
because the test should be valid if it is reliable.
For the grade school, one must consider the stamina and attention span of the pupils
The test should be long enough to be adequately reliable and short enough to be
administered
Test Instruction
It is the function of the test instructions to furnish the learning experiences needed in order to
enable each examinee to understand clearly what he is being asked to do.
Instructions may be oral, a combination of written and oral instruction is probably desirable,
except with very young children.
Clear concise and specific.
175
Test layout
The arrangement of the test items influences the speed and accuracy of the examinee
Utilize the space available while retaining readability.
Items of the same type should be grouped together
Arrange test items from easiest to most difficult as a means of reducing test anxiety.
The test should be ordered first by type then by content
Each item should be completed in the column and page in which it is started.
If the reference material is needed, it should occur on the same page as the item
If you are using numbers to identify items it is better to use letters for the options
Lesson 2
Designing Selected-Response Items
When you are done with your test blueprint, you are now ready to start developing your
test items. For this phase of test development, you will need to decide what types or methods of
pen-and-paper assessment you wish to design. To aid you in this process, we will now discuss
some of the common pen-and-paper types of test and the basic guidelines in the formulation of
items for each type.
In deciding about the assessment method to use for a pen-and-paper test, you choose
which among the selected response or the constructed response types would be appropriate for
your blueprint. Selected response tests use those types of items that require the test takers to
respond by choosing an option from a list of alternatives. Common types of constructed response
tests are binary-choice, multiple-choice, and matching tests.
Binary-choice Items
The binary-choice test offers students the opportunity to choose between two options for
an answer. The items must be responded to by choosing one of two categorically distinct
alternatives. The true-false test is an example of this type of selected-response test. This type of
selected-response test typically contains short statements that represent less complex
propositions, and therefore, is efficient in assessing certain levels of students’ learning in a
reasonably short period of testing time. In addition to this, a binary-choice test may cover a wider
content area in a brief assessment session (Popham, 2005).
To assist you in developing binary-choice items, here are some guidelines with brief
descriptions of each. These guidelines may not capture everything that you need to be mindful of
in developing teacher-made tests. These are just the basics of what you need to know. It is
important that you also explore on other aspects of test development, including the context in
which the test is to be used, among others.
Make the Instructions Explicit. Basic in pen-and-paper test is that instructions indicate
the task that students need to do and the credit they can earn from making every correct answer.
However there is one more thing you need to indicate in your instructions for a binary-choice test
– the reference of validity or reference of truth. When you ask your students to judge whether the
statement is true or false, correct or erroneous, or valid or invalid, you need to state the reference
of truth or correctness of a response. If the reference is a reading article, textbook, teacher’s
lecture, class discussion, or resource person, state it in your instructions. This will help students
think contextually and on-track. This will also help you cluster your items according to specific
domain or context. Also, it can minimize the problem of conflict of information, such as one
resource material says this and one person (maybe your student’s parent or another teacher) says
otherwise. For items that vary in context and reference of truth, state the reference in the item
itself. For example, if the item is drawn from a person’s opinion, such as the principal’s speech,
or a guest speaker’s ideas, it is important that you attribute to opinion to its source. Lastly,
although not a must, it might be nicer to use “please” and “thank you” in our test instructions.
177
State the item as either definitely true or false. Statements must be categorically true or
false, and not conditionally so. It should clearly communicate the quality of the idea or concept
as to whether it is true, correct, and valid or false, erroneous, and invalid. Make sure that it
clearly corresponds to the reference of validity and that the context must be explicit. For the
quality to be categorical, it must invite only a judgment of contradictories, not contraries. For
example, white or not white implies a contradiction because one idea is a denial of the other. To
say black or white indicates opposing ideas that imply values between them, such as gray. A
good item is one that implies only contradictory, mutually exclusive qualities, that is, either true
or false, and it does not need further qualification in order to make it true or false.
Keep the statements short, simple, but comprehensible. In formulating binary-choice
items, it is wise to consider brevity in the statement. Good binary-choice items are concisely
written so that they present the ideas clearly but avoid extraneous materials. Making the
statements too long is risky in that it might unintentionally indicate clues that will make your
statement obviously true or false. There is actually no clear-cut rule for brevity. It is usually left
to the teacher’s judgment. In preparing the whole binary-choice test, it is also important that all
the items or statements maintain relatively the same length. For a statement to be
comprehensible, it must make a clear sense of the ideas or concepts on focus, which is usually
lost when a teacher lifts a statement from a book and use it as a test item statement.
Do away with tricks. We remember that the purpose of assessing our students’ learning
is based on the assessment objectives we set. Clearly, solving tricks is remote if not totally
excluded in our intents. Therefore, we need to avoid using tricks, such as using double-negatives
in the statement or switching keys. The use of double-negative statements is a logical trickery
because the “valence” of the statement is still maintained, not altered. These statements are
usually puzzling, and will therefore take more time for students to understand. Switching keys is
when you ask students to answer “false” if the statement is true, or “true” if the statement is
false. This is obviously an unjustifiable trick. By all means, we have to avoid using any kind of
tricks not only in binary-choice tests but also in all other types and methods of assessment.
Get rid of those clues. Clues come in different forms. One of the common clues that can
weaken validity and reliability of our assessment is comes from our use of certain words, such as
those that denote universal quantity or definite degree (i.e., all, everyone, always, none, nobody,
never, etc.). These words are usually false because it is almost always wrong to say that one
instance applies to all sorts things. Other verbal clues may come from the use of terms that
denote indefinite degree (i.e., some, few, long time, many years, regularly, frequently, etc.).
These words do not actually indicate a definite quantity or degree, and, thus, violate the rule on
definiteness of quality in letter b. Other clues may come from the way statements are arranged
according to the key, such as alternating items that are true and false, or any other style of
placing the items in a systematic and predictable order. This should be avoided because once the
students notice the pattern, they are not likely to read the items anymore. Instead, they respond to
all items mindlessly but obtain high scores.
Basic in test development is our mindful tracking of our purpose. Binary-choice items
can be a useful tool for assessing learning intents that are drawn from various types of
knowledge, but include only simpler cognitive processes. In this test, students only recall their
178
understanding of the subject matter covered in assessment domain. They do not manipulate this
knowledge by using more complex, deeper cognitive strategies and processes.
Another important point to consider in deciding whether to use the binary-choice test is
its degree of difficulty. Because this type of test offers only two options, the chance that a
student chooses the correct option is 50%, the remainder is his chance of choosing the wrong
option. This 50-50 probability of selecting the correct answer is problematic because the chance
of answering the question correctly is high even if the student is not quite sure of his
understanding. One way of reducing the likelihood of guessing for the right option is suggested
by Popham (2004), that is to include more items because if students are successful in their
guesswork for a 10-item binary-choice test, it is likely impossible to maintain this success with,
let us say, a 30-item test.
Instructions in Writing BinaryType of Items
2. Base true-false items upon statements that are absolutely true or false, without qualifications
or exceptions.
FAULTY: World War II was fought in Europe and the Far East.
IMPROVED: The primary combat locations in terms of military personnel during World War II
were Europe and the Far East.
3. Avoid negative stated items when possible and eliminate all double negatives.
FAULTY: It is not frequently observed that copper turns green as a result of oxidation.
IMPROVED: Copper will turn green upon oxidizing.
4. Use quantitative and precise rather than qualitative language where possible.
FAULTY: Many people voted for Gloria Arroyo in the 2003 Presidential election.
IMPROVED: Gloria Arroyo received more than 60 percent of the popular votes cast in the
Presidential election of 2003.
6. Avoid making the true items consistently longer than the false items.
FAULTY: According to some peripatetic politicos, the raison d’etre for capital punishment is
retribution.
IMPROVED: According to some politicians, justification for the existence of capital punishment
can be traced to the Biblical statement, “An eye for an eye.”
FAULTY: Jane Austen, an American novelist born in 1790, was a prolific writer and is best
known for her novel Pride and Prejudice, which was published in 1820.
IMPROVED: Jane Austen is best known for her novel Pride and prejudice.
9. It is suggested that the crucial elements of an item be placed at the end of the statement.
FAULTY: Oxygen reduction occurs more readily because carbon monoxide combines with
hemoglobin faster than oxygen does.
IMPROVED: Carbon monoxide poisoning occurs because carbon monoxide dissolves delicate
lung tissue.
Multiple-choice Items
Multiple-choice test is another selected-response type where students respond to every
item by choosing one option among a set of three to five alternatives. The item begins with a
stem followed by the options or alternatives. This type of pen-and-paper test has been widely
used in national achievement tests and other high stake assessment, such as the professional
board examinations. Perhaps, the reason for this is because multiple-choice test is capable of
measuring a range of knowledge and cognitive skills, obviously more than what other types of
objective tests can do.
180
A multiple-choice test may come in two types. The correct-answer type is one whose
items pose specific problems with only one correct answer from the list of alternatives. For
instance, if a stem is followed by four alternatives, only one of them is correct (the keyed
answer), and the other three are incorrect. In this type of multiple-choice test, all items should be
designed in this fashion. The other is the best-answer type where the stem establishes a problem
to be answered by choosing one best option. Understandably, the other options are acceptable but
not necessarily the best alternatives to answer the problem posed in the stem. In this type of
multiple-choice test, only one option is the best answer (keyed answer), and the others may all be
conditionally acceptable, or some are acceptable, some others are totally incorrect.
To guide you in formulating good multiple-choice items, here are some fundamental
guidelines that will be helpful in going through the process.
Make the instructions explicit. When giving a multiple-choice test, the instructions must
indicate the content area or context, the ways in which students respond to every item, and the
scoring. If you are using the correct answer-answer type, it is helpful to the students if your
instructions state that they “choose the correct answer”. Common sense should tell us not to use
this expression when our multiple-choice test is of best-answer type, but “choose the best
answer” would be more appropriate. Lastly, you may want to say “please” and “thank you.”
Formulate a problem. As mentioned above, every item in a multiple-choice test has a
stem and a set of alternatives. The stem should clearly formulate a problem. This is to compel
students to respond to it by choosing one option that will correctly answer the problem or best
address it. There are two ways of posing a problem in the stem of multiple-choice test. One way
is by formulating a question or and interrogative statement. If the stem is “In what year did the
first EDSA revolution happen?” it clearly poses a problem to be answered than “The year when
the first EDSA revolution happen.” The other way to pose a problem in the stem is by
formulating an incomplete sentence where one of the options correctly or best completes it. It
may be phrased as “The first EDSA revolution happened in the year” then the statement is
followed by the list of alternatives. As you will also learn about completion types in the
subsequent section of this chapter, when you use the incomplete sentence format to pose a
problem in the stem of a multiple-choice item, always remove a keyword at the end of the
statement, or at least near the end. If the keyword is at the end of the statement, you don’t end
with any punctuation mark or a blank space. If the missing keyword is near the end of the
statement but not necessarily the last word, replace that keyword which you removed with an
underlined blank space, and end your statement with an appropriate punctuation mark.
State the stem in positive form. Ask yourself, how reasonable is it for you to state your
item’s stem in a negative form? or how important is assessing students’ ability to deal with the
“negatives” in your test?, you surely struggle to seek a good answer that justifies the use of
negative statements in your multiple-choice test.
One of the common problems we encounter in a negatively phrased stem is the high
chance of not spotting the word that carries the negation (e.g., not). Another is the difficulty in
anchoring the negative item to the learning intent. In general case, “which one is” will work more
effectively in assessing students’ learning than “which one is not.” The rule-of-thumb says that
you avoid the use of negative statements, unless there is a really compelling reason for why you
will need to phrase your stem in a negative form. If this reason is reasonable enough, you need to
181
highlight the word that carries the negation, such as writing “not” as “not,” “NOT,” “not,” or
not.
Include only useful alternatives. Remember that the set of alternatives following the stem
is a list of options from which students pick out their response. In any type of multiple-choice
test, only one alternative is keyed, and the rest are distractors. The keyed alternative is ultimately
useful because it is what we expect every student who learned the subject matter should choose.
If the set of alternatives does not contain the expected answer, it is clearly a bad item. This
problem is more dreadful in a correct-answer type than in the best-answer type. At least for the
latter, the second best alternative can stand as the key if the best answer is missing in the list. If
the correct answer is missing in the list of options in a correct-answer type, then there is really
no answer to the problem posed in the stem, and must be removed from the test.
Even if the distractors are not the expected answers, they serve an important function in
the multiple-choice test. As distractors, they should distract those students who do not learn
enough the subject matter, but not those who learn. Therefore, these distractors should be
plausible appear as if they are correct or best options. The way plausible distractors work in a
multiple-choice test is by making the students believe that these distractors are the correct or best
answer even if they are actually not. An important consideration in dealing with the alternatives
is maintaining a homogeneous grouping. For example, if a stem asks about the name of a
particular inventor in science, all alternatives should be names of scientific inventors.
As stated above, a multiple-choice item should have three to five alternatives. Choosing
to include 3, or 4, or 5, depends on the grade level or year level of the class of students you are
handling. We suggest that higher grade- or year-level students be given items with more than 3
options as this will increase the level of test difficulty and reduce the effects of guessing on your
assessment of students’ learning. In instances when you wish that students evaluate each option
as to is plausibility, you may add the option “none of the above” as the fourth or fifth alternative.
However, you have to use this alternative with caution. Use it only for correct-answer type of
multiple-choice test and when you intend to increase the difficulty of an item and that the
presence of this option will help you come up with a better inference of your students’ learning.
Let us say, for example, you are testing computational skills of your students using multiple-
choice items and you encourage mental computations as they deal with the item. If you give
them only number options, they may just choose any one option based on simple estimation,
believing that one of them is the correct answer. Adding the “none of the above” option will
encourage students to do mental computation to check on each option’s correctness, because they
know it is possible that the correct answer is not in the list. Obviously, you cannot use this option
in a best-answer type of multiple-choice test.
The option “all of the above” should never be used at all as this can invite guessing that
will work for your students. If your last option (4th or 5th) is “all of the above” and your students
notice at least 2 options that are correct, they are likely to guess that “all of the above” is the
correct option. Similarly, if they spot one incorrect option, automatically they disregard the “all
of the above” option. When they do so, the item’s difficulty is reduced.
One of the instances that teachers are tempted to unreasonably use “none of the above” or
include “all of the above” option even if it is not allowed, is when they force themselves to
maintain the same number of alternatives for their multiple-choice items. In this case, they use
182
these alternatives as “fillers” in case they run out of options to maintain its number in all the
items. In order to avoid this mistake, it is important to realize that, for classroom testing
purposes, multiple-choice items do not have to come with the same number of options for all its
items. It is okay to have some items with four options while some other items have five.
Scatter the positions of keyed answers. In formulating your multiple-choice items, spread
the keyed answers to different response positions (i.e., a, b, c, d, and e). Make sure the number of
items whose keyed answer is “a” is proportional to the items keyed for each of the other response
positions. Better yet, if you give a 20-item multiple-choice test with 4 options per item, key five
items to each response position (25% of items per response position or approximately so).
The good thing about multiple-choice test is that it is capable of measuring skills higher
than just recall or simple comprehension. If properly formulated, the test can measure higher
level thinking (Airasian, 2000). Also, the fact that every item in the multiple-choice test is
followed by more than two response options makes it obtain its reputation of having higher
difficulty level because the probability that one option is correct becomes smaller as you increase
the number of options. Certainly, a 4-option item is more difficult than a 3-option item because
the former indicates only a 25% probability that one option is correct, which is lower than 33%
probability for a 3-option item. A 5-option item is clearly more difficult.
A. Think of a specific subject matter in your field of specialization, one that you are very
familiar with.
B. Write a learning intent that can be measured using multiple-choice test.
C. Formulate at least 5 correct-answer-type items and another 5 best-answer-type items.
D. Check the quality of your output based on the guidelines discussed above. As you do this,
monitor your learning as well as your confusions, doubts or questions.
E. Raise questions in class.
3. Include in the stem any words that might otherwise be repeated in each response.
183
4. Items should be stated simply and understandably, excluding all nonfunctional words from
stem and alternatives.
7. Avoid making the correct alternative systematically different from other options
10. Make all responses plausible and attractive to the less knowledgeable and skillful student.
FAULTY: Which of the following statements makes clear the meaning of the word “electron”?
a. An electronic tool
b. Neutral particles
c. Negative particles
d. A voting machine
e. The nuclei of atoms
11. The response alternative “None of the above” should be used with caution, if at all.
FAULTY: What is the area of a right triangle whose sides adjacent to the right angle are 4 inches
long respectively?
a. 7
b. 12
c. 25
d. None of the above
185
IMPROVED: What is the area of a right triangle whose sides adjacent to the right angle are 4
inches and 3 inches respectively?
a. 6 sq. inches
b. 7 sq. inches
c. 12 sq. inches
d. 25 sq. inches
e. None of the above
12. Make options grammatically parallel to each other and consistent with the stem.
FAULTY: As compared with the American factory worker in the early part of the 19th century,
the American factory worker at the close of the century
a. was working long hours
b. received greater social security benefits
c. was to receive lower money wages
d. was less likely to belong to a labor union.
e. became less likely to have personal contact with employers
IMPROVED: As compared with the American factory worker in the early part of the century, the
American factory worker at the close of the century
a. worked longer hours.
b. had more social security.
c. received lower money wages.
d. was less likely to belong to a labor union
e. had less personal contact with his employer
13. Avoid such irrelevant cues as “common elements” and “pat verbal associations.”
IMPROVED: The “standard error of estimate” is most directly related to which of the following
test characteristic?
a. Objectivity
b. Reliability
c. Validity
d. Usability
e. Specificity
186
14. In testing for understanding of a term or concept, it is generally preferable to present the term
in the stem and alternative definitions in the options.
FAULTY: What name is given to the group of complex organic compounds that occur in small
quantities in natural foods that are essential to normal nutrition?
a. Calorie
b. Minerals
c. Nutrients
d. Vitamins
15. Use objective items – items’ whose correct answers are agreed by experts
Factual Knowledge
Conceptual Knowledge
Which of the following statements of the relationship between market price and normal price is
true?
a. Over a short period of time, market price varies directly with changes in normal price.
b. Over a long period of time, market price tends to equal normal price.
c. Market price is usually lower than normal price.
d. Over a long period of time, market price determines normal price.
A B C D
187
Application
In the following items (4-8) you are to judge the effects of a particular policy on the distribution
of income. In each case assume that there are no other changes in policy that would counteract
the effect of the policy described in the item. Mark the item:
If the policy described would tend to reduce the existing degree of inequality in the distribution
of income,
If the policy described would tend to increase the existing degree of inequality in the distribution
of income, or
If the policy described would have no effect, or an indeterminate effect, on the distribution of
income.
Analysis
An assumption basic to Lindsay’s preference for voluntary associations rather than government
order… is a belief
a. that government is not organized to make the best use of experts
b. that freedom of speech, freedom of meeting, freedom of association, and possible only
under a system of voluntary associations.
c. in the value of experiment and initiative as a means of attaining an ever improving
society
d. in the benefits of competition
For items 14-16, assume that in doing research for a paper about the English language you find a
statement by Otto Jespersen that contradicts one point of view in a language you have always
accepted. Indicate which of the statements would be significant in determining the value of
Jespersen’s statement. For the purpose of these items, you may assume that these statements are
accurate. Mark each item using the following key.
Significant positively – that is, might lead you to trust his statement and to revise your own
opinion.
Significant negatively – that is, night lead you to distrust his statement
Has no significance
Matching Items
Another common type of selected-response test is the matching type of test that comes
with two parallel lists (i.e., premise and response), where students match the entries on one list
with those in the other list. The first list consists of descriptions (words or phrases), each of
which serves as a premise of test item. Therefore, each premise is taken as a test item, and must
be numbered accordingly. Each premise will be matched with the entries in the second (or
response) list. There is only one and the same response list for all the premises in the first list.
In developing good matching items, it is helpful to consider the following hints that will
guide you in the process of designing your lists.
Make instructions explicit. In making your instructions for a matching test explicit, the
context, task, and scoring must be clearly indicated. For its context, your instructions must
introduce the description as well as the response lists. If, for example, your description list
contains premises about scientific inventions you must state in your instructions that the first list
(or first column) is about scientific inventions. If your response list contains names of scientific
inventors, you must also state in your instructions that the second list (or second column)
contains names of scientific inventors. You may phrase it something like this: “In the first
column are scientific inventions. The second column lists names of scientific inventors. Match
the inventions with their inventors.” Then indicate the scoring. Having said this, we suggest that
your lists should be labeled with headings accordingly. In the case of the above example, you
may write the column heading as “Inventions” or “Column A: Inventions” for the first column or
description list, and “”Inventors” or “Column B: Inventors” for the second column or response
list.
Maintain brevity and homogeneity of the lists. The list of premises or descriptions must
be fairly short, that is, include only those items that go together as a group. For example, if your
matching test covers the common laboratory operations in chemistry, choose only those that are
relevant to your assessment domain. Doing this, you are also maintaining homogeneity of your
list. In matching tests, it is extremely important that entries in the description list are drawn from
one relatively specific assessment domain. For example, never mix up common laboratory
operations with measurements. Instead, decide as to whether you will include only one of these.
The same is true for your response list. Include only those that belong to the assessment domain.
Note here homogeneity in your lists is non-negotiable.
Also, in writing good matching items, it is imperative that the descriptions are longer than
the responses, not the other way around. After students read one of the descriptions, he reads all
options in the response list. If the description is longer than each of the options, at least, the
student only reads it once or twice. If the entries in the response list are long, it will take up more
time for the student to read all options just to respond to one description or item.
Finally, include more options than descriptions. If your description list has 10
descriptions or items, make your responses 12 or a bit more. This strategy reduces the effect of
response elimination where the student already disregards those options already chosen to match
the other descriptions. For example, if the student has already responded to 8 out of 10
descriptions with high confidence of his responses, so far, but finds the last 2 items difficult, with
only 10 options, only 2 options are available for his choice, and therefore, each of the remaining
189
option has a 50% probability that is it the correct option. If you include more than 10 responses,
the options for the last 2 descriptions would still be more, and the probability that each option is
correct is smaller than 50%. This will reduce the effect of guessing. Better yet, formulate your
descriptions in a way that some options may be used more than once. In this case, you maintain
the plausibility of all options for every description.
Keep the options plausible for each description. Because there is only one and the same
list of options for each of the descriptions, it is vital that you keep the options plausible for every
description. It means that if you have ten descriptions and twelve options, one option is keyed for
each description and the other eleven should be plausible distractors. Usually, if the rule on
homogeneity is very well observed, it is relatively easy to maintain one list of plausible options
for each description. In addition to this, never establish a systematic sequence of keyed
responses, such as coding with a word, such as G-O-L-D-E-N-S-T-A-R, which means that the
keyed response letter for the first description is “G” and the keyed response for the 10th
description is “R.” If this pattern is initially detected by the students, such as G-O-L- _ -E- _ -S-
T- _ -R, they immediately jump into guessing that the missing letters are D, N, and A,
respectively (and guessing it right).
Place the whole test in the same page of the test paper. After stating the instructions for
a matching test, write the lists or columns below it and make sure all descriptions and options are
written on the same page where the matching test is placed in the test paper. Never attempt to
extend some items or options in the next page of the test paper because, if you do so, students
will keep flipping between pages as they respond to your matching items. If you notice in your
draft that some items already go to the next page, you do some simple adjustments, like reducing
the font size of your items, as long as it remains legible, or improve the efficiency of your test
layout. If the problem still exists, shorten your list, or if there are other types of test in your test
paper, decide to switch your matching test with your other test.
The use of selected-response tests is effective in various types of learning intents and
assessment contexts. With careful design, these tests can measure capabilities beyond those
lower-order kinds, especially if the items are formulated to elicit students’ higher levels of
cognitive skills (Popham, 2004).
Guidelines in Wriitng Matching Type Items
FAULTY: Match List A with List B. You will be given one point for each correct match.
List A List B
a. cotton gin a. Eli Whitney
b. reaper b. Alexander Graham Bell
c. wheel c. David Brinkley
d. TU54G tube d. Louisa May Alcott
e. steamboat e. None of these
IMPROVED: Famous inventions are listed in the left-hand column and inventors in the right-
hand column below. Place the letter corresponding to the inventor in the space next to the
invention for which he s famous. Each match is worth 1 point, and “None of these” may be the
correct answer. Inventors may be used more than once.
Inventions Inventors
__ 1. steamboat a. Alexander Graham-Bell
__ 2. cotton skin b. Robert Fulton
__ 3. sewing machine c. Elias Howe
__ 4. reaper d. Cyrus McCormick
e. Eli Whitney
f. None of these
191
Lesson 3
Designing Constructed-Response Types
Another set of options for the types of pen-and-paper test to give is the constructed-
response test. Unlike the selected-response types, the constructed-response test does not provide
students with options for answers, but rather require students to produce and give a relevant
answer to every test item. Drawing from its name, we understand that, in this type of test,
students construct their response, instead of just choosing it from a given list of alternatives.
Constructed-response methods of assessment include certain types of pen-and-paper tests
and performance-based assessments. In this chapter we focus our discussion only on constructed-
response types of pen-and-paper test. Some of the common types of pen-and-paper constructed-
response test are short-answer and essay.
Short-answer Items
As the name suggests, short-answer items allow students to provide short answers to the
questions or descriptions. This type of constructed-response test calls for students to respond to
either a direct question, a specific description, or an incomplete sentence by supplying a word, a
phrase, or a sentence. If a test contains direct questions, students are expected to answer the
question by giving a word, a symbol, a number, a phrase, or a sentence that is being asked. The
same applies to items using specific descriptions of words, phrases, or sentences. Items
composed of incomplete sentences ask students to complete the every sentence by supplying the
word or phrase that should meaningfully complete the sentence in terms of the assessment
domain.
In formulating questions or descriptions that compose your test items, it is important to
always think according to the name of this test type, so that you are mindful that the items should
call for “short answers.” Do not dare to ask questions that require long answer, otherwise you are
using the short-answer items as essay items. If your assessment target calls for students to
response with longer statements or written discussions, it is preferable that you give essay items
instead of short-answer items.
Make instructions explicit. Short-answer items usually have simple instructions. In fact,
it is tempting to just expect that students understand how to go about the test using only their
common sense. However, it is always unsafe to assume that every student understand what you
want them to do with your test. Besides, it is always advisable that you give your students the
necessary prompt before they respond to the test items. In short, you need to set clear instructions
even for short-answer items, which should indicate the content area, the task, and scoring. In
directing students on the task you expect them to do, specify if they answer the question, indicate
what is described, or complete the sentences, depending on your item’s format. Lastly, remember
to say “please” and “thank you” in your instructions.
Decide on the item’s format. When you decide to use short-answer items, also decide if
all your items should come in questions, descriptions, or incomplete sentences. Whichever you
decide to use to format your items, maintain consistency of the format for all your short-answer
items. For example, if you wish to give a 15-item short-answer test and expect that students
supply short answers to your questions, have all your items of the test written in a direct question
192
form. Never mix up direct question items with descriptions or incomplete sentences. One
important criterion for choosing what format to use is the age of the student. For younger
learners, it is usually preferable to use direct questions than descriptions or incomplete sentences.
Once you already make up your mind as to the item format, walk through your way to
formulating each item.
Structure the items accordingly. Because short-answer items call for “short answer” as
may be inferred from its name, always make sure you structure every item in a way that it
requires only a brief answer (i.e., a word, a symbol, a number [or a set of numbers], a phrase, or
a short sentence). This is achieved by formulating very clear, specific, explicit, and error-free
statements in your items. A clear and specific question calls for a specific answer. If your
description clearly and explicitly represents the object that is described, and you are sure that it
refers to a specific word, symbol, or phrase, then your item is structured properly. If your items
are incomplete sentences, structure every item so that the missing word or phrase is a keyword or
a key idea. Ordinarily, an incomplete sentence has only one blank which corresponds to one
missing keyword. You may want to remove 2 keywords as long as it does not distort the key idea
of the incomplete sentence which should guide the students in figuring out the missing words.
Never go more than 2 blanks.
One important reason why we need to ensure that students supply only brief responses is
because we make sure that responses are easy to check objectively. We encounter a major
problem related to scoring if students’ responses are lengthy. With long responses, it is difficult
to give accurate scores. Of course, we already know, as discussed in Chapter 3, that inaccurate
scoring of students’ responses in the test undermines the reliability of our measures, and reduces
the validity of the inference we make on our students’ learning outcomes.
Provide the blanks in appropriate places. Blanks are spaces in the items where
students supply their answers by writing a word, a symbol, a number, a phrase, or a sentence. If
your items are all in a direct question format where each question begins with an item number,
place the blanks on the left-side of the item number. When you type the item, begin with the
blank space, followed by the item number, then the question. This rule also applies to items
using explicit descriptions. If you are using the incomplete sentence format for your items, place
the blank near the end of the sentence. This means that you take out a keyword that is found near
the end of the sentence so that it becomes an incomplete sentence. Never take out a keyword
from the beginning of a sentence. The reason for this is that you need to first establish the key
idea of the sentence so that students immediately know what is missing in the sentence right after
one reading. If the blank space is near the beginning of the sentence, students will find it hard to
understand the key idea and will, therefore, read the sentence more than once in order to figure
out the missing word. In all item formats, always maintain the same length of the blanks in all
your short-answer items.
The good thing short-answer items is that students really produce a correct answer rather
than merely selecting one from a set of given alternatives. In this case, if students only possess a
partial knowledge of the subject matter, which usually works with selected-response items, they
will find short-answer items difficult to give a correct response to every item. Although we
generally recognize that these types of items are appropriate for measuring simple kinds of
learning outcomes, they are capable of measuring various types of challenging outcomes if the
items are carefully developed. However, it is not advisable that you force yourself to use short-
193
answer items to measure more complex and deeper levels of cognitive processes. It is always
helpful that you know other methods of assessment so that you have a wide range of options
where you can freely navigate yourselves depending on your assessment purposes.
FAULTY: _____ pointed out in ____ the freedom of thought in America was seriously
hampered by ___, ____, & __.
IMPROVED: That freedom of thought in America was seriously hampered by social pressures
toward conformity was pointed out in 1830 by ______.
4. Specify and announce in advance whether scoring will take spelling into account.
5. In testing for comprehension of terms and knowledge of definition, it is often better to supply
the term and require a definition than to provide a definition and require the term.
FAULTY: What is the general measurement term describing the consistency with which items in
a test measure the same thing?
194
6. It is generally recommended that in completion items the blanks come at the end of the
statement.
FAULTY: A (an) ________ is the index obtained by dividing a mental age score by
chronological age and multiplying by 100.
IMPROVED: The index obtained by dividing a mental age score by chronological age and
multiplying by 100 is called a (an) ________
FAULTY: Where does the Security Council of the United Nations hold its meeting?
IMPROVED: In what city of the United States does the Security Council of the United Nations
hold its meeting?
Essay Items
Relative to our learning intents, there are times when it is necessary that our students
supply lengthy responses so that they exhibit more complex cognitive processes. For some
learning targets, a single word, a phrase or, a sentence is not enough to measure students’
learning outcomes. For these targets, we need a constructed-response type of test that will allow
students to adequately exhibit their learning through sufficient writing; hence, essay items work
for these purposes.
Just like short-answer items, essay items call for students to produce rather than select
answers from the given alternatives. But unlike short-answer items, essay items call for more
substantial, usually lengthy response from students. Because the length and complexity of the
195
response may vary, essay items are appropriate measures of higher-level cognitive skills.
Following are some guidelines that will help you formulate good essay items.
Communicate the extensiveness of expected response. By reading the essay item, your
students must know exactly how brief or extensive their responses should be. This is made
possible by making your item clearly convey the degree of extensiveness you expect from their
response. Extensiveness depends on the degree of complexity of your item. To determine the
degree of complexity you desire to assess, you may design an essay item according to any of the
two types, depending, of course on your assessment objective– the restricted-response and
extended-response items. If you wish to measure students’ ability to understand, analyze, or
apply certain concepts to new contexts while dealing with relatively simple dimensions of
knowledge, and if the task requires only a relatively short time period, the restricted-response
type may be preferred. If, however, you wish to assess students’ capability to evaluate or
synthesize various aspects of knowledge, which will naturally require longer time for their
responses to be completed, the extended-response type is preferable. Notice that even at this
phase of determining the degree of complexity of your essay item, it is very vital that you clearly
make a decision based on your learning intent. This phase is crucial because if you design an
essay item that is of extended type but give it to your students as if it is of restricted type, your
students’ failure to meet the assessment standards set for the item may not be due to their level of
learning, but rather because they needed more time to gather and process information before they
could come up with responses that are relevant to your assessment standards. Your inference on
students’ learning becomes problematically unreliable and invalid. Equally problematic your
inference becomes if you construct a truly restricted type of essay item but give it as if it is an
extended-type essay item.
Prime and prompt students through the item. Unlike the other types of pen-and-paper
tests, an essay item already includes the context, assessment task, and assessment focus
standards, altogether. The statement of context provides a background of the subject matter in
focus, and primes the students’ thinking of that subject matter. The prime helps students to be
selectively attentive to a subject matter that is relevant to the assessment task of the essay item.
Without it, students tend to grapple with understanding the subject matter that is embedded in the
statement of assessment task, and may find it difficult to stay in focus. The assessment task is
what the students directly respond to in order to write an essay. Both the statements of context
(or the prime) and the assessment task (or the prompt) are important in setting the students’
attention to the subject matter and in making them think of a response that meets the assessment
standards. Notice, for example, that if the item is phrased as “Compare and contrast the
governance of Estrada and Arroyo,” students first struggle to generate some ideas related to
these two names, then think of governance or political administrations of the two Philippine
presidents in general sense. It is because the item does not have a prime. In this case, the item is
not helping the student stay in a clear focus of what the item really intends to assess. It will be
different if the item begins with a prime, such as when phrased something like, “Our country has
been run by a number of presidents already, and along with the change in political
administration are the changes in the agenda of reforms. Compare and contrast the economic
reform agenda of the presidential administrations of Estrada and Arroyo.” In this item, students
are primed to think of the reform agenda on the two presidents, which is very probable that they
focus more on the context as they respond to the assessment task. This latter example is not yet a
complete essay item as it lacks other necessary elements, but it clearly shows how effectively
196
you can prime and prompt students to appropriately respond to your test item. This item may be
improved to become a full-blown essay item if you add other elements, such as the guide to the
extensiveness of the desired response as well as the assessment standards.
Provide clear assessment standards. You might think that, if it has both the prime and
prompt, you item can already stand. This is not true. For an essay item to stand as a good one, it
must also indicate a clear guide for the value of the item. The assessment standards inform the
students about what specific aspects of their responses you will give merit, and what aspects will
earn more credit than the others. If, for example, you give credit to their argument if they can
provide an evidence, then you need to categorically ask for it in your essay item. Similarly, if
you give two or three essay items and you wish to give more credit to one item based on its
complexity, you also need to indicate the item’s value. This way, students know when and where
they devote most of their time and effort, and decide how much of these resources will be
invested to each item. One simple way of guiding students in term of the item’s value is to
indicate the assessment weight you assign to the item in parentheses at the end of the item.
Do away with optional items. While reading this part you may be recalling a common
experience in taking an essay test when the teacher asked you to choose some, but not all, essay
items to answer, and that you tended to choose those items that were more convenient to your
understanding and readiness. This practice of providing optionality in essay items where students
are made to choose fewer items to answer than what is presented should be stopped. From what
you may recall in your experience, it is obvious that, when students are free to choose only a few
items to answer, they would choose those items that are easy to them. As a consequence, each
student will be choosing items that are “easy” to them, and, thus leads to flawed inference of
students’ learning because students’ responses are marked under different standards and levels of
complexity, depending on the items they chose to answer. One of the basic questions you will
need to answer if you plan to do this is, What is the assurance that all your items have equal
level of complexity and that they measure exactly the same knowledge domains and cognitive
processes? This question is extremely difficult to answer. This guideline, therefore, says that if,
for example, you have 3 essay items for the test you are about to administer, have each of your
students answer all the 3 items.
Prepare a scoring rubric. Because an essay item calls for relatively extensive response
from the students, it is always necessary that you prepare a scoring rubric or guide prior to giving
the test. The scoring scheme will help you pre-assess the validity and reliability of your item
because it will allow you to identify the criteria as well as the key ideas you expect your students
to give in response to the item. Your scoring rubric indicates the descriptions in scoring the
quality of your students’ responses in the essay item. It includes a set of standards that define
what is expected in a learning situation, and important indicators of how students’ responses to
the task will be scored. Having said this, we ask you to choose the scoring approach that will best
fit your assessment context. You have two options for this purpose. One is the holistic approach,
another is the analytic approach.
The holistic approach allows you to focus on your students’ overall response to an essay
item. As you assess the response as a whole, this approach will guide you in terms of what
dimensions of the learning outcome you pay attention to. For example, if your essay item intends
to let students manifest their ability to argue with appropriate evidence, and explain in good,
clear, and coherent language, you need to identify the dimensions that can capture those abilities
197
in you assessment; hence you may have the following dimensions indicated in your holistic
rubric: Logic of the argument, Relevance of evidence, Communicative clarity, Lexical choice,
and Mechanics (spelling, punctuations, etc.). These dimensions serve as your criteria for
assessing students’ response. It is always appropriate that you indicate the dimensions to assess
because these dimensions keep you in focus as you assign a score to each of your students’
response. And for you to be guided further in terms of how much score to give, each dimension
must be assigned a corresponding point or set of points. For example, you wish to give a
maximum of 6 points for the logic of the argument, indicate it in your holistic rubric so that it
might look the items in the box below.
Another way of setting a guide to scoring is by way of assigning the same points for each
criterion but you also indicate the weight of each criterion based on its importance or value. The
box below gives you a view of how the contents may look like.
When employing a holistic approach for scoring students’ responses in an essay item,
your decision as to how much score to give based on each dimension is not guided by clear
descriptions of the quality of response. It usually rests on the teacher’s judgment of the student’s
response in terms of each criterion. Because this approach does not require specific descriptions
of the quality of response, it is easy and efficient to use. The major weakness of this approach,
however, is the fact that it does not specify the graded levels of performance quality which
invites teachers’ subjective judgment of students’ response. Acknowledging its major weakness,
we recommend that you use the holistic approach only for restricted-response items where
students are tested only on less complex skills requiring only a small amount of time.
198
In contrast, the analytic approach allows for a more detailed and specific assessment
scheme in that it indicates not only the dimensions or criteria, but also the specific descriptions
of the different levels of performance quality per criterion. Supposing we take the sample criteria
in the boxes above and use them as the same criteria for our analytic rubric, we proceed by
determining the levels of performance quality for each criterion. For the logic of the argument
criterion we set a scale of varying performance quality, perhaps ranging from Excellent to Poor,
with other levels of quality in between. A simple way to do this is exemplified in the box below.
As indicated in the box above, there are 4 scale indicators, each representing a level of
performance quality. In this example, the teacher will put a check on the space below the scale
indicator that matches the quality of a student’s response on every criterion. Scores are obtained
by assigning points in every scale indicator. You may also specify the weight of each criterion
depending on the degree of importance or value of the criterion.
A more calibrated analytic rubric not only indicates the scale levels for the teacher to
check against the quality of students’ response in an essay item, but also describe that
performance quality that falls under each level of the scale. This rubric describes what quality of
performance will qualify as “excellent” and what type of performance is “poor.” In this case, the
analytic rubric should include descriptive statements for each scale level of each criterion. The
table below shows an example of these descriptive statements applied to one of the criteria we
used in the above example, just to illustrate the point.
199
The good thing about using the analytic approach in scoring essay responses is that it
helps you identify the specific level of students’ performance, and your assessment of students’
learning outcomes is objective. Therefore, it increases the reliability of your measure and will
facilitate more valid and reliable inference. It is also beneficial for the students because, through
the analytic rubric, they can pinpoint the specific level of their performance, and can judge its
quality by matching it against the descriptions. This type of rubric is best for essay items that
measure more complex cognitive skills and more sophisticated knowledge dimensions.
Whichever approach you wish to use for scoring your students’ response to your essay
items, your decision will work if you are already clear on the following questions:
What do you want your students to know and be able to do in the essay?
How well do you want your students to know and be able to do it in the essay?
How will you know when your students know it and do it well in the essay?
As you clarify your practice with reference to those questions, walk your way to
constructing your scoring scheme using any approach, and following the simple steps indicated
below.
Set appropriate assessment target.
Decide on the type of the rubric to use.
Identify the dimensions of performance that reflect the learning outcomes.
Weigh the dimensions in proportion to their importance or value.
Determine the points (or range of points) to be allocated to each level of performance.
Show the rubric with colleagues and/or students before using it.
Some teachers are excited to use essay items because these items provide more
opportunities to assess various types of learning outcomes, particularly those that involve higher
level cognitive processing. If carefully constructed, essay items can test students’ ability to
logically arrange concepts, analyze relationships between them; state assumptions or compare
200
positions, evaluate them and draw conclusions; formulate hypotheses and argue on the causal
relationships of concepts, organize information or bring in evidences to support some findings;
propose solutions to certain problems and evaluate the solutions in light of certain criteria. These
and much, much more competencies can be measured using good essay items.
201
Lesson 4
Designing Interpretive Exercise
Items that are interpretive exercise usually contain a stimulus aside from the item stem
and options. Its is useful in measuring more cpmplex skills such as relevance of information,
generalizations, inference, applying principles, recognizing assumptions, and interpretations.
2. Select introductory material that is appropriate to the curricular experience and reading ability
of the examinees.
5. Revise introductory material for clarity, conciseness, and greater interpretive value.
6. Construct test items that require analysis and interpretation of introductory material.
7. Make the number of items roughly proportional to the length of the introductory material.
Reading Comprehension
Bem (1975) has argued that androgynous people are “better off” than their sex-typed
counterparts because they are not constrained by rigid sex-role concepts and are freer to respond to a
wider variety of situations. Seeking to test this hypothesis, Bem exposed masculine, feminine, and
androgynous men and women to situations that called for independence (a masculine attribute) or
nurturance (a feminine attribute). The test for masculine independence assessed the subject’s willingness
to resist social pressure by refusing to agree with peers who gave bogus judgments when rating cartoons
for funniness (for example, several peers might say that a very funny cartoon was hilarious). Nurturance
or feminine expressiveness, was measured by observing the behavior of the subject when left alone for
ten minutes with a 5-month old baby. The result confirmed Bem’s hypothesis. Both the masculine sex-
typed and the androgynous subjects were more independent (less conforming) on the ‘independence”
test than feminine sex-typed individuals. Furthermore, both the feminine and the androgynous subjects
were more “nurturant” than the masculine sex-typed individuals when interacting with the baby. Thus, the
androgynous subjects were quite flexible, they performed as masculine subjects did on the “feminine”
task.
a. task performance
b. frequency of refusals and conformity
c. rating scale
d. counting the behavior occured
a. factorial
b. factorial design based on mixed model
c. randomized block design
d. switching replications design
208
Interpreting Diagrams
Instruction. Study the following illustrations and answer the following questions.
Figure 1
Group A
Group B
Pretest Posttest
a. there is an EV
b. there is no treatment
c. there is the occurence of ceiling effect
References
Airasian, P. W. (2000). Assessment in the classroom: A concise approach (2nd ed.) USA:
McGraw-Hill Companies.
Popham, W. J. (2005). Classroom assessment: What teachers need to know (4th ed.). Boston,
MA: Allyn and Bacon.
210
Chapter 5
Constructing Non-Cognitive Measures
Objectives
Lessons
Lesson 1
The Nature of Non-Cognitive Constructs
Figure 1
Dimensions of Affect
High Intensity
Person Object
Low Intensity
212
religion where the catholic group had the highest mean score. In another discriminant validity the
participants who frequently attended church had the highest mean. Example of items are:
1. I think the teaching of the church is altogether too superficial to have much social
and significance.
2. I feel the church services give me inspiration and help me to live up to my best during
the following week.
3. I think the church keeps business and politics up to higher standard than they would
otherwise tend to maintain.
Beliefs. Beliefs are judgments and evaluations that we make about ourselves, about
others, and about the world around us (Dilts, 1999). Beliefs are generalizations about things such
as causality or the meaning of specific actions (Pajares, 1992). Examples of belief statements
made in the educational environment are “A quiet classroom is conducive to learning,”
“Studying longer will improve a student’s score on the test,” “Grades encourage students to work
harder.”
Beliefs play an important part in how teachers organize knowledge and information and
are essential in helping teachers adapt, understand, and make sense of themselves and their world
(Schommer, 1990; Taylor, 2003; Taylor & Caldarelli, 2004). How and what teachers believe
have a tremendous impact on their behavior in the classroom (Pajares, 1992; Richardson, 1996).
An example of a measure of belief is the Schommer Epistemological Questionnaire.
Schommer (1990) developed this questionnaire to assess beliefs about knowledge and learning.
A 21-item questionnaire was developed by the researchers to measure epistemological beliefs of
Asian students. The questionnaire was adapted from Schommer's 63-item epistemological beliefs
questionnaire. This Asian version of the Schommer Epistemological Questionnaire has been
validated with a sample of 285 Filipino college students. This epistemological questionnaire was
revised to have lesser items, and simpler expression of ideas to be more appropriate for Asian
learners. The number of statements was reduced to ensure that the participants would not be
placed under any stressed while completing the questionnaires. Students are asked to rate their
degree of agreement for each item on a 5-point Likert scale ranging from 1 (strongly disagree) to
5 (strongly agree). Wording of items varied in voice from first person (I) to third person
(students) in an effort to illustrate how the same belief could be queried from somewhat different
perspectives. Items assessed four epistemological belief factors including beliefs about the
ability to learn (ranging from fixed at birth to improvable), structure of knowledge (ranging from
isolated pieces to integrated concepts), speed of learning (ranging from quick learning to gradual
learning), and stability of knowledge (ranging from certain knowledge to changing knowledge).
Schommer (1990) has reported reliability and validity testing for the Epistemological
Questionnaire; the instrument reliably measures adolescents' and adults' epistemological beliefs
and yields a four-factor model of epistemology. Schommer (1993) has reported test-retest
reliability of .74. Factor analyses were conducted on the mean for each subset, rather than at the
item level.
the doing of an activity" (p. 138). Interests may be referred to as instrumental means to an end
independent of perceived importance (Savickas, 1999).
According to Holland’s theory, there are six vocational interest types. Each of these six
types and their accompanying definitions are presented below:
Realistic People with Realistic interests like work activities that include practical,
hands-on problems and solutions. They enjoy dealing with plants, animals,
and real-world materials like wood, tools, and machinery. They enjoy outside
work. Often people with Realistic interests do not like occupations that
mainly involve doing paperwork or working closely with others.
Investigative People with Investigative interests like work activities that have to do with
ideas and thinking more than with physical activity. They like to search for
facts and figure out problems mentally rather than to persuade or lead people.
Artistic People with Artistic interests like work activities that deal with the artistic
side of things, such as forms, designs, and patterns. They like self-expression
in their work. They prefer settings where work can be done without
following a clear set of rules.
Social. People with Social interests like work activities that assist others and
promote learning and personal development. They prefer to communicate
more than to work with objects, machines, or data. They like to teach, to give
advice, to help, or otherwise be of service to people.
Enterprising People with Enterprising interests like work activities that have to do with
starting up and carrying out projects, especially business ventures. They like
persuading and leading people and making decisions. They like taking risks
for profit. These people prefer action rather than thought.
Conventional People with Conventional interests like work activities that follow set
procedures and routines. They prefer working with data and detail rather than
with ideas. They prefer work in which there are precise standards rather than
work in which you have to judge things by yourself. These people like
working where the lines of authority are clear.
Examples of affective measures of interest are the Strong-Campbell Interest Inventory and
Strong Interest Inventory (SII), Jackson, Vocational Interest Inventory, Guilford-Zimmerman
Interest Invmetory, Kuder Occupational Interest Survey. For a list of vocational interest tests,
visit the site: http://www.yorku.ca/psycentr/tests/voc.html.
Values. Values refer to “the principles and fundamental convictions which act as general
guides to behavior, the standards by which particular actions are judged to be good or desirable
(Halstead & Taylor, 2000, p. 169). These values are used as guiding principles to act and justify
accordingly (Knafo & Schwartz, 2003). The values are internalized and learned at an early stage
in life. The school setting is one major avenue where people show how the values are learned,
respected and uphold. A student who strives for valuing education in school are provided with
opportunities to behave in ways that will allow him to do well in school and thus attain values of
hard work, perseverance and diligence when it comes to academic-related tasks. Examples of
values are diligence, respect for authority, emotional restraint, filial piety, and humility.
215
Dispositions. The National Council for Accreditation of Teacher Education (2001), which
stated that dispositions are: the values, commitments, and professional ethics that influence
behaviors toward students, families, colleagues, and communities and affect student learning,
motivation, and development as well as the educator's own professional growth. Dispositions are
guided by beliefs and attitudes related to values such as caring, fairness, honesty, responsibility,
and social justice. Examples of dispositions include fairness, being democratic, empathy,
enthusiasm, thoughtfulness, and respectfulness. Disposition measures are also created for
metacognition, self-regulation, self-efficacy, approaches to learning, and critical thinking.
Activity
Use the internet and give examples of affective scales under each of the following areas.
Lesson 2
Steps in Constructing Non-Cognitive Measures
The construction of a scale begins with clearly identifying what construct needs to be
measured. The basis of constructing a scales is when (1) no scales are available to measure such
construct, (2) all scales are foreign and it is not suitable for the stakeholders or sample that will
take the measure, (3) existing measures are not appropriate for the purpose of assessment, (4) the
test developer intends to explore the underlying factors of a construct and eventually confirm it.
Once the purpose of developing a scale is clear, the test developer decides what type of
questionnaire to be used. Decide whether the measure will be an attitude, belief, interest, value,
and disposition.
When the specific variable construct is clearly framed, it is very important that the test
developer search for relevant literature reviews from different studies involving the construct
intended to be measured. What is needed in the literature review is the definition that the test
developer wants to adapt and whether the construct has underlying factors. The definition and its
underlying factors is the major basis for the test developer later on to write the items. Having a
thorough literature review helps the test developer to provide a conceptual framework as basis
for the construct being measured. The framework can come in the form of theories, principles,
models, and a set of taxonomy that the test developer can use as basis for hypothesizing factors
of the construct intended to measure. Having a thorough knowledge of the literature about a
construct helps the researcher identify different perspectives on how the factors were arrived and
possible problems with the application of these factors across different groups. This will help the
test developer justify the purpose of constructing a scale.
When the constructs and its underlying factors or subscales are established through
thorough literature review, a plan to make the scale needs to be designed. The plan starts with
creating a Table of Specifications. The Table of Specifications indicates the number of items for
each subscale, the items phrased in positive and negative statements, and the response format.
The test developer uses the definitions provided in the framework to write the
preliminary items of the scale. Items are created for each subscale as guided by the conceptual
definition. The number of items as planned in the Table of Specifications is also considered. As
much as possible, a large number of items are written to represent well the behavior being
measured. In helping the test developer write some items, a well represented set of behaviors
manifesting the construct should be covered. Qualitative studies reporting the specific responses
are very helpful in writing the items. An open-ended survey, focus group discussion, and
interviews can be conducted in order to come up with statements that can be used to write items.
217
When these methods are employed as a start of item writing, the questions generally seeks for
specific behavioral manifestations of the subscales intended to measure. An example would be
the study of Magno and Mamauag (2008) where they created the “Best Engineering Traits”
(BET) that measures dispositions of engineering students in the areas of assertiveness,
intellectual independence, practical inclination, and analytical interest. The items in this scale
were based on an open-ended survey conducted among engineering students. The survey asked
the following questions:
5. What do you think are other personality traits or characteristics that would make you an
effective engineer?
Example of item statements generated from the survey responses are as follows:
Notice that the item statements begin with the pronoun “I.” This indicates self-referencing for the
respondents when they answer the items. Items 1 and 2 in the example are stated in a positive
statement while items 3 and 4 are stated in negative. This ensures that respondents would be
consistent with their answers in a subscale where the items should be responded in the same way.
For negative items, reverse scoring is done with the responses to be consistent with the positive
items. The following are guidelines in writing good items:
Most people favor death penalty. What do you think? (Leading Question)
Select a scaling technique
After writing the items, the test developer decides on the appropriate response format to
be used in the scale. The most common response formats used in scales are the Lickert scale
(measure of position in an opinion), Verbal frequency scale (measure of a habit), Ordinal scale
(ordering of responses), and the Linear numeric scale (judging a single dimension in an array). A
detailed description of each scaling technique is presented in the next lesson.
It is important that directions or instructions for the target respondents be created as early
as when the items are created. When making instructions, it is very important that it is clear and
concise. Respondents should be informed how to answer. When you intend to have a separate
answer sheet, make sure to inform the respondents about it in the instructions. Instructions
should also include ways of changing answers, how to answer (encircle, check, or shade). Inform
the respondents in the instructions specifically what they need to do.
The following are the instructions formulated for the BET:
This is an inventory to find out your suitability to further study Engineering. This can help guide you in
your pursuit of an academic life. The inventory attempts to assess what interests and strategies you have
learned or acquired over the years as a result of your study.
In the inventory, you will find statements describing various interests and strategies one acquires through
years of schooling and other learning experiences. Indicate the extent of your agreement or disagreement
to each of these statements by using the following scale:
There are no right or wrong answers here. You either AGREE or DISAGREE with the statement. It is
best if you do not think about each item too long --- just answer this test as quickly as you can, BUT
please DO NOT OMIT answering any item.
DO NOT WRITE OR MAKE ANY MARKS ON THE TEST BOOKLET. All answers are to be written on
your answer sheet.
Ensure that you have filled out your answer sheet properly and legibly for your name, school, date of
birth, age, and gender.
Be sure also that you have copied correctly you test booklet number on the space provided in your
answer sheet. Do not turn the page until you are told to do so.
You have a total of 40 minutes to finish this whole test. Do not spend a lot of time in any one item.
Answer all items as truthfully and honestly as you can.
Notice that the instruction started with the purpose of the test. This is done to dispel any
misconceptions that the respondents think about the test. Then the instruction describes the kind
219
of items expected for the test. Then the respondent is told how to answer the items. The scaling
technique is also provided. The respondents are reminded that there are no right or wrong
answers to avoid faking good or bad in the test. The respondents are reminded such as not
making any marks on the test booklet, use of answer sheets, answering all items and the time
allotment. As much as possible, detailed instructions are provided to avoid any problems.
For achievement tests and teacher made tests, this procedure is called content validation.
But for affective measures, it would be difficult to conduct content validation because there is no
available content area for an affective variable. The definition and behavioral manifestations
from empirical reports can qualify for the areas measured. Instead the items are reviewed
according to the definition or framework provided whether they are relevant, not within the
confines of the theory, measuring something else, applicability of the target respondents, and
whether it needs revision for clarity.
Item review is conducted among experts in the content being measured. In the process of
item review, the together with the constructed items, the conceptual definition of the constructs
are provided to guide the reviewer to ensure that the items are framed. It is also necessary to
arrange the items according to each subscale where is belongs so that the reviewer can easily
evaluate the appropriateness of the items in that subscale. A suggested format for item review is
shown below:
When giving items for review, the test developer write a formal letter to the reviewer and
indicate specifically how do you want the review to be done. Indicate specifically if you also
intent to review the grammar of the statement because most reviewers would just focus on the
content and its frame on the definition.
After the items have been reviewed expect that there would be several corrections and
comments. Several comments indicate that the items will be better because it have been
thoroughly studied and critiqued. In fact, several comments should be more appreciated than few
because it means that the reviewers are offering better ways on how to fix and reconstruct your
items. In this stage, it is necessary to consider the suggestions and comments provided by the
reviewer. If there are things that are not clear to you, do not hesitate to go back and ask the
reviewer once more. This will ensure that the items will be better when the final form of the
scale is assembled.
220
Preparing the items for pilot testing requires a layout of the test for the respondents. The
general format of the scale should be emphasized on making it as easy as possible to use. Each
item can be identified with a number of a letter to facilitate scoring of responses later. The items
should be structured for readability and recording responses. Whenever possible items with the
same response formats are placed together. In designing self-administered scales, it is suggested
to make it visually appealing to increase response rate. The items should be self-explanatory and
the respondents can complete it in a short time. In ordering of items, the first few questions set
the tone for the rest of the items and determine how willingly and conscientiously respondents
will work on subsequent questions.
Before going to the actual pilot test, the items can be administered first to at least 3
respondents who belong in the target sample and observe them in some areas that take them long
in answering and if the instructions are clearly followed. A retrospective verbal report can be
conducted while the participants are answering the scale to clarify any difficulties that might
arise in answering the items.
In the actual pilot testing, the scale is administered to a large sample (N=320). The ideal
number of sample would be three time the total number of items. If there are 100 items in the
scale, the ideal sample size would be 300 or more. Having a large number of respondents makes
the responses more representative of the characteristic being measured. Large sample tends to
make the distribution of the scores assume normality.
In administering a scale, the proper testing condition should be maintained such as the
absence of distractions, room temperature, proper lighting, and other aspects that can cause large
measurement errors.
The responses in the scale should be recorded using a spreadsheet. The numerical
responses are then analyzed. The analysis consists of determining whether the test is reliable or
valid. Techniques of establishing validity and reliability are explained in chapter 3. If the test
developer intends to use parallel forms or test-retest, then two time frames would be set in the
design of the testing.
The analysis of items would indicate whether the test as a whole or the individual items
are valid or reliable. If principal components analysis is conducted, each item will have a
corresponding factor loading, the items that do not highly load on any factor are removed from
the item pool. If certain items when removed would also increase the Cronbach’s alpha
reliability of the test. These techniques suggest removing certain items to improve the index of
reliability and validity of the test. This implies that a new form is produced complying with the
results of the items analysis. That’s why it is needed to have a large pool of item to begin with
because not all items will be accepted in the final form of the test.
The instrument is then revised because items with low factor loadings are removed, items
that when removed will increase Cronbach’s alpha is also considered. In the process of principal
221
components analysis even though the test developer has proposed a set of factors these factors
may not hold true because items will have a different grouping. The test developer then thinks of
new factor labels for the new grouping of items. These cases necessitate the test developer to
revise the items and come up with another revised form. This revised form is again administered
to another large sample to collect evidence of the scale being valid or reliable.
For the final pilot data gathering, a large sample is again selected which is three times the
number of items. The sample should have the same characteristics as with the first pilot sample.
The data gathered would serve to establish the final estimates of the tests validity and reliability.
The validity and reliability is again analyzed using the new pilot data. The test developer
wants to determine if the same factors will still be formed and whether the test will still show the
same index of reliability.
Edit the questionnaire and specify the procedures for its use
Items with low factor loadings are again removed resulting to less items. A new form of
the test with reduced items will be formed. The remaining items have evidence of good factor
loadings. The final form of the test can now be formed.
The test manual indicates the purpose of the test, instructions in administering, procedure
for scoring, interpreting the scores including the norms. Establishing norms will be fully
discussed in the next chapter.
Think of a construct that you want to study for a research or for your thesis in
the future. Follow the steps in test construction in developing the scale.
222
Lesson 3
Response Formats
This lesson presents the different scaling techniques used in tests, questionnaires, and
inventories. The important assumption for putting scales on tests and questionnaires is to provide
quantities and figures that can be analyzed and interpreted statistically. One characteristic of
research is that it should be measurable, through scales we are able to measure and quantify
concepts under study. Scales also enable the results be analyzed by mathematical formulas to
arrive with quantities of results.
The scaling techniques discussed here can be categorized accordingly to the levels of
measurement such as nominal, ordinal, interval, and ratio. In some references, the scaling
techniques come in conjunction with the levels of measurement. The purpose of mentioning the
level of measurement is to separate them as a topic and how they are related to scaling
techniques.
According to Bailey (1996) scaling is a process of assigning numbers or symbols to
various levels of a particular concept that we wish to measure. Scales can either be open-ended
or close-ended. For open-ended questions scales refer to the criteria set in order to effectively
and objectively assess the information presented. For close-ended questions, scales refer to
response formats for certain concepts and statements. Varieties of these scales serving as a
response format on tests and questionnaires will be presented in this report.
Before I present the varieties of scaling techniques the following should be remembered
as a framework for discussion:
The scaling techniques are be classified according to three categories base on the type of
question they are used. These categories are scaling techniques for Multiple Choice Questions,
Conventional scale types used for measuring behavior on questionnaires, Scale Combinations,
Nonverbal scales for questions requiring illustrations and Social Scaling for obtaining the profile
of a group (Alreck & Settle, 1995).
Multiple choice questions are common and known for being simple and versatile. They
can be used to obtain mental ability and a variety of behavioral patterns. This is ideal for
responses that fall into discrete categories. When the answers can be expressed as numbers, a
direct question should be used, and the number of units should be recorded.
223
Please check any type of food that you regularly eat in the cafeteria.
___ Hamburger
___ Pasta
___ Soup
___ Fried chicken
___ French fries
2. Single-Response Item
In this scaling technique one alternative is singled out from among several by the
respondent. The item is still multiple choice but only one response is required. Single response
items can be used only when (1) the choice criterion is clearly stated and (2) the criterion
actually defines a single category.
What kind of food do you most often eat in the cafeteria? (Check only one)
___ Hamburger
___ Pasta
___ Soup
___ Fried chicken
___ French fries
These types of scales are commonly used for surveys. Every information need or survey
question can be scaled effectively with the use of one or more of the scales. One should
remember that the decision of scaling technique is a matter of choice among the conventional
scales.
3. Lickert Scale
Used to obtain people’s position on certain issues or conclusions. This is a form of
opinion or attitude measurement. In this scale the issue or opinion is obtained from the
respondents’ degree of agreement or disagreement.
The advantage of this scale include flexibility, economy, and ease of composition. The
procedure is flexible because items can be only a few words long, or they can consist of several
lines. The method is economical because one set of instructions and scale ca serve many items.
The respondent can quickly and easily complete the items.
Also, the Lickert scale enables to obtain a summated value. Beside obtaining the results
of each item, a total score can be obtained from a set of items. The total value would be an index
of attitudes toward the major issue, as a whole.
224
Please pick a number from the scale to show how much you agree or disagree with each statement and
jot it in the space to the left of the item.
Scale
1 Strongly agree
2 Agree
3 Neutral
4 Disagree
5 Strongly disagree
Please pick a number from the scale to show how often you do each of the things listed below and jot in
the space at the left.
Scale
1 Always
2 Often
3 Sometimes
4 Rarely
5 Never
5. Ordinal Scale
The ordinal scale is also a multiple choice item but the response alternatives don’t stand
in any fixed relationship with one another. The response alternatives define an ordered sequence.
The responses are ordinal because each time a category is listed, it comes before the next one.
The principal advantage of the ordinal scale is the ability to obtain a measure relative to
some other benchmark. The order is the major focus and not simply the chronology.
225
Ordinarily, when do you or someone in your family would read a pocket book at home on a weekday?
(Please check only one)
Please rank the books listed below in their order of your preference. Jot the number 1 next to the one you
prefer most, number 2 by your second choice, and so forth.
For each pair of study skills listed below, please put a check mark by the one you most prefer, if you had
to choose between the two.
___ Memorizing
___ Graphic organizer
8. Comparative Scale
The comparative scale is appropriate when making comparison(s) between one object
and one or more others. With this type of scale, one entity can be used as the standard or
benchmark by which several others can be judged.
The advantage of this scale is that no absolute standard is presented or required and all
evaluations are made on a comparative basis. Ratings are all relative to the standard or
benchmark used. When there is no absolute standard that exist, the comparative scale approach is
applicable. Another advantage is its flexibility. The same two entities can be compared on
several dimensions or criteria, and several different entities can be compared with the standard.
The comparative scale is used for research interests on comparisons of own sponsor’s
store, brand, institution, organization, candidate, or individual with that of others that are
competitive.
According to Alreck & Settle (1995), that the comparative scales are more powerful in
several ways: They present an easy, simple task to the respondent, ensuring cooperation and
accuracy. They provider interval data, rather than only ordinal values, as rankings do. They
permit several things that have been compared to the same standard to be compared with one
another, and economy of space and time are inherent in them.
Compared to the previous teacher, the new one is… (Check one space)
1 2 3 4 5
Scale
___ 1. Directress
___ 2. Principal
___ 3. Teachers
___ 4. Academic Coordinator
___ 5. Discipline officer
___ 6. Cashier
___ 7. Registrar
___ 8. Librarian
___ 9. Janitor
Please put a check mark in the space on the line below to show your opinion about the school guidance
counselor
Please put a check mark on the space in from of any word that describes your school.
Please pick a number from the scale to show how well each word or phrase below describes your teacher
and jot it in the space in front of each item.
Scale
Not at all 1 2 3 4 5 6 7 Perfectly
Of the last 10 times you went to the library, haw many times did you visited each of the following library
sections.
___ Reference
___ Periodical
___ Circulation
___ Filipinana
___ Other (What? __________________)
SCALE COMBINATION
The scale combinations take the form by listing items together in the same format in
which they share a common scale. This saves valuable questionnaire space. It reduces the
response task and facilitates recording. The respondents mentally carry the same frame of
reference and judgment criteria from one item to the next, so the data are closely comparable.
Several colleges and universities are listed below. Please indicate how safe or risky is their location by circling the
number beside it. If you feel it’s very safe, circle a number towards the left. If you feel it’s very risky, circle one
towards the right, and if you think it’s some place in between, circle a number from the middle range that indicates
your opinion.
The table below lists 3 universities, and several characteristics of universities along the left side. Please
take one university at a time. Working down the column, pick a number from the scale indicating your
evaluation of each characteristic and jot it on the space in the column below the university and to the right
of the characteristic. Please fill in every space, giving your rating for each university on each
characteristic.
Scale
Very Poor 1 2 3 4 5 6 Excellent
Please list the ages of all those in your class in the spaces below. Jot the ages of the boys in the top
circles and the ages of the girls in the bottom circles.
♂ ♂ ♂ ♂ ♂ ♂
Boys
♀ ♀ ♀ ♀ ♀ ♀
Girls
NONVERBAL SCALES
The Nonverbal scales take the form of pictures and graphs to obtain the data. This is
useful for respondents who have limited ability to read or to understand numeric scales.
extremes visually represent none and all or total. Picture and graphic scales are most often used
only for personal interview surveys because they are designed for a special need.
Which of the faces indicates your feeling about your math course?
5 4 3 2 1
SOCIAL SCALING
20. Sociogram
The sociogram is a graphic representation of sociometric data. In a sociogram each
individual is represented by an illustrative symbol. The symbols are then connected by arrows
and it describes the relationship among the individuals involved. Those chosen most often are
referred to as stars, those not chosen by others are called isolates, and the small groups made up
of individuals who choose one another are called cliques (Best & Khan, 1990).
Some scales are easily identified as potentially useful for obtaining some information,
needs, and questions, and there are often other scales that are clearly inappropriate.
1. Keep it simple. The less complex scale should be used. Even after identifying a scale
consider an easier and simpler scale.
2. Respect the respondent. Select scales that will make a quick and easy as possible for the
respondents that will reduce non-response bias and improve accuracy.
3. Dimension the response. The dimensions that respondents think is not usually common with
one another, some commonality must be discovered. It must not be obscure and difficult, and
they should parallel respondents thinking.
4. Pick the denominations. Always use the denominations that are best for respondents. The
data can later be converted to the denominations sought by information users.
5. Choose the range. Categories or scale increments should be about the same breadth as those
ordinarily used by respondents.
6. Group only when required. Never put things into categories when they can easily be
expressed in numeric terms.
7. Handle neutrality carefully. If respondents genuinely have no preference, they’ll recent the
forced choice inherent in a scale with an even number of alternatives. If feelings aren’t
especially strong, an odd number of scale points may result in fence- riding or piling in the
midpoint, even when some preference exist.
233
8. State instructions clearly. Even the least capable respondents must be able to understand. Use
language that’s typical of the respondents. Explain exactly what the respondent should do
and the task sequence they should follow. List the criteria by which they should judge and
use an example or practice if there is any doubt.
9. Always be flexible. The scaling techniques can be modified to fit the task and the
respondents.
10. Pilot test the scales. Individual parcels can be checked with a few typical respondents.
References
Anderson, L. W. (1981). Assessing affective characteristics in the schools. Boston: Allyn and
bacon.
Alreck, P. L. & Settle, R. B. (1995). The survey research handbook (2nd ed.). Chicago: Irwin
Prof. Books.
Bailey, K. D, (1995). Methods of social research (4th ed.). New York: McMillan Pub.
Best, J. W. & Kahn, J. V. (1995). Research in education (6th ed.). New Jersey: Prentice Hall.
Dilts, R. B. (1999). Sleight of mouth: The magic of conversational belief change. Capitola, CA:
Meta Publications.
Halstead, J. M. & Taylor, M. J. (2000). Learning and teaching about values: A review of recent
research. Cambridge Journal of Education, 30, 169-203.
Knafo, A. & Schwartz, S.H. (2003). Parenting and adolescents’ accuracy in perceiving parental
values. Child Development, 74(2), 595-611.
National Council for Accreditation of Teacher Education. (2001). Professional standards for the
accreditation of schools, colleges, and departments of education. Washington, DC: Author.
Meece, J., Parson, J., Kaczala, C., Goff, S., & Futerman. R. (1982). Sex differences in math
achievement: Toward a model of academic choice. Psychological Bulletin, 91, 324 – 348.
234
Overmier, J.B.& J. A. Lawry. (1979). Conditioning and the mediation of behavior. In G.H.
Bower (ed.). The psychology of learning and motivation (pp. 1- 55). New York: Academic Press.
Richardson, V. (1996). The role of attitude and beliefs in learning to teach. In J. Sikula, T.
Buttery, & E. Guyton (Eds.), Handbook of research on teacher education ( pp. 102-119). New
York: Macmillan.
Sta. Maria, M. & Magno, C. (12007). Dimensions of Filipino negative emotions. Paper presented
at the 7th Conference of the Asian Association of Social Psychology, July 25-28, 2007 in Kota
Kinabalu, Sabah, Malaysia.
Taylor, E. (2003). Making meaning of non-formal education in state and local parks: A park
educator's perspective. In T. R. Ferro (Ed.), Proceedings of the 6th Pennsylvania Association of
Adult Education Research Conference (pp. 125-131). Harrisburg, PA, Temple University.
Taylor, E., & Caldarelli, M. (2004). Teaching beliefs of non-formal environmental educators: A
perspective from state and local parks in the United States. Environmental Education Research,
10, 451-469.
Zimbardo, P. G, & Leippe, M. R. (1991). The psychology of attitude change and social
influence. New York: McGraw Hill.
235
Empirical Report categories were drawn from social learning theory and
research (e. g., Bandura, 1982; Schunk, 1984;
Zimmerman, 1983). They included goal-setting,
Development of the Academic Self-Regulation Scale
environmental structuring, self-consequences, and self-
evaluating. Several other categories were included on the
Carlo Magno
basis of closely allied theoretical formations namely the
MR Aplaon
strategies of organizing and transforming (Baird, 1983,
Carmine Gañac
cited from Zimmerman & Pons, 1986), seeking and
Sheena Marie Morales
selecting information (Wang, 1983; Baird, 1983 cited from
De LaSalle University-Manila
Zimmerman & Pons, 1986) and rehearsal and mnemonic
strategies (Mccombs, 1984, cited from Zimmerman &
Abstract Pons, 1986). Also included are strategies of seeking social
Exploratory (EFA) and Confirmatory Factor Analysis (CFA) assistance and reviewing previously compiled records such
were conducted to verify the subscales of a constructed as class notes and notes on text material (Wang, 1983,
academic self-regulation scale. The subscales were based cited from Zimmerman & Pons, 1986).
on Zimmerman and Pons (1986) model of self-regulation. There have been increasing researches on the
Items were written to assess the self-regulatory strategies mechanisms through which students regulate their own
that the students use in an academic setting. In the EFA, motivation and academic learning (e.g., Corno, 1989;
eight factors were extracted that were consistent with the Harris, 1990; Zimmerman and Schunk, 1989). In addition,
eight factors proposed in the previous self-regulation task-related cognitive and metacognitive strategies such as
model. The CFA showed that self-regulation indeed is mnemonic encoding and self-monitoring have been the
composed of eight factors (goal-setting, organizing and center of much research on self-regulated learning.
transforming information, self-consequences, seeking According to Zimmerman & Pons (1986), the social
information, seeking social assistance, environmental cognitive theory of academic regulation states that students
structuring, rehearsal and mnemonic strategies and self- regulate the motivational, affective and social determinants
evaluation). A one-factor with eight component of their intellectual functioning as well as cognitive aspects
measurement model had an adequate fit (χ2=109.68, and that the exercise of self-regulatory skills produces
χ2/df=.5, RMS=.07, Joreskog GFI=.90, Bentler-Bonnet beneficial results. It is also said that good self-regulators do
Normed Fit Index=.90). Each of the subscales of the better academically than poor regulators even after
academic self-regulation showed adequate internal controlling for other potential influential factors.
consistencies. Intercorrelations of all eight subscales Moreover, in Zimmerman’s study (1981), it was
showed convergence with significant correlations (p<.01). theorized that human achievement is heavily dependent on
the use of self-regulation particularly in competitive and
evaluative settings. In the previous researches of Bandura,
Self-regulated learning has been a topic of (1982), Schunk, (1984) and Zimmerman, (1983), academic
considerable interest in educational psychology. Self- achievement is also the one realm where self-regulated
regulated strategy is defined as self directed process by learning processes are assumed to be crucial. In the upper
which learners transform their mental abilities into grades, success in school is believed to be highly
academic skills. These are also actions directed at dependent on student self-regulation, especially in
acquiring information or skill that involve agency, purpose unstructured settings where studying often occurs. On the
or goals and instrumentality self-perceptions by a learner. other side, Krouse & Krouse, (1981) said in their study that
Furthermore, self-regulated learners are generally the major cause of underachievement is the inability of the
characterized as active learners who efficiently manage students to control their own behavior. In addition,
their own learning experiences in many different ways. Zimmerman and his colleagues have been interested in
They have adaptive learning goals and are persistent in learning how students become willing and able to assume
their efforts to reach those goals and are proficient at responsibility for controlling or self-regulating their
monitoring and, if necessary, modifying their strategy use academic achievement. Research also suggests that
in response to shifting task demands. In short, self- learning self-regulating skills can lead to academic
regulated learners are motivated, independent, and achievement and an increased sense of efficacy.
metacognitively active participants in their own learning Self-regulation fosters learning. As indicated by
(Zimmerman, 1990). Several studies were also conducted Ertmer, Newby, & MacDougall (1996), students with high
that developed measures for self-regulation. One measure levels of self-regulation possess attributes and skills that
was developed by Zimmerman and Martines-Pons (1986) would be likely to enhance performance in a case-based
which is the self-regulation interview. course. It is generally acknowledged that self-regulation is
From the existing literature, a number of not an automatic process for all learners. Schunk (1989)
categories of self-regulated strategies were identified. The stated that self-regulation does not automatically develop
236
as people become older nor it is passively acquired from Rehearsal and mnemonic strategies. It may be
the environment. written or verbal; overt and covert. It uses mnemonic
In the social cognitive theory of Bandura (1986), devices, teaching someone else the material, making
self-regulation operates through a set of psychological sample questions, using mental imagery, and using
subfunctions. Zimmerman (1986) added that these include repetition.
one’s self-monitoring of activities, applying personal Environmental Structuring. It is selecting or
standards for judging and directing one’s performance, arranging the physical setting, isolating or eliminating or
enlisting self-reactive influences to guide and motivate minimizing distractions, and breaking up study periods and
one’s effort and employing proper strategies to achieve spread over time.
success. Self-evaluation. A process of checking the
The present study confirmed the factors of self- quality or progress. It involves task analysis, self-
regulation using the model of Zimmerman and Martinez- instructions, enactive feedback, and attentiveness.
Pons (1986). This investigation was undertaken to validate
a scale for measuring academic self-regulation. There is a Method
need to construct direct items and confirm the factors Sample
because the instrument developed by Zimmerman and The participants were composed of 110 college
Pons is a structured interview questionnaire in which open- students in the initial exploratory analysis. Another sample
ended responses could be gathered. The instrument they of 310 was drawn for confirming the factor structure
developed is difficult to score and the data collected is proposed. The participants were from different private and
qualitative where scoring issues arise. Having direct items public schools in Metro Manila, specifically De La Salle
to measure academic self-regulatory strategies is easier University-Manila, Ateneo de Manila University,
and more efficient to use. Polytechnic University of the Philippines, University of Sto.
Self-regulation is an action directed at acquiring Tomas, and Lyceum of the Philippines. The ages ranged
information or skill that involves agency, purpose or goals, from 17 up to 25 years old. Convenience sampling was
instrumentality, self-perceptions by a learner (Zimmerman, done by going to the nearest college schools in Metro
1983). The Social Cognitive Theory explains how people Manila and looked for college students who are willing to
acquire and maintain certain behavioral patterns while also participate in the study.
providing the basis for intervention strategies (Bandura
1997). These are the classes underlying Self-regulation Test Development Design
theory (Zimmerman & Ponz, 1986): The measure is constructed by confirmatory the
Goal setting. Most theories of self-regulation factor proposed in previous studies on self-regulation. This
emphasize its inherent link with goals. A goal reflects one’s is done to verify that self-regulation has eight factors.
purpose and refers to quantity, quality, or rate of These are goal-setting, organizing and transforming
performance (Locke & Lathman, 1990). Goal setting information, seeking and selecting information, seeking
involves establishing a standard or objective to serve as social assistance, self-consequences, rehearsal and
the aim of one’s actions. Goals are involved across the mnemonic strategies, environmental structuring and self-
different phases of self-regulation: forethought (setting a evaluation.
goal and deciding on goal strategies); performance control
(employing goal-directed actions and monitoring Search for Content Domain
performance); and self-reflection (evaluating one’s goal The establishment of the conceptual definitions
progress and adjusting strategies to ensure success) for self-regulation and the factors that compose academic
(Zimmerman, 1998). It also involves sequencing, timing, self-regulation was first done to be guided in constructing
completing, time management, and pacing. the items. The conceptual framework was based on the
Organizing and transforming information. The eight factors by Zimmerman and Martinez-Pons (1986).
use outlines, summaries, rearrangement of materials, The eight factors are goal-setting, organizing and
highlighting, flash cards, and drawing pictures, diagrams, transforming information, self-consequences, seeking
and webbing or mapping. information, seeking social assistance, environmental
Self-consequences. This refers to the person’s structuring, rehearsal and mnemonic strategies and self-
reinforcement, how he/she treats motivation, arrangement evaluation.
or imagination of punishments, and delay of gratification.
Seeking and selecting information. This entails Item Writing
where and who a person seeks information. The use of The items that were constructed that make up
library resources, internet resources, reviewing cards, the scale for academic self-regulation were based from the
rereading records, tests, books. definitions for each subscale. It was composed of 110
Seeking social assistance. Seeking assistance items. The items reflect the specific strategies that the
from peers, teachers, and adults. student uses while engaging in a learning task. The items
237
also show how students engage in different learning and subscale was determined to assess the internal
their methods for studying, completing their home works consistencies of the items.
and participating in class.
The constructed items were then distributed to Validity Analysis
which factor they fall under. To cluster the items A confirmatory factor analysis (CFA) was
appropriately, conceptual definition for each factor was conducted to determine the plausibility of the one-factor
used. The items were modified to fit the factors they belong structure with eight components academic self-regulation.
to. The fit of the hypothesized one-factor model was assessed
by examining several fit indices including three absolute
Selection of Scaling Technique and one incremental fit index. The minimum fit function chi-
The responses of the participants were square, the root mean square error of approximation
measured with the use of a four-point Verbal Frequency (RMSEA), and the standardized root mean square residual
Scale. Responses were based on how frequent they used (SRMR) are absolute fit indices. The chi-square statistic
the particular study strategy being described in the items. (χ2) assesses the difference between the sample
This type of scale provides answers in the form of coded covariance matrix and the implied covariance matrix from
data that are comparable and can be manipulated. The the hypothesized model (Fan, Thompson, & Wang, 1999).
scales include always, often, rarely and never. Points were A statistically non-significant χ2 indicates adequate model
assigned to each scale wherein ”always” corresponds to fit. Because the χ2 test is very sensitive to large sample
four (4) points, “often” corresponds to three (3) points, sizes (Hu & Bentler, 1995), additional absolute fit indices
“rarely” correspond to two (2) points and “never” were examined. The RMSEA is moderately sensitive to
corresponds to one (1) point. simple model misspecification and very sensitive to
complex model misspecification (Hu & Bentler, 1998). Hu
Item Review and Bentler (1999) suggest that values of .06 or less
Experts in the study of self-regulation and scale indicate a close fit. The SRMR is very sensitive to simple
development checked and reviewed the items that were model misspecification and moderately sensitive to
constructed. The items were judged whether it is accept, complex model misspecification (Hu & Bentler, 1998). Hu
reject or needs to be revise. The initial items reviewed was and Bentler (1999) suggest that adequate fit is represented
composed of 21 items for goal-setting, 26 items for by values of .08 or less.
organizing and transforming information, 14 for self-
consequences, 15 items for seeking and selecting
information, 12 for seeking social assistance, 16 items for Results
rehearsal and mnemonic strategies, 12 for environmental
structuring, and 15 for self-evaluation. The 110 items were first reduced into underlying
factors by conducting a principal components analysis with
Final Form varimax rotation. The eigenvalues extracted showed 10
The items were revised based on the reviews factors can be extracted with value of 1.00. Although the
provided. The first draft of the instrument was composed of ninth and tenth eigenvalues are far from the values of the
151 items and the final was composed only of 110 items. first eight. This justifies the appropriateness of an eight
The items for the final form for testing was then coded factor scale. The items contained on their respective factor
(e.g., GS for Goal-setting, SE for self-evaluation, RM for were maintained. The labels for each factor were also
rehearsal and mnemonic strategies) for convenience in maintained considering the items were also grouped as
encoding the responses. There were 15 items for goal hypothesized. Other items were omitted because of low
setting factor, 24 items for organizing and transforming, 10 factor loading (below 0.4). There were only 77 items
items for self-consequences, 13 items for seeking remained. The factors were goal setting (22 items),
information, 8 for seeking social assistance, 15 items for organizing and transforming (9 items), self-consequences
rehearsal and mnemonic strategies, 10 items for (17 items), seeking and selecting information (8 items),
environmental structuring and 15 items for self-evaluation. seeking social assistance (5 items), rehearsal and
There was a total of 110 items for the final form. After the mnemonic strategies (5 items), environmental structuring
revision of items, the test was administered. (3 items), and self-evaluation (8 items). High factor
loadings were obtained for each item belonging on their
Reliability Analysis respective factors.
In the data-analysis, the reliability of the test Using a separate sample (N=310), the
instrument was determined. The 110 items were Confirmatory Factor Analysis (CFA) verified the eight
intercorrelated to show internal consistency through subscales of self-regulation in one latent factor solution. All
interitem correlation. The Cronbach’s alpha of each subscales had a significant path estimate in a one-factro
solution for academic self-regulation with adequate fit
238
(χ2=109.68, χ2/df=.5, RMS=.07, Joreskog GFI=.90, Bentler- evaluation” which means that a person undergo process of
Bonnet Normed Fit Index=.90). Other supplementary checking the quality or progress of his or her own learning.
goodness of fit indices was also adequate for a one-factor It involves task analysis, self-instructions, enactive
measurement model solution (see Tables 1 and 2). feedback, and attentiveness.
The factors are reliable due to their Cronbach’s
The standard deviation, skewness, kurtosis, CFA alpha values which are 0.88, 0.90, 0.79, 0.80, 0.72, 0.85,
parameter and Cronbach’s alpha of each factor was 0.65, and 0.87. It only shows that each factor is consistent
obtained (see Table 3). Having relatively large numbers for with the proposed definition of the factors. There were no
standard deviation, it shows that the scores are dispersed. new factors arrived after factor analysis but there are
While for the skewness and kurtosis indicates a normal possible factors that could be drawn out as the scree plot
distribution of scores. Hihg internal consistencies were illustrates. The items were not reclassified in every factor
onbtained for each subscales of the instrument with but some were omitted since they don’t achieved more
Cronbach’s alpha >.70. than 0.4 value in the factor loading analysis. All factors
The factor scores of each subscale were also were accepted as shown with their eigenvalues 1.0 above.
intercorrelated. The intercorrelations all showed a positive The value obtained for internal consistency is
magnitude where each dimension increases with one 0.96 which is considerably high. The test has just
another. This means that all eight factors are internally undergone its initial stages, which is only preliminary pilot
consistent and have convergence with other (see Table 4). testing, thus, it is just necessary to further confirm and
established more reliable test properties.
Discussion
Zimmerman, B.J. (1983). Social learning theory: A Zimmerman, B. (1990). Self-regulated learning and
contextualist account for cognitive functioning. In academic achievement: An overview.
C.J. Brainerd (Ed), Recent advances in cognitive Educational Psychologist, 25, 3-17.
developmental theory. New York: Springer, 1-49.
Table 1
Goodness of Fit Estimates
Values
Discrepancy Function 0.50
Maximum Residual Cosine 0.00
Maximum Absolute Gradient 0.00
ICSF Criterion 0.00
ICS Criterion 0.00
ML Chi-Square 109.68
Degreed of Freedom 20.00
p-level 0.00
RMS Standard Residual 0.07
Table 2
Single Sample Fit Indices Values
Single Sample Fit Indices Values
Joreskog GFI 0.87
Joreskog AGFI 0.77
Akaike Information Criterion 0.65
Schwarz’s Bayesian Criterion 0.89
Browne-Cudeck Cross Validation Index 0.65
Independence Model Chi-square 910. 58
Independence Model df 28.00
Bentler-Bonnet Normed Fit Index 0.88
Betler-Bonnet Non-formed Fit Index 0.86
Bentler Comparative Fit Index 0.90
James-Mulaik-Brett Parsimonious Fit Index 0.63
Bollen’s Rho 0.84
Bollen’s Delta 0.90
240
Table 3
Estimates of the Subscales of Self-regulation
N M SD Skewness Kurtosis Initial CFA Cronbach’s
Eigenvalues Parameter Alpha
(N=110) estimates
(N=310)
Goal Setting 310 42.06 7.96 -0.47 0.18 22.79 5.42* 0.88
Organizing and 310 61.35 12.02 -0.11 0.39 5.60 9.94* 0.90
Transforming Information
Self-consequence 310 30.00 5.51 -0.07 0.14 4.89 4.13* 0.79
Seeking Information 310 35.68 5.66 -0.18 -0.18 3.03 3.81* 0.80
Seeking Social Assistance 310 21.20 3.95 -0.20 -0.22 3.03 2.22* 0.72
Rehearsal and Mnemonics 310 38.45 7.50 -0.01 -0.25 2.67 5.95* 0.85
Environmental Structuring 310 29.91 4.68 -0.32 -0.34 2.50 2.89* 0.70
Self-evaluating 310 40.59 7.69 0.09 -0.01 2.29 5.87* 0.87
*p<.05
Table 4
Correlation Matrix of the Factor of Self-regulation
(1) (2) (3) (4) (5) (6) (7) (8)
(1) 1.00
(2) 0.64** 1.00
(3) 0.60** 0.62** 1.00
(4) 0.44** 0.51** 0.61** 1.00
(5) 0.24** 0.43** 0.32** 0.42** 1.00
(6) 0.47** 0.65** 0.54** 0.55** 0.61** 1.00
(7) 0.51** 0.54** 0.54** 0.40** 0.22** 0.42** 1.00
(8) 0.45** 0.63** 0.52** 0.50** 0.53** 0.68** 0.45** 1.00
**Significant at 0.01
241
Chapter 6
Art of Questioning
Chapter Objectives:
Lessons
1 Functions of Questioning
2 Types of Questions
3 Taxonomic Questioning
4 Practical Considerations in Questioning
242
Lesson 1
Functions of Questioning
Every time we get inside our classrooms and deal with our students in various teaching
and learning circumstances, our ability to ask questions is always brought to fore. Being
intricately embedded in our pedagogies and assessments, questioning is one of the most basic
processes we deal with. But, just how appropriate our questions are, we need to discuss about the
art of questioning.
To begin with, we ask ourselves this fundamental question, “Why do we ask questions?”
From our teaching methods and strategies to our assessments, questioning is inevitable. From the
transmissive to more constructivist approaches of teaching, asking questions is always a
“mainstay” process. To answer this fundamental question, we need to first look into the function
of questioning as its works in our own selves, then in terms of how it works in the learning
process in general.
As you are reading this chapter, or even the previous chapter of this book, you
effortlessly ask questions. Why is that? How important is that process in our understanding of the
concepts are trying to learn about? Whenever you ask a question, regardless of whether you just
keep it in mind or express it verbally, you activate your senses and drive your attention to what
you are currently processing. As you engage a reading material, for example, and you ask
questions about what you are reading, you are bringing yourself into a deeper level of the
learning experience where you become more “into the experience.” Obviously, questioning
brings you to the level of focused and engaged learning as you become particularly attentive to
everything that takes place within and around you.
In the classrooms, we ask our students many questions without always being aware that
the kinds of questions we ask make or break students’ deep academic engagement. At this
juncture, therefore, we emphasize the point that, as teachers, just asking questions is not enough
to bring our students to the level of engagement we desire. What matters in this case is the
quality of the questions we ask them. The effects of questioning on our students differ,
depending on how “good” or “bad” our questioning is.
From various studies, we now know that “good” questioning positively affects students’
learning. Teachers’ good questioning boosts students’ classroom engagement because the
atmosphere where good questions are tossed encourages them to push themselves some more
into the state of inquiry. If students’ feel that questions are interesting, sensible, and important,
they are driven not only to “know more” but also “think more.” Good questioning encourages
deep thinking and higher levels of cognitive processing, which can result in better learning of the
subject matter in focus. One distinct mark of a classroom that employs good questioning is that
students generally participate in a scholarly conversation. This happens because teachers’ good
questioning encourages the same good questioning from the students as they discuss with their
teachers and with each other.
On the contrary, bad questioning distorts the climate of inquiry and investigation. It
undermines the students’ motivation to “know more” and “think more” about the subject matter
in focus. If, for example, a teacher’s question makes the student feels stupid and impossibly
capable of answering, the whole process of questioning leads to a breakdown of students’
243
academic engagement. Indeed, it is important for a teacher to always think of his or her
intentions for tossing questions in the class. Certainly, questions encouraged by a sound motive
will work better that those ill-motivated ones.
Lesson 2
Types of Questions
Now that you have just explored on the kinds of motives that may encourage or
undermine students’ learning, it is helpful if you focus on those motives that establish an
atmosphere of inquiry in your classrooms. Focus on those intentions that will allow for the use of
questioning as a tool for deep learning rather than those that embarrass students and discourage
them from engaging your lessons.
However, because teaching is not a trial-and-error endeavor, motives might not be
enough to guide our questioning so that it makes desirable effects on our students’ learning. With
the sound motive being the undercurrent of our questioning, we need to also know what types of
questions to ask to engage our students.
Interpretive Question
This type of question calls for students’ interpretation of the subject matter in focus. It
usually asks students to provide missing information or ideas so that the whole concept is
understood. An interpretive question assumes that, as students engage the question, they monitor
their understanding of the consequences of the information or ideas. In a class with primary
graders, the teacher narrated a story a boy in dark-blue shirt who was lost in a crowd of people at
a carnival one evening, and his mother roved around for hours to find him. After narrating the
story, one of the questions the teacher asked her pupils was “If the boy wore a bright-colored
shirt, what could change in the mother’s effort in looking for the boy?” Question that call for
interpretation of a situation is a powerful tool for activating your students’ analytical ability.
Inference Question
If the question you ask intends that students go beyond available facts or information and
focus on identifying and examining the suggestive clues embedded in the complex network of
facts or information, you may use toss up an inference question. After a series of discussions on
the Katipunan revolution in a Philippine history class, the teacher presented a picture that
appeared to capture a perspective of the Katipunan revolution. As the teacher showed the picture,
he asked, “What do you know by looking at this picture?” Having learned about Katipunan
revolution in its different angles, students were prompted to explore on clues that may suggest
certain perspectives of the event, and focus on a more salient clue that represented one
perspective, such as, for instance, the common people’s struggles during the revolution, or the
bravery of those who fought for the country, or the heroism of its leaders. Inference questions
encourage students to engage in higher-order thinking and organize their knowledge rather than
just randomly fire out bits and pieces of information.
245
Transfer Question
Questioning is one of the processes that affect transfer (Mayer, 2002). Transfer questions
are tools for a specific type of inference where students are asked to take their knowledge to new
contexts, or bring what they already know in one domain to another domain. Questions of this
type bring students to a level of thinking that goes beyond just using their knowledge where it is
used by default. For example, after a lesson on the literary works of Edgar Allan Poe, students
were already familiar with Poe’s literary style or approach. So that the teacher can infer on
students’ familiarization and understanding of Poe’s rhetoric “trademark,” the teacher thinks of a
literary work from a different source, let us say, one from the long list of fairy tales. Then the
teacher asked a transfer question, “Imagine that Edgar Allan Poe wrote his version of the fairy
tale story, ‘Jack and the beanstalk,’ you are making a critical review of his version of the story,
what do you expect to see in his rhetoric quality?” This question prompts the students to bring
their knowledge of Poe’s rhetoric style to a new domain, that is, a different literary piece with a
different rhetoric quality. This question further encourages the students to thresh out only those
relevant knowledge that must be transferred, and therefore, helps them account for their learning
of a subject matter.
Predictive Question
Asking predictive questions allows students to think in the context of a hypothesis.
Through questions of this type, students infer on what is likely to happen given the
circumstances in hand. In other words, students are compelled to think about the “what if” of the
phenomenon under study, mindful of the circumstances on focus. This type of question has long
been used in the natural sciences, but is certainly not for their exclusive use. In any subject area,
we can let our students think scientifically. One of the ways to do so is to let them engage our
predictive questions or to drive them to raise the same type of question in the class. Predictive
questions prompt the students to go beyond the default condition and infer on what is likely to
happen if some circumstances change. Here, students make use of higher levels of cognitive
processing as they estimate probabilities.
Metacognitive Question
The types of questions discussed above all focus on students’ cognitive processes. To
bring students into the level of regulation over their own learning, we also need to ask
metacognitive questions. Questions of this type allow students to think about how they are
thinking, and learn about how they are learning your course lessons. Successful learners tend to
show higher level of awareness of how they are thinking and learning. They show clear
understanding of how they struggle with academic tasks, comprehend written texts, solve
problems, or make decisions. A metacognitive question invites students to know how they know,
and, thus, become more aware of the processes that take place within them while they are
thinking and learning. In a math class, for instance, the teacher not only asks to solve a word
problem but also to describe how the student is able to solve the word problem.
246
Lesson 3
Taxonomic Questioning
After trying your best to formulate questions for every type of questions discussed above,
we will now bring you to the discussion on planning the questioning in terms of taxonomic
structure. Questions differ not only in terms of types but also in terms of what cognitive
processes are involved based on the taxonomy of learning targets you are using. For our students
to benefit more from our questioning, it is necessary to plan our questioning taxonomically.
In Chapter 2 of this book we learned about the different taxonomic tools for setting your
learning intents or target. These tools also serve as frameworks for planning and constructing
your questions. Because questioning influences the quality of students’ reasoning, the questions
we ask our students to respond to must be pegged on certain levels of cognitive processes
(Chinn, O’Donnell, & Jinks, 2000). For example, Bloom’s taxonomy provides a way of
formulating questions in various levels of thinking, as in the following:
Questions intended for knowledge should encourage recall of information. Such
questions may be What is the capital city of…? or What facts does… tell?
For comprehension, questions should call for understanding of concepts, such as What is
the main idea of…? or Compare the…
Questions at the level of application must encourage the use of information or concept in
a new context, like How would you use…? or Apply… to solve…
If analysis is desired where students are driven to think critically, the questions must
focus on relationships of concepts and logic of arguments, such as What is the difference
between…?” or How are…and…analogous?
To encourage synthesis, questioning must focus on students’ original thinking and
emergent knowledge, like Based on the information, what could be a good name for…? or What
would…be like if…?
In terms of questioning at the level of evaluation, students are prompted to judge the
ideas or concepts based on certain criteria. Questions may be like Why would you choose…? or
What is the best strategy for…?
If you are to use the revised taxonomy where you need to consider both the knowledge
and cognitive process dimensions, it is important that you first identify the knowledge dimension
you wish to focus, and ask yourself, “What questions will be appropriate for every knowledge
dimension?
Your clear understanding of the kinds of questions to ask based on the types of
knowledge in focus helps you to categorically focus on any of those types of knowledge,
depending on what is relevant to your teaching and assessment at any given time. After
anchoring your questions into a particular type of knowledge, the next step is to frame your
question so that it conveys the relevant cognitive process needed for a successful learning of the
subject matter. If your focus is factual knowledge, you can toss up different questions that vary
according to the cognitive processes. You can raise a question on factual knowledge that
necessitates the use of recall (remember) or synthesis (create), depending on your learning
intents. You can navigate in the same way across the different levels of cognitive processing
while anchoring on any other type of knowledge.
You may also try out on the alternative taxonomic tools discussed in Chapter 2, and see
how you can brush up on your art of questioning while maintaining your track towards your
learning intents. When you wish to verify the validity of your questions, always go back to the
conceptual description of the taxonomy. It should be an important process as you build on your
art of questioning so that, aside from its artistic sense, your questioning also becomes scientific
in so far as teaching-and-learning process in concerned.
249
Lesson 4
Practical Considerations in Questioning
We now give you some tips in questioning. These tips are add-on elements to the items
that have already been discussed in the preceding section of this chapter.
References
Airasian, P. W. (2000). Assessment in the classroom: A concise approach. 2nd edition. USA:
McGraw-Hill Companies.
Chinn, C. A., O’Donnell, A. M., & Jinks, T. S. (2000). The structure of discourse in
collaborative learning. Journal of Experimental Education, 69, 77-97.
Mayer, R. E. (2002). The promise of educational psychology Volume II: Teaching for
meaningful learning. NJ: Merrill Prentice Hall.
251
Chapter 7
Grading Students
Chapter Objectives
Lessons
1 Defining Grading
2 The Purposes of Grading
Feedback
Administrative Purposes
Discovering Exceptionalities
Motivation
3 Rationalizing Grades
Absolute/ Fixed Standards
Norms
Individual Growth
Achievement Relative to Ability
Achievement Relative to Effort
252
Lesson 1
Defining Grading
Effective and efficient way of recording and reporting evaluation results is very important
and useful to persons concerned in the school setting. Hence, it is very important that students’
progress is recorded and reported to them, their parents and teachers, school administrators,
counselors and employers as well because this information shall be used to guide and motivate
students to learn, establish cooperation and collaboration between the home and the school and
in certifying the students’ qualifications for higher educational levels and for employment. In the
educational setting, grades are used to record and report students’ progress. Grades are essential
in education such that it is through it that students’ learning can be assessed, quantified and
communicated. Every teacher needs to assign grades which are based on assessment tools such
as tests, quizzes, projects and so on. Through these grades, achievement of learning goals can be
communicated with students and parents, teachers, administrators, and counselors. However, it
should be remembered that grades are just a part of communicating student achievement;
therefore, it must be used with additional feedback methods.
According to Hogan (2007), grading implies (a) combining several assessments, (b)
translating the result into some type of scale that has evaluative meaning, and (c) reporting the
result in a formal way. From this definition, we can clearly say that grading is more than
quantitative values as many may see it; rather, it is a process. Grades are frequently
misunderstood as scores. However, it must be clarified that scores make up the grades. Grades
are the ones written in the report cards of students which is a compilation of students’ progress
and achievement all through out a quarter, a trimester, a semester or a school year. Grades are
symbols used to convey the overall performance or achievement of a student and they are
frequently used for summative assessments of students. Take for instance two long exams, five
quizzes, and ten homework assignments as requirements for a quarter in a particular subject area.
To arrive at grades, a teacher must be able to combine scores from the different sets of
requirements and compute or translate them according to the assigned weights or percentages.
Then, he/ she should also be able to design effective ways on how he/ she can communicate it
with students, parents, administrators and others who are concerned. Another term not
commonly used to refer to the process is marking. Figure 1 shows a graphical interpretation
summarizing the grading process.
REPORTED TRANSLATED
Grades are communicated Combined scores are
to teachers, students, translated into scales
parents, administrators, etc. with evaluative meaning.
253
Review Questions:
Lesson 2
The Purposes of Grading
Grading is very important because it has many purposes. In the educational setting, the
primary purpose of grades is to communicate to parents, and students their progress and
performance. For teachers, grades of students can serve as an aid in assessing and reflecting
whether they were effective in implementing their instructional plans, whether their instructional
goals and objectives were met, and such. Administrators on the other hand, can use the grades of
students for a more general purpose as compared to teachers, such that they can use grades to
evaluate programs, identify and assess areas that needs to be improved and whether or not
curriculum goals and objectives of the school, and state has been met by the students through
their institution. From these purposes identified, the purposes of grading can be sorted out into
four major parts in the educational setting.
Feedback
Feedback plays an important role in the field of education such that it provides
information about the students’ progress or lack. Feedback can be addressed to three distinct
groups concerned in the teaching and learning process: parents, students, and teachers.
Feedback to Parents. Grades especially conveyed through report cards provide a critical
feedback to parents about their children’s progress in school. Aside from grades in the report
cards however, feedbacks can also be obtained from standardized tests, teachers’ comments.
Grades also help parents to identify the strengths and weaknesses of their child.
Depending on the format of report cards, parents may also receive feedbacks about their
children’s behavior, conduct, social skills and other variables that might be included in the report
card. On a general point of view, grades basically tell parents whether their child was able to
perform satisfactorily.
However, parents are not fully aware about the several and separate assessments which
students have taken that comprised their grades. Some of these assessments can be seen by
parents but not all. Therefore, grades of students, communicated formally to parents can
somehow let parents have an assurance that they are seeing the overall summary of their
children’s performance in school.
Feedback to Students. Grades are one way of providing feedbacks to students such that it
is through grades that students can recognize their strengths and weaknesses. Upon knowing
these strengths and weaknesses, students can be able to further develop their competencies and
improve their deficiencies. Grades also help students to keep track of their progress and identify
changes in their performance.
Personally, I feel that this feedback is directly proportional with the age and year level
with the students such that grades are given more importance and meaning by a high school
student as compared to a grade one student; however, I believe that the motivation grades can
give is equal across different ages and year levels. Such that grade one students (young ones) are
motivated to get high grades because of external rewards and high school students (older ones)
are also motivated internally to improve one’s competencies and performance.
255
Administrative Purposes
Promotion and Retention. Grades can serve as one factor in determining if a student will
be promoted to the next level or not. Through the grades of students, skills, and competencies
required of him to have for a certain level can be assumed whether or not he was able to achieve
the curriculum goals and objectives of the school and/ or the state. In some schools, the grade of
students is a factor taken into consideration for his/ her eligibility in joining extracurricular
activities (performing, theater arts, varsity, cheering squads… etc.). Grades are also used to
qualify a student to enter high school or college in some cases. Other policies may arise
depending on the schools’ internal regulations. At times, failing marks may prohibit a student
from being a part of the varsity team, running for officer, joining school organizations, and some
privileges that students with passing grade get. In some colleges and universities, students who
get passing grades are given priority in enrolling for the succeeding term, as compared to
students who get failing grades.
Placement of Students and Awards. Through grades of students, placement can be done.
Grades are factors to be considered in placing students according to their competencies and
deficiencies. Through which, teaching can be more focused in terms of developing the strengths
and improving the weaknesses of students. For example, students who consistently get high,
average and failing grades are placed in one section wherein teachers can be able to focus more
and emphasize students’ needs and demands to ensure a more productive teaching learning
process. Another example which is more domain specific would be grouping students having
same competency on a certain subject together. Through this strategy, students who have high
ability in Science can further improve their knowledge and skills by receiving more complex and
advanced topics and activities at a faster pace, and students having low ability in Science can
receive simpler and more specific topics at a slower pace (but making sure they are able to
acquire the minimum competencies required for that level as prescribed by the state curriculum).
Aside from placement of students, grades are frequently use as basis for academic awards. Many
or almost all schools, universities and colleges have honor rolls, and dean’s list, to recognize
student achievement and performance. Grades also determine graduation awards for the overall
achievement or excellence a student has garnered through out his/ her education in a single
subject or for the whole program he has taken.
256
Program Evaluation and Improvement. Through the grades of students taking a certain
program, program effectiveness can be somehow evaluated. Grades of students can be a factor
used in determining whether the program was effective or not. Through the evaluation process,
some factors that might have affected the program’s effectiveness can be identified and
minimized to improve the program further for future implementations.
Admission and Selection. External organizations from the school also use grades as
reference for admission. When students transfer from one school to another, their grades play
crucial role for their admission. Most colleges and universities also use grades of students in their
senior year in high school together with the grade they shall acquire for the entrance exam.
However, grades from academic records and high stakes tests are not the sole basis for
admission; some colleges and universities also require recommendations from the school,
teachers and/ or counselors about students’ behavior and conduct. The use of grades is not
limited to the educational context, it is also used in employment, for job selection purposes and
at times even in insurance companies that use grades as basis for giving discounts in insurance
rates.
Discovering Exceptionalities
Counseling Purposes. It is through the grades of students that teachers can somehow seek
the assistance of a counselor. For instance, a student who normally performs well in class
suddenly incurs consecutive failing marks, then teachers who was able to observe this should be
able to think and reflect about the probable reasons that caused the student’s performance to
deteriorate and consult with the counselor about procedures she can do to help the student. If the
situation requires skills that are beyond the capacity of the teacher, then referral should be made.
Grades are also used in counseling when personality, ability, achievement, intelligence, and other
standardized tests are being measured.
Motivation
Motivation can be provided through grades; most students study hard in order to acquire
good grades; once they get good grades, they are motivated to study harder to get higher grades.
Some students are motivated to get good grades because of their enthusiast to join extra-
curricular activities, since some schools do not allow students to join extra curricular activities if
they have failing grades. There are numerous ways on how grades serve as motivators for
students across different contexts (family, social, personal…etc.). Thus, grades may serve as one
of the many motivators for students.
257
Review Questions:
1. What are the different purposes of grades in the educational context? Explain each.
2. How do grades motivate you as a student?
3. How does feedback affect your performance in school?
Lesson 3
Rationalizing Grades
Attainment of educational goals can be made easier if grades could be accurate enough to
convey a clear view of a student’s performance and behavior. But the question is what basis shall
we use in assigning grades? Should we grade students in relation to (a) an absolute standard, (b)
norms or the student’s peer group, (c) the individual growth of each student, (d) the ability of
each student, or (e) the effort of the students/? Each of these approaches has their own
advantages and disadvantages depending on the situation, test takers, and the test being used. It is
expected for teachers to be skillful in determining when to use a certain approach and when not
to.
Absolute Standards. Using absolute standards as basis for grades means that students’
achievement is related to a well defined body of content or a set of skills. For a criterion-
referenced measurement, this basis is strongly used. An example for a well defined body of
content would be: “Students will be able to enumerate all the presidents of the Philippines and
the corresponding years they were in service.” An example for a set of skills would be something
like: “Students will be able to assemble and disassemble the M16 in 5 minutes.” However, this
type of grading system is somewhat questionable when different teachers make and use their
own standards for grading students’ performance since not all teachers have the same set of
standards. Therefore, standards of teachers may vary across situations and is subjective
according to their own philosophies, competencies and internal beliefs about assessing students
and education in general. Hence, this type of grading system would be more appropriate when it
is used in a standardized manner. Such that a school administration or the state would provide
the standards and make it uniform for all. An example for tests wherein this type of grading is
appropriate would be standardized tests wherein scales are from established norms and grades
are obtained objectively.
Norms. The grades of students in this type of grading system is related to the performance
of all others who took the same test; such that the grade one acquires is not based on set of
standards but is based from all other individuals who took the same test. This means that students
are evaluated based on what is reasonably expected from a representative group. To further
explain this grading system, take for instance that in a group of 20 students, the student who got
the most number of correct answers- regardless whether he got 60% or 90% of the items
correctly, gets a high grade; and the student who got the least number of correct answers-
regardless whether he got 10% or 90% of the items correctly, would get a low grade. It can be
observed in this example that (a) 60% would warrant a high grade if it was the highest among all
the grades of participants who took the test; and (b) a 90% can possibly be graded as low
considering that it was the lowest among all the grades of the participants who took the test.
Therefore, this grading system is not advisable when the test is to be administered in a
heterogeneous group because results would be extremely high or extremely low. Another
problem for this approach is the lack of teacher competency in creating a norm for a certain test
which lets them settle for absolute standards as basis for grading students. Also, this approach
would require a lot of time and effort in order to create a norm for a sample. This approach is
also known as “grading on the curve.”
259
Individual Growth. The level of improvement is seen as something relevant in this type
of grading system as compared to the level of achievement. However, this approach is somewhat
difficult to implement such that growth can only be observed when it is related to grades of
students prior to instruction and grades after the instruction, hence, pretests and posttests are to
be used in this type of grading system. Another issue about this type of grading system is that it
is very difficult to obtain gain or growth scores even with highly refined instruments. This
system of grading disregards standards and grades of others who took the test; rather, it uses the
quantity of progress that a student was able to have to assess whether he/ she will have a high
grade or a low grade. Notice that initial status of students is required in this type of grading
system.
Achievement Relative to Effort. Similarly, this type of grading system is relative to the
effort that students exerted such that a student who works really diligently, responsibly,
complying to all assignments and activities, doing extra credit projects and so on should receive
a high grade regardless of the quality of work he was able to produce. On the contrary, a student
who produces a good work will not be merited a high grade if he was not able to exert enough
effort. Notice that grades are based merely on efforts and not on standards.
As mentioned earlier, each of these approaches in arriving at grades have their own
strengths and limitations.
Using absolute standards, one can focus on the achievement of students. However, this
approach fails to state reasonable standards of performance and therefore can be subjective.
Another drawback in this approach would be the difficulty in specifying clear definitions;
although this difficulty can be solved, it can never be eliminated.
The second approach is appealing such that it ensures realism that is at times lacking in
the first approach. It avoids the problem of setting too high or too low standards. Also, situation
wherein everyone fails can be prevented. However, the individual grade of students is dependent
on the others which is quite unfair. A second drawback to this kind of approach is that how will
the teacher choose the relevant group; will it be the students in one class, students in the school,
students in the state, or students in the past ten years? Answers to these questions are essential to
be answered by a teacher to have a rationale if achievement in relation to other students. Another
difficulty for this approach is the tendency of encouraging unhealthy competitions; if this
happens, then students become competitors with one another and it is not a good environment for
teaching and learning.
The last three approaches can be clustered such that they have similar strengths and
weaknesses. The strength of theses three is that they focus more on the individual, making the
individual define a standard for himself. However, these three approaches have two drawbacks;
one is that conclusions would seem awkward, or if not, detestable. For example, a student who
performed low but was able to exert effort gets a high grade; but a student who performed well
but exerted less effort got a lower grade. Another example would be: Ken with an IQ of 150 gets
260
a lower grade compared to Tom with an IQ of 80 because Ken should have performed better;
while we were pleasantly amazed with Tom’s performance… Kyle starting with little knowledge
about statistics learned and progressed a lot. Lyra, who was already proficient and
knowledgeable in statistics, gained less progress. After the term, Kyle got a higher grade since he
was able to progress more; although it can be clearly seen that Lyra is better than him.
Conditions of these types make people feel uncomfortable with such conclusions. The second
drawback would be reliability. Reliability is hard to obtain when we use differences as basis for
grades of students. In the case of effort, it is quite hard to measure and quantify it, therefore, it is
based on subjective judgments and informal observations. Hence, resulting grades from these
three approaches when combined to achievement are somewhat unreliable. Table 1 presents a
summary of the advantages and disadvantages of the different rationales in grading.
Table 1
Review Questions:
References
Popham, J. W. (1998). Classroom assessment: What teachers need to know (2nd ed.). Needham
Heights, MA: Allyn & Bacon.
Brookhart, S. M. (2004). Grading. Upper Saddle River, New Jersey: Pearson Education Inc.
EMPIRICAL REPORT
Do Parents and Teaching Approach matter in Teaching approach has proved to be a
Predicting Students’ Grades? factor that predicts a student’s achievement. It is
Aldrich B. Alvaera, Ma. Eloisa Bayan, and Darwin P.
defined as a customary way of teaching,
Martinez described as either teacher- centered or learner-
De La Salle University- Manila centered (Lefrancois, 2000). The tradition in the
school setting has always been a teacher-
Abstract centered approach, where the students are just
The study determined whether teaching approach (teacher- passive receivers of knowledge. The underlying
centeredness and learner-centeredness), parental
involvement, and parental autonomy can significantly concept of the teacher-centered approach is
predict students’ Grades. With a sample of 382 grade four based on traditional pedagogy wherein
public school students in Metro Manila, the researchers knowledge is passed from teacher to children
administered the Teacher-Centered Practices (Katsuko, 1995). However, the trend in schools
Questionnaire and Learner-Centered Practices now is to move away from the teacher- centered
Questionnaire to measure the teaching approach of the
students’ class adviser and the Perception of Parents approach and adopt a new approach called the
Scale for children to measure how involved and learner- centered approach. Unlike a teacher-
autonomous the students’ fathers and mothers were. The centered approach, the learner- centered
students’ general average grade from the previous grading approach does not limit the students to acquiring
period was used as a measure of their academic knowledge solely from their teachers. Instead,
performance. Using stepwise forward regression, only
mother involvement revealed to be a significant predictor of they are limited by their own capabilities on when
academic achievement. Implications and recommendations and how they will learn (Fadul, 2006). Schools
were included in the discussion. nowadays move towards the learner- centered
approach because of the benefits that the new
The parents’ and teachers’ approaches approach offers. The new approach claims that
important factors that influence students in their students are more actively involved with the
school performance. For instance, the way subject matter, they are more motivated as
parents take care of their children and the way learners and they learn more skills, especially
teachers deal with the students have an influence discipline, communication and collaboration skills
on the students’ behavior in school. The way (Johnson, 2000). The diversity in the students’
parents relate with their children and the way needs has grown too large to a teacher- centered
teachers handle their students also help explain approach to address (Laboard, 2003).
a student’s academic performance in school. Despite the trend of moving from a
Despite attempts to improve learning and traditional approach to the new learner-centered
achievement of students, there are still issues approach, there are still teachers who believe
regarding the outcome of the students’ that the teacher-centered approach is a more
performance. Results of the National Diagnostic appropriate construct in the classroom setting.
Test (NDT) administered in the year 2002- 2003 Biggs (1999) says that this approach is
of first year students showed that of the 1.3 appropriate especially when the teacher or the
million first year students all over in the transmitter of knowledge is one who comes from
Philippines, only 18% passed the competency a position of expertise.
level for English, 10% for Science and only 8% Parental influence has also been
for Math (Evasco, 2005). The result of the study identified as an important factor affecting student
implies a very alarming condition because it achievement (Halawah, 2006). Previous studies
shows how low the literacy rate of Filipino (Boveja, 1998; Gregory, 2006; Halawah, 2006;
students is. Given this issue, there is now a need Weiner, 1974; Wu & Qi, 2006) have found that
to improve the teaching in the Philippine setting. low achievement has been associated with
263
students having parents who are less involved in and achievement (Gray & Steinberg, 1999;
their school work. On the other hand, students Strage & Brandt, 1999). Gray and Steinberg
who have parents that are more involved with (1999) analyzed the concept of parenting and its
their school work have a higher achievement specific parts. In this study, they found that
tendency (Rollins & Thomas, 1979). parental involvement has a contribution to every
Previous studies have found that the aspect of adolescent development. For instance,
performance of students benefits most when their when parents were perceived to be more
parents are highly involved in their school work involved, the adolescents became more
(Gray & Steinberg, 1999). However, parental psychologically rounded and performed better in
autonomy and support has also been proven to school. In another study, Strage and Brandt
be a significant predictor of academic (1999) studied how the academic performance of
achievement (Strage & Brandt, 1999). college students was predicted by parental
Studies have shown that teaching autonomy. It was revealed that when parents
approach, parental involvement and autonomy granted more autonomy, demands and support,
predict academic achievement when studied the students became more confident, persistent
separately. It has been well established that and positively oriented to their teachers. In other
several factors influence academic achievement words, autonomy granting paired with other
of students. However, studies on teaching factors (i.e. demand and support) significantly
approach and achievement (Adams & predicts college students’ overall GPA. Both
Engelmann, 1996; Gleason, 1995; Spector, studies (Gray & Steinberg, 1999; Strage &
1995; Nelson, 1995; and Brent & DiObilda, 1993; Brandt, 1999) clearly showed that the
Reyes, 2001) fall short on defining which performance of students is greatly affected by
particular paradigm would be most influential highly involved parenting and liberal autonomy
towards achievement. Instead, the studies (Gray & Steinberg, 1999).
showed how each of them supports or follows a In another study (Grolnick & Deci, 1991;
particular learning theory. Gray & Steinberg, 1999), the researchers
The studies (Adams & Engelmann, 1996; examined a process model of relationships
Gleason, 1995; Spector, 1995; Nelson, 1995; among children’s perception of their parents,
Brent & DiObilda, 1993; Reyes, 2001; Boveja, their motivation and their performance in school.
1998; Gregory, 2006; Halawah, 2006; Weiner, The study involved three motivation variables
1974; Wu & Qi, 2006) presented were limited to namely: Control understanding, perceived
predicting academic achievement of students competence and perceived autonomy. With the
without taking into account both the teaching and use of the developed Perception of Parents
parenting factors. Hence, the purpose of this scale, 456 children from Grade 3 through 6
study is to bridge the gap between the factors participated in the study. The results showed that
that influence achievement of students. In the maternal autonomy support and involvement
present study the factors of teaching approach were positively associated with the three
(teacher-centered and learner-centered), motivation variables. On the other hand, paternal
parental involvement and parental autonomy autonomy and support were related to the three
were used as predictors of students’ academic motivation variables as well. Finally, analysis of
achievement as measured by grades. the results showed that the three motivational
factors mediate the child’s academic
achievement. Although both mother and father
Parental Involvement, Autonomy and autonomy support and involvement showed
Achievement significant relationships with motivation,
Several studies have explored the descriptive data from the children’s perception of
relationship of parental involvement, autonomy parents scale revealed that mothers were more
264
supportive and involved as compared to fathers. centered paradigms work (Hara, 1995).
Unlike other studies (Gray & Steinberg, 1999; Compared to the studies regarding parenting
Strage & Brandt, 1999), Grolnick and Deci (1991) approach, studies regarding teaching approach
were able to show a clearer definition of how did not highlight a particular paradigm that would
parental involvement and autonomy support be most influential towards achievement
influenced student performance by separately (Winsler, Madigan & Aquilino, 2006; Alfaro
measuring mothers and fathers. By doing so, the Umana-Taylor & Bamaca, 2006; Assadi, Zokaei,
researchers were able to emphasize who Kaviani & Mohammadi, 2007; Rollins & Thomas,
between mothers and fathers have more 1979; Boveja, 1998; Gregory, 2006; Halawah,
influence on the student’s academic 2006; Weiner, 1974; Wu & Qi, 2006). Instead, the
achievement. studies showed how each of them support or
follow a particular learning theory.
Academic Achievement The social-cognitive theory (Bandura,
The article of the North Central Regional 1997) was used as the framework in the study.
Educational Laboratory on achievement used The theory explains the interaction of parental
standardized tests as the definition of involvement and autonomy and teaching
achievement. The article states that standardized approach as it predicts achievement. The social
test scores are used to determine how well cognitive theory explains the interaction between
students are doing in school. This would coincide the person and the environment which involves
with AbiSamra’s definition of achievement as the cognitive competencies such as achievement
quality and quantity of a student’s work (2000). that are developed and modified by social
Meanwhile, Steinberg (1993) discusses influences and structures within the environment
achievement as something comprised of ability such as parents and teachers. As by its
and performance. Achievement is definition, two of the major factors of the social-
multidimensional in the sense that it is intricately cognitive theory namely the person and the
associated to human growth and cognitive, environment can elaborate the relation of the
emotional, social and physical development. In variables used in the study. Environment refers
addition, it reflects the child as a whole that to the factors that can affect a person’s behavior
transpires across time and levels. which are parenting styles and teaching
The concept of achievement from the approach.
latter articles as the multidimensional growth and In the present study, parental
development of the student would be most involvement, parental autonomy and teaching
appropriate for the study. The current study approaches are environmental factors that can
focuses on academic achievement as measured influence the academic achievement of the
by the general average grade of the student from student. More specifically, a student’s
the previous grading period. understanding and level of participation can vary
It has been well established how depending on the way his parents influence him
academic achievement is influenced by a and on the teacher’s approach to the lesson.
particular factor. However, previous studies Thus, a combination of teacher’s approach and
(Grolnick & Deci, 1991; Gray & Steinberg, 1999; high parental involvement can explain students’
Strage & Brandt, 1999) have focused on academic achievement.
predicting achievement without stating other Parents’ involvement in the child’s
factors such as teaching approach that would schooling like assisting the child’s in making their
also lead to achievement. assignments explains much the grade of the
On the studies regarding teaching child. Studies (Grolnick & Deci, 1991; Gray &
approach and achievement, related literature Steinberg, 1999) showed that when parents pay
revealed how the teacher-centered and learner- more attention to their child’s studies, the
265
tendency is that the child performs better in and teaching approaches can relate with each
school. Gray and Steinberg (1999) concluded other to predict achievement.
that when students feel that their parents are
involved with their school work, they become Participants
more psychologically rounded and as a result, A total of 400 grade four students were
perform better in their school work. asked to participate in the study. Only students
Acknowledging individual differences from the top four classes were included in the
between students and serving as a facilitator also study. The sample was recruited from different
explains the grades of the student. As part of the public schools in Metro Manila namely Andres
learner-centered principles, executing such Bonifacio Elementary School, Rizal Elementary
behaviors would show that a student’s teacher is School, M. Hizon Elementary School and
inclined toward learner-centered approach. On Francisco Balagtas Elementary School. These
the other hand, a teacher-centered approach is public schools have self-contained classes
effective for learning basic skills (Snowman & wherein there’s only one teacher per class who
Biehler, 2000). handles all the subjects. These self-contained
classes are used from pre-school to fourth grade
Research Questions in the elementary level. Participants were
The study intended to determine whether recruited through purposive sampling and
parental involvement and autonomy (mothers inclusion criteria included participants who grew
and fathers), and teaching approach can predict up with at least one parent. Participants who
public school students achievement as measured grew up in the absence of both parents were still
by the general average grades of students. The asked to complete the questionnaires but were
study addressed the following questions: no longer included in the analysis of the study.
(1) Is there a significant relationship Out of the 400 respondents, 18 students were
among parental involvement, parental autonomy removed because they did not meet the criteria
and teaching approach with achievement? and only 382 students were included in the
(2) Can parental involvement, parental analysis. Their ages ranged from 9-11 years old
autonomy, learner-centeredness, and teacher- with a mean age of 9.57. There were 183 males
centeredness significantly predict student and 199 females.
achievement?
(3) How much does each variable Instruments
contribute in predicting student achievement?
Perception of Parents Scale. The
Method Perception of Parents Scale, Child Scale is an
Research Design instrument developed by Grolnick, Deci, & Ryan
(1997) which assesses the children’s perceptions
A descriptive design was utilized in the of the degree to which their parents are
current study. It investigated the perceived autonomy supportive and the degree to which
parental involvement, perceived parental their parents are involved. The scale has 22
autonomy and teaching approach as predictors items, in which 11 items are about their mother
of achievement. The current study focused on and then the same 11 items focusing on their
the relation of the different variables and father. Factor analysis of the scale has revealed
described how the relationship of the said a clear four-factor solution with factors labeled
variables can predict achievement. Stepwise mother involvement (1, 3, 5, 9, and 11), mother
forward regression was used to further explain autonomy support (2, 4, 6, 7, 8, and 10), father
how parental involvement, parental autonomy involvement (12, 14, 16, 20, and 22) and father
autonomy support (13, 15, 17, 18, 19, and 21).
266
The validity of the scale is .86 for mother emphasizes on competition, individual work, and
autonomy support, and .88 for mother discipline. The items from this questionnaire
involvement. Both father autonomy support and focused on these four factors. The validity of the
father involvement had a validity of .85. scale was analyzed using Cronbach’s alpha and
yields a score of .8968.
Learner- Centered Practices
Questionnaire. The Learner- Centered Practices Procedure
Questionnaire (LCPQ) is based on the principles
of the learner-centered practices by McCombs For the test administration, the
(1997) (Magno & Sembrano, 2007). The 25 item administrator gave the general instructions.
scale is consists of four areas namely (1) positive Afterwards, a researcher passed around an
interpersonal characteristics (items 1 to 5) which information sheet which contained the following:
reflects the capacity to develop positive Research ID no., name, age, gender, grew up
interpersonal relationships with students and with mother- yes/no and grew up with father-
instructor’s ability to value and respect students yes/no.
The items have a Cronbach’s alpha of .986 which The first test administered was the
establishes it internal consistency. (2) Perception of Parents Scale (POPS) which took
Encourages personal challenge (items 6 to 10) 20 minutes. For each item, the administrator read
focuses on the instructors ability to take charge each statement and waited for the participants to
of the students learning obtained an internal finish writing down their answer.
consistency of .983 using Cronbach’s alpha. (3) After collecting the POPS, the
Adopts class learning needs (items 11-15), items administrator read the instructions for the
which show the ability of the teacher to be Learner-Centered Practices Questionnaire as the
flexible in addressing the student’s need and other researchers distributed the questionnaires.
have a internal consistency of .975 using For each item, the administrator read each
Cronbach’s alpha. (4) facilitates the learning statement and waited for the participants to finish
process (items 16 to 19) consists of items writing down their answer. This took about 20
reflecting the instructors ability to encourage minutes as well. The final questionnaire was the
students to monitor their own learning process Teacher- Centered Practices Questionnaire. The
and this items gathered a .990 internal administrator read the instructions while the other
consistency using Cronbach’s alpha. The overall researchers distributed the questionnaires. For
reliability of the scale is .994 indicating high each item, the administrator read each statement
internal consistency of the items (Magno & and waited for the participants to finish writing
Sembrano, 2007). down their answer. After 20 minutes, the
administrator asked the participants to pass their
Teacher- Centered Practices papers.
Questionnaire. The Teacher- Centered Pratices
Questionnaire was adopted from Lefrancois Data Analysis
(2000). The 25 item questionnaire was The Pearson r was used to inter-
constructed under the areas of (1) direct correlate teacher-centeredness, learner-
instruction (item nos. 6, 7, 13, 14, 19, 20, 21, 22, centeredness, parental involvement, parental
24, 25), (2) competition within students (item nos. autonomy and achievement.
3, 8, 15, 16), (3) individual work (item nos. 10, The Multiple Regression was used as the
11, 17, 23) and (4) discipline (item nos. 1, 4, 5, 9, main analysis in the study. The Stepwise
12, 18). This book states that a formal teaching Forward Multiple Regression was used to further
approach is more of a direct instruction from investigate whether the factors of parental
teachers to the students. This approach involvement, parental autonomy, and teaching
267
Results Table 2
Stepwise Forward Regression Predicting Student
The researchers determined the means Achievement
and standard deviation of the general average of Standardized
SE of
Unstandardized t(379 p-
Standardized
students, parental involvement, parental Beta
Beta
ß ) level
autonomy and teaching approaches. Bivariate Mother
0.11* 0.05 0.32 2.16 0.03
correlation was also conducted to determine Involvement
whether there was a significant relationship *p<.05, Note. R= .11, R²= .012, Adjusted R²= .95, SE= 2.87
A t-test for dependent samples was used teachers in public school classrooms (Hicap,
to compare the means of Mother Involvement 2006a). It was even mentioned that teachers in
and Father Involvement in order for the public schools are so bad that they teach English
researchers to see which one was higher in subjects in Filipino due to their inadequate
terms of involvement. Table 4.1 shows that there English speaking skills (Hicap, 2006b). In an
was no significant difference between the article by Hicap (2006a), he discussed specific
parents’ involvement in terms of the mean issues and steps which the Department of
scores. Education should take in order to improve the
The same method was used to compare quality of public education in the country, one of
the mean scores of mother autonomy and father which is allotting more funds for re-training of
autonomy to find out which parent was higher in teachers. The problem of re-training teachers has
terms of autonomy. In figure 4.2 it was found out been an issue for a long time. Sutaria (1990)
that there was no significant difference between cited that there is a need for strengthening
the mean scores of father autonomy and mother teachers’ competence in teaching to maximize
autonomy. learning for the poor performing students. The
above mentioned proves that teachers fail to
maximize the potential of learners due to their
Discussion incompetence in applying teaching strategies.
Although there was no significant
Looking at the results of the study, it was relationship between the teaching approaches
found that although the participants came from and achievement, Table 2 shows a significant
the top classes of different public schools around relationship between teacher-centeredness and
Manila, the mean of their general averages was learner-centeredness. The significant correlation
not as high as expected. In addition, it is safe to between the two approaches explains why the
assume that there are relatively similar grading mean score for both scales were close to each
standards among the three schools because the other (2.27 for Teacher-Centeredness while 2.38
grades of all the students were close to one for Learner-Centeredness). The relationship
another. This was also indicated by the low between rating for both teaching approaches can
standard deviation. be explained by the fact that behaviors of
In determining which variable has a Filipinos are situation specific (Marcus &
significant relationship with student achievement, Kitayama, 1991). The teachers utilize both
mother involvement was significantly related with teacher-centered and learner-centered practices
the students’ academic achievement. Other according to the type of situation or the kind of
variables such as teaching approach, mother student that they have. It was also because of
autonomy, father involvement and father this same reason why all the other factors were
autonomy, failed to show a significant significantly correlated with the teaching
relationship with the achievement of the students. approaches. Father involvement and father
The researchers believe that the insignificant autonomy also did not have a significant
relationship of teaching approach with student relationship with the achievement of students.
achievement suggests that there is inefficiency or This could be attributed to the fact that Filipino
poor quality of teaching in the public schools. fathers have a more procreative nature of
Another reason for this is the inadequate teacher parenting approach (Tan, 1989). Basically,
training that teachers in the public schools have. Filipino fathers equate fatherhood with the
The deteriorating quality of public education, biological aspect and their responsibility of
especially their continuous decline of students’ providing for their family. Sadly, this nature of an
performance in National Achievement Tests is approach does not go beyond those facets.
believed to be an effect of poor teacher quality Table 2 also showed a significant relationship
269
between all the parental factors and the teaching samples to determine which parent had higher
approaches. Again, due to the nature of Filipino involvement and autonomy ratings. Table 4.1
teachers utilizing both teacher-centered and showed that there was no significant difference
learner-centered practices, all the other variables between the parents’ involvement in terms of
yielded a significant relationship between them. their means. At the same time, Table 4.2 showed
The only variable that showed a that there was no significant difference between
significant relationship with the achievement of the parents’ autonomy. Although both tables did
students was mother involvement. This finding not show any significant differences between
supports other studies (Bogenshneider, 1997; mothers and fathers and their involvement and
Grolnick & Slowiaczek, 1994) that mother autonomy, this does not have any bearing on
involvement influences student achievement. both factors’ contribution in predicting student
This finding coincides with the fact that mothers achievement since this was only a comparison of
are the primary caretaker of children (Mendez & the mean scores of the variables.
Jocano, 1979; Licuanan, 1979; Lagmay, 1983; The findings of the study have several
Mindoza, Botor & Tablante, 1984; UP CHE, implications. First, the poor quality of teaching in
1958) and responsible for supervising the studies the Philippine public education affects the
of their children. Decisions made regarding the relatively low achievement of students in the
child’s daily routine, health and schooling is also public schools. The role of the teacher is critical
attributed to Filipino mothers as well (UP CHE, for they are the people who determine the
1958). Mothers were thought to be more involved content to be taught, the teaching strategy to be
in the sense that they were perceived to show used and the conditions of learning the content.
interest and spend time relating to their child’s Cortes (1987) said that the teacher factor is
school activities (Murray, 2005). The findings on believed to explain the low level of achievement
the significant relationship between mother of Filipino students. The fact that the students
involvement and achievement is further failed to recognize which particular approach
supported by the findings in Table 3 showing their teachers are using show that their teachers
mother involvement significantly predicting are failing to effectively practice their teaching
student achievement. Of all the predictors of strategies. Second, the high mother involvement
achievement used by the researchers, it was only rating implies that mothers have more time to
mother involvement that had significantly look after the studies of their children as
predicted student achievement. Although the compared to the fathers. In the Philippine setting,
stepwise forward regression analysis did not fathers perceive their role as mainly the provider
show the other factors as significant predictors of of the family, thus making them pay more
student achievement, this does not mean that attention to their job and less attention to their
teaching approach, father involvement, father children (Tan, 1989).
autonomy and mother autonomy does not It was concluded in the study that only
contribute in predicting achievement. This simply mother involvement can predict students’
implies that their contribution in the achievement achievement. Teacher-centeredness and learner-
of the students is not as significant as compared centeredness were significantly related with each
to the contribution of mother involvement. In other. This indicates that teachers are failing to
addition, this further stresses the poor quality of utilize teaching strategies appropriately which
teaching in public schools because of their failure may have influenced the students in
to significantly predict the achievement of distinguishing which teaching approach their
students. teacher uses.
To further look into the role of mothers In predicting student achievement,
and fathers in predicting student achievement, factors such as father involvement, father
the researchers used a T-test for two dependent autonomy, mother autonomy, and teaching
270
AbiSamra, N. (2000). The Relationship between Gleason, M. M. (1995). Using Direct Instruction
Emotional Intelligence and Academic to Integrate Reading and Writing for Students
Achievement in Eleventh Graders. Retrieved with Learning Disabilities. Reading and Writing
October 4, 2007, from: Quarterly, 11, 91-108.
http://members.fortunecity.com/nadabs/research-
intell2.html Gray, M.R., & Steinberg, L. (1999). Unpacking
authoritative parenting: Reassessing
Adams, G. L. & Engelmann, S. (1996). Research multidimensional construct. Journal of Marriage
on Direct Instruction: 25 Years beyond DISTAR. and the Family, 61, 574-587.
Seattle, WA: Educational Achievement Systems.
Gregory, A. & Weinstein, R. S. (2006).
Assadi, S. M, Zokaei, N., Kaviani, H., Connection and Regulation at Home and in
Mohammadi, M. R., et al. (2007). Effect of School: Predicting Growth in Achievement for
Sociocultural Context and Parenting Style on Adolescents. Journal of Adolescent Research,
Scholastic Achievement among Iranian 19, 405.
Adolescents. Oxford, 16, 169.
Grolnick, W. S., Ryan, R. M., & Deci, E. L.
Bandura, A. (1997). Self-efficacy: The exercise of (1991). The inner resources for school
control. New York: W.H. Freeman. performance: Motivational mediators of children's
perceptions of their parents. Journal of
Biggs, J. (1999). Teaching for Qualiy Learning at Educational Psychology, 83, 508-517.
university. Philadelphia: Open
University. Grolnick, W. S. (2003). The psychology of
parental control: How well mean parenting
Boveja, M. (1998). Parenting Styles and backfires. Mahwah, NJ: Erlbaum.
Adolescent’s Learning Strategies in the Urban
Community. Journal of Multicultural Counseling Halawah, I. (2006). The Effect of Motivation,
and Development, 26, 110- 120. Family Environment and Student Characteristics
on Achievement. Journal of Instructional
Brent, G. & DiObilda, N. (1993). Effects of Psychology, 33, 91- 99.
Curriculum Alignment versus Direct Instruction of
Urban Children. Journal of Educational
Research, 86, 333-338.
271
Licuanan, P. (1979). Some aspects of child- Rollins, B. C. & Thomas, D. L. (1979). Parental
rearing in an urban low-income community. Support, Power and Control Techniques
Philippines Studies, 27, 453-468. in the Socialization of Children. In W. R. Burr, R.
Hill, F. I. Nye, & I. L. Reiss (Eds.) Contemporary
Magno, C. (in press). Exploratory and Theories about the Family, 1, 317- 364.
Confirmatory Analysis of Parenting Closeness
and Multidimensional Scaling of other Parenting Ryan, R. M., Deci, E. L., & Grolnick, W. S.
Models. The Guidance Journal. (1995). Autonomy, relatedness, and the self:
Their relation to development and
Magno, C. & Sembrano, J. (2007). The Role of psychopathology. In D. Cicchetti & D. J. Cohen
Teacher Efficacy and Characteristics on (Eds.), Developmental psychopathology: Vol. 1.
Teaching Effectiveness, Performance and Use of Theory and methods, 618–655. New York: Wiley.
Learner-Centered Practices. The Asia-Pacific
Education Researcher, 60, 167-180. Schuh, K. L. (2003). Knowledge Construction in
the Learner-Centered Classroom. Journal of
McCombs, B. L. (2003). Applying the LCPs to Educational Psychology, 95, 246-442.
High School Education. Theory into Practice, 42,
117-126. Snowman, J. & Biehler, R. (2000). Psychology
Applied to Teaching (9th Ed.). Boston: Houghton
McCombs, B. L. (2001). What Do We Know Mifflin.
About Learners and Learning? The Learner-
Centered Framework: Bringing the Educational Spector, J. E. (1995). Phonemic Awareness
System into Balance. Educational Horizons, 8, Training: Application of Principles of Direct
182-193.
272
Chapter 8
Standardized Tests
Objectives
Lessons
Lesson 1
What are Standardized Tests?
A test is a tool used to measure a sample of behavior. Why did we say “a sample” and not
the entire behavior? A test can only measure part of a behavior. A test CANNOT measure the
entire behavior of a person, or characteristics measured. For example in a personality test, you
cannot test the entire personality. In t case of NEO-PI, the subscale on extrovertness can only
measure part of extrovertness. As an implication, during pre-employment testing, before an
applicant is accepted they administer a series or battery of tests to well represent the behavior
that needs to be uncovered. In school admission, the university or college require student
applicants’ grades, entrance exam, essay, recommendation letter, and bioprofile to decide on the
suitability of the student. A test can never measure everything. There are proper uses of tests.
As discussed in chapter 3 a test should be valid, reliable and can discriminate ability
before one should use it. Validity means if the test is measuring what it is suppose to measure.
Reliability means if the test scores are consistent when the same test or a test with another test.
Discrimination is the ability of the test to determine who learned and who does not.
The primary purpose of standardization is to (1) facilitate the development of tools; and
(2) to ensure that results from a test are indeed reliable and therefore can be used to assign
values/ qualities to attributes being measured (through the established norms of a said test).
The unique characteristic of a standardized test which differentiates it from other tests
are: (1) Uniform procedures in test administration, and scoring, and (2) having establishment of
norms.
Uses of Tests
Classifications of Tests
Standardized VS. Non-Standardized. Standardized tests have fixed directions for scoring
and administering. Can be purchased with test manuals, booklet, answer sheet. It was sampled to
those who are considered in the norm. Non-Standardized or teacher-made test is intended for
classroom assessment. it is used for classroom purposes. It intends to measure the behavior in
line with the objectives of the course. Examples are Quiz, Long Test, Exams, etc. Can a teacher
made test become a standardized test? Yes, as long as it is valid, reliable, and has a norm
Individual Tests VS. Group Tests. Individual Tests are administered to one examinee at a
time. Used for special populations such as children and people with mental disorders. Examples
are Stanford-Binet and WISC. Group Tests are administered to many examinees at a time.
Examples are classroom Tests.
Speed VS. Power. Speed test consists of easy items but time is limited. Power consists of
few pre-calculated difficult item and time is also limited.
Verbal VS. Non-Verbal Tests. Verbal consists of vocabulary and sentences. Examples are
Math test with characters. Non-Verbal consists of puzzles and diagrams. Examples are Abstract
reasoning and projective tests. Performance Test requires to manipulate objects.
Cognitive VS. Affective. Cognitive measures the process and products of natural ability.
Example are intelligence, aptitude, memory, problem solving. Achievement Test assesses what
has been learned in the past. Aptitude Test focuses in future and what the person is capable of
learning. Example is Mechanical Aptitude Test, Structural Visualization. Affective assesses
interest, personality, and attitudes, non-cognitive aspects.
276
Lesson 2
Interpreting Test Scores through Norm and Criterion Reference
Creating norms are usually done by test developers, psychometricians, and other practitioners
in testing. When a test is created, it is administered to a large group of individuals. This group of
individuals are the target sample where the test is intended for. If the test can be used for a wide
range of individuals, then a norm for a specific group possessing that characteristic needs to be
constructed. It means that a separate norm is created for males and females, for ages 11-12, 13-
14, 15-16, 17-18 and so on. There should be a norm for every kind of user for the test in order to
277
interpret his position in a given distribution. A variety of norms is needed because one cannot
use a norm that was made for 12 years old and use it for 18 years old because the ability of an 18
years old is different from the ability of a 13 years old. If a 21 years old need to take a test but
you DO NOT have a norm for a 21 years old, then you have to create a norm for a 21 years old.
There is a need to create norms for certain groups because the types of groups involved are
different from one another in terms of curriculum, ability, etc. For example, majority of
standardized tests used in the Philippine setting are from the west. This means that the content
and norms used are based in that setting. Thus, there is a need to create norms specifically for
Filipinos. Another concern in developing norms is that it expires across a period of time. Norms
created in the 1960’s cannot be used to interpret the scores of test takers of 2008. Thus, a norm
needs to be created every year.
In creating a norm, the goal is to come up with a normal distribution of scores that is
typical of a normal curve. A normal distribution is asymptotic and symmetrical. Asymptotic
means that the two tails of the normal curve do not touch the base which extends to infinity. The
sides of the normal distribution are symmetrical. The normal curve is a theoretical distribution of
cases where the mean, median, and mode are the same and in which distances from the mean can
be measured in standardized distances such as standard deviation units or z scores. The z-scores
are standardized values transformed from distributions that are not distributed normally. There
are 6 standard scores presented for each area in a normal curve. The z-score ranges from -3 to +
3, with a mean of 0 and the standard deviation is 1.
- 3 SD - 2 SD - 1 SD 0 +1 SD +2 SD + 3 SD
278
Suppose that a general ability test with 100 items was constructed and it was pilot tested
to 25 participants. The goal is to construct a norm to interpret scores of future test takers
(Generally 25 respondents are not enough to create a norm).
96 74 64 50 76
83 80 92 85 91
59 68 76 75 69
64 87 71 81 83
73 67 68 70 75
47
i
10
3. Start the class interval with the score that is divisible to your interval size. The lowest score
which is 50 is divisible by 5 (interval size), so the class interval can start at 5.
279
The frequency (f) and relative frequency (rf) indicates how many participants scored within a
class interval. The cumulative percentage (cP) indicates the point in a distribution that has a
given percent of the cases below it. For the example, an examinee who scored 87 means that
88% of the participants are below his score and there are 22% of the cases above his score.
midpoint
280
When a histogram is created for the data set, it typifies a normal distribution. To
determine if a distribution of scores will approximate a normal curve, there are indices to be
assessed:
X 1877
X X X 75.08
N 25
N (.5) cf 25(.5) 12
C50 cb (i ) C50 74.5 (5) C50 = 75.13
f 4
The 50% of the N=25 is 12.5, given this proportion, select from the cumulative frequency (cf) in
the frequency distribution table that is close to 12.5 but will not exceed it. This value would be
12 which will then be used as cf in the formula. The f used is 4 because given a cf of 12 a
frequency of 4 is still needed to approximate 12.4. The value 4 is taken as the frequency above
12. The i value is the interval size which is 5. To determine cb which is the class boundary, get
the corresponding upper limit value of 12 in the class interval. This upper limit value is 74 (70 is
the lower limit value). The boundary between 74 and the next limit which is 75 is 74.5, therefore
74.5 will be used as the cb.
The value of the mean (75.08) and median (75.13) are close. It can be assumed that the
distribution is normal.
Notice that in a skewed distribution, the mean and median are not equal. In a positively skewed
distribution, the mean is pulled by the extreme scores on the right having a higher value than the
median ( X C50 ) . While in the negative skewed curve, the mean is pulled by the extreme
scores in the left side having a median with higher value ( X C50 ) .
3( X C50 )
sk
sd
Where sd is the standard deviation, X is the mean, and C50 the median. In the previous section the
mean and median are already computed with values 75.08 and 75.13, respectively. To determine the
value of the standard deviation, the formula below is used:
( X ) 2 (1877) 2
X 2 143713
sd N sd 25 sd = 10.78
N 1 25 1
Where ΣX is the sum of all scores, ΣX2 is the sum of squares, and N is the sample size. ΣX2 is
obtained by squaring each score and then summate it. It will give a sum of 143713 from the
given data. Substitute the values in the formula:
The value of the sk is almost 0 which indicates that the distribution is normal.
Estimating Kurtosis. Kurtosis refers to the peakedness of the curve. If a curve is peaked
and the tails are more elevated, the curve is leptokurtic, if the curve is flattened then it is said to
be platykurtic. A normal distribution is somewhat mesokurtic.
QD
Kurtosis
P90 P10
Q Q1
Where QD is the quartile deviation QD 3 , P90 is the 90th percentile, and P10 is the
2
10th percentile. The formula to determine the median can be used to determine percentile ranks
P. The Q3 is also equivalent to P75 and Q1 is equivalent to P25. There are four estimates of
percentiles needed to determine kurtosis, P75, P25, P90, and P10.
N (.75) cf 25(.75) 16
P75 cb (i ) P75 79.5 (5) P75 = 82.94
f 4
N (.25) cf 25(.25) 4
P25 cb (i ) P25 64.5 (5) P25 = 67.31
f 4
N (.10) cf 25(.10) 2
P10 cb (i ) P10 59.5 (5) P10 = 60.75
f 2
N (.90) cf 25(.90) 22
P90 cb (i ) P90 89.5 (5) P90 = 90.75
f 2
Q3 Q1 82.94 67.31
QD QD QD = 7.81
2 2
QD 7.81
Kurtosis Kurtosis Kurtosis = 0.26
P90 P10 90.75 60.75
The distribution approximates a normal since the kurtosis value is exactly 0.26.
What is the standard score corresponding to a score of 94? Locate this score in the normal
curve.
XX
To convert a raw score to a standard z-score the formula z is used. Just replace
sd
the values in the formula where X is the given score. We use the given data set where the
X =75.08 and sd = 10.78.
94 75.08
z z = 1.76
10.78
-3 -2 -1 0 1 2 3
Notice that the z-score has a mean of 0 and standard deviation of 1. A T score has a mean of 50
and a standard deviation of 10. For the other scales:
Convert a raw score of 94 into T score, CEEB, ACT, and stanine. Given the z value of 1.76 for a
raw score of 94 just multiply the z with the standard deviation of the standard score then add the
mean value.
A raw score of 94 has an equivalent 67.6 T score, 676 CEEB, 23.8 ACT, and 8.52 stanine.
Once a score is converted into a standard score, a score can be interpreted based on its position in
the normal curve. For example a raw score of 94 is said to be above the average given that its
location surpassed the area of the mean.
The normal distribution since it is symmetrical has constant areas. When a cutoff is made using
the z-score the following areas are as follows.
- 3 SD - 2 SD - 1 SD 0 +1 SD +2 SD + 3 SD
The areas show that from the mean to a z score of 1 the area covered is 34.13% which is
also 1 standard deviation away from the mean. From a standard score of -1 to +3 a total area of
68.26% (34.13% + 34.13%) is covered. From -2 to +2, a total area of 95.44% is covered in the
curve. From -3 to +3 a total area 99.72% is covered. The remaining areas of the normal curve is
286
0.13% each side. The approximate areas of the normal curve for every z-score is found in
Appendix B of the book.
For example, in a given raw score of 94, what is the area away from the mean? Given the
z score of 1.76 for a raw score of 94, look for the value of 1.76 in Appendix C (first column, z
score) gives a value of .4608 which is the area away from the mean. To illustrate, it means that
the area occupied from the mean “0” to a z score of 1.76 occupies 46.06% of the normal
distribution.
3.92%
-3 -2 -1 0 1 2 3
How many cases are within the 46.06 area of the distribution?
Just multiply the area .4606 with N (.4606 X 25) gives 12 participants.
1) How many cases are within the 68.26% of the Norm distribution.
Multiply N= 25 to .6826. Therefore, 25 x .6826 gives 17 cases.
25-17= 8 cases
2) Given a score of 87 and another score of 73. How many people are between the two scores?
Convert 87 and 73 into z scores ( X =75.08, sd = 10.78). A score of 87 corresponds to a z score
of 1.11, and a score of 73 corresponds to a z score of -0.19. A z score of 1.11 is located on the
right side of the curve above the mean and a z score of -0.19 is on the left side below the mean
because the negative sign. The areas away from the mean can be located for each z score and add
these areas to determine the proportion. Then multiply this proportion with N=25 to determine
the cases in between the two scores.
.0753 .3643
288
Criterion-Referenced Norm-Referenced
Dimension
Tests Tests
To determine whether each student To rank each student with respect to
has achieved specific skills or the
concepts. achievement of others in broad areas
Purpose of knowledge.
To find out how much students
know before instruction begins and To discriminate between high and low
after it has finished. achievers.
Measures specific skills which
make up a designated curriculum.
Measures broad skill areas sampled
These skills are identified by
from a variety of textbooks, syllabi,
Content teachers and curriculum experts.
and the judgments of curriculum
experts.
Each skill is expressed as an
instructional objective.
Each skill is tested by at least four Each skill is usually tested by less
items in order to obtain an adequate than four items.
sample of student
Item performance and to minimize the Items vary in difficulty.
Characteristics effect of guessing.
Items are selected that discriminate
The items which test any given skill between high
are parallel in difficulty. and low achievers.
Each individual is compared with
Each individual is compared with a
other examinees and assigned a score-
preset standard for acceptable
-usually expressed as a percentile, a
achievement. The performance of
grade equivalent
other examinees is irrelevant.
score, or a stanine.
Score
Interpretation A student's score is usually
Student achievement is reported for
expressed as a percentage.
broad skill areas, although some
norm-referenced tests do report
Student achievement is reported for
student achievement for individual
individual skills.
skills.
289
Exercise
(True False) 1. The mean of a score is equivalent to zero in a standard z score.
(True False) 2. The mean and the median are equivalent to 0 in a normal curve.
(True False) 3. The 68% percent of the normal distributions are 2 standard deviations away
from the mean.
(True False) 4. The entire area of the normal distribution is 100%.
(True False) 5. The area in percentage from -3 to -2 of the normal distribution is 86.26%
(True False) 6. The extreme area of the normal distribution throughout is 0.13%?
(True False) 7. The area of the normal curve from +2 to -1 is 95.44
(True False) 8. The area from -2 to +1 is the equivalent +2 to +1.
(True False) 9. The mode is found on zero in a normal distribution.
290
Lesson 3
Standards in Educational and Psychological Testing
There is a need to control the use of tests due to the issue on leakage. When this happens
it will be difficult to determine abilities accurately. To control the use of test, proper
considerations are ensured: The qualified examiner and procedure in test adminsitration. A
person can be a qualified examiner provided that he/she undergoes training in administering a
particular test. The psychometrician is the one responsible for the psychometric properties and
the selection of tests. The psychometrician also trains the staff on how to administer standardized
tests properly.
A qualified examiner needs to follow instructions precisely by undergoing training or
orientation to develop the skill of administering a test. The examiner needs to follow precisely
the test manual. If the examiner largely deviates from the instructions, then it defeats the purpose
of standardization. One of the distinct qualities of standardized measures is the uniformity of
administration. Moreover, the lack of preciseness in following the instructions in the
administration of the test can affect the results of the test.
The examiner should have a thorough familiarity in the tests’ instructions. They should at
least memorize their script even when they introduce themselves to the examinees.
Careful control of testing condition which concerns the environment of the testing rooms
when taking the exam is also important. If there are many groups who will take the exam, the
condition should be the same for all. It includes the lighting, temperature, noise, ventilation, and
facilities. The condition of the testing room can affect the test taking process.
Proper checking procedure should also be taken into consideration. It should be decided
whether the test will be checked by scanning via computer or manually. Second round of
checking should also be done for verification if the checking is done accurately.
There should also be proper interpretation of results. Some trained examiners have the
skills to make a psychological profile out of the battery tests or several tests administered. The
psychometrician is qualified to write a narrative integrating all test results. In some cases, the
staff are trained how to write psychological profiles especially if there are occasional test takers.
Tests content should be restricted in order to forestall deliberate efforts to fake scores.
The questionnaires can only be accessed by the psyhometricians. The staff, superiors, or anybody
else are not allowed to have access of the tests. To avoid leakage and familiarity, the
psychometrician can use different sets of standardized test which measure the same
characteristics for different groups of test takers.
Test results are confidential. The examiner is not allowed to show anybody the results of
the exam other than the test taker and the people who will use it for decision making. Test results
are kept where it can only be accessible to the psychometrician and qualified personnel.
The nature of the test should effectively be communicated to the test takers. It is
important to dispel any mystery and misconception regarding the test. It should be clarified to the
test takers what the test is for purposes of assessment and will be used for deciding whatever the
test is intended for. The procedures of the test can be reported to test takers in case they are
291
concerned. It is essential for them to know that the test is reliable and valid. Moreover the
examiner should also dispel the anxiety of the test takers to ensure that they will perform to the
best of their ability. After taking the test, feedback on the result of the test should be
communicated to the test takers. It is the right of the test takers to know the result of the test they
took. The psychometrician is responsible for keeping all the records of the results in case the test
takers look for it.
Test Administration
Before the test proper, the examiner should prepare for the test administration. The
preparation are memorizing the script, and familiarization with the instructions and procedures.
The examiner should memorize the exact verbal instruction especially the introduction part.
However, there are some standardized tests that do not require the examiner to memorize the
instructions and procedures. Some tests permit the examiner to read the instruction and
procedure from the manual.
In terms of preparing for the test materials, it is advisable that the examiner prepares a
day before the test taking day. The test examiner should count the test booklets, the answer
sheets, pencils, and prepare the sign boards, stopwatch, other materials, and the room itself. The
room reservation should have been made one month before the test taking. The testing schedules
are all prearranged. The room condition is fixed and this includes ventilation, air-conditions, and
chairs.
Thorough familiarity with specific testing procedure is also important and is done by
checking the names of the test takers. The pictures in the test permit should match the
examinees’ faces. Testing materials such as the stopwatch provided for administering the test
should be tested if they are properly working.
Advance briefing for the proctor is also done through orientation and training on how to
administer the test. The examiner during the test is also responsible for reading the instructions
carefully, take charge of timing, and in-charge of the group taking the exam. They should also
prevent the test-takers from cheating. The examiner checks if the numbers of test-takers
correspond with the test booklet number after the session. The examiner also makes sure that the
test takers follow instructions such as shading the circle if they are to shade it. In cases of
questions that cannot be answered by the proctor, there is a testing manger nearby that can be
consulted.
For the testing condition, the environment should not be noisy. The examiner should be
able to select good and suitable testing rooms that can facilitate a good testing environment for
the test takers. The area or place where the test is administered should be restricted. Noise in the
place should be regulated. Temperature in each room should be kept the same for all rooms. The
room should be free of noise, lights should be bright enough, good seating facilities and other
factors that can negatively affect the test takers as they are taking the exam should be controlled.
Special step should also be done to prevent distractions by putting signs outside the testing room
like “examination going on.” The examiner can also lock the door, or ask assistants outside the
room to tell people that test is going on in that area. Subtle testing conditions may affect
performance on ability and personality tests like the tables, chairs, type of answer sheet, paper
and pencil, and computer administration.
292
The test administrator should establish rapport with the test takers. Rapport means the
examiner’s efforts to arouse test takers interest in the test, elicit their cooperation, and encourage
them to be appropriate in response. For ability test, encourage test takers to their best effort to
perform well. For personality inventories, tell test takers to be frank and honest with their
responses to the questions. For projective tests, inform test takers to fully report and make
associations evoked by the stimuli without censoring or editing content. Generally, the test takers
motivate respondents to follow instructions carefully.
For preschool children, the test administrator has to be friendly, cheerful, and relaxed. A
short testing time is recommended considering the attention span of children. The tasks required
in the test should be interesting. The scoring should also be flexible. Examples are demonstrated
to children on how to answer each test type.
For grade school students, the test administrator should appeal to their competitive side,
and their desire to do well.
For the educationally disadvantaged, they may not be motivated in the same way as the
usual test takers and so the examiner should adapt to their needs. Nonverbal tests are used for
deaf examinees and those who are not able to read and write. Oral tests should be given to
examinees who are having difficulty in writing.
For the emotionally disturbed, test administrators should be sensitive to difficulties the
test takers might have while interpreting scores. Testing should occur when these examinees are
in the proper condition.
For adults, test administrators should sell purpose of test, convince the test taker that it’s
for their own interest.
Examiner variables such as age, sex, ethnicity, professional/socio-economic status,
training, experience, personality characteristics and appearance affect the test takers. Situational
variables such as unfamiliar/stressful environment, activities before the test, emotional
disturbance, and fatigue also affect the test takers.
Intelligence tests
in level of reliability between the short from and full test (Form A and B) are sufficiently large to
warrant administration of the full test. Scale 2 reliability coefficients are .80-.87 for the full test
and .67 to .76 for the short form. Scale 3 on the other hand has a reliability coefficient of .82-.85
for the full test and .69 to .74 for the short form. The validity used was construct and concurrent
validity. Construct validity in the Scale 2 reported .85 for the full test and .81 for the short form.
For the concurrent validity of Scale 2 reported .77 for full test and .70 for short form. In the Scale
3, construct validity reported was .92 for full test and .85 for short form while concurrent validity
reported .65 for the full test and .66 for the short form. The standardization was done for both
scales. In scale 2 4, 328 males and females were included from the varied regions of US and
Britain and For Scale 3 3,140 American first to fourth year high school students and young
adults participated.
This test was developed by Arthur Otis and Roger Lennon and was published by the
Harcourt Brace and World, Inc. in New York on 1957. This test was designed to provide a
comprehensive assessment of the general mental ability for the students in American schools. It
is also developed to measure the student’s facility in reasoning and in dealing abstractly with
verbal, symbolic, and figural test. The content sampling includes a broad range of mental ability.
It is important to take note that it does not intend to measure the innate mental ability of the
students. There are 6 levels of Otis Lennon Mental Ability Test to ensure the comprehensive and
efficient measure of the mental ability available or already developed among students in Grade
K-12. The Primary Level I is intended for the students in the last half of kindergarten, Primary
Level II for the first half of grade 2, elementary I, for the half of grade 2 through grade 3,
Elementary II for Grade 4-6, Intermediate for grade 7 to 9 and Advance for grade 10-12. The
norm was obtained by getting 200,000 students from 117 school systems in the 50 states
participated in the National Standardization program. There were 12000 pupils from grade 1-12
while 6000 were from kindergarten. For the reliability, Split-half was used in which the
computed reliabilities range from .93 (Elem I) to .96 (intermediate). KR#20 or the Kuder-
Richardson also obtained above .93 (Elem I) to .96 (Intermediate) for reliability coefficients. Still
in the alternate forms of reliability range from .89 (Elem II) to .94 (Intermediate) for reliability
coefficients. As for the validity, school grades and scores on achievement test are computed.
Moreover the relationship between OLMAT and other accepted mental ability and aptitude test
were computed.
This test was developed by Arthur Otis and Roger Lennon and was published by the
Harcourt Brace and Jovanovich, Inc. in New York on 1979. It was developed to give an accurate
and efficient measure of the abilities needed to attain the desired cognitive outcomes of formal
education. It intends to measure the general mental ability or the Spearman’s “g”. It was
modified by Vernon based on the postulate two major factors or components of “g” which are
the verbal-educational and practical mechanical. However, this test focused on the verbal-
educational factor through a variety of tasks that call for the application of several processes to
verbal, quantitative and pictorial content. OLSAT was organized in five levels which includes
Primary Level I for grade 2 students, Primary level II for grades 2 and 3, Elementary for Grades
294
4 and 5, Intermediate, for grades 6-8 and Advance for grades 9 through 12. Each level is
designed to obtain reliable and efficient measurement to most students in which it is intended.
For each level, there are two parallel forms of the test; the Forms R and S were developed. Items
in these two forms are balanced in terms of content, difficulty and discriminatory power. These
two forms also obtained comparable results. A norm composed of 130000 students in 70 school
systems enrolled in Grades 1-12 from American schools was used for standardization. For the
reliability of the test, Kuder-Richardson yielded .91 to .95 reliability coefficients. Test-retest
reliability was also utilized and obtained .93 to .95 reliability coefficients. Lastly, standard error
of measurement was also computed wherein 2/3 of scores fell within +/- 1 standard error of
measurement from “true scores” and 95% fell within standard error of measurement from “true
scores”. For the validity, the OLSAT was compared to teacher’s grade and got .40- .60 and
median of .49. OLSAT was also correlated to Achievement test scores.
This test was originally developed by Dr. John C. Raven and was published by the U.S.
Distributor: The Psychological Corporation in 1936. It is a test of abstract reasoning which is a
multiple choice type. It was designed to measure the ability of a person to form perceptual
relations. Moreover it intends to measure a person’s ability to reason by analogy independent of
language and formal schooling. This test is a measure of Spearman's g. It is consisting of 60
items which are arranged in five sets (A, B, C, D, & E) of 12 items each. Each item contains a
figure with a missing piece. There are either six (sets A & B) or eight (sets C through E)
alternative pieces to complete the figure, only one of which is correct. Each set involves a
different principle or "theme" for obtaining the missing piece, and within a set the items are
roughly arranged in increasing order of difficulty. The raw score is converted to a percentile rank
through the use of the appropriate norms. This test is intended for people with age ranging from
6 up to adult. The matrices are offered in three different forms for participants of different ability
which includes the Standard Progressive Matrices, the Colored Progressive Matrices, and the
Advanced Progressive Matrices. The Standard Progressive Matrices were the original form of
the matrices and were first published in the year 1938. This test comprises five sets (A to E) of
12 items each with items within a set becoming increasingly difficult. This requires even greater
cognitive capacity in order to encode and analyze information. All of the items are presented in
black ink on a white background. There is also Colored Progressive Matrices Designed for
younger children, the elderly, and people with moderate or severe learning difficulties. This test
is consists of sets A and B from the standard matrices, with a further set of 12 items inserted
between the two, as set Ab. Most of the items are presented on a colored background so that the
test will appear visually stimulating for participants. On the other hand the very last few items in
set B are presented as black-on-white so that if participants exceed the tester's expectations,
transition to sets C, D, and E of the standard matrices is eased. Another form is Advanced
Progressive Matrices which contains 48 items, presented as one set of 12 (set I), and another of
36 (set II). Items here are also presented in black ink on a white background. Items become
increasingly difficult as progress is made through each set. The items in this form are appropriate
for adults and adolescents of above average intelligence. The last two forms of matrices has been
published on were published in 1998. In terms of establishing the norms, the standard sample
included are: British children between the ages of 6 and 16; Irish children between the ages of 6
and 12; military and civilian subjects between the ages of 20 and 65. Some more others includes
295
sample Canada, the United States, and Germany. The two main factors of Raven's Progressive
Matrices are the two main components of general intelligence (originally identified by
Spearman): Inductive ability (the ability to think clearly and make sense of complexity) and
reproductive ability (the ability to store and reproduce information). To determine reliability, the
split-half method and KR20 estimates values ranging from .60 to .98, with a median of .90. Test-
retest correlations was also used and obtained coefficients range from a low of .46 for an eleven-
year interval to a high of .97 for a two-day interval. The median test-retest value is
approximately .82. Raven provided test-retest coefficients for several age groups: .88 (13 yrs.
plus), .93 (under 30 yrs.), .88 (30-39 yrs.), .87 (40-49 yrs.), .83 (50 yrs. and over). For test
validity the Spearman used the SPM to be the best measure of g. Through the evaluation using
factor analytic methods which were used to define g initially, the SPM comes as close to
measuring it as one might expect. Majority of studies that factor analyzed the SPM along with
other cognitive measures in Western cultures report loadings higher than .75 on a general factor.
Moreover, concurrent validity coefficients between the SPM and the Stanford-Binet and
Weschler scales range between .54 and .88, with the majority in the .70s and .80s.
SRA Verbal
This test is a general ability test which measure the individual’s overall adaptability and
flexibility in comprehending and following instructions and in adjusting to alternating types of
problems. It is designed to use on both school and industry. It has two forms, A and B that can
also be sued at all educational levels from junior high school to college at all employee levels
from unskilled laborers to middle management. However, it is intended only for persons with
familiarity on the English Language. To determine the general ability of persons who speak
foreign language or of illiterates, a non-verbal or pictorial test should be used. The items in this
test have two types, the vocabulary (linguistic) and arithmetic reasoning (quantitative). This test
is intended for 12 to 17 years old. Reliability was determined and reported that the coefficients
are in the high .70s for all the scores- linguistic, qualitative and total. The means were also found
to be very similar. For the validity of the test, SRA is correlated with the other tests particularly
in the HS placement Test (r=.60) and in Army General Classification Test (r=.82).
This test was designed to measure the critical thinking of a person. This test was a series
of exercises which require the application of score of the important abilities involved in thinking
critically. It includes problems, statements, arguments, and interpretations of data similar to
those which a citizen in democracy might encounter in daily life. It has two forms, the Ym and
the Zm which also consist of 5 subtests. These subtests were designed to measure different and
interdependent aspects of critical thinking. There were 100 items which is not used as a test of
speed but a test of power. The five subtests are inference, recognizing assumptions, deduction,
interpretation and evaluation of arguments. Inference consists of 10 items and the students are
display the ability to discriminate among the degrees of truth or falsity of inferences drawn from
given data. The recognizing assumption (16 items) on the other hand allows the students to
recognize unstated assumptions or presuppositions which are taken in given statements or
assertions. Next, deduction (25 items) tests the ability to reason deductively from given
statements or premises and to recognize the relation of implication between prepositions. The
296
next is interpretation which measures the ability to weigh evidence and to distinguish between
generalizations from given data are not warranted beyond a reasonable doubt and generalization
which although not absolutely obtain or necessary do seem to be warranted beyond a reasonable
doubt. Lastly, evaluation of arguments measures the ability to distinguish between arguments
which are strong and relevant and those which are weak or irrelevant to a particular question or
issue. For the standardization of the test, norm was set. With this, 4 grade levels were included:
Grades 9, 10, 11 and 12. There was a total of 20,312 students participated. High schools had to
be a regular public institution in a community of 10 000-75000 with a minimum of 100 students.
This was done to avoid the biasing influences associated with extremely small schools and with
specialized High school found in some very large systems. The reliability was determined using
the split-half. The computed reliability coefficients were .61, .74, .53, .67 and .62 for are
inference, recognizing assumptions, deduction, interpretation, and evaluation of arguments
respectively in the Ym form. While for the Zm form, the reliability coefficients were .55, .54,
.41, .52 and .40 for inference, recognizing assumptions, deduction, interpretation, and evaluation
of arguments respectively. Validity was then determined through content and construct validity.
The indication for the validity was the extent to which the critical thinking appraisal measures a
sample of specified objective of such instructional programs. Moreover, for the construct
validity, various forms of test intercorrelation obtained .21-.50 and .56-.79 was the correlation
coefficient computed for the correlation of the subtests to the appraisal as a whole.
Achievement Tests
This test was designed to provide an accurate and dependable data concerning the
achievement of the students on important skills and content areas of the school curriculum. The
test aims to support theories that explain achievement test should asses what is being taught in
the classrooms. The use has been extended to include the first half of kindergarten and Grade 10-
12. It is a two-component system of achievement evaluation both designed to obtain both norm-
referenced and criterion referenced information. The first one is the instructional component
which is designed for classroom teachers and curriculum specialists. This is an instructional
planning tool that provides prescriptive information in the educational performance of individual
students in terms of specific instructional objectives. There is a separate instructional battery
under this which includes reading, mathematics, and language all available in JI and KI forms.
The other one is the survey component which provides the classroom teacher with considerable
information about the strengths and weaknesses of the students in the class in the importance
skill and content areas of the school curriculum. Under this are 8 overlapping batteries covering
the age range from K-12. This also includes reading, mathematics, and language. The norm was
set and participants were selected to represent the national population in terms of school system
enrolment, public versus non-public school affiliation. Geographic design, socio-economic status
and ethnic background. There were 550 students and there were 10% public schools from the
Metropolitan Population and 10% also from the national population. For the Socio-economic
status, 54% were from metropolitan, and 52% from national population, all adults graduated and
high school. Reliability was computed using KR#20 and obtained .93 for reading, .91 for
mathematics, .88 for language. The basic battery was .96. Also, standard error of measurement
297
was also computed and yield 2.8 for reading, 2.9 for mathematics, 3.4 for language and the basic
battery is 5.3. Validity was determined through a content validity with a belief that the objective
and item should correspond to the school curriculum. With this in mind, compendium of
instructional objectives was made available.
This test was designed by Gardner, Rudman, Karlson, & Merwin in 1981. This is a series
of test that is comprehensive which was later developed to assess the outcomes of learning at
different levels in educational sequences. This measures the objectives of the general education
from kindergarten through first year college. Its series include SESAT Stanford Early School
Achievement Test and TASK Stanford Test of Academic Skills. SAT is intended for primary,
intermediate, and junior high school. It assesses the essential learning outcomes of the school
curriculum. It was first established in 1923 and undergone several revisions until 1982. These
revisions were done to have a close match between test content and learning practices, to provide
norms that will have an accurate reflection of the performance of students in different grade
levels and achieve modern ways of interpreting the scores which result in improvement in
measurement technology. SESAT is for children in kindergarten and grade 1. This test measures
the cognitive development of children upon admission and entry into school in order to establish
a baseline where learning experiences may best begin. On the other hand, TASK was intended
for grade 8 to 13 students (first year college). This intends to measure the basic skills. The level I
of TASK is for grades 8-12 which measures the competencies and skills that are desired by the
adult social level, while Level II is for grades 9-13 and measures the skills that are requisite to
continued academic training. SAT contains 8 subtests which include reading comprehension,
vocabulary, listening comprehension, spelling, language, concepts of numbers, math
computations, math applications, and science. Reading comprehension is the measure of
understanding skills wherein textual (typical found in books), functional (printed found in daily
life), and recreational (reading for enjoyment such as poetry and fiction) were included.
Vocabulary is the measure of the pupil’s language competence without having to read prior the
test. Listening comprehension is the subtest which evaluates the ability of the student to process
information that has been heard. Spelling tests the ability of the student to identify the misspelled
words from a group of four words. The language test has three parts: Proper use of capital letters,
use of punctuation marks, and appropriate use of the parts of speech. Concept of number
includes the understanding of the student with the basic concepts about numbers. Math
computations include the multiplication and division of whole numbers, operations to fractions,
decimals and percents. Math application tests the student’s ability to apply the concepts they
have learned to problem solving. Lastly, science application tests measure the ability of the
students to understand the basic physical and biological sciences. One of the items in SAT under
vocabulary is “when you have a disease, you are ____” a. sick, b. rich, c. lazy, d. dirty. Te
reliability of the test was obtained through internal consistency, KR#20 (computed r= .85- .95),
standard error of measurements and alternate forms of reliability. For the validity, the test
content was compared with the instructional objectives of the curriculum.
298
Aptitude Tests
This test was designed to meet the needs of the guidance counselors and consulting
psychologists, whose advice and ideas were sought in planning for a battery which would meet
the accurate standards and be practical on daily use in schools, social agencies, and business
organizations. The original forms (A and B) were developed in 1947 with the aim to provide an
integrated scientific and well-standardized procedure for measuring the abilities of the boys and
girls in grade 8-12 for the purposes of educational and vocational guidance. It was primarily for
junior and senior high school. It can also be used in educational and vocational counseling of
young adult out of school and selection of employees. This test was revised and restandardized in
1962 for the forms L and M and in 1972 for forms S and T. Included in the battery of test for
DAT are verbal reasoning, numerical ability, abstract reasoning, clerical speed and accuracy,
mechanical ability, space relations and spelling. The verbal reasoning measures the ability of the
student to understand concepts that were framed in words. Numerical ability subtest tests the
understanding of the students of numerical relationships and facility in handling numerical
concepts which includes arithmetic computations. Abstract Reasoning intends as a non-verbal
measure of the student’s reasoning ability. Clerical speed and accuracy intends to measure the
speed of response in simple perceptual task including simple number and letter combination.
Mechanical ability test is the constructed version of the Mechanical Comprehensive Test (but it
is easier) and measure mechanical intelligence. Space and relations measure the ability to deal
with concrete materials through visualization. Lastly, spelling measures the student’s ability to
detect errors in grammar, punctuations, and capitalizations. The norm was obtained through
percentiles and stanines. 76 school districts were included and test the grade 8-12 students in
Schools at District of Columbia. Schools with 300 or more students each were included. Small
school district’s entire enrollment is grade 8-12 also participated. For the large schools district,
representative were included taking into consideration the school achievement and racial
composition. All in all there were 14, 049 grade 8 students, 14, 793 grade 9 students, 13,613
grade 10 students, 11,573 grade 11, and 10,764 grade 12 students. The reliability was computed
through split-half and get the reliability coefficients. Validity was determined and it can be said
that the coefficient presented demonstrates the utility of Differential Aptitude Test for
educational guidance. Each of the test is potentially useful as to what the expectancy tables
evidently show the validity coefficient.
This test is a set of 18 short tests designed for use with adults in personnel selection
programs for a wide variety of jobs. The tests are short and self-administering. The FIT battery
measures 18 subscales including arithmetic, assembly, components, coordination, electronics,
expression, ingenuity, inspection, judgment, and comprehension, mathematics and reasoning,
mechanics, memory, patterns, planning, precision, scales, tables, and vocabulary. Arithmetic
measures the accuracy in working with numbers. Assembly measures the ability to visualize the
appearance of an object assembled from separate parts. Component is the ability to locate and
identify important parts of a whole. Coordination tests the coordination of arms and hand.
Electronics measures the understanding and electronic principles and analyze diagrams of
299
electrical circuits. Expression is the ability to feel and having the knowledge of correct English,
ability to convey ideas in writing and talking. Ingenuity refers to being creative and inventive
and having the ability to devise procedures equipment and presentations. Inspection is the ability
to spot flaws and imperfections ion series of articles accurately and quickly. Judgment and
comprehension is the ability to read with understanding and use good judgment n interpreting
materials. Math and reasoning refers to the understanding basic math concepts and ability to
apply in solving certain problems. Mechanics is the ability to understand mechanical principles
and analyze mechanical movements. Memory tests the learning and recalling ability in terms of
association. Patterns refer to the ability to perceive and reproduce simple pattern outlines
accurately. Planning is the ability to foresee problems that may arise and anticipate the best order
for carrying out steps. Precision refers to the ability to make appropriate figure movements with
accuracy. Scales is the ability to read and understand what the scales graphs and charts are
conveying. Tables refer to the ability to read and understand tables accurately and quickly.
Vocabulary refers to the ability to choose the tight terms to convey ones idea. The standard
sample in this test are 12th grade students. The reliability of the test was determined and reported
the reliability coefficients ranging from .50-.90 from the individual test. When FIT was
correlated with FACT the range was .28 (memory) to .79 (arithmetic). For the validity of the test,
it is said that many of the short test has fairly substantial reliability coefficients ranging from .20
to .50 using step wise multiple regression. It is also found that 5 of the three tests namely, Math
and reasoning, Judgment and comprehension, Planning, Arithmetic and Expression yield
multiple correlation of .5898 with fall semester GPA. The first four tests along with vocabulary
and precision provide a multiple correlation of .47 in spring semester GPA. In general, multiple
correlation vary from .57- to.40
Personality Test
This test was created by Allen L. Edwards and was published by The Psychological
Corporation. This test is an instrument for research and counseling purposes. It can provide
convenient measures of independent personality variables. Moreover, it provides measure for test
consistency and profile stability. It is a non-projective personality test that was derived from H.
A. Murray’s theory which measures the rating of individuals in fifteen normal needs or motives.
These needs or motives from Murray’s theory are the statements used in the Edwards Personal
Preference Schedule. It consists of 15 scales including achievement, deference, order, exhibition,
autonomy, affiliation, interception, succorance, dominance, abasement, nurturance, change,
endurance, heterosexuality, and aggression. Achievement is described as the desire of the person
to exert best effort. Deference is the tendency of the person to get suggestions from other people,
doing what is expected praising others conforming and accept other’s leadership. Order is the
neatness and organization in doing one’s work, arranging everything in proper order so
everything will run smoothly. Exhibition is the tendency of saying smart and clever things to
gain other’s praise and be the center of attention. Autonomy is the ability to do whatever desired,
avoiding conformity and making independent decisions. Affiliation is having plenty of friends,
ability to form new acquaintance, and build intimate attachments with others. Intraception is the
tendency to put oneself on other’s shoes, and analyzing other’s behaviors and motives.
Succorance is the desire to be helped by others in times of trouble, seeks encouragement and
300
wants others to be sympathetic to him. Dominance is the tendency of the person to argue with
another’s view, act as a leader in the group thereby influencing others and make group decisions.
Abasement is the tendency to feel guilty when someone commits a mistake, accepts blame and
feels the need of confession after a mistake is done. Nurturance is the ability to help friends who
are in trouble, desire to help the less fortunate ones, showing great affection to others and being
kind and sympathetic. Change is the tendency to explore on new things, doing things out off
routine. Endurance is the ability to keep on doing the task until it is finished and sticking on the
problem until it is solved. Heterosexuality is the desire to go out with friends in the opposite sex,
becoming physically attracted to the people in the opposite sex and being sexually excited.
Lastly, aggression is the tendency to criticize others in public, attacking contrary points of view
and making fun of others. This test is intended for college students and adults. To set the norm,
1,509 students in college were included. Norm includes High School graduates and college
training. It was consist of 749 college females and 760 college males. Still, part of the sample in
the norm was adults consisting of male and female households heads who are members of
consumer purchase panel used for market surveys. They were from rural and urban areas of
countries in the 48 states. The consumer panel consisted of 5105 households. For the reliability,
a split-half reliability coefficient technique was used. The coefficients of internal consistency for
1,509 students in the college normative group range from .60 to .87 with a median of .78. A test-
retest stability coefficient with a one-week interval was also conducted. These are based on a
sample of 89 students and range from .55 to .87 with a median of .73. Other researchers have
reported similar results over a three-week period, showing correlations of .55 to 87 with a
median of .73. On the other hand, for validity, the manual reports studies comparing the EPPS
with the Guilford Martin Personality Inventory and the Taylor Manifest Anxiety Scale. Other
researchers have correlated the California Psychological Inventory, the Adjective Check List, the
Thematic Apperception Test, the Strong Vocational Interest Blank, and the MMPI with the
EPPS. In these studies there are often statistically significant correlations among the scales of
these tests and the EPPS, but the relationships are usually low-to-moderate and often are difficult
for the researcher to explain.
sociability means optimism and cheerfulness. Obtaining a high score in objectivity may mean
less egoistic and insensitiveness. High score in friendliness means lack of fighting tendencies,
and desires to be liked by others. Obtaining a high score in thoughtfulness may pertain to men
who have an advantage in getting supervisory positions. Personal relations high scores mean the
high capability of getting along with other people. High score in masculinity may pertain to
people who behave in ways that are more acceptable to men. Examples of items in this test are
“You like to play practical jokes in others” and “Most people are out to get more than they give”.
Standardization of this test was done by gathering 523 college men and 389 college women in
one southern California University and two junior colleges for all except for the trait
thoughtfulness. In the male sample, there were veterans aging 18-30 year old. Reliability was
calculated using KR#20 and obtained reliability ranging from .79 for general activity and .87 for
sociability. However, in intercorrelations of ten traits gratifies low reliability coefficients, only
two scores are high, between Sociability and Ascendance and between Emotional Stability and
Objectivity. For the validity, it is believed that what each score measures is fairly well-defined
and that the score represent a confirmed dimension of personality and a dependable descriptive
category. Most impressive validity data have come from the use of inventories with supervisory
and administrative personnel.
The 16 Pf was originally developed by Raymond Cattel, Karen Cattel, and Heather Cattle
to help identify personality factors. It can be administered to individuals 16 years and older.
There are 16 bipolar dimensions of personality and 5 global factors. The bipolar dimensions of
Personality are Warmth (Reserved vs. Warm; Factor A), Reasoning (Concrete vs. Abstract;
Factor B), Emotional Stability (Reactive vs. Emotionally Stable; Factor C), Dominance
(Deferential vs. Dominant; Factor E), Liveliness (Serious vs. Lively; Factor F)
Rule-Consciousness (Expedient vs. Rule-Conscious; Factor G), Social Boldness (Shy vs.
Socially Bold; Factor H), Sensitivity (Utilitarian vs. Sensitive; Factor I), Vigilance (Trusting vs.
Vigilant; Factor L), Abstractedness (Grounded vs. Abstracted; Factor M), Privateness (Forthright
vs. Private; Factor N), Apprehension (Self-Assured vs. Apprehensive; Factor O), Openness to
Change (Traditional vs. Open to Change; Factor Q1), Self-Reliance (Group-Oriented vs. Self-
Reliant; Factor Q2), Perfectionism (Tolerates Disorder vs. Perfectionistic; Factor Q3), Tension
(Relaxed vs. Tense; Factor Q4). The global factors are Extraversion, Anxiety, Tough-
mindedness, Independence, and Self-Control. A stratified random sampling that reflects the 2000
U.S. Census was used to create the normative sample, which consisted of 10,261 adults. Test-
retest coefficients offer evidence of the stability over time of the different traits measured by the
16 PF. Pearson-Product Moment correlations were calculated for two-week and two-month test-
retest intervals. Reliability coefficients for the primary factors ranged from .69 (Reasoning,
Factor B) to .86 (Self-reliance, Factor Q2)with a mean of .80. Test-retest coefficients for the
global factors were higher, ranging from .84 to .90 with a mean of .87. Cronbach’s alpha values
ranged from .64 (Openness to Change, Factor Q) to .85 (Social Boldness, Factor H), with an
average of .74. Validity of the 16 PF (5th ed.) demonstrated its ability to predict various criterion
measures such as the Coopersmith Self-esteem Inventory, Bell’s adjustment inventory, and
social skills inventory. Its subscales are correlated well with the factors of the Myers-Briggs
Type Indicator.
302
remain the same overall type and 36% remain the same after nine months. When people are
asked to compare their preferred type to that assigned by the MBTI, only half of people pick the
same profile. Critics also argue that the MBTI lacks falsifiability, which can cause confirmation
bias in the interpretation of results. The standardization was made using high school, college, and
graduate students; recently employed college graduates; and public school teachers.
This test, also called PUP,w as developed by Virgilio G. Enriquez and Ma. Angeles
Guanzon-Lapena. It was published by the Research Training House in 1975. Panukat ng Ugali at
Pagkatao is a psychological test that can be used for research, employment, and screening of
members and students in an institution. Its reliability is .90 and test-retest reliability result was
.94 (p< .01). It has four trait subscales and each has underlying personality traits. The four trait
subscales are Extraversion or Surgency, Aggreableness, Conscientiousness, and Emotional
Stability. Under extraversion are ambition (+), guts/ daring (+), shyness or timidity (-) and
conformity (-). Ambition is the tendency of a person to act towards the accomplishment of his/
her goal. Guts/daring is the courage which is a very strong emotion from the person within. It
can be related to things that are in risk or danger be it in life, aspect of life and material things.
Shyness or timidity is the trait of being timid, reserved and unassertive. A person who is shy
tends not to socialize with others, does not engage in eye contact and lose trust to oneself so
prefers to be alone. Conformity is the tendency of a person to take into consideration what other
people are saying especially if that person has a higher position to him/ her. A conforming
person tends to disregard one's own opinion. For the agreeableness, the factors are respectfulness
(+), generosity (+), humility (+), helpfulness (+), difficulty to deal with (-), criticalness (+), and
belligerence (-). Respectfulness is the trait of giving value to the person you are taking to
regardless of his/ her position and age. Generosity is the ability to satisfy the needs of others by
giving what they need or want even it is not in accordance of one’s personal desire. Humility is
the trait of showing modesty and humbleness in dealing with other people, not boast of her
accomplishments and status in life. Helpfulness is the desire to attend to others’ needs and fill
their shortcomings. Difficulty to deal with others is the tendency of the person to agree on
something after many attempts of request. Criticalness is the tendency of the person to criticize
every small detail of something, giving attention to things that are rarely noticed by others.
Belligerence is the trait of a person of being a war-freak and hot headed, easily angered and
frequently encounters trouble due to short or absence of patience. For the conscientiousness
dimension, the personalities are thriftiness, perseverance, responsibleness, prudence, fickle
mindedness, and stubbornness. Thriftiness is the ability of a person to manage his/ her resources
wisely and conservative in spending money. Perseverance is the persistence of a person to
achieve ones goal and being constant with the things already started until it is finished.
Responsibleness is the capacity to do the task assigned to him/ her and being accountable to it.
Prudence is the ability to make sound and careful decisions by weighing the available options.
Fickle mindedness is the tendency of the person to think twice before finally making up one’s
mind and having constantly changing mind once in a while. Finally, stubbornness is the
determination to do things despite any prohibitions, hindrances and objections and hard to
convince that he/ she has committed a mistake. For the fourth dimension which is emotional
stability, 4 traits are included which are restraint (+/-), sensitiveness (-), low tolerance to joking/
teasing (-) and mood (-). Restraint is the tendency of the person not to show his/ her intense
304
emotion, keeping one’s own feelings as a self-control strategy. Sensitiveness is the tendency of
the person to be easily hurt or affected by little things said or done that the person does not like.
Low tolerance to joking/ teasing is the tendency of the person to have intense emotion due to
teasing or provocation of others. The mood is the tendency to show unusual attitude or behavior
and changing emotion due to an unexpected event that happened. The last dimension of this test
is the intellect or openness to experience includes 3 personality traits such as thoughtfulness,
creativity and inquisitiveness. Thoughtfulness id the tendency of the person to be so concerned
with the future especially regarding the problems or troubles. Creativity is the natural ability of
the person to make or create something out of local materials or resources, and having the ability
to express oneself. Creative people have a wide imagination and high inclination to music, arts,
and culture. Last, inquisitiveness is the trait of the person to be curious and sometimes intrusive.
To be able to make a norm, 3,702 ethnic group were asked to participate: 412 Bicolano, 152
Chabacano, 642 Ilocano, 489 cebuano, 170 Ilonggo, 190 Kapampangan, 513 tagalog, 378 waray,
29 Zambal and 83 others. For the validity of the test, all items are said to have positive direction.
2 subscales for validity were used, denial (certain that the respondents will disagree with the
statement such as “I never told a lie in my entire life”) and tradition (certain that the respondents
will agree to the statement such as “I would take care of my parents when I get old”)
This test was developed by Anadaisy J. Carlota, from the psychology department in the
University of the Philippines. It was published in Quezon City in the year of 1989. PPP is a 3-
form personality test designed to measure 19 personality dimensions. Each personality
corresponds to subscales which are comprised of homogeneous subset of items. The three forms
are the Form K, form S and the form KS. Form K corresponds to the salient traits for
interpersonal relations. Under this form are 8 personality traits which include thoughtfulness,
social curiosity, respectfulness, sensitiveness, obedience, helpfulness, capacity to be
understanding, and sociability. Thoughtfulness is the tendency to be considerate to others. A
person who is thoughtful tries not to be inconvenient to other people. Social curiosity is the
inquisitiveness about other’s life. A person who is socially curious tends to ask everything to
someone and loves to know everything that is happening around him/ her. Respectfulness is the
tendency of people to recognize one’s belief and privacy. Behavior of respectful person is
concertized by simply knocking on the door first before entering. Sensitiveness is the tendency
of a person to be affected easily by any negative type of criticisms. So, a sensitive person does
not want to hear any negative criticisms from other people. Obedience is the tendency of a
person to do what others demand of him/her. An obedient person tends to follow whatever
commanded to him by others. Helpfulness is the tendency of a person to offer service to others,
extend help and give resources. It is characterized by a person who is always willing to lend
his/her things to others. The capacity to be understanding is the person's tolerant to other people's
shortcomings and when this person is hurt by others; he/she is always ready to listen to
explanations. And lastly, sociability is the ability of the person to easily get along and befriend
with others. In social gatherings or event this person will always take the first move to introduce
himself/ herself to others. The second form of this test is Form S which includes 7 factors such as
orderliness, emotional stability, humility, cheerfulness, honesty, patience, and responsibility.
Orderliness is the neatness and organization in one’s appearance and even in work. The person
the is orderly puts his/her things in proper places. Emotional Stability is the ability of the person
305
to control his/ her emotions and manage to remain calm even when face in a great trouble.
Humility is the tendency to remain modest despite accomplishments and readily accepts own
mistakes. The person with humble personality does not boast about his/ her successes.
Cheerfulness is the disposition of the person to be cheerful and see the happy and funny aspects
of things that happen. A cheerful person is one who always finds funny things about situations.
Honesty is the sincerity and truthfulness of a person. A person who is honest tends to tell the
truth in every situation regardless of the feelings of others. Patience is the ability to cope up with
daily life's routine and repetitive activities. A patient person is one who responds to a child's
repetitive questions without getting mad. Lastly, responsibility is the tendency of the person to
do a particular task upon own initiative. A responsible person is characterized by not
procrastinating in accomplishing an activity. For the last form of PPP, the Form KS, there are 4
subscales which include creativity, risk-taking, achievement orientation and intelligence.
Creativity is the ability of being innovative, and think of various strategies in solving a problem.
Risk-taking is the tendency to take new challenges despite the unknown consequences. A risk-
taker person is the one who believes that one must take risks to be successful in life.
Achievement-orientation is the tendency of the person to strive for excellence and to emphasize
quality over quantity in every task he/ she does. And lastly, intelligence is the trait of a person to
perceive oneself as an intelligent person. This is also characterize by easily understanding the
material being read. This test can be taken by person with age ranging from 13 and above. It is
already written in Filipino and has translations in English, Cebuano, Ilokano and Ilonggo. During
its pretest, 245 respondents from ages 13-18 years old were included. There were more females
then. The reliability was tested through internal consistency reliabilities. All personality
dimensions except Achievement orientation has high reliability. Internal consistency reliability
was done three times. At first top 10 personalities were gotten, then top 12 then top 14. For the
fourth time, top 8 was taken and were included in the inventory. Form K has a mean reliability
coefficient of .69, Form s .81 and Form KS .72. For the validity, construct validity was applied
wherein internal structures of the original version of PP before clustering on 3 forms
intercorrelations among the subscales were obtained. The test was valid because for one, more
positive intercorrelations than negative were obtained. Second, in personality subscales there
were also more positive than negative except for social curiosity and sensitiveness. And lastly,
the magnitude of the correlations were small to moderate although the majority of the subscales
are significant at alpha level of p=.05. The predominance of positive intercorrelation means that
all of the subscales are measuring the same construct which is personality. This test was
standardized through norming which was developed in two forms including percentiles and
normalized standardized scores with a mean of 50 and standard deviation of 10.
Attitude Tests
This test was developed in order to help meet the challenge of students with high
scholastic aptitude but is very poor in schools while the mediocre in the scholastic test were
doing well in school. This test is easily administered study of methods, motivation for studying
certain attitudes toward scholastic activities which are important in the classrooms. The purpose
of developing this is to identify the students whose study habit and attitudes are different from
those of students who earn high grades, to aid understanding of the students with academic
306
difficulties and to provide a basic for helping such students improve their study habits and
attitudes and thus more fully realize their best potentialities. To add to this, study habits are
believed to be a strong predictor of achievement. This test consists of Form C for college and
Form H for high school (grades 7-12). The four basic subscales include delay avoidance, work
methods, teacher approval, and educational acceptance. It has 100 items and can be used as
screening instrument, diagnostic, teaching aid and teaching tool. There were separate norms for
both of the Forms. For Form C, 3054 first semester freshmen enrolled at the following nine
colleges were included: Antioch College, Bowling Green State University, Colorado Reed
College, San Francisco State College, Southwest Texas State College, Stephen F College, Austin
State College, Swarthmore College and Texas Lutheran College. For the Form H 11, 218
students in 16 different towns and metropolitan arena in America participated: Atacosta Texas
(10-12), Austin Texas (10-12), Buda Texas (7-12), Durango Colorado (10-12), Olen Ellyn,
Illinois (9), Gunnison Colorado (10-12), Hagerstown Maryland (7-12), Marion Texas (7-12),
Navarro Texas (7-12), New Brauntels Texas (7-9), Salt Lake City Utah (7-12), San Marcos
Texas (7-12), Seguin Texas (7-2), St. Louis Missouri (7-12) and Waelder Texas (7-12). The
computed reliability coefficients were baes on the Kuder-Richardson # 8, which ranged from .87
and .89. Using the test-retest method, the coefficients were .93, .91, .88, .90 for delay avoidance,
work methods, teacher approval, and educational acceptance respectively in the 4 weeks interval.
In the 14 week interval, the reliability coefficients were .88, .86, .83 and .85. For the validity, the
criterion used was the one-semester grade point average GPA. SSHA and GPA were correlated
and the result was .27-.66 for men and .26- .65 for women. The average validity coefficients for
10 colleges were .42 for men and .45 for women. When SSHA was also correlated to ACE
(American Council on educational psychological examination, a scholastic aptitude test was
always low. SSHA and Form C correlated to GPA obtained .25-.45 and the weighted average
was .36. SSHA and each subscale using Fisher’s z-functions .31 for delay, .32 for work methods,
.25 for teacher approval and .35 educational acceptance.
This test intends to meet the need of assessing the goals which motivate man to work. It
measures the values which are extrinsic as well as those which are intrinsic work, the
satisfactions with men and women seek in work and the satisfactions which may be the
concomitants or outcomes of work. It seeks to measure these in boys and girls, in men and
women at all age levels beginning with adolescence and at all educational levels beginning with
entry into junior high school. It is both in the variety of values tapped and in the ages for which it
is appropriate. Its factors are altruism, esthetic, creativity, intellect stimulation, achievement,
independence, prestige, management, economic returns, security, surroundings, supervisory
relations, associates, way of life, and variety. Altruism refers to the work which enables the
person to contribute with the welfare of others. Esthetic is the works which permits to one to
make beautiful things and to contribute to beauty to the world. Creativity pertains to the work
which permits one to invent new things, design new products or develop new ideas. Intellect
stimulation refers to the work which provides opportunity for independent thinking and for
learning how and why things work. Achievement refers to the work which gives one a feeling of
accomplishment in a job well done. Independence pertains to the work in his own way as fasts or
as slowly as he wishes. Prestige pertains to the work which gives one standing in the eyes of
other and evokes respect. Management refers to the work which permits one to plan and lay out
307
work for others to do. Economic returns pertain to the work which pays well and enables one to
have the things he wants. Security pertains to the work which provides one with certainty of
having a job even in the hard times. Surroundings pertains to the work which is carried out under
pleasant conditions, not too hot, not too cold, noisy, dirty, etc. Supervisory relations refer to the
work which is carried out under a supervisor who is fair and with whom one can get along.
Associates refer to the work which brings one into contact with fellow workers whom he likes.
Way of life refers to the kind of work that permits to live the kind of life he chooses and to be the
type of person he wishes to be. Variety refers to the work that provides an opportunity to do
different types of job. One of the items in the inventory under creativity is “Create new ideas,
programs or structures departing from those ideas already in existence.” To set the standards of
this test, norm was obtained. The sample were grade 7 (902 females, 925 males), 8 (862 females,
949 males), 9 (844 females, 931 males), 10 (772 females, 859 males), 11 (824 females, 814
males), and 12 (724 females and 672 males). Reliability was obtained through test-retest method
and the reliability coefficients reported were: .83, .82, .84, .81, .83, .83, .76, .84, .88, .87, .74,
.82, and .80 for all the subscales. Validity was also determined through construct, content,
concurrent and predictive validity. Some of the construct validity were obtained by correlating
Altruism subscale to Social Service Scale (r= .67) and to Social scale of AVL (r= .29). Also
Esthetic subscale with Artist key SVIB (r= .55), with artictic scale of Kuder (r=.48) and with
Aesthetic Scale of AVL (r=.08).
Interest Test
This test was designed to be able to have a systematic study of a person’s interest. It is a
standardized questionnaire that is designed to bring to the fore of the facts about a person with
respect to his occupational interest so that he and his advisers can more intelligently and
objectively discuss his educational and occupational plans. This test is intended for 8-12th grade
students and adults. It requires relatively low reading skills as determined by the readability
formulas. It provides information concerning a vital phase in the complex matter of setting the
person’s vocational plans wisely and planning a program for attaining his goals. It yields score in
6 broad occupational fields for each service. Both females and males obtain scores in fields
identified as commercial, mechanical, professional, esthetic and scientific. Agricultural score is
only for boys and personal service is only for girls. Each field has 20 questions divided among 4
occupational sections. A 5-point scale was used from strongly dislike to strongly like. The
sample in the norm includes 10, 000 students in 14 school system, both males and females from
grade 8 to 12. Reliability was obtained through test-retest and boys got r=.73 in Commercial and
.88 in scientific scores while girls obtained .71 in the commercial and .84 in esthetic. Another
reliability method used was split-half and boys obtained .88 in commercial scores and .95 in
mechanical and scientific scores while girls obtained .82 in commercial scores and .95 in
scientific scores. For the test of validity, Brainard was correlated to Kuder Personal Preference
record and it was found that the latter test measures different in such a way that its focus is in the
interest and forces the respondents to choose three activities indicative of different types of
interest.
308
A. Look for other Standardized Tests and report its current validity and reliability.
B. Administer the test that you created in Lesson 2 chapter 5 to a large sample. Then
create a norm.
References
(1973). Measuring intelligence with the culture fair test: Manual for scales 2 and 3. Institute of
Personality and Ability Testing, Philippines
Bennett, G.K., Seashore, H.G., & Wesman, A.G. (1973). Fifth edition manual for the
differential aptitude test forms s and t. The Psychological Corporation, New York.
Brainard, PP. & Brainard, R.T. (1991). Brainard occupational preference inventory manual.
Bird Avenue, San Jose California USA
Briggs, K.C., & Myers, I.B. (1943). The Myers-Briggs Type Indicator Manual. Consulting
Psychologists Press, Inc.
Brown, W.F., & Holtzman, W.H. (1967). Survey of study habits and attitudes: SSHA manual.
The Psychological Corp, East 45th Street New York
Carlota, A. (1989). Panukat ng pagkataong pilipino PPP Manual. Quezon City Philippines.
Enriquez, V. G. & Guanzon, M.A. (1975). Panukat ng ugali at pagkatao manual. PPRTH-ASP
Panukat na Sikolohikal
Flanagan, J.C. (1965). Flanagan industrial test manual. Science Research Associates, East Street
Chicago Illinois
Gardner, E.F., Rudman, H.C., Karlson, B., & Merwin, J.C. (1981). Manual directions for
administering stanford schievement test. Harcourt Brace and Jovanovich, Inc., New York
Guilford, J.P.,& Zimmerman, W.S. (1949). Guilford zimmerman temperament survey: Manual
of instructions and interpretations. Harcourt Brace and Jovanovich, Inc., New York
309
Otis, A.S. & Lennon, R.T. (1957). Ottis-Lennon mental ability test manual for administration.
Harcourt Brace and Jovanovich, Inc., New York
Otis, A.S. & Lennon, R.T. (1979). Ottis-Lennon mental ability test manual for administration
and interpretation. Harcourt Brace and Jovanovich, Inc., New York
Prescott, G.A., Balow, I.H., Hogan, T.P. & Farr, R.C. (1978). Advanced 2: Metropolitan
achievement tests: Forms JS and KS. Harcourt Brace and Jovanovich, Inc., New York
Raven, J., Raven, J.C., & Court, J.H. (2003). Manual for raven's progressive matrices and
vocabulary scales. section 1: General overview. San Antonio, TX: Harcourt Assessment.
Super, D.E. (1970). Manual: Work values inventory. Houghton Mifflin Company
Thurstone, L.L & Thurstone, T.G. (1967). SRA verbal examiner’s manual. Science Research
Associates, East Street Chicago Illinois
Watson, G., & Glaser, E.M. (1964). Watson-Glaser critical thinking appraisal: Manual for
forms Ym and Zm. Harcourt Brace and Jovanovich, Inc., New York
310
Chapter 9
The Status of Educational Assessment in the Philippines
Objectives
1. Realize the strong foundation of the field of educational assessment in the Philippines.
2. Describe the history of formal assessment in the Philippines.
3. Describe the pattern of assessment practices in the Philippines.
Lessons
Lesson 1
Assessment in the Early Years
Formal Assessment in the Philippines started as mandate from the government to look
into the educational status of the country (Elevazo, 1968). The first assessment was conducted
through a survey authorized by the Philippine legislature in 1925. The legislature created by the
Board of Educational Survey headed by Paul Monroe. Later the board appointed an Educational
Survey Commission who was also headed by Paul Monroe. This commission visited different
schools around the Philippines. They observed different activities conducted in schools around
the Philippines. The results of the survey reported the following:
1. The public school system that is highly centralized in administration needs to be humanized
and made less mechanical.
2. Textbook and materials need to be adapted to Philippine life.
3. The secondary education did not prepare for life and recommended training in agriculture,
commerce, and industry.
4. The standards of the University of the Philippines was high and it should be maintained by
freeing the university from political interference.
5. Higher education be concentrated in Manila.
6. English as medium of instruction was best. The use of local dialect in teaching character
education was suggested.
7. Almost all teachers (95%) were not professionally trained for teaching.
8. Private schools except under the religious groups were found to be unsatisfactory.
This division started as the measurement and Research Division in 1924 that was an off
shoot to the Monroe Survey. It was intended to be the major agent of research in the Philippines.
Its functions were:
1. To coordinate the work of teachers and supervisors in carrying out testing and research
programs
2. To conduct educational surveys
3. To construct and standardize achievement tests
In a legislative mandate in 1927, the director of education created the Economic Survey
Committee headed by Gilbert Perez of the Bureau of Education. The survey studied the
economic condition of the Philippines. They made recommendations as to the best means by
which graduates of the public school could be absorbed to the economic life of the country. The
results of the survey pertaining to education include:
312
1. Vocational education is relevant to the economic and social status of the people.
2. It was recommended that the work of the schools should not be to develop a peasantry class
but to train intelligent, civic-minded homemakers, skilled workers, and artisans.
3. Devote secondary education to agriculture, trades, industry, commerce, and home economics.
After the Prosner survey there were several surveys conducted to determine mostly the
quality of schools in the country after the 1930’s. All of these surveys were government
commissioned such as the Quezon Educational Survey in 1935 headed by Dr. Jorge C. Bacobo.
Another study was made in 1939 which is a sequel to the Quezon Educational Surveys which
made a thorough study of existing educational methods, curricula and facilities and recommend
change son financing public education in the country. This was followed by another
congressional survey in 1948 by the Joint Congressional Committee on Education to look into
the independence of the Philippines from America. This study employed several methodologies.
The UNESCO undertook a survey on Philippine Education from March 30 to April 16,
1948 headed by Mary Trevelyan. The objective of the survey was to look at the educational
situation of the Philippines to guide planners of subsequent educational missions to the
Philippines. The report of the surveys was gathered from a conference with educators and
layman from private and public school all over the country. The following were the results:
After the UNESCO study, it was followed by further government studies. In 1951, the
Senate Special Committee on Educational Standards of Private schools undertook to study
private schools. This study was headed by Antonio Isidro to investigate the standards of
instruction in private institutions of learning and to provide certificates of recognition in
accordance with their regulations. In 1967, the Magsaysay Committee on General Education that
was financed by the University of the East Alumni Association. In 1960, the National Economic
Council and the International Cooperation Administration surveyed public schools. The survey
was headed by Vitaliano Bernardino, Pedro Guiang, and J. Chester Swanson. Three
recommendations were provided to public schools: (1) To improve the quality of educational
services, (2) To expand the educational services, and (3) To provide better financing for the
schools.
The assessment conducted in the early years were mandated and/or commissioned by
government which was also initiated by the government. The private sectors were not yet
included in the studies as proponents and usually headed by foreign counterparts such as the
UNESCO and the Monroe, and Swanson survey. The focus of the assessments was on the overall
education of the country which is considered national research given the need of the government
to determine the status of the education in the country.
314
Lesson 2
Assessment in the Contemporary Period and Future Directions
The EDCOM report in 1991 indicated that high dropout rates especially in the rural areas
were significantly marked. The learning outcomes as shown by achievement levels show mastery
of the students in important competencies. There were high levels of simple literacy among both
15-24 year olds and 15+ year olds. “Repetition in Grade 1 was the highest among the six grades
of primary education reflects the inadequacy of preparation among the young children. All told,
the children with which the formal education system had to work with at the beginning of EFA
were generally handicapped by serious deficiencies in their personal constitution and in the skills
they needed to successfully go through the absorption of learning.”
The PESS was jointly conducted by the World Bank and Asian Development Bank. It
was recommended that:
Aside from the government initiatives in funding and conducting surveys that applies
assessment methodologies and processes. Aside from survey studies, the government also
practiced testing where they screen government employees which started in 1924. Grade four to
fourth year high school students were tested in the national level in 1960 to 1961. Private
organizations also spearhead the enrichment of assessment practices in the Philippines. These
private institutions are the Center for Educational Measurement (CEM) and the Asian
Psychological Services and Assessment Corporation (APSA).
FAPE started with testing programs such as the guidance and testing program in 1969.
They started with the College Entrance Test (CET) which was first administered in 1971 and
again in 1972. The consultants who worked with the project were Dr. Richard Pearson from the
Educational Testing Service (ETS), Dr. Angelina Ramirez, and Dr. Felipe. FAPE then worked
with the Department of Education, Culture, and Sports (DECS) to design the first National
College Entrance Exam (NCEE) that will serve to screen fourth year high school students who
are eligible to take a formal four-year course. There was a need to administer a national test then
because most universities and colleges do not have an entrance exam to screen students. Later
the NCEE was completely endorsed by FAPE to the National Educational Testing Center of the
DECS.
The testing program of FAPE continued where they developed a package of four tests
which are the Philippine Aptitude Classification Test (PACT), the Survey/Diagnostic Test
(S/DT), the College Scholarship Qualifying Test (CSQT), and the College Scholastic Aptitude
315
Test (CSAT). In 1978, FAPE institutionalized an independent agency called the Center for
Educational Measurement that will undertake the testing and other measurement services.
CEM started as an initiative of the Fund for Assistance to Private Education (FAPE).
CEM was headed by Dr. Leticia M. Asuzano who was the executive vice-president. Since then
several private schools have been members to CEM to continue their commitment and goals.
Since 1960 CEM has developed up to 60 more tests focused on education such as the National
Medical Admissions Test (NMAT). The main advocacy of CEM is to improving the quality of
formal education through its continuing advocacy and supporting systematic research. CEM
promote the role of educational testing and assessment in improving the quality of formal
education at the institutional and systems level. Through test results, the CEM helps to improve
effectiveness so teaching and student guidance.
Aside from the CEM, in 1982 there is a growing demand for testing not only in the
educational setting but in the industrial setting. Dr. Genevive Tan who was a consultant to
various industries felt the need to measure the Filipino ‘psyche’ in a valid way because most
industries use foreign tests. The Asian Psychological Services and Assessment Corporation was
created from this need. In 2001, headed by Dr. Leticia Asuzano, former EVP of CEM, APSA
extended its services for testing in the academic setting because of the growing demand of
private schools on quality tests.
The mission of APSA is a commitment to deliver excellent and focused assessment
technologies and competence-development programs to the academe and the industry to ensure
the highest standards of scholastic achievement and work performance and to ensure
stakeholders' satisfaction in accordance with company goals and objectives. APSA envisions
itself as the lead organization in assessment and a committed partner in the development of
quality programs, competencies, and skills for the academe and the industry.
APSA has numerous tests that measures mental ability, clerical aptitude, work habits, and
supervisory attitudinal survey. For the academe side, they have test for basic education,
Assessment of College Potential and Assessment of Nursing Potential. In the future the first
Assessment for Engineering Potential and Assessment of Teachers Potential will be available for
use in higher education.
APSA pioneered on the use of new mathematical approaches (IRT Rasch Model) in
developing tests which goes beyond the norm-reference approach. In 2002 they launched the
standards-based instruments in the Philippines that serve as benchmarks in the local and
international schools. Standards-based assessment (1) provides an objective and relevant
feedback to the school in terms of its quality and effectiveness of instruction measured against
national norms and international standards; (2) Identifies the areas of strengths and the
developmental areas of the institution's curriculum; (3) Pinpoints competencies of students and
learning gaps which serve as basis for learning reinforcement or remediation; (4) Provides good
feedback to the student on how well he has learned and his readiness to move to a higher
educational level.
316
Building Future Leaders and Scientific Experts in Assessment and Evaluation in the Philippines
There are only some universities in the Philippines that offer graduate training on
Measurement and evaluation. The University of the Philippines offer a master’s program in
education specialized in measurement and evaluation and doctor of philosophy in research and
evaluation. Likewise, De La Salle University-Manila has a master of science in psychological
measurement offered by the psychology department and their college of education which is a
center for excellence has a master of arts in educational measurement and evaluation, and a
doctor of philosophy in educational psychology major in research, measurement and evaluation.
There are only two universities in the Philippines that offer graduate training and
specialization on measurement and evaluation courses. Some practitioners were trained in other
countries such as in the United States and Europe. There is a greater call for educators and those
in the industry involved in assessment to be trained to produce more experts in the field.
Aside from the government and educational institutions, the Philippine Educational
Measurement and Evaluation (PEMEA) is a professional organization geared n promoting the
culture of assessment in the country. The organization started with the National Conference on
Educational Measurement and Evaluation headed by Dr. Rose Marie Salazar-Clemeña who was
the dean of the College of Education in De La Salle University-Manila together with the De La
Salle-College of Saint Benilde’s Center for Learning and Performance Assessment. It was
attended by participants all around the Philippines. The theme of the conference was
“Developing a Culture of Assessment in Learning Organizations.” The conference aimed to
provide a venue for assessment practitioners and professionals to discuss the latest trends,
practices, and technologies in educational measurement and evaluation in the Philippines. In the
said conference the PEMEA was formed. The purpose of the organization are as follows:
The first batch of board of directors elected for the PEMEA are Dr. Richard DLC
Gonzales as President (University of Santo Tomas Graduate School), Neil O. Pariñas as Vice
president (De La Salle–College of Saint Benilde), Dr. Lina A. Miclat as secretary (De La Salle–
College of Saint Benilde), Marife M. Mamauag as treasurer (De La Salle–College of Saint
Benilde), Belen M. Chu as PRO (Philippine Academy of Sakya). The board members are Dr.
Carlo Magno (De La Salle University-Manila), Dennis Alonzo (University of Southeastern
Philippines, Davao City), Paz H. Diaz (Miriam Collage), Ma. Lourdes M. Franco (Center for
317
Educational Measurement), Jimelo S. Tipay (De La Salle–College of Saint Benilde), and Evelyn
Y. Sillorequez (Western Visayas State University).
Aside from the universities and professional organization that provide training on
measurement and evaluation, the field is growing in the Philippines because of the periodicals
that specialize in the field. The CEM has its “Philippine Journal of Educational Measurement.”
The APSA is continuing to publish its “APSA Journal of SBA Research.” And the PEMEA will
soon launch the “Educational Measurement and Evaluation Review.” Aside from these journals
there are Filipino experts from different institutions who published their work in international
journals and journals listed in the Social Science Index.
References
Appendix A
Critical Values of the Pearson r Moment Correlation
Appendix B
Areas of the Normal Curve
Glossary
Absolute standards, 258
Abstraction, 6
Accountability, 30
Accuracy, 8
Achievement tests, 296
ACT, 285
Adjective checklist, 228
Admission, 256
Affect, 44
Affective characteristics, 211
Affective domain, 37
Alternate form, 61
Analysis of test data, 140
Analysis, 37, 45
Application, 37
Appraisal, 29
Aptitude test, 298
Articulation, 38
Asian Psychological Services and Assessment Corporation, 315
Assessment, 2, 22
Assignments, 27
Attitude tests, 305
Attitude, 43, 212
Audience, 35
Behavior, 35
Beliefs, 213
Binary-choice, 176
Bloom’s taxonomy, 37
Brainard Occupational Preference Inventory, 307
CEEB, 285
Center for Educational Measurement, 315
Characterization, 37
Chi-square goodness of fit, 76
Clarificative, 8
Classical test theory, 92, 126
Cognitive level, 37
Cognitive strategies, 43
Cognitive system, 44
Cognitive test, 275
Comparative scale, 226
Comprehension, 37
Comprehension, 45
328
Conceptual definition, 6
Conceptual knowledge, 40
Concrete concepts, 42
Condition, 35
Confirmatory factor analysis, 75
Construct validity, 74
Constructed-response, 191
Consumer-oriented, 9
Content validity, 73
Convergent validity, 78
Correlation coefficient, 58
Counseling, 30
Criterion, 24
Criterion, 35
Criterion-prediction, 73
Critical value, 61
Cronbach’s alpha, 64
Culture fair test, 292
De Bono’s six thinking hats, 47
Defined concepts, 42,
Degrees of freedom, 61
Diagram scale, 230
Differential aptitude test, 298
Direction, 211
Discrimination, 42
Disposition, 215
Divergent validity, 78
Economic Survey Committee, 311
EDCOM report, 314
Edwards Personal Preference Schedule, 298
Eigenvalue, 74
Essay, 194
Evaluation, 37
Evaluation, 6, 7
Exceptionality, 256
Expertise-oriented, 10
Exploratory factor analysis, 74
Factor analysis, 74
Factor loading, 75
Factor, 6
Factual knowledge, 40
Feasibility, 8
Feedback, 254
Fixed sum scale, 228
Flanagan industrial test, 298
Forced ranking scale, 225
329
Performance assessment, 26
Personality test, 299
Philippine Education Sector Study, 314
Philippine Educational Measurement and Evaluation, 316
Picture scale, 230
Placement, 255
Power test, 275
Precision, 38
Predictive question, 245
Proactive, 8
Procedural knowledge, 40
Product, 44
Program theory, 9
Projects, 26
Promotion, 255
Propriety, 8
Prosser Survey, 312
Psychomotor domain, 38
Psychomotor procedures, 46
Qualitative, 25
Quantification, 4, 5
Quantitative, 25
Quartile deviation, 282
Questioning, 242
Rasch Model, 103
Ravens’ progressive Matrix, 294
Reasoning, 43
Receiving, 37
Recitation, 26
Reliability, 57
Responding, 37
Response format, 222
Retrieval, 44
Revised Bloom’s taxonomy, 40
Root Mean Square residual, 76
Rules, 42
Scales, 223
Scatterplot, 59
Scoring, 175
Scree plot, 75
Selected-response, 176
Selection, 256
Selection, 30
Self-system, 45
Semantic differential scale, 227
Semantic distance scale, 228
332
Variable, 6
Variance, 61
Variance, 64
Verbal frequency, 224
Verbal information, 42
Verbal test, 275
Watson Glaser Critical Thinking Appraisal, 295
Work values inventory, 306
z score, 286
334
Dr. Carlo Magno is presently a faculty of the Counseling and Educational Psychology
Department at De La Salle University-Manila where he teaches courses in measurement
and evaluation, educational research, psychometric theory, and statistics. He took his
undergraduate in De La Salle University-Manila with the degree Bachelor of Arts major
in Psychology. He took his Masters degree in Education major in Basic Education
teaching at the Ateneo de Manila University. He received his PhD in Educational
Psychology major in Measurement and Evaluation at De La Salle University-Manila with
high distinction. He was trained in Structural Equations Modeling at Freie Universität in
Berlin, Germany. In 2005 he was awarded as the Most Outstanding Junior Faculty in
Psychology by the Samahan ng Mag-aaral sa Sikolohiya and in 2007 he was the Best
Teacher Students’ Choice Award by the College of Education in DLSU-Manila. In 2008,
he was awarded by the National Academy of Science and Technology as the Most
Outstanding Published Scientific Paper in the Social Science. Majority of his research
uses quantitative techniques in the field of educational psychology. Some of his work on
teacher performance, learner-centeredness, measurement and evaluation, self-regulation,
metacognition, and parenting were published in local and international refereed journals
and presented in local and international conferences. He is presently a board member of
the Philippine Educational Measurement and Evaluation Association.He is also the senior
editor of the Philippine ESL Journal and The International Journal of Research and
Review.