You are on page 1of 152

KNOWLEDGECOMPETENCYEXAM

REVIEWPACKET
M
etho
dlgicaC
sr
n

PART 1 OF 1

If you would like to add to, correct, or somehow modify this


packet, please contact the Academic Success Center.
A color version is available on-line.

METHODOLOGICAL CONSIDERATIONS

ABOUT THE COMPS AND THIS STUDY PACKET ...................................................................................... 5


RECOMMENDED READING LIST ............................................................................................................ 6
I) RESEARCH DESIGN ................................................................................................................. 7
A) Terminology in scientific inquiry............................................................................................. 7
B) Hypotheses and hypothesis testing .........................................................................................11
C) Variables ..................................................................................................................................13
D) Controls and their application ................................................................................................14
E) Samples: Types and methods of selection and assignment ....................................................17
F) Criteria and criterion measures..............................................................................................21
G) Threats to internal, external, construct, and statistical conclusion validity..........................21
H) Pre-experimental, quasi-experimental and experimental designs .........................................25
I) Single subject designs ..............................................................................................................30
J) Meta analysis ...........................................................................................................................31
K) Research ethics ........................................................................................................................31
II)
STATISTICS ...........................................................................................................................34
A) Descriptive ...............................................................................................................................34
1) Measures of central tendency..............................................................................................34
2) Measures of dispersion or variability .................................................................................35
3) Skewness, Kurtosis ..............................................................................................................39
B) Non-Parametric Statistics .......................................................................................................42
C) Univariate Parametric Statistics .............................................................................................45
1) Sampling distributions ........................................................................................................45
2) Assumptions and their violation .........................................................................................45
3) T-tests...................................................................................................................................46
4) ANOVA................................................................................................................................47
5) ANCOVA .............................................................................................................................56
6) Magnitude of effect ..............................................................................................................57
D) Correlational Techniques........................................................................................................58
1) Pearson correlations ............................................................................................................59
2) Partial and semi-partial correlations ..................................................................................60
3) Spearman rank correlation .................................................................................................61
4) Point biserial and phi coefficients .......................................................................................61
5) Scatter plot...........................................................................................................................61
6) Significance of a correlation ................................................................................................62
7) Assumptions.........................................................................................................................62
E) Multivariate Statistics .............................................................................................................62
1) Multiple Regression and Path Analysis ..............................................................................63
a) Simultaneous, stepwise & hierarchical regression .........................................................63
b) R, R2, significance ............................................................................................................67
c) Regression weights (raw and standard) ..........................................................................69
d) Incremental variance and significance ...........................................................................70
e) Assumptions .....................................................................................................................71
f) Shrinkage .........................................................................................................................72
g) Causal models ..................................................................................................................72
h) Moderating and mediating variables ..............................................................................73
j) Standard Error of Estimate (Prediction)........................................................................74
k) Statistical Control in MRC..............................................................................................74
l) Outliers ............................................................................................................................74
2) Principle Components and Factor Analysis .......................................................................74
a) Principle components vs. principal axis ..........................................................................75
May 2007

METHODOLOGICAL CONSIDERATIONS

Rotation (orthogonal and oblique) ..................................................................................75


Eigenvalues and percent of variance ..............................................................................76
Confirmatory vs. exploratory .........................................................................................76
Factor loadings ................................................................................................................77
3) Discriminant function analysis and logistic regression ......................................................77
4) Multivariate analysis of variance and covariance MANOVA ...........................................79
a) Justification for use .........................................................................................................80
b) Synthetic variables...........................................................................................................80
c) Multivariate tests of significance.....................................................................................80
d) Assumptions .....................................................................................................................80
5) General linear model ...........................................................................................................80
F) Statistical Inference .................................................................................................................81
1) Type 1 and Type 2 errors ....................................................................................................81
2) Power of a statistical test .....................................................................................................82
3) Interpretation of significance testing ..................................................................................82
4) Confidence intervals ............................................................................................................82
II)
MEASUREMENT ...................................................................................................................83
A) Scales of measurement ............................................................................................................83
1) Nominal variables ................................................................................................................83
2) Ordinal variables .................................................................................................................83
3) Equal interval variables ......................................................................................................83
4) Ratio variables .....................................................................................................................83
B) Interpretation of Measures .....................................................................................................84
1) Transforming scores ............................................................................................................84
2) Creation of norms................................................................................................................84
3) Appropriate use of norms ...................................................................................................84
4) Criterion-referenced vs. norm-referenced tests .................................................................84
5) Scaling, Likert, Guttman, etc. .............................................................................................84
6) Confidence intervals and bands of error ............................................................................85
C) Reliability of measurement .....................................................................................................86
1) Types of reliability ...............................................................................................................87
a) Stability, test-retest ..........................................................................................................87
b) Equivalence, Parallel forms ............................................................................................87
c) Homogeneity, Internal consistency .................................................................................88
d) Inter-rater ........................................................................................................................89
2) Reliability models ................................................................................................................89
a) True-score theory ............................................................................................................89
b) Domain Sampling Model .................................................................................................89
3) Relationship to other features .............................................................................................90
a) Test length........................................................................................................................90
b) Composite of measures ....................................................................................................90
c) Sample selection...............................................................................................................90
4) Relationship of validity .......................................................................................................90
a) Standard error of measurement/estimate .......................................................................90
b) Correction of attenuation ................................................................................................90
D) Validity ....................................................................................................................................91
1) Face ......................................................................................................................................91
2) Content ................................................................................................................................91
3) Criterion ..............................................................................................................................92
a) Forecasting efficiency ......................................................................................................92
b)
c)
d)
e)

May 2007

METHODOLOGICAL CONSIDERATIONS

Correcting for range restriction ......................................................................................92


Validation strategies ........................................................................................................92
4) Construct .............................................................................................................................92
a) Convergent validity .........................................................................................................92
b) Divergent validity ............................................................................................................92
c) Discriminant validity .......................................................................................................92
d) Multitrait-multimethod matrix .......................................................................................93
5) Incremental validity ............................................................................................................95
E) Test Construction ....................................................................................................................95
1) Item creation........................................................................................................................95
2) Item analysis ........................................................................................................................96
[See Summary of this section in Appendix]................................................................................96
a) Difficulty ..........................................................................................................................96
b) Discriminability ...............................................................................................................96
c) Chance .............................................................................................................................98
d) Homogeneity ....................................................................................................................98
3) Test bias ...............................................................................................................................99
F) Test (Basic features of various types, not specific tests)....................................................... 100
1) Cognitive ............................................................................................................................ 100
a) Intelligence ..................................................................................................................... 100
b) Aptitude ......................................................................................................................... 100
c) Achievement .................................................................................................................. 100
d) Neuropsychological........................................................................................................ 100
2) Sentiments.......................................................................................................................... 101
a) Interests.......................................................................................................................... 101
b) Values ............................................................................................................................. 101
c) Attitudes......................................................................................................................... 101
3) Vocational .......................................................................................................................... 101
4) Personality ......................................................................................................................... 101
a) Objective ........................................................................................................................ 101
b) Projective ....................................................................................................................... 101
G) Utility of Tests ....................................................................................................................... 101
1) Taylor-Russell model......................................................................................................... 101
2) Naylor-Shine model ........................................................................................................... 102
SUMMARY TABLES .............................................................................................................................. 103
STATISTICAL DECISION TREE ............................................................................................................... 108
GLOSSARY .......................................................................................................................................... 113
FORMULAS .......................................................................................................................................... 117
FACTOR ANALYSIS SUMMARY ............................................................................................................. 118
ITEM ANALYSIS OVERVIEW ................................................................................................................. 120
REFERENCES ....................................................................................................................................... 122
METHODOLOGY STUDY QUESTIONS .....................................................................................................123
METHODOLOGY SAMPLE TEST I ........................................................................................................... 128
METHODOLOGY SAMPLE TEST II.......................................................................................................... 133
METHODOLOGY SAMPLE TEST III ........................................................................................................ 138
METHODOLOGY SAMPLE TEST IV ........................................................................................................ 143
METHODOLOGY SAMPLE TEST V ......................................................................................................... 148
b)
c)

May 2007

METHODOLOGICAL CONSIDERATIONS

ABOUT THE COMPS AND THIS STUDY PACKET


Please Note: The Faculty as a whole endorses the policy that each doctoral program in psychology should ensure
knowledge in certain core content areas in psychology. Such knowledge enhances the breadth of students
understanding of the field, and allows students to make connections between basic and applied areas of the
discipline. The Knowledge Competency Examinations provide one way of assessing this knowledge.
This outline provides an overview of the topic areas you should study to prepare for the Knowledge Competency
Exam in Methodological Considerations. These areas define core knowledge bases with which you should be
familiar to demonstrate basic graduate-level knowledge in this area. The exam will test your knowledge of
concepts and their application, terminology, important and consistent research findings, and historical figures and
their major contributions.

THIS REVIEW PACKET


HAS BEEN PREPARED
BY STUDENTS
FOR STUDENTS
This packet was prepared by Kopitzee Parra-Thornton in June 2007.
 This packet is designed to be an additional resource to help students study.
 This packet is not designed to be a comprehensive guide.
 Although the packet was prepared diligently and carefully, the reader is advised that the packet was prepared
per a students interpretation of the topic headings of the outline. It is your responsibility to verify any
information that seems incorrect to you.
 The inclusion of material in this packet has been decided by students with no input or review from faculty.
 For a complete list of all possible areas you are responsible for on the comprehensive exam, please see the
outline designed by faculty available online at www.alliant.edu/sandiego/comps.

Tip: The symbol


Indicates the topic that was found on the exam.
Not all items on the exam are represented by this
symbol or addressed in the packet.

May 2007

METHODOLOGICAL CONSIDERATIONS

RECOMMENDED READING LIST


Anastasi, A. & Urbina, S. (1997). Psychological Testing (7th ed.). Upper Saddle River, NJ: Prentice Hall, Inc.
Cook, T. D., Campbell, D. T., & Peracchio, L. (1990). Quasi Experimentation. In M. D. Dunnette, & L. M.
Hough (Eds.), Handbook of Industrial and Organizational Psychology: Vol 1. (2nd ed.). Palo Alto, CA:
Consulting Psychologists Press, Inc.
Ethical Principles in the Conduct of Research With Human Participants (1982). Washington, D. C.: American
Psychological Association, Inc.
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences. New York:
W. H. Freeman and Co.
Keppel, G. (1991). Design and Analysis: A Researchers Handbook (3rd ed.). Englewood Cliffe: Prentice-Hall,
Inc.
Ray, W. (1993). Methods toward a science of behavior and experience (4th ed.). Belmont, CA: Wadsworth
Publishing Company.
Siegel, S. & Castellan, N. J., Jr. (1998). Non-parametric statistics for the behavioral sciences. New York, NY:
McGraw Hill.
Tabachnick, B. G., & Fidell, L. S. (2001). Using Multivariate Statistics (4th ed.). Boston: Allyn and Bacon.
Winder, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical Principles in Experimental Design (3rd ed.).
New York, NY: McGraw Hill

May 2007

METHODOLOGICAL CONSIDERATIONS

I) RESEARCH DESIGN
A) Terminology in scientific inquiry
1) Operational Definition: Definition of a concept or variable on the basis of the specific measures used
in the research project. The operational definition provides the working definition, however every
definition has limits.
a) The definition captures only a portion of the concept
b) The definition may include some irrelevant correlated features
c) The use of a single measure means the definition relies heavily upon the specific and unique
features of the measure. E.g., intelligence would be defined as a persons score on the WAIS.
To define a concept in such a way that it can be measured.
Vogt, 220: (a) a description of the way researchers will observe and measure a variable; so called because it
specifies the actions (operations) that will be taken to measure the variable. (b) The criteria used to
identify a variable or condition.
Operational definitions are essential. They make intersubjectivity (objectivity) possible because
they can be replicated; operational definitions are always imperfect, usually being artificial or too
narrow. E.g., the operational definition of an overweight person could be whose body mass index
(BMI) is over 25. However, some slightly muscled individuals have BMIs over 25 but would not fit
most other definitions of overweight.
2) Qualitative Designs: Research strategies that are designed to provide or allow a full and thorough
description of a specific sample. Not intended to make between group comparisons or to provide
inferential data. Descriptive only.
Vogt, 256: Said of research designs commonly used to study qualitative data. The distinction between
qualitative and quantitative design is hard to maintain. Virtually every major research design can be
employed to gather either qualitative or quantitative data. E.g., surveys are usually thought of as a
quantitative design or method, but that is not necessarily the case. Surveys might ask respondents to
answer questions on a Likert scale, which are then summed into a quantitative index. But
respondents could just as easily be asked open-ended questions; their answers would become texts
that could be studied with grounded theory, which is a qualitative method of analysis.
3) Quantitative: research strategies that are designed to provide inferential statistics from a sample.
These statistics can then be generalized to a target population which the original sample was supposed
to represent.
Vogt, 256: Said of variables or research that can be handled numerically. Usually contrasted (too sharply) with
qualitative variables and research. Many research designs lend themselves well to collecting both
quantitative and qualitative data, and many variables can be handled either qualitatively or
quantitatively. E.g., naturalistic observations can give rise to either or both kinds of data.
Interactions can be counted and timed with a stopwatch or they can be interpreted more holistically.
4) Analogue Research: Research which examines a treatment or phenomena under conditions which
approximate the real world or clinical settings. E.g., bio-dome experiments- artificial settings to
look at a natural behavior; there is some control over the environment.
Vogt, 8: (Also spelled: analog) Said of data or computers that use a system of representation that is physically
analogous to the thing being represented. E.g., a thermometer can use the height of a column of
mercury to indicate heat; the higher the column, the higher the temperature. Or, analog watches use
the physical movement of hour and minute hands to represent the passing of time; the more hands
have moved, the more time has passed.

May 2007

METHODOLOGICAL CONSIDERATIONS

5) True Experimental Designs: research strategies that provide the clearest case for drawing causal
inferences. These designs have the greatest control over independent variables and sources of error.
The most common characteristics include:
 Random assignment of subjects to conditions
 Manipulation of the independent variable
 Use of statistical analyses (ANOVA, MANOVA, ANCOVA)
 Use of a control group
These designs include between-subjects as well as within-subjects (repeated measures) independent
variables.

Between-Subjects Designs: each person serves in only one group and the groups are compared.
Between-Subjects Designs or (ANOVA) (Vogt, 25) A research procedure that compares different
subjects. Each score in the study comes from a different subjects. Usually contrasted to a withinsubjects design, which compares the same subjects at different times or under different
treatments.

Within-Subjects Design: each person serves in every condition. For instance, time is the most
common within-subjects variable whereby you compare each persons pre-test to their own post-test.
Within-Subjects Design (Vogt, 343) A before-and-after study or a study of the same subjects given
different treatments. A research design that pretests and posttests within the same group of subjects,
that is one which uses no control group.

Repeated-Measures Design (Vogt, 274) A design in which subjects are measured two or more times
on the dependent variable. Rather than using different subjects for each level of treatment, the
subjects are given more than one treatment and are measured after each. This means that each subject
is its own control.

Vogt, 328: An experiment. True is contrasted with the methods of quasi-experiments and natural experiments.
The key distinction is that, unlike in other research designs, in a true experiment subjects are
randomly assigned to treatment groups and the researchers manipulate the independent variables.
6) Quasi-Experimental Designs: Designs that approximate the control offered by experimental designs,
but where true random assignment to all groups is not possible.
However, true random assignment to conditions is not possible. Characteristics include:
A Static variable that does not allow random assignment of subjects to groups/conditions. The
condition is such that you cannot randomize.
Manipulation of at least one independent variable
Statistical analyses (ANOVA, Regression, MANOVA, ANCOVA)
Vogt, 257: A type of research design for conducting studies in field or real-life situations where the researcher
may be able to manipulate some independent variables but cannot randomly assign subjects to
control and experimental groups. E.g., you cannot cut off some individuals unemployment benefits
to see how well they could get along without them or to see whether an alternative job training
program would be more effective. But you could try to find volunteers for the new job training
program. You could compare the results for the volunteer group (experimental group) with those of
people in the regular program (control group). The study is quasi-experimental because you were
unable to assign subjects randomly to treatment and control groups.
7) Correlational Designs, Static Group/Case Control: Designs in which the variable of interest is
studied by selecting participants who vary in their display of the variable or characteristic of interest.
This design is more observational/descriptive because it does not allow for causal inferences.

Independent variables are not manipulated!


Participants are not randomly assigned!
Statistical analysis used include Chi-Square, Regression, and Correlations
Observational or Descriptive- Does not allow for inferences about causal relationships.
May 2007

METHODOLOGICAL CONSIDERATIONS

Vogt, 64: A design in which the variables are not manipulated. Rather, the researcher uses measures of
association to study their relations. The term is usually used in contrast with experimental research.
Design

Subtypes

Experimental

Pre-Post Control
Post Only Control
Solomon 4 Group
Time Series
Factorial

QuasiExperimental

Multiple Treatment
Counterbalanced
Crossover

Correlational
Case Study
Case Control
Cross-Sectional
Retrospective CrossObservational Sectional
Cohort (Prospective
Longitudinal, Single
Group, Multiple Group,
or Accelerated)
ABAB
Single
Multiple Baseline
Subject
Changing Criterion

Descriptive
or Causal?

Causal

Random
Assignment?

Yes

Static or
Control
Manipulated
Group?
IVs?

Both

Yes

Statistics
Used
t-test
ANOVA
ANCOVA
MANOVA
ANOVA
ANCOVA
MANOVA
Regression
ChiSquare
Regression
Correlation

Causal

Yes/No

Both

Yes/No

Descriptive

No

Static

No

Descriptive

No

Static

No

Qualitative

Descriptive

No

Static

No

Correlation
Qualitative

8) Error Variance: Vogt, 107: Any uncontrolled or unexplained variability, such as within-group
differences in an ANOVA. Also called random error, random variance, and residual. The error
variance is the variance of the error term.
Severino: This is all the variations within each condition of your independent variable (i.e.
experimental group and control group). It basically represents all the individual differences among
participants (differences between individuals that will keep you from seeing the true effect of the
experiment) in each group and is assumed to be randomly distributed. This noise or variance can
increase the probability of a Type II error (not finding significant differences when they exist).
Variance: all the ways something can differ. For instance, if you looked at the income in La Jolla the
variance would probably be rather small whereas if you included all areas of San Diego County the
variance would be much broader!
9) Covariance: Also known as secondary variance. This refers to differences that are NOT randomly
distributed across groups. A covariate is a variable that is a potential confound in the study and you
have found that it does, in fact, affect your dependent variable! This is a huge threat to your internal
validity! Covariance is also described as systematic differences between groups that are not accounted
for by the treatment effect!
Vogt, 66: A measure of the joint or (co-) variance of two or more variables. A covariance is an unstandardized
correlation coefficient r and it is more often reported as an r. However, the covariance is very
important in calculating many multivariate statistics.
May 2007

METHODOLOGICAL CONSIDERATIONS

10

E.g., suppose we want to see if there is a relation between knowledge of politics (variable X) and
political tolerance (variable Y). Our tests of these two variables are each measured on a scale of 120. We give the two tests to a sample of 10 people. The scores and the calculation of the covariance
are shown in the following table. Column 1 assigns a number to each individual taking the two
tests. Columns 2 and 4 are their results on Test X and Test Y. Columns 3 and 5 subtract the mean of
each variable from each individuals score (to get the deviation scores). Column 6 shows the
product of multiplying column 3 times column 5 (the cross product). Total Column 6 and divide by
the number of cases minus 1 (10-1=9) to get the covariance of X and Y (Covxy), which equals 11.4
The correlation r is a standardized version of the covariance, which for these data equals .79.
variable
Covariance (Example of How to Compute) Vogt, 68
Column 1
Column 2
Column 3
Column 4
Case
X
Y
X-
01
18
5
16
02
9
-4
10
03
12
-1
11
04
17
4
14
05
13
0
13
06
8
-5
13
07
17
4
16
08
14
1
11
09
16
3
13
10
6
-7
4
130
0
120
Total

Column 5
Y-
4
-2
-1
2
1
1
4
-1
0
-8
8

Column 6
(x-)(y-)
20
8
1
8
0
-5
16
-1
0
56
103
11.4
13
12
Mean
(covariance)
10) Population: group that you are interested in generalizing your results to. Parameters must be specific.
11) Sample: the selected subset of your population (sample mean = population mean). For instance, your
population may be all college students but your sample would only include those at AIU, UCSD, and
SDSU due to sampling difficulties!
12) Internal Validity: The extent to which the intervention or manipulation of the IV can be considered to
account for the results, changes or group difference, rather than extraneous influences. The extent to
which an experiment rules out alternative explanations.
Vogt, 156: The extent to which the results of a study (usually an experiment) can be attributed to the treatments
rather than to flaws in the research design. In other words, internal validity is the degree to which
one can draw valid conclusions about the causal effects of one variable on another. It depends on
the extent to which extraneous variables have been controlled by the researcher.
13) External Validity: The extent to which the results can be generalized beyond the conditions of the
research to other populations, settings or conditions. Also known as generalizability.
14) Interaction Effect: (Vogt, 154) The joint effect of two or more independent variables on a dependent
variable. Interaction effects occur when independent variables not only have separate effects, but also
have combined effects that are different from the simple sum of their separate effects. In other terms,
interaction effects occur when the relation between two variables differs depending on the value of
another variable. The presence of statistically significant interaction effects makes it difficult to
interpret main effects. Also called conditioning effect, contingency effect, joint effect, and
moderating effect.
When two variables interact, this is called a first-order interaction; when three interact, it is a secondorder interaction, and so on. Interaction effects may be ordinal or disordinal.
E.g., Suppose a cholesterol reduction clinic had two diets and one exercise plan. Exercise alone was
effective and dieting alone was effective. For patients who did not exercise, the two diets worked
May 2007

METHODOLOGICAL CONSIDERATIONS

11

about equally well. Those who went on Diet A and exercised got the benefits of both. But those who
combined exercise with Diet B got a bonus, an interaction effect. All patients could benefit by dieting
and exercising, but those who followed Diet B and exercised benefited more.
Another instance of an interaction effect is when the treatments or independent variables interact with
attributes of the people being studied, such as their age or sex; these interactions are often referred to
as moderating effects.
15) Main Effect: (Vogt, 183) The simple effect of an independent variable on a dependent variable; the
effect of an independent variable uninfluenced by (without controlling for the effects of) other
variables. Used in contrast with the interaction effect of two or more independent variables on a
dependent variable. It is difficult to interpret main effects in the presence of interaction effects.
B) Hypotheses and hypothesis testing
Vogt, 146: Hypothesis. A tentative answer to a research question; a statement of (or conjecture about) the
relationships among the variables that a researcher intends to study. Hypotheses are sometimes testable
statements of relations. In such cases, they are usually thought of as predictors, which if confirmed, will
support a theory.
E.g., suppose a social psychologist theorized that racial prejudice is due to ignorance. Hypotheses for
testing the theory might be as follows: If (1) education reduces ignorance, then (2) the more highly
educated people are, the less likely they are to be prejudiced. If an attitude survey showed that there was
indeed an inverse relation between education and prejudice levels, this would tend to confirm the theory
that prejudice is a function of ignorance.

Alternative hypothesis (Vogt, 7) In hypothesis testing any hypothesis that does not conform to the
one being tested, usually the opposite of the null hypothesis. Also called the research hypothesis.
Rejecting the null hypothesis shows that the alternative (or research) hypothesis may be true.
E.g., researcher conducting a study of the relation between teenage drug use and teenage suicide
would probably use a null hypothesis something like: There is no difference between the suicide
rates of teenagers who use drugs and those who do not. The alternative hypothesis might be:
Drug use by teenagers increases their likelihood of committing suicide. Finding evidence that
allowed the rejection of the null hypothesis (that there was no difference) would increase the
researchers confidence in the probability that the alternative hypothesis was true.

Null hypothesis (H0) (Vogt, 216) An hypothesis that a researcher usually hopes to reject, thereby
substantiating its opposite. Often the hypothesis that two or more variables are not related or that
two or more statistics (e.g., means for two different groups) are not the same. The null does not
necessarily refer to zero or no difference (although it usually does); rather it refers to the hypothesis
to be nullified or rejected. In accumulating evidence that the null hypothesis is false, the researcher
indirectly demonstrates that the variables are related or that the statistics are different. The null
hypothesis is the core idea in hypothesis testing.
The null hypothesis is something like the presumption of innocence in a trial; to find someone
guilty, the jury has to reject the presumption of innocence. To continue with the analogy, they have
to reject it beyond a reasonable doubt. The reasonable doubt in hypothesis tests is the alpha level.

Research question (Vogt, 276) the problem to be investigated in a study stated in the form of a
question. It is crucial for focusing the investigation at all stages, from the gathering through the
analysis of evidence. A research question is usually more exploratory than a research hypothesis or
a null hypothesis.
E.g., a research question might be: What is the relation between A and B? A parallel research
hypothesis could be: Increases in A lower the incidence of B. The associated null hypothesis might
be: There is no difference in the mean levels of B among subjects who have different levels of A.

May 2007

METHODOLOGICAL CONSIDERATIONS

12

Hypothesis Testing: (Vogt, 146) The classical approach to assessing the statistical significance of
findings. Basically it involves comparing empirically observed sample findings with theoretically
expected findings- expected if the null hypothesis is true. This comparison allows one to compute the
probability that the observed outcomes could have been due to chance or random error.
E.g., suppose you wanted to study the effects on performance of working in groups as compared to
working alone. You get 80 students to volunteer for your study. You assign them randomly into two
categories: those who work in teams of four students and those who would work individually. You
provide subjects with a large number of math problems to solve and record the number of answers they
got right in 20 minutes. Your alternative or research hypothesis might be that people who work in teams
are more efficient than those who work individually. To examine the research hypothesis, you would
try to find evidence that would allow you to reject your null hypothesis- which would probably be
something like: There is no difference between the average score of students who work individually and
those who work in teams.
POSSIBLE OUTCOMES OF A HYPOTHESIS TEST
Decision
Alternative Hypothesis is True

Null Hypothesis is True


(2+ variables are not related)

Correct retention
(say there is no relationship
because there was none)

Type I (alpha) error,


wrong rejection
(Say there is a relationship
and there is not)

Alternative Hypothesis is True


(2+ variables are related)

Type II (beta) error,


wrong retention
(Say they arent related
when they are)

Correct rejection
(Say there is no relationship
and there is none)

Reality

Null Hypothesis is True

Severino: Hypotheses are attempts to explain, predict, and explore specific relations. When formulated,
they represent if-then statements about a particular phenomenon.
If = the I.V. which is manipulated in some way
Then = the D.V. or resultant data

If depressed patients undergo cognitive


therapy, then they will be less depressed

The statistical hypothesis that we test is the Null hypothesis- which states that there are no difference
(H0: u1 = u2 = u3). We set out to reject the null and accept the alternative hypothesis (which states that
there are differences).
Scientific/empirical research is known as an iterative processes which requires researchers to
continuously run data. Because we cannot prove models, we
merely have to continue gaining support or rejecting it through
multiple studies.
Event A might be considered fairly likely, given the above
model was correct. As a result the model would be retained,
along with the null hypothesis. Event B on the other hand is
unlikely, given the model. Here the model would be rejected,
along with the null hypothesis.

Alpha/Type I Error: (Vogt, 330) An error made by wrongly rejecting a true null hypothesis. This
might involve incorrectly concluding that two variables are related when they are not, or wrongly
deciding that a sample statistic exceeds the value that would be expected by chance. Also called
alpha error or false positive.

Beta/Type II Error: (Vogt, 330) An error made by wrongly retaining (or accepting or failing to reject)
a false null hypothesis. Also called beta error or false negative.
May 2007

METHODOLOGICAL CONSIDERATIONS

13

Type I and Type II errors are inversely related; the smaller the risk of one, the greater the risk of the
other. The probability of making a Type I error can be precisely computed in advance for a particular
investigation, but the exact probability of Type II error is generally unknown.
(See Section II, F, 1 for more on Type 1 and Type 2 Errors)
C) Variables
Vogt, 336: Loosely, anything studied by a researcher. Any finding that can change, that can vary, that can be
expressed as more than one value or in various values or categories. The opposite of a variable is a
constant.
Examples of variables include anything that can be measured or assigned a number, such as
unemployment rate, religious affiliation, experimental treatment, GPA, etc. Much of social science is
aimed at discovering and explaining how differences in some variables are related to differences in others.
a) Independent Variable: (Vogt, 151) The presumed cause in a study. Also a variable that can be used
to predict or explain the values of another variable. A variable manipulated by an experimenter who
predicts that the manipulation will have an effect on another variable (the dependent variable).
Some authors use the term independent variable for experimental research only. For these authors
the key criterion is whether the researcher can manipulate the variable; for nonexperimental research
these authors use the term predictor variable or explanatory variable. However, most writers us
independent variable when they many any causal variable, whether in experimental or
nonexperimental research.
(Vogel) Classifies the common bond between groups we are comparing (environmental/situational,
static, instructional), the construct, experimental manipulation, intervention, or factor whose impact
will be evaluated in the study
Examples of IV: (Severino)

 Environmental/Situational: Varying what is done to, with, or by the subject. E.g., a task is
provided to some but not to others. (Is CBT more effective than Mindfulness Meditation in
treating depression? Therapy is the IV here and it has 2 levels = meditation and CBT)

 Instructional: variations in what participants are told or led to believe through verbal or written
statements, usually aimed at altering the perception of the situation (Do therapists interpret
psychological test results differently when they are told that the test responses were produced by
disturbed patients versus people who are functioning well?)

 Organismic (Static): an independent variable that cannot be manipulated so subjects can not be
assigned randomly to these conditions. For instance, gender could be an IV, however, you cannot
randomly assign a person to be male or female!! Other static variables that may be of interest as
IVs could be year level in grad school or generation/age. (Do men score higher on a measure of
physical aggression than do women?)
b) Control Variable: (Vogt, 62) An extraneous variable that you do not wish to examine in your study;
hence you control for it. Also called covariate. (Vogel) extraneous variable held constant because it
could offer an alternative explanation for the results (more in threats to internal validity). Examples
include age, gender, time of day, environment.
c) Dependent Variable: (Vogt, 86) The presumed effect in a study; so called because it depends on
another variable. The variable whose values are predicted by the independent variable, whether or not
they are caused by it. Also called outcome, criterion, and response variable.
E.g., in a study to see if there were a relationship between students drinking of alcoholic beverages
and their GPA, the drinking behavior would probably be the presumed cause (independent variable);
the GPA would be the effect (dependent variable). But, it could be the other way around, if, for
instance, one wanted to study whether students grades drive them to drink.

May 2007

METHODOLOGICAL CONSIDERATIONS

14

d) Categorical: (Vogt, 39) A variable that distinguishes among subjects by sorting them into a limited
number of categories, indicating type or kind, as religion can be categorized: Buddhist, Christian,
Jewish, Muslim, Other, None. Breaking a continuous variable, such as age, to make it categorical is a
common practice, but since this involves discarding information it is usually not a good idea. Also
called, discrete or nominal variable.
e) Continuous: (Vogt, 61) A variable that can be expressed by a large (often infinite) number of
measures. Loosely, a variable that can be measured on an interval or ratio scale. While all continuous
variables are interval or ratio, all interval or ratio scales are not continuous, in the strict sense of the
term.
Deciding whether to treat data as continuous can have important consequences for choosing statistical
techniques. Ordinal data are often treated as continuous when there are many ranks in the data, but as
categorical when there are few.
E.g., height and GPA are continuous variables. Peoples heights could be 69.38 inches, 69.39 inches,
and so on; GPAs could be 3.17, 3.18, and so on. In fact, since values always have to be rounded,
theoretically continuous variables are measures as discrete variables. There is an infinite number of
values between 69.38 and 69.39 inches, but the limits of our ability to measure or the limits of our
interest in precision lead us to round off continuous values.
GPA is a good example of the difficulty of making these distinctions. It is routinely treated as a
continuous variable, but it is constructed out of a rank order scale (A, B, C, etc.). Numbers are
assigned to those ranks, which are then treated as thought they were an interval scale.
f) Random: (Vogt, 262) A variable that varies in ways the researcher does not control; a variable whose
values are randomly determined. Random refers to the way the events, values, or subjects are
chosen or occur, not to the variable itself. Men and women are not random, but sex could be a random
variable in a research study; the sex of subjects included in the study could be left to chance and not
controlled by the researcher. Also called stochastic variable.
INDEPENDENT VARIABLE IS MANIPULATED!!!
DEPENDENT VARIABLE IS OBSERVED!!!
CONTROL VARIABLES/COVARIATES ARE HELD CONSTANT!!!
D) Controls and their application
TYPES OF CONTROL (Controlling sources of error) (Outline from Severino)
1) Statistical Methods of Control
ANCOVA: (Vogt, 8) Analysis of Covariance- An extension of ANOVA that provides a way of
statistically controlling the (linear) effects of variables one does not want to examine in a study. These
extraneous variables are called covariates, or control variables. ANCOVA allows you to remove
covariates from the list of possible explanations of variance in the dependent variable. ANCOVA
does this by using statistical techniques (such as regression) to partial out of the effects of covariates
rather than direct experimental methods to control extraneous variables.
ANCOVA is used in experimental studies when researchers want to remove the effects of some
antecedent variable. E.g., pretest scores are used as covariates in pre-/posttest experimental designs.
ANCOVA is also used in nonexperimental research, such as surveys of nonrandom samples, or in
quasi-experiments when subjects cannot be assigned randomly to control and experimental groups.
Although fairly widespread, the use of ANCOVA for nonexperimental research is controversial. All
ANCOVA problems can be handled with multiple regression analysis using dummy coding for the
nominal variables, and, with the advent of powerful computers, this is a more efficient approach.
Because of this, ANCOVA is used less frequently than in the past.

May 2007

METHODOLOGICAL CONSIDERATIONS

15

2) Holding Constant
(Vogt, 144) To subtract the effects of a variable from a complex relationship so as to study what the
relationship would be if the variable were in fact a constant. Holding a variable constant essentially
means assigning it an average value.
E.g., in a study of managerial behaviors and their effects on workers productivity, a researcher might
want to hold the education of the managers constant. This would especially be the case if she had
reason to believe that different kinds or amounts of education might lead managers to behave
differently.
3) Matching
(Vogt, 186) A research design in which subjects are matched on characteristics that might affect their
reaction to a treatment. After the pairs are determined, one member of each pair is assigned at random
to the group receiving treatment (experimental group); the other group (control group) does not
receive treatment. Without random assignment, matching is not considered good research practice.
Also called subject matching.
E.g., if professors wanted to test the effectiveness of two different textbooks for an undergraduate
statistics course, they might match the students on quantitative aptitude scores before assigning them
to classes using one or another of the texts. An alternative, if the professors had no control over class
assignment, would be to treat the quantitative aptitude scores as a covariate and control for it using an
ANCOVA design.
4) Blocking
Block Design (Vogt, 29) - An experimental design in which subjects are grouped into categories or
blocks. These blocks may then be treated as the experiments unit of analysis. The goal of
categorizing subjects into blocks is to control for a covariate. See matched pairs and randomized
block design.
Randomized-Blocks Design (Vogt, 261) - A research design in which subjects are matched on a
variable the researcher wishes to control. The subjects are put into groups (blocks) of the same size as
the number of treatments. The members of each block are assigned randomly to different treatment
groups. Compare Latin square, repeated-measures ANOVA.
E.g., say we are doing a study of the effectiveness of four methods of teaching statistics. We use 80
subjects and plan to divide them into 4 treatment groups of 20 students each. Using a randomizedblocks design, we give the subjects a test of their knowledge of statistics. The four who score highest
on the test are the first block; the next highest four are the second block, and so on to the 20th block.
The four members of each block are randomly assigned, one to each of the four treatment groups. We
use the blocks to equalize the variance within each treatment group by making sure that each group
has subjects with a similar prior knowledge of statistics.
5) Counterbalancing Technique
(Vogt, 67) In a within-subjects factorial experiment, presenting conditions (treatments) in all possible
orders to avoid order effects. See Latin square.
E.g., an experimenter might wish to study the effects of three kinds of lighting (A, B, and C) on
performance on a visual skill. Subjects could first be placed in Condition A and be given a test of the
skill; then they could be put in Condition B and get a second test, and so on. By Condition C and the
third test, subjects scores might go up simply because they had the practice of the first two tests. Or
their scores might go down because they become fatigued.
The effects of practice and fatigue could be counterbalanced by rotating the lighting conditions so that
subjects would experience them in all possible orders. Since there are six possible orders (ABC,
ACB, BAC, BCA, CAB, and CBA), subjects could be dividing into six groups, one for each possible
order.

May 2007

METHODOLOGICAL CONSIDERATIONS

16

6) Double Blind Technique


(Vogt, 97) A means of reducing bias in an experiment by insuring that both those who administer a
treatment and those who receive it do not know (are blind to) which subjects are in the
experimental and control groups, that is, who is and who is not receiving the treatment.
E.g., in a study of the effectiveness of various headache remedies, 80 headache sufferers could be
randomly assigned to four groups. Group A would receive aspirin, Group B ibuprofen, Group C
acetaminophen, and Group D a placebo. The pills might be color-coded, but otherwise look the same
so that the experimenter handing them out would not know which subjects were getting which; and,
of course, the subjects would not know. When subjects experienced pain, they would be given pills
depending upon their group and then asked about the relief they got from their pills. Their responses
would be data used to evaluate the effectiveness of the various remedies. If the experiment used true
double-blind procedures, the researchers analyzing the data would not know, until after they had
reached their conclusions, which group received which remedy; they would only know, say, that on
average blue pills worked better than red ones.
7) Control groups
(Vogt, 62) In experimental research, a group that, for the sake of comparison, does not receive the
treatment the experimenter is interested in studying. Compare experimental group.
E.g., psychologists studying the effects of TV violence on attitudes might give subjects a
questionnaire to measure their attitudes, divide the group into tow, show a videotape on a violent
program to one half (the experimental group), and show a nonviolent program to the other half (the
control group). A second attitude questionnaire would then be given to the two groups to see whether
the programs affected their scores.
a) (Severino) Control groups usually employed to address threats to internal validity (i.e., history,
maturation, selection, testing, etc.).
b) You want the control group to share these influences with the experimental group, but does not
receive the treatment.
i) No-treatment Control Group:
A group that is assessed, but does not receive treatment or intervention.
Use of no-tx control directly controls the effects of history, maturation
Asses the base rate of improvement for clients who do not receive the treatment under
investigation
Ethical considerations include problems associated with withholding tx.
Group
Experimental
Control

Test
X
X

Treatment
X

Test
X
X

ii) Wait-List control


Like a no-tx control, but treatment is only withheld temporarily
The period of time that tx is withheld usually corresponds to the pre to post test
assessment interval. As soon as the second assessment battery is administered, the waitlist subjects receive their tx.
Because subjects in wait-list controls receive treatment after the post-test period, longterm follow up is not possible since the control is no longer available for comparisons.
Group
Experimental
Control

Test
X
X

Treatment
X

Test
X
X

Treatment
X
X
(receives tx later)

May 2007

METHODOLOGICAL CONSIDERATIONS

17

iii) Attention-Placebo Control Group


Designed to control for common factors that are associated with participation in treatment
(e.g., Hawthorne effects)
The goal is to provide a pseudo intervention that involves participants in some sort of
experience, thus controlling for factors common to coming to treatment (i.e., attending
sessions, meeting with a professional, etc.) account for the change.
(Vogel) Both the experimental and control group have equivalent contact with research
personnel (pseudo intervention) to control for placebo effects.
Group
Experimental
Control

Test
X
X

Treatment
Experimental
Psuedo

Test
X
X

iv) Yoked-Control Group:


Used in studies in which differences in procedures may arise as a function of
implementing a particular intervention.
Purpose is to ensure that groups are equal with respect to potentially important but
conceptually and procedurally irrelevant factors.
Pairs are formed arbitrarily (unless matching was used to assign to groups). A subject in
the experimental group receives a certain number of sessions on the basis of his/her
progress. The yoked-control participant receives the same number of sessions.
(Vogel) Used when the procedure changes for each subject based on their performance,
pairs are arbitrarily formed so that the subject in the experimental group and the yokedcontrol subject receive same number of sessions/trials/etc.
Experimental Group
Subject 1
Subject 3
Subject 5

Sessions
5
7
3

Control Group
Subject 2
Subject 4
Subject 6

Sessions
5
7
3

v) Patched-Up Control Group


Groups that are added to an experiment that utilize subjects who were not part of the
original subject pool and not randomly assigned to treatment.
E) Samples: Types and methods of selection and assignment

Sampling Distribution (of a Statistic): (Vogt, 284) A theoretical frequency distribution of the scores for
or values of a statistic, such as a mean. Any statistic that can be computed for a sample has a sampling
distribution is the distribution of statistics that would be produced in repeated random sampling (with
replacement) from the same population. It is composed of all possible values of a statistic and their
probabilities of occurring for a sample of a particular size.
A sampling distribution is constructed by assuming that an infinite number of samples of a given size
have been drawn from a particular population and that their distributions have been recorded. Then the
statistic, such as the mean, is computed for the scores of each of these hypothetical samples; then this
infinite number of statistics is arranged in a distribution to arrive at the sampling distribution. The
sampling distribution is compared with the actual sample statistic to determine if that statistic is or is not
likely to be the size it is due to chance.
It is hard to overestimate the importance of sampling distributions of statistics. The entire process of
inferential statistics (by which we move from known information about samples to inferences about
populations) depends on sampling distributions. We use sampling distributions to calculate the probability
that sample statistics could have occurred by chance, and thus to decide whether something that is true of
a sample statistic is also likely to be true of a population parameter.
(Vogel) The sampling distribution is a distribution used as a model of what would happen if:
The null hypothesis were true (there really were no effects), and
The experiment was repeated an infinite number of times.
May 2007

METHODOLOGICAL CONSIDERATIONS

18

Sampling distributions are created using Monte Carlo experiments whereby a large number of equal
sized random samples are drawn from a population you wish to represent. For each sample, the statistic is
computed and the stats are arranged in a frequency distribution so you can see the normal curve for that
population. Doing these samples over and over again allow you to finally get that population sampling
distribution.
(Vogt, 196) Any generating of random values (most often with a computer) in order to study statistical
models. Monte Carlo methods involve producing several sets of artificial data and using these to study the
properties of estimators. E.g., statisticians who develop a new theory want to test it on data. They could
collect real data, but it is much more cost-efficient, initially at least, to test the theory on sets of data
(often hundreds of sets of data) generated by a Monte Carlo computer program.
Sampling Error
(Retrieved on June 1, 2007 from http://npts.ornl.gov/npts/1995/courseware/Useable_Nav3_18_27.html)
Whenever a sample is drawn, by definition, only that part of the population that is included in the sample
is measured, and is used to represent the entire population. Hence, there must always be some error in the
data, resulting from those members of the population who were not measured. Error will, therefore, be
reduced as the sample size is increased, so that, if a census is performed (a 100 percent sample is a
census), by definition there will be no sampling error.
If a population from which a sample is drawn is large, then population values will not be affected very
much by one or two members of the population who have extreme values of a particular measure. For
example, the population of the U.S. is about 100 million households; most households have between 1
and 6 members; if 100,000 households have 10 members, and the remaining 99.9 million average 2.300
persons per household, the average for all households will change from 2.300 to 2.308. As a result of this,
samples that are quite small in numeric size and very small as a percentage of the population will often
have very small sampling errors.

Sampling error of a mean value estimated from a sample is equal to the estimated standard deviation of
the variable divided by the square root of the sample size. It is therefore not dependent on the population
size, but only on the variability of the variable of concern and sample size. SamplingError =

SD
n

For

example:

If one drew a sample of four observations from a large population, the sampling error would be
equal to the standard deviation divided by 2 (the square root of four).
To halve the sampling error of that variable, one would need to increase the sample to 16; it could
be halved again by increasing the sample size to 64; and halved again by increasing the sample to
256.
If a sample of 1,024 were selected, the sampling error would be 1/32 of the standard deviation;
because standard deviations on many variables are fairly small values, this represents a very small
error.
It also follows that increasingly large increases in sample size are necessary to continue to decrease the
sampling error - to halve the error again to 1/64th of the standard deviation would require an increase of
the sample size to 4,096, while halving again would require an increase to 16,384.
(Outline from Severino)
1) Random Selection (Vogt, 262) Another term for random sampling. Selection is more often used
in experimental research; sampling is the more common term in survey research, but the underlying
principle is the same. See random assignment, probability sample.
Random Sampling (Vogt, 262) Selecting a group of subjects (a sample) for study from a larger
group (population) so that each individual (or other unit of analysis) is chosen entirely by chance.
When used without qualification (such as stratified random sampling), random sampling means
simple random sampling. Also sometimes called equal probability sample, since every member
May 2007

METHODOLOGICAL CONSIDERATIONS

19

of the population has an equal probability of being included in the sample. A random sample is not
the same thing as a haphazard or accidental sample. Using random sampling reduces the likelihood of
bias. Compare probability sample, cluster sample, quota sample, stratified random sample.

2)

Sampling: (Vogt, 284) Selecting elements (subjects or other units of analysis) from a population in
such a way that they are representative of the population. This is done to increase the likelihood of
being able to generalize accurately about the population. Sampling is often a more accurate and
efficient way to learn about a large population than a census of the whole population.
a) Convenience Sample (Vogt, 62) A sample of subjects selected for a study not because they are
representative but because it is convenient to use them- as when college professors study their
own students. Compare accidental sample, bias.
Oddly, using this term sometimes tends to legitimize bad practice. Students of mind occasionally
say, I used a convenience sample to gather the data for my project, as though this hard-tojustify method were a reasonable option among the many types of samples such as random
systematic, stratified, and so on.
b) Snowball Sample (Vogt, 300) A technique for finding research subjects. One subject gives the
researcher the name of another subject, who in turn provides the name of a third, and so on. This
is an especially useful technique when the researcher wants to contact people with unusual
experiences or characteristics who are likely to know one another- members of a small religious
group, for example. Also called networking sample. word of mouth
c) Stratified Random Sampling (Vogt, 312) Also known as stratified sampling. A random or
probability sample drawn from particular categories (or strata) of the population being studied.
The method works best when the individuals within the strata are highly similar to one another
and different from individuals in other strata. Indeed, if the strata were not different from one
another, there would be no point in stratifying. The strata have a function similar to blocks in
randomized blocks designs. Stratified random sampling can be proportionate, so that the size of
the strata corresponds to the size of the groups in the population. It can also be disproportionate,
as in the following example.
Suppose you wanted to compare the attitudes of Protestants, Catholics, Muslims, and Jews in a
population in which those four groups were not present in equal numbers. If you drew a simple
random sample, you might not get enough cases from one of the groups to make meaningful
comparisons. To avoid this problem you could select random samples of equal size within each of
the four religious groups (strata).
d) Proportional Stratified Random Sampling (Vogt, 251) A stratified random sample in which the
proportion of the subjects in each category (stratum) is the same as in the population. Compare
quota sample.
e) Cluster Sampling (Vogt, 46) A method for drawing a sample from a population in two or more
stages. It is typically used when researchers cannot get a complete list of the members of a
population they wish to study, but can get a complete list of groups or clusters in the population.
It is also used when a random sample would produce a list of subjects so widely scattered that
surveying them would be prohibitively expensive. Generally, the researcher wishes to use clusters
containing subjects as diverse as possible. By contrast, in stratified sampling the goal is often to
find strata containing subjects as similar to one another as possible.
The disadvantage of cluster sampling is that each stage of the process increases sampling error.
The margin of error is therefore larger in cluster sampling than in simple or stratified random
sampling; but, since cluster sampling is usually much easier (cheaper), this error can be
compensated for by increasing the sample size. See central limit theorem.
E.g., suppose you wanted to survey undergraduates on social and political issues. There is no
complete list of all college students. But there are complete lists of all 3,000+ colleges in the
country. You could begin by getting such a list of colleges (which are clusters of students).
May 2007

METHODOLOGICAL CONSIDERATIONS

20

You could then select a probability sample of, say, 100 colleges. Once the clusters (colleges)
were identified, the researchers could go to each school and get a list of its students; students to
be surveyed would be selected (perhaps by simple random sampling) from each of these lists.
f) Probability Sample (Vogt, 248) A sample in which each case that could be chosen has a known
probability of being included in the sample. Often a random sample, which is an equal probability
sample. At some point, random selection is part of the process of every probability sample.
g) Quota Sample (Vogt, 62) A stratified nonrandom sample, that is, a sample selected by dividing a
population into categories and selecting a certain number (a quota) of respondents from each
category. Individual cases within each category are not selected randomly; they are usually
chosen on the basis of convenience. Compare accidental sampling, purposive sample, random
sampling, stratified random sampling, proportional stratified random sampling.
E.g., interviewers might be given the following assignment: Go out and interview 20 men and 20
women, with half of each 50+ years old. Despite its superficial resemblance to stratified random
sampling, quota sampling is not a reliable method to use for making inferences about a
population.
h) Accidental sampling (Vogt, 3) A sample gathered haphazardly, e.g., by interviewing the first 100
people you ran into on the street who were willing to talk to you. An accidental sample is not a
random sample. The main disadvantage of an accidental sample is that the researcher has no way
of knowing what the population might be. See convenience sample, probability sample.
i)

Purposive sample (Vogt, 252) A sample composed of subjects selected deliberately (on purpose)
by researchers, usually because they think certain characteristics are typical or representative of
the population. Compare quota sample.
This is generally an unwise procedure; it assumes that the researcher knows in advance what the
relevant characteristics are; it runs the risk (because it is not random) of introducing unknown
bias. Inferences about a population cannot legitimately be made using a purposive sample. On the
other hand, purposive sampling is often the only way to try to increase representativeness in field
research, and it can be an improvement over simple convenience sampling. A frequent
compromise between random sampling and purposive sampling is stratified random sampling.

3) Random Assignment (Vogt, 260) Putting subjects into experimental and control groups in such a
way that each individual in each group is assigned entirely by chance. Otherwise put, each subject has
an equal probability of being placed in each group. Using random assignment reduces the likelihood
of bias. Also called random allocation. Goal is group equivalence, but random assignment does not
necessarily produce equivalent groups.
4) Matching (See also I, D, 3)
(Vogel) This is used when a subject variable is known to be related to scores on the DV. It is essential
to distribute these characteristics across groups so that they do not differ prior to treatment! First,
group subjects together based on similarity on the variable in question. Then, randomly assign one
person from each pair to the experimental group.

(Severino) Match subjects with identical pretreatment scores. When 2 are matched perfectly,
randomly assign each to either the experimental or control group through a coin toss.
Rank order subjects from high to low if there are three groups in the experiment, the first three
with the highest score form the first block and are assigned randomly to each condition- etc.
If you want to make sure that groups are equivalent on a categorical variable, have separate lists
for each (i.e., male/female) and assign randomly.

Central Limits Theorem: (Vogt, 41) A statistical proposition to the effect that the larger a sample size,
the more closely a sampling distribution of a statistic will approach a normal distribution. This is true
even if the population from which the sample is drawn is not normally distributed. A sample size of 30 or
more will usually result in a sampling distribution for the mean that is very close to a normal distribution.
May 2007

METHODOLOGICAL CONSIDERATIONS

21

The central limit theorem explains why sampling error is smaller with a large sample than with a small
sample and why we can use the normal distribution to study a wide variety of statistical problems.
F) Criteria and criterion measures
This is another term for dependent variable, criterion are used in correlational research when it is not
possible to establish a causal relationship, it is like the outcome of a study
G) Threats to internal, external, construct, and statistical conclusion validity
Validity (Vogt, 335) A term to describe a measurement instrument or test that
Measures what its
accurately measures what it is supposed to measure; the extent to which a
supposed to measure
measure is free of systematic error. Validity also refers to designs that help
researchers gather data appropriate for answering their questions. Validity requires reliability, but the
reverse is not true.
E.g., we say we want to measure individuals heights. If all we had was a bathroom scale, we could ask
our individuals to step on the scale and record the results. Even if the measurements were highly reliable,
that is, consistent from one weighing to the next, they would not be very valid. The weights would not be
completely useless however, sincere there generally is some correlation between height and weight.
Social scientists have not found it easy to define validity. To try to explain it they have created several
subcategories, including: internal, external, content, construct, criterion-related, and face validity.
Threats to Validity (Vogt, 324) Problems that can lead to false conclusions. Refers to the characteristics
of various research methods and designs that can lead to spurious or misleading conclusions.
Discussions of threats to validity often lead researchers to recommend using more than one method (see
mixed-method research, triangulation). Because different kinds of research designs are open to different
kinds of threats, you can reduce the risk of error by using two or more methods. Researchers sometimes
use of a list of threats to validity as a checklist to review before putting the finishing touches on a design.
(Shadish et al, 39-42) (39) Threats to validity are specific reasons why we can be partly or completely
wrong when we make an inference about covariance, about causation, about constructs, or about whether
the causal relationship holds over variations in persons, settings, treatments, and outcomes(40) The
likelihood that they will occur varies across contexts. Lists of validity threats are heuristic aids; they are
not etched in stone and they are not universally relevant across all research areas in the social sciences.
(40) They help experimenters to anticipate the likely criticisms of inferences from experiments that
experience has shown occur frequently, so that the experimenter can try to rule them out. The primary
method we advocate for ruling them out is to use design controls that minimize the number and
plausibility of those threats that remain by the end of the study
(40) However, many threats to validity cannot be ruled out by design controls, either because the logic of
design control does not apply (e.g., with some threats to construct validity such as inadequate construct
explication) or because practical constraints prevent available controls from being used. In these cases,
the appropriate method is to identify and explore the role and influence of the treat in the study. In doing
this three questions are critical: 1) How would the threat apply in this case? 2) Is there evidence that the
threat is plausible rather than just possible? 3) Does the threat operate in the same direction as the
observed effect, so that it could partially or totally explain the observed findings?
1) Statistical Conclusion Validity (Vogt, 309) The accuracy of conclusions about covariation (or
correlation) made on the basis of statistical evidence. More specifically, inferences about whether it is
reasonable to conclude that covariation exists- given a particular alpha level and given the variances
obtained in the study. See validity, statistical significance.
(Shadish, 42) Concerns two related statistical inferences that affect the covariation component of
causal inferences: 1) whether the presumed cause and effect covary and 2) how strongly they covary.
For the first of these inferences, we can incorrectly conclude that cause and effect covary when they
do no (a Type I error) or incorrectly conclude that they do not covary when they do (a Type II error).
For the second inference, we can overestimate or underestimate the magnitude of covariation, as well
as the degree of confidence that magnitude estimate warrants.
May 2007

METHODOLOGICAL CONSIDERATIONS

22

(Severino) Important concepts


Alpha: The probability of rejecting a hypothesis (the null) when that hypothesis is true. Type I
Beta: The probability of accepting the null hypothesis when it is false. Type II
Power: The probability of detecting real differences between conditions.
Effect size: A way of expressing the differences between groups, treatments, or conditions.
 The magnitude of the difference between two or more conditions expressed in standard
deviation units

m1 m2
(the difference between means and the pooled variance)
s

ES =

Useful in meta analysis.

Threats to Statistical Conclusion Validity: Reasons Why Inferences about Covariation between
Two Variables May Be Incorrect (Shadish, 45)
Regards inferences about
a) Low Statistical Power: An insufficiently powered experiment
covariation.
may incorrectly conclude that the relationship between treatment
and outcome is not significant.
b) Violated Assumptions of Statistical Tests: Violations of statistical test assumptions can lead to
either overestimating or underestimating the size and significance of an effect.
c) Fishing and the Error Rate Problem: Repeated tests for significant relationships, if uncorrected
for the number of tests, can artificially inflate statistical significance.
d) Unreliability of Measures: Measurement error weakens the relationship between two variables
and strengthens or weakens the relationships among three or more variables.
e) Restriction of Range: Reduced range on a variable usually weakens the relationship between it
and another variable.
f) Unreliability of Treatment Implementation: If a treatment that is intended to be implemented in
a standardized manner is implemented only partially for some respondents, effects may be
underestimated compared with full implementation.
g) Extraneous Variance in the Experimental Setting: Some features of an experimental setting
may inflate error, making detection of an effect more difficult.
h) Heterogeneity of Units: Increased variability on the outcome variable within conditions increases
error variance, making detection of a relationship more difficult.
i) Inaccurate Effect Size Estimation: Some statistics systematically overestimate or underestimate
the size of an effect.
2) Internal Validity (Vogt, 157) The extent to which the results of a study (usually an experiment) can
be attributed to the treatments rather than to the flaws in the research design. In other words, internal
validity is the degree to which one can draw valid conclusions about the causal effects of one variable
on another. It depends on the extent to which extraneous variables have been controlled by the
researcher. Compare external validity.
(Shadish, 53) Refers to inferences about whether observed covariation between A and B reflects a
causal relationship from A to B in the form in which the variables were manipulated or measured. To
support such an inference, the researcher must show that A preceded B in time, that A covaries with
B (already covered in statistical conclusion validity) and that no other explanations for the
relationship are plausible.

May 2007

METHODOLOGICAL CONSIDERATIONS

Threats to Internal Validity: Reasons Why Inferences That


the Relationship between Two Variables May Be Incorrect
(Shadish, 55)

23

Regards inferences about the


relationship between 2 variables

a) Ambiguous Temporal Precedence: Lack of clarity about which variable occurred first may yield
confusion about which variable is the cause and which is the effect.
b) Selection: Systematic differences over conditions in respondent characteristics that could also
cause the observed effect.
c) History: Events occurring concurrently with treatment could cause the observed effect.
d) Maturation: Naturally occurring changes over time could be confused with a treatment effect.
(E.g., growing older, stronger, wiser)
e) Regression: When units are selected for their extreme scores, they will often have less extreme
scores on other variables, an occurrence that can be confused with a treatment effect. (57) E.g.,
the people who come to psychotherapy when they are extremely distressed are likely to be less
distressed on subsequent occasions, even if psychotherapy had no effect. This phenomenon is
often called regression to the mean.
f) Attrition: a.k.a. mortality. Loss of respondents to treatment or to measurement can produce
artifactual effects if that loss is systematically correlated with conditions.
g) Testing: Exposure to a test can affect scores on subsequent exposures to that test, an occurrence
that can be confused with a treatment effect. (60) Sometimes taking a test once will influence
scores when the test is taken again. Practice, familiarity, or other forms of reactivity are the
relevant mechanisms and could be mistaken for treatment effects.
h) Instrumentation: The nature of a measure may change over time or conditions in a way that
could be confused with a treatment effect. (60) E.g., a change in a measuring instrument can
occur over time even in the absence of treatment, mimicking a treatment effect. The spring on a
bar press might become weaker and easier to push over time artifactually increasing reaction
times.
i) Additive and Interact Effects of Threats to Internal Validity: The impact of a threat can be
added to that of another threat or may depend on the level of another threat. (61) Validity threats
need not operate singly. Several can operate simultaneously. If they do, the net bias depends on
the direction and magnitude of each individual bias plus whether they combine additively or
multiplicatively (interactively). E.g., a selection-maturation additive effect may result when
nonequivalent experimental groups formed at the start of the treatment are also maturing at
different rates over time. A selection-history additive effect may result if nonequivalent groups
also come from different settings and each group experiences a unique local history.
3) Construct Validity Does the instrument measure the construct it says it does? (Vogt, 58) The extent
to which variables accurately measure the constructs of interest. In other words: How well are the
variables operationalized? Do the operations really get at the things you are trying to measure? How
well can you generalize from your operations to your construct? In practice, construct validity is used
to describe a scale, index, or other measure of a variable that correlates with measures of other
variables in ways that are predicted by, or make sense according to, a theory of how the variables are
related. See concurrent, content, convergent, and criterion-related validity. Absolute distinctions
among these kinds of validity are difficult to make, in large part because procedures for assessing
them tend to be similar if not identical. Convergent and Discriminant validity, for instance, are used
as tests of construct validity.
E.g., if you were studying racist attitudes, and you believed that racism (the construct) was more
common among people with low self-esteem, you could put together some questions that you thought
were a good index of racism. If subjects scores on that index were strongly (negatively) correlated
with their scores on a measure of self-esteem, this would be evidence that your index had construct
validity. The index is more likely to be a good measure of racism if it correlates with something your
theory says it should correlate with than if it does not. All this assumes, of course, that your theory is
May 2007

METHODOLOGICAL CONSIDERATIONS

24

right in the first place about the relation between self-esteem and racism and that you have a valid
measure of self-esteem.
(Shadish, 506) The degree to which inferences are warranted from the observed persons, settings, and
cause-and-effect operations sampled within a study to the constructs that these samples represent.
Threats to Construct Validity: Reasons Why Inferences about the Constructs That
Characterize Study Operations May Be Incorrect (Shadish, 73) (Severino) Sources of secondary
variance. Features within an experiment that interfere with interpretation and create alternative
explanations for results which are different from the theoretical assumptions about the action of the
IV or the presumed agents of change.
a) Inadequate Explication of Constructs: Failure to adequately explicate a construct may lead to
incorrect inferences about the relationship between operation and construct.
b) Construct Confounding: Operations usually involve more than one construct, and failure to
describe all the constructs may result in incomplete construct inferences.
(Severino)
c) Attention and/or Simple Contact with Participants (Hawthorne Effect): The intervention may
impact participants simply because of the attention or human contact it provided, rather than the
specific content or characteristics of the intervention. Controlled for by the use of Attention
Control- giving the group attention but not the intervention.
d) Single Operations and Narrow Stimulus Sampling: The way in which the intervention,
treatment, or program is operationalized or delivered may limit the researchers ability to examine
why the intervention affected participants. E.g., holding constant on the therapist- but the
therapist might be more adept or confident with one intervention more than the others. The results
are due to a therapists treatment interaction, rather than the intervention itself. Therefore the
results are due to the therapist, not the treatment.
e) Experimenter Expectancy Effects: The extent to which it is possible that the researchers beliefs,
ideas, hopes, opinions and hypotheses inadvertently affected participants responses. May be
communicated either verbally or nonverbally. Controlled for by the use of blind and double-blind
studies.
f) Demand Characteristics or Cues in the Experimental Setting: The extent to which extraneous
cues in the intervention or experimental procedure may account for the results, rather than the
intervention itself.
4) External Validity (Vogt, 114) The extent to which the findings of a study are relevant to subjects and
settings beyond those in the study. Another term for generalizability.
Threats to External Validity: Reasons Why Inferences about How Study Results Would Hold
over Variations in Persons, Settings, Treatments, and Outcomes May Be Incorrect (Shadish, 87)
a) Inadequate Explication of Constructs: Failure to adequately explicate a construct may lead to
incorrect inferences about the relationship between operation and construct.
b) Construct Confounding: Operations usually involve more than one construct, and failure to
describe all the constructs may result in incomplete construct inferences.
(Severino)
c) Sample Characteristics: The degree to which the results of the research may be generalized to
others who vary from the particular characteristics of the selected sample. E.g., Demographics,
age, gender, religion, ethnicity, nationality, ability or disability, SES, or geography.
d) Stimulus Characteristics & Settings: the degree to which the conditions in the natural research
setting may impact the results in ways which may not generalize to situations or persons who are
not involved in an experiment.
e) Reactivity of Experimental Arrangement: Participant awareness that they are participating in an
experiment may impact the results in ways which may not generalize to situations persons who
are not involved in an experiment.

May 2007

METHODOLOGICAL CONSIDERATIONS

25

f) Multiple Treatments: When participants receive more than one experimental condition or
treatment, the results may not generalize to situations where only a single treatment is given.
g) Novelty effects: The possibility that effects of an intervention are in part due to the novelty of the
administration circumstances.
h) Experimenter Expectancy: The unintentional effect that the experimenter exerts on the study in
the direction of the hypothesis.
i) Reactivity of Assessment: Participants may respond to tests or assessment measures differently
when they are aware that they are being assessed, than if they are not aware. To the degree that
participants would not be aware of such measurement or assessment in other non-experimental
situations, the results would perhaps not be generalizable.
j) Test Sensitization: The effect produced by pretesting participants. May make them more or less
attentive or receptive to the intervention and limit generalizability.
k) Timing of Measurement: The degree to which the results of the intervention or research project
vary as a result of the point in time which the post-intervention measurement is given.
More on other types of validity in the foundations of measurement section!!
H) Pre-experimental, quasi-experimental and experimental designs
OBSERVATIONAL DESIGNS [PRE-EXPERIMENTAL]: (Severino & Vogel) Observational designs are
purely descriptive in nature (qualitative) and there are no requirements for random assignment of
individuals, independent variables, or control groups. This design uses many different techniques such as
observation, interview, records, etc. It is usually conducted to collect preliminary information that may
lead to specific IVs, DVs, and hypotheses about relationships. The biggest problem with observational
designs is that there is poor external validity (Generalizability). [See Summary Table in Appendix]
(Vogt, 218) Any of several research designs in which the investigators observe subjects but do not interact
with them, as they would have to, for example, in interviews.
Subtypes:
1) Case Study: an in depth study of a single individual, family, organization, or other social unit.
Purely descriptive. Uses techniques such as observation, interviews, records, etc.
(Vogt, 38) Gathering and analyzing data about one or a small number of examples as a way of
studying a broader phenomenon. This is done on the assumption that the example (the case) is in
some way typical of the broader phenomenon. An advantage of the case-study method is that it
allows more intensive analysis of specific empirical details. A disadvantage is that it is hard to use
the results to generalize to other cases.
2)

Case Control: (Vogt, 218) A method of studying an outcome by sampling cases with and without
that outcome and studying their backgrounds. E.g., in a study of lung cancer, the cases are
individuals who have the disease. The controls are similar people who do not have it. The
backgrounds of those with and without the disease are compared to understand the origins of the
disease.

3)

Cross-Sectional: (Vogt, 73) A study conducted at a single point in time by, so to speak, taking a
slice (a cross section) of a population at a particular time. Cross-sectional studies provide only
indirect evidence about the effects of time and must be used with great caution when drawing
conclusions about change. E.g., if a cross-sectional survey shows that respondents aged 60-65 are
more likely to be racially prejudiced than respondents aged 20-25, this does not necessarily mean
that as the younger group ages it will become more prejudiced- nor does it necessarily mean that the
old group was once less prejudiced.

4)

Retrospective Cross-Sectional: draw inferences about an antecedent event that results


in/associated with an outcome. Observe the past to see what determined present-day outcomes. It
attempts to identify the timeline between possible cause or antecedents (risk factors) and subsequent
outcome of interest. Subjects identified who already show the outcome of interest (cases) and
compared to those who do not show the outcome (control). E.g., the relationship of attachment
patterns to suicidal behavior among adolescents.
May 2007

METHODOLOGICAL CONSIDERATIONS

5)

26

Cohort Designs (Prospective Longitudinal Study): A.k.a. cohort analysis. (Vogt, 49) Studying the
same cohort over time. E.g., individuals who graduated from high school in 1985 form a cohort
whose subsequent educational experiences could be followed in a cohort analysis- for example, how
many went on to college immediately? How many went on eventually? Of those who attended
college, how many went to a two-year college? How many to a four-year college? How many
graduated? - and so on. (Severino) Strength of the design lies in establishing the relations between
antecedent events and outcomes.
Cohort: group of individuals having a statistical factor (usually age) in common.
Problem: may confound results of cross-sectional studies as difference are more due to the
specific cohort and not the DV you are interested in
Difference from Case-Control: Cohort designs follow sample over time to identify factors
leading to an outcome of interest and the group is assessed before the outcome has occurred.
a) Single-Group Cohort Design: One group is identified who meet particular criteria are selected
and are followed over time tin order to study the emergence of the outcome.
b) Multiple- Group Cohort Design: Two or more groups are identified and followed over time to
examine outcomes of interest.
c) Accelerated Cohort Design: The study of multiple groups. Cohorts are included who vary in
age when they enter the study- each group covers only a portion of the total time frame of
interest and the groups overlap in ways that allow the investigator to discuss the entire
development period. Requires less time than if a single group was studied. This design is in
between longitudinal and cross sectional designs.

EXPERIMENTAL DESIGNS (Vogt, 112) The art of planning and executing an experiment. The greatest
strength of an experimental research design, due largely to random assignment, is its internal validity:
One can be more certain than with any other design about attributing cause to the independent variables.
The greatest weakness of experimental designs may be external validity. It may be inappropriate to
generalize results beyond the laboratory or other experimental context.
Experimental designs allow for the clearest case for drawing causal inferences because they have the
greatest control of independent variables and error. [See Summary Table in Appendix]

Independent Variable: the variable identifying the groups you are comparing, can be manipulated or
static
Sources of Error:
 Error variance: differences within the groups you are comparing that are distributed randomly
 Covariance: this occurs when you have a variable that is associated with/related to the dependent
variable; systematic differences between the groups that is not accounted for by the treatment
effect (i.e. social desirability)
 Between Group versus Within Group designs: WG designs have less error variance because
you are comparing each subject to them self.
Must Have:
 Random Assignment: all subjects are equally likely to be in any treatment group with the goal
being to evenly distribute characteristics across groups (i.e. age, sex, education); goal is group
equivalence!
 IVs must be manipulated!
 Manipulated: differences chosen by researcher
 Static: researcher cannot assign to subjects (age, gender, sexual orientation)
 Control Group: group that experiences all of the same things as the experimental group but does
not receive treatment (i.e. history, maturation, demographics, selection, testing, etc), accounts for
spontaneous remission cases which affect internal validity
Statistical Analyses
 t-test, ANOVA, ANCOVA, MANOVA

May 2007

METHODOLOGICAL CONSIDERATIONS

Subtypes

R = Random Assignment; O = Observation; X = Treatment

Pretest-Posttest Control Group Design: 2 levels of the IV w/ 1 receiving treatment and the other not
receiving treatment, assesses amount of change within a group from time 1 (pretest) to time 2
(posttest); however, this does not control for pretest sensitization
Random Assignment
R
R

27

Observation
O
O

Treatment
X

Posttest
O
O

Posttest Only Control Group Design: at least 2 levels of the IV with 1 receiving treatment, only a
posttest is given to control for pretest sensitization, however, now we cannot assure that our groups
started out equivalent!
Random Assignment
Observation
Treatment
Posttest
R
X
O
R
O
Solomon 4 Group: used to control the effects of pretesting by including a pretest IV. This design,
however, is costly, requires a great amount of effort, and a large number of subjects is needed!
Group
1
2
3
4

Random Assmnt
R
R
R
R

Observation
O
O

Treatment
X
X

Posttest
O
O
O
O

Group 2 versus 4: assesses effect of pretest on internal validity (neither group got treatment)
Group 1 versus 3: assesses effect of pretest on external validity (both groups got treatment)

Time Series Design: within subjects design in which the dependent variable is administered to all
subjects before and after the independent variable is applied (pre, post, f/u testing). This design
focuses on comparing subjects against themselves which increases power by decreasing error
variance.
Random Assignment
R
R

Observation
O
O

Posttest
O
O

Follow-Up
O
O

Factorial Design: involves 2 or more IVs, each with 2 or more levels; this design allows for the
simultaneous investigation of 2 or more variables in a single experiment. It provides information on
both the main effects of each IV and the interaction between the 2 variables. See more information in
factorial ANOVA section
Random Assignment
Observation
Treatment
Follow-Up
R
XA1B1
O
R
XA1B2
O
R
XA2B1
O
R
XA2B2
O
Strengths
It can assess the effects of separate variables
Different variables can be studied with fewer subjects
Provides unique information about the combined effects of IVs
Interactions provide important information such as whether there may be variables that
moderate the effects of other variables!
Weaknesses
The number of groups multiplies quickly as new factors or new levels are added (can become
unwieldy)
May 2007

METHODOLOGICAL CONSIDERATIONS

28

Optimally informative when an investigator predicts an interactive relationship among


variables, but with more variables, interactions become complicated and difficult to interpret!

QUASI-EXPERIMENTAL (COMBO DESIGNS) [See Summary Table in Appendix]


Basic Design. Quasi-experimental designs are similar to experimental designs, however, at least one IV is
static (gender). Thus, this design approximates the control offered by experimental designs but true
random assignment is not possible!
(Vogt, 257) A type of research design for conducting studies in field or real-life situations where the
researcher may be able to manipulate some independent variables but cannot randomly assign subjects to
control and experimental groups.
E.g., you cannot cut off some individuals unemployment benefits to see how well they could get along
without them or to see whether an alternative job training program would be more effective. But you
could try to find volunteers for the new job training program. You could compare the results for the
volunteer group (experimental group) with those of people in the regular program (control group). The
study is quasi-experimental because you were unable to assign subjects randomly to treatment and control
groups.

Must Have:
1 Static IV
1 Manipulated IV

Statistical Analyses: ANOVA, ANCOVA, Regression, MANOVA

(Shadish, 105) A causal inference from any quasi-experiment must meet the basic requirements for all
causal relationships: that cause precede effect, that cause covary with effect, and that alternative
explanations for the causal relationship are implausible. Both randomized and quasi-experiments
manipulate the treatment to force it to occur before the effect. Assessing covariation between cause and
effect is easily accomplished in all experiments usually during statistical analysis. To meet the 3rd
requirement, randomized experiments make alternative explanations implausible by ensuring that they are
randomly distributed over the experimental conditions. Because quasi-experiments do not use random
assignment, they rely on other principles to show that alternative explanations are implausible. We
emphasize three closely related principles to address this requirement in quasi-experimentation.
1) Identification and study of plausible threats to internal validity. Once identified, those threats can be
studied to probe how likely it is that they explain treatment-outcome covariation.
2) Primacy of control by design. By adding design elements (e.g., observation at more pretest time
points, additional control groups), quasi-experimental aims either to prevent the confounding of a
threat to validity with treatment effects or to provide evidence about the plausibility of those threats.
Subtypes
(Anastasi, 109) Within-participants designs can increase statistical power by controlling individual
differences between units within conditions, and so they can use fewer units to test the same number of
treatments. However, within-participants designs can cause fatigue effects, practice effects, carryover
effects, and order effects. To avoid confounding such effects with treatment, the order of treatments in a
within-participants design is often either randomized for each unit or deliberately counterbalanced so that
some units get treatments in one order (e.g., A then B) but others get a second order (B then A) so that
order effects can be assessed.

Order Effects: also called sequence effects, multiple treatment interference, or carryover effects.
(Vogt, 222) (a) In experiments where subjects receive more than one treatment (a within-subjects
design), the influence of order in which they receive those treatments. Order effects may confound
(make it difficult to distinguish) the treatment effects. To avoid this problem, experimenters often use
counterbalancing. (b) In survey research, when respondents are asked more than one question about
the same topic, the order in which the questions are asked can influence the answers. One example in
which researchers experimented with question order involved asking US respondents whether
May 2007

METHODOLOGICAL CONSIDERATIONS

29

Russian newspaper reporters should be able to travel freely in the US in order to gather information.
Most respondents said no - unless they had first been asked whether US reporters should be able to
travel freely in Russia to gather information.
Counterbalanced Designs: Designs that try to balance the order of treatments across subjects. (Vogt, 67)
In a within-subjects factorial experiment, presenting conditions (treatments) in all possible orders to avoid
order effects. See Latin Square. (See I, D, 5)
Crossover Designs: Partway through the experiment (treatments), usually at midpoint, all subjects
crossover (are switched) to another experimental condition. Both groups get both control and
experimental conditions. This increases power because each group serves as its own control.
Multiple Treatment Counterbalanced: controls for carryover (order) effects that may result with a
within-subjects design; administer levels of an IV to different subjects or groups in a different order
(balances the order of treatments)
Latin Square: (Vogt, 169) A method of allocating subjects, in a within-subjects experiment, to treatment
group orders. So called because the treatments are symbolized by Latin (not Greek) letters. The main goal
of using Latin squares is to avoid order effects by rotating the order of treatments.
In the following example, A, B, C, and D are treatments. There are four subjects and four orders of
treatment. Subject 1 would receive the treatment ABCD in that order, the order for Subject 2 would be
BDCA, and so on. Note that a Latin Square, i.e., the number of rows and columns must be equal. The
number of subjects must equal or be a multiple of the number of treatments- in this example, 4, 8, 12, 16,
and so forth.
LATIN SQUARE
Order
st

Subject 1
Subject 2
Subject 3
Subject 4


1
A
B
C
D

nd

2
B
D
A
C

3rd
C
A
D
B

4th
D
C
B
A

Ceiling Effect: (Vogt, 40) A term used to describe what happens when many subjects in a study
have scores on a variable that are at or near the possible upper limit (ceiling). The ceiling effect
makes analysis difficult, since it reduces the amount of variation in a variable.
E.g., suppose a group of statistics professors wants to see whether a new method of teaching
increases knowledge of elementary statistics. They give students in their classes a test, try the
new method, and then give the students another version of the same test to see whether their
scores went up. If one of the professors had students who knew a lot of statistics already, and
scored at or near 100% on the first test, she could not tell whether the new method was effective
in her class. The scores of her students were so high (at the ceiling) that they could hardly go up,
even if the student learned a great deal using the new method.
Floor effect: (Vogt, 123) A term used to describe a situation in which many subjects in a study
measure at or near the possible lower limit (the floor). This make analysis difficult since it
reduces the amount of variation in the variable.
E.g., a study of counseling strategies to reduce suicide could be more difficult to conduct with
subjects who were black women in their thirties, since this group has a very low suicide rate, one
that could be hard to reduce further. It might be easier to conduct the study with white men in
their sixties, since they have a much higher suicide rate.

May 2007

METHODOLOGICAL CONSIDERATIONS

I)

30

Single subject designs


Also called single case designs. Single subject designs compare the effects of different treatment
conditions on performance of one individual over time. (Shadish, 512) A time series done on one person,
common in clinical research.

(Severino) Compare the effects of different treatment conditions (IV) on performance


Inferences are made about the effects of the intervention by comparing different conditions presented
to the same subject over time.

Must have:
Baseline Assessment: Each design begins with observing behavior for a period prior to the
intervention. This is referred to as the baseline phase.
Used to predict the level of performance for immediate future if intervention is not applied.
Important that the data are stable- and provide a stable rate of performance (e.g., absence of a
slope)
Continuous assessment starting with strong baseline! The clients performance is observed on several
occasions prior to the intervention, and then continuously observed during the period of time the
intervention is in effect.
Examination of trends: Trend or slope refers to the tendency for performance to decrease or increase
systematically or consistently over time.
Single-subject designs relate to clinical practice. While the unit of analysis in single-subject designs is
usually an individual, it can also be a family, a group, a community, or an organization. Single subject
designs are within subject designs (each participant is his/her own control), whereas group experimental
designs are between subject designs (the participant is either in the treatment or control group).
(Retrieved on May 31, 2007 from http://www.brynmawr.edu/Acads/GSSW/schram/sw132111301.html)

Subtypes

ABAB Designs: (Vogt, 1) Research designs that alternate baseline measures of a variable with
measures of that variable after a treatment. Such designs are used in single-subject or single-group
experiments having one treatment and no control group. The A represents the baseline, the B
represents the treatment. A-B and A-B-A designs are shorter versions of the same research procedure.
A-B-A-B designs are often used when it would be unethical to withhold a treatment from a control
group.
E.g., in order to see whether a treatment had an effect, patients symptoms might be measured daily
for four weeks. During the first week there would be no treatment (baseline, A) then the treatment (B)
would be given for a week. In the third week, the treatment could be withdrawn, and a second
baseline (A) would be established. In the fourth week the treatment would be given again.
(Vogel) This design examines the effects of an intervention by alternating the baseline with treatment
for 4 complete phases. In other words, the design begins by taking a baseline measurement of a
behavior  treatment is then implemented  when the behavior changes treatment is withdrawn and
there is often a return to baseline behaviors  treatment is then administered again  so on through
4 phases
Baseline (A)  Treatment (B)  Baseline (A)  Treatment (B)  Baseline (A)  Treatment (B)
 Baseline (A)  Treatment (B)
ABABABAB etc.
The general paradigm for single-subject designs involves first collecting data for a baseline activity
of the target problem, e.g., drug addiction, depression or whatever, and then collecting more data after
the introduction of a treatment intervention. The baseline activity phase is referred to as A and the
intervention phase as B. The baseline phase is basically a control phase (serving the same function
as control groups do in group experiments). To infer that an intervention is effective requires a
comparison of shifts in the pattern of the data which coincides with shifts between the baseline and
intervention phases. Extraneous events are much better controlled when there are several shifts
May 2007

METHODOLOGICAL CONSIDERATIONS

31

between baseline and intervention phases for example, ABAB designs (withdrawal or reversal
designs). However, the use of these more rigorous designs, where internal validity is maximized, still
can have problems of carry-over effects where results from the previous phase carry-over into the
next phase, order effects where the ordering of the intervention or treatment affects what results you
will get, and the irreversibility of effects where once a change is effected it can not be undone. Also,
multiple stage designs that withdraw treatment can at times present ethical problems and feasibility
problems because they often cannot be implemented in the complex world of practice, including the
constraints of managed care.
(Retrieved on May 31, 2007 from http://www.brynmawr.edu/Acads/GSSW/schram/sw132111301.html)

Multiple Baseline: This design demonstrates the effects of an intervention by showing that behavior
change accompanies the introduction of the intervention at different points in time. Data is collected
continuously by taking assessments over time of at least two precisely defined behaviors. A baseline
period of assessment is taken, and then the intervention is introduced to the different baselines at
different points in time. The main difference from ABAB designs is that at least 2 behaviors are being
studied concurrently!

Changing Criterion: This design demonstrates the effect of an intervention by showing that behavior
changes incrementally to match a performance criterion (expected outcome/goal). You would start
out with a baseline phase followed by an intervention. The intervention ceases once the criterion is
reached, at which point a reward/incentive is administered. You would then keep setting new
criterion/goals to show incremental improvement.
Baseline (A)  Treatment (B)  Criterion Reached and Reward Administered Set New Goal,
Return to Baseline  Treatment  Goal reached and Reward Given  etc.

SINGLE-SUBJECT and TRADITIONAL GROUP DESIGNS


Similarities
Dissimilarities
Both are longitudinal and are concerned with:
single subject designs typically use more repeated
measures;
issues of control;
with specifying targets of intervention in operational duration of the research is more variable;
terms;
participants are more actively involved in setting
the goals and targets of interventions;
with developing measurement and recording plans
the choice of design is typically established by the
for assessing these targets; and
worker rather than the researcher;
can use a combination of process and outcome
measures, though single subject designs rarely designs are more flexible, responding to the needs
employ process measures, which try to assess the
of the particular case rather than fixed;
"black box" of treatment or what kinds of interactions findings have more direct and immediate impact
go on between clients and therapists during the
on interventions at the individual case level;
course of an intervention.
and they are much less costly than group designs.
J)

Meta analysis
These designs are conducted to get an overview or a review of all literature in a specific area of interest.
This methodology involves evaluating and combining the results of multiple studies. The goal of meta
analyses is to provide an estimate of the effect size for the particular area of research.

K) Research ethics
8.01 Institutional Approval

http://www.apa.org/ethics/code2002.html#8

When institutional approval is required, psychologists provide accurate information about their research
proposals and obtain approval prior to conducting the research. They conduct the research in accordance
with the approved research protocol.
8.02 Informed Consent to Research
(a) When obtaining informed consent as required in Standard 3.10, Informed Consent, psychologists
inform participants about (1) the purpose of the research, expected duration, and procedures; (2) their
May 2007

METHODOLOGICAL CONSIDERATIONS

32

right to decline to participate and to withdraw from the research once participation has begun; (3) the
foreseeable consequences of declining or withdrawing; (4) reasonably foreseeable factors that may be
expected to influence their willingness to participate such as potential risks, discomfort, or adverse
effects; (5) any prospective research benefits; (6) limits of confidentiality; (7) incentives for
participation; and (8) whom to contact for questions about the research and research participants'
rights. They provide opportunity for the prospective participants to ask questions and receive answers.
(See also Standards 8.03, Informed Consent for Recording Voices and Images in Research; 8.05,
Dispensing With Informed Consent for Research; and 8.07, Deception in Research.)
(b) Psychologists conducting intervention research involving the use of experimental treatments clarify to
participants at the outset of the research (1) the experimental nature of the treatment; (2) the services
that will or will not be available to the control group(s) if appropriate; (3) the means by which
assignment to treatment and control groups will be made; (4) available treatment alternatives if an
individual does not wish to participate in the research or wishes to withdraw once a study has begun;
and (5) compensation for or monetary costs of participating including, if appropriate, whether
reimbursement from the participant or a third-party payor will be sought. (See also Standard 8.02a,
Informed Consent to Research.)
8.03 Informed Consent for Recording Voices and Images in Research
Psychologists obtain informed consent from research participants prior to recording their voices or images
for data collection unless (1) the research consists solely of naturalistic observations in public places, and
it is not anticipated that the recording will be used in a manner that could cause personal identification or
harm, or (2) the research design includes deception, and consent for the use of the recording is obtained
during debriefing. (See also Standard 8.07, Deception in Research.)
8.04 Client/Patient, Student, and Subordinate Research Participants
(a) When psychologists conduct research with clients/patients, students, or subordinates as participants,
psychologists take steps to protect the prospective participants from adverse consequences of
declining or withdrawing from participation.
(b) When research participation is a course requirement or an opportunity for extra credit, the prospective
participant is given the choice of equitable alternative activities.
8.05 Dispensing With Informed Consent for Research
Psychologists may dispense with informed consent only
(1) where research would not reasonably be assumed to create distress or harm and involves
(a) the study of normal educational practices, curricula, or classroom management methods
conducted in educational settings;
(b) only anonymous questionnaires, naturalistic observations, or archival research for which
disclosure of responses would not place participants at risk of criminal or civil liability or damage
their financial standing, employability, or reputation, and confidentiality is protected; or
(c) the study of factors related to job or organization effectiveness conducted in organizational
settings for which there is no risk to participants' employability, and confidentiality is protected or
(2) where otherwise permitted by law or federal or institutional regulations.
8.06 Offering Inducements for Research Participation
(a) Psychologists make reasonable efforts to avoid offering excessive or inappropriate financial or other
inducements for research participation when such inducements are likely to coerce participation.
(b) When offering professional services as an inducement for research participation, psychologists clarify
the nature of the services, as well as the risks, obligations, and limitations. (See also Standard 6.05,
Barter With Clients/Patients.)
8.07 Deception in Research
(a) Psychologists do not conduct a study involving deception unless they have determined that the use of
deceptive techniques is justified by the study's significant prospective scientific, educational, or
applied value and that effective nondeceptive alternative procedures are not feasible.
(b) Psychologists do not deceive prospective participants about research that is reasonably expected to
cause physical pain or severe emotional distress.
May 2007

METHODOLOGICAL CONSIDERATIONS

33

(c) Psychologists explain any deception that is an integral feature of the design and conduct of an
experiment to participants as early as is feasible, preferably at the conclusion of their participation,
but no later than at the conclusion of the data collection, and permit participants to withdraw their
data. (See also Standard 8.08, Debriefing.)
8.08 Debriefing
(a) Psychologists provide a prompt opportunity for participants to obtain appropriate information about
the nature, results, and conclusions of the research, and they take reasonable steps to correct any
misconceptions that participants may have of which the psychologists are aware.
(b) If scientific or humane values justify delaying or withholding this information, psychologists take
reasonable measures to reduce the risk of harm.
(c) When psychologists become aware that research procedures have harmed a participant, they take
reasonable steps to minimize the harm.
8.09 Humane Care and Use of Animals in Research
(a) Psychologists acquire, care for, use, and dispose of animals in compliance with current federal, state,
and local laws and regulations, and with professional standards.
(b) Psychologists trained in research methods and experienced in the care of laboratory animals supervise
all procedures involving animals and are responsible for ensuring appropriate consideration of their
comfort, health, and humane treatment.
(c) Psychologists ensure that all individuals under their supervision who are using animals have received
instruction in research methods and in the care, maintenance, and handling of the species being used,
to the extent appropriate to their role. (See also Standard 2.05, Delegation of Work to Others.)
(d) Psychologists make reasonable efforts to minimize the discomfort, infection, illness, and pain of
animal subjects.
(e) Psychologists use a procedure subjecting animals to pain, stress, or privation only when an alternative
procedure is unavailable and the goal is justified by its prospective scientific, educational, or applied
value.
(f) Psychologists perform surgical procedures under appropriate anesthesia and follow techniques to
avoid infection and minimize pain during and after surgery.
(g) When it is appropriate that an animal's life be terminated, psychologists proceed rapidly, with an
effort to minimize pain and in accordance with accepted procedures.
8.10 Reporting Research Results
(a) Psychologists do not fabricate data. (See also Standard 5.01a, Avoidance of False or Deceptive
Statements.)
(b) If psychologists discover significant errors in their published data, they take reasonable steps to
correct such errors in a correction, retraction, erratum, or other appropriate publication means.
8.11 Plagiarism
Psychologists do not present portions of another's work or data as their own, even if the other work or
data source is cited occasionally.
8.12 Publication Credit
(a) Psychologists take responsibility and credit, including authorship credit, only for work they have
actually performed or to which they have substantially contributed. (See also Standard 8.12b,
Publication Credit.)
(b) Principal authorship and other publication credits accurately reflect the relative scientific or
professional contributions of the individuals involved, regardless of their relative status. Mere
possession of an institutional position, such as department chair, does not justify authorship credit.
Minor contributions to the research or to the writing for publications are acknowledged appropriately,
such as in footnotes or in an introductory statement.
(c) Except under exceptional circumstances, a student is listed as principal author on any multipleauthored article that is substantially based on the student's doctoral dissertation. Faculty advisors
discuss publication credit with students as early as feasible and throughout the research and
publication process as appropriate. (See also Standard 8.12b, Publication Credit.)
May 2007

METHODOLOGICAL CONSIDERATIONS

34

8.13 Duplicate Publication of Data


Psychologists do not publish, as original data, data that have been previously published. This does not
preclude republishing data when they are accompanied by proper acknowledgment.
8.14 Sharing Research Data for Verification
(a) After research results are published, psychologists do not withhold the data on which their
conclusions are based from other competent professionals who seek to verify the substantive claims
through reanalysis and who intend to use such data only for that purpose, provided that the
confidentiality of the participants can be protected and unless legal rights concerning proprietary data
preclude their release. This does not preclude psychologists from requiring that such individuals or
groups be responsible for costs associated with the provision of such information.
(b) Psychologists who request data from other psychologists to verify the substantive claims through
reanalysis may use shared data only for the declared purpose. Requesting psychologists obtain prior
written agreement for all other uses of the data.
8.15 Reviewers
Psychologists who review material submitted for presentation, publication, grant, or research
proposal review respect the confidentiality of and the proprietary rights in such information of
those who submitted it.

II) STATISTICS
A) Descriptive
(Salkind, 8) Descriptive statistics are used to organize and describe the characteristics of a collection of
data. The collection is sometimes called a data set or just data.
1) Measures of central tendency
(Salkind, 19) Groups of data can be summarized using an
A.k.a.- Measures of location
average. Averages, also called measures of central tendency,
come in three flavors: the mean, median, and mode. Each
provide you with a different type of information about a distribution of scores and is simple to
compute and interpret.
a) Mean: average score; (Salkind, 20-21) the sum of all the values in a group, divided by the
number of values in that group. Represented as x . (21) The mean is sometimes represented by
the letter M and is also called the typical, average, or most central score.

X=

10

10
100/10 = 10
Mean = 10

10

12

14

20 = 100

x = mean

X = each individual score in the group of scores

3= the sum of

n = size of the sample from which you are computing the mean

b) Median: (Salkind, 23) the median is also an average. The median is defined as the midpoint in a
set of scores. Its the point at which one half, or 50% of the scores fall above and one half, 50%,
fall below. Theres no standard formula for computing the median. To compute the median,
follow these steps: 1) list the values in order, either from highest to lowest or lowest to highest. 2)
Find the middle-most score, thats the median. E.g.:
Household Incomes
$135,456
$ 25,500
$ 32,456
$ 54,365
$ 37,668

Income listed in order


$135,456
$ 54,365
$ 37,668
$ 32,456
$ 25,500

There are five values. The middle most


value is $37,668, and thats the median.

May 2007

METHODOLOGICAL CONSIDERATIONS

When there is an even number of values:


Then average the middle two values. The average of $34,500 and
$37, 668 is $ 36,084. This is the median for that set of six values.
The 50th percentile. E.g., 2, 3, 4 = 3.5 or
2, 3, 4, 5, 6 = 4
Median is not affected by outliers.

35

Income listed in order


$135,456
$ 54,365
$ 37,668
$ 34,500
$ 32,456
$ 25,500

c) Mode: most frequently occurring score. (Salkind, 27) The value that occurs most frequently. To
compute the mode, follow these steps: 1) list all the values in a distribution, but list each only
once. 2) Tally the number of time that each value occurs. 3) The value that occurs most often is
the mode.
4

10

10

10

12

14

20

20

MODE = 10
2) Measures of dispersion or variability
a) Range: this is simply the difference between the minimum and maximum score; (Salkind, 35)
The range is the most general measure of variability. It gives you an idea of how far apart scores
are from one another. The range is computed simply by subtracting the lowest score in a
distribution from the highest score in the distribution. (36) The range is used almost exclusively
to get a very general estimate of how wide or different
Maximum - Minimum = Range
scores are from one another- that is, the range shows how
3, 6, 9, 12 (data set)
much spread there is from the lowest to the highest point
12-3 = 9
9 is the Range.
in a distribution. Although the range is fine as a general
indicator of variability, it should not be used to reach any conclusions regarding how individual
scores differ from one another.
b) Standard Deviation (square root of variance): standardized measure of dispersion of scores.
(Salkind, 36-39) The most frequently used measure of variability. Just think about what the term
implies; its a deviation from the standard. The SD represents the average amount of variability in
a set of scores. In practical terms, its the average distance from the mean. The larger the SD, the
larger the average distance each data point is from the mean of the distribution.

s=

(X X )
n 1

s- standard deviation
3- the sum of what follows
X- each individual score
x - the mean of all scores
n- is the sample size

(41) Things to Remember:


The SD is computed as the average distance from the mean. So, you will need to first
compute the mean as a measure of central tendency. Dont fool around with the median or
mode in trying to compute the standard deviation.
The larger the SD, the more spread out the values are, and the more different they are from
one another.
Just like the mean, the SD is sensitive to extreme scores. When you are computing the SD of
a sample and you have extreme scores, note that somewhere in your written report.
c) Variance: spread or amount of differences found between groups,
within groups, etc; variability of scores around mean; mean squared
distance between an observed score and the overall mean. Variability
reflects how scores differ from one another.
May 2007

METHODOLOGICAL CONSIDERATIONS

36

See the illustration to observe how 2 different distributions show a


different amount of spread in the scores. The peaked distribution has
much less variance than the fatter distribution.
(Salkind, 42) Variance is another measure of variability. If you know
the SD of a set of scores and you can square a number, you can easily
compute the variance of the same set of scores. Its the same formula as above without the square
root bracket.

(X X )
=
n 1

How are SD and the variance the same and how are they different? Well,

they are both measures of variability, dispersion, or spread. The formulas used to compute them
are very similar. They are also quite different. First, and most important, the SD (because we take
the square root of the average summed squared deviation) is stated in the original units from
which it was derived. The variance is stated in units that are squared (the square root is never
taken).
What does this mean? Lets say that we need to know the variability of a group of production
workers assembling circuit boards. Lets say that they average 8.6 boards per hour, and the SD is
1.59. The value 1.59 means that the difference in the average number of boards assembled per
hour is about 1.59 circuit boards from the mean. Lets look at an interpretation of the variance,
which is 1.592, or 2.53. This would be interpreted as meaning that the average difference between
the workers is about 2.53 circuit boards squared from the mean. Which of these two makes more
sense?
d) Confidence Interval: (Vogt, 55) A range of values of a sample statistic that is likely (at a given
level of probability, called a confidence level) to contain a population parameter. The interval that
will include the population parameter a certain percentage (confidence level) of the time. In other
words, a range of values with a known probability of including the true population value. The
wider the confidence interval, the higher the confidence level. It is common to say, for example,
that one can be 95% confident that the confidence interval contains the true value. Although this
is the usual way to report confidence intervals and limits, it is not technically correct. Rather it is
correct to say, were one to take an infinite numbers of
samples of the same size that on average 95% of them
would produce confidence intervals containing the true
population value.
(Vogel) the smaller the range the better the estimate. For
instance, in IQ testing you get a confidence interval for
each subscale. If you were using a 99% confidence interval,
you would say that you are 99% sure that the subscale score
would fall between 90-110. The smaller your CI, the better
your estimate!
If you have a variable that is normally distributed (many
variables are), then standard deviations are important
because they allow the calculation of confidence intervals
into which certain known percentages of scores reside.
Approximately 68% of the scores in a normal distribution
are between the mean and 1 standard deviation.
Approximately 95% of the scores in a normal distribution
are between the mean and 2 standard deviations. The
figure shown on the right illustrates this property of normal
distributions. Although there is no figure to illustrate this,
approximately 99% of the scores in a normal distribution
are between the mean and +/- 3 standard deviations.
May 2007

METHODOLOGICAL CONSIDERATIONS

37

e) Confidence Level: (Vogt, 55) A desired percentage of the scores (often 95% or 99%) that would
fall within a certain range of confidence limits. It is calculated by subtracting the alpha level from
1 and multiplying the result times 100; e.g., 100 x (1-.05) = 95%.
E.g., say a poll predicted that, if the election were held today, a candidate would win 60% of the
vote. This prediction could be qualified by saying that the pollster was 95% certain (confidence
level) that the prediction was accurate plus or minus 3% (confidence interval). The larger the
sample the narrower the confidence interval or margin of error.
Confidence Interval (Retrieved on May 17, 2007 from
http://www.stats.gla.ac.uk/steps/glossary/confidence_intervals.html#conflevel)
A confidence interval gives an estimated range of values which is likely to include an unknown
population parameter, the estimated range being calculated from a given set of sample data.
If independent samples are taken repeatedly from the same population, and a confidence interval
calculated for each sample, then a certain percentage (confidence level) of the intervals will
include the unknown population parameter. Confidence intervals are usually calculated so that
this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals
for the unknown parameter.
The width of the confidence interval gives us some idea about how uncertain we are about the
unknown parameter (see precision). A very wide interval may indicate that more data should be
collected before anything very definite can be said about the parameter.
Confidence intervals are more informative than the simple results of hypothesis tests (where we
decide "reject H0" or "don't reject H0") since they provide a range of plausible values for the
unknown parameter. See also confidence limits.
Confidence Limits - Confidence limits are the lower and upper boundaries / values of a
confidence interval, that is, the values which define the range of a confidence interval.
The upper and lower bounds of a 95% confidence interval are the 95% confidence limits. These
limits may be taken for other confidence levels, for example, 90%, 99%, 99.9%.
Confidence Level - The confidence level is the probability value (1-) associated with a
confidence interval. It is often expressed as a percentage. For example, say = 0.05 = 5%, then
the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level.
Example: Suppose an opinion poll predicted that, if the election were held today, the
Conservative party would win 60% of the vote. The pollster might attach a 95% confidence level
to the interval 60% plus or minus 3%. That is, he thinks it very likely that the Conservative party
would get between 57% and 63% of the total vote.
Confidence Interval for a Mean - A confidence interval for a mean specifies a range of values
within which the unknown population parameter, in this case the mean, may lie. These intervals
may be calculated by, for example, a producer who wishes to estimate his mean daily output; a
medical researcher who wishes to estimate the mean response by patients to a new drug; etc.
The (two sided) confidence interval for a mean contains all the values of 0 (the true population
mean) which would not be rejected in the two-sided hypothesis test of:
H0: = 0
against
H1: not equal to 0
The width of the confidence interval gives us some idea about how uncertain we are about the
unknown population parameter, in this case the mean. A very wide interval may indicate that
more data should be collected before anything very definite can be said about the parameter.
We calculate these intervals for different confidence levels, depending on how precise we want to
be. We interpret an interval calculated at a 95% level as, we are 95% confident that the interval
contains the true population mean. We could also say that 95% of all confidence intervals formed
in this manner (from different samples of the population) will include the true population mean.
Compare one sample t-test.
May 2007

METHODOLOGICAL CONSIDERATIONS

38

Confidence Interval for the Difference Between Two Means - A confidence interval for the
difference between two means specifies a range of values within which the difference between
the means of the two populations may lie. These intervals may be calculated by, for example, a
producer who wishes to estimate the difference in mean daily output from two machines; a
medical researcher who wishes to estimate the difference in mean response by patients who are
receiving two different drugs; etc.
The confidence interval for the difference between two means contains all the values of 1 - 2
(the difference between the two population means) which would not be rejected in the two-sided
hypothesis test of:
H0: 1 = 2
against
H1: 1 not equal to 2
i.e.
H0: 1 - 2 = 0
against
H1: 1 - 2 not equal to 0
If the confidence interval includes 0 we can say that there is no significant difference between the
means of the two populations, at a given level of confidence.
The width of the confidence interval gives us some idea about how uncertain we are about the
difference in the means. A very wide interval may indicate that more data should be collected
before anything definite can be said.
We calculate these intervals for different confidence levels, depending on how precise we want to
be. We interpret an interval calculated at a 95% level as; we are 95% confident that the interval
contains the true difference between the two population means. We could also say that 95% of all
confidence intervals formed in this manner (from different samples of the population) will
include the true difference. Compare two sample t-test.
Summary Review Notes: (Chantelle)
 Measure of variability: standard deviation
 Standard deviation squared is = variance
 SD (squared) = statistic used if talking about your sample
 Omega squared = parameter statistic (used when generalizing about the population)
 x =mean of our sample
 (Miu) = mean of the whole population
 Variance = how much does one number stray from the middle
In the standard deviation formula: by subtracting 1 we have an end value that is more
generalizable to the populationa.k.a. (df = n-1) {degrees of freedom}
Standard deviation has a special relationship to the normal curve
68% of the curve includes +/- 1 standard deviation
95% of the curve includes +/- 2 standard deviations
Bell curve: if you get a z score of +1 then you are 1 SD above the mean
IQ = mean is 100
SD = 15
SD of +1 is a z score of 1 which is equal to a score of 115
Very large variance indicates that peoples scores are very difficult to predict

May 2007

METHODOLOGICAL CONSIDERATIONS

39

3) Skewness, Kurtosis
(Salkind, 48) Previously, you learned about two important types of descriptive statistics- measures of
central tendency and measures of variability. Both of these provide you with the one best score for
describing a group of data (central tendency) and a measure of how diverse, different, scores are from
one another (variability). Now we will examine how differences in these two measures result in
different-looking distributions. Numbers alone (such as x = 10 and s =
Skewness is about the
3) may be important, but a visual representation is a much more effective
behavior of the tail.
way of examining the characteristics of a distribution as well as the
characteristics of any set of data.
Skewness (Salkind, 61) is a measure of the lack of symmetry, or the lopsidedness, of a distribution.
One tail of the distribution is longer than another.

Positively Skewed (mean>median>mode): A positively skewed


distribution is asymmetrical and points in the positive direction. If a test
was very difficult and almost everyone in the class did very poorly on it,
the resulting distribution would most likely be positively skewed. In the
case of a positively skewed distribution, the mode is smaller than the median, which is smaller
than the mean. This relationship exists because the mode is the point on the x-axis corresponding
to the highest point that is the score with greatest value, or frequency. The median is the point on
the x-axis that cuts the distribution in half, such that 50% of the area falls on each side.
One way to remember the order of the mean, median, and mode in a skewed distribution is to
remember that the mean is pulled in the direction of the extreme scores. In a positively skewed
distribution, the extreme scores are larger, thus the mean is larger than the median.

Negatively Skewed: (mean<median<


< mode): A negatively skewed distribution is asymmetrical
and points in the negative direction, such as would result with a very easy test. On an easy test,
almost all students would perform well and only a few would do poorly. The order of the
measures of central tendency would be the opposite of the positively skewed distribution, with
the mean being smaller than the median, which is smaller than the mode.

Kurtosis: (Salkind, 62) has to do with how flat or peaked a distribution appears, and the terms used to
describe this characteristic are relative ones.

Leptokurtotic: Kurtosis refers to the


Looks like a leap
extent to which data is concentrated in
the middle or at the tails of a distribution. This type of distribution looks
very peaked because all of the scores are in the middle
Platykurtotic: This type of distribution looks very flat because the scores
are spread out evenly through the middle and tails. (flat like a platypus).
Probability of responding is the same. Has a
Plata-flata
uniform distribution
Bimodal: In this case the mean and the
median fall at the same point, while the two modes correspond to the two
highest points of the distribution
Normal (mesokurtotic): (mean=median=mode) This distribution is
considered normal and is represented by a bell shaped curve with most of
the scores gathering in the middle and a few extreme scores pulling the
tails out a bit.
May 2007

METHODOLOGICAL CONSIDERATIONS

40

4. Types of Data

a) Nominal (Vogt, 207) Nominal Variable is another term for categorical (or discrete or a
qualitative) variable. See nominal scale.
Nominal scale (or level of measurement) (Vogt, 207). A scale of measurement in which
numbers stand for names, but have no order or value. E.g., coding female = 1 and male = 2 would
be a nominal scale; females do not come first, two females do not add up to a male, and so on.
The numbers are merely labels. See categorical variable.
Categorical Variable (Vogt, 39) A variable that distinguishes among subjects by sorting them
into a limited number of categories, indicating type or kind, as religion can be categorized:
Buddhist, Christian, Jewish, Muslim, Other, None. Breaking a continuous variable, such as age,
to make it categorical is a common practice, but since this involves discarding information it is
usually not a good idea. Also called discrete or nominal variable.
b) Discrete Variable (Vogt, 92) Commonly, another term for categorical (or nominal) variable.
Compare continuous variable.
More formally, a discrete variable is one made up of distinct and separate units or categories.
When a variable is discrete, only a finite number of values separate any two points. While all
categorical variables are discrete, in some usages there might be dispute about whether to label
particular variables discrete or continuous. This matters because it determines appropriate
statistical techniques.
E.g., the number of people in a family is clearly a discrete variable. So are flips of a coin; if you
flip a coin 10 times you cant get 3.27 tails. But the distinction is not always so clear. Take
personal income. It looks like a continuous variable, and it is usually treated as one in research.
Millions of possible values stretch from zero to Bill Gates income. More strictly, however,
income is discrete. Income does not come in units smaller than one cent; there is only one value
between $411.01 and $411.03 ($411.02). Thus, while income is measured on a ratio scale, it is a
discrete variable. No matter how close 2 peoples weights might be, there is always an
intermediate value, although an ordinary scale might not capture it. Because of limits in how
accurately we can measure, all measurements are discrete in practice.
c) Ordinal Variable (Vogt, 222) - A variable that is measured using an ordinal scale, such as the
shirt sizes of small, medium, large, and extra large.
Ordinal scale (or level of measurement) (Vogt, 222). A way of measuring that ranks subjects
(puts them in order) on some variable. The differences between the ranks need not be equal (as
they are in an interval scale). Team standings or scores on an attitude scale (highly concerned,
very concerned, concerned, etc.) are examples.
A question that sometimes arises in statistical analyses is whether ordinal variables ought to be
considered continuous. A rule of thumb is that if there are many ranks, it is permissible to treat
the variable as continuous, but such rules of thumb leave much room for disagreement.
d) Interval Scale (or level of measurement) (Vogt, 158) A scale or measurement that describes
variables in such a way that the distance between any two adjacent units of measurement (or
intervals) is the same, but in which there is no meaningful zero point. Scores on an interval
scale can meaningfully be added and subtracted, but not multiplied and divided. Compare ratio
scale.
E.g., the Fahrenheit temperature scale is an interval scale because the difference, or interval,
between (say) 72 and 73 degrees is that same as that between 20 below 0 and 21 below 0. Since
there is no true zero point (zero is just a line on the thermometer), it is an interval, not a ratio
scale. There is a zero on the thermometer, of course, but it is not a true zero; when its zero
degrees outside, there is still some warmth, more than when its 20 below zero.

May 2007

METHODOLOGICAL CONSIDERATIONS

41

E.g., if on a 20-item vocabulary test Mr. A got 12 correct and Mr. B got 6 right, it would be
correct to say that A answered two times as many correctly, but it would not be correct to say that
As vocabulary was twice as large as Bs- unless the test measured all vocabulary knowledge and
getting a zero on it meant that a person had no vocabulary at all (in that case the test would be an
example of a ratio scale).
e) Ratio Scale (or level of measurement) (Vogt, 264) A measurement or scale in which any two
adjoining values are the same distance apart and in which there is a true zero point. The scale gets
its name from the fact that one can make ratio statements about variables measured on a ratio
scale. See interval scale, level of measurement.
E.g., height measured in inches is measured on a ratio scale. This means that the size of the
difference between being 60 and 61 inches tall is the same as between being 66 and 67 inches tall.
And, because there is a true zero point, 70 inches is twice as tall as 35 inches (ration of 2 to 1).
The same kind of ratio statements cannot be made, for example, about measures on an ordinal
scale. The person who is second tallest in a group is probably not twice as tall as the person who
is fourth tallest.
NOMINAL
[ Crosstabs ]
chi square,
phi,
Cramr's V
contingency coefficient, CC
:lambda,
uncertainty coefficient, UC
kappa,
likelihood ratio, LR
Goodman & Kruskal tau,
[ Nonparametric ]
chi-square,
runs
binomial
McNemar
Cochran Q

ORDINAL
[ Frequencies ]
median, Mdn
interquartile range, IRQ
[ Crosstabs ]
Spearman's Rank Order
Correlation, rs
(formerly rho)
Kendal's Tau-b, b
Kendall's Tau-c, c
Somers' D
Gamma,
Mantel-Haenszel

INTERVAL
mean, M
standard deviation, SD
Pearson's product-moment
correlation, r
t test, t
analysis of variance,
ANOVA
multivariate analysis of
variance, MANOVA
factor analysis
regression
multiple correlation, R

RATIO
coefficient of variation,
CFVAR
(CFVAR = SD / M)

[ Nonparametric ]
Kolmogorov-Smirnov
Sign
Wilcoxon
Kendall coefficient of
concordance, W
Friedman two-way ANOVA
Mann-Whitney U
Wald-Wolfowitz
Kruskal-Wallis
http://web.uccs.edu/lbecker/SPSS/scalemeas.htm

5. z-Scores (Vogt, 349) (lowercase z) The most commonly used standard score. It is a measure of
relative location in a distribution; it gives, in standard deviation units, the distance from the mean of a
particular score. In z-score notation, the mean is 0 and a standard deviation is 1. Thus, a z-score of
1.25 is one and one quarter standard deviations above the mean; a z-score of -2.0 is 2 standard
deviations below the mean. Therefore, z-scores are especially useful for comparing performance on
several measures, each with a different mean and standard deviation.
E.g., say you took two midterm exams. On the first you got 90 right, on the second 60. If you knew
the means and standard deviations, you could compute z-scores for each of your exams to see which
one you did better on. The procedure is to take your score, subtract from it the mean of all the scores,
May 2007

and divide the result by the SD. z =

METHODOLOGICAL CONSIDERATIONS

or z =

42

xM
X is your score, M is the mean, and SD is
SD

the standard deviation. The following table shows how to compute your z-scores and compare them.
In this example, you did better (ranked higher in the class) on your second midterm. Your score of 60
was 2 SDs above the mean; your 90 was only 1 SD above.
COMPUTING YOUR Z-SCORE
First Midterm
X=90
M=80
SD=10
(90-80)/10=
10/10=1

Second Midterm
X=60
M=42
SD=9
(60-42)/9=
18/9=2

T-score (Vogt, 350) A standard score in which the mean of the distribution is 50 and the standard
deviation is 10. The Z-score is obtained by transforming the z-score (multiplying z by 10 and adding
50), which is why the Z-score is called a transformed
Not to be confused with a t-test which
standard score. The main advantage of this
is a test of significance. Used for
transformation is that it eliminates decimals and
inferential statistics.
negative numbers. Sometimes, but not consistently,
called the T score.
B) Non-Parametric Statistics
Parametric Statistics: (Vogt, 227) Statistical techniques designed for use when data have certain
characteristics - usually when they approximate a normal distribution and are measurable with interval or
ratio scales. Also, statistics used to test hypotheses about population parameters.
Nonparametric Statistics: (Vogt, 210) Statistical techniques designed to be used when the data being
analyzed depart from the distributions that can be analyzed with parametric statistics. In practice, this
most often means data measured on a nominal or an ordinal scale. Nonparametric tests generally have less
power than parametric tests. The chi-squared test is a well known example.
(Salkind, 261) Also called distribution-free statistics. These tests dont follow the same rules (meaning
they dont require the same assumptions of the parametric tests), but the nonparametrics are just as
valuable. The use of nonparametric tests also allows us to analyze data that come as frequencies, such as
the number of children in different grades or the percentage of people receiving social security.
E.g., if we wanted to know whether the number of people who voted for the school voucher in the most
recent election is what we would expect by chance, or if there was really a pattern of preference, we
would then use a nonparametric technique called chi-square. [See Summary table]
1) Chi-square/Z test for proportions: (Vogt, 43) A test statistic for categorical data. As a test statistic it
is used as a test of independence, but it is also used as a goodness-of-fit test. The chi-squared test
statistic can be converted into one of several measures of association, including the phi coefficient,
the contingency coefficient, and Cramers V.
The chi-squared test is known by many names: Pearson chi-square, X2, chi2, and c2.
The simplest use of the chi-squared test, illustrated in the following example, occurs when a
researcher wants to see if there are statistically significant differences between the observed (or
actual) frequencies and the expected (or hypothesized, given the null hypothesis) frequencies of
variables presented in a cross tabulation or contingency table. The larger the observed frequency is in
comparison with the expected frequency, the larger the chi-squared statistic. The larger the chisquared statistic, the less likely the difference is due to chance, that is, the more statistically
significant it is.
May 2007

METHODOLOGICAL CONSIDERATIONS

43

E.g., say that a researcher gives a pass/fail test to a sample of 100 subjects, 42 men and 58 women; 61
subjects pass, and 39 fail. If the researcher were interested in whether there are differences in test
performance by gender, she could use the chi-squared test to test the null hypothesis of no statistically
significant differences between the sexes. To do so, she might arrange the information about her
subjects in Tables C.2, C.3, and C.4. Table C.2 gives the total (or marginal) frequencies for the two
variables. Table C.3 shows what the (approximate) frequencies of passes and fails on the two tests
would have been if the null hypothesis were true, that is, what you would expect if there were no
differences on the exam between men and women. Table C.4 shows the actual or observed number of
men and women who passed or failed the exam.

Parametric Stats
Stat techniques for data that
approximate a normal distribution
Measurable with interval or ratio
scales.
Table C.2 Chi-Squared Test (Marginal Frequencies)
Pass
Men
Women
Total

61

Table C.3 Chi-Squared Test (Expected Frequencies)


Pass
Men
26
Women
35
Total

61

Table C.4 Chi-Squared Test (Observed Frequencies)


Pass
Men
19
Women
42
Total

61

NON-Parametric Stats
Stat techniques for data that is NOT
normally distributed.
Measurable with nominal or ordinal
scales. Distribution Free Stats

Fail

Total
42
58

39

100

Fail
16
23

Total
42
58

39

100

Fail
23
16

Total
42
58

39

100

Comparing Table C.3 and Table C.4, it is clear that the actual and expected frequencies are not
identical. For example, 26 men were expected to pass, but only 19 did; 23 women were expected to
fail, but only 16 did, and so on. But are these differences statistically significant, that is, are they
unlikely to have occurred by chance? Conducting the chi-squared test can tell you. The answer
(calculations not shown) is that the null hypothesis of no difference between men and women should
be rejected. The differences are greater than what would be expected by chance along; they are
significant at the .01 level.
2) Kruskal-Wallis: (Vogt, 166) A nonparametric test of statistical significance used when testing more
than two independent samples. It is an extension of the Mann-Whitney U test, and of the Wilcoxon
test, to three or more independent samples. It is a nonparametric, one-way ANOVA for rank-order
data and is based on medians rather than means. Symbolized H.

May 2007

METHODOLOGICAL CONSIDERATIONS

44

3) Friedman Test: (Vogt, 126) A nonparametric test of statistical significance for use with ordinal data
from correlated groups. It is similar to the Wilcoxon test, but can be used with more than two groups.
It is an extension of the sign test and is a nonparametric version of a one-way, repeated measures
ANOVA.
4) Mann-Whitney U Test: (Vogt, 184) A test of the statistical significance of differences between two
groups. It is used when the data for two samples are measured on an ordinal scale. It is a
nonparametric equivalent of the t-test. Although ordinal measures are used with the Mann-Whitney
test, an underlying continuous distribution is assumed. This test is also used instead of the t-test with
interval-level data when researchers do not assume that the populations are normal. It is very similar
to the Wilcoxon test.
5) Wilcoxon Test: (Vogt, 343) More fully, the Wilcoxon signed-rank or rank-sum test for ordinal
data. A nonparametric test of statistical significance for use with two correlated samples, such as the
same subjects on a before-and-after measure. See the Mann-Whitney U test and the Kruskal-Wallis
test, which require independent samples.
6) Kolmogorov Test: (Vogt, 166) Nonparametric tests (for ordinal data) of whether two distributions
differ and whether two samples may reasonably be assumed to come from the same population; they
are goodness-of-fit tests.
(Salkind, 270) Chi-square is one of many different kinds of nonparametric statistics that help you answer
questions based on data that violate the basic assumptions of the normal distribution or are just too small.
These nonparametric tests are a very valuable tool, and even as limited an introduction as this will provide
you with some assistance.
[See Summary Tables]
From Review: (Chantelle)
There is at least one nonparametric equivalent for each parametric general type of statistic. These tests fall
into the following categories:
Tests of differences between groups (independent samples)
Number of Groups

Type of Data

2
2
More than 2

Rank
Rank
Rank

Parametric Test
(uses continuous data)
t-test for independent samples
t-test for independent samples
ANOVA/MANOVA

NonParametric Equivalent
Mann Whitney U-Test
Kolmogorov-Smirnov Test
Kruskal-Wallis Test

Tests of differences between variables (dependent samples)


Number of Groups

Type of Data

Rank

Dichotomous (Categorical)

More than 2

Rank

Parametric Test
(uses continuous data)
t-test for dependent
samples
t-test for dependent
samples
Repeated Measure
ANOVA

NonParametric
Equivalent
Wilcoxon
Matched Pairs
Chi-Square
Friedman Test

Tests of relationships between variables.


Number of Groups

Type of Data

2
2
2

Rank
Dichotomous (Categorical)
Dichotomous (Categorical)

Parametric Test
(uses continuous data)
Correlation
Chi-Square
Chi-Square

NonParametric
Equivalent
Spearman
Chi-Square
Chi-Square
May 2007

METHODOLOGICAL CONSIDERATIONS

45

C) Univariate Parametric Statistics


Univariate Analysis (Vogt, 333) (a) Studying the distribution of cases of one variable only- for example,
studying the ages of welfare recipients, but not relating that variable to their sex, ethnicity, and so on.
Compare multivariate analysis. (b) Occasionally used in regression analysis to mean a problem in which
there is only one dependent variable- a usage that conflicts with the more common meaning in definition
(a).
1) Sampling distributions
(See Section I, D)

Sampling distributions exist for each particular statistic. Mean, variance, correlation coefficient, F
test, etc.
Every sampling distribution has a standard error
(Vogel) Created with Monte Carlo study whereby a large number of equal sized random samples
are drawn from a population you wish to represent. For each sample, the statistic is computed and
the stats are arranged in a frequency distribution so you can see the normal curve for that
population. Doing these samples over and over again allow you to finally get that population
sampling distribution.

2) Assumptions and their violation


Independence of Observations: all scores are independent of each other
If a study put all their subjects in a waiting room
E.g., when your control group gets the
and called each person in one at a time there is the
treatment. This usually gets screwed
potential for subjects to talk to each other about
up in the design of the study.
the study while they are waiting together. This
causes a violation, as their scores may not be entirely independent of the influence of other
subjects!
Violations are VERY BAD! We cannot correct this violation statistically. Instead, you must
rerun the experiment with changes made to the design!

Normality: distribution should be mesokurtotic (see normality of data section)


a.k.a. normality of distribution
Violations include any distributions that are positively/negatively skewed; leptokurtotic,
platykurtotic, or bimodal
ANOVAs are robust (resistant) when groups have equal sample sizes to violations of
normality so no big deal! This does not hold for imbalanced design. E.g., n=20 for group 1
and n=50 for group 2.
If you are being especially conservative though, you could use a nonparametric test rather
than an ANOVA if data is not normal. For instance, you could use a Kruskal-Wallis, MannWhitney U, or Wilcoxon rank sum test.

Homogeneity of Variance: the variance within each condition/group is similar to the other
groups; within group and between group variances are similar
In a study, you want to make sure that if you have a lot of variance in age or income in one
group, you have the same amount of differences in the other group. If you dont, then you
have to worry that the first group may show significant differences on your dependent
variable merely because of these variations in age and income!
There are several tests for this assumption that can be calculated using SPSS.
 Levene: assess the differences of scores from each of their means.
 Brown Forsythe: like Levenes but it looks at the differences of scores from median
 F Max: biggest variance/smallest variance (if biggest variance is 4-10x greater, than you
have a violation!)
Early work on the effects of violating the assumption of equal variances suggested that
the F test was relatively insensitive to the presence of variance heterogeneity, except
when unequal sample sizes were involved. The F test becomes seriously biased in the
May 2007

METHODOLOGICAL CONSIDERATIONS

46

positive direction when the largest within-group variance divided by the smallest withingroup variance (Fmax), is 9 or greater. What this means is that the actual significance level
revealed by Monte Carlo studies is substantially greater than the level chosen by
researchers to evaluate an obtained F; with the significance level set at = .05, for
instance, the actual significance level may be as high as .08 or even higher. (Keppel, 98)
You can correct violations of homogeneity of variance by transforming the data (square
root transformation)

3) T-tests
(Vogt, 329) A test of statistical significance, often of the difference between two group means, such as
the average score on a manual dexterity test of those who have and have not been given caffeine. Also
used as a test statistic for correlation and regression coefficients.
A two-tailed t-test is used to test the significance of a nondirectional hypothesis, that is, an
hypothesis that says there is a difference between two averages without saying which of the two is
bigger. A one-tailed t-test is called directional because it tests the hypothesis that the mean of one
of the two group averages is bigger. When the researcher is uncertain about which is larger, the twotailed test should be used.
There are several formulas for t. The one to use depends on the nature of the data and the group being
studied, most often upon whether the groups are independent or correlated.
(Salkind, 163) Almost every statistical test has certain assumptions that underlie the use of the test.
E.g, the t test has a major assumption that the amount of variability in each of the two groups is equal.
This is the homogeneity of variance assumption. Although this assumption can be violated if the
sample size is big enough, small samples and a violation of this assumption can lead to ambiguous
results and conclusions. `

Basic Design:
 One Independent Variable (IV)
 One Dependent Variable (DV)
 Stats = t (t2 = F)

Statistic Options:
 Single Sample: used when you only have one group to compare to normative data
 Independent Samples: used when you compare groups (same as 1-way ANOVA)
 Paired (Related): used one your only IV is a repeated measure (subjects serve in every
condition) same as 1-way ANOVA with 1 RM

Single Sample T Test Examples:


Independent Samples T Test Examples:
 Researchers are interested in comparing the number of antibodies following an influenza shot
in the corporate world. Corporate employees were randomized to receive Mindfulness Based
Stress Reduction therapy or Music therapy. They were all given a flu shot and antibody levels
were tested 3 months later.
 300 participants are given a questionnaire to assess their level of emotional impact from 9-11.
The scores for fireman were compared with the scores from policeman participants.

Paired Sample T Test Example:


 Students are compared before they finish their statistics project and after finish their project
on depression by the BDI (Beck Depression Inventory).
 You are investigating the sleep efficiency of women diagnosed with nonmetastatic breast
cancer. You assign a sleep efficiency score before and after a sleep hygiene class of 4 weeks.

May 2007

METHODOLOGICAL CONSIDERATIONS

47

Determining That a t Test Is the Correct Statistic (Salkind, 162)


Are you examining relationships between
variables or examining the difference
between group on one or more variables?
Im examining
relationships
between variables.

Im examining the
differences between groups
on one or more variables.
Are the same participants
being tested more than once?

How many variables


are you dealing with?

Two variables

Yes

No

How many groups are


you dealing with?

How many groups are


you dealing with?

More than two


variables
Two groups

t-test for the


significance of
the correlation
coefficient

Regression,
factor analysis
or canonical
analysis

t-test for
dependent
samples

More than two


groups
Related
measures of
analysis of
variance

Two groups

t-test for
independent
samples

More than two


groups
Simple analysis
of variance

4) ANOVA
From Review: (Chantelle)
3 major assumptions you make when running an ANOVA:
Robust: how sensitive a test is to violation ---Very robust- not sensitive not biasing the data
Normality of distribution: ANOVA is NOT very sensitive to violation of normalityANOVA
is most robust to normality
You can fix a violation through transforming the data or using nonparametric test
You can test for a violation by eyeballplotting data w/ histograms
If univariate normality>> use Z scores & charts
Histograms
To examine residuals (scatter plots)
Skewness
Kurtosis
>>Fix through data transformation
Bi-variate normality is checked through Scatter plot matrix>>looking for elliptical shapes
Multi-variate normality is assessed through Mahalinobis- checking for outliers
Independence of sampling: ANOVA is VERY sensitive to violations of independent sampling-----can we fix POST-HOC a violation of independence-->NO---it has to do w/ design of experiment or lack of random sampling
>It does NOT improve with increasing your sample size
Homogeneity of variance: it IS sensitive to violations of homogeneity >you can fix a
violation>>>through transformations
*rank order
*category
May 2007

METHODOLOGICAL CONSIDERATIONS

48

*square root
*logarithm
*nonparametric tests---which are NOT sensitive to violations
(Vogt, 8-11) A test of the statistical significance of the differences among the mean scores of two or
more groups on one or more variables or factors. It is an extension of the t-test, which can handle only
two groups at a time, to a larger number of groups. More specifically, it is used for assessing the
statistical significance of the relationship between categorical independent variables and a continuous
dependent variable. The procedure in ANOVA involves computing a ratio (F ratio) of the variance
between the groups (explained variance) to the variance within the groups (error variance). See oneway ANOVA, two-way ANOVA. ANOVA is equivalent to multiple regression with dummy-coded
independent variables and a continuous dependent variable.
E.g., a professor tried different teaching methods. He randomly assigned members of his class of 30
students to three groups of 10 students each. All three groups were given the same required readings,
but class time was spent directly in each. Group 1 (discuss) spent class time in directed discussions
of the assigned readings. Group 2 (No class) was excused from any obligations to attend classes for
the first half of the semester, but they were given additional text materials they could use to help them
understand the assigned readings. Group 3 (Lecture) was taught by traditional lecture methods. The
students scores on the midterm examination are listed in the Table A.1. Students in the three groups
obviously got different average scores. The professor wanted to know whether the differences were
statistically significant, that is, whether they were bigger than would be likely due to chance alone. To
find out, he entered the information from the table into his computer and conducted an ANOVA. As
Table A.2 shows, the results were highly statistically significant (at the p< .001 level); the teaching
methods almost certainly made a difference.
Table A.1 Analysis of Variance: Students Midterm Scores, by Method of Instruction
Group 1 Discuss
Group 2 No Class
Group 3 Lecture
94
78
87
92
796
85
91
72
84
89
71
84
88
68
81
88
68
80
86
67
80
86
66
79
83
64
72
83
60
68
880
88

Total
Mean

690
69

Table A. 2 ANOVA Summary Table: Three Teaching Methods


Source (of variance)
SS
Df
MS
Between Groups
1820
2
910.00
Within Groups
700
27
25.93
(error variance)
Total
p< .001

2520

800
80

F
35.10*

29

How to Read an ANOVA Summary Table. Source means source of the variance. Between groups
is explained variance, that is, explained by the treatments the different groups received. Within
groups is unexplained or error variance, since differences among individuals within a group cannot be
explained by differences in the treatments the groups received. SS is sum of squares (total of squared
May 2007

METHODOLOGICAL CONSIDERATIONS

49

deviation scores). Degrees of freedom are abbreviated to Df. MS Stands for mean squares, which
are calculated by dividing the SS by the Df. F is the F ratio of the MS between to the MS within,
which is statistically significant at the .001 level (p<.001).
F Ratio. (Salkind, 198) The logic behind this ratio goes something like this. If there was absolutely no
variability within each group (all the scores were the same), then any difference between groups would
be meaningful, right? Probably so. The ANOVA formula (which is the ratio) compares the amount of
variability between groups (which is due to chance). If that ratio is 1, then the amount of variability due
to within-group differences is equal to the amount of variability due to between-group differences, and
any difference between groups would not be significant. As the average difference between groups gets
larger (and the numerator of the ratio increases in value), the F value increases as well. As the F value
increases, it becomes more extreme in relation to the distribution of all F values and is more likely due
to something other than chance.
Relationship between t value and F value. (Salkind, 203) The t value (which is always used for the test
between the difference of the means for two groups) and an F value (which is always more than two
groups) might be related. Interestingly enough, an F value for two
For 2 group situation.
groups is equal to a t value for two groups squared, or F = t2.
Different Flavors of ANOVA. (Salkind, 194-197) ANOVA comes in many different flavors. The
simplest kind is simple analysis of variance, where there is one factor or one treatment variable (such as
group membership) being explored, and there are more than two groups within this factor. Simple
ANOVA is also called one-way analysis of variance because there is only one grouping dimension. The
technique is called analysis of variance because the variance due to differences in performance is
separated into variance thats due to differences between individuals within groups and variance due to
differences between groups. Then the two types of variance are compared with one another.
In fact, ANOVA is, in many ways, similar to a t test. In both procedures, differences between means are
computed. But with ANOVA, there are more than two means.
E.g., lets say we were investigating the effects on language development of being in preschool for 5,
10, or 20 hours per week. The group the children belong to is the treatment variable, or the grouping
factor. Language development is the dependent variable, or the outcome. The experimental design looks
something like this.
Group 1
(5 hours per week)
Language development test
score

Group 2
(10 hours per week)
Language development test
score

Group 3
(20 hours per week)
Language development test
score

The more complex type of ANOVA is called a factorial design, where there is more than one treatment
factors being explored (more than one IV). Heres an example where the effect of number of hours of
preschool participation is being examined, but the effects of gender differences are being examined as
well. The experimental design looks something like this:

Gender
Male
Female

Number of Hours of Preschool Participation


Group 1
Group 2
(5 hours per week)
(10 hours per week)
Language development
Language development
test score
test score
Language development
Language development
test score
test score

Group 3
(20 hours per week)
Language development
test score
Language development
test score

This factorial design is described as a 3X2 factorial design. The 3 indicates that there are 3 levels of one
grouping factors (Group 1, Group 2, and Group 3). The 2 indicates that there are two levels of the other
grouping factor (male and female). In combination, there are 6 different possibilities (males who spend
5 hours per week in preschool, females who spend 5 hours per week in preschool, males who spend 10
hours per week in preschool, etc.)
May 2007

METHODOLOGICAL CONSIDERATIONS

50

These factorial designs follow the same basic logic and principles of simple ANOVA, but they are just
more ambitious in that they can test the influence of more than one factor at a time as well as a
combination of factors.

Basic Design (Vogel)


 ANOVAs are used when you are looking for differences between 2 or more groups (cannot
use ordinal or nominal data! Only interval or ratio data!)
 Independent Variables (IV): comparison variable
One IV = One way ANOVA
2+ IVS = Factorial ANOVA
A Within group variable = Repeated Measure


Nomenclature:
(# of IVs)-Way ANOVA with (# of within group IVs) Repeated Measures
So, if you have 2 IVs (One is treatment group and one is time) you will have a 2-way
ANOVA with 1 RM

One Dependent Variable (DV): measure

E.g., In a study looking at the relationship between gender and depression.


IV = gender
DV = depression.
A one-way ANOVA with 0 Repeated Measures
Way

# of IVs

NOVA with
#of DVs

Repeated Measures
Between Group or
Within Group

One-Way ANOVA examples:


You want to compare fibromyalgia, myofacial and pain free individuals on coping styles. You
administer a coping inventory and compare the 3 groups on their level of avoidant coping.
IV = diagnosis (fibromyalgia vs. myofacial pain vs. pain free)
DV = level of avoidant coping

Factorial ANOVA and Repeated Measure examples: Repeated measure means that subjects
participate in EVERY condition. For instance, therapy group as an IV is usually not a repeated
measure because subjects are randomly assigned to one of the treatments. However, any time you
see pretest, posttest, and follow-up you know that every subject did a pretest, posttest, follow-up!
That is a repeated measure!
We are interested in the effects of yoga on sleep. 40 subjects are randomized to either participate
in a yoga class or psychoeducational group. Subjects sleep is assessed using the Sleep Inventory
one week before groups begin, one week after the last group session, and 3 months later.
2 WAY ANOVA W/ 1 REPEATED MEASURE!!!
IV #1 = treatment (yoga vs. psychoeducation)
IV #2 = Time (pretreatment vs. posttreatment vs. 3 month follow-up) RM
DV = sleep inventory score
Researchers are studying the effects of 3 treatments for generalized anxiety disorder. Subjects
with GAD are randomized into a CBT, group discussion, or wait list control group. All subjects
take a worry questionnaire at pretreatment, posttreatment, and 6 months after treatment.
2 WAY ANOVA W/ 1 REPEATED MEASURE
May 2007

METHODOLOGICAL CONSIDERATIONS

51

IV #1 = treatment (CBT vs, Group vs. Wait List Control)


IV #2 = Time (pretreatment vs. posttreatment vs. 6 month follow-up) RM
DV = worry questionnaire

Running an ANOVA: (Vogel) ANOVAs assess for group differences by comparing the means of
each group. This involves partitioning (spreading out) the variance into different sources. The
following are all the sources of variance and other statistics of interest when running an
ANOVAs.

Sum of Squares (SS): (Vogt, 305) Sum of squares, the sum of the squared deviation of scores
from the mean. (Vogt, 317) The result of adding together the squares of deviation scores.
Analysis of variance is in fact analysis of sums of squares. Not to be confused, as it often is when
doing calculations, with the square of sums, that is, all the scores first added together to get a
sum, which is then squared.
E.g., the following table lists the scores on a test, calculates the mean, subtracts the mean from
each score, squares each of those results, and adds (sums) these numbers. Computing the sum of
squares is a step on the way to calculating the variance and the standard deviation. The variance is
found by dividing the sum of squares by the number of scores minus 1 (238 divided by 6 = 39.67
in this example), and the standard deviation is calculated by taking the square root of the variance
(which equals 6.3 in this example).
Sum of Squares
Scores
88
86
84
80
77
73
72
560
860/7=80
(mean)


Minus Mean
-80
-80
-80
-80
-80
-80
-80

=
=
=
=
=
=
=

Deviation Score
8
6
4
0
-3
-7
-8

Deviation Score Squared


64
36
16
0
9
49
64
238
(sum of squares)

Between Sum of Squares: (Vogt, 25) A measure of between-group differences. It is


calculated by squaring and summing deviation scores. It is used in comparison to withingroup differences to compute the F ratio in an analysis of variance. Symbolized SSbetween.

Variance: (Vogel) spread of scores, differences


 The first value calculated to represent variance is called the Sum of Squares. Then, you
must keep partitioning down the variance into what is known as the Mean Squares. This
is kind of a purer estimate of variance.
Mean Square (MS) (Vogt, 189) Short for the mean of the squared deviation scores, that
is, the variance. The variance is most often referred to as the MS in an ANOVA.


(Vogel)You will get a Sum of Squares and Mean Square to estimate the variance within
each group as well as between the groups. You can imagine that the goal would be to see
small differences within each group but a large variance between the groups so that you
can say that they are significantly different from each other.

Sum of Squares (SS): measures variability


 Between Groups: sum of squared deviations of groups means from grand mean
 Within Groups: sum of squared deviations of individual scores from group mean
 Total SS: SS between + SS within
May 2007

METHODOLOGICAL CONSIDERATIONS

52

Degrees of Freedom (Df)


 a= groups, n= subjects
 Between Groups: a-1
 Within Groups: a(n-1)
Mean Square (MS): purer form of SS
 Between Groups: SS/ Df (This is also called the treatment effect)
 Within Groups: SS/ Df (This is also called the error term)
Omnibus F: this is the statistic that tells us if the groups are different
 MS between /MS within
 treatment effect + error/error
 Are my groups significantly different from each other?
If F = 1, Null hypothesis is true (error/error)
If F > 1.0, Null is false (treatment effect + error/error)
EXAMPLE
You want to compare 10 fibromyalgia, 10 myofacial and 10 pain free individuals on
coping styles. You administer a coping inventory to all 30 subjects and compare the 3
groups on their level of avoidant coping.
IV = diagnosis (fibromyalgia vs. myofacial pain vs. pain free)
DV = level of avoidant coping
Sum of Squares = found in SPSS output
Degrees of Freedom (Between ) = 3-1 = 2
Degrees of Freedom (Within) = 3(30-1) = 87
Mean Squares Between = 90/3 = 30
Mean Squares Within = 522/87 = 6
Omnibus F = 30/6 = 5
Sum of Squares Degrees of Freedom Mean Squares Omnibus F

Between Groups
Within Groups
Total

90
522
1422

3
87
90

30
6

Determining Significance: how sure you are that a real difference exists (alpha level)
 For every study, you must determine what the critical value is that your Omnibus F must
exceed. This critical value is based upon the degrees of freedom in your study. Lets say your
critical value for the above study (determined using a special table) is 3.45 with an alpha level
of .05, your omnibus F must be 3.45 or greater to be statistically significant. In this case, you
would say that you are 95% (alpha = .05) sure that the group differences are real! Alpha
levels of .05 mean that for every 100 tests, 5 will be significant by chance alone! Likewise, an
alpha of .01 means that for every 100 tests, 1 will be significant by chance alone!


If Omnibus F is not significant, check your power to see if the study was sensitive
enough to detect a real difference anyway!
Power = [n(magnitude of effect)/within group variance]
Need high sample size and effect size combined with low WG variance for good power
Usually want about .80 power!

If Omnibus F is significant, check your magnitude of effect to see how big the
differences (treatment effect) is?
Magnitude of effect = effect size
Calculated using R2 or omega2
Decimal format (example = .60 means that 60% of the variance in the DV is
accounted for by the IV)
May 2007

METHODOLOGICAL CONSIDERATIONS

53

Factorial ANOVA (feel free to make up your own variables for the example below)
Basic Design: 2+ IVs
 X axis = IV 1
 Separate Lines = IV 2
 Y axis = DV
Sum of Squares: measures variability
 SS 1: sum of squared deviations of groups means from grand mean for IV(1)
 SS 2: sum of squared deviations of groups means from grand mean for IV(2)
 SS (1x2): sum of squares for the interaction between IV1 and IV2

Variable C1
4.5
4.0

Marginal Means

3.5
3.0
2.5
2.0

Variable B

1.5

1.00

1.0
.5

2.00
1

Variable A

1. Interaction: levels of one IV differ across levels of another IV


 In this e.g., the levels 1 and 2 of Variable B change in a different pattern across Variable A
 Is there an interaction AxB?
 Yes if.levels of B will intersect at some point across the levels of A
 No if.No intersection of lines at all!!!!
2. If there is NO significant interaction,
 Main effects: There are differences across levels of one IV
 In this example, there is a main effect for Variable A because the levels differ and there is a
main effect for Variable B because the levels differ
 Look for main effects of A and/or B (marginal means or SPSS output)
 If there are no main effects, stop analyzing and check your power (Power should be around
.80, meaning you have an 80% chance of finding real differences)
3. If there is a significant interaction,
 Simple Effect: effects of one level of an IV across every level of another
 Test Simple Effects within AxB interaction
 Comparing the Bs (A at B1 and A at B2)

May 2007

METHODOLOGICAL CONSIDERATIONS

54

Variable C1
4.5
4.0

Marginal Means

3.5
3.0
2.5
2.0

Variable B

1.5

1.00

1.0
.5

2.00
1

Variable A

Comparing the As (B at A1, B at A2, and B at A3

Variable C1
4.5
4.0

Marginal Means

3.5
3.0
2.5
2.0

Variable B

1.5

1.00

1.0
.5

2.00
1

Variable A

4. If there are significant simple effects, test for simple comparisons (only when the levels of a variable
is more than 2)
 Dont forget to make an alpha correction to account for multiple comparisons (Post Hoc/Planned
Comps)
 A1 vs. A2 at B2 , A2 vs. A3 at B1, etc.(comparisons made within one variable)

Running Planned Comparisons


 Only run planned comparisons if the Omnibus F is significant!
 We know that our groups are significantly different. Now, we need to have a specific plan of
attack to find where the differences are! Test for differences between levels of your IVs. For
instance, if therapy was your Variable A, then youd want to compare CBT versus
Meditation, Meditation vs. Control group, and possibly CBT versus Control, etc

May 2007

METHODOLOGICAL CONSIDERATIONS

55

Variable C1
4.5
4.0

Marginal Means

3.5
3.0
2.5
2.0

Variable B

1.5

1.00

1.0
.5

2.00
1

Variable A

Familywise or Experimentwise Error


 (Vogt, 120) The probability that a Type I error has been committed in research involving
multiple comparisons. Family in this context means group or set of related statistical
tests. Also called experimentwise error.
E.g., if you set your alpha level at .05 and make three comparisons using the same data, the
probability of Familywise error is roughly .15 (.05 + .05 + .05). One way around this problem
is to lower the alpha level to, say, .01. But this increases the probability of Type II error. A
better alternative is the Scheff test.


(Salkind, 208) OK, so youve run an ANOVA and you know that there is an overall
difference between the means of three or four or more groups. But where does that difference
lie? You already know not to perform multiple t tests. You need to perform what are called
post hoc, or after-the-fact, comparisons. Heres where each mean is compared to each other
mean and you can see where the difference lies, but whats most important is that the Type I
error for each comparison is controlled at the same level as you set. There are a bunch of
these different comparisons, including Bonferroni. You compute them using SPSS. Its really
simple to see how this analysis tells you that the significant pairwise differences between the
groups contributing to the overall significant difference between all (for example) three
groups lies between Group 1 and Group 3 and there is no pairwise difference between Groups
1 and 2 or Groups 2 and 3. This pairwise stuff is very important because it allows you to
understand the source of the difference between more than two groups.

(Vogel) Allowed a-1 comparisons (remember, a = # of groups) without having to correct for
Type I error
 The more comparisons you run, the greater the chance for Type I error (alpha of .05 means
that 5/100 comparisons will be significant by chance alone!)
 If you have a few more than a-1 allowed comparisons, then you can run these planned
comparisons as long as you use a correction for Type I error
Bonferroni: strict, stringent, conservative
Sidak-Bonferroni: not as stringent, more Type I but less Type II compared with Bonferroni

May 2007

METHODOLOGICAL CONSIDERATIONS

56

Running Post Hoc Comparisons


These comparisons are decided upon after the fact (predicting horse race after its over!)
 Familywise/Experimentwise error high!
 Type of correction used depends on comparisons
Dunnets: only treatment versus control
Tukey: all possible pairwise comparisons (any pairs can be done)
Fisher Hayter: same as Tukey but uses (a-1)
Scheffe: no logic, any comparison
You have to find a good balance between making a Type I versus Type II error!!!
Type I error

Type II error
Scheffe > Tukey > Fisher Hayter > Dunnetts
Degree of conservativeness of the correction because of the likelihood of making an error.

Repeated Measures ANOVA


Design
 Doesnt matter how many IVs
 What matters is how many IVs are within group (do subjects serve in every condition?
Usually time is a repeated measure IV)
 The most common within group IV is time (pre, post, follow-up)
Advantages
 Can use a smaller sample size because you have more control over subjects variability
 Comparing each persons score against their previous score not someone elses
 Great if you want to study learning, practice effects
Disadvantages
 Practice effects: subjects show improvement over time or become bored/fatigued
 Carryover effects: performance on one measure impacts the next
 Counterbalance measures to take care of this problem
Assumptions
 Same as ANOVA but now we include Spherecity!
 Spherecity: everyone stays in their relative rank (if you were the most anxious on the
measure, you will be the most anxious in every other condition as everyone fluctuates;
like swimming in a school of fish)
Geisser Greenhouse tests for this assumption and if violated we use look at the Huydt
Feldt values because they correct slightly for violations (see SPSS for Hyudt Feldt)

5) ANCOVA
(Vogt, 8) An extension of ANOVA that provides a way of statistically controlling the (linear) effects
of variables one does not want to examine in a study. These extraneous variables are called
covariates, or control variables. ANCOVA allows you to remove covariates from the list of possible
explanations of variance in the dependent variable. ANCOVA does this by using statistical techniques
(such as regression) to partial out the effects of covariates rather than direct experimental methods to
control extraneous variables.
ANCOVA is used in experimental studies when researchers want to remove the effects of some
antecedent variable. E.g., pretest scores are used as covariates in pre-/posttest experimental designs.
ANCOVA is also used in nonexperimental research, such as surveys of nonrandom samples, or in
quasi-experiments when subjects cannot be assigned randomly to control and experimental groups.
Although fairly widespread, the use of ANCOVA for nonexperimental research is controversial. All
ANCOVA problems can be handled with multiple regression analysis using dummy coding for the
nominal variables, and, with the advent of powerful computers, this is a more efficient approach.
Because of this, ANCOVA is now used less frequently than in the past.

May 2007

METHODOLOGICAL CONSIDERATIONS

57

(Severino) Reduces experimental error by statistical means


Subjects are fist measured on the covariate and then randomly assigned to groups without regard
for their scores on the covariate
During the analysis, scores on the covariate are used to
 Adjust estimates of experimental error
 Adjust treatment effect for any differences between the treatment groups that existed prior to
the experimental treatment
 Uses linear regression
The covariate should be correlated with the DV but not with any of the IVs

Assumptions of ANCOVA
Independence of observation>Durbin Watson statistic (or) visual inspection of the residuals.
Normality>examine the scatter plot matrices (bi-variate)
Homogeneity of Variance>Levenes & Brown Forsythe
Linearity >visual examination of scatter plots
Homogeneity of Regression- the slope of the regression line (beta) is assumed to be the same for
each group, condition, cell. (Vogel) states that the relationship between the CV and DV is the
same at every level of the IV (look at multivariate scatter plot for elliptical shapes in SPSS)
6) Magnitude of effect
Effect Size (Vogt, 103) (a) Broadly, any of several measures of association or of the strength of a
relation, such as Pearsons r or eta. ES is often thought of as a measure of practical significance. (b) A
statistic, often abbreviated d, Cohens d, D, or delta, indicating the difference in outcome for the
average subject who received a treatment from the average subject who did not (or who received a
different level of treatment). This statistic is often used in meta-analysis. It is calculated by taking the
difference between the control and experimental groups means and dividing that difference by the
standard deviation of the control groups scores- or by the standard deviation of the scores of both
groups combined. In psychological research it is often the scores of both groups combined. In
psychological research it is often referred to as the effect size statistic, but it is in reality one of many.
(c) In statistical power analysis, ES is the degree to which the null hypothesis is false.
(Severino) The MOE = importance- size of the treatment effect
a) Omega squared: (Vogt, 219) A measure of strength of association, that is, of the proportion of
the variability in the dependent variable associated with the variability in the independent
variable. Omega squared ranges from 0 to 1. When it is 0, knowing X (the independent variable)
tells us nothing at all about Y (the dependent variable). When it is 1.0, knowing X lets us predict Y
exactly. The omega squared for a particular study will yield an estimate smaller than either eta
squared or R2.
b) R2: (Vogt, 259) Symbol for a coefficient of multiple determination between a dependent variable
and two or more independent variables. It is a commonly used measure of the goodness-of-fit of a
linear model. Sometimes written R-squared.
E.g., if the R2 between average individual income (the dependent variable) and fathers income,
education level, and IQ were .4, that would mean that the effects of fathers income educational
level, and IQ together explained (or predicted ) 40% of the variance in the individuals average
incomes- and that they did not explain 60% of the variance.
c) Eta squared: (Vogt, 108) [ 2] A measure of how much of the variance in a DV (measured at the
interval level) can be explained by a categorical (nominal, discrete) IV. It may also be used a s a
measure of association between 2 interval variables. Eta squared can be interpreted as PRE
measure, that is, it tells us how much better we can guess the value of the DV by knowing the IV.
Eta squared in ANOVA is analogous to R2 in multiple regression; it is an estimate of the variance
associated with all the IVs taken together.

May 2007

METHODOLOGICAL CONSIDERATIONS

58

(Vogel) The Magnitude of Effect (effect size): tells us how big the treatment effect (differences) is
Small
= .01 1% of the variance in scores is due to the treatment
Medium = .06 6% of the variance in scores is due to the treatment
Large
= .15 15% of the variance in scores is due to the treatment
Calculated as R2 or omega 2 [ 2 ] with the former always being a little larger because it does not
take error into account.
Effect size is the proportion of total variability that is due to the treatment!
Not affected by sample size.
D) Correlational Techniques
Correlation and Causation (Vogel)
No discussion of correlation would be complete without a discussion of causation. It is possible for
two variables to be related (correlated), but not have one variable cause another. For example,
suppose there exists a high correlation between the number of popsicles sold and the number of
drowning deaths. Does that mean that one should not eat popsicles before one swims? Not
necessarily. Both of the above variables are related to a common variable, the heat of the day. The
hotter the temperature, the more popsicles sold and also the more people swimming, thus the more
drowning deaths. This is an example of correlation without causation.
Much of the early evidence that cigarette smoking causes cancer was correlational. It may be that
people who smoke are more nervous and nervous people are more susceptible to cancer. It may also
be that smoking does indeed cause cancer. The cigarette companies made the former argument, while
some doctors made the latter.
If a high correlation was found between the age of the teacher and the students' grades, it does not
necessarily mean that older teachers are more experienced, teach better, and give higher grades.
Neither does it necessarily imply that older teachers are soft touches, don't care, and give higher
grades. Some other explanation might also explain the results. The correlation means that older
teachers give higher grades; younger teachers give lower grades. It does not explain why it is the case.
Description (Vogel)
 Relationships between variables
 Ranges from -1 to 1 with 0 meaning the variables are unrelated
 Negative versus Positive tells us the direction of the relationship
 Negative = one variable increases while the other decreases
Freshman in college find that as their social life increases, their grades begin declining!



Positive = variables increase at the same time or decrease at same time


The more you study for the methodology comp, the better your score will be!

Closeness to -1 or 1 tells us the strength of the relationship


 Relationships are stronger as they get further from 0 regardless of whether they are positive or
negative!
Significance level (p) tells us how sure we are that the relationship is real!

Assumptions
 Linearity: data is linear, not curvilinear
 Normality: normal distribution curve for that sample
 Not bimodal, kurtotic (leptokurtotic versus platykurtotic), or skewed (positive or negative)
 Full Range of Scores:
 Restriction of range will cause your correlation to decrease because you did not represent the full
distribution of the population from which you sampled (see example in foundations of
measurement section)

May 2007

METHODOLOGICAL CONSIDERATIONS

59

Types:

CORRELATION STATISTICS
(RELATIONSHIP BETWEEN 2 VARIABLES)
TYPES OF DATA
DV 1
DV 2

STATISTIC
Pearson
Spearmen

Score
Rank

Score
Rank

Point biserial
Chi-Squared

Score
Category

Category
Category

Is there a relationship between IQ score and GRE score?


Is there a relationship between shirt size (S, M, L, XL) and
contest results (1st, 2nd, 3rd place)?
Is there a relationship between gender and achievement score?
Is there a relationship between gender (female/male) and
treatment group (treatment/no treatment)?

1) Pearson correlations
(Vogt, 233) Pearsons Correlation Coefficient. More fully, the
Pearson product-moment correlation coefficient. More briefly,
Pearsons r.

r=

(z x z y )

(n 1)

A statistic, usually symbolized as r, showing the degree of linear


Z = standard score
relationship between two variables that have been measured on interval
or ratio scales, such as the relationship between height in inches and
weight in pounds. It is called product-moment because it is calculated by multiplying the z-scores
of two variables by one another to get their product and then calculating the average (mean value),
which is called a moment, of these products. Note: Pearsons r is rarely computed this way; the
preceding is known as the definitional, not the computational formula. See correlation
coefficient.
Pearsons correlation is so frequently used that it is often assumed that the word correlation by
itself refers to it; other kinds of correlation, such as Kendalls and Spearmans have to be specified by
name.

In the following example, showing the


association of education levels and birth rates in
100 countries, the points definitely are arranged
in a linear pattern. Since the line and the pattern
of points run from the upper left to the lower
right, the correlation is negative, as is the
regression
coefficient.
The
correlation
coefficient is -.84. This is also the standardized
regression coefficient. The unstandardized
regression coefficient is -3.7, which means that,
on average, for every one-year increase in
education, the birth rate goes down by 3.7 per
1,000.

Births Per 1,000


0 60

Correlation and regression are often discussed together. This is because correlation is a special case of
regression. Pearsons r is a standardized regression coefficient (or beta) between two variables. The
essential link between the two is most easily discussed by referring to a scatter diagram that includes
a regression line. If the correlation were perfect
(-1.0 or +1.0) all points would be on the line.

0 14
Mean Years of Schooling, Adults

May 2007

METHODOLOGICAL CONSIDERATIONS

60

(Vogel) Pearson: relationship between 2 variables (only continuous scores/interval or ratio data, no
nominal or ordinal data)
To assess whether there is a relationship between mood and satiety, a researcher gives 230
college freshmen a mood inventory (scores) and a satiated scale (scores).
We are interested in the relationship because head injury, avoidant coping, and depression several
months following the head injury. Participants with recent head injuries were given the Beck
Depression Inventory and the Coping Inventory.
We are interested in the neural correlates of social exclusion. Subjects are asked to play a virtual
ball toss game while an fMRI records brain activity. Subjects are also asked to rate the level of
social exclusion in the game. How do we look at the relationship between levels of anterior
cingulated cortex activity and the level of rated exclusion?
2) Partial and semi-partial correlations
Partial Correlation (Vogt, 228) Called partial for short. A correlation between two variables after
the researcher statistically subtracts or removes (controls for, holds constant, or partials out) the
linear effect of one or more other variables. The opposite of a partial relation is not a whole
relation, but rather a simple relation, this is, one uncomplicated by considering other variables. The
differences between a partial correlation and a beta weight is that beta weight is an asymmetric
measure, while the partial is symmetric.
Symbolized r with subscripts. E.g., r12.3 means the correlation between 1 and 2 when variable 3 is
controlled; r13.2 means the correlation between 1 and 3 when 2 is controlled. Another way to put it;
r12.3 = 0.27 means that the correlation between the variable 1 and 2 would have been .27 had all the
subjects been alike with respect to variable 3.
Semipartial Correlation(Vogt, 293) A correlation that partials out (controls for) a variable, but only
from one of the other variables being correlated. Also called part correlation. See partial
correlation, which partials out a variable from all other variables. It is computed using multiple
regression analysis.
(Vogel) Relationship between the predictor and outcome when all other predictors are partialled out
of only the original predictor

Gives us the unique contribution of the predictor to the outcome


Pure relationship between predictor and outcome
Smaller relationship than the Partial gives us because we are leaving all of the variance of the
outcome in tact. In Partial Correlations we took variance out of both the predictor and outcome!

(Cohen, 72) One of the important problems that arises in


multiple regression correlation is that of defining the
contribution of each IV in the multiple correlation. We shall
see that the solution to this problem is not so straightforward
as in the case of a single IV, the choice of coefficient
depending on the substantive reasoning underlying the exact
formulation of the research questions. One answer is
provided by the Semipartial correlation coefficient sr and its
square, sr2.

e
a

c
b

The figure here shows that this area is equal to the sum of
areas designated a, b, and c. The areas a and b represent
X1
X2
those portions of Y overlapped uniquely by IVsX1 and X2,
respectively, whereas area c represents their simultaneous
overlap with Y. The unique areas, expressed as proportions of Y variance, are squared semipartial
correlation coefficients, and each equals the increase in the squared multiple correlation that occurs
when the variable is added to the other IV.
May 2007

METHODOLOGICAL CONSIDERATIONS

61

The semipartial correlation sr1 is the correlation between all of Y and X1 from which X2 has been
partialled. It is a semipartial correlation because the effects of X2 have been removed from X1 but not
from Y.
Multiple Correlation: (Vogt, 200) A correlation with more than two variables, one of which is
dependent, the others independent. The object is to measure the combined influence of two or more
independent variables on a dependent variable. R is the symbol for a multiple correlation coefficient.
R2 gives the proportion of variance in the dependent variable that can be explained by the action of all
of the independent variables taken together and is known as the coefficient of determination.
E.g., researchers could use multiple correlations to measure the combined effects of age and years of
education on individuals incomes.
3) Spearman rank correlation
Spearman Correlation Coefficient (Rho) (Vogt, 303) A statistic that shows the degree of monotonic
relationship between two variables that are arranged in rank order (measured on an ordinal scale).
Also called rank-difference correlation. Abbreviated rs for a sample.
E.g., suppose you wanted to see if there was a relationship between knowledge of the political system
and self-esteem among college students. You take a sample of students and give them a test of
political knowledge and a psychological assessment of their self-esteem. Then you rank each of the
students on the two scales. Spearmans rho measures the association between the two sets of ranks.
The null hypothesis is that the two ranks are independent.
(Vogel) A researcher is convinced that men are faster runners than women. The researcher takes
place holdings of competitors in a recent marathon and wants to see if there was a strong
relationship between gender and placement in the race. What statistic should the researcher use?
4) Point biserial and phi coefficients
Point Biserial Correlation (Vogt, 237) A type of correlation to measure the association between two
variables, one of which is dichotomous (category) and the other continuous (score).
(Vogel) We are interested in the relationship between clinical depression and coping among head
injury patients. Head injury patients take the BDI and those
Biserial correlationscoring above 12 indicate clinical depression and lower scores
Artificial dichotomy
indicate the patient is not depressed (categorical data!). The
avoidant subscale (scores) is also used from the Coping Inventory.
How do we analyze the relationship between depression diagnosis and avoidant coping?
If a researcher decided to group data from a satiated scale as full or not full (category) and used a
mood inventory (score), what statistic should be used to explore their relationship to one another?
Phi Coefficient (Vogt, 235) A type of correlation or measure of association between two variables
used when both are categorical and one or both are dichotomous. Phi is a symmetric measure. It is
based on the chi-squared statistic (specifically, to get phi you divide chi-squared by the sample size
and take the square root of the result).
Relationship between 2 variables when one is nominal and one is interval/ratio
5) Scatter plot
(Vogt, 287) Also called scatter diagram and scattergram. The pattern of points that results from
plotting two variables on a graph. Each point or dot represents one subject or unit of analysis and is
formed by the intersection of the values of the two variables.
The pattern of points indicates the strength and direction of the correlation between the two variables.
The more the points tend to cluster around a straight line, the stronger the relation (the higher the
correlation). If the line around which the points tend to cluster runs from lower left to upper right, the
relation is negative (or inverse). If the dots are scattered randomly throughout the grid, there is no
relationship between the two variables.
May 2007

METHODOLOGICAL CONSIDERATIONS

62

(Vogel) The scatter plots presented below perhaps best illustrate how the correlation coefficient
changes as the linear relationship between the two variables is altered. When r=0.0 the points scatter
widely about the plot, the majority falling roughly in the shape of a circle. As the linear relationship
increases, the circle becomes more and more elliptical in shape until the limiting case is reached
(r=1.00 or r=-1.00) and all the points fall on a straight line.

r= 0.54

r = 0.85

r= 0.42

r= 1.00

r= 0.17

6) Significance of a correlation
(Severino) Significance of a correlation indicates the probability of a true non-zero relationship.
The significance of a correlation has nothing to do with the magnitude of the correlation (e.g., just
because a correlation is significant at the .001 level, does not mean that it is a strong relationshipit simply means that there is a 1% probability that the correlation occurred by chance).
P values only tell you whether you can interpret the relationship
Sample size affects p values- not the magnitude of the correlation
7) Assumptions
a) Linearity
b) Bivariate Normality
c) Full range of scores (no restriction)
E) Multivariate Statistics
Multivariate Analysis (Methods) (Vogt, 201) Any of several methods for examining multiple (three or
more) variables at the same time- usually two or more independent variables and one dependent variable.
Usage varies.
a) Stricter usage reserves the term for designs with two or more IVs and two or more DVs.
b) More commonly, multivariate analysis applies to designs with more than one IV or more than one DV
or both.
Whichever usage you prefer, multivariate analyses allow researchers to examine the relation between
two variables while simultaneously controlling for how each of these may be influenced by other
variables. Examples include path analysis, factor analysis, principle components analysis, multiple
regression analysis, MANOVA, MANCOVA, structured equation modeling, canonical correlations,
and Discriminant analysis.
May 2007

METHODOLOGICAL CONSIDERATIONS

63

1) Multiple Regression and Path Analysis


Multiple Regression Analysis (MRA) (Vogt, 200) Any of several related statistical methods for
evaluating the effects of more than one independent (or predictor) variable on a dependent (or
outcome) variable. Since MRA can handle all ANOVA problems (but the reverse is not true), some
researchers prefer to use MRA exclusively. See regression analysis. MRA answers two main
questions: (1) What is the effect (as measured by a regression coefficient) on a dependent variable
(DV) of a one-unit change in an independent variable (IV), while controlling for the effects of all the
other IVs? (2) What is the total effect (as measured by the R2) on the DV of all the IVs taken
together?
Multiple Regression
(Vogel) Used to determine the utility of a set of predictor variables for predicting an event or
behavior (criterion variable). DV = Criterion; IV = Predictor
MR yields a weighted linear combination of predictors that provides the best prediction of the
criterion.
Types of MR to run:
a) Simultaneous, stepwise & hierarchical regression
Simultaneous (Severino) A form of MR that examines the contributions of all the predictors at
the same time, rather than by adding or subtracting variables one at a time.
Stepwise Regression (Vogt, 311) (a) A technique for calculating a regression equation that
instructs a computer to find the best equation for entering independent variables in various
combinations and orders. Stepwise regression combines the methods of backward elimination and
forward selection. The variables are in turn subject first to the inclusion criteria of forward
selection and then to the exclusion procedures of backward elimination. Variables are selected
and eliminated until there are none left that meet the criteria for removal. Stepwise regression is
often, but inconsistently, contrasted with hierarchical regression analysis, in which the researcher,
not the computer program, determines the order of the variables in the regression equation. (b) A
less common use of stepwise is to describe regression in which the researcher enters the
variables in a logical, theoretical order- which is almost the exact opposite of definition (a).
Hierarchical Regression (Vogt, 142) (a) A method of regression analysis in which independent
variables are entered into the regression equation in a sequence specified by the researcher in
advance. The hierarchy (order of the variables) is determined by the researchers theoretical
understanding of the relations among the variables. Hierarchical techniques are often contrasted
with stepwise regression, in which the order of the variables is determined by a computer
program using statistical associations among the variables in the particular data set.
It is common to combine hierarchical and stepwise procedures. E.g., the researcher will enter the
variables in groups or blocks. The order of the blocks is determined by the researcher as in
hierarchical regression, but within each block the order is determined by a computer program
using stepwise techniques (backward elimination and forward selection). In part because of such
possible combinations, usage tends to vary quite a bit.
(b) A type of regression model which assumes that when a higher order interaction term is
included, all the lower order terms (main effects) are also included.
MR Basics:
Line of Best Fit- (Vogel) There is always one straight line that fits best through the data and
is referred to as the Least Sum of Squares because it attempts to find a line that minimizes
the distance between all of the points and itself. This is the regression line and it produces the
regression equation described below!
Regression Line (Vogt, 271) A graphic representation of a regression equation. It is the line
drawn through the pattern of points on a scatter diagram that best summarizes the relationship
between the dependent and independent variables. It is most often computed by using the
ordinary least squares (OLS) criterion. When the regression line slopes down (from left to
May 2007

METHODOLOGICAL CONSIDERATIONS

64

right) this indicates a negative or inverse relationship when it slopes up (as in the following
illustration), this indicates a positive or direct relationship.

Simple Regression Equation (Severino) Regression equation plots a line through the data
points that minimizes the residuals (errors).
(Vogt, 271) An algebraic equation expressing the relationship between two (or more)
variables. Also called prediction equation. Usually written Y = a + bX + e. Y is the
dependent variable; X is the independent variable; b is the slope or regression coefficient; a is
the intercept; and e is the error term.
E.g., in a study of the relationship between income and life expectancy we might find that: (1)
people with no income have a life expectancy of 60 years; (2) each additional $10,000 in
income, up to $100,000, adds two years to the average life expectancy so that people with
incomes of $100,000 or more have a life expectancy of 80 years. This would yield the
following regression equation:
Life expectancy = 60 years + 2 times the number of $10,000 units of income.
Y = 60 + 2X, where Y equals predicted life expectancy, and X equals number of $10,000
units. The intercept is 60. (In this highly simplified example, there is no error term).
(Severino) Simple Regression Equation: Y = a + bx
Y= predicted Y
a = y intercept (the constant). Indicates the criterion score when all of the predictors
equal 0. The Y value at which the line touches the vertical Y axis.
b = slope (beta). Indicates the effects of the predictor on the criterion. The expected
change in Y for each 1 unit change in X.
x = a score on the x variable
Multiple Regression

(Vogel)
Basic Design: MR is used to predict an event or behavior
No Independent Variable
More than 2 Dependent Variables (continuous score/interval data)
DVs can be split up into one outcome (criterion) and the rest are predictors
Sometimes textbooks will use different terminology and say the outcome/criterion is the
Independent Variable while the predictors are the Dependent Variable (just be familiar
with both terminologies)
Consider the following example:
You are interested in predicting weight loss based on calorie consumption, # of exercise
minutes, and gender. Weight loss would be considered your criterion or outcome whereas
caloric consumption, exercise minutes, and gender are your predictors!

Statistics Used
Line of Best Fit
Weights
R2
Residuals
R
Advantages of Multiple Regression
Statistical control: this technique allows you to partial out (hold constant) all other
predictors so you can focus on the unique contribution of each one separately! SPSS does
this automatically!

Residuals (Vogt, 277) The portion of the score on a dependent variable not explained by
independent variables. The residual is the difference between the value observed and the
value predicted by a model such as a regression equation or a path model. Residuals are,
in brief, errors in prediction. Error in this context usually means degree of inaccuracy
rather than a mistake. It is sometimes assumed that unmeasured residual variables could
May 2007

METHODOLOGICAL CONSIDERATIONS

65

account for the unexplained parts of the dependent variable. See error term, deviation
score. In ANOVA designs, residual means error variance or within-group variance. See
residual SS.
Residual SS (Vogt, 271) In regression analysis, the sum of squares not explained by the
regression equation. Analogous to within-groups SS in analysis of variance. Also called
error SS.

Multiple Regression Equation (Severino)


Raw weights: Y = a + bx1 + bx2
Standardized weights: Y = Bx1 + Bx2 When using standardizes betas, the a
intercept always equals zero.

Path Analysis (Vogt, 230) A kind of multivariate analysis in which causal relations among several
variables are represented by graphs (path diagram) showing the paths along which causal inferences
travel. The causal relationships must be stipulated by the researcher. They cannot be calculated by a
computer; the computer is used to calculate path coefficients, which provide estimates of the strength
of the relationships in the researchers hypothesized causal system. Path analysis is an early form of
structural equation modeling.
In path analysis, researchers use data to examine the accuracy of causal models. A big advantage of
path analysis is that the researcher can calculate direct and indirect effects of independent variables.
(Vogel) Path Analysis and Structural Equation Modeling
Basic Design: These techniques are used when we want to identify
causal relationships between variables so you must first draw a path
connecting all your variables (extension of MR that allows researcher to
test a theory of causal ordering among variables). Structural Equation
Modeling is the most complex version of path analysis.

AC

E
BD

Relationships
 Direct: A  C
 Indirect: A  C  E
 Spurious: one variable causes 2 different variables (A causes D and C)
 Unanalyzed: there is no causal relationship, just a correlation (double headed arrow between
A and B)
 Moderators: two variables have a causal relationship but another variable can change that
relationship
Stress (A)  Immune Functioning Decline (C)
When social support (B) is considered, high amounts of social support can alleviate immune system
responses while poor social support might exacerbate the immune system response. Without social
support though, the relationship between stress and immune functioning still remains.

Cell Phone (A)  Tumor (C)


When a sticker (b) is added to the phone to reduce the amount of radioactive waves, the relationship
between cell phones and tumors reduces! Without the sticker though, the relationship still exists!


Mediators: when a variable must be added in order for two variables to be causally related
A B  C (B is the mediator)
Cell phone (A)  Radiation (B)  Tumor (C)
You have to have radiation (B) in order to get a tumor from cell phones!

Variables
 Must be measured on interval/ordinal scale (no categories or ranks)
 Endogenous Variables: variable explained by another variable (C,D,E)
 Exogenous Variables: variable that causes others but is not explained by any of the others
(A)
May 2007

METHODOLOGICAL CONSIDERATIONS

66

Statistics Used
 Run a multiple regression and report Betas and Semipartials. These values are called path
coefficients and indicate the amount of influence and unique contribution of the causal
variable, respectively.
 Number of Multiple Regressions run = Number of Endogenous Variables

From Review: (Chantelle)


Structural Equation Modeling (Path) Concepts
Path Coefficients: Betas and Semi-partials
Betas (standardized) can compare them
Semi-partial: actually indicates the % of variance uniquely accounted for by the variable with all
others held constant
Endogenous: one we are trying to explain (THING WITH THE ARROWS)
Exogenous: predictor variable
Latent Variables: measures a construct>>but is composed of several different measures
--for example: g
Indirect vs direct paths, unanalyzed, spurious
Direct: is a variable that indicates causality directly to the DV uses Betas as the path coefficient
Indirect: includes more than one variable as influencing causation to the DVstill uses path
coefficients/aka betas
Unanalyzed: are paths that we can not conclude the direction of causality but are testing for a
relationship as indicated by r
Spurious: pathways are ones in which the path reverses in a linear fashion
Path Diagrams:
Goodness of fit index>>results in a chi-square and several programs run the entire model to
indicate how well the data fits in the model --- Do NOT want significance and want a small chisquare ---Path analysis goodness of fit software
 LISREL- programs designed to do structural equation modeling (aka: confirmatory factor
analyses & latent variable modeling) gives you the goodness of fit
 EQS
 AMOS
Reproduced correlation matrix: is a way to check whether your data fits the model>>>model fit
goodness of fit in path analysis
Advantage of regression within path analysis is it allows us to hold certain paths constant in order to
analyze one path at a time
Mediators: indicate causationcompare both
A (IV) ------- C (DV)
Yields a path coefficient that may be significant
IF you plug a B (IV) and that Beta is significant then we can assume that B (IV) is a mediator in the
model and shows causation regarding the DV
Relationship between A and C or the amount of variance would go down
.90
A ----------- C
.20
A ----------- C
A--B--- C

May 2007

METHODOLOGICAL CONSIDERATIONS

67

Moderators:
the idea that a variable interacts w/ some other variable to influence the DV but is NOT directly
causal (serves as a buffer)
influences the relationship between two variables
alters or adjusts the relationship (cannot cause the relationship) but rather changes the way that A
interacts w/ C
b) R, R2, significance
R (Vogt, 259) (a) A symbol for a multiple correlation, that is, among more than two variables. (b)
Abbreviation for range. (Vogel) (r= correlation between 2 variables whereas R= correlation
between multiple variables such as predictors and outcome)
R2 (Vogt, 259) Symbol for a coefficient of multiple determination between a dependent variable
and two or more independent variables. It is a commonly used measure of the goodness-of-fit of a
linear model. Sometimes written R-squared.
E.g., if the R2 between average individual income (the dependent variable) and fathers income,
education level, and IQ were .4, that would mean that the effects of fathers income educational
level, and IQ together explained (or predicted ) 40% of the variance in the individuals average
incomes- and that they did not explain 60% of the variance.
(Vogel) Anytime you square an r or R it will give you the variance shared between the
variables. Just a plain r gives you a correlation whereas squaring it gives you a variance.

R2 = coefficient of determination;
R= .46  The variables of weight loss, caloric intake, exercise, and gender are positively
correlated at .46
R2 = .16  the variables above share 16% of variance with each other
Adjusted R2 is a estimated variance for smaller samples so people say it accounts for
shrinkage. When you have a smaller sample you have to be more conservative in
applying it to the population, therefore, adjusted R2 is always smaller (shrunken).
UNSTANDARDIZED PREDICTION EQUATION
Y = CONSTANT + B WEIGHT(PREDICTOR)

Weight Loss = constant + Bweight(caloric consumption) + B weight(exercise) + B weight(gender)


STANDARDIZED PREDICTION EQUATION
Y = BETA WEIGHT (PREDICTOR)

NO CONSTANT!!!

Weight Loss = Beta weight (caloric consumption) + .Beta weight (exercise minutes) + Beta weight (gender)
Significance (Vogt, 295) The degree to which a research finding is meaningful or important.
Without qualification, the term usually means statistical significance, but lack of specificity leads
to confusion (or allows obfuscation).
(Salkind, 142-144) (142) Because our world is not a perfect one, we must allow for some leeway
in how confident we are that only those factors we identify could cause any difference between
groups. You need to be able to say that although (for example) you are pretty sure the difference
between two groups of adolescents is due to maternal employment, you cannot be absolutely,
100%, positively sure. Theres always a chance, no matter how small, that you are wrong. So
even though you are a good researcher and take into account for other sources of difference, you
still need to take the possibility of being wrong into account.
In most scientific endeavors that involve testing hypotheses, there is bound to be a certain amount
of error that cannot be controlled for. The level of chance or risk you are willing to take is
expressed as a significance level.
May 2007

METHODOLOGICAL CONSIDERATIONS

68

(143) Significance level is the risk associated with not being 100% confident that what you
observe in an experiment is due to the treatment or what is being tested. If you read that
significant findings occurred at the .05 level (or p < .05), the translation is that there is 1 chance in
20 (or .05 or 5%) that any differences found were not due to the hypothesize reason (whether
mom works) but to some other, unknown reason(s).
(144) Statistical Significance is the degree of risk you are willing to take that you will reject a
null hypothesis when it is actually true. (147) E.g., if the level of significance is .01, it means that
on any one test of the null hypothesis, there is a 1% chance you will reject the null hypothesis
when the null is true and conclude that there is a group difference when there really is no group
difference at all. Notice that the level of significance is associated with an independent test of the
null, and it is not appropriate to say that on 100 tests of the null hypothesis, I will make an error
on 5, or 5% of the time. In a research report, statistical significance is usually represented as p <
.05, read as the probability of observing that outcome is less than .05 often expressed in a report
or journal article simply as significant at the .05 level.
Significance Testing (Vogel)

http://www.statsoft.com/textbook/stnonpar.html

The concept of statistical significance testing is based on the sampling distribution of a particular
statistic. If we have a basic knowledge of the underlying distribution of a variable, then we can
make predictions about how, in repeated samples of equal size, this particular statistic will
"behave," that is, how it is distributed. For example, if we draw 100 random samples of 100
adults each from the general population, and compute the mean height in each sample, then the
distribution of the standardized means across samples will likely approximate the normal
distribution. Now imagine that we take an additional sample in a particular city ("Tallburg")
where we suspect that people are taller than the average population. If the mean height in that
sample falls outside the upper 95% tail area of the normal distribution then we conclude that,
indeed, the people of Tallburg are taller than the average population.
In the above example we relied on our knowledge that, in repeated samples of equal size, the
standardized means (for height) will be distributed following the t distribution (with a particular
mean and variance). However, this will only be true if in the population the variable of interest
(height in our example) is normally distributed, that is, if the distribution of people of particular
heights follows the normal distribution (the bell-shape distribution). For many variables of
interest, we simply do not know for sure that this is the case. For example, is income distributed
normally in the population? -- probably not. The incidence rates of rare diseases are not normally
distributed in the population, the number of car accidents is also not normally distributed, and
neither are very many other variables in which a researcher might be interested.
Another factor that often limits the applicability of tests based on the assumption that the
sampling distribution is normal is the size of the sample of data available for the analysis (sample
size; n). We can assume that the sampling distribution is normal even if we are not sure that the
distribution of the variable in the population is normal, as long as our sample is large enough
(e.g., 100 or more observations). However, if our sample is very small, then those tests can be
used only if we are sure that the variable is normally distributed, and there is no way to test this
assumption if the sample is small.
Because we encounter data that we do not have good normal distributions for and/or we have
very small samples that may not be representative of those distributions we are basing our
significance testing on, the need is evident for statistical procedures that allow us to process data
of "low quality" data. In other words, low quality data is usually from small samples or on
variables about which nothing is known. Specifically, nonparametric methods were developed to
be used in cases when the researcher knows nothing about the parameters of the variable of
interest in the population (hence the name nonparametric). In more technical terms,
nonparametric methods do not rely on the estimation of parameters (such as the mean or the
standard deviation) describing the distribution of the variable of interest in the population.
May 2007

METHODOLOGICAL CONSIDERATIONS

69

There is at least one nonparametric equivalent for each parametric general type of statistic. These
tests fall into the following categories:

Tests of differences between groups (independent samples)


Number of
Groups

Type of
Data

Rank

Rank

More than 2

Rank

Parametric Test
(uses continuous data)
t-test for independent
samples
t-test for independent
samples
ANOVA/MANOVA

NonParametric
Equivalent
Mann Whitney U-Test
Kolmogorov-Smirnov
Test
Kruskal-Wallis Test

Tests of differences between variables (dependent samples)


Number of
Groups

Type of Data

Rank

Dichotomous
(Categorical)

More than 2

Rank

Parametric Test
(uses continuous data)
t-test for dependent
samples
t-test for dependent
samples
Repeated Measure
ANOVA

NonParametric
Equivalent
Wilcoxon
Matched Pairs

Parametric Test
(uses continuous data)
Correlation

NonParametric
Equivalent
Spearman

Chi-Square

Chi-Square

Chi-Square

Chi-Square

Chi-Square
Friedman Test

Tests of relationships between variables.


Number of
Groups
2
2
2

Type of Data
Rank
Dichotomous
(Categorical)
Dichotomous
(Categorical)

Threats to Finding Significant Differences


1. Low Power (due to small sample size and effect size and large error): want at least .80

Power =

SampleSize( EffectSize)
Variance

Thus, if you have a lot of within group variance (error) OR low sample size OR low effect size,
power will be a problem. Remember, power is the SENSITIVITY of an experiment to find real
differences between groups!!!
2. Subject heterogeneity: When subjects are very different, you will see a lower effect size (tells us how
big the difference is!) and subsequently decreased power
3. Unreliable measures
4. Multiple comparisons: Making numerous comparisons causes family wise error. This type of error is
based on the assumption that the more comparisons the greater chance for type I error (alpha of .05
says for every 100 comparisons 5 will be significant by chance alone). So if you run 100 tests and
find 5 that are significant, researchers will question your work highly because they theoretically could
have been significant by chance along seeing as though you ran so many analyses.
c) Regression weights (raw and standard)
May 2007

METHODOLOGICAL CONSIDERATIONS

70

(Vogt, 272) Another term for beta weight or regression coefficient.


Beta Weight (Vogt, 24) Another term for standardized regression coefficients, or beta
coefficients. Beta weights enable researchers to compare the size of the influence of independent
variables measured using different metrics or scales of measurement. Also called regression
weights.
E.g., imagine a regression analysis studying the influence of age and income on attitudes.
Subjects could be adults ranging in age from 18 to 80. Their incomes might vary from $4,000 for
a high school senior working after school to $200,000 for a tax lawyer. By reporting years and
dollars as standard scores, rather than in the original metric, beta weights allow the researcher to
make easier comparisons of the negative influence of age and income on attitudes.

b = Raw: b weights in the original metric (unstandardized). The size of b is dependent on the
scale of measurement of the IV and DV. You cannot compare unstandardized beta
weights across samples.
B = Standardized: B is based on z-scores. Allows for comparisons across samples.

(Vogel) Weights (B or Beta) are assigned to each predictor to be put into a prediction equation.
 These weights indicate the amount of influence each predictor has on the outcome
 b weight is the unstandardized weight and gives us the amount of influence of that
predictor
 Beta weights are similar, however, they are standardized. If you want to compare 2
predictors in your model, usually they are measured on different scales (i.e. gender versus
calories have completely different scales). Beta weights allow you to compare them
because they standardize the values. Whichever one has a larger Beta weight can be
considered a more valuable predictor of weight loss!
d) Incremental variance and significance
(Severino)
The amount of variance in the criterion that a predictor explains, above and beyond the other
predictors in the analysis
Change in R2
Best when each predictor correlates highly with the criterion but not with other predictors
Semi Partials

a
(a + b + c + d )
c
(a + b + c + d )

Partials

a
(a + d )
c
(c + d )

x1

b c

x2

May 2007

METHODOLOGICAL CONSIDERATIONS

71

e) Assumptions
Assumptions of Multiple Regression

Independence of Observations (all scores are independent of each other)


Normality- can be corrected by transforming scores

Linearity
Multivariate normality and
linearity is assessed using a
scatter plot matrix like the one
below. Make sure all blocks
are relatively elliptical in
shape. This plot would be
considered normal and linear.

respondent's income

Age of Respondent

Highest Year of Scho

Highest Year of Scho

Homoscedasticity (Evenness
of Errors)
Highest Year of Scho
(Vogt, 145) Homogeneity of
variances. A condition of
substantially equal variances
Number of Hours Work
in the dependent variable for
the same values of the
independent variable in the different populations being sampled and compared in a regression
analysis or an ANOVA. Comes from homo, meaning the same or equal and scedasticity,
meaning tendency to scatter (skedaddle?). Parametric statistical tests usually assume
homoscedasticity. If that assumption is violated, results of those tests will be doubtful
validity.
(Vogel) Naturally, you will make some errors in predicting outcomes for subjects based on
your chosen variables. The trick is that you want to be sure that youve made the same
amount of error across the spectrum of the outcome.
If you are predicting weight loss based on calorie consumption, # of exercise minutes, and
gender, then you want to make sure that you are equally good at predicting small, moderate,
and large amounts of weight loss. Sometimes you
may find that you are only good at predicting
Residual Plot to Determine
when there is a large amount of weight loss.
Normality and Homoscedasticity
4

Draw a vertical and horizontal line as shown. This


residual plot shows homoscedasticity/evenness of
errors. In this example, most of the errors are
centered around the middle except for that upper
right quadrant. There is some scattering up there
but this violation is very minimal. You can assume
homoscedasticity in this example.

Standardized Residual

3
2
1
0
-1
-2
-3
-4
-4

-3

-2

-1

Standardized Predicted Value

Independence of Errors

Error Score Assumptions They have a mean of zero


They are uncorrelated with each other
They have equal variances at all values of the predictor (e.g., homoscedastic)
They are normally distributed

May 2007

METHODOLOGICAL CONSIDERATIONS

Specification Errors The relationship between variables must be linear


All relevant predictors must be included
No irrelevant predictors can be included

Measurement Errors Measures should be reliable and valid

72

f) Shrinkage
(Vogt, 294) The tendency for the strength of prediction in a regression or correlation study to
decrease in subsequent studies with new data sets. The regression model derived from one set of
data usually works less well with others. The degree of shrinkage is measured by change in R2.
Compare regression to the mean.

(Severino) MRC derives a prediction equation that is to some degree, sample specific.
Derivation Sample: The original sample that the regression equation is derived from.
Measures of Association and accuracy in prediction are expected to be lower (shrink) when
the regression equation is used on another sample of subjects- this is known as shrinkage.
R2 is a maximizing procedure that yields an inflated estimate, because it takes advantage of
sample specific error.
Adjusted R2: A more accurate estimate of prediction

g) Causal models
(Severino) MRC is a correlational technique that does not imply causality.
Path analysis is an extension of MRC, which allows the researcher to test a theory of causal
ordering among a set of variables
Variables must be measured on interval or ordinal scale
The number of cases required depends on the models complexity. Most require about 200300 cases.
The number of regressions that need to be run = the number of endogenous variables in the
model.
1) Recursive vs. Non-recursive Models: (Vogt, 266) A causal model in which all the
causal influences are assumed to work in one direction only, that is, they are asymmetric
(and the error or disturbance terms are not correlated across equations). By contrast,
nonrecursive models allow two-way causes. Note: Readers are often confused by the
term, because they take recursive to refer to recurring or repeating and therefore to
mean a reciprocal relationship.
E.g., if you were looking at the influence of age and sex on math achievement, your
model would probably be recursive since, while age and sex might influence students
math achievement, doing well or poorly in math certainly
could not change their age or sex. On the other hand, your
Recursive Model
model of the relationship between achievement and time
AB
spent studying might not be recursive. It could be reciprocal
Nonrecursive Model
(or nonrecursive): Studying could boost math achievement,
and increased achievement could make it more likely that a
AB
student would enjoy studying math.
2) Endogenous Variables: (Vogt, 266) A variable that is caused by other variables in a
causal system. Generally contrasted with exogenous variables. In the following figure,
Childs aspirations, Childs education, and Childs income are endogenous. Parents
education and Parents aspirations are exogenous.
Parents Education
Childs
Aspirations
Nonrecursive Model

Childs
Education

Childs
Income
May 2007

METHODOLOGICAL CONSIDERATIONS

73

3) Exogenous Variables: (Vogt, 110) A variable entering from and determined from outside
the system being studied. A causal system says nothing about its exogenous variables.
Their values are given, not analyzed. Also called prior variables. In path analysis,
cause is illustrated by an arrow. If a variable does not have an arrow point at it, it is
exogenous.
E.g., saw we were studying the relation of hours spent practicing to score in an archery
contest. Subjects dexterity and strength might be related to their scores, but would be
exogenous variables for the purposes of our study.
4) Direct Effects (Severino) In the path diagram, direct effects are indicated by straight
arrows from one variable to another.
Path Coefficient (Vogt, 230) A numerical representation of the strength of the relations
between pairs of variables in a path analysis when all the other variables are held
constant. Path coefficients are standardized regression coefficients (beta weights), that is,
they are regression coefficients expressed as z-scores. Unstandardized path coefficients
are usually called path regression coefficients.
5) Indirect Effects (Severino) The product of two direct effects. The total causal impact of
a variable on the criterion is the sum of the direct effects and the product of the indirect
effects.
6) Spurious Effects (Severino) When two variables have a common cause. Represented by
a path that goes against the direction of the arrows in the model.
7) Unanalyzed Effects (Severino) Represented by a two-headed arrow.
h) Moderating and mediating variables
Moderating Variable (Vogt, 195) A variable that influences
(moderates) the relation between two other variables and
thus produces a moderating effect or an interaction effect.
Mediating Variable (Vogt, 190) Another term for intervening
variable, that is, a variable that transmits the effects of
another variable.
E.g., parents transmit their social status to their children
directly. But they also do so indirectly, through education, as
in the following diagram, where the childs education is the
mediating variable.

B
A
C
B mediates the relations
between A and C.
A

B
B moderates the relationship
between A and C.

Parents Status Childs Education Childs Status


i) Multicolinearity (Vogt, 198) In multiple regression analysis,
multicollinearity exists when two or more independent variables are highly
correlated; this makes it difficult if not impossible to determine their separate
effects on the dependent variable. Also called colinearity.
(Severino) When there is a lot of overlap between predictors (e.g., predictors
are redundant).
Tolerance: Test of multicollinearity. The proportion of a predictors
variance that is not shared by the other predictors. It should be as close to
one as possible.

multicollinearity

May 2007

METHODOLOGICAL CONSIDERATIONS

74

j) Standard Error of Estimate (Prediction)


(Vogt, 307) The estimate is a regression line. The error is how much you are off when using
the regression line to predict particular scores. The standard error is the standard deviation of
variability of the errors. It measures the average error over the entire scatter plot. The lower the
SEE, the higher the degree of linear relationship between the two variables in the regression. The
larger the SEE, the less confidence once can put in the estimate. Symbolized syx to distinguish it
from (i.e., the standard deviation of scores - not the error scores).
k) Statistical Control in MRC
Statistical Control (Vogt, 309) Using statistical techniques to isolate or subtract variance in the
dependent variable attributable to variables that are not the subject of study. See control for,
partial out, ANCOVA.

Partialling
Holding Constant
Covarying

l) Outliers
(Vogt, 223) A subject or other unit of analysis that has extreme values on a variable. Outliers are
important because they can distort the interpretation of data or make misleading a statistic that
summarizes values (such as a mean). Outliers may also indicate that a sampling error has
occurred by including a case from a population different from the target population.
(Vogel) Run 4 different statistics to determine where your outliers are
Mahalanobis: gives you a number for each subject telling you how far away they are from
the theoretical center of data
Leverage: gives you a number for each subject telling you how far they are from all other
subjects
DfBeta: gives you a number for each subject on every predictor telling you how much that
subject influenced each predictors regression weight
Cooks: gives you a number for each subject telling you how much they influenced the
regression equation as a whole
SPSS has an explore option that will identify your top 10 possible outliers which you will
then compare to a norm value that you look up on a chart.
2) Principle Components and Factor Analysis
[See Summary Page]
Principle Components Analysis (PCA) (Vogt, 245) Methods for undertaking a linear
transformation of a large set of correlated variables into smaller uncorrelated groups of variables.
This makes analysis easier by grouping data into more manageable units and eliminating problems of
multicollinearity. Principle components analysis is similar in aim to factor analysis, but it is an
independent technique; advocates of the two methods stress the differences rather than the
similarities, and debates between them are often highly contentious. However, the outcomes
produced by the two methods are usually quite similar, and PCA is often used as a first step in factor
analysis.
Factor Analysis (FA) (Vogt,117) Any of several methods of analysis that enable researchers to
reduce a large number of variables to a smaller number of variables, or factors, or latent variables. A
factor is a set of variables, such as items on a survey that can be conceptually and statistically related
or grouped together. Factor analysis is done by finding patterns among the variations in the values of
several variables; a cluster of high intercorrelated variables is a factor. Exploratory factor analysis
was the original type. Confirmatory factor analysis developed later and is generally considered more
theoretically advanced. Principle components analysis is sometimes regarded as a form of factor
analysis, though the mathematical models on which they are based are different. While each method
has strong advocates, the two techniques tend to produce similar results, especially when the number
of variables is large.
May 2007

METHODOLOGICAL CONSIDERATIONS

75

E.g., factor analysis is often used in survey research to see if a long series of questions can be
grouped into shorter sets of questions, each of which describes an aspect or factor of the phenomena
being studied.
(Anastasi, 303) The principle object of factor analysis is to simplify the description of data by
reducing the number of necessary variables, or dimensions. Thus, if we find that five factors are
sufficient to account for all the common variance in a battery of 20 tests, we can for most purposes
substitute 5 scores for the original 20 without sacrificing essential information. The usual practice is
to retain from among the original tests those providing the best measure of each of the factors.

Basic Design: This is a technique in which a large number of interrelated variables are reduced
into a smaller number of latent dimensions
 No Independent Variables
 More than 2 Dependent Variables that cannot be divided into predictors and outcomes
 Principal Components is used when we want 100% of the variance between items explained
whereas Factor Analysis only explains shared/common variance between the variables used
(Communalities can be >1 in FA but must equal 1 in Principal Components)
 Must have 5-10 people per item!
 The more heterogeneous the sample, the more factors will emerge


Factor Loadings: ender each factor, every variable gets loading that indicates how
important that variable is (correlation between item and factor)

Communalities: value for each variable telling you how much of the variable was
used by the components/factors/subscales
Low communalities = did not load highly on any factor and can be thrown out
High communality = high loadings on one or more factors

a) Principle components vs. principal axis

b) Rotation (orthogonal and oblique)


Factor Rotation (Vogt, 119) Any of several methods in factor analysis by which the researcher
attempts (by transformation of loadings) to relate the calculated factors to theoretical entities.
This is done differently depending upon whether the factors are believed to be correlated
(oblique) or uncorrelated (orthogonal).
(Anastasi, 304-307) It is customary to represent factors geometrically as reference axes in terms
of which each test can be plottedIn this connection it should be noted that the position of the
reference axes is not fixed by the data. The original correlation table determines only the position
of the tests in relation to each other. The same points can be plotted with the reference axes in
any position. For this reason, factor analysts usually rotate axes until they obtain the most
satisfactory and easily interpretable pattern.
Oblique Axes and Second-Order Factors (Anastasi, 309) Orthogonal axes- are at right angles
to each other.
Occasionally, the test clusters are so situated that a better fit can be obtained with oblique axes. In
such a case, the factors would themselves be correlated. Some investigators have maintained that
orthogonal, or uncorrelated, factors should always be employed, since they provide a simpler and
clearer picture of train relationships. Others insist that oblique axes should be used when they fit
the data better, since the most meaningful categories need not be uncorrelated. An obvious
example is height and weight. Although it is well known that height and weight are highly
correlated, they have proved to be useful categories in the measurement of physique.
When factors are themselves correlated, it is possible to subject the intercorrelations among the
factors to the same statistical analysis we employed with intercorrelations among tests. In other
words, we can factorize the factors and derive second-order factors.
May 2007

METHODOLOGICAL CONSIDERATIONS

76

Rotated Component Matrix: rotations clear up the focus of the data by giving up some
magnification and finding the best fit for all the retained factors even if it means increasing or
decreasing the importance of each factor
Varimax Rotation: orthogonal (independent, each factor has zero correlation with others),
most common, rotation that wiggles around after the first factor is set so the others can get
a best fit
Oblique Rotation: dont force factors to be orthogonal because you suspect they may be
intercorrelated

c) Eigenvalues and percent of variance


Eigenvalues (Vogt, 103) A statistic used in factor analysis to indicate how much of the variation
in the original group of variables is accounted for by a particular factor. It is the sum of the
squared factor loadings of a factor. Eigenvalues of less than 1.0 are usually not considered
significant. Usually symbolized lamda [ ]. Also called characteristic root and latent root.
Eigenvalues have similar uses in canonical correlation analysis and principal component analysis.
Statistics Used
 Deciding how many factors/subscales to retain (so if you ran through all the items on the
MMPI, you need to decide how many subscales you will keep once the items have been
clustered together)
Eigenvalues: each factor (clustered items) gets an eigenvalue score that tells you the amount of
variance among all items that this one factor accounts for
Low values would mean that not many variables clustered together in this particular
grouping/factor

Kaiser Rule: you can tell SPSS to retain all factors with eigenvalues >1
Scree Plot: the plot will be steep for the first few factors and then level off. You do not
want to keep the factors that have leveled off
Theory (research shows a certain number of factors is best!)

Scree Plot
Plot of the eigenvalues for each factor/component that was created by lumping variables
together
As the plot levels, each factor is explaining less unique information

d) Confirmatory vs. exploratory


Types of Factor Analysis
 Exploratory (most common)
Not for hypothesis testing
Used in test construction (creating subscores on assessments)
Used in empirical exploration (study brand new areas to see what symptoms cluster
together)
Used for data reduction/reduce number of DVs (if you have 12 measures of depression
and several are highly intercorrelated, identify which ones cluster together so you can
eliminate inventories)
(Vogt, 113) Factor analysis conducted to discover what latent variables (factors) are behind a
set of variables or measures. Generally contrasted with confirmatory factor analysis, which
tests theories and hypotheses about the factors one expects to find.


Confirmatory
Used in testing the DSM-IV criteria (take an extensive history of many patients from
each diagnostic category and make sure they cluster together in accordance with their
diagnosis)
Schizophrenics should not cluster with Bipolar etc
OCD and Phobias or PTSD may cluster together a little
May 2007

METHODOLOGICAL CONSIDERATIONS

77

Checking existing classifications to see if they hold up, check to see if the model fits
existing data (if chi-square is significant, real life data does not fit current model)

(Vogt, 56) Factor analysis conducted to test hypotheses (or confirm theories) about the factors
one expects to find. It is a type of or element of structural equation modeling.
e) Factor loadings
(Vogt, 119) The correlations between each variable and each factor in a factor analysis. They are
analogous to regression (slope) coefficients. The higher the loading, the closer the association of
the item with the group of items that make up the factor. Loadings of less than .3 or .4 are
generally not considered meaningful.
3) Discriminant function analysis and logistic regression
(Vogel unless otherwise indicated)
Basic Design
 Logistic Regression is like Multiple Regression except that we are predicting an event or
behavior that either occurs or not (CATEGORICAL) whereas multiple regression predict an
outcome on a continuum. For instance, if you were to predict how much weight loss a person
achieved based on the previous 3 variables discussed, you would use MR. If you were to
predict whether each subject did or did not lose weight, you would run a logistic regression
because the outcome is dichotomous!
 No Independent Variable
 More than 2 Dependent Variables
DVs can be split up into one outcome (criterion) and the rest are predictors
Outcome if Categorical/dichotomous and Predictors can be continuous or dichotomous
 Sometimes texts will use different terminology and say the outcome/criterion is the
Independent Variable while the predictors are the Dependent Variable (just be familiar with
both terminologies)
Statistics Used
 Weights (B or Beta) are assigned to each predictor to be put into a prediction equation.
 Chi-Square, -2 Log Likelihood, Significance Level: Tells us if the whole model (all
predictors lumped together) are significantly predicting the outcome! Are you better at
predicting the outcome when youve added predictors versus by chance alone? This should be
SIGNIFICANT
 Cox and Snell/Nagelkerke Pseudo R Squared: gives us a range of the variance in the
outcome that is explained by our model (predictors)!
 Hosmer and Lemeshow Goodness of Fit Chi Square: tells us if the predictions we are
making fits the actual data we collected, if there are a lot of discrepancies then we have to
rethink the model! This should be NOT SIGNIFICANT
 Predicted Probability: takes the score of each individual and plugs it into an equation
(1/ 1 + e (a + b1x1 + b2x2 .) in order to yield a probability value between 0 & 1. Then
taking the predicted probability, graphing it and comparing it against the observed score for
each individual.

 Classification Table
Tells you the observed values and predicted values so you can see how many you
correctly classified overall as well as your hits and misses
Goal is to correctly classify as many as possible!
 Sensitivity: ability to detect the presence of the outcome (True Positives)
Mistakes are considered a Type I error
At a crisis house, the ability to screen for individuals who will attempt suicide refers to
sensitivity
(Vogt, 293) The ability of a diagnostic test to correctly identify the presence of a disease or
condition. Sensitivity is the conditional probability of the test giving a positive result if the
May 2007

METHODOLOGICAL CONSIDERATIONS

78

subjects do have the condition or disease. Originating in medical research, the term is now
used more broadly.


Specificity: ability to detect the absence of the outcome (True negatives)


Mistakes are considered a Type II error
At a crisis house, the ability to weed out those individuals who will not attempt suicide
refers to specificity
In this example, sensitivity is likely more important than specificity as we would want to
be cautious! Other situations may be the opposite!
Low specificity: trouble detecting absence results in false positives
(Vogt, 293) The ability of a test to correctly judge that subjects do not have a disease or
condition, in other words, to avoid false negatives. Specificity is the conditional probability
of a test giving a negative result when patients or subjects do not have a disease. Compare
sensitivity, which is the ability of a test to avoid false positives.

From Review: (Chantelle)


Logistic Concept List
Wald Statistics and its limitations: Equivalent of a t score when dealing with binary
logistic regressions. Danger is that it may easily be inflated due to large Betas and
excessively large standard errors >>>increasing chance of making Type II errors.
Log (likelihood) ratio test: indicates how well the equation or the regression model fits
the data (closer to 0 the better the equation or model fits).
Classification Tables: 2 x 2 tables that demonstrate hit rates and miss rates as well as
indications of Type I and Type II error:
Cohens Kappa: agreement statistic that corrects for chance agreement
agreement statistic independent of chance agreement>>we want a high percent of
agreement >produces an agreement statistic that is better than chance alone
Hit Rate: is the percentage with which we correctly classified the variables
Assumptions:
Logistic Regression:
Independence of Observation >Indicated by the Durbin Watson Statistic (Range of
1.8-2.2)
Requirement: subject/predictor ratio (maybe 20-50 subjects per predictor
Discriminant Analysis Assumptions:
Normality
Linearity
Heteroskedasticity
Independence of Observation > more powerful
Contrast to discriminant analysis:
Advantages to Logistic Regression:
No assumptions
More flexible
No negative probabilities
Good for variables of all types
Less limitations
Good when you except an IV is non linear
Disadvantage: loose a little power w/ dichotomous output


Odds ratio
If your Odds ratio = 2 for a predictor then, One unit increase in the predictor increases
the probability of the outcome by 2:1 with all other predictors held constant
May 2007

METHODOLOGICAL CONSIDERATIONS

79

Odds Ration (OR) (Vogt, 218) A ratio of one odds to another. The odds ratio is a
measure of association, but unlike other measures of association, 1.0 means that there
is no relationship between the variables. The size of any relationship is measured by the
difference (in either direction) from 1.0. An odds ratio of less than 1.0 indicates an
inverse or negative relation; an odds ratio greater than 1.0 indicates a direct or positive
relation. Also called cross-product ration after a method of computing this statistic. [odds
of a certain outcome happening or not happening predicted by the predictors]
An adjusted odds ratio is an OR computed after having controlled for the effects of other
predictor variables. An unadjusted OR would be a bivariate OR.

Assumptions
 Its almost impossible to violate assumptions! Very lenient!!!
 Independence of Observations
 Outcome must be dichotomous
 Need large sample size (approx. 20-50 subjects per predictor)
 Discriminant Analysis: You can run this analysis as an alternative to logistic regressions,
however, it requires meeting many more assumptions than Logistic!

4) Multivariate analysis of variance and covariance MANOVA


(Vogt, 202) The extension of ANOVA techniques to studies with multiple dependent variables.
MANOVA allows the simultaneous study of two or more related dependent variables while
controlling for the correlations among them. If the dependent variables are not related, there is no
point in doing a MANOVA; rather, separate ANOVAs for each (unrelated) dependent variable would
be appropriate.
E.g., to study the effects of exercise on at-rest heart rate, you could use ANOVA to test the (null)
hypothesis that there is no difference in average heart rate of three groups: women who never
exercise, who exercise sometimes, and who exercise frequently. MANOVA makes it possible to add
related dependent variables to the design, such as mean blood pressure and respiratory rates of the
three groups.

(Vogel) Basic Design: similar to ANOVA in that you are looking for differences between 2 or
more groups, however, now we also have 2 or more Dependent Variables!

Advantages
 Could run several separate ANOVAs except you will have to worry with Type I error
because running many analyses (familywise/experimentwise error)
 Increases power
 Takes into account the correlations between DVs

Basic Stats
 Synthetic Variable: MANOVAs create a new DV from the set of correlated DVs (lumps
them all into one DV).
 Multivariate Tests of Significance
Takes all the DVs together and creates the synthetic variable/factor and compares the
synthetic variable across levels of the IV
4 different tests to choose from depending on whether you are interested in the primary
synthetic variable or other related ones
Wilks Lambda: very common, tells us the variance in the synthetic variable that is
accounted for by the IV (more intuitive meaning)
Roys Largest Root: highly sensitive to only the most important synthetic factor so
dont use this is you are interested in other dimensions
Pillais Trace, Hotellings and Roys are more arbitrary than Wilks Lambda while
Wilks Lambda and Hotellings are the most common stat used.

May 2007

METHODOLOGICAL CONSIDERATIONS

80

3 Rough rules of thumb for selecting the most appropriate MANOVA test (Chantelle)
 Roys GCR>> this test should be employed to confirm a hypothesis of one single dimension
(or one predominant factor in the dependent variable set)
 Wilks lambda. This test is maximally sensitive when two or more dimensions are contained
in the set of dependent variables and are of relatively equal importance in accounting for the
trace
 Lawley-Hotelling trace & Pillais trace. These two test criteria appear to be intermediate in
sensitivity when compared with Roys GCR and Wilks Lambda. However, there is evidence
that Pillais trace criterion may be more robust to lack of homogeneity of dispersion matrices
than the other 3 MANOVA criteria, but see also Stevens
4 Statistical Reasons For Preferring A Multivariate Analysis
1. Use of univariate tests (ANOVAs) leads to Inflated overall Type I error rate & Probability of
at least one false rejection
2. Univariate tests ignore important information. The correlation among the variable.
Multivariate test incorporates the correlation right into the test statistic.
3. Although the groups may not be statistically significant on any of the variables individually,
jointly the set of variables, may be reliably differentiate the groups.
Small differences on several of the variables may combine to produce a reliable overall
difference. Multivariate tests may be more POWERFUL.
4. Sometimes argued that the groups should be compared on total test score first to see if there is
a difference. If so, then compare the groups further on subtest scores to locate the source
responsible for the global difference. If there is NO test score difference then STOP. Use a
MANOVA as a gate keeping function. If significant then you can go ahead & run univariate.
Assumption in ANOVA
 Observations are Normally distributed on the DV in each group
 The population variances for the groups are equal (homogeneity of variance)
 Independence of observation
Basic Assumptions (see descriptions in ANOVA section)
 Normality
 Independence of Observations
 Homogeneity of Variance
 Homogeneity of Variance/Covariance
The covariance (variance shared between variables) for each pair of DVs is the same
across levels of the IV; relationship between DVs stays same across levels of IV
Use Box Test in SPSS

Multivariate Analysis of Covariance (MANCOVA)


This is used when you have a covariate you need to control for
Remember, covariates are variables that are correlated with the DV but not with your IVs
(confound not controlled for in the research design)
When you run SPSS, you will enter the covariate into the analysis
Assumptions: same as MANOVA and ANCOVA
a) Justification for use
b) Synthetic variables
c) Multivariate tests of significance
d) Assumptions
5) General linear model
May 2007

METHODOLOGICAL CONSIDERATIONS

81

F) Statistical Inference
1) Type 1 and Type 2 errors
DIFFERENT TYPES OF ERROR

The null
hypothesis
is really
true
True nature of
the null
hypothesis
The null
hypothesis
is really
false

Action You Take


Accept the Null Hypothesis
Reject the Null Hypothesis
2.
Oopsyou made a Type I error
1. Bingo, you accepted the null
and rejected a null hypothesis
when it is true and there is really
when there really is no difference
no difference between groups.
between groups. Type I errors are
also represented by the Greek
letter alpha, or (level of
significance)
True Negative (Specificity)
3. Uh-oh- You made a Type II error
and accepted a false null
hypothesis. Type II errors are
also represented by the Greek
letter beta, or .

False Negative

False Positive
4.

Good job, you rejected the


null hypothesis when there really
are differences between the two
groups. This is also called power,
or 1 -

True Positive (Sensitivity)


(Salkind, 145)

(Salkind, 145-148) Example: A researcher is interested in seeing whether there is a difference in the
academic achievement of children who participated in a preschool program and children who did not.
The null hypothesis is that the two groups are equal to each other on some measure of achievement.
The research hypothesis is that the mean score for the group of children who participated in the
program is higher than the mean score for the group of children who did not participate in the
program.

Cell 1 represents a situation where the null hypothesis is really true (theres no difference
between groups) and the researcher made the correct decision accepting it. No problem here. In
our example, our results would show that there is no difference between the two groups of
children, and we have acted correctly by accepting the null that there is no difference.

Cell 2 represents a serious error. Here, we have rejected the null hypothesis (that there is no
difference) when it is really true (and there is no difference). Even though there is no difference
between the two groups of children, we will conclude there is and thats an error. Clearly a booboo called a Type 1 error, also known as the level of significance.

Type 1 error, or level of significance, has certain values associated with it that define the risk
you are willing to take in any test of the null hypothesis. The conventional levels set are
between .01 and .05. E.g., if the level of significance is .01, it means that on any one test of
the null hypothesis, there is a 1% chance you will reject the null hypothesis when the null is
true and conclude there is a group difference when there really is no group difference at all.

Cell 3 represents a serious error as well. Here, we have accepted the null hypothesis (that there is
no difference) when it is really false (and, indeed, there is a difference). We have said that even
though there is a difference between the two groups of children, we will conclude there is not.
Clearly a boo-boo known as a Type II error.

May 2007

METHODOLOGICAL CONSIDERATIONS

82

Type II error occurs when you inadvertently accept a false null hypothesis. E.g., there may
really be differences between the populations represented by the sample groups, but you
mistakenly conclude there are not.

Cell 4 represents a situation where the null hypothesis is really false and the researcher made the
correct decision in rejecting it. No problem here. In our example, our results show that there is a
difference between the two groups of children, and we have acted correctly by rejecting the null
that states there is no difference.

Ideally you want to minimize both Type I and Type II errors, but it is not always easy or under
your control. You have complete control over the Type I error level or the amount of risk that you
are willing to take (because you actually set the level itself). Type II errors are not as directly
controlled but, instead, are related to factors such as sample size. Type II errors are particularly
sensitive to the number of subjects in a sample, and as that number increases, Type II error
decreases. In other words, as the sample characteristics more closely match that of the population
(achieved by increasing the sample size), the likelihood that you will accept a false null
hypothesis decreases.

2) Power of a statistical test


(Vogt, 242) Broadly, the ability of a technique, such as a statistical test, to detect relationships.
Specifically, the probability of rejecting a null hypothesis when it is false- and therefore should be
rejected. The power of a test is calculated by subtracting the probability of a Type II error from
1.0. The maximum total power a test can have is 1.0; the minimum size is zero; .8 is often
considered an acceptable level for a particular test in a particular study. Also called statistical
power.
Power analyses are sometimes conducted before data are gathered in order to determine how
large a sample needs to be to avoid Type II error. They are also conducted after the data have
been gathered; the data and the sample size are then used to estimate the probability that a Type II
error has been committed.

(Vogel) Power: how sensitive is our design to the effects we want to find, if true effects really do
exist, will we find them?
 Power = sample size (effect size)
WGvariance
 To have good power, we want a large sample, large effect, and low error
 If error is low, then having either a huge sample size or huge magnitude of effect will give
you enough power and vice versa

3) Interpretation of significance testing


Significance: tells us how sure we are that the differences found between groups are real


Alpha < .05 states that we are 95% sure the differences found are real

Alpha < .01 states we are 99% sure the differences are real

Affected by sample size: the smaller the sample, the harder to get significance

4) Confidence intervals
Confidence Interval (Vogt, 307) A range of values of a sample statistic that is likely (at a given level
of probability, called a confidence level) to contain a population parameter. The interval that will
include the population parameter a certain percentage (confidence level) of the time. In other words, a
range of values with a known probability of including the true population value. The wider the
confidence interval, the higher the confidence level. See confidence level for an example.
It is common to say, for example, that one can be 95% confident that the confidence interval contains
the true value. Although this is the usually way to report confidence intervals and limits, it is not
technically correct. Rather, it is correct to say, were one to take an infinite numbers of samples of the
same size, that on average 95% of them would produce confidence intervals containing the true
population level.
May 2007

METHODOLOGICAL CONSIDERATIONS

83

Confidence Level (Vogt, 307) A desired percentage of the scores (often 95% or 99%) that would fall
within a certain range of confidence limits. It is calculated by subtracting the alpha level from 1 and
multiplying the result times 100; e.g., 100 X 1 (1-.05) = 95%.
E.g., say a poll predicted that, if the election were held today, a candidate would win 60% of the vote.
This prediction could be qualified by saying that the pollster was 95% certain (confidence level) that
the prediction was accurate plus or minus 3% (confidence interval). The larger the sample, the
narrower the confidence interval or margin of error.
II) MEASUREMENT
A) Scales of measurement
Nominal

Ordinal

Interval

Ratio

Gender.

Movie ratings (0, 1 or 2


thumbs up).

Degrees F.

Degrees K.

Most personality
measures.

Annual income in
dollars.

Ethnicity.
Marital Status.

SES.

U.S.D.A. quality of beef WAIS intelligence score. Length or distance in


ratings (good, choice,
centimeters, inches,
prime).
miles, etc.
The rank order of
anything.
http://web.uccs.edu/lbecker/SPSS/scalemeas.htm

1) Nominal variables
Nominal/Categorical: grouping people into categories based upon stated political party preference
(Republican, Democrat, or Other,) or upon sex (Male or Female.) In the political party preference
system Republicans might be assigned the number "1", Democrats "2", and Others "3", while in the
latter females might be assigned the number "1" and males "2".
2) Ordinal variables
 Ordinal/Rank Ordered: ordering, ranking, or rank ordering; the ordinal scale of measurement
represents the ranks of a variable's values. Values measured on an ordinal scale contain
information about their relationship to other values only in terms of whether they are "greater
than" or "less than" other values but not in terms of "how much greater" or "how much smaller."

Rank ordering people in a classroom according to height and assigning the shortest person the
number "1", the next shortest person the number "2", etc. is an example of an ordinal scale.

3) Equal interval variables


 Interval (equal): This scale of measurement allows you to not only rank order the items that are
measured, but also to quantify and compare the sizes of differences between them (no absolute
zero is required). This is typically the type of data you use for dissertations in which the score is
on a continuous scale.

Any scores found on the Beck Depression Inventory, Ratings on Stress Level, Hours of Sleep
per Night, etc are all continuous scores.

4) Ratio variables
 Ratio (definite zero): The added power of a rational zero allows ratios of numbers to be
meaningfully interpreted; i.e. the ratio of John's height to Mary's height is 1.32, whereas this is
not possible with interval scales.

May 2007

METHODOLOGICAL CONSIDERATIONS

84

B) Interpretation of Measures
1) Transforming scores
(Vogel, unless otherwise referenced)
 Transforming Scores
 Raw scores are not helpful
 Percentiles
 Percent of people who scored below you
 Advantages: indicates each persons relative position
 Disadvantages: cannot compare 50-60th percentile difference to 80-90th percentile
difference because a larger group (more differences) fall in the 50-60th percentile than 8090th
 Standardized Scores
 Allows you to compare scores on different tests
 T scores, Z scores, etc
2) Creation of norms
(Vogel, unless otherwise referenced)
 Creation of Norms
 Created/established by administering tests to a sample that is representative of the population
of interest
3) Appropriate use of norms
(Vogel, unless otherwise referenced)
 Appropriate Use of Norms
 Do not use test norms (at least be cautious) if the individual is not represented in the
normative sample
4) Criterion-referenced vs. norm-referenced tests
(Vogel, unless otherwise referenced) [See Summary Table]
Criterion (Domain) Referenced: how each person performs based on a criterion/outcome;
determines if they learned the material
 Mastery: all or none score (comps/licensing exam) that assess a content area
(Vogt,69) A test that examines a specific skill (the criterion) that students are expected to have
learned, or a level (the criterion) students are expected to have attained. Unlike a norm-referenced
test, it measures absolute levels of achievement; students scores are not dependent upon
comparisons with the performance of other students. Also called content-referenced tests.

Norm Referenced: score is relative to those in normative sample; tests for individual differences.
(Vogt,215) A test in which the scores are calculated on the basis of how subjects did in
comparison to (relative to) others taking the test (others scores provide the norm or standard).
The alternative is some absolute standard or criterion.

5) Scaling, Likert, Guttman, etc.


(Vogel, unless otherwise referenced)
Scaling
Polychotomous Scales (usually used to assess attitudes)
 Thurstone
Method of creating and scoring a questionnaire
Many statements (100, for example) are presented to a group of judges that express a
range of attitudes about a certain subject
Then, the group of judges sorts the statements into 11 groups that classify them as similar
attitudes (kind of creates subscales that lump certain questions together)
Subjects score depends on the number (1-11) associated with the statement they endorse

May 2007

METHODOLOGICAL CONSIDERATIONS

85

Guttman
A set of statements about a topic from which you choose to endorse one
The endorsement of that one statement implies that you would endorse all other milder
statements
Continuum of Divorce: If you endorse I have filed for divorce then you will also
endorse I have occasionally thought of divorce

Likert
Opinion statement on how much you agree versus disagree
5 Point continuum

Strongly Disagree
1

Disagree
2

Undecided
3

Agree
4

Strongly Agree
5

Always
Strongly
Disapprove

OTHER EXAMPLES:
Never

Seldom

Sometimes

Often

Strongly Approve

Approve

Need more info

Disapprove

Strongly Opposed

Definitely
Opposed

A bit of both

Definitely
Unopposed

Strongly
Unopposed

Semantic Differential (Osgood)


Each concept is rated on a 7 point scale indicating which opposite the construct is more
closely related to
Evaluative: Good versus Bad; Valuable vs. Worthless, Clean vs. Dirty
Potency: Strong versus Weak; Large versus Small, Heavy vs. Light
Activity: Slow versus Fast; Active versus Passive, Sharp vs. Dull

identifies to opposite constructs (good vs. bad) and you identify where you are at one a
scale between these two constructs (good -------------7 point scale ---bad)
OTHER EXAMPLES

Bad
Deep
Weak
Fair
Quiet
Modern
Simple
Fast
Dirty

Good
Shallow
Strong
Fair
Loud
Traditional
Complex
Slow
Clean

6) Confidence intervals and bands of error


(Vogel, unless otherwise referenced)
Confidence Intervals and Standard Error of Measurement (SEM)
 Error is randomly and normally distributed so we dont know where a persons true score
falls based on the obtained score.
 We must, therefore, determine the reliability of the test in which the subject receives that
score. This is referred to as SEM.
 SEM = (SD)(Square root of [1-r])
 r = internal consistency (reliability) of the measure
 SD = standard deviation of the test scores
May 2007

METHODOLOGICAL CONSIDERATIONS




86

We are then able to calculate the confidence interval (see earlier definition II, F, 4) for that
particular measure! This will tell you the range in which the true score is likely to fall
We want to be correct 95% of the time (1.96 SD in either direction, see below distribution) so
the equation would be: SEM x 1.96 = Confidence Interval

Example:
SEM = 1.6
Persons Score = 5
CI = 1.6 x 1.96 = 3.1
We can say with 95% certainty that the persons true score lies between 5 +/- 3.1 which translates to
between 1.9 and 8.1 on this measure!
General Rule: 68% of the area of any normal
distribution is within one standard deviation of the
mean
Figure 1: Normal distribution with a mean of 50 and
standard deviation of 10. 68% of the area is within one
standard deviation (10) of the mean (50).
General Rule: 96% of the area of any normal distribution is
within two standard deviation of the mean. So, 95% is within 1.96
standard deviations
Figure 3: A normal distribution with a mean of 75 and a standard
deviation of 10. 95% of the area is within 1.96 standard deviations of
the mean.
General Rule: 99% of the area of any normal distribution is within three standard deviation of
the mean
C) Reliability of measurement
(Anastasi,84) Reliability refers to the consistency of scores obtained by the same persons when they are
reexamined with the same test on different occasions, or with different sets of equivalent items, or under
other variable examining conditions. This concept of reliability underlies the computation of the error of
measurement of a single score, whereby we can predict the range of fluctuation likely to occur in a single
individuals score as a result of irrelevant or unknown chance factors.
The concept of reliability has been used to cover several aspects of score consistency. In its broadest
sense, test reliability indicates the extent to which individual differences in test scores are attributable to
true differences in the characteristics under consideration and the extent to which they are attributable
to chance errors. To put it in more technical terms, measures of test reliability make it possible to estimate
what proportion of the total variance of test scores is error variance.
(85) Since all types of reliability are concerned with the degree of consistency or agreement between two
independently derived sets of scores, they can all be expressed in terms of a correlation coefficient.
Essentially, a correlation coefficient (r) expresses the degree of correspondence, or relationship, between
two sets of scores. Thus, if the stop-scoring individual in variable 1 also obtains the top score in variable
2, the second-best individual in variable 1 is second best in variable 2, and so on, then there would be a
perfect correlation between variables 1 and 2. Such a correlation would have a value of 1.0.
Reliability Coefficient (Vogel)
 Used to interpret group scores by testing the consistency across the group/population NOT
individuals
 Pearson r, Spearman Brown, KR-20, Cronbach Alpha, Cohens Kappa, Session Total, and Intraclass
Correlation
May 2007

METHODOLOGICAL CONSIDERATIONS

87

(Anastasi, 90) Correlation coefficients have many uses in the analysis of psychometric data. The
measurement of test reliability represented one application of such coefficients. The use of correlation
coefficients in computing different measures of test reliability will be considered in the Types of
Reliability section.

Change/Difference Scores: Pre-Post


 Unreliable because you are taking items away (Take away and you will pay!)
 If using this, use SEM or reliability coefficient calculations
 notoriously unreliable because doesnt account for error in both of the scores

1) Types of reliability
(Vogel, unless otherwise referenced)
a) Stability, test-retest
(Anastasi, 91) The most obvious method for finding the reliability of test scores is by repeating
the identical test on a second occasion. The reliability coefficient in this case is simply the
correlation between the scores obtained by the same persons on the two administrations of the
test. (93) The concept of reliability is generally restricted to short-range, random changes that
characterize test performance itself rather than the entire behavior domain that is being testedx.
Limitations:
Practice- will probably produce varying amounts of improvement in the retest scores of different
individuals. If the interval between retests is fairly short, the test takers may recall many of their
former responses. Thus, the scores on two administrations of the test are not independently
obtained, and the correlation between them will be spuriously high.
Nature of the test- Only tests that are not appreciably affected by repetition lend themselves to the
retest technique (e.g., sensory discrimination and motor tests- because once you figure it out, you
dont go through all the intervening steps and jump straight to the answer).

Test Retest (stability): consistency of a measure over time


Administer the same test to the same group twice (everyone takes the same test 2x)
Correlation depends on
 Time between administrations (i.e. time frame should be long enough to avoid
practice/carryover effects but short enough so nothing happened to the construct)
 Construct being measured (i.e. if construct is stable like IQ, use a longer interval but
if it is unstable like behavior then use a shorter interval)
Subject X (time 1)

Y (time 2)

b) Equivalence, Parallel forms


(Anastasi, 93) One way of avoiding the difficulties encountered in test-retest reliability is through
the use of alternate forms of the test. The same persons can thus be tested with one form on the
first occasion and with another, equivalent form on the second. The correlation between the two
scores obtained on the two forms represents the reliability coefficient of the test. This is a
measure of both temporal stability and consistency of response to different item samples (or test
forms). This coefficient thus combines two types of reliability. Since both types are important for
most testing purposes, however, alternate-form reliability provides a useful measure for
evaluating many tests.
Limitations:
If the behavior functions under consideration are subject to a large practice effect, the use of
alternate forms will reduce but not eliminate such an effect. If individuals differ in amount of
improvement, owing to extent of previous practice with similar material, motivation in taking the
test, and other factors, the practice effect represents another source of variance that will tend to
reduce the correlation between the two test forms. (95) Finally, alternate forms are unavailable for
many tests, because of the practical difficulties of constructing truly equivalent forms.

May 2007

METHODOLOGICAL CONSIDERATIONS

88

Alternate Form (equivalence): consistency across forms of the same instrument


Used when you create 2 forms of the same test (prevents cheating and practice effects)
Tests are identical in format and construct but have different content!
Generally, one version is self-report and the other is standardized
c) Homogeneity, Internal consistency
 Interval Based Type of inter-rater reliability. (each piece of material to be coded is broken
into intervals and then each interval is scored for either an occurrence or nonoccurrence of
what youre interested in.)
Overall Percent agreement: Used for rating single observers. easy to calculate but is
inflated by chance agreement
% occurrence = A/A+B+C = 1/3 = 33%
% nonoccurrence = D/B+C+D = 7/9 = 78%
Observer 2
Yes

Yes

No

1A

1C

1B

7D

Observer 1
No




Cohens Kappa: corrects for chance agreement, smaller/stringent estimate


(A+D/A+B+C+D). Higher scores suggest higher internal consistency.
Session Total (each piece of material to be coded is NOT broken into intervals. Instead, you
get a total session score for the target behavior)
Kids Swearing
Kid

Observer 1

Observer 2

Even though observer 2 always coded more swearing, the kids stayed in their same
relative rank!!! Affects Pearson r!
 Solution: Intraclass Correlation (more stringent like the Kappa)
Multiple Observers (all data is coded by 2 or more observers and data is then averaged or
summed)
 Calculate reliability of the average not individual scores
 Average correlation of all pairs of observers
 A formula calculates how many raters you will need to get reliable results
 Just like when you add items you increase r, when you add observers you increase r
 Use Spearman Brown
Split Half (homogeneity): consistency between 2 halves of the same instrument
 Systematically (even vs odd, beginning vs end) or randomly split the test items
 If you run a Pearson r, you are essentially correlating only half the items which will
decrease r
Solutions: either add items (longer = stronger) or use Spearman Brown correlation
because it estimates the reliability of the entire test!!
Subject

X (even items) Y (odd items)

(Anastasi, 95) From a single administration of one form of a test it is possible to arrive at a
measure of reliability by various split-half procedures. In such a way, two scores are obtained
for each person by dividing the test into equivalent halves. It is apparent that split-half
May 2007

METHODOLOGICAL CONSIDERATIONS

89

reliability provides a measure of consistency with regard to content sampling. Temporal


stability of the scores does not enter into such reliability, because only one test session is
involved. This type of reliability coefficient is sometimes called a coefficient of internal
consistency, since only a single administration of a single form is required.

Internal Consistency (homogeneity): consistency between items/content


 Measures how items on a measure are correlated with each other
 Cronbach Alpha (continuous items) or KR-20 (dichotomous items)
 Average of all possible split halves!!! Keep averaging until you removed or added
enough items to maximize the alpha value
 Influenced by:
Magnitude (degree of) correlation among items
Length of test (Longer = Stronger)

d) Inter-rater
 Inter-observer/Inter-rater: consistency across scorers
 Used any time there is subjective human judgment (i.e. coding, Rorschach, Olympics, etc)
 Consensual Drift: observers talk and influence each other
 Individual Drift: individual interpretations influence observations over time
 Control drift by calculating reliability during training as well as throughout the study
(retraining/recalibration)!
Subject

X (scorer 1)

Y (scorer 2)

(Anastasi, 99) Certain types of tests- notably tests of creativity and projective tests of personalityleave a good deal to the judgment of the scorer. With such tests, there is as much need for a
measure of scorer reliability (inter-rater reliability) as there is for the more usual reliability
coefficients. Scorer reliability can be found by having a sample of test papers independently
scored by two examiners. The two scores thus obtained by each test taker are then correlated in
the usual way, and the resulting correlation coefficient is a measure of scorer reliability.
2) Reliability models
(Vogel, unless otherwise referenced)
a) True-score theory
Classical Measurement Theory (True Score Theory)
 Measurement of a psychological construct will yield a score on a measure (True Score) that is
reasonably stable and fixed
 Observed Score = True Score + Error
 Error is random and normally distributed (just as likely to overestimate or underestimate
scores!)
True Score (Vogt, 328) In measurement theory, a score is thought to consist of the true score plus
or minus random measurement error. If the errors are random, they will cancel one another out in
the long run, after many measurements, and yield the true score. That means that the true score
can be assumed to be the mean of a large number of measurements.
b) Domain Sampling Model
Generalizability Theory (Domain Sampling Model)
 Rather than having an observed score that is imperfect, we just recognize that we can have a
bunch of different observed scores based on different circumstances
 Under what circumstances would you expect similar or different results
Generalizability Theory (Vogt, 131) An alternate way to estimate reliability suggested by
Cronbach; it identifies the different sources of error in a measure rather than simply estimating
total error. The procedures for doing this are complicated, which may be why most researchers
continue to estimate reliability using simpler statistics, such as Cronbachs (same statistician)
alpha.
May 2007

METHODOLOGICAL CONSIDERATIONS

90

Domain Sampling (Vogt, 96) Sampling items, such as questions on a questionnaire, in a


particular subject area or domain.
E.g., a researcher might be interested in the domain of respondents attitudes toward affirmative
action. Rather than study all responses to all questions that are pertinent, the researcher could take
a sample of the questions in the domain (which, in this usage, is a population of items).
3) Relationship to other features
a) Test length
You want a test to have multiple items because using few or even just one item to assess a
construct makes it notoriously unreliable!!!
LONGER = STRONGER
TAKE AWAY AND YOU WILL PAY
b) Composite of measures
c) Sample selection
4) Relationship of validity
Validity will always be lower because reliability sets limits (maximum validity coefficient)
If unreliable, validity must be low!
r11 (reliability of measure 1) r22 (reliability of measure 2)
r12 (max validity coefficient)
a) Standard error of measurement/estimate
Standard Error of Measurement and Confidence Intervals
Used when interpreting individual scores
Std Error of Measurement: estimates a band of error so that we can identify where a persons
true score lies
Observed Score = True Score + Error (randomly and systematically distributed)
Confidence Interval: range around the observed score where the true score is likely to fall
See above section for more details!
b) Correction of attenuation
Reliability Issues/Attenuation
Test Length (longer = stronger, take away and you will pay!)
Restriction of Range: this occurs when you do not have the full range of scores for that
population (i.e. only get experts), the correlation is then attenuated (or lower)
Achievement



IQ

Achievement


IQ

When you consider all data points, the correlation is


.70. There is no restriction of range here because you
have included the entire range of scores!

R=.70

When you consider only the gifted students, you see


the correlation decrease! There is a restriction of range,
such that you did not include half of the scores!

R=.28 (only gifted children included)


May 2007

METHODOLOGICAL CONSIDERATIONS

Achievement

IQ

Achievement

91

When you consider only the honor students, you see


the correlation decrease! There is a restriction of range,
such that you did not include half of the scores!

R= .30 (only use honor society)


IQ

When you consider only the gifted and honor students,


you see the correlation decrease! There is a restriction
of range, such that you did not include the lower
IQ/Achievement scores!
R=.09 (use gifted and honor)

D) Validity
(Vogel, unless otherwise referenced)
1) Face
(content NOT scores)
 Does the content look appropriate for the construct being measured? Does it appear to measure
what it is supposed to measure?
 Non-experts make a superficial, global judgment just by reviewing the items
 No stats used, just a Yes or No answer
 May influence consumers and may be a bad idea (malingering)
(Chantelle)

Face validity: does the test appear to measure what it is measuring (items that superficially appear
to assess what the test is really measuring. Has to do with a superficial look-see. Consumer does
the evaluating.

2) Content
(content NOT scores)
 Does the content look appropriate for the construct being measured? Does it appear to measure
what it is supposed to measure?
 Experts make formal judgment based on reviewing the items
 Consider:
 Is the appropriate content included?
 Is the inappropriate content excluded?
 Is there a good balance of appropriate content? (depression measure must include somatic,
cognitive, and behavioral items)

Content Validity: more formal systematic evaluation that is related to one of 3 questions
 Have I included the appropriate content?
 Is my instrument free of irrelevant content have I left stuff out that should be left out?
 Are these elements in appropriate balance
The person contributing to the content are the experts---more systematic/formal evaluation. Is
there influence of expertise: Experts are>>>experts are those that have a great deal of experience
with the domain or construct being tested.

May 2007

METHODOLOGICAL CONSIDERATIONS

92

3) Criterion
Criterion-Related Validity (Vogt, 67) The ability of a test to make accurate predictions. The name
comes from the fact that the tests validity is measured by how well it predicts an outside criterion.
Also called predictive validity. See also concurrent validity, which is often held to be another
aspect of criterion-related validity. E.g., the extent to which students SAT scores predict their college
grades is an indication of the SATs criterion-related validity.
(Vogel) (utility of measure)
 Can scores be used for a certain purpose?
 Ideally, we would like a perfect correlation!
 Types of Criterion Related Validity
 Concurrent: assess the criterion (outcome) and give the measure at the same time
If the measure predicts the criterion and is simpler, less expensive, or quicker, then we
probably want to start using that measure!
 Predictive: give the measure now and assess the criterion (outcome) in the future
Predicts future performance (GRE or GPA predicts success in grad school so they can be
used during admission process)
Also used in selecting personnel for employment (Taylor Russell Tables assess the base
rate of successful employees hired, selection ratio of how many positions are available,
and the validity of the measure being used)
Problem: criterion contamination (knowledge of a score influences the outcome) i.e.
Professor knows GPA/GRE score which influences your class grade
a) Forecasting efficiency
b) Correcting for range restriction
c) Validation strategies
4) Construct
(See definition in section I, G, 3) The extent to which variables accurately measure the constructs of
interest. How well are the variables operationalized?
Construct Validity (meaning of scores, utility of test)
 Can you interpret the scores as meaning what you want them to mean?
 You never really want a perfect correlation because you are trying to create a new measure!
 Types of Construct Validity:
a) Convergent validity
(Vogt, 63) The overlap between different tests that presumably measure the same construct.
(Vogel) Convergent: correlation between scores on 2 different measures that assess the same
(want a high correlation)
Subject X (measure 1 score) Y (measure 2 score)
b) Divergent validity
(Vogel) Discriminant/Divergent: correlation between scores on 2 different measures that assess 2
different constructs (depression versus antisocial pd); want low correlation
Subject X (construct 1) Y (construct 2)
c) Discriminant validity
(Vogt,93) A measure of the validity of a construct that is high when the construct fails to correlate
with other, theoretically distinct constructs. Discriminant validity is often called divergent
validity and is the mirror image of convergent validity.
May 2007

METHODOLOGICAL CONSIDERATIONS

93

E.g., suppose researchers are writing a questionnaire containing several questions designed to
measure the construct patriotism They worry that respondents may just be giving the answers
they think are proper or that they think the researcher wants to hear (social desirability bias). So
the researchers include questions that measure the construct social desirable responding. If the
two measures were not correlated, the measure of patriotism would have more Discriminant
validity, that is, it would be unrelated to a measure of something to which it would not be related
if it were valid.
(Vogel) Discriminative: do scores on my measure differentiate between groups known or
expected to differ on the construct? (want a low correlation)
Depressed versus Not Depressed
d) Multitrait-multimethod matrix
 When you assess 2 or more constructs with the same method (self-report, report-by-others,
etc) you need to make sure your method doesnt affect the scores; some people may naturally
always rate things in extremes and measures may correlate simply because of the format of
the test
 Use convergent validities (same construct) and discriminant validities (different construct)
 Convergent should always be higher than discriminant because construct validity is for 2
measures of the same construct whereas discriminant is differentiating between 2 constructs!
 Heterotrait Monomethod (2 constructs, 1 method of assessment) triangle should have low
discriminant correlations. If they are high, then you should be worried about shared method
variance (method contributes to correlation instead of the construct contributing)
Summary Review Notes: (Chantelle)
MTMM Essentials
Convergent validity
Discriminant validity
Shared method variance (whenever you calculate a correlation it is based on two scores x & y
but we must also take into account the method of measurement---there are at least 2 things
that contribute to the relationship between x and y the real relationship & also the
method of measuring for the x and the y
If the X and Y share the same method there is some indication that the shared method may
contribute to the correlation between X and Y (measuring anxiety and depression through
self-report measures)
Something about the method may influence the pattern of variance among the scores ****and
thus may inflate correlations
It is a possibility but it does NOT always inflate correlations
What do we need to do a MTMM Analysis
We need unrelated constructs (we need at least 2)
We need to assess each of the constructs with different methods (sometimes when using
ratings by others will work if you use different informants)
Assess each construct assessed by at least 2 methods
Several Principles: for examining convergent and discriminant validity:
Convergent validity correlations > Discriminant validity correlations (should always be
higher)
When the rs are based on an x and y with different methods
When discriminant validity rs based on scores assessed with 2 different methods
When discriminant validity is based on two scores assessed with same methods
Patterns of relationships among pairs of scores
****Should not be method dependent
If A and C are more highly related than B and C, method of assessing A, B, Cs should not
matter
May 2007

METHODOLOGICAL CONSIDERATIONS

94

Four Steps in looking @ MTMM Analysis:


1) Look at the convergent validity values (if you can answer yes to both of these questions than
we can go onto step 2)
a. Validity values need to be significant & large enough to enquire further about validity
(needs to meet both of the criteria) >>>found in the validity diagonal
b. We need to establish a cut-off criterion >>generally .60 is a good cutoff for convergent
validity coefficients
1. .60 or higher = excellent
2. .50-.60 = adequate or good
3. .35-.5 = marginal to adequate
4. For purpose of this class if correlations are below .35 we will not look further
2) Next step looks at Discriminant Validity (going trait by trait) >>>ask are the convergent
validity correlations higher than the discriminant validity correlations
a. Convergent validity value and compare it with HTHM triangles correlations that relate to
the specific trait we are examining --- the row and column of the convergent validity
correlation
b. We can test for the difference for 2 dependent correlations to see whether there is a
significant difference between 2 correlations
c. Depends on the sample sizes
d. We will use a difference of > or = to .10 then we will consider this to be statistically
different (discriminant vs. convergent validity correlations)
3) Looking at discriminant validity coefficients (HT MM) vs. the convergent validity
correlationswanting convergent >discriminant correlations
a. May get inflated correlations due to shared method variance
b. In the overall statement about discriminant validity we would state specifically what
monomethod comparisons and heteromethod comparisons of convergent vs. discriminant
validity yielded
When discrimimant correlations are not significant on their own then you can compare the
convergent validity coefficients to the lowest significant discriminant correlation to see if there is
a significant difference there
4) Looking at whether the pattern of trait relationship is the same regardless of the method that
is being used to obtain a correlation
a. We look at correlation pairs between different constructs/traits
b. We lay out the correlations in a certain way
Lower
HTHM

Upper
HTHM

Method 1

Method2

.22
.11
.11

.22
.09
.10

.51
.38
.37

.68
.59
.58

Correlation
Between
A-B
A-C
B-C
A/B>A/C A/B > A/C
A/B > B/C A/B > B/C
B/C = A/C B/C = A/C

This pattern repeating is evidence for discriminant validity


This method is dependent on all the measures working well in order to do step 4
Shared Method Variance
When we assessing shared method variance we are going to go method by method to assess
variance rather than trait by trait

May 2007

METHODOLOGICAL CONSIDERATIONS

95

Step 1: Look at the magnitude of the correlations in the mono method triangle. We do not want
these correlations to be high (these correlations might be high due to 2 potential reasons:
1) shared method variance
2) picked bad constructs
How do we know whether we have a problem or not>>
1st thing you do to evaluate shared method variance is to examine
STEP 1
Method 1: & isolate HTMM>if we picked really well the correlations should be close to
zero/very low if they are not, *then it could be due to shared variance of using the same
method
Ideally discriminant correlations should be lower than .3 in order to indicate good
discriminant validity
STEP 2
In order to eliminate the possibility that we may have picked our constructs poorly we then
go back to Step 3: comparing the convergent validity values with the mono-method triangles
for the method we are examining
We are looking across traits within methods
Convergent validity diagonal should be higher than discriminant HTMM coefficients & we
look to see if the convergent validity values are likely to be significantly higher in the right
direction
One by one we take each convergent validity coefficient and compare each one to the HTMM
scores>>
STEP 3:
We still may not know whether we have shared method variance but if we go back to the last
step in (step 4):
2 sets of HTMM
2 sets of HTHM
and we are going to take all of the HTHM triangles and compare them to the HTMM for
method 1
HTMM (are our potential shared method variance correlations) and if they are systematically
significantly higher than the lower HTHM & upper HTHM we have reason to believe that we
have a problem of shared method variance (strong reason to believe this)
5) Incremental validity
E) Test Construction
1) Item creation
 Dichotomous: 2 choices
 Polychotomous: multiple choices
 Scaling and Types of Data (see previous sections)

May 2007

METHODOLOGICAL CONSIDERATIONS

96

2) Item analysis
[See Summary of this section in Appendix]
a) Difficulty
 Item Difficulty (How hard is the item?)
 Problems w/ Pass vs. Fail
Does not give us range
 Problems with % passing:
Dependent on overall ability level of the sample
Percentages are not based on interval scales so you must convert the % into
standard deviation units (the difference between 10-20% is not the same as the
difference between 50-60%)
 % Ideals
Not 50-50 because we want a range of performance
Moderate internal consistency because if all items perfectly correlate then
subjects either get all right or all wrong
Bell curve where average person gets 50 right, 50 wrong
b) Discriminability
 Item Discriminability (extent to which an item reflects or differentiates subjects with
varying amounts of traits being assessed)
 Methods of Assessment
 Mastery: items passed by experts
 Empirical Item (criterion) Keyed: assess performance on item in relation to an
outcome or incident (i.e. MMPI, discriminative validity)
 Total Score: subtract number of low scorers (lower 25%) from number of high
scorers (upper 25%) and the difference should be negligible

Item Response Theory


(Vogt, 160) A group of methods designed to assess the reliability, validity, difficulty of items
on tests. The assumption is that each of the items is measuring some aspect of the same
underlying (latent) ability, trait, or attitude. IRT is important for determining equivalency of
tests (such as two versions of the same proficiency exam) as well as for determining
individuals scores. IRT uses a version of logistic regression as its basic tool, with the
dependent variable being the log of the odds of answering questions correctly.




Provides info about item difficulty as well as discriminability while avoiding the problem
of sample dependence found in % passing
Every item gets its own Item Characteristic Curve (ICC), which describes its
performance.
 Item response theory uses item characteristic curves to assess difficulty and
discriminability.
Give a test to 1000 people and divide them into subgroups based on their score
Subgroup
1=
2=
3=
4=
5=
6=
7=
8=
9=
10=

Total Score
0-20
21-40
41-60
61-80
81-10
101-120
121-140
141-160
161-180
181-200

# Subjects
50
50
100
150
150
200
100
100
50
50

% Right on Item 1
2
6
10
10
15
25
50
60
60
80
May 2007

METHODOLOGICAL CONSIDERATIONS

You can see from this plot that 50% of the


students who passed item 1 received around
a 70% on the exam, which tells you
difficulty of the items. You can also see that
the item doesnt start to really discriminate
students from each other (slope) until
between the 5-8 total score on the test. So,
50% of students start passing item 1 if they
got a 70% or better on the exam and item 1
best discriminates between students who
know at least some of the material!

80.00

percent that passed item 1

97

70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

total score on the entire exam

This is a very bad situation! As students are


scoring higher on the test they are failing item
1!!! This makes no sense. If you see an ICC like
this you have to wonder why the poor students
get it right but the good students do not!!

% passing

Total Score on the Exam

This graph shows that the item differentiates more for the poor students and
the good students are all doing the same (passing the item).

This graph shows us that this item on the exam differentiates well for
the more gifted students while all the poor students are getting it
wrong.

Advantages of these ICCs: can plot 2 items side by side and compare
(Item Test Regression)
 Picture another curve on the plot above, one being for item 1 and another for item 2 on
the exam. You can then compare different items on the test regarding their difficulty and
discriminability.

Summary Review Notes: (Chantelle)


Advantages of ITEM RESPONSE THEORY:
1) Indices of item difficulty and discrimination
2) Independent of overall ability of sample (it assumes that you have a full range of the
attributes in question >> you have to have sufficient variability level in your sample)
3) Interchangeability of items ( 2 items of different content can function exactly the same)
4) Efficient testing at different ability levels (we assume once you reach your 50/50 level then
you have reached the individuals truest ability level)
a. Greater reliability with fewer items!
5) Easily compare new items to old items (not sample dependent)
6) Use to test for biases in testing
May 2007

METHODOLOGICAL CONSIDERATIONS

98

7) Can be used to select tests that are sensitive across ability levels or at particular points in the
scale
Assumptions of Item Response theory:
*assumes that the construct you are assessing is unidimensional
*it requires a large number of items (at least 20)
*a large number of people (at least 100)
c) Chance
Summary Review Notes: (Chantelle)
Someone thought it would be a good idea to break up overall percent agreement into
1) Percentage occurrence agreement
2) Percent nonoccurrence reliability
Whether a statistic is inflated by chance is still not accounted for directly with this parceling out
of percent agreement into reliability of occurrence/nonoccurrence
The formula for the chance agreement statistic is Kappa: >>if you get anything over a 1 then
you have done it wrong, because it only yields a statistic within the correlation range (0-1) that is
expressed as a decimal point not as a percentage.
Kappa is generally going to be lower than percent agreement ---it will only be the same with
100% perfect agreement >>>the higher the better
In general the agreement statistic of choice should be Kappa because it takes chance agreement
into account (it will be more conservative)
d) Homogeneity
Summary Review Notes: (Chantelle)
Other things to look at when having a problem with a test:
1) we may be assessing different facets of the same construct & thus we may need to do a factor
analysis (exploratory) to see whether items are grouping together in different ways2) we may have bad questions
3) our questions may be assessing different constructs
Alpha is meant to be a measure about content homogeneity
Alpha is NOT a perfect measure>>it can be high and have some bad items in it & contain items
that are not homogeneous with the construct
How could that happen?
Alpha is influenced by 2 things:

>large number of items can influence alpha (elevating this to where the good stuff
disguises the bad)
>the degree of inter-item correlation

An alpha can be high & you might still have some sub-domains within your total score *if you
are assessing 2 parts of a construct and the two parts are correlated (sub-scale scores)
There is such a thing as too high an alpha>>
you have to be weary >>having to do with the nature of what you are assessing a multifaceted construct anxiety: physiology, cognitive, behaviorally (are all related but should not
be perfectly related)

May 2007

METHODOLOGICAL CONSIDERATIONS

99

3) Test bias
 Assess Face Validity: do items look biased towards one group or the next? Are any items
offensive or demeaning?
 Content Validity Probs
 Unnecessary content (some or all content irrelevant for group investigated, i.e. measuring
depression in other cultures)
 Inadequate Content (fails to sample important aspects consistent for that group)
 Solutions:
 Have cultural informant assess content validity
 Use only culturally relevant content
 Use only culture-common content (no unique symptoms for that population, may miss
important concepts though!)
 Common + Unique items used (Use mix of content but score according to cultural relevance)
Item Bias
 Do items discriminate differently for different groups?
 Draw an ICC for each ethnicity if the curve looks the same then the item performs equally well
for each group
For the below graphs, imagine that the first one is for Caucasian individuals. The second represents
African Americans. The third represents Pacific Islanders. For each different ethnicity, the ICC for
item 1 on an exam differentiates differently and the difficulty also varies among ethnicities!!!! Watch
out for this!!!

Caucasian

African American

Pacific Islander

Slope Bias
 When a measure has different criterion validity coefficients for groups (predicts differently)
 Foreign Born versus US (GRE used to predict grad school performance)
 GRE is better predictor for US students. GRE is a better predictor of grad school success for
only US students not for foreign born students.
US Born

US Born
Performance
In School

Performance
In School

Foreign
born

Foreign
born
GRE Scores
GRE Scores
This first graph shows that using the GRE to predict performance in grad school is good for the US
students but not foreign born students! The second graph shows that using the GRE to predict
performance is just as good for both groups of students!

May 2007

METHODOLOGICAL CONSIDERATIONS

100

Intercept Bias
 Test scores systematically over-predict or under-predict on criterion variable
US Born

US Born
Performance
in School

Performance
in School

Foreign
born

Foreign
born
GRE Scores
GRE Scores
This first graph shows that using GRE norms would over predict US born and underpredict Foreign
Born students! The second graph shows very little difference in the prediction for the students.
Psychometric Equivalence
 Do scores have the same reliability and validity values across populations?
 Problem: we may have limited data for certain groups
 Solution:
 Make sure reliability and validity for at least one group is good
 Establish rel/val for other group (time consuming)
 Build tests of psychometrics in study (problem with Generalizability)
 New measure with culturally appropriate info (time consuming and costly!)
 Existing measure with culturally appropriate info (culturally fair but redo psychometrics)
F) Test (Basic features of various types, not specific tests)
1) Cognitive
a) Intelligence
Latent construct which is comprised on several variables
Norm referenced (tests individual differences)
 Stanford Binet, Wechsler IQ (WAIS, WISC), Kaufman Scales (K-ABC, KAIT, K-BIT)
b) Aptitude
Cumulative influence of multiple experiences of daily living which then predicts subsequent
performance
Could this individual benefit from more experience or have they achieved quality
performance
Determine ability, skill level (e.g. a quiz to measure ability at the start of a class)
c) Achievement
Effect of specific program on instruction/training (i.e. course exams)
Final evaluation
d) Neuropsychological
Indicates organicity (brain damage)
Tests concepts such as spatial skills and memory
Flexible Battery Approach: choose test specific to the presenting problem, however, you may
miss the big picture
Standardized Battery Approach: combination measures all neuropsychological abilities to
detect brain damage

May 2007

METHODOLOGICAL CONSIDERATIONS

101

2) Sentiments
a) Interests
Looks for individual areas of interest and compares them to those in different occupations,
fields, specialties, etc
b) Values
Often referred to as Interests or Attitudes
Little research distinguishing values from these other terms
c) Attitudes
Tendency to react favorably or unfavorably
Scales indicate direction and intensity of attitude toward a stimulus
Thurstone, Guttman Scaling, and Likert Scales
3) Vocational
4) Personality
a) Objective
Structured
Normed
Self Report
Concerns about social desirability and response sets (tendency to respond in a certain way
regardless of item content)
b) Projective
Unstructured directions and responses
Stimuli is vague/ambiguous
Persons perception and interpretation reflects fundamental aspects of their personality; they
project their thoughts, needs, anxieties, conflicts etc onto the stimuli
Global approach to personality NOT study of separate traits
Covert, latent, UCS aspects of personality
G) Utility of Tests
1) Taylor-Russell model
Used when we want to use a measure to predict which applicants for a job will be successful
(employee selection process)
This model calculates the % of selected employees that will be successful
Depends on:
 Base rate of % of successful employees prior to the use of test (not too low because well get
many false positive, not too high because cant improve)
 Selection Ratio: number of total positions to be filled (as this increases, we profit from testing
less)
 Validity Coefficient: correlation between measure and outcome (obtained from previous
studies, as this increases we profit more!)
Summary Review Notes: (Chantelle)
Taylor-Russell Tables: intended to help you know how much your measure will improve your
decision making
In order to use them, you have to have 3 things:
1) We have to know our criterion-validity correlation (study yielding scores on our measure and
the outcome of interest)
2) Need to know the selection ratio (percentage of applicants that we are going to hire)
>>knowing the percentage of people that can be selected
3) What is the base rate of the outcome of interest
There is a different table for every base-rate for the outcome of interest
May 2007

METHODOLOGICAL CONSIDERATIONS

102

The tables tell us that if we use a measure with these characteristics this is the percent of
people that we would expect to succeed
In the absence of any other information, your best guess if the base rate.>>.thus if the
table indicates that a measure only
As the validity goes up, so does the proportion of accurate predictions for the sample of
interest
As the selection ratio goes up & you have to take more and more people than the ability
of the test to predict well goes down

How helpful a measure is to us depends on 3 things:


1) Base Rate
2) Selection Ratio
3) Validity of the test

These are not often used because in psychology we do not use the selection ratio
We can use an approach that applies to a single measure that is trying to predict a
dichotomous outcome:

2) Naylor-Shine model
Used to predict expected mean performance of the group you hire
Depends on:
 Selection ratio (number of positions)
 Validity Coefficient (correlation between measure and outcome)

May 2007

METHODOLOGICAL CONSIDERATIONS

103

SUMMARY TABLES

RESEARCH DESIGN

DEFINITION

PRE-EXPERIMENTAL
OBSERVATIONAL
Observational designs are purely
descriptive in nature (qualitative) and
there are no requirements for random
assignment of individuals, independent
variables, or control groups. This design
uses many different techniques such as
observation, interview, records, etc. It is
usually conducted to collect preliminary
information that may lead to specific IVs,
DVs, and hypotheses about relationships.
The biggest problem with observational
designs is that there is poor external
validity (Generalizability).

The greatest strength of an experimental


research design, due largely to random
assignment, is its internal validity: One
can be more certain than with any other
design about attributing cause to the
independent variables. The greatest
weakness of experimental designs may
be external validity: It may be
inappropriate to generalize results
beyond the laboratory or other
experimental context.

QUASI-EXPERIMENTAL
Similar to experimental designs,
however, at least one IV is static
(gender). Thus, this design approximates
the control offered by experimental
designs but true random assignment is
not possible!
A type of research design for conducting
studies in field or real-life situations
where the researcher may be able to
manipulate some independent

Random Assignment
1 Static IV
IVs must be manipulated!
1 Manipulated IV
Control Group
Error variance
Covariance
Between Group versus Within Group
designs

MUST HAVE

SOURCES OF
ERROR

t-test, ANOVA, ANCOVA, MANOVA

STATISTICAL
ANALYSIS
SUBTYPES

EXPERIMENTAL

Case study
Case Control
Cross Sectional
Retrospective Cross-Sectional
Cohort Designs
Single-group cohort design
Multiple-group cohort design
Accelerated-group cohort design

Prettest-Posttest Control Group Design


Posttest Only Control Group Design
Solomon 4 Group
Time Series Design
Factorial Design

ANOVA, ANCOVA, Regression,


MANOVA
Crossover Designs
Counterbalanced Designs
Multiple Treatment Counterbalanced
Latin Squares

May 2007

METHODOLOGICAL CONSIDERATIONS

104

THREATS TO VALIDITY SUMMARY


INTERNAL
Reasons Why Inferences That the Relationship between Two Variables
May Be Incorrect

Ambiguous Temporal Precedence


Selection
History
Maturation
Regression
Attrition
Testing
Instrumentation
Additive and Interact Effects of Threats to Internal Validity

CONSTRUCT

EXTERNAL
Reasons Why Inferences about How Study Results Would Hold over
Variations in Persons, Settings, Treatments, and Outcomes May Be
Incorrect
Inadequate Explication of Constructs
Construct Confounding
Sample Characteristics
Stimulus Characteristics & Settings
Reactivity of Experimental Arrangement
Multiple Treatments
Novelty effects
Experimenter Expectancy
Reactivity of Assessment
Test Sensitization
Timing of Measurement
STATISTICAL CONCLUSION VALIDITY

Reasons Why Inferences about the Constructs That Characterize Study


Operations May Be Incorrect

Accuracy of conclusions about covariation (or correlation) made on the


basis of statistical evidence.

Inadequate Explication of Constructs


Construct Confounding
Attention and/or Simple Contact with Participants (Hawthorne Effect)
Single Operations and Narrow Stimulus Sampling
Experimenter Expectancy Effects
Demand Characteristics or Cues in the Experimental Setting

Low Statistical Power


Violated Assumptions of Statistical Tests
Fishing And Error Rate Problem
Unreliability of Measures
Restriction of Range

May 2007

METHODOLOGICAL CONSIDERATIONS

105

RELIABILITY AND VALIDITY


RELIABILITY

ASSESSES

STATISTICS

Test Retest

Stability over time

Pearson r (time 1 with time 2)

Interrater/Interobserver

Consistency across scorers

% agreement
Cohens kappa (corrects for chance agreement)
Session Total (rank data)
Intraclass Correlation (rank data)
Spearman Brown (>2 observers)

Alternate Form

Equivalency across forms

Pearson r

Split Half

Homogeneity of construct

Internal Consistency

Homogeneity of construct (broader


than split half)

Pearson r (correlation for half of the test)


Spearman Brown (correlation for whole test)
KR-20 (dichotomous items)
Cronbachs Alpa (continuous items)
(Vogel)

VALIDITY

ASSESSES

STATISTICS

Face

Content

Superficial judgment

Content

Content

Formal Expert Judgment

Construct

Meaning

Criterion Related

Utility

Convergent
Discriminant/Divergent
Discriminative
Other (Factorial)
Concurrent
Predictive
(Vogel)

May 2007

METHODOLOGICAL CONSIDERATIONS

106

CORRELATION STATISTICS
(RELATIONSHIP BETWEEN 2 VARIABLES)


STATISTIC

TYPES OF DATA
DV 1
DV 2

Pearson
Spearmen

Score
Rank

Point biserial
Chi-Squared

Score
Category

Phi- Coefficient

Category

Score
Rank

(Dichotomous?)

Is there a relationship between IQ score and GRE score?


Is there a relationship between shirt size (S, M, L, XL) and contest results (1st, 2nd, 3rd
place)?
Category
Is there a relationship between gender and achievement score?
Category
Is there a relationship between age (old, young) and treatment group (CBT, psychotherapy,
meditation)
Category
Association between 2 variables when BOTH are categorical and one or both are
(Dichotomous?) dichotomous.

RELATIONSHIPS OF PREDICTION
Linear Regression:

Predictor

Outcome

Linear Regression

Logistic Regression / Discriminant Analysis / Multiple Regression:

Predictor
Predictor
Predictor

If Categorical

Logistic Regression or
Discriminant Analaysis

If Score

Multiple Regression

Outcome

Canonical Correlation

Predictor

Outcome

Predictor

Outcome

Predictor

Outcome

Canonical Correlation
May 2007

METHODOLOGICAL CONSIDERATIONS

107

PARAMETRIC & NON PARAMETRIC STATISTICS


PARAMETRIC
(Stat techniques for data that approximate a normal distribution
Measurable with interval or ratio scales)
Assumptions:
Independence of Observations
Normality
Homogeneity of Variance
Types of Data Analyzed:
Interval
Ratio
Continuous scores

NON PARAMETRIC
(Stat techniques for data that is NOT normally distributed
Measurable with nominal or ordinal scales)
Assumptions:

Types of Data Analyzed:


Nominal
Ordinal

STATISTICS
T-tests

Chi-Square/Z test for proportions

ANOVA

Kruskal-Wallis

ANCOVA

Friedman Test

Magnitude of Effect/ Effect Size

Mann-Whitney U Test

Correlational Techniques

Wilcoxon Test

Multiple Regression

Kolmogorov Test

Factor Analysis
Discriminant Function Analysis and Logistic Regression
MANOVA
MANCOVA

May 2007

METHODOLOGICAL CONSIDERATIONS

108

NONPARAMETRIC TESTS TO ANALYZE DATA IN CATEGORIES AND BY RANKS


Test Name

When the Test Is Used

TO ANALYZE DATA ORGANIZED IN CATEGORIES


To determine if the number of
Chi-square one-sample test
occurrences across categories is
random
TO ANALYZE DATA ORGANIZED BY RANKS
Kolmogorov-Smirnov test
Mann-Whitney U test
Wilcoxon rank test
Kruskal-Wallis one-way
analysis of variance

To see whether scores from a sample


came from a specified population
Used to compare two independent
samples
To compare the magnitude as well as
the direction of differences between
two groups
Compares the overall difference
between two or more independent
samples

A Sample Research Question

Did brands Fruities, Whammies, and Zippies each sell


and equal number of units during the recent sale?
How representative is a set of judgments of other
children of the entire elementary school to which they
go?
Did the transfer of learning, measured by number
correct, occur faster for Group A than for Group B?
Is preschool twice as effective as no preschool
experience for helping develop childrens language
skills?
How do rankings of supervisors differ between four
regional offices?

Spearman rank correlation


coefficient

Computes the correlation between


ranks

What is the correlation between rank in the senior year


of high school and rank during the freshman year of
college?

Friedman two-way analysis


of variance

Compares the overall difference


between two or more independent
samples on more than one dimension

How do rankings of supervisors differ as a function of


regional office and gender?

STATISTICAL DECISION TREE


(SEE NEXT PAGE )
May 2007

METHODOLOGICAL CONSIDERATIONS

Statistical Decision Tree

109

Page 1, UNIVARIATE

Richard Gevirtz, Ph.D. (copyright 1990)

Two or more, go to page 3


How many Dependent Variables

Do subjects cross levels?

?
Two or more, go to page 2

One
Yes

No

Cat.=??

Two
How many Independent
Variables?

Rank=Wilcoxon
test

Score=Paired t-test

N
One

Yes
More than two

Levels?
Cat.=Chi-Squared
Rank=Mann-Whitney U Test
Score=t-test for
Independent. Samples

Do subjects cross
levels?

Cat.=??
Rank=Friedman Test

One sample two levels

Cat.= Chi-Squared
Rank=Kruskal-Wallace
Score=1way ANOVA
no RM

Score=1 way ANOVA


1 RM

Single Sample t-test


(one sample vs. norm)
May 2007

METHODOLOGICAL CONSIDERATIONS

110

Labeling I.V.s
Page 2, ANOVA designs

For each I.V. determine # of levels and whether subjects cross the levels (Repeated Measures)

Example A (Two I.V.s)


Between or Within
Factor
1
B
3
G
2
2
WG
Example B (Three I.V.s)

I.V.

1
2
3

Levels

2
2
4

BG
WG
WG

Nomenclature
Two-way ANOVA
with 1 RM

Three-way ANOVA
with 2 RM
May 2007

METHODOLOGICAL CONSIDERATIONS

111

Page 3, Two or More D.V.s


Bivariate/Multivariate

How many D.V.s?


Two

More than two

One
How many I.V.s?

How many I.V.s?


Hotellings T 2

Two or
more

None
(relationship)

One or more

None

MANOVA
(follow univariate nomenclature)
i.e. 2way MANOVA with 1 RM

Both score=Pearson Correlation

Can D.V.s be divided into predictors and


outcomes?
Both rank=Spearman Correlation

Yes
No
Both Category=Chi-Squared
Phi coefficient

How many outcomes?

One

Factor Analytic
Technique

Several

One score, one category=Point Biserial


Canonical Correlation

Data?
Category
Logistic Regression
(Discriminant Analysis)

Score
Multiple Regression Analysis

Based on correlation or
covariance matrix
Structural Equation Modeling
May 2007

METHODOLOGICAL CONSIDERATIONS

DIMENSION

CRITERION-REFERENCED TESTS

112

NORM-REFERENCED TESTS

Purpose

To determine whether each student has achieved


specific skills or concepts. Mastery / Domain
To find out how much students know before
instruction begins and after it has finished.

To rank each student with respect to the


achievement of others in broad areas of knowledge.
To discriminate between high and low achievers.

Content

Measures specific skills which make up a


Measures broad skill areas sampled from a variety
designated curriculum. These skills are identified by of textbooks, syllabi, and the judgments of
teachers and curriculum experts.
curriculum experts.
Each skill is expressed as an instructional objective.

Each skill is tested by at least four items in order to Less than four items usually test each skill.
Item
Characteristics obtain an adequate sample of student performance Items vary in difficulty.
and to minimize the effect of guessing.
Items are selected that discriminate between high
The items, which test any given skill, are parallel in and low achievers.
difficulty.

Each individual is compared with a preset standard Each individual is compared with other examinees
Score
and assigned a score--usually expressed as a
Interpretation for acceptable achievement. The performance of
other examinees is irrelevant.
A student's score is usually expressed as a
percentage.
Student achievement is reported for individual
skills.

percentile, a grade equivalent score, or a stanine.


Student achievement is reported for broad skill
areas, although some norm-referenced tests do
report student achievement for individual skills.
http://chiron.valdosta.edu/whuitt/col/measeval/crnmref.html

May 2007

TERM
Demand Characteristics

ALSO KNOWN AS:

Social Desirability

Halo

Reactivity

Carryover Effects

Order Effects
Sequence Effects
Multiple Treatment
Interference
Random Allocation

Random Assignment

METHODOLOGICAL CONSIDERATIONS
113
GLOSSARY
DEFINITION
Things the person being observed can perceive as demands,
suggestions, or contingencies for responding in a certain way.
Not unique to a particular type of method. They relate to the
context & the setting & situation in which you are being
evaluated (e.g., child custody battles). Can influence any
method but will not necessarily.
Presenting yourself in a more desirable light, then actually is
the case. Goes along with the method of self-report
(reconstrual, reconstruction, retrospective reflection which is
likely to be influenced). It is less likely to occur with selfobservation; in the moment; with very precisely defined
behaviors>>>if you set it up right then it is much less likely to
affect the report by the individual.
It is related to report by others (how you think about or rate
someone else & it may be positive or negative in either
direction)
Changes in behavior that occur that come from doing a
measurement (unique to the method). Most often found with
self-observation. Sometimes with direct-observation but not
as commonly as we believe to be the case.

Random Sampling

Random Selection
Equal Probability
Sample
Simple Random
Sampling

Within-Subjects Design

Within-Participants
Design

Between-Subjects Designs
or (ANOVA)

Between-Participants
Design

Single-Subject Design

Single-Case Designs

Independent Variable

Predictor Variable
Explanatory Variable

Dependent Variable

Outcome
Criterion
Response variable

Putting subjects into experimental and control groups in such


a way that each individual in each group is assigned entirely
by chance. Otherwise put, each subject has an equal
probability of being placed in each group. Using random
assignment reduces the likelihood of bias.
Selecting a group of subjects (a sample) for study from a
larger group (population) so that each individual (or other unit
of analysis) is chosen entirely by chance. Every member of
the population has an equal probability of being included in
the sample. A random sample is not the same thing as a
haphazard or accidental sample. Using random sampling
reduces the likelihood of bias.
A before-and-after study or a study of the same subjects given
different treatments. A research design that pretests and
posttests within the same group of subjects, that is one which
uses no control group.
A research procedure that compares different subjects. Each
score in the study comes from a different subjects. Usually
contrasted to a within-subjects design, which compares the
same subjects at different times or under different treatments.
Single subject designs compare the effects of different
treatment conditions on performance of one individual over
time
The presumed cause in a study. Also a variable that can be
used to predict or explain the values of another variable. A
variable manipulated by an experimenter who predicts that the
manipulation will have an effect on another variable (the
dependent variable). COMPARED
The presumed effect in a study; so called because it
depends on another variable. The variable whose values are
predicted by the independent variable, whether or not they are
caused by it. MEASURED
May 2007

TERM
Familywise Error

Type I Error

Type II Error

Effect Size

Pearsons r

Interaction effect

Main Effect

Parametric Statistics

Nonparametric statistics

Homogeneity of Variance

Regression Weights

METHODOLOGICAL CONSIDERATIONS
114
GLOSSARY
ALSO KNOWN AS:
DEFINITION
The probability that a Type I error has been committed in
Experimentwise error
research involving multiple comparisons. Family in this
context means group or set of related statistical tests. Also
called experimentwise error.
When we reject the null hypothesis (that there is no
Alpha error
difference) when it is really true (there is no difference). Even
False Positive
though there is no difference between the two groups, we will
Level of Significance
conclude there is and thats an error.
When we accept the null hypothesis (that there is no
Beta error
difference) when it is really false (and, indeed, there is a
False negative
difference). We have said that even though there is a
Power
difference between the two groups of children, we will
conclude there is not.
(a) Broadly, any of several measures of association or of the
Magnitude of Effect
strength of a relation, such as Pearsons r or eta. ES is often
thought of as a measure of practical significance. (b) A
statistic, often abbreviated d, Cohens d, D, or delta,
indicating the difference in outcome for the average subject
who received a treatment from the average subject who did
not (or who received a different level of treatment).
A statistic, usually symbolized as r, showing the degree of
Pearsons Correlation
linear relationship between two variables that have been
Coefficient
measured on interval or ratio scales, such as the relationship
between height in inches and weight in pounds.
The joint effect of two or more independent variables on a
conditioning effect
dependent variable. Interaction effects occur when
contingency effect
independent variables not only have separate effects, but also
joint effect
have combined effects that are different from the simple sum
moderating effect
of their separate effects. In other terms, interaction effects
occur when the relation between two variables differs
depending on the value of another variable. The presence of
statistically significant interaction effects makes it difficult to
interpret main effects.
The simple effect of an independent variable on a dependent

variable; the effect of an independent variable uninfluenced


by (without controlling for the effects of) other variables
Statistical techniques designed for use when data have certain

characteristics - usually when they approximate a normal


distribution and are measurable with interval or ratio scales
Statistical techniques designed to be used when the data being

analyzed depart from the distributions that can be analyzed


with parametric statistics. In practice, this most often means
data measured on a nominal or an ordinal scale.
Nonparametric tests generally have less power than
parametric tests.
the variance within each condition/group is similar to the

other groups; within group and between group variances are


similar
 b = Raw: b weights in the original metric
Beta weight
(unstandardized). The size of b is dependent on the scale
Regression coefficient
of measurement of the IV and DV. You cannot compare
unstandardized beta weights across samples. b weight is
the unstandardized weight and gives us the amount of
influence of that predictor
 B = Standardized: B is based on z-scores. Allows for
comparisons across samples. These weights indicate the
amount of influence each predictor has on the outcome
May 2007

METHODOLOGICAL CONSIDERATIONS
TERM

ALSO KNOWN AS:

Homoscedasticity

Evenness of Errors
Homogeneity of
variances

Outliers

Shrinkage

Partial Correlation

Semipartial Correlation

Sensitivity

Specificity

Factor Analysis

Exploratory Factor Analysis

Confirmatory Factor
Analysis

Eigenvalues

characteristic root
latent root

Synthetic Variable

115

GLOSSARY
DEFINITION
Homogeneity of variances. A condition of substantially equal
variances in the dependent variable for the same values of the
independent variable in the different populations being
sampled and compared in a regression analysis or an
ANOVA.
A subject or other unit of analysis that has extreme values on
a variable. Outliers are important because they can distort the
interpretation of data or make misleading a statistic that
summarizes values (such as a mean).
The tendency for the strength of prediction in a regression or
correlation study to decrease in subsequent studies with new
data sets. The regression model derived from one set of data
usually works less well with others. The degree of shrinkage
is measured by change in R2. Adjusted R2: A more accurate
estimate of prediction
A correlation between two variables after the researcher
statistically subtracts or removes (controls for, holds constant,
or partials out) the linear effect of one or more other
variables.
A correlation that partials out (controls for) a variable, but
only from one of the other variables being correlated.
The ability of a diagnostic test to correctly identify the
presence of a disease or condition. Sensitivity is the
conditional probability of the test giving a positive result if
the subjects does have the condition or disease
The ability of a test to correctly judge that subjects do not
have a disease or condition, in other words, to avoid false
negatives. Specificity is the conditional probability of a test
giving a negative result when patients or subjects do not have
a disease. Low specificity: trouble detecting absence results
in false positives

Factor analysis conducted to discover what latent variables


(factors) are behind a set of variables or measures. Generally
contrasted with confirmatory factor analysis, which tests
theories and hypotheses about the factors one expects to find.
Used in test construction (creating subscores on assessments)
Used in empirical exploration (study brand new areas to see
what symptoms cluster together)
Used for data reduction/reduce number of DVs
Factor analysis conducted to test hypotheses (or confirm
theories) about the factors one expects to find. Checking
existing classifications to see if they hold up, check to see if
the model fits existing data (if chi-square is significant, real
life data does not fit current model)
A statistic used in factor analysis to indicate how much of the
variation in the original group of variables is accounted for by
a particular factor. It is the sum of the squared factor loadings
of a factor. Eigenvalues of less than 1.0 are usually not
considered significant.
MANOVAs create a new DV from the set of correlated DVs
May 2007

TERM

Homogeneity of
Regression

Criterion-Referenced Test

Norm-Referenced Test

Latent variables

Manifest variables

METHODOLOGICAL CONSIDERATIONS
116
GLOSSARY
ALSO KNOWN AS:
DEFINITION
(lumps them all into one DV).
the slope of the regression line (beta) is assumed to be the

same for each group, condition, cell. (Vogel) states that the
relationship between the CV and DV is the same at every
level of the IV (look at multivariate scatter plot for elliptical
shapes in SPSS)
Assumption of ANCOVA and MANCOVA. Requires that the
covariate-dependent variable relationship is the same across
all levels of the IV.
A test that examines a specific skill (the criterion) that
Content-referenced
students are expected to have learned, or a level (the criterion)
tests
students are expected to have attained. Unlike a normreferenced test, it measures absolute levels of achievement;
students scores are not dependent upon comparisons with the
performance of other students. Also called c.
A test in which the scores are calculated on the basis of how

subjects did in comparison to (relative to) others taking the


test (others scores provide the norm or standard). The
alternative is some absolute standard or criterion.
(Vogt, 169)An underlying characteristic that cannot be
Factors
observed or measured directly; it is hypothesized to exist so as
to explain variables, such as behavior, that can be observed
(manifest variables). Latent variables are also often called
factors, especially in the context of factor analysis. E.g., if we
observed the votes of members of a legislature on spending
bills for the military, medical care, nutrition programs,
education, law enforcement, and promoting business
investment, we might find underlying patterns that could be
explained by postulating latent variables (factors), such as
conservatism and liberalism.
(Vogt, 169) A variable that can be directly observed. Often
Indicator variable
assumed to indicated the presence of a latent variable. Also
called an indicator variable. E.g., we cannot observe
intelligence directly; it is a latent variable. But we can look at
indicators such as size of vocabulary, success in ones
occupation, IQ test score, the ability to play complicated
games such as chess or bridge well, and so on.

May 2007

METHODOLOGICAL CONSIDERATIONS

117

FORMULAS

TERM
Effect Size

COMPONENTS
ES = effect size
x1 = mean for Group 1

FORMULA

x2 = mean for Group 2

X X2
ES = 1
SD

SD = standard deviation from either group

Or

ES =

X 1 X 2

ES = effect size
x1 = mean for Group 1

12 + 22

x2 = mean for Group 2

12 = variance of Group 1

22 + variance of Group 2

Variance

s2=

(X X )
n 1

Standard Deviation

(x x )
n 1

SD =

or

s=

z-score

(X X )
z=

n 1
x

or

z=

X is your score,
M is the mean,
SD is the standard deviation.

xM
SD

Y = a + bX + e

Sampling error

s- standard deviation
3- the sum of what follows
X- each individual score
- the mean of all scores
n- is the sample size
s- standard deviation
3- the sum of what follows
X- each individual score
- the mean of all scores
n- is the sample size

SD
n

Y is the DV
X is the IV
b is the slope or regression coefficient
a is the intercept
e is the error term.
SD is the standard deviation
n is the sample size

May 2007

METHODOLOGICAL CONSIDERATIONS

118

FACTOR ANALYSIS SUMMARY


Factor Analysis is used for common purposes:
1) Data reduction --Not sure whether to use one total score or sub scales within it (when trying to determine how
we score a measure)
2) Exploratory purposes: wanting to see how phenomenon clusters together
3) Examining construct validity>>whether or not the items group together as predicted by theory or past
research
4) Theory Testing: as a way of looking at theory
Exploratory factor analysis is dangerous (because you dont know what you are looking for & it has to do with the
way the computer combines itemsonly on a mathematical basis but not relating to conceptual info.)
Be sure to think about whether or not something is a good scale
Decisions with Factor Analyses?
1) Exploratory versus confirmatory factor analysis?
a. Confirmatory: you specify the model in advance and then the computer sees how well your model fits the
confirmatory factor analysis
b. Exploratory: asking the computer to generate a mathematical clusters (fishing expeditions) & also to help
you with scoring (for data reduction)
c. You could use confirmatory factor analysis for construct validation purposes
2) With exploratory factor analysis:
a. The population you choose to assess can affect the factor analysis
b. The more heterogeneous the population the greater the number of factors
c. Rules of thumb at least give you some idea of how many people you need to think about (the minimum of
5 participants: 1 items and a max of 20-25: 1 items)
d. Some people think you need at least 100 items (or you can do 5:1, whichever is lower) but you should
always have at least 100>>if factor loadings are higher & they clearly fit their factor you dont need as
many people>>if factor loadings are lower than you need more people to make a stable solution
e. First thing you have to decide is: How should each item correlate with the factors (called the starting
communalities)? >>>What should my factor extraction method be?
Communalities: correlation of the item with the underlying construct (the latent variable)
Math magic: has starting values and calculates the relationship between the items (but it needs a
baseline solution)
2 main different ways to establish baseline solutions (ways of abstracting factors):
1. Principal Components: the starting value assumes a perfection correlation between the item and
the construct you are measuring (easier to do the calculations with PCA)
2. Principal Axis (also known as: Principal Factor; Common Factor) more modestand starts with
values that are less than perfect>>seems to acknowledge the existence of error variance & starts
with better assumptions>>>>>>takes into account probability of error & assumes a less than
perfect correlation between factor loadings and items
3. The real question: how similar are these? When you have high factor loadings your results are
very similar, but when factor loadings are low then it is better to go with principle axis
f. How are the factors probably related (has to do with rotations)?
If you dont want factors to be correlated then you use an orthogonal rotation ---keep factors as
separate as possible mathematically>>>or if you want to maximize the separation between the factors
If factors should be related>>>then you can choose an oblique rotationwhich will calculate an
oblique rotation
g. How should I decide how many factors to retain?
Eigenvalues: the amount of standardized variance accounted for by the factor>>>the usual cut-off is 1

May 2007

METHODOLOGICAL CONSIDERATIONS

119

Scree-Plot: plot eigenvalues on the y axis and plot the factors on the x axis and the cutoff is when the
scree plot bend (cut right before or after the bend) is the point of diminishing returns (dont get much
original information after the bend)
Then look at the results and make judgments about conceptual sense
h. How do I decide which items go with which factors?
Usually this is based on factor loadings (how well the item correlates with the factor score --analogous to item to total score correlation, conceptually) you want an item to load highly on its own
factor & low on the other factors (prefer something somewhere around .3/.4 and above & not on other
factors)
Confirmatory Factor Analysis (subset of structural equation modeling):
What is my model? How do my items and factors relate to each other?
Liserel, Amos are different structural equation computer modeling programs
With amos you can draw a picture
The greater a sample the more likely the statistic is to become significant
We may get significant chi-square values even if we have a good model because of our very large
sample>>>we DONT want chi-square to be SIGNIFICANT
Chi-square is the statistic that tells us whether model significantly differs from the data set
They came up with a set of fit-indicators & we would ideally like them to be close to 1 (they fit into different
classes) >>usual cutoffs would be around .95 although sometimes you want them to be high and sometimes
you want them to be low
a. Normed fit index (NFI) you want them to be .95 or higher
b. Goodest fit index (GFI)
*confirmatory and exploratory do not always give you the same solutions ---they address different kinds of things
Single analysis can have multiple implicationsdoesnt all validation work tell you something theoretically about
the construct? yes depending on how you do your study
***IF the factor loadings are high than Principle components and Principle Axis will produce similar results.

May 2007

METHODOLOGICAL CONSIDERATIONS

120

ITEM ANALYSIS OVERVIEW


(Retrieved on June 1, 2007 from http://www.rpi.edu/~verwyc/chap4tm.htm)

During the construction of psychological tests, developers must determine the effectiveness of each individual
test item. This process of evaluating each item is called Item Analysis.
Item Analysis can identify which items should be reworded (or dropped), and tells us how good a job each
individual item does in predicting the overall score.
Item Analysis can tell us ;
How difficult an individual item is.
How good a job a particular item does in discriminating between high and low performance on the test

How do we determine what constitutes high or low performance on a psychological test?


2 ways:
Criterion-Referenced (domain referenced) Testing: High and Low performance is determined by the
test developer who compares test performance to a set list of objectives or standards.
Mastery Test: a type of criterion referenced test designed to measure a limited range of cognitive skills.
Total score is the % of correct answers.

Norming Distributions: High or low performance on the test is determined by comparing an individual score to
the score distribution of a representative sample of test takers.
The item analysis procedures vary, depending upon which criterion procedure is used.

Item Validity

Test Validity: does the test actually measure what it intends to measure.
Item Validity: does the specific test item correlate with what you are trying to measure. We can assess item
validity with respect to both internal and external criterion.

External Criterion: Data from outside the test which we expect to correlate in some meaningful way with our
test items.
For achievement tests, an external criterion might be your overall grade in that subject area.
For aptitude tests, an external criterion might be your supervisor rating of your job performance.
To determine the validity of an individual test item, we correlate the scores on that test item with the external
criterion from the domain of interest.

Item validity and External Criterion


Interpreting the item-criterion correlation
The higher the correlation (closer to 1) , the more accurate (or valid) the test item is.
Test items with 0 or a very low correlation (<.2) should be reworded or removed from the test.
The point-biserial coefficient formula can also be used with an internal criterion such as the total score on the
test itself.
Another way to measure internal consistency (the relationship between performance on an individual item
and performance on the entire test) is to examine the performance of the low and high performers on an item.
Split test takers into three groups, based on overall test performance:
Top 25% or 27%. (depends upon sample size. If more than thirty test takers use 27%)
Middle 50%
Bottom 25%

May 2007

METHODOLOGICAL CONSIDERATIONS

121

Measuring Internal Consistency of Items


Need to know two things:
Number of people in high and low groups.
Number of people in high and low groups who get a particular answer right.
Assume 200 people take test.
Our top group is the highest 54 scorers.
Our bottom group is the lowest 54 scorers.
On question 17, 48 of the top group and 37 of the bottom group got the question correct.

What is the item difficulty and the item discriminability of this question?
Item difficulty and item discriminability are both measures of internal consistency

Calculating Internal Consistency Measures


Item Difficulty Index (4.2):

p = (Up + Lp ) / (U + L)

Item Discrimination Index (4.3): D = (Up Lp ) / U


Up = Number of high performers who got question right
Lp = Number of low performers who got question right
U = Number of high performers
L = Number of Low performers
Item difficulty is a measure of overall difficulty of the test item. The lower the p, the more difficult a particular
item is.
Item discrimination tells us how good a job a question does is separating high and low performers.
It is more important for an item to be discriminable, than it is to be difficult.

Interpreting the Item Discrimination Index (D index)


The higher the value of D (up to 1), the better the job of separating high and low performance.
If D = 1, this means all of the high performance group and none in the lower performance group get a particular
question right.
D rarely (if ever) = 1
An item has an acceptable level of discrimination if D => .30
p and D are not independent probabilities.
Discrimination Indexes less than .30 are sometimes acceptable if we have a very high p value.
Other Factors may affect item difficulty and discriminability and are sometimes analyzed as part of a
comprehensive item analysis.
If Gender, age, ethnicity, or socioeconomic status is theorized to possible affect test performance than
statistical indexes of Differential Item Functioning can be calculated.
The same procedures discussed earlier are used, but we divide our test takers into the various groupings of
interest before calculating difficulty and discrimination Indexes.
Remember, measures of Internal Consistency are not indexes of Validity. Validity can only be assessed by
comparing test performance with some external criterion.
In general, as the overall D increases for an exam, the variability and the ability to separate high and low
performers increases.
In general, as p increases, your test average will increase as well. (High p = easy test)

May 2007

METHODOLOGICAL CONSIDERATIONS

122

REFERENCES
Cohen, J., Cohen, P., West, S.G., &Aiken, L.S. (2003). Applied Multiple Regression/Correlation Analysis for the
Behavioral Sciences. Lawrence Earlbaum Associates, Mahwah, NJ.
Gevirtz, R. (1990) Statistical Decision Tree. ** Included with permission from Dr. Gevirtz.
Salkind, N.J. (2004). Statistics for People Who (Think They) Hate Statistics. Sage, Thousand Oaks, CA.
Severino, J.P. (n.d.) Study packet prepared by student. ** Unless otherwise indicated, unreferenced material comes
from the Severino packet.
Schwab, D.P. (2005). Research Methods for Organizational Studies. Lawrence Earlbaum Associates, Mahwah, NJ.
Shadish, W.R., Cook, T.D., & Campbell, D.T. (2002). Experimental and Quasi-Experimental Design for Generalized
Causal Inference. Houghton Mifflin Company, Boston, MA.
Vogel, L (2005). Prior study packet prepared by student. Nearly all graphics were sourced from this version of the
packet.
Vogt, W.P. (2005). Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Sciences. Sage,
Thousand Oaks, CA.
Ethical Principles in the Conduct of Research With Human Participants (1982). Washington, D. C.: American
Psychological Association, Inc.
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences. New York: W.
H. Freeman and Co.
Keppel, G. (1991). Design and Analysis: A Researchers Handbook (3rd ed.). Englewood Cliffe: Prentice-Hall, Inc.
Ray, W. (1993). Methods toward a science of behavior and experience (4th ed.). Belmont, CA: Wadsworth Publishing
Company.

May 2007

METHODOLOGICAL CONSIDERATIONS

123

METHODOLOGY STUDY QUESTIONS


The following review questions are intended to support studying the materials. It does not represent all of the material that
may be on the exam. This is intended to serve as a review and not your sole means of studying for the exam.

Test Construction
1. List the types of cognitive tests.
________________________________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________________________________
2. List the types of sentiment tests.
________________________________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________________________________
3. List the types of vocational tests.
________________________________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________________________________
4. List the types of personality tests.
________________________________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________________________________
Reliability and Validity
5. List, define, and provide an example of each type of validity.

Define:
E.g.:
Define:
E.g.:
Define:
E.g.:
Define:
E.g.:
Define:
E.g.:
Define:
E.g.:
Define:
E.g.:
Define:
E.g.:
Define:
E.g.:
May 2007

METHODOLOGICAL CONSIDERATIONS

124

6. List, define, and provide an example of each type of reliability. Then specify the statistic used for each.

Define:

Statistic(s):

E.g.:
Define:

Statistic(s):

E.g.:
Define:

Statistic(s):

E.g.:
Define:

Statistic(s):

E.g.:
Define:

Statistic(s):

E.g.:

7. List and define the threats to Internal Validity


Internal validity is:

8. List and define the threats to External Validity


External validity is:

9. List and define the threats to Construct Validity


Construct validity is:

May 2007

METHODOLOGICAL CONSIDERATIONS

125

10. List and define the threats to Statistical Conclusion Validity


Statistical Conclusion Validity is:

11. What is Type I error?


b. It is also referred to as?
12. What is Type 2 error?
b. It is also referred to as?
13. What is Power?
14. What is the difference
between and 1 ?

(Tip: When there does not appear to be any IVs, automatically go to page 3 of the decision tree)
1. 165 participants are given a questionnaire to assess their level of justification for the LA riots. The scores of
African American participants were compared with the scores for European American participants. What
statistic should be used?
2. Researchers at a mental hospital want to compare the treatment of depression by cognitive therapy with
treatment of depression by behavioral therapy. 60 depressed male participants were randomly assigned to
either treatment group. At the end of the treatment, each participant was measured on a depression inventory
and a life satisfaction scale. What statistic should be used?
3. To assess whether there is a relationship between a persons sociability and their expression of anger, a
researcher gives 150 freshman both a Sociability Scale and an Anger Expression Inventory. What statistic
should be used?
b. If the researcher made sure the data from the sociability scale came out as either high sociability or low
sociability and kept the Anger Expression Inventory the same, what statistic should he use?
4. A history professor wished to determine whether comments made on her students exams had any effect on
their final exam scores. Records were used to divide students according to their GPA into above average
scoring students, average scoring students, and below average scoring students. Students from each group
were randomly assigned to receive either negative, positive, or neutral comments on their exams throughout
the course. What statistic would she use to see what effect her comments had?
5. Researchers are interested in finding out if certain parental characteristics can predict aggression in teenage
boys. 68 male participants were given an Aggression Inventory, and their parents were given the following
scales assessing the following characteristics: assertiveness, negativism, self-esteem, work-ethic, emotional
openness, and anger expression. What statistic should be used?
b. What if the teenage boys were not included in the study and the researchers just wanted to see how the
same parental characteristics that were measured are related? What will the researchers use now?
6. Using a sample of 137 gay and bisexual males between 14 and 21 years old, researchers were interested in
factors predicting whether or not a person has seriously considered committing suicide in their lifetime.
Participants were first asked whether or not they had seriously considered committing suicide in their lifetime.
They were then given separate measures assessing the following: whether they had used drugs or not, had
May 2007

METHODOLOGICAL CONSIDERATIONS

126

been sexually abused or not, had felt hopeless about their future or not, and whether or not their friends and
family were supportive of their gay or bisexual identity. Data from these measures were obtained to see if
they could predict the participants reports as to whether or not they had considered committing suicide in
their lifetime. What statistics should be used?
7. 250 elderly subjects between the ages of 65-75 were randomly assigned to one of four treatment groups to
improve their level of depression. Subjects either received an exercise program, a pet-visit program,
cognitive-behavioral therapy, or a waiting list. Subjects filled out inventories assessing their level of
depression, degree of hopelessness, and their amount of irrational beliefs. Subjects filled out the inventories
before the treatment, after the treatment, then two months later. What analysis should they use to determine
whether these interventions had any effect on depression?
8. Before conducting their analyses the researchers noticed that many subjects dropped out of one condition in
particular. What statistics should they use to determine whether the percent of subjects dropping out of this
condition is significantly more than the drop outs of the other conditions?
9. The cognitive and behavioral functioning of children of working mothers was investigated. Five hundred
children (250 males, 250 females) were measured over a course of three months on three scales: WISC-R,
Behavior Checklist, and Behavioral Inventory. A childs cognitive functioning was assessed by a Full Scale
IQ score and his/her behavior was assessed by combining the results of the two behavior inventories to yield
one score. The intensity of the mothers job (1-7 scale), hours worked per week, duration of mothers
employment, and number of siblings were all recorded. Researchers are interested in seeing if they can
determine a childs cognitive and behavioral functioning from the number of siblings and the mothers
employment-related information.
10. A sexist pig is convinced that men are faster runners than women. A researcher sets out to prove that he is
wrong. The researcher takes place holdings of competitors in recent marathon and compares the differences
between men and women. What statistics should the researcher use?
11. The sexist pig is still not convinced, so the researcher decided to use the same data to prove to him that there
is a strong relationship between males and females on placement in races. What statistic should the research
use?
12. A group of Bush supporters were asked to listen to a political speech on the merits of Clinton. On the basis of
the questionnaire, participants were divided into those that were highly committed to voting for Bush and
those that were only mildly committed to Bush. The subjects were randomly assigned to three groups. Each
group watched a video of either moderate, strong, or intense appeal. All the subjects were then administered a
new questionnaire to measure how positive they felt toward Bush. What statistic should they use to analyze
this?

May 2007

METHODOLOGICAL CONSIDERATIONS

127

ANSWERS:
1. T-test for independent samples
2. Hotellings T-test
3. (a) Pearson Correlation
(b) Point Biserial
4. 2-way ANOVA with no repeated measures
5. (a) Multiple Regression Analysis
(b) Factor Analysis
6. Logistic Regression
7. 2-way MANOVA with 1 repeated measure
8. Chi-squared
9. Canonical Correlation
10. Mann-Whitney U Test
11. Spearman Correlation
12. 2-way ANOVA with no repeated measures

Predictor

Outcome

Linear Regression

Outcome
Predictor

Hotellings T-test

Outcome

Predictor
Predictor
Predictor

If Categorical

Logistic Regression or
Discriminant Analaysis

If Score

Multiple Regression

Outcome

Predictor

Outcome

Predictor

Outcome

Predictor

Outcome

Canonical Correlation

May 2007

METHODOLOGICAL CONSIDERATIONS
METHODOLOGY SAMPLE TEST I
The following review questions were not taken from past
comp exams or review packets. They were created by the
reviewer (Lynn Vogel) and will not represent all of the
material that may be on the exam. This is intended to
serve as a review and not your sole means of studying for
the exam.
Research Design
1. A variable that is comprised of groups you are
comparing is a(n):
a. Covariate
b. Independent Variable
c. Dependent Variable
d. Confound
2. A measured variable that correlates with the
dependent variable but not with the independent
variable could be a confound and should be treated
as a(n):
a. Covariate
b. Independent Variable
c. Dependent Variable
d. None of the above
3. Random selection is used to be reasonably sure that:
a. Any person from the population is equally
likely to be chosen to participate
b. All subjects are equally likely to be in any level
of the independent variable
c. All groups are equal (i.e. gender, age, ses)
d. Both b and c
4. Which of the following designs results in less error?
a. Between Group
b. Wait List Control
c. Yoked Control
d. Within Group
5. A control group that receives a pseudo
intervention instead of the experimental treatment is
called:
a. No Treatment Control
b. Wait List Control
c. Attention Placebo Control
d. Yoked Control
6. Which of the following techniques is used to control
for carryover effects?
a. Counterbalancing
b. Retrospective Ceiling
c. Cross-Section
d. All of the above
7. An individual has so little depression that they
decrease much less than those with higher levels of
depression. This is referred to as:
a. Floor Effect
b. Ceiling Effect
c. Crossover Effect
d. None of the above

128

8.

Studies that are used to collect preliminary data that


eventually leads to specific independent variables,
dependent variables, and hypotheses are considered:
a. Qualitative
b. Observational
c. Not Generalizable
d. All of the Above
9. Which observational design would be most
appropriate in studying characteristics of individuals
with antisocial personality disorder by comparing
them to those without the diagnosis?
a. Case Study
b. Case Control
c. Case Cohort
d. Case Sectional
10. All of the following are relevant for single subject
designs except:
a. Continuous Assessment
b. Moderate Generalizability
c. Strong Baseline
d. Assessment of Performance over Time
11. Elizabeth collects her dissertation sample by word of
mouth. This is an example of:
a. Stratified Random Sampling
b. Convenience Sampling
c. Cluster Sampling
d. Snowball Sampling
12. AIU is using GRE scores, undergraduate GPA, and
gender to predict achievement in a doctoral program.
What is the dependent variable?
a. GRE scores
b. Undergraduate GPA
c. Gender
d. All of the Above
13. Dr. Gevirtz tells you that your basic design is flawless,
however, you will not be able to generalize to a large
population. You must have good ______________ but
poor ___________.
a. Reliability, Validity
b. Internal Validity, External Validity
c. Validity, Reliability
d. External Validity, Internal Validity
14. All of the following are threats to external validity except:
a. Instrumentation
b. Sample Characteristics
c. Test Sensitization
d. Demand Characteristics
Statistical Inference and Other Concepts
15. I am studying depression among students in a statistics
course and have students from Dr. Gevirtzs and Dr.
Dalenbergs class each take the Beck Depression
Inventory. Students in each class then receive a
ranking of how depressed they are relative to each
other (1st, 2nd, 3rd). This depression data is:
a. Categorical
b. Nominal
c. Ordinal
d. Interval
May 2007

16. Alice is the teaching assistant for statistics and tells


you the mode of the final exam was an 82%. This
value represents the:
a. Average of the scores
b. Middle point of the scores
c. Curve of the test
d. Most frequent score
17. Which of the following is not a measure of
dispersion?
a. Variance
b. Standard Deviation
c. Median
d. All of the above are measures of dispersion
18. The following distribution is:
10

2
St d. Dev = 9.82
Mean = 8.2
N = 25.00

0
0.0

5.0

10.0

15.0

20.0

25.0

30. 0

35. 0

40. 0

VAR00 003

a. Leptokurtotic
b. Positively Skewed
c. Significant
d. Bimodal
19. Power is an estimate of:
a. How sure we are that differences exist
b. Sensitivity of the experiment to find real
differences
c. How big the differences are
d. All of the above
20. In a normal distribution,
a. Mean = Median = Mode
b. Mode < Mean < Mode
c. Median < Mean < Mode
d. None of the above
21. Population distributions are developed based on
what studies?
a. Repeated Measures
b. Monte Carlo
c. Boxs M
d. Hosmer and Lemeshow
Univariate Statistics
22. When comparing 2 groups on one dependent
variable (i.e. CBT vs. Control for treatment of
depression), the best analysis used is:
a. One Way Analysis of Variance
b. One Way Multivariate Analysis of Variance
c. Analysis of Covariance
d. Point Biserial
23. The larger the differences between the groups, the
___________ the F statistic:
a. More Accurate
b. Less Accurate
c. Larger
d. Smaller
24. If F is not significant, check your:
a. Power
b. Magnitude of Effect
c. Homogeneity of Variance
d. Independence of Observations

METHODOLOGICAL CONSIDERATIONS
129
25. The sum of the squared deviations of individual scores
from the group mean is referred to as:
a. Sum of Squares within group
b. Error
c. Sum of Squares between groups
d. Both a and b
26. In a study of the effects of gender and treatment for
depression, Mary found that males depression levels
decreased with CBT while Females decreased in the
Meditation treatment. This is an example of:
a. Main Effect
b. Interaction
c. Planned Comparison
d. Post Hoc Analysis
27. A disadvantage of repeated measures designs is:
a. Practice Effects
b. Carryover Effects
c. Both a and b
d. None of the above
28. Mike is doing research on healthy eating habits of
college students but he is concerned subjects will
deliberately disclose only healthy habits. As a result,
he measures their level of social desirability. In
preliminary analyses, he realizes that social
desirability is significantly affecting the healthy eating
measure. What should he do?
a. Nothing, social desirability is not significantly
related to the independent variable
b. Use a counterbalancing technique
c. Run a repeated measures design instead
d. Treat social desirability as a covariate
Correlational Techniques
29. Debbie is interested in the relationship between
gender and level of physical activity (exercise minutes
per week). Which correlation will be calculated?
a. Point Biserial
b. Spearman Rank
c. Pearson
d. Multiple R
Multivariate Statistics
30. Kate wants to predict the amount of pain within
individuals in a motor vehicle accident using the
variables of psychiatric history, gender, and anger.
What would be the most appropriate analysis?
a. Multivariate Analysis of Variance
b. Factor Analysis
c. Logistic Regression
d. Multiple Regression
31. Each predictor in the above study receives a weight
indicating how influential it is on the outcome of
amount of pain. Which statistic would Kate report as
the weight for each predictor?
a. b weight
b. Beta weight
c. Either a or b
d. None, these are loading statistics not weights

May 2007

32. Dr. Gevirtz is trying to predict the amount of


minutes graduate students exercise based on how
stress relieving exercise is for them. However, he is
unaware that one student is training for a marathon.
This student could be a:
a. Influential outlier
b. Covariate
c. Central limit
d. None of the above
33. _____ is the correlation between 2 variables while
_____ is the correlation between multiple variables.
a. r, r2
b. R, r2
c. r2, R
d. r, R
34. Kathryn wants to predict whether individuals in a
motor vehicle accident will use narcotics or not 6
months following the accident. Which analysis
would you recommend?
a. Analysis of Variance
b. Logistic Regression
c. Discriminant Analysis
d. Path Analysis
35. At the local crisis house you are asked to design a
method of assessing suicidality. Your supervisor wants
you to be able to detect all individuals who will attempt
suicide but doesnt want you to be so particular that you
end up detecting some who will not attempt suicide. He
is looking for a measure that contains:
a. Low sensitivity and low specificity
b. Low sensitivity and high specificity
c. High sensitivity and low specificity
d. High sensitivity and high specificity
36. In the following path, the exogenous variable is:
Metabolism
Exercise

Weight Loss

Caloric Intake
a. Metabolism
b. Weight Loss
c. Caloric Intake
d. Both a and c
37. In the above model, the endogenous variable is:
a. Metabolism
b. Weight Loss
c. Caloric Intake
d. Both a and c
38. A study has suggested that cell phone use causes
radiation which then causes cancerous tumors.
Radiation is considered a:
a. Moderator
b. Mediator
c. Confound
d. Semipartial

METHODOLOGICAL CONSIDERATIONS
130
39. You are studying the DSM by taking an extensive
history (i.e. symptoms) of patients in a local mental
health agency to make sure the symptoms cluster into
the appropriate diagnoses. For instance, symptoms of
Bipolar Disorder should not be correlating to
Antisocial Personality Disorder. Which subtype of
factor analysis is most relevant?
a. Exploratory
b. Causal
c. Confirmatory
d. This is a principal components analysis
40. Which of the following are not used in factor
analysis/principal components analysis?
a. Boxs M test
b. Eigenvalues
c. Scree plot
d. Kaiser criteria
41. Bob wants to compare treatment modalities (group
therapy vs. individual therapy vs. control group) on
quality of life, beck depression inventory, and beck
anxiety inventory. Which analysis is most appropriate?
a. Analysis of Variance
b. Multivariate Analysis of Variance
c. Analysis of Covariance
d. Multivariate Analysis of Covariance
42. Which of the following assumptions is crucial to
Analyses of Covariance and Multivariate Analyses of
Covariance and requires that the covariate-dependent
variable relationship is the same across all levels of
the independent variables?
a. Homogeneity of Regression
b. Homogeneity of Variance
c. Homogeneity of Variance-Covariance
d. None of the above
Interpretation of Measures
43. Comprehensive Exams are _____________ tests
whereas Symptom Inventories are _______________
tests.
a. Criterion References, Norm Referenced
b. Domain Referenced, Norm Referenced
c. Empirical Keyed, Norm Referenced
d. All of the above
44. On a measure assessing relationship satisfaction,
endorsing the item I have filed for divorce implies
that all other milder statements such as I have
thought about divorce would also be endorsed. This
is an example of what kind of scaling?
a. Thurstone
b. Guttman
c. Likert
d. Semantic Differential
45. Which theory states that observed scores are the
outcome of error added to ones true score.
a. Classical Measurement
b. Generalizability
c. Domain Sampling
d. None of the above

May 2007

46. Gary tells you that the range of IQ score for subject
#1 is 90-110. The statistic he used was:
a. Confidence Interval
b. Standard Error
c. Both a and b
d. None of the above
Reliability of Measurement
47. All of the following are true for reliability except:
a. Good reliability = Good consistency
b. Shorter tests have better reliability
c. Longer tests have better reliability
d. Restriction of range is a potential problem
48. What is the relationship between reliability and validity?
a. Reliability = Validity
b. Reliability < Validity
c. Reliability > Validity
d. Unpredictable
49. Ann administers her personality inventory to subjects
twice to tests its consistency. She is assessing:
a. Split Half reliability
b. Repeated measures reliability
c. Interrater reliability
d. Test retest reliability
50. Kevin reported a strong Cohens kappa during his
dissertation defense. This indicated:
a. Good interrater reliability
b. Good internal consistency
c. Poor interrater reliability
d. Poor internal consistency
51. Dr. Greenberg decides to prevent recent occurrences
of cheating on exams by making 2 versions. Which
of the following should she check before
administering?
a. Split Half reliability
b. Internal consistency
c. Alternate Form reliability
d. Test retest reliability
Validity of Measurement
52. Jason asks experts in the area of autism to review
items on his new measure assessing the spectrum of
autistic disorders in order to identify if the items are
appropriate. Jason is assessing:
a. Face Validity
b. Construct Validity
c. Content Validity
d. Criterion related Validity
53. Linda is correlating scores on the Beck Depression
Inventory to scores on an Antisocial Personality
Disorder measure to establish ___________ validity.
a. Construct
b. Divergent
c. Discriminant
d. All of the above

METHODOLOGICAL CONSIDERATIONS
131
54. You are interested in predicting future success (GPA)
in graduate school based on GRE scores and
undergraduate GPA, however, professors have
knowledge of their students scores. What problem
might you have in calculating predictive validity of
GRE and GPA?
a. Criterion contamination
b. Crossover contamination
c. Predictor contamination
d. Construct contamination
55. Which of the following are not frequently used in
selecting personnel for employment?
a. Taylor Russell Model
b. Thurstone Model
c. Naylor Shine Model
d. Predictive Validity
Test Construction, Features, and Utility
56. Item Response Theory uses Item Characteristic
Curves to assess:
a. Difficulty
b. Discriminability
c. Mastery
d. Both a and b
57. Bret believes that GRE scores are better predictors of
grad school success for only US students not for
Foreign-born students. In response, Dr. Foster, states
that his results actually show that GRE scores are
systematically overpredicting US performance and
underpredicting foreign-born performance. This bias
is referred to as:
a. Item
b. Slope
c. Intercept
d. None of the above
58. Dr. Gevirtz gives a pop quiz on the first day of
statistics to determine the ability/skill level of students
before beginning the course. What was Dr. Gevirtz
interested in?
a. Achievements
b. Criterion Keying
c. Latency
d. Aptitude

May 2007

METHODOLOGICAL CONSIDERATIONS

132

Answers Test I
1. b
2. a
3. a
4. d
5. c
6. a
7. a
8. d
9. b
10. b
11. d
12. d
13. b
14. a
15. c
16. d
17. c
18. b
19. b
20. a
21. b
22. a
23. c
24. a
25. d
26. b
27. c
28. d
29. a
30. d
31. c
32. a
33. d
34. b
35. d
36. d
37. b
38. b
39. c
40. a
41. b
42. a
43. d
44. b
45. a
46. c
47. b
48. c
49. d
50. a
51. c
52. c
53. d
54. a
55. b
56. d
57. c
58. d
May 2007

METHODOLOGY SAMPLE TEST II

9.

The following review questions were not taken from past comp
exams or review packets. They were created by the reviewer
(Lynn Vogel) and will not represent all of the material that may
be on the exam. This is intended to serve as a review and not
your sole means of studying for the exam.

Research Design
1. A measured variable is called a(n):
a. Covariate
b. Independent Variable
c. Dependent Variable
d. Confound
2. Random assignment is used to be reasonably sure that:
a. Any person from the population is equally
likely to be chosen to participate
b. All subjects are equally likely to be in any level
of the independent variable
c. All groups are equal (i.e. gender, age, ses)
d. Both b and c
3. An experimental design comparing 2 treatments for
depression controls for pretesting effects by
including a pretest and posttest for depression. This
design is called:
a. Pre-Post Control
b. Post Only Control
c. Time Series
d. Solomon 4 Group
4. A within subjects design in which the dependent
variable
is
administered
at
pretreatment,
posttreatment, and follow-up is called a ____ design.
a. Time Series
b. Factorial
c. Pre Post Control
d. Solomon 4 Group
5. Gender is a _______ independent variable while
Treatment Group is a ______ independent variable.
a. Patched, Manipulated
b. Static, Manipulated
c. Patched, Static
d. Gender cannot be an independent variable
6. Quasi-Experimental designs are unlike experimental
designs primarily due to:
a. Having at least one static independent variable
b. No true random assignment
c. Both a and b
d. Neither a nor b
7. Which observational design would be best to study the
risk factors associated with a particular outcome?
a. Cross Sectional
b. Retrospective Cross-Sectional
c. Cohort
d. Observational designs cannot be used
8. I am interested in assessing differences in IQ
between 15, 25, and 35 year olds. What design
would you use?
a. Cross Sectional
b. Retrospective Cross Sectional
c. Case Control
d. Cohort

10.

11.

12.

13.

14.

METHODOLOGICAL CONSIDERATIONS
133
In the above study, what may be your biggest concern
regarding internal validity?
a. Generalizability
b. Ceiling Effects
c. Floor Effects
d. Cohort Effects
ABAB, Multiple Baseline, and Changing Criterion are
designs used in which of the following types of
research?
a. Experimental
b. Quasi-Experimental
c. Observational
d. Single Subject
John is studying the differences between Ivy League,
Public, and Private Colleges. He randomly selects 5
Ivy League, 5 Public, and 5 Private schools and then
randomly selected students from only those schools.
This sampling technique is referred to as:
a. Stratified Random Sampling
b. Convenience Sampling
c. Cluster Sampling
d. Snowball Sampling
Karen is conducting political poling in election year
2008. She divides the US into geographical locations
and then randomly selects subjects from each location
in order to fully represent the entire US population.
This type of sampling is referred to as:
a. Stratified Random Sampling
b. Convenience Sampling
c. Cluster Sampling
d. Snowball Sampling
AIU is conducting a study comparing G1s to G4s on
their level of satisfaction with the psychology
program. What is the independent variable?
a. G1 versus G4
b. Level of Satisfaction
c. Psychology Program
d. There are only dependent variables
All of the following are threats to internal validity
except:
a. Maturation
b. Reactivity
c. Attrition
d. Regression to the Mean

Statistical Inference and Other Concepts


15. Erics dissertation is based on comparing yoga versus
a control group on sleep quality. His ______
hypothesis states that the groups will exhibit no
significant differences on sleep quality whereas his
____ hypothesis states that the yoga group will exhibit
significantly better sleep quality than the control
group.
a. Alternative, Experimental
b. Experimental, Null
c. Alternative, Null
d. Null, Alternative

May 2007

16. Maria is running so many tests that some could be


significant by chance alone and not due to real
differences. This is an example of:
a. Familywise Error
b. Experimentwise Error
c. Type I error
d. All of the above
17. Maria reruns her study and finds out that her power
is only .40 (low). What trouble will she encounter?
a. Type I error
b. Type II error
c. Confounds
d. Covariates
18. This theorem states that the more subjects we
randomly choose from the population, the better
representation we have.
a. Restriction of Range
b. Omnibus Selection
c. Central Limits
d. Factorial
19. The following distribution is:
50

40

30

20

10
Std. Dev = 1.63
Mean = 5.4
N = 95.00

0
2.0

4.0

6.0

8.0

10.0

VAR00004

a. Leptokurtotic
b. Skewed
c. Platykurtotic
d. Mesokurtotic
20. Which of the following does not measure the spread
of scores?
a. Variance
b. Sum of Squares
c. Degrees of Freedom
d. Mean Squares
Nonparametric Statistics
21. Which of the following is not a nonparametric
analysis for rank data?
a. Chi Square
b. Kruskal Wallis
c. Mann Whitney U Test
d. Wilcoxon Signed
Univariate Statistics
22. The assumption of homogeneity of variance can be
tested by which of the following statistics?
a. Levenes
b. Brown Forsythe
c. F-Max
d. All of the above
23. An F of 1 indicates:
a. Groups differ significantly
b. F values must be lower than 1
c. Groups are the same
d. F values must be greater than 1
24. If F is significant, the next step is to check:
a. Power
b. Magnitude of Effect
c. Homogeneity of Variance
d. Independence of Observations

METHODOLOGICAL CONSIDERATIONS
134
25. Effect size is calculated as which of the following?
a. R2
b. Eta2
c. Omega2
d. All of the above
26. Ryan has 3 groups he is comparing and decides to run
4 additional planned comparisons. What is his next
step?
a. Evaluate familywise error rates
b. Use an alpha correction
c. Assess Type I Error
d. All of the above
27. A factorial analysis of variance is run when:
a. You have more than 1 dependent variables
b. You have more than 1 independent variables
c. You have a covariate
d. You do a factor analysis
28. In a study of the effects of year level (G1 vs G5) and
program (PhD vs. PsyD) at AIU on level of
satisfaction, Paul found that G1s rated satisfaction
lower than G5s in both PhD and PsyD programs. This
is an example of:
a. Main Effect
b. Interaction
c. Planned Comparison
d. Post Hoc Analysis
29. Which of the following is not an advantage of a
repeated measures design?
a. Can use a smaller sample size
b. Increase power
c. Reduce within group variance
d. Spherecity
Correlational Techniques
30. As depression decreases, quality of life increases.
This is an example of a:
a. Positive correlation
b. Negative correlation
c. Causal Relationship
d. Indirect relationship
Multivariate Statistics
31. Which of the following is not a type of Multiple
Regression?
a. Simultaneous
b. Predictive
c. Stepwise
d. Hierarchical
32. Alice wants to be sure she has made an even amount
of errors in her prediction. She would most likely
assess the assumption of:
a. Homoscedasticity
b. Homogeneity of Regression
c. Linearity
d. Independence of Observations
33. To assess this students impact on Dr. Gevirtzs
regression equation (model), which statistic would
you use?
a. Mahalanobi
b. Leverage
c. DfBeta
d. Cooks
May 2007

34. When applying the amount of variance explained by


your model (multiple regression) to other
populations, which statistic would be most
appropriate?
a. R2
b. Adjusted R2
c. Magnitude of Effect
d. Omega2
35. In order to compare the influence of predictors,
which statistic is best?
a. Beta weight
b. b weight
c. Tolerance
d. Durbin Watson
36. Beth states that metabolism uniquely accounts for
6% of the variance in weight loss with other
predictors held constant. Beth must have interpreted
which statistic?
a. Point Biserial
b. Semipartial
c. Chi-Square
d. Pearson
37. At Kaylas yearly mammogram, her doctor
comments that detection results in many false
positives. Mammograms must therefore have:
a. High specificity
b. High sensitivity
c. Low specificity
d. High specificity
38. To determine the influence/contribution of
predictors in a Logistic Regression, one would
interpret the:
a. Significance level
b. Odds Ratio
c. b Weights
d. None of the above
39. In Path Analysis/Structural Equation Modeling, a
variable that is measured by many other variables
(i.e. IQ) is referred to as:
a. Static
b. Manifest
c. Latent
d. Both a and c
40. A recent study has found that strong social support
may influence disease progression of breast cancer
patients. Social support is considered a:
a. Moderator
b. Mediator
c. Confound
d. Semipartial
41. You have an Anger Inventory with 100 items and
you want to see if those items can be reduced into
different subscales of anger. What statistical analysis
is most appropriate?
a. Repeated Measure
b. Multivariate Analysis of Variance
c. Structure Equation Modeling
d. Factor Analysis/Principal Components Analysis

METHODOLOGICAL CONSIDERATIONS
135
42. The measures quality of life, beck depression
inventory, and the beck anxiety inventory are used in a
MANOVA. They are first tested as one dependent
variable called a:
a. Synthetic factor
b. Orthogonal factor
c. Familywise factor
d. Both a and c
43. Alan realizes that although he has randomly assigned
subjects to groups, they differ on the demographic
variable of socioeconomic status. In order to treat
socioeconomic status as a covariate, what is
necessary?
a. Relationship between the covariate and dependent
variable
b. Nonsignificant interaction between the covariate
and independent variable
c. Both a and b
d. None of the above
Interpretation of Measures
44. The licensing exam would be described as:
a. Domain Referenced
b. Mastery
c. Criterion Referenced
d. All of the above
45. Which theory does not account for error but
recognizes that observed scores do vary by
circumstances.
a. Classical Measurement
b. Generalizability
c. Domain Sampling
d. None of the above
46. Cathy tells you that her standard error is 4.27. This
value indicates:
a. Estimated band of error
b. Maximum amount of error
c. Range where true scores fall
d. Standard error is not a valid statistic
Reliability of Measurement
47. Difference scores (pre-post) are:
a. Notoriously unreliable
b. Highly recommended
c. Notoriously invalid
d. None of the above
48. Ann calculates her test-retest reliability and tells you r
= .13. All of the following could be true except:
a. The construct is not stable
b. The interval was too long
c. Practice/carryover effects were not a problem
d. The interval was too short
49. Interrater reliability was calculated between Olympic
judges and was found to be low due to individual
interpretations of scoring that influenced their
observations over time. This problem is referred to as:
a. Individual Drift
b. Consensual Drift
c. Independent Drift
d. Subjective Drift

May 2007

METHODOLOGICAL CONSIDERATIONS

136

50. Which of the following would you recommend to


calculate internal consistency?
a. Cronbach Alpha
b. KR-20
c. Both a and b
d. Only pearson rs are used for reliability
Validity of Measurement
51. You are interested in comparing individuals who are
not clinically depressed to those who are on the
Beck Depression Inventory to establish its
__________ validity.
a. Divergent
b. Discriminative
c. Discriminant
d. Definitive
52. Multitrait Multimethod Matrices are used to explore:
a. Shared Method Variance
b. Convergent Constructs
c. Discriminant Variance
d. All of the above
53. Criterion related validity:
a. Ideally should be a perfect correlation
b. Addresses the utility of a measure
c. Can be concurrent or predictive
d. All of the above
Test Construction, Features, and Utility
54. Item discriminability refers to the extent to which the:
a. Subjects pass the item
b. Subjects fail the item
c. Item differentiates subjects
d. All of the above
55. Item test regression is used to:
a. Compare items on a test
b. Compare skill level of subjects
c. Assess hit rate percentages
d. Both a and c
56. Bret finds that GRE scores are better predictors of
grad school success for only US students not for
Foreign-born students. He is demonstrating what
kind of test bias?
a. Item
b. Slope
c. Intercept
d. None of the above
57. Dr. Gevirtz then gives a final exam to assess:
a. Achievement
b. Criterion Keying
c. Latency
d. Aptitude

May 2007

METHODOLOGICAL CONSIDERATIONS

137

Answers Test II
1. c
2. d
3. d
4. a
5. b
6. c
7. b
8. a
9. d
10. d
11. c
12. a
13. a
14. b
15. d
16. d
17. b
18. c
19. a
20. c
21. a
22. d
23. c
24. b
25. d
26. d
27. b
28. a
29. d
30. b
31. b
32. a
33. d
34. b
35. a
36. b
37. c
38. b
39. c
40. a
41. d
42. a
43. c
44. d
45. b
46. a
47. a
48. d
49. a
50. c
51. b
52. a
53. d
54. c
55. a
56. b
57. a

May 2007

METHODOLOGY SAMPLE TEST III


The following review questions were not taken from past comp
exams or review packets. They were created by the reviewer
(Lynn Vogel) and will not represent all of the material that may
be on the exam. This is intended to serve as a review and not
your sole means of studying for the exam.

Research Design
1. A psychologist wants to compare the effects of
behavior
therapy,
cognitive therapy,
and
psychodynamic therapy on anxiety as measured by a
self-report level of anxiety, heart rate, and the Taylor
Manifest Anxiety Scale. Identify the independent
variable(s).
a. Therapy
b. Level of Anxiety
c. Heart Rate
d. Taylor Manifest Anxiety Scale
2. A psychologist wants to compare the effects of
behavior
therapy,
cognitive therapy,
and
psychodynamic therapy on anxiety as measured by a
self-report level of anxiety, heart rate, and the Taylor
Manifest Anxiety Scale. Identify the dependent
variable(s).
a. Taylor Manifest Anxiety Scale
b. Level of Anxiety
c. Heart Rate
d. All of the above
3. To investigate the effects of television violence on
aggressive behavior, a social psychologist has male
and female children who have been identified as
either very aggressive, moderately aggressive,
mildly aggressive, or nonaggressive watch either a
violent or neutral film. Following the film, each
child is observed during a 60 minute free play period
and coded for the number of aggressive acts the
child exhibits. Identify the independent variable(s).
a. Gender of Child
b. Level of identified aggression
c. Number of aggressive acts
d. Both a and b
4. To investigate the effects of television violence on
aggressive behavior, a social psychologist has male
and female children who have been identified as
either very aggressive, moderately aggressive,
mildly aggressive, or nonaggressive watch either a
violent or neutral film. Following the film, each
child is observed during a 60 minute free play period
and coded for the number of aggressive acts the
child exhibits. Identify the dependent variable(s).
a. Gender of Child
b. Level of identified aggression
c. Number of aggressive acts
d. Both a and b
5. In a research study, what is being measured is
referred to as the:
a. Dependent Variable
b. Predictor Variable
c. Criterion Variable
d. All of the above

METHODOLOGICAL CONSIDERATIONS
138
You would most likely conduct a case study in order to:
a. Determine the degree of association between
variables
b. Investigate variables over an extended period of
time
c. Identify variables to examine more systematically
later
d. There is no good reason to conduct case studies
7. Variability in the dependent variable due to
confounding variables is a source of:
a. Systematic error
b. Random error
c. Extraneous error
d. Both a and c
8. The presence of observers during the course of a
research study alters the way that subjects behave
during the course of the study. This is an example of
the threat to ________ validity known as _______.
a. External, Reactivity
b. Internal, Reactivity
c. External, Expectancy
d. Internal, Expectancy
9. The random assignment of subjects to treatment
groups is most useful for ensuring that the study has
adequate __________ validity.
a. Internal
b. Predictive
c. Discriminant
d. External
10. A high school teacher administers an achievement test
to high school freshmen at the beginning of the school
year, teaches the students test-taking strategies, and
then readministers the achievement test at the end of
the school year. Which of the following is not a threat
to this studys internal validity.
a. Maturation
b. History
c. Selection
d. None of the above
11. A researcher evaluates the effects of a 15-month
training program on the conversation skills of
preoperational children. At the end of the program, he
determines that a significantly greater number of
children converse after the program than before. The
greatest threat to this studys internal validity is:
a. Maturation
b. Within Group History
c. Order Effects
d. Carryover
12. A factorial design:
a. Includes 3 or more levels of a single independent
variable
b. Includes 2 or more independent variables
c. Includes 2 or more dependent variables
d. Includes at least 1 covariate
6.

May 2007

13. To assess the effects of communicator credibility on


attitude change, an experimenter randomly assigns
60 males and 60 females to either the high, average,
or low communicator credibility condition. After the
communicator delivers a speech about a political
candidate, the experimenter asks each subject for
his/her opinion of the candidate. This study
illustrates which of the following research designs?
a. Between groups
b. Within Group
c. Mixed Design
d. Repeated Measures
14. The biggest threat to the studys external validity in
a within subject (repeated measure design) is:
a. Statistical regression
b. Regression to the mean
c. Interaction between selection and treatment
d. Order effects
15. Which of the following studies would ALWAYS be
considered quasi-experimental?
a. An analogue study because its results would
have limited Generalizability
b. A study using a counterbalanced design because
subjects receive more than one treatment
c. A study using a repeated measures design
because the study may be confounded by
practice effects
d. An ex-post facto study because the researcher
cannot assign subjects to groups
16. Twenty children enrolled in Preschool A are
assigned to the experimental group of a research
study and twenty children enrolled in Preschool B
are assigned to the control group. Children in the
experimental group are given instruction on the
differences between squares and rectangles, while
children in the control group do not receive this
instruction. Several weeks later, a test is
administered to all children to reassess their ability
to distinguish between squares and rectangles.
Which of the following are MOST likely to threaten
the studys internal validity?
a. Instrumentation and maturation
b. Selection and history
c. Regression and maturation
d. Diffusion and Selection
17. A researcher is using a repeated measures design to
assess the effectiveness of different behavioral
techniques for reducing transitory anxiety. To
control potential carryover effects, the researcher
should use which of the following?
a. A control group
b. Blocking
c. Counterbalancing
d. Cross-Sequential Design
18. An experimenter wants to investigate the short and
long term effects of four different smoking cessation
programs. She randomly assigns 80 smokers to one
of the four programs. After each program is
completed, the experimenter determines the number
of cigarettes smoked by each participant each day

METHODOLOGICAL CONSIDERATIONS
139
immediately following the program, 3 months after
the program, and 6 months after the program. The
research design being used by the experimenter is best
described as:
a. Between Subjects
b. Within Subjects
c. Mixed Design
d. Counterbalanced
19. When conducting evaluation research, the first step is
ordinarily which of the following:
a. Conducting job analyses
b. Operationally defining the predictors
c. Clarifying the content domain
d. Defining the programs objectives
Statistical Inference and Other Concepts
20. Beta is the probability of:
a. Correctly rejecting the null hypothesis
b. Incorrectly rejecting the null hypotheses
c. Correctly retaining the null hypothesis
d. Incorrectly retaining the null hypothesis
21. A _______ test states that the null is false and whether
the sample will be greater than or less than the control.
a. One-tailed
b. Two-tailed
c. Either a or b
d. None of the above
22. A biofeedback techniques gives temperature in
degrees F. This is an example of what scale?
a. Ordinal
b. Nominal
c. Ratio
d. Interval
23. Ranking of peers in terms of popularity is on what
type of scale?
a. Ordinal
b. Nominal
c. Ratio
d. Interval
24. Of the three measures of central tendency, which is
the least susceptible to sampling fluctuations?
a. Mean
b. Median
c. Mode
d. Variance
25. Of the three measures of central tendency, which is
the least susceptible to outliers?
a. Standard error
b. Median
c. Mode
d. Variance
26. In a normal distribution, approximately _________%
of observations fall between the scores that are plus
and minus 1 standard deviation from the mean?
a. 99
b. 95
c. 68
d. 50

May 2007

27. The standard error of the mean is the:


a. Standard Error of the sampling distribution
b. Standard deviation of the sampling distribution
of the mean
c. Standard error of the sample population
d. Monte Carlo statistic
Univariate Statistics
28. You have collected IQ scores from schizophrenic
and non-schizophrenic patients who have been
classified as either high, middle, or low in
socioeconomic status. Which test(s) will you use to
analyze the data?
a. Multi Sample Chi-Square
b. 1-way ANOVA
c. Factorial ANOVA
d. Either b or c
29. A psychologist obtains a statistically significant F
ratio for the interaction of Factor A x B. This means
that:
a. The effects of Factor A are the same across the
same levels of Factor B
b. The effects of Factor A are different across
different levels of Factor B
c. The effects of Factor A are the same across
different levels of Factor B
d. The effects of Factor A are different across the
same levels of Factor B
30. You have collected IQ scores from patients who
have and have not received a diagnosis of
Schizophrenia and whose families have been
classified as either high, moderate, or low in
expressed emotion. To analyze the data you have
collected, you will use the:
a. Multiple Sample Chi-Square test
b. One Way ANOVA
c. Two Way ANOVA
d. Regression
31. The numerator of the F ratio is a measured of the
variability due to:
a. Error + Treatment Effects
b. Treatment Effects Only
c. Error Only
d. None of the above
Correlational Techniques
32. A curvilinear relationship between achievement and
anxiety is likely to have what effect of the Pearson r?
a. Overestimates
b. No significant effect
c. Underestimates
d. Unpredictable effects
33. When data points are widely scattered in a scatter
plot matrix for variables X and Y, this indicates that
the correlation between X and Y is:
a. Very low
b. Moderate
c. Very High
d. Perfect

METHODOLOGICAL CONSIDERATIONS
140
34. You are interested in correlating sex of respondent
with attitude toward abortion (measured on an interval
scale). Which of the following correlation techniques
would you use?
a. Pearson r
b. Point Biserial
c. Phi Coefficient
d. Contingency Coefficient
Multivariate Statistics
35. A researcher interested in determining the causal
relationship between several variables would be most
likely to use which of the following?
a. Discriminant Analysis
b. Canonical Correlation
c. Logistic Regression
d. Path Analysis
36. The shrinkage associated with the cross-validation
of multiple regression equations is ordinarily
attributable to which of the following?
a. High correlations between predictors
b. Low initial correlations between the predictors
and the criterion
c. Regression to the mean
d. The impact of chance factors in the original
sample
37. Which analysis would you use if you wanted to
predict a dichotomous outcome using few
assumptions?
a. Path Analysis
b. Logistic Regression
c. Multiple Regression
d. None of the above
38. When data points are narrowly scattered around the
regression line in a scatter plot, this indicates that the
correlation between X and Y is:
a. Very low
b. Moderate
c. Very High
d. Zero
39. One of the assumptions of regression analysis is that:
a. There is a linear relationship between the variables
b. There is a casual relationship between the variables
c. Subjects were randomly assigned to groups
d. The predictors are dichotomous
Interpretation of Measures
40. On a 75-item test, a student obtains a score that is at
the 75th percentile. This means that:
a. The student answered all questions correctly
b. The student answered 75% of the items correctly
c. The students score exceeded 25% of the scores
obtained by others
d. The students score exceeded 75% of the scores
obtained by others
41. If you are interested in assessing individual
differences, you would prefer test scores to be:
a. Criterion referenced
b. Normed
c. Standardized
d. Percentages
May 2007

METHODOLOGICAL CONSIDERATIONS

141

Reliability of Measurement
42. When a tests reliability coefficient is equal to 1.0,
its standard error of measurement:
a. Is equal to 1.0
b. Is equal to 0.0
c. Is greater than 0 but less than 1
d. Is equal to the tests standard deviation
43. To reduce consensual observer drift, you would:
a. Have observers check each others ratings
b. Have observers work independently
c. Use a single blind technique
d. Use a double blind technique
Validity of Measurement
44. To determine the criterion-related validity of a test, a
psychologist administers the test to current
employees and correlates their test scores with
available supervisor ratings. This is an example of a
___________________ validation study.
a. Convergent
b. Predictive
c. Concurrent
d. Construct
45. In a multitrait multimethod matrix, a tests validity
would be confirmed if the:
a. Monotrait monomethod coefficients are low and
the heterotrait heteromethod coefficients are high
b. Monotrait heteromethod coefficients are high and
the heterotrait monomethod coefficients are low
c. Monotrait monomethod coefficients are high and
the monotrait heteromethod coefficients are low
d. Heterotrait monomethod coefficients are high and
the heterotrait heteromethod coefficients are low
46. Criterion contamination tends to artificially _______
a predictors criterion-related validity coefficient.
a. Invalidate
b. Prove
c. Inflate
d. Deflate
Test Construction, Features, and Utility
47. A screening test developed to identify children with
learning disabilities is administered to 100 children,
60 who have a learning disability and 40 who do
not. The test correctly identifies 50 of the 60
children with learning disabilities and 25 of the
children who do not have learning disabilities. The
number of false positives in this situation is:
a. 50
b. 25
c. 15
d. 10

May 2007

METHODOLOGICAL CONSIDERATIONS

142

Answers Test III


1. a
2. d
3. c
4. c
5. d
6. c
7. a
8. a
9. a
10. c
11. a
12. b
13. a
14. d
15. d
16. b
17. c
18. c
19. d
20. d
21. b
22. d
23. a
24. a
25. b
26. c
27. b
28. c
29. b
30. c
31. a
32. c
33. a
34. b
35. d
36. d
37. b
38. c
39. a
40. d
41. b
42. b
43. b
44. c
45. b
46. c
47. c

May 2007

METHODOLOGY SAMPLE TEST IV


The following review questions were not taken from past comp
exams or review packets. They were created by the reviewer
(Lynn Vogel) and will not represent all of the material that may
be on the exam. This is intended to serve as a review and not
your sole means of studying for the exam.

Research Design
1. An educational psychologist wants to test the
hypothesis that the effectiveness of a mastery
learning method (versus a traditional method) for
learning college algebra is a function of both math
aptitude and math anxiety. Identify the independent
variable(s).
a. Math aptitude and anxiety
b. Learning method
c. College algebra achievement
d. Both a and b
2. An educational psychologist wants to test the
hypothesis that the effectiveness of a mastery
learning method (versus a traditional method) for
learning college algebra is a function of both math
aptitude and math anxiety. Identify the dependent
variable(s).
a. Math aptitude and anxiety
b. Learning method
c. College algebra achievement
d. Both a and b
3. A researcher asks a sample of male and female
mental health professionals to describe a healthy
male adult and a healthy female adult. Based on
his review of the literature, the researcher
hypothesizes that the adjectives used by both male
and female mental health professionals to describe a
healthy male adult will be more positive than the
adjectives used to describe the healthy female adult.
Identify the independent variable(s).
a. Gender of mental health professional
b. Gender of healthy adult
c. Descriptive adjectives
d. Both a and b
4. A researcher asks a sample of male and female
mental health professionals to describe a healthy
male adult and a healthy female adult. Based on
his review of the literature, the researcher
hypothesizes that the adjectives used by both male
and female mental health professionals to describe a
healthy male adult will be more positive than the
adjectives used to describe the healthy female adult.
Identify the dependent variable(s).
a. Gender of mental health professional
b. Gender of healthy adult
c. Descriptive adjectives
d. Both a and b
5. In a research study, different treatment groups that
are being compared represent the different levels of
the:
a. Dependent Variable
b. Independent Variable
c. Static Variable
d. Control Variable

METHODOLOGICAL CONSIDERATIONS
143
An investigator believes that job satisfaction and
motivation are related to level of self esteem. To test
this hypothesis, she will administer measures of selfesteem, job satisfaction, and job motivation to a
sample of workers at a large manufacturing company.
The problem with this study is that any relationship
that the investigator finds between variables may be
due to ____ variables rather than the ____ variable.
a. Extraneous, Independent
b. Confounding, Independent
c. Both a and b
d. None of the above
7. Due to the unreliability of a test, many subjects who
receive extremely high scores on the first
administration of a test receive scores closer to the
mean on the second administration. This is referred to
as:
a. Multiple Regression
b. Reactivity
c. Regression to the mean
d. Logistic regression
8. Demand characteristics are a threat to a studys
______ validity when cues suggest the behavior that is
expected by the experimenters.
a. Internal
b. Predictive
c. Discriminant
d. External
9. Counterbalancing is used to control for:
a. Order effects
b. Practice effects
c. Maturation
d. Selection
10. Pretesting is a threat to a studys internal validity when:
a. Different tests are used as the pretest and posttest
b. Exposure to the pretest alters subjects reactions
to the treatment
c. Exposure to the pretest alters subjects
performance on the posttest
d. Pretesting is never a threat to validity
11. Attrition is most likely to be a threat to a studys
internal validity when:
a. A significant number of subjects drop out of a
study before completion
b. A larger number of subjects drop out of one group
more than the other groups
c. The type of subjects that drop out of one group
differ from the type of subjects that drop out of
another group
d. Attrition is always a problem
12. An experimenter compares the effects of 3 different
diets on weight loss by assigning overweight subjects
to either Diet A, Diet B, or Diet C and then
determining each subjects weight one week, 6 weeks,
and 3 months after beginning the diet. This study is an
example of which of the following research?
a. Between Groups
b. Within Groups
c. Mixed Design
d. ABAB Design
6.

May 2007

13. A behavioral psychologist interested in the


effectiveness of self-reinforcement procedure on the
caloric intake of overweight adolescents would most
likely use a reversal single-subject design in order
to control which of the following?
a. History effects
b. Reactivity effects
c. Order effects
d. Placebo effects
14. Which of the following would be most useful for
helping the investigator above control the threat to
internal validity of his study?
a. An unobtrusive measure
b. A control group
c. A Solomon 4-Group Design
d. Random Selection
15. Ethical considerations often dictate which research
design an investigator should use. Thus, when
investigating the effectiveness of aversion therapy
for reducing violent and aggressive behaviors, a
researcher would MOST likely use which of the
following single-subject designs?
a. Time series
b. Multiple Baseline
c. Posttest only
d. Reversal
16. When conducting a research study, a psychologist
would use matching in order to:
a. Ensure that variability in the dependent variable
is not due to random error
b. Maximize the effects of an extraneous variable
on the dependent variable
c. Maximize the effects of the independent
variable on the dependent variable
d. Ensure that groups are initially equivalent with
regard to an extraneous variable
17. An investigator would use blocking to control an
extraneous variable when:
a. The study includes a small number of subjects
b. The variable does not correlate with the
dependent variable
c. The investigator wants to statistically remove
the effects of that variable
d. The investigator wants to statistically analyze
the effects of that variable
18. Dr. I Que conducts a cross-sectional study to assess
the effects of increasing age on certain cognitive
abilities. The results of her study suggest that these
abilities begin to deteriorate during the early 20s.
When interpreting the results of her study, Dr. Que
should be aware that the study may have been
contaminated by:
a. Demand characteristics
b. Halo effects
c. Cohort effects
d. Carryover effects

METHODOLOGICAL CONSIDERATIONS
144
19. Which of the following is an example of demand
characteristics?
a. An experimenter double checks his data whenever
it doesnt conform to the research
b. Subjects alter their behaviors in ways that help
them avoid negative evaluations by the
experiment
c. Subtle cues in the environment communicate to
subjects what behaviors are expected of them
d. Research assistants change they way they code
data after speaking to each other
20. Threats to internal validity reduce an investigators
ability to determine:
a. Relationships
b. Causality
c. Generalizability
d. None of the above
Statistical Inference and Other Concepts
21. If a researcher compares the difference between 2
means to assess the effectiveness of an independent
variable on the dependent variable, she will have most
confidence in that they are found to be statistically
significant at the _________ level of significance?
a. .01
b. .5
c. .001
d. .05
22. A psychologist administers an achievement test to a
group of 75 hyperactive 6th graders. The mean of the
distribution of scores is 40 and the standard deviation
is 8. In this distribution, a raw score of 50 would be
equivalent to a z score of:
a. +1.25
b. +1.00
c. +5.00
d. +10.0
23. In a negatively skewed distribution:
a. The median is greater than the mean
b. The median is less than the mean
c. The mean is greater than the mode
d. The median is greater than the mode
24. If one or two extreme scores are added to a
distribution of 50 scores:
a. The value of the mean will be affected more than
the value of the median
b. The value of the median will be affected more
than the value of the mean
c. The value of the mode will be affected more than
the median or mean
d. The mean, median, and mode are affected equally
25. A ______ test states that the null hypothesis is false
and does not predict a direction.
a. One-tailed
b. Two-tailed
c. Either a or b
d. None of the above

May 2007

26. Scores obtained on the Beck Depression Inventory


are on which scale?
a. Ordinal
b. Nominal
c. Ratio
d. Interval
27. College major is measured on what kind of scale?
a. Ordinal
b. Nominal
c. Ratio
d. Interval
28. As alpha increases, the probability of making a Type
II error ____________ and power _____________.
a. Increases, Decreases
b. Decreases, Increases
c. Increases, Increases
d. Decreases, Decreases
29. Power refers to the:
a. Sensitivity of finding real differences
b. How big a difference there is between groups
c. Probability of correctly rejecting the null
d. Both a and c
Nonparametric Statistics
Univariate Statistics
30. A researcher wants to assess the effectiveness of a
training course for improving SAT scores by
comparing pretest and posttest scores for a group of
high school seniors. To analyze the data obtained in
this study, the researcher should use which statistical
test?
a. 2-Way ANOVA
b. Repeated Measures Chi-Square
c. Kolmogorov
d. Paired Samples T-test
31. Two variables indicate that the effects of different
levels of one variable are not the same at the levels
of another variable. This is referred to as a(n):
a. Interaction
b. Main Effect
c. Post Hoc
d. Confound
32. Post Hoc comparisons are conducted when:
a. The F ratio is insignificant
b. The F ratio is significant
c. The F ratio indicates an interaction
d. The F ration indicates a main effect
33. To assess interaction effects you must be using:
a. A true experimental design
b. A multivariate design
c. A factorial design
d. None of the above
34. An increase in experimentwise error refers to an
increase in:
a. Type I error
b. Type II error
c. Familywise error
d. Both a and b

METHODOLOGICAL CONSIDERATIONS
145
Correlational Techniques
35. Which of the following correlation coefficients is
most appropriate for the relationship between gender
and SAT score?
a. Spearman
b. Point Biserial
c. Biserial
d. Chi Square
36. A ____ correlation indicates that people scoring low
on one variable tend to obtain high scores on another.
a. Positive
b. Zero
c. Perfect
d. Negative
37. If the correlation between X and Y is .70, this means
that _____% of the variability in Y is explained by X.
a. 49
b. 30
c. 70
d. 7
Multivariate Statistics
38. The sum of the deviations of data points from a
regression line:
a. Is always positive
b. Is always negative
c. Always equals 0
d. Is equal to the variance explained
39. A researcher is likely to conduct a _______ rotation in a
factor analysis if he believes that the factors underlying
the tests included in the analysis are correlated.
a. Oblique
b. Orthogonal
c. Varimax
d. Both b and c
40. Which analysis would you use if you were to test
theories about relationships between variables?
a. Path Analysis
b. Logistic Regression
c. Multiple Regression
d. None of the above
41. If the relationship between two variables disappears
when you take away another variable, you have a:
a. Mediator
b. Confound
c. Moderator
d. Covariate
42. Multicollinearity:
a. Increases the probability that a correlation will be
statistically significant
b. Refers to high correlations between predictors and
is a problem for multiple regression
c. Refers to high correlations between the predictor
and outcome and is not a problem
d. Provides semi-partial correlations for each
predictor

May 2007

METHODOLOGICAL CONSIDERATIONS

146

Interpretation of Measures
43. The test taken to qualify for a drivers license is an
example of a:
a. Criterion referenced test
b. Normed test
c. Validation test
d. Construct validity
Reliability of Measurement
44. To maximize a tests reliability coefficient, you
would:
a. Make sure the test is homogenous with regard
to content domain
b. Include in the tryout sample individual who are
homogenous with regard to the attribute
measured by the test
c. Use a true-false item format
d. Make sure the test is valid
45. Which of the following would be the least
appropriate for assessing the reliability of a 20 item
arithmetic test?
a. Test Retest
b. Coefficient Alpha
c. Split-Half
d. All of the above are appropriate
Validity of Measurement
46. Convergent and Discriminant validity are both
methods for assessing ___________ validity.
a. Predictive
b. Construct
c. Criterion Related
d. Discriminative
47. A test developed would you multitrait multimethod
matrices in order to assess a tests:
a. Differential validity
b. Incremental validity
c. Concurrent and predictive validity
d. Convergent and discriminant validity
48. The difference between predictive and concurrent
validity is most related to:
a. The time interval between administration of the
predictor and the criterion
b. The type of statistic used to analyze data
collected on the predictor and the criterion
c. The nature of the construct measured by the
predictor
d. The sources of measurement error
Test Construction, Features, and Utility
49. A 25-item test is administered to a group of
examinees and the resulting distribution of scores is
negatively skewed. Adding a few items to the test
that have difficult index between .00 and .50 will
most likely:
a. Increase the negative skewness of the
distribution
b. Cause the distribution to become positively
skewed
c. Change shape of the distribution so that it is
closer to normal
d. Change the shape of the distribution so that it is
less than normal
May 2007

METHODOLOGICAL CONSIDERATIONS

147

Answers Test IV
1. d
2. c
3. d
4. c
5. b
6. c
7. c
8. d
9. a
10. c
11. c
12. c
13. a
14. b
15. b
16. d
17. d
18. c
19. c
20. b
21. c
22. a
23. a
24. a
25. a
26. d
27. b
28. b
29. d
30. d
31. a
32. b
33. c
34. d
35. b
36. d
37. a
38. c
39. a
40. a
41. a
42. b
43. a
44. a
45. a
46. b
47. d
48. a
49. c

May 2007

METHODOLOGY SAMPLE TEST V

7.

The following review questions were not taken from past comp
exams or review packets. They were created by the reviewer
(Lynn Vogel) and will not represent all of the material that may
be on the exam. This is intended to serve as a review and not
your sole means of studying for the exam.

Research Design
1. An educational psychologist believes that children
will be better spellers if they are provided with
spaced rather than massed practice. Identify the
independent variable(s).
a. Spaced Practice
b. Massed Practice
c. Type of Practice
d. Spelling ability
2. An educational psychologist believes that children
will be better spellers if they are provided with
spaced rather than massed practice. Identify the
dependent variable(s).
a. Spaced Practice
b. Massed Practice
c. Type of Practice
d. Spelling ability
3. A psychologist suspects that a teachers expectations
about a students academic performance will have a
self-fulfilling prophecy effect on the students
actual academic achievement but that the magnitude
of effect will depend on the students level of selfesteem. Identify the independent variable(s).
a. Self Esteem
b. Academic performance
c. Teachers Expectations
d. Both a and c
4. A psychologist suspects that a teachers expectations
about a students academic performance will have a
self-fulfilling prophecy effect on the students
actual academic achievement but that the magnitude
of effect will depend on the students level of selfesteem. Identify the dependent variable(s).
a. Self Esteem
b. Academic performance
c. Teachers Expectations
d. Both a and c
5. In a true experiment, the variable that subjects are
randomly assigned to is referred to as the:
a. Dependent Variable
b. Independent Variable
c. Static Variable
d. Control Variable
6. A researcher divides the population into subgroups
according to certain characteristics (i.e. age,
ethnicity, etc) and then randomly selects subjects
from each subgroup. This sampling technique is
known as:
a. Stratified Random
b. Cluster Sampling
c. Quota Sampling
d. Stratified Clusters

8.

9.

10.

11.

12.

13.

14.

METHODOLOGICAL CONSIDERATIONS
148
The primary difference between the true experimental
research and quasi-experimental research is that in the
former:
a. Subjects are randomly assigned to groups
b. Subjects are randomly selected from the
population
c. Subjects are both randomly assigned and selected
d. Subjects are unaware of which group they were
selected for
Variability in the dependent variable due to the
unreliability of the measuring instruments is source of:
a. Systematic error
b. Random error
c. Extraneous error
d. Both a and c
The random selection of subjects for a research study is
most useful for maximizing a studys _____ validity.
a. Internal
b. Predictive
c. Discriminant
d. External
The double blind technique is most useful for
controlling.
a. Carryover effects
b. Reactivity
c. Practice effects
d. Differential selection
The Solomon Four-Group Design is used to control
which of the following?
a. Instrumentation
b. Pretesting
c. Order Effects
d. Selection
Subjects for a research study at a university are
volunteers from the subject pool. Most of the subjects
are psychology undergraduates. In this situation,
selection is a threat to the studys:
a. External validity
b. Internal Validity
c. Incremental Validity
d. Both a and b
Which of the following single subject designs would
you be LEAST likely to use when assessing the
effectiveness of a behavioral treatment for reducing
head-banging in autistic children?
a. AB
b. ABAB
c. Multiple Baseline
d. Either a or b
An advantage of the ABA design over the AB design
is that the former better controls which of the
following threats to internal validity?
a. Instrumentation
b. Regression
c. History
d. Experimenter Expectancy

May 2007

15. An investigator, using one group time series design


to assess the effects of a new safety campaign on a
number of work related accidents at a large
manufacturing company, measures the number of
accidents at regular intervals 6 months before and 6
months after instituting the safety campaign. The
investigator can probably consider which of the
following to be the biggest threat to the internal
validity of his study?
a. Regression
b. Maturation
c. History
d. Attrition
16. Dr. CP Anderson, an industrial psychologist, is
conducting a research study to assess the effects of a
special training course on the job performance of
accountants. In addition, Dr. Anderson wants to
determine if the effectiveness of the course is
affected by the administration of a pretest.
Therefore, she is most likely to use which of the
following research designs?
a. Solomon 4-Group
b. Latin Square
c. Time Series
d. Multivariate
17. The independent variable(s) in the above study is (are):
a. Training Course
b. Training Course and Pretesting
c. Job Satisfaction and Job Motivation
d. Course effectiveness
18. An educational psychologist believes that the use of
reinforcers to improve the academic achievement of
primary school children will be more effective for
slow learners. She administers an academic
achievement test to all first grade children in a large
public school and then includes in her study only
those children who received the lowest scores on the
test. After eight weeks of reinforcement, the
psychologist
readministers
the
academic
achievement test to the students to determine if the
reinforcement has had a positive effect. The major
threat to the internal validity of this study is:
a. Restriction of range
b. Attrition
c. Selection
d. Carryover effects
19. When conducting an analogue study, you would be
most concerned about:
a. Limited ecological validity
b. Limited population validity
c. Limited internal validity
d. Limited construct validity
Statistical Inference and Other Concepts
20. The scale of measurement that is characterized by
equal intervals and an arbitrary 0 point is which of
the following?
a. Ordinal
b. Ratio
c. Nominal
d. Interval

METHODOLOGICAL CONSIDERATIONS
149
21. The shape of a distribution on z scores is:
a. Always flat
b. Always bimodal
c. Identical to the distribution of raw scores
d. Always bell-shaped (normal)
22. Alpha is the probability of:
a. Correctly rejecting the null hypothesis
b. Incorrectly rejecting the null hypotheses
c. Correctly retaining the null hypothesis
d. Incorrectly retaining the null hypothesis
23. A psychologist obtains IQ scores for a group of 250
junior high school students. Assuming that the scores
are normally distributed, the psychologist can
conclude that approximately _______ % of the scores
fall within 2 standard deviations above and below the
mean of the distribution?
a. 99
b. 95
c. 68
d. 50
24. The number of times a rat presses a lever for a reward
demonstrates what kind of data?
a. Ordinal
b. Nominal
c. Ratio
d. Interval
25. Birth order is measured on what scale?
a. Ordinal
b. Nominal
c. Ratio
d. Interval
26. Which of the following describes the relationship
between the variance and the standard deviation?
a. Variance is 2x the size of standard deviation
b. Variance is the square root of standard deviation
c. Variance is the square of standard deviation
d. None of the above
Nonparametric Statistics
27. Nonparametric techniques are also known as:
a. Small sample inferential statistics
b. Distribution free tests
c. Multivariate tests
d. Descriptive techniques
Univariate Statistics
28. A psychologist would use the analysis of covariance
when analyzing the data collected in a research study
in order to:
a. Maximize true score variability
b. Statistically remove the effects of extraneous
variables
c. Statistically analyze the effects of
confounding variables
d. Minimize the effects of random error
29. In order to assess interaction effects, a research study
must include at least:
a. 2 levels of one independent variable
b. More than 2 levels of one independent
variable
c. 2 levels each of 2 independent variables
d. 2 levels each of 2 dependent variables
May 2007

30. You are interested in analyzing the difference in IQ


scores of two samples. Each sample consists of 25
students. The appropriate statistical technique is:
a. Multiple sample chi-square
b. Kolmogorov test
c. T test for independent samples
d. Regression
31. Which of the following is used to compare means?
a. 1-Way ANOVA
b. T-test
c. Both a and b
d. None of the above
32. The relationship between main effects and
interactions is described as:
a. You can only have an interaction if both main
effects are significant
b. You can only have main effects if the
interaction is significant
c. You can only have either a main effect or an
interaction
d. None of the above
Correlational Techniques
33. Individuals scoring low on a quality of life scale
tend to score highly on a depression scale. This
relationship is an example of a:
a. Positive Correlation
b. Negative Correlation
c. Multiple Correlation
d. Point Biserial Correlation
34. Which of the following correlation coefficients is most
appropriate when data on both variables is rank ordered?
a. Biserial
b. Phi
c. Spearman
d. None of the above
35. A researcher obtains IQ scores and psychology exam
scores from a group of students. He correlated the 2
sets of scores and obtains an r of -.42. This means
approximately ___% of the variance in IQ scores is
shared in common with psychology exam scores.
a. 32
b. 58
c. 17
d. 42
Multivariate Statistics
36. A researcher uses scores on several measures to
predict scores on a criterion. The researcher is using
the multivariate technique known as:
a. Logistic Regression
b. Predictive Regression
c. Discriminant Regression
d. Multiple Regression
37. The least squares criterion is used to:
a. Determine the location of the regression line in
a scatter plot
b. Statistically remove the effects of a confounding
variable
c. Identify the criterion group that an examinee
most closely resembles
d. Determine if the model fits the data

METHODOLOGICAL CONSIDERATIONS
150
38. In factor analysis, a factor loading expresses the
correlation between:
a. Communality and Component
b. Item and Component
c. Eigenvalue and Item
d. Synthetic factors are used instead of loadings
39. When a correlation coefficient is significantly higher
for males than for females, gender is acting as a:
a. Blocking variable
b. Suppressor variable
c. Dichotomous variable
d. Moderator variable
40. Dr. Locke wants to investigate his theory that past
performance determines ones feelings of self-efficacy
which in turn affect ones goals and goal attainment.
Which of the following techniques will be most useful
for this investigation?
a. Cannonical
b. Path Analysis
c. Discriminant Analysis
d. Both a and b
Interpretation of Measures
41. Mastery tests are ____________.
a. Criterion Referenced
b. Normed
c. Both a and b
d. None of the above
42. According to classical test theory, measurement error is:
a. Unsystematic
b. Systematic
c. Both a and b
d. Due to invalidity
Reliability of Measurement
43. KR-20 is a variation of coefficient alpha that can be
used to test items that are:
a. Dichotomous
b. Polychotomous
c. Thurstone scaled
d. All of the above
44. The major problem when using percent agreement as a
measure of interrater reliability is that this method:
a. Tends to underestimate the level of agreement
b. Cannot be used when the variability of one
scorers scores differs substantially from another
c. Doesnt take into account chance agreement
d. There are no major problems with percent
agreement statistics
45. The same test is readministered to the same examinee
six times over a 12 month period and the examinee
gets six different scores. This suggests the test has:
a. Low reliability
b. Low predictive validity
c. Low convergent validity
d. High differential validity

May 2007

METHODOLOGICAL CONSIDERATIONS

151

Validity of Measurement
46. Tenured professors would be asked to review items
for an observational scale used to rate the
performance of 1st year instructors in order to
establish the scales ___________ validity.
a. Construct
b. Face
c. Criterion
d. Content
47. A school psychologist develops a test for high
school freshmen to identify students who are likely
to quit school prior to graduation. The psychologist
will be most interested in establishing which type of
validity for this test?
a. Content
b. Construct
c. Convergent
d. Criterion-Related
48. Which of the following is a measure of construct
validity?
a. Convergent
b. Divergent
c. Discriminative
d. All of the above
Test Construction, Features, and Utility
49. Which item difficulty level is associated with the
greatest differentiation of examinees:
a. +1
b. .5
c. .01
d. -1
50. When using an objective test:
a. An examinee will obtain the same score
regardless of who scores the test
b. Items included in the test have been found to
correlate highly with an objective criterion
c. The test must be administered in accord with
clearly defined guidelines
d. Both a and c
51. A D index refers to an items:
a. Discriminability
b. Difficulty
c. Distractibility
d. Both a and b
52. Mortality in a study refers to:
a. The death of subjects
b. The death of researchers
c. Loss of respondents to a treatment group
d. Both a and b

May 2007

METHODOLOGICAL CONSIDERATIONS

152

Answers Test V
1. c
2. d
3. d
4. b
5. b
6. a
7. a
8. b
9. d
10. b
11. b
12. a
13. b
14. c
15. c
16. a
17. b
18. a
19. a
20. d
21. c
22. b
23. b
24. c
25. a
26. c
27. b
28. b
29. c
30. c
31. c
32. d
33. b
34. c
35. c
36. d
37. a
38. b
39. d
40. b
41. a
42. a
43. a
44. c
45. a
46. d
47. d
48. d
49. b
50. a
51. a
52. c

May 2007

You might also like