Bul 136 6 1123

Psychological Bulletin © 2010 American Psychological Association
2010, Vol. 136, No. 6, 1123–1135 0033-2909/10/$12.00 DOI: 10.1037/a0021276
New Trends in Gender and Mathematics Performance: A Meta-Analysis
Sara M. Lindberg, Janet Shibley Hyde, Marcia C. Linn

and Jennifer L. Petersen University of California, Berkeley
University of Wisconsin–Madison
In this article, we use meta-analysis to analyze gender differences in recent studies of mathematics
performance. First, we meta-analyzed data from 242 studies published between 1990 and 2007, repre-
senting the testing of 1,286,350 people. Overall, d ⫽ 0.05, indicating no gender difference, and variance
ratio ⫽ 1.08, indicating nearly equal male and female variances. Second, we analyzed data from large
data sets based on probability sampling of U.S. adolescents over the past 20 years: the National
Longitudinal Surveys of Youth, the National Education Longitudinal Study of 1988, the Longitudinal
Study of American Youth, and the National Assessment of Educational Progress. Effect sizes for the
gender difference ranged between – 0.15 and ⫹0.22. Variance ratios ranged from 0.88 to 1.34. Taken
together, these findings support the view that males and females perform similarly in mathematics.
Keywords: mathematics performance, gender, meta-analysis
Supplemental materials: http://dx.doi.org/10.1037/a0021276.supp
Policy decisions, such as funding for same-sex education, as being equal in mathematical ability, they nonetheless view adult
well as the continuing stereotype that girls and women lack math- men as being better at mathematics than adult women (J. Steele,
ematical ability, call for up-to-date information about gender dif- 2003). Implicit attitudes that link men and mathematics have been
ferences in mathematical performance. Such stereotypes can dis- demonstrated repeatedly in studies of college students (e.g., Kiefer
courage women from entering or persisting in careers in science, & Sekaquaptewa, 2007; Nosek, Banaji, & Greenwald, 2002).
technology, engineering, and mathematics (STEM). Today women Parents believe that their sons’ mathematical ability is higher
earn 45% of the undergraduate degrees in mathematics (National than their daughters’. In one study, fathers estimated their sons’
Science Foundation [NSF], 2008a) but make up only 17% of mathematical IQ at 110 on average, and their daughters’ at 98;
university faculty in mathematics (NSF, 2008b). We report on a mothers estimated 110 for sons and 104 for daughters (Furnham,
meta-analysis of recent studies of gender and mathematics. We Reeves, & Budhani, 2002; see also Frome & Eccles, 1998). Teach-
estimate the magnitude of the gender difference and test whether it ers, too, tend to stereotype mathematics as a male domain. In
varies as a function of factors such as age and the difficulty level particular, they overrate boys’ ability relative to girls’ (Li, 1999;
of the test. but see Helwig, Anderson, & Tindal, 2001).
These stereotypes are of concern for several reasons. First, in the
Stereotypes About Gender and Mathematics language of cognitive social learning theory (Bandura, 1986),
Mathematics and science are stereotyped as male domains (Fen- stereotypes can influence competency beliefs or self-efficacy; cor-
nema & Sherman, 1977; Hyde, Fennema, Ryan, Frost, & Hopp, relational research does indeed show that parents’ and teachers’
1990; Nosek et al., 2009). Stereotypes about female inferiority in stereotypes about gender and mathematics predict children’s per-
mathematics are prominent among children and adolescents, par- ceptions of their own abilities, even with actual mathematics
ents, and teachers. Although children may view boys and girls as performance controlled (Bouchey & Harter, 2005; Frome &
Eccles, 1998; Keller, 2001; Tiedemann, 2000). Competency be-
liefs are important because of their profound effect on individuals’
selection of activities and environments (Bandura, 1997; Bussey &
Sara M. Lindberg, School of Medicine and Public Health, University of
Wisconsin–Madison; Janet Shibley Hyde and Jennifer L. Petersen, Depart- Bandura, 1999). According to an earlier meta-analysis, girls re-
ment of Psychology, University of Wisconsin–Madison; Marcia C. Linn, ported lower mathematics competence than boys did, although the
Graduate School of Education, University of California, Berkeley. difference was not large (d ⫽ ⫹0.16; Hyde, Fennema, et al.,
This research was funded by the National Science Foundation (Grant 1990). In recent studies, elementary school boys still reported
REC 0635444 to Janet Shibley Hyde) and the Eunice Kennedy Shriver significantly higher mathematics competency beliefs than girls did
National Institute of Child Health and Human Development (Grant T32 (Else-Quest, Hyde, & Linn, 2010; Fredricks & Eccles, 2002;
HD049302 to Sara M. Lindberg). The content of this article is solely the Lindberg, Hyde, & Hirsch, 2008; Watt, 2004).
responsibility of the authors and does not necessarily represent the official
A second concern is that stereotypes can have a deleterious
views of the funding organizations. We would like to thank Amanda Riek
for her help coding articles. effect on actual performance. Stereotype threat effects (C. M.
Correspondence concerning this article should be addressed to Janet Steele, 1997; C. M. Steele & Aronson, 1995) have been found for
Shibley Hyde, University of Wisconsin, Department of Psychology, 1202 women in mathematics. In the standard paradigm, half the partic-
West Johnson Street, Madison, WI 53706. E-mail: jshyde@wisc.edu ipants (talented college students) are told that the math test they are
1123
1124 LINDBERG, HYDE, PETERSEN, AND LINN
about to take typically shows gender differences (threat condition), A second moderator analysis examined the magnitude of gender
and the other half are told that the math test is gender fair and does differences in mathematics performance as a function of the ethnicity
not show gender differences (control). Studies have found that of the sample (Hyde, Fennema, & Lamon, 1990). The striking finding
college women underperform compared with men in the threat was that the small gender difference favoring boys was found for
condition but perform equal to men in the control condition, Whites (d ⫽ 0.13), but not for Blacks (⫺0.02) or Latinos (0.00).
indicating that priming for gender differences in mathematics
indeed impairs girls’ math performance (e.g., Ben-Zeev, Fein, Depth of Knowledge
& Inzlicht, 2005; Cadinu, Maass, Rosabianca, & Kiesner, 2005;
Inzlicht & Ben-Zeev, 2000; Johns, Schmader, & Martens, 2005; Historically, researchers maintained that girls might do as well
Quinn & Spencer, 2001; Spencer, Steele, & Quinn, 1999). Stereo- as or even better than boys on tests of computation, which require
type threat effects have been found in children as early as kinder- relatively simple cognitive processes (e.g., Anastasi, 1958). These
garten (Ambady, Shih, Kim, & Pittinsky, 2001). Other research, same researchers concluded that male superiority emerged for tests
measuring implicit stereotypes about gender and math, has found requiring more advanced cognitive processing, such as complex
that these implicit stereotypes predict performance in a calculus problem solving. The 1990 meta-analysis by Hyde, Fennema, and
course (Kiefer & Sekaquaptewa, 2007). Lamon provided some support for these ideas, although the gender
Stereotypes play a role in policy decisions as well as personal difference in complex problem solving did not appear until the
decision making. For example, schools and states may base a high school years and was not large even then.
decision to offer single-sex mathematics classes on the belief that Current mathematics education researchers conceptualize this
these gender differences exist (Arms, 2007). issue of complexity of cognitive processes as a question of item
demand and the depth of knowledge required to solve a particular
Gender and Mathematics Performance problem. Webb (1999) developed a four-level depth of knowledge
framework to identify the cognitive difficulty of mathematics
The stereotypes about female inferiority in mathematics stand in items on standardized assessments. In this framework, Level 1
distinct contrast to the scientific data on actual performance. A (recall) includes the recall of information such as facts or defini-
1990 meta-analysis found an effect size of d ⫽ 0.15, with men and tions, as well as performing simple algorithms. Level 2 (skill/
boys scoring higher, for gender differences in mathematics perfor- concept) includes items that require students to make decisions
mance averaged over all samples; however, in samples of the about how to approach a problem. These items typically ask
general population (i.e., national samples, classrooms—as opposed students to classify, organize, estimate, or compare information.
to exceptionally precocious or low ability samples), women and Level 3 (strategic thinking) includes complex and abstract cogni-
girls scored higher but by a negligible amount (d ⫽ – 0.05; Hyde, tive demands that require students to reason, plan, and use evi-
Fennema, & Lamon, 1990). Hedges and Nowell (1995), using data dence. Level 4 (extended thinking) requires complex reasoning,
sets representing large probability samples of American adoles- planning, developing, and thinking over an extended period of
cents, found d ⫽ 0.03 to 0.26 across the different data sets. time. Items at Level 4 require students to connect ideas within the
Moreover, girls earn better grades in mathematics courses than content area or among content areas as they develop one problem-
boys through the end of high school (Dwyer & Johnson, 1997; solving approach from many alternatives. This depth of knowledge
Kenney-Benson, Pomerantz, Ryan, & Patrick, 2006; Kimball, framework was used to rate the cognitive demands of the tests that
1989). In short, previous research showed that gender differences assess mathematics performance in the studies reviewed here.
in mathematics performance were very small and, depending on
the sample and outcome measure, sometimes favored boys and New Trends
sometimes favored girls.
Several features of the 1990 meta-analysis (Hyde, Fennema, & Cultural shifts have occurred since the 1980s that call for the
Lamon, 1990) warrant more detailed description. Using comput- reexamination of gender differences in mathematics. In the 1980s,
erized literature searches, the researchers identified 100 useable a prominent explanation of male superiority in complex problem
studies, which yielded 254 independent effect sizes representing solving beginning in high school was gender differences in course
the testing of more than 3 million persons. One key moderator choice (Meece, Eccles-Parsons, Kaczala, & Goff, 1982). Girls
analysis examined the magnitude of gender differences as a func- were less likely than boys to take advanced mathematics courses
tion of age and cognitive level of the test (computation was and advanced science courses. Because mathematical problem
considered the lowest level, understanding of concepts was con- solving is an important component of chemistry and physics
sidered intermediate, and complex problem solving was consid- courses, students may learn those skills in science courses as much
ered the highest level). Girls performed better than boys at com- as in mathematics courses. Today, however, the gender gap in
putation in elementary school and middle school, but the course taking has disappeared in all areas except physics. For the
differences were small (d ⫽ – 0.20 and – 0.22, respectively) and high school graduating class of 2005, 7.7% of boys and 7.8% of
there was no gender difference in high school. There was no girls took calculus; 57.8% of girls and 50.6% of boys took chem-
gender difference in understanding of mathematical concepts at istry; and 32.8% of girls and 36.8% of boys took physics (NSF,
any age. For complex problem solving, there was no gender 2008c). Insofar as courses taken by students influence their math-
difference in elementary or middle school, but a gender difference ematics performance, we expect the gender difference in complex
favoring boys emerged in high school (d ⫽ 0.29). This last gender problem solving in high school to have narrowed.
difference, although small, is of concern because complex problem In addition, cross-national data show that the gender gap in
solving is crucial for STEM careers. mathematics performance narrows or even reverses in societies
NEW TRENDS IN GENDER AND MATHEMATICS 1125
with more gender equality (e.g., Sweden and Iceland), compared The Current Study
with those with more gender inequality (e.g., Turkey; Else-Quest
et al., 2010; Guiso, Monte, Sapienza, & Zingales, 2008). Insofar as Several factors warrant a new meta-analysis of research on
the United States has moved toward gender equality over the past gender and mathematics performance. First, approximately 18
30 to 40 years, the gender gap in mathematics performance should years of new data have accumulated since the 1990 meta-analysis
have narrowed. (Hyde, Fennema, & Lamon, 1990). Second, cultural shifts have
Findings from a recent analysis of data from state assessments occurred over the last 2 decades. Specifically, girls are now taking
of mathematics performance provide evidence that the gender gap advanced mathematics courses and some science courses in high
in mathematics performance in the United States has indeed di- school at the same rate as boys are, closing the gap in course
minished or even vanished (Hyde, Lindberg, Linn, Ellis, & Wil- choice. The magnitude of gender differences in mathematics per-
liams, 2008). Those data had several limitations and raised some formance is expected to be even smaller than it was in the 1990
questions that deserve analysis in a larger study. First, they were meta-analysis; of particular interest is the gender difference favor-
based only on tests administered by the states to satisfy the ing boys in complex problem solving in high school and whether
requirements of the No Child Left Behind legislation. Items tap- this difference has narrowed in recent years. Third, statistical
ping Level 3 or 4 depth of knowledge were notably absent. methods of meta-analysis have advanced. At the time of the 1990
Second, the data were derived only from students in Grades 2 meta-analysis, only fixed-effects models were available. Fixed-
through 11; trends in gender differences beyond Grade 11 (age 17) effects models have since been criticized, and random-effects and
could therefore not be assessed. Third, the distributions of male mixed models have been developed (Hedges & Vevea, 1998;
and female performance were available for only part of the sample. Lipsey & Wilson, 2001). The current meta-analysis used a mixed-
The available results raise intriguing questions about gender dif- effects model, the advantages of which are detailed below.
ferences in complex problem solving, because in some subgroups Our goals in these meta-analyses were to provide answers to the
girls outperformed boys at the high end of the distribution. following questions:
1. What is the magnitude of the gender difference in math-

Gender and Variability ematics performance, using the d metric?
Most of the research has focused on mean-level gender differ-
2. Does the direction or magnitude of the gender difference
ences, but variability (variance) remains an issue even when means
vary as a function of the depth of knowledge tapped by
are similar. The greater male variability hypothesis was originally
the test?
proposed in the 1800s and advocated by scientists such as Charles
Darwin and Havelock Ellis to explain why there was an excess of 3. Developmentally, at what ages do gender differences
men both in homes for the mentally deficient and among geniuses appear or disappear?
(Shields, 1982). In modern statistical terms, the hypothesis is that,
independent of mean-level differences, men have a greater vari- 4. Are there variations across U.S. ethnic groups, or across
ance than women do on the intellectual trait of interest. Thus, the nations, in the direction or magnitude of the gender
hypothesis states that men are more likely than women to be at difference?
both the top and the bottom of the statistical distribution of
mathematics performance. Typically the statistic that is computed 5. Has the magnitude of gender differences in mathematics
is the variance ratio (VR), the ratio of the male variance to the performance declined from 1990 to 2007?
female variance. Thus values ⬎ 1.0 indicate greater male variabil-
ity. On the basis of test norming data, Feingold (1992) found a VR 6. Do men and boys display greater variance in scores, and
of 1.11 for the Differential Ability Test’s Numerical Ability if so, by how much?
subtest, 1.20 for the SAT Mathematics test, and 1.02 for the
Study 1 addressed these questions with traditional methods of
Arithmetic subtest of the Wechsler Adult Intelligence Scale.
meta-analysis that involve identifying all possible studies using
Hedges and Nowell (1995) found VRs ranging between 1.05 and
article databases. Study 2 addressed the questions using an alter-
1.25 for mathematics tests administered to national samples of
native method advocated by Hedges and Nowell (1995), which
adolescents such as the National Longitudinal Surveys of Youth
involves the analysis of recent large, national U.S. data sets based
(NLSY; Bureau of Labor Statistics, n.d.) and the National Educa-
on probability sampling. Cross-national probability-sampled data
tional Longitudinal Study of 1988 (NELS:88; National Center for
sets have been analyzed by others (Else-Quest et al., 2010; Guiso
Education Statistics, n.d.-a). The analysis of recent state assess-
et al., 2008; Penner, 2008).
ment data described earlier found VRs ranging from 1.11 to 1.21
(Hyde et al., 2008). Thus there is some evidence of greater male
variability in mathematics performance, although the VRs are not Study 1
terribly lopsided. The greater male variability hypothesis, of
course, is a description of the data, not an explanation for it, but if
Method
true, it could partially account for findings of an excess of males
at very high levels of mathematical performance (Hedges & Fried- Identification of studies. Computerized database searches of
man, 1993). One goal of this meta-analysis was to reassess the ERIC, PsycINFO, and Web of Knowledge were used to generate
greater male variability hypothesis for mathematics performance a pool of potential articles. To identify all articles that investigated
using contemporary data. mathematics performance, the following search terms were used:
(math* or calculus or algebra or geometry) AND (performance or samples, majority ethnic group of participants (European Ameri-
achievement or ability) NOT (mathematical model). These broad can, African American, Asian American, Hispanic, other; mixed;
terms were selected to capture the widest possible range of re- or unreported), and (d) ability level of the participants (low ability,
search conducted on this topic while avoiding studies that used general ability, selective, highly selective).
computational modeling methodology to study unrelated phenom- Several aspects of the mathematics tests were also coded as
ena. Search limits restricted the results to articles that discussed potential moderators: (a) whether the test was time-limited; (b)
research with human populations and that were published in En- whether the test included each of several problem types (multiple
glish between 1990 and 2007. The three database searches iden- choice, short answer, open ended); (c) which types of mathemat-
tified 10,816, 9,577, and 18,244 studies, respectively, which were ical content were included in the test (numbers and operations,
considered for inclusion. Given the tremendous volume of studies algebra, geometry, measurement, data analysis and probability);
identified, we decided to rely solely on this method of identifica- (d) depth of knowledge (Level 1 recall, Level 2 skills/concepts,
tion, a potential limitation. Level 3 strategic thinking, Level 4 extended thinking; levels are
Abstracts and citations were imported into RefWorks citation described in greater detail above; see also Webb, 1999); and (e)
manager, and then each article was evaluated for inclusion on the whether the test was specific to the local curriculum (i.e., based on
basis of the following criteria: (a) the title or abstract alluded to a published curricular standards for the region or developed in
measure of mathematics performance; (b) the study appeared to collaboration with local teachers, textbooks, and syllabi) or was
contain original data; (c) the study was conducted on a human relatively independent of the curriculum. Publication year was also
population; (d) the sample included at least five males and five coded as a potential moderator.
females. After evaluation, 3,941 studies met the aforementioned Fifty articles were double coded to determine interrater reliabil-
criteria. These articles were then printed and examined to deter- ity. Percentage agreement for coding of the moderators was ⬎
mine whether they presented sufficient statistics for an effect size 95% for all variables.
calculation. The final sample of studies included in Study 1 used Effect size computation. Cohen’s d (Cohen, 1988) is the
data from 242 articles, comprising 441 samples and 1,286,350 effect size for the standardized mean difference between two
people. groups on a continuous variable (e.g., the mean difference between
Although the final sample of studies is large, some readers may males and females on a continuous measure of mathematics per-
wonder why so many potential studies were lost during the coding formance). Thus, for each independent sample within an article, d
processes. Most cases of exclusion were based on one of three was computed, with d ⫽ (Mm ⫺ Mf)/sw and Mm ⫽ the mean for
reasons: males, Mf ⫽ the mean for females, and sw ⫽ the pooled within-
gender standard deviation. Whenever possible, separate effect
1. Many of the articles identified in our search reported on sizes were computed for independent groups within each sample
data from the large national data sets included in Study 2 (e.g., different age groups, Black and White participants). If means
(National Assessment of Educational Progress [NAEP], and standard deviations were not available, the effect size was
NELS:88, Longitudinal Study of American Youth computed from other statistics such as t or F, using formulas
[LSAY], NLSY) or from large international data sets provided by Lipsey and Wilson (2001). The complete list of
(Second International Mathematics Study, Third Interna- samples, with corresponding effect sizes and VRs, can be found in
tional Mathematics and Science Study, Programme for the online supplemental material for this article.
International Student Assessment) that had been covered In a few cases, more than one effect size was available for the
by other meta-analyses of gender differences in mathe- same sample. When that happened, the following decision rules
matics performance (Else-Quest et al., 2010; Guiso et al., were used to select or calculate a single effect size for inclusion in
2008; Penner, 2008). Those articles were excluded from the analyses:
Study 1 to avoid redundancy.
1. If a study used a longitudinal design, we used the effect
2. The searches also picked up reports published by state, size from the most recent measurement.
local, and federal educational organizations, few of
which contained individual-level data. 2. If multiple effect sizes were available from the same time
point but missing data produced different sample sizes
3. Finally, many articles did not contain enough data to for different measures, then the effect size with the larg-
compute effect sizes (e.g., results were not disaggregated est corresponding sample size was selected for inclusion
by gender) and additional data could not be obtained in the analyses.
from the study authors.
3. If there were multiple effect sizes from the same time
Coding the studies. If studies reported data from large data point with the same sample size, means and standard
sets or longitudinal studies that were likely to create multiple deviations were pooled, and the pooled values were used
publications, this was noted so as to avoid inclusion of noninde- to compute the composite effect size that was included in
pendent effect sizes. the analyses.
Several characteristics of each sample were coded as potential
moderators: (a) age of the participants; (b) nationality of partici- Raw effect sizes were corrected for bias, and standard errors
pants (American, Canadian, European, Australian/New Zealander, were calculated. Specifically, estimated population effect sizes
Asian, African, Latin American, or Middle Eastern); (c) for U.S. were used, which adjust for upward bias of effect sizes among
small samples (Hedges, 1981). In addition, inverse variance using an analog to analysis of variance in the case of categorical
weights were calculated for each effect size so that analyses could moderator variables (e.g., ethnicity) or an analog to multiple re-
be weighted by inverse variance, a procedure that allows large gression in the case of continuous moderator variables (e.g., pub-
samples to have more leverage in the analyses than small samples lication year). The level of missing data for moderator variables
have (Lipsey & Wilson, 2001). (due to vague descriptions of mathematics measures and study
VR computation. VRs were also computed for each study, procedures) made it untenable to conduct a simultaneous analysis
with VR ⫽ variancemales/variancefemales. Thus, a VR ⬎ 1 denotes of all moderators. The piece-wise approach is not ideal, but we
greater male variability, and a VR ⬍ 1 denotes greater female believe that it is warranted in this case, because we were testing a
variability. Standard errors and inverse variance weights were priori hypotheses based on the findings of previous meta-analyses.
calculated for each VR. Analyses of the VRs were conducted as
outlined by Katzman and Alliger (1992).
Results
Data analyses. Results were analyzed using a mixed-effects
model (Lipsey & Wilson, 2001) and SPSS macros by Wilson Magnitude of gender differences. The overall weighted ef-
(2005). The mixed-effects model assumes that variability among fect size, averaged over all studies, was d ⫽ ⫹0.05, representing
effect sizes can be explained by both fixed factors (i.e., systematic a negligible gender difference. Figure 1 shows the distribution of
differences due to moderators) and random factors (i.e., error effect sizes, which is approximately normal and centered around 0.
variance). This approach is preferable to fixed-effects or random- Heterogeneity analysis revealed that the set of effect sizes was
effects models, which assume that all variability among effect significantly heterogeneous, Qt(427) ⫽ 11,478.74, p ⬍ .001. The
sizes is accounted for by moderators or by error, respectively. In random effects variance component was .070.
mixed-effects models, a random-effects variance component is Moderator analyses. Given the heterogeneity among the ef-
computed on the basis of the residual homogeneity after moderator fect sizes, we conducted analyses for suspected moderator vari-
effects have been accounted for. Then, inverse variance weights ables. Table 1 displays the analyses for variations in effect size as
are recalculated with the random-effects variance component a function of characteristics of the tests. Problem type (presence of
added, and the model is refit. multiple choice, short answer, and open ended questions) was the
To measure heterogeneity of variance of the effect sizes, Q only aspect of the tests that significantly predicted heterogeneity
statistics were computed using weights based on the mixed-effects among the effects, Qb(3, 178) ⫽ 8.11, p ⬍ .05. The presence of
model. When homogeneity analyses indicated that there was sig- multiple choice questions on exams predicted relatively better
nificant heterogeneity among the effect sizes, moderator analyses performance by boys (␤ ⫽ .16), whereas the presence of short
were conducted to test whether characteristics of the mathematics answer and open ended questions predicted relatively better per-
assessments or characteristics of the samples could explain the formance by girls (␤s ⫽ –.02 and –.06, respectively).
variability among effects. These moderator analyses allowed us to The magnitude of the gender difference did not depend on
explore whether systematic differences among the studies led to whether there was a time limit for the test, Qb(1, 136) ⫽ 2.21, p ⫽
reliably different effects. Moderators were tested for significance .14, or whether the test was curriculum-focused, Qb(1, 422) ⫽
Figure 1. Distribution of effect sizes in Study 1.

Table 1 predict effect size. This analysis indicated that publication year
Weighted Ordinary Least Squares Regressions Predicting was not a significant predictor of effect size, Qb(1, 427) ⫽ 1.05,
Gender Differences in Math Performance, as Moderated by p ⫽ .31.
Test Characteristics (Study 1) A final analysis explored the interaction of age and depth of
knowledge. Thus, our next analysis focused on both depth of
Moderator (k affirmative) ␤ k R2 p knowledge and different age groups. This analysis was limited by
Problem type 189 .04 ⴱ the fact that many articles identified in our original search pro-
Contains multiple choice (95) ⫹.16 vided insufficient information to be able to code depth of knowl-
Contains short answer (105) ⫺.03 edge, so they were not useable in this analysis. Furthermore, of
Contains open-ended (36) ⫺.09 those that had usable information about test content, only a small
Problem content 205 .04 proportion of tests included items that tapped complex problem
Contains numbers & operations (160) ⫹.02 solving (Level 3 or 4). Some studies located in the literature search
Contains algebra (54) ⫹.10
Contains geometry (90) ⫹.14 involved problem solving at Level 3 or 4 but reported only qual-
Contains measurement (38) ⫺.08 itative data on students’ approach to the problem, with no data on
Contains data analysis & probability (29) ⫹.09 actual performance. Although these studies could not be included
Problem depth of knowledge 120 .01 in the meta-analysis, they suggest the value of looking at more
Contains Level 1 (89) ⫺.10 complex tasks. Table 3 provides a breakdown of effect sizes by age
Contains Level 2 (52) ⫹.00 and depth of knowledge. Of particular interest in this analysis was
Contains Level 3 or 4 (9) ⫺.10
whether a gender difference in complex problem solving would be
Test time limit 137 .02
seen among high school and college students, as was found in the
Yes (105) ⫹.12
1990 meta-analysis by Hyde, Fennema, and Lamon. Our results
Test curriculum focused 423 .01
Yes (132) ⫺.08 showed that there was a small gender difference favoring male
high school students on tests that included problems at Levels 3 or
Note. k ⫽ number of studies; d ⫽ effect size. Studies that provided
insufficient information to code a particular moderator are omitted from
that analysis. Therefore, k fluctuates between analyses, and the results of Table 2
moderator analyses do not represent the full body of studies used to
compute the overall mean effect size reported in Study 1. Positive betas
Summary of Meta-Analysis Results (Study 1), as Moderated by
denote increases in effect size (advantage for males) as the value of the Sample Characteristics
predictor increases, whereas negative betas denote decreases in effect size
(advantage for females) as the value of the predictor increases. Sample characteristic d k p
ⴱ
p ⬍ .05.
ⴱⴱⴱ
Ability
Low ability ⫹0.07 15
General ability ⫹0.07 304
Moderately selective ⫹0.15 79
3.01, p ⫽ .08. Similarly, there were no variations in the magnitude
Highly selective ⫹0.40 27
of the effect size as a function of problem content (numbers and
Nationality
operations, algebra, geometry, measurement, probability), Qb(5, U.S. ⫹0.10 226
204) ⫽ 8.55, p ⫽ .13, or depth of knowledge, Qb(3, 119) ⫽ 1.51, Canada ⫹0.01 13
p ⫽ .68. Central/South America & Mexico ⫺0.06 11
Table 2 displays the analyses for variations in effect size as a Europe ⫹0.07 70
function of characteristics of the sample. The magnitude of the Australia & New Zealand ⫹0.10 15
Asia ⫹0.17 45
gender difference varied significantly as a function of the selec- Africa ⫹0.21 26
tivity of the sample, Qb(3, 424) ⫽ 27.06, p ⬍ .001. For samples of Middle East ⫹0.12 16
the general population, d ⫽ ⫹0.07, but d ⫽ ⫹0.40 for highly Ethnicity (U.S. samples) ⴱⴱ
selective samples. Primarily European American ⫹0.13 58

Nationality was not a significant predictor of effect sizes, Qb(7, Primarily minority (combined)a ⫺0.05 32
ⴱⴱⴱ
421) ⫽ 10.12, p ⫽ .18. All effects were small or negligible. Age
Among U.S. studies, effect sizes varied as a function of ethnicity, Preschool ⫺0.15 3
Elementary school ⫹0.06 86
Qb(1, 89) ⫽ 10.00, p ⬍ .01. Samples composed mainly of Whites Middle school ⫺0.00 140
showed d ⫽ ⫹0.13, whereas for ethnic minority samples, d ⫽ High school ⫹0.23 110
– 0.05. Although we would have preferred to report results from College ⫹0.18 78
these groups separately by ethnicity, the number of samples was Adult ⫺0.07 7
too small to permit this.
Note. k ⫽ number of studies; d ⫽ effect size. Studies that provided
The analysis also indicated that age was a significant moderator, insufficient information to code a particular moderator are omitted from
Qb (5, 423) ⫽ 44.75, p ⬍ .001. Gender differences were negligible that analysis. Therefore, k fluctuates between analyses, and the results of
in elementary school and middle school aged children and reached moderator analyses do not represent the full body of studies used to
a peak of d ⫽ ⫹0.23 in high school. The gender difference then compute the overall mean effect size reported in Study 1.
a
The number of samples for distinct U.S. ethnic groups was too small to
declined for college-aged samples and adults. analyze effects for each group separately, so all ethnic minority samples
To test for trends over time from 1990 to 2007, an analog to were combined.
ⴱⴱ
multiple regression was performed, using year of publication to p ⬍ .01. ⴱⴱⴱ p ⬍ .001.
Table 3
Mean Weighted Effect Sizes, by Highest Depth of Knowledge Tested and Age Group
Depth of knowledge Elementary Middle school High school College Adult
Level 1 ⫹0.03 (k ⫽ 21) ⫺0.02 (k ⫽ 16) ⫹0.21 (k ⫽ 9) ⫹0.24 (k ⫽ 11) ⫹0.17 (k ⫽ 3)

Level 2 ⫹0.12 (k ⫽ 5) ⫺0.07 (k ⫽ 16) ⫹0.37 (k ⫽ 16) ⫹0.12 (k ⫽ 8) ⫺0.28 (k ⫽ 1)
Level 3 or 4 ⫺0.10 (k ⫽ 1) ⫹0.16 (k ⫽ 3) ⫺0.11 (k ⫽ 5)
4 (d ⫽ ⫹0.16), but the effect was reversed among college students given between the basal and the ceiling and is standardized with a
(d ⫽ – 0.11). However, these findings are based on small numbers mean of 100 and a standard deviation of 15. The current study
of studies and therefore cannot be considered robust. determines math achievement using the standardized score of the
Gender differences in variability. A mixed-effects analysis PIAT-R math for each assessment.
of gender differences in variance was conducted in parallel to the The NELS:88 is a longitudinal study that began examining
effect size analyses reported above. The overall weighted VR, eighth graders in 1988 and followed these youth in the 10th and
averaged over all studies, was VR ⫽ 1.07, indicating a slightly 12th grades in 1990 and 1992. At the first assessment, 67.0% of
larger variance for males than for females. The residual variance participants were White, 12.2% were Black, 12.7% were Latino,
component was .073. 6.3% were Asian American, and 1.6% did not report ethnicity. The
NELS math assessment was developed by Educational Testing
Study 2 Services and consisted of multiple choice questions. Item content
included arithmetic, algebra, geometry, data and probability, and
Method advanced topics. One version of the test was administered at the
base year, and three versions of the test were administered at the
Large United States data sets. Study 1 excluded articles that first and second follow-up. On the basis of their performance on
reported secondary data analyses from large national data sets, the math test in the base year, students were divided into three
because original data from those studies were acquired directly for groups (low, moderate, and high ability) and were assigned ver-
a separate analysis, which constitutes Study 2. Data sets were
sions of the math test at the first and second follow-up in accor-
included in Study 2 if they (a) included relevant information about
dance with their ability. Each test, regardless of ability, assessed
math performance, (b) represented data collected after 1990, (c)
skill/knowledge, comprehension, and problem solving in all five
were nationally representative with a large sample size, and (d)
content areas described above. In addition to the multiple choice
provided statistics for both males and females. International data
test, a constructed response test was given to 12th graders at the
sets were excluded from Study 2 because they have been thor-
1992 assessment; this test involved items examining measurement,
oughly reviewed elsewhere in the literature (see Else-Quest et al.,
geometry, and data analysis.
2010). The following large U.S. data sets were analyzed in Study
The LSAY followed youth from 1987 to 1992 with assessments at
2: the NLSY97 (Bureau of Labor Statistics, n.d.), the NELS:88
each year from Grades 7 to 12. At the first assessment, 70% of
(National Center for Education Statistics, n.d.-a), the LSAY (n.d.),
participants were White, 11% were Black, 9.2% were Latino, 3.5%
and the NAEP (National Center for Education Statistics, n.d.-b).
The NLSY97 began data collection in 1997 and followed stu- were Asian American, 1.5% were American Indian, and 4.8% did not
dents each year until 2002. At the first assessment, 58.3% of report ethnicity. At each assessment students completed a multiple-
participants were White, 27.0% were Black, 1.7% were Asian choice math test that assessed skills in geometry, measurement, data
American, 0.8% were American Indian, and 12.4% did not report analysis, algebra, and simple operations (for a complete list of prob-
ethnicity. Math achievement was measured using the Peabody lems, see http://lsay.msu.edu/instruments_006.html).
Individual Achievement Test–Revised (PIAT-R; Markwardt, The NAEP math assessment was the only large U.S. database in
1998). During round one of data collection, the test was adminis- the current study that was not longitudinal. It consisted of two
tered to all participants who were in the ninth grade or lower. different studies: the long-term trend assessment, and the main
During round two of data collection, the test was administered to assessment.
all participants who had taken the test during round one, as well as The NAEP long-term trend assessment was given every 4 years
those who were at least 12 years old on December 31, 1996. The from 1992 to 2004 to students ages 9, 13, and 17. Ethnic diversity
PIAT-R consists of multiple choice items about three areas of math was different for each assessment and age group, but consisted of
content: (a) foundations (i.e., number, size, and shape discretion), Whites (M ⫽ 74.07%, SD ⫽ 4.87), Blacks (M ⫽ 14.78%, SD ⫽
(b) basic facts (i.e., addition, subtraction, multiplication, division), 1.31), Latinos (M ⫽ 7.82%, SD ⫽ 3.31), and those who did not
and (c) applications (i.e., algebra, geometry, fractions, word prob- report ethnicity (M ⫽ 3.29%, SD ⫽ 1.38). The math assessment in
lems, and numerical relationships). The PIAT-R math assessment this study has remained virtually unchanged since its inception in
begins with an age-appropriate question and increases or decreases 1978. It includes both multiple choice and short constructed re-
in difficulty until the youth establishes a basal—that is, when the sponse items that focus on math skills, including number opera-
youth correctly answers five consecutive questions. Once a basal is tions, measurement, algebra, and geometry.
reached, the questions increase in difficulty until a ceiling is In contrast to the long-term trend data, the NAEP main assess-
reached when the youth incorrectly answers five of seven consec- ment selected students by grade rather than by age. The main
utive questions. The ceiling is then adjusted for incorrect responses assessment was given to fourth and eighth graders every 2 years
from 1990 to 2007. Twelfth graders were included in 1990 through Results
2000. Ethnic diversity was different for each assessment and age
group, but consisted of Whites (M ⫽ 67.07%, SD ⫽ 6.43), Blacks, Across all data sets in Study 2, the average weighted effect size
was d ⫽ ⫹0.07. The average weighted VR across all data sets was
(M ⫽ 15.94%, SD ⫽ 0.99), Latinos (M ⫽ 11.89%, SD ⫽ 4.89),
1.09. The effect sizes were heterogeneous, Qt(55) ⫽ 393.04, p ⬍
Asian Americans (M ⫽ 3.17%, SD ⫽ 1.66), American Indians
.001, with a random effects variance component of .001. Differ-
(M ⫽ 0.94%, SD ⫽ 0.42), and those who did not report ethnicity
ences among the national data sets were a significant source of
(M ⫽ 3.29%, SD ⫽ 1.38). The math portion of the NAEP main
heterogeneity, Qb(4, 55) ⫽ 43.12, p ⬍ .001; therefore we describe
assessment used multiple choice, short constructed response, and findings from each data set in turn.
long constructed response items. In addition to the math skills Effect sizes for each assessment of the NLSY are presented in
assessed in the long-term trend analysis, the main analysis also Table 4. The mean weighted effect size for all six assessments of
included skills in data analysis and probability. the NLSY was d ⫽ ⫹0.08. The average weighted VR was 1.05.
A unique aspect of the NAEP assessments is that performance Effects sizes for NLSY-97 were homogenous, Qw(5) ⫽ 5.72,
scores are constructed via item response theory. Thus, they are not p ⫽ .33.
a direct reflection of the number of problems any student got right Effect sizes for each assessment of the NELS:88 are presented
or wrong. Rather, the pattern of correct responses is used to in Table 5. The average weighted effect size across all eight
construct a “probable value” score that reflects the student’s over- assessments was d ⫽ ⫹0.10. The average weighted VR for the
all understanding of mathematics. (See Mislevy, Johnson, and NELS:88 was 0.94. Effect sizes within the NELS:88 were heter-
Muraki, 1992, for a further discussion of this approach and its ogeneous, Qw(7) ⫽ 18.66, p ⬍ .01.
advantages in estimating population values.) Our analyses are Effect sizes for each assessment of the LSAY are presented in
based on these probable value scale scores, along with the corre- Table 6. Results for the LSAY indicated small or negligible gender
sponding weights and standard errors reported by NAEP. differences for each assessment. The average weighted effect size
Data analysis. Data analysis for Study 2 was similar to Study for all six assessments was d ⫽ – 0.07. The weighted average VR
1. All effect sizes were calculated, and a mixed effects model was was 1.26. Effect sizes were homogenous, Qw(5) ⫽ 2.50, p ⫽ .78.
used to determine whether the effect sizes within each data set Effect sizes for each assessment of the NAEP are presented in
were heterogeneous. If effect sizes were heterogeneous, a Table 7. Results for NAEP indicated small or negligible gender
weighted ordinary least squares regression was applied to predict differences at all grades. The average weighted effect size across
gender differences in math performance. all 18 assessments of the long-term trend data was d ⫽ ⫹0.09. The
average weighted effect size across all 18 of the main assessments
Moderating variables used in Study 2 were age, publication
was d ⫽ ⫹0.06. The average weighted VR for the long-term trend
year, percentage of each type of problem (number sense, algebra,
data was 1.13, and for the main assessment was 1.04. Effect sizes
geometry, measurement), percentage of problems in each type of
for both the long-term trend data and the main assessment were
format (multiple choice, short answer, open ended), and percent-
homogenous, Qw(17) ⫽ 16.05, p ⫽ .52, and Qw(17) ⫽ 12.88, p ⫽
age of each ethnic group in the sample. Similar to Study 1, VRs .74, respectively.
were also computed. More information about sample and test The heterogeneity of the effect sizes across data sets indicates
characteristics was available for the large data sets than was that these studies are not replications of each other but rather vary
available for the studies uncovered in the literature reviews in along some dimension(s). We therefore conducted additional mod-
Study 1. Therefore, with the exception of depth of knowledge, we eration analyses to examine whether sample characteristics or test
were able to code moderator variables with more detail, and many characteristics could explain the heterogeneity among effect sizes
moderators that were coded as categorical variables in Study 1 across data sets. These analyses were conducted using an analog to
were considered continuous variables in Study 2. For example, multiple regression in which each level of each moderator was
Study 1 coded whether the tests used multiple choice, short an- entered as a separate predictor of the studies’ effect sizes. The
swer, or open-ended questions, whereas Study 2 coded the per- resulting betas can be interpreted much like correlation coeffi-
centage of question in each format. cients, with positive values indicating an increase in effect size as
Table 4
Descriptive Statistics for the NLSY-97
Year Age Nmale Mmale Nfemale Mfemale d VR
1997 13 3,184 96.57 2,860 96.53 0.00 1.07

1998 14 835 97.80 767 96.27 ⫹0.08 1.06
1999 15 767 97.44 747 95.15 ⫹0.12 0.99
2000 16 779 96.75 737 95.21 ⫹0.09 1.09
2001 17 748 98.30 713 95.89 ⫹0.14 1.04
2002 18 226 96.20 180 92.29 ⫹0.22 1.05
Note. NLSY-97 ⫽ National Longitudinal Surveys of Youth; Nmale ⫽ number of males in the sample; Nfemale ⫽
number of females in the sample; Mmale ⫽ mean for males; Mfemale ⫽ mean for females; d ⫽ effect size; VR ⫽
variance ratio.
Table 5
Descriptive Statistics for NELS:88
Year Grade Ability Question type Nmale Mmale Nfemale Mfemale d VR
1988 8 All MC 11,763 0.54 11,885 0.54 ⫹0.00 1.00

1990 10 Low MC 1,570 0.45 1,629 0.42 ⫹0.18 0.89
1990 10 Moderate MC 4,873 0.58 4,907 0.58 ⫹0.00 0.88
1990 10 High MC 2,452 0.81 2,362 0.79 ⫹0.13 0.88
1992 12 Low MC 1,293 0.49 1,261 0.46 ⫹0.15 0.90
1992 12 Moderate MC 3,746 0.56 3,971 0.54 ⫹0.09 0.91
1992 12 High MC 2,087 0.64 1,878 0.61 ⫹0.12 0.92
1992 12 All OE 1,141 11.87 1,091 10.74 ⫹0.21 1.20
Note. NELS:88 ⫽ National Education Longitudinal Study of 1988; MC ⫽ multiple choice; OE ⫽ open ended; Nmale ⫽ number of males in the sample;
Nfemale ⫽ number of females in the sample; Mmale ⫽ mean for males; Mfemale ⫽ mean for females; d ⫽ effect size; VR ⫽ variance ratio.
the value of the moderator increases (relative advantage for males) With regard to depth of knowledge, a marginally significant
and negative values indicating a decrease in effect size as the value trend emerged such that tests containing items at Level 3 or 4
of the moderator increases (relative advantage for females). yielded larger effect sizes (males performed better or females
As seen in Table 8, two aspects of the tests accounted for performed worse). All of the tests contained items at Levels 1 and
heterogeneity among studies: problem type and mathematical con- 2, and therefore we were not able to examine the specific effects
tent. With regard to problem type, tests with a higher proportion of of items at those levels.
multiple choice and open-ended items yielded smaller gender The ethnic composition of the samples did not have an effect on
effect sizes, whereas tests with a higher proportion of short answer the magnitude of the gender difference in mathematics perfor-
items yielded larger gender effect sizes. This finding surprised us, mance. However, age was a marginally significant predictor of
given that three of the big data sets (LSAY, NLSY, NELS:88) effect size ( p ⫽ .0516), with older samples yielding relatively
were 100% multiple choice, and only one of them had a negative larger gender differences favoring males.
overall effect size; if multiple choice items conferred a significant
female advantage, we might have expected negative effects across Discussion
all three of those studies. Therefore, we conducted an additional
analysis looking just at the 36 NAEP effect sizes, which have We proposed to answer six questions with these meta-analyses.
variation in the proportion of multiple choice, short answer, and We take up each question in turn.
open-ended questions. When looking at just the NAEP effect sizes, First, what is the magnitude of the gender difference in mathe-
we found a different pattern of results, such that males did better matics performance, based on contemporary studies? Taking Study
on tests with a greater proportion of multiple choice items (␤ ⫽ 1 and Study 2 together, the answer appears to be that there is no
⫹.29), and females did better on tests with a greater proportion of longer a gender difference in mathematics performance. For Study
short answer and open-ended items (␤s ⫽ –.32 and –.19, respec- 1, d values averaged ⫹0.05 on the basis of data from 1,286,350
tively). Thus, problem type had a similar effect on gender differ- persons. For Study 2, d values averaged ⫹0.07 on the basis of data
ences in the NAEP as was found in Study 1. from 1,309,587 persons. These results are consistent with a recent
With regard to mathematical content, tests with a higher pro- analysis of U.S. data from state assessments of youth in Grades 2
portion of algebra items yielded smaller effect sizes (females through 11, which found that girls had reached parity with boys in
performed relatively better), and tests with a higher proportion of math performance (Hyde et al., 2008).
measurement items yielded larger effect sizes (males performed Second, does the direction or magnitude of the gender difference
relatively better). The other three types of mathematical content vary as a function of the depth of knowledge tapped by the test? By
were not significant predictors of effect size in this analysis. itself, depth of knowledge was not a significant predictor of
Table 6
Descriptive Statistics for LSAY
Year Grade Nmale Mmale Nfemale Mfemale d VR
1987 7 1,593 28.66 1,472 29.41 ⫺0.07 1.14

1988 8 1,426 26.48 1,323 27.83 ⫺0.15 1.28
1989 9 1,246 34.20 1,189 34.81 ⫺0.04 1.29
1990 10 1,152 39.42 1,112 40.21 ⫺0.05 1.34
1991 11 934 42.15 898 42.83 ⫺0.04 1.33
1992 12 755 29.57 712 30.30 ⫺0.07 1.21
Note. LSAY ⫽ Longitudinal Study of American Youth; Nmale ⫽ number of males in the sample; Nfemale ⫽
number of females in the sample; Mmale ⫽ mean for males; Mfemale ⫽ mean for females; d ⫽ effect size; VR ⫽
variance ratio.
Table 7
Descriptive Statistics for NAEP
Assessment Year Age Grade Nmale Mmale Nfemale Mfemale d VR
LTT 1990 9 N/A 3,038 229 3,162 230 ⫺0.03 1.06

LTT 1992 9 N/A 3,577 231 3,723 228 ⫹0.09 1.06
LTT 1994 9 N/A 2,793 232 2,907 230 ⫹0.06 1.06
LTT 1996 9 N/A 2,700 233 2,700 229 ⫹0.12 1.12
LTT 1999 9 N/A 2,940 233 3,060 231 ⫹0.06 1.12
LTT 2004 9 N/A 2,548 243 2,654 240 ⫹0.09 1.06
LTT 1990 13 N/A 3,300 271 3,300 270 ⫹0.03 1.14
LTT 1992 13 N/A 2,950 274 2,950 272 ⫹0.06 1.14
LTT 1994 13 N/A 2,989 276 3,111 273 ⫹0.09 1.20
LTT 1996 13 N/A 2,736 276 2,964 272 ⫹0.13 1.07
LTT 1999 13 N/A 2,950 277 2,950 274 ⫹0.09 1.13
LTT 2004 13 N/A 2,736 283 2,964 279 ⫹0.12 1.27
LTT 1990 17 N/A 2,156 306 2,244 303 ⫹0.09 1.14
LTT 1992 17 N/A 2,244 309 2,156 305 ⫹0.13 1.14
LTT 1994 17 N/A 1,862 309 1,938 304 ⫹0.17 1.14
LTT 1996 17 N/A 1,750 310 1,750 305 ⫹0.17 1.14
LTT 1999 17 N/A 1,824 310 1,938 307 ⫹0.10 1.14
LTT 2004 17 N/A 1,824 308 1,938 305 ⫹0.10 1.23
MA 1990 N/A 4 1,780 214 1,643 213 ⫹0.03 1.13
MA 1992 N/A 4 3,588 221 3,588 219 ⫹0.06 1.13
MA 1996 N/A 4 3,380 224 3,248 223 ⫹0.03 1.14
MA 2000 N/A 4 7,066 227 6,789 224 ⫹0.10 1.14
MA 2003 N/A 4 96,951 236 93,149 233 ⫹0.11 1.07
MA 2005 N/A 4 87,720 239 84,280 237 ⫹0.07 1.07
MA 2007 N/A 4 102,000 241 102,000 239 ⫹0.07 1.07
MA 1990 N/A 8 1,750 263 1,681 262 ⫹0.03 1.12
MA 1992 N/A 8 3,908 268 3,755 269 ⫺0.03 1.06
MA 1996 N/A 8 3,715 271 3,430 269 ⫹0.05 1.05
MA 2000 N/A 8 8,124 274 7,806 272 ⫹0.05 1.11
MA 2003 N/A 8 76,600 278 76,600 277 ⫹0.03 1.12
MA 2005 N/A 8 81,000 280 81,000 278 ⫹0.06 1.12
MA 2007 N/A 8 76,500 282 76,500 280 ⫹0.06 1.12
MA 1990 N/A 12 1,537 297 1,600 291 ⫹0.17 1.06
MA 1992 N/A 12 3,486 301 3,486 298 ⫹0.09 1.12
MA 1996 N/A 12 3,314 303 3,590 300 ⫹0.09 1.12
MA 2000 N/A 12 6,716 302 6,716 299 ⫹0.08 1.18
Note. NAEP ⫽ National Assessment of Educational Progress; LTT ⫽ long-term trend data; MA ⫽ NAEP main assessment; N/A ⫽ long-term trend data
assessed youth by age, but the main assessment assessed youth by grade level; Nmale ⫽ number of males in the sample; Nfemale ⫽ number of females in
the sample; Mmale ⫽ mean for males; Mfemale ⫽ mean for females; d ⫽ effect size; VR ⫽ variance ratio.
differences in effect sizes in Study 1. However, Study 2 indicated marginally significant increase in effect sizes as age increased
that there may be a modest effect of depth of knowledge on gender (␤ ⫽ ⫹.24). This is consistent with the results of Study 1, which
differences in mathematics performance, with those containing a found gender differences close to 0 for elementary and middle
greater proportion of items at Level 3 or 4 favoring males. These school students and small effects favoring male high-school and
results are inconsistent with a recent analysis of NAEP data, college students (ds ⫽ ⫹0.23 and ⫹0.18, respectively). These
examining gender differences for items categorized as difficult by results, too, are inconsistent with the Hyde et al. (2008) analysis of
NAEP and as at Level 3 or 4 depth of knowledge; at 12th grade, data from state assessments, which showed no gender difference in
the average d ⫽ ⫹0.07, or a negligible difference (Hyde et al., performance at any grade level through Grade 11. Again, though,
2008). However, an examination of depth of knowledge and age it is important to consider age and depth of knowledge required by
simultaneously in Study 1 (see Table 3) indicates a male advantage the test simultaneously.
(d ⫽ 0.16) in Level 3 or 4 problems in high school, a finding that Overall, we conclude that a small gender difference favoring
is consistent with the earlier meta-analysis by Hyde, Fennema, and boys in complex problem solving is still present in high school.
Lamon (1990). This finding, however, is based on only three Multiple factors may account for this gender gap. As noted earlier,
studies, so it should be interpreted with caution. Very few studies girls are less likely to take physics than boys are, and complex
used items requiring this greater depth of knowledge, yet it is problem solving is taught in physics classes, perhaps even more
precisely the skill that is required for high-level STEM careers. than in math classes. Gender differences in patterns of interest may
Third, developmentally, at what ages do gender differences play a role (Su, Rounds, & Armstrong, 2009), although these
appear or disappear? Consistent with previous meta-analyses, are patterns, too, are shaped by culture. Moreover, even in very recent
gender differences larger in high school than in elementary or studies, parents and teachers give higher ability estimates to boys
middle school? The data sets reviewed in Study 2 showed a than to girls (Lindberg et al., 2008), and the effects of parents’ and
Table 8 small (Hyde, Fennema, & Lamon, 1990), leaving little room for
Weighted Ordinary Least Squares Regressions Predicting further decline.
Gender Differences in Math Performance for all Large U.S. Sixth, do males display greater variance in scores and, if so, by
Datasets (Study 2) how much? The overall VR in Study 1 was 1.07. That is, males
displayed a somewhat larger variance, but the VR was not far from
Predictor ␤ R2 p 1.0 or equal variances. In Study 2, the average VR was 1.09, again
Problem type not far from 1.0. In addition, the NELS:88 data (see Table 3) show
% Multiple choice ⫺.27 .07 ⴱ several VRs that are ⬍1.0, indicating that greater male variability
ⴱ
% Short answer ⫹.29 .09 is not ubiquitous. VRs less than 1.0 have also been found in some
% Open ended ⫺.05 .00 national and international data sets (Hyde et al., 2008; Hyde &
Mathematical content Mertz, 2009).
% Numbers and operations ⫹.07 .00 Overall, to put these findings in a broader context, gender can be
ⴱ
% Algebra ⫺.32 .10
% Geometry ⫺.19 .04 conceptualized as one of many predictors of mathematics perfor-
ⴱⴱ mance. Other factors include socioeconomic status (SES), parents’
% Measurement ⫹.40 .16
% Data analysis & probability ⫺.18 .03 education, and the quality of schooling. Melhuish et al. (2008)
Depth of knowledge compared the effect sizes of nine predictors of children’s mathe-
Contains Level 3 ⫹.22 .05 †
matics performance at age 10: birth weight, gender, SES, mother’s
Contains Level 4 ⫹.21 .04 †
education, father’s education, family income, quality of the home
Age ⫹.24 .06 †
learning environment, preschool effectiveness, and elementary
Ethnicity school effectiveness. The striking finding was that gender was the
% European American ⫺.19 .03 weakest of these nine predictors (i.e., it had the smallest effect
% African American ⫹.19 .04
% Hispanic ⫹.12 .01 size). Mother’s education, quality of the home learning environ-
% Asian American ⫺.01 .00 ment, and elementary school effectiveness were far stronger pre-
% American Indian ⫺.14 .02 dictors. Our findings are consistent with those of Melhuish et al.:
gender is not a strong predictor of mathematics performance.
Note. Positive betas denote increases in effect size (advantage for males)
as the value of the predictor increases; whereas, negative betas denote
decreases in effect size (advantage for females) as the value of the predictor Implications
increases.
†
p ⬍ .10. ⴱ p ⬍ .05. ⴱⴱ p ⬍ .01. Overall, the results of these two studies provide strong evidence
of gender similarities in mathematics performance. The heteroge-
neity of the findings suggests that there are moderator variables
teachers’ expectations on children’s estimates of their own ability that might clarify the pattern of effect sizes. Detecting consistent
and their course choices are well documented (Eccles, 1994; moderators of gender differences would be strengthened by mea-
Jacobs, Davis-Kean, Bleeker, Eccles, & Malanchuk, 2005). sures that tap the full range of mathematical reasoning, including
Fourth, are there variations across U.S. ethnic groups, or across items that require sustained reasoning about complex problems.
nations, in the direction or magnitude of the gender difference? In The existence and magnitude of gender differences in mathematics
regard to ethnicity, Table 2 shows that, for Study 1, a small gender performance varies as a function of many factors, including nation,
difference was found favoring males among Whites, d ⫽ ⫹0.13, ethnicity, and age.
but for all ethnic minorities combined, d ⫽ – 0.05. This result is These findings have several policy implications. First, these
similar to the one found in the 1990 meta-analysis (Hyde, Fen- findings call into question current trends toward single-sex math
nema, & Lamon, 1990), which found a small gender difference classrooms. Advocates of single-sex education base their argument
favoring males among Whites, but no difference for ethnic minor- in part on the assumption that girls lag behind boys in mathematics
ity groups. Table 2 also shows variation in effect sizes according performance and need to be in a protected, all-girls environment to
to nationality or region of the world. The largest gender difference be able to learn math (e.g., Streitmatter, 1999). The data, however,
favoring males was found in studies from Africa, d ⫽ ⫹0.21, but show that girls are performing as well as boys in mathematics, on
even this difference is small. The largest difference favoring fe- the basis of 242 separate studies (Study 1) and four large, well-
males was found in Central/South America and Mexico, d ⫽ sampled national U.S. data sets (Study 2). The great majority of
– 0.06. These variations in the magnitude and direction of the these girls and boys did their learning in coeducational classrooms.
gender difference in math performance are consistent with those Thus, the argument that girls’ mathematics performance suffers in
found in analyses of international data sets such as PISA and gender-integrated classrooms simply is not supported by the data.
TIMSS. These other studies have, in addition, found that values of If we wish to improve students’ mathematics performance, we
d for nations correlate significantly with measures of gender in- would do better to focus not on gender but on factors that have
equality for those nations (e.g., Else-Quest et al., 2010; Guiso et larger effects, such as the quality and implementation of the
al., 2008; Penner, 2008). curriculum (Tarr et al., 2008), as well as the quality of the ele-
Fifth, has the magnitude of gender differences in mathematics mentary school and the quality of the home learning environment
performance declined from 1990 to 2007? Study 1 found no (Melhuish et al., 2008).
relation between year of publication and effect sizes, indicating no Second, the dearth of Level 3 or 4 items in assessments has a
discernible trend over time toward smaller gender differences. This serious consequence. Given the importance of mathematics tests
may be because even in 1990, gender differences were already for school evaluation under the No Child Left Behind legislation,
it is common for teachers to teach to the test (Au, 2007). If the test meta-analysis. Psychological Bulletin, 136, 103–127. doi:10.1037/
fails to emphasize the skills that citizens need, American students a0018053
are disadvantaged. In addition, without evidence concerning stu- Feingold, A. (1992). Sex differences in variability in intellectual abilities:
dent progress on these important forms of mathematical reasoning, A new look at an old controversy. Review of Educational Research, 62,
teachers, administrators, and policy makers cannot determine 61– 84.
Fennema, E., & Sherman, J. (1977). Sex-related differences in mathematics
which curriculum materials or teaching strategies contribute to
achievement, spatial visualization, and affective factors. American Ed-
mathematical proficiency. Finally, tests that fail to emphasize ucational Research Journal, 14, 51–71.
complex problem solving or sustained reasoning communicate an Fredricks, J. A., & Eccles, J. S. (2002). Children’s competence and value
inaccurate picture of mathematics to students. beliefs from childhood through adolescence: Growth trajectories in two
These findings also have implications for dispelling stereotypes. male-sex-typed domains. Developmental Psychology, 38, 519 –533. doi:
Overall, it is clear that in the United States and some other nations, 10.1037/0012-1649.38.4.519
girls have reached parity with boys in mathematics performance. It Frome, P. M., & Eccles, J. S. (1998). Parents’ influence on children’s
is crucial that this information be made widely known to counter- achievement-related perceptions. Journal of Personality and Social Psy-
act stereotypes about female math inferiority held by gatekeepers chology, 74, 435– 452. doi:10.1037/0022-3514.74.2.435
such as parents and teachers and by students themselves. Furnham, A., Reeves, E., & Budhani, S. (2002). Parents think their sons are
brighter than their daughters: Sex differences in parental self-estimations
and estimations of their children’s multiple intelligences. Journal of
References Genetic Psychology, 163, 24 –39. doi:10.1080/00221320209597966
For a complete list of studies included in the meta-analysis in Study 1, Guiso, L., Monte, F., Sapienza, P., & Zingales, L. (2008, May 30). Culture,
go to http://dx.doi.org/10.1037/a0021276.supp gender, and math. Science, 320, 1164 –1165. doi:10.1126/science
.1154094
Ambady, N., Shih, M., Kim, A., & Pittinsky, T. L. (2001). Stereotype
Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect
susceptibility in children: Effects of identity activation on quantitative
size and related estimators. Journal of Educational Statistics, 6, 107–
performance. Psychological Science, 12, 385–390. doi:10.1111/1467-
128. doi:10.2307/1164588
9280.00371
Hedges, L. V., & Friedman, L. (1993). Gender differences in variability in
Anastasi, A. (1958). Differential psychology (3rd ed.), New York, NY:
intellectual abilities: A reanalysis of Feingold’s results. Review of Edu-
Macmillan.
cational Research, 63, 94 –105.
Arms, E. (2007). Gender equity in coeducational and single-sex environ-
Hedges, L. V., & Nowell, A. (1995, July 7). Sex differences in mental test
ments. In S. Klein (Ed.), Handbook for achieving equity through edu-
scores, variability, and numbers of high-scoring individuals. Science,
cation (pp. 171–190). Mahwah, NJ: Erlbaum.
269, 41– 45. doi:10.1126/science.7604277
Au, W. (2007). High-stakes testing and curricular control: A qualitative
Hedges, L. V., & Vevea, J. L. (1998). Fixed- and random-effects models in
metasynthesis. Educational Researcher, 36, 258 –267. doi:10.3102/
meta-analysis. Psychological Methods, 3, 486 –504. doi:10.1037/1082-
0013189X07306523
989X.3.4.486
Bandura, A. (1986). Social foundations of thought and action: A social
cognitive theory. Englewood Cliffs, NJ: Prentice-Hall. Helwig, R., Anderson, L., & Tindal, G. (2001). Influence of elementary
Bandura, A. (1997). Self-efficacy: The exercise of control. New York, NY: student gender on teachers’ perceptions of mathematics achievement.
Freeman. Journal of Educational Research, 95, 93–102. doi:10.1080/
Ben-Zeev, T., Fein, S., & Inzlicht, M. (2005). Arousal and stereotype 00220670109596577
threat. Journal of Experimental Social Psychology, 41, 174 –181. doi: Hyde, J. S., Fennema, E., & Lamon, S. (1990). Gender differences in
10.1016/j.jesp.2003.11.007 mathematics performance: A meta-analysis. Psychological Bulletin,
Bouchey, H. A., & Harter, S. (2005). Reflected appraisals, academic 107, 139 –155. doi:10.1037/0033-2909.107.2.139
self-perceptions, and math/science performance during early adoles- Hyde, J. S., Fennema, E., Ryan, M., Frost, L. A., & Hopp, C. (1990). Gender
cence. Journal of Educational Psychology, 97, 673– 686. doi:10.1037/ comparisons of mathematics attitudes and affect. Psychology of Women
0022-0663.97.4.673 Quarterly, 14, 299 –324. doi:10.1111/j.1471-6402.1990.tb00022.x
Bureau of Labor Statistics. (n.d.). National Longitudinal Surveys: The Hyde, J. S., Lindberg, S. M., Linn, M. C., Ellis, A. B., & Williams, C. C.
NLSY 97. Retrieved from http://www.bls.gov/nls/nlsy97.htm (2008, July 25). Gender similarities characterize math performance.
Bussey, K., & Bandura, A. (1999). Social cognitive theory of gender Science, 321, 494 – 495. doi:10.1126/science.1160364
development and differentiation. Psychological Review, 106, 676 –713. Hyde, J. S., & Mertz, J. E. (2009). Gender, culture, and mathematics
doi:10.1037/0033-295X.106.4.676 performance. Proceedings of the National Academy of Sciences, USA,
Cadinu, M., Maass, A., Rosabianca, A., & Kiesner, J. (2005). Why do 106, 8801– 8807. doi:10.1073/pnas.0901265106
women underperform under stereotype threat? Evidence for the role of Inzlicht, M., & Ben-Zeev, T. (2000). A threatening intellectual environ-
negative thinking. Psychological Science, 16, 572–578. doi:10.1111/ ment: Why females are susceptible to experiencing problem-solving
j.0956-7976.2005.01577.x deficits in the presence of males. Psychological Science, 11, 365–371.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. doi:10.1111/1467-9280.00272
Hillsdale, NJ: Erlbaum. Jacobs, J., Davis-Kean, P., Bleeker, M., Eccles, J., & Malanchuk, O.
Dwyer, C. A., & Johnson, L. M. (1997). Grades, accomplishments and (2005). “I can, but I don’t want to”: The impact of parents, interests, and
correlates. In W. A. Willingham & N. S. Cole (Eds.), Gender and fair activities on gender differences in math. In A. Gallagher & J. Kaufman
assessment (pp. 127–156). Mahwah, NJ: Erlbaum. (Eds.), Gender differences in mathematics: An integrative psychological
Eccles, J. S. (1994). Understanding women’s educational and occupational approach (pp. 73–98). New York, NY: Cambridge University Press.
choices: Applying the Eccles et al. model of achievement-related Johns, M., Schmader, T., & Martens, A. (2005). Knowing is half the battle:
choices. Psychology of Women Quarterly, 18, 585– 609. doi:10.1111/ Teaching stereotype threat as a means of improving women’s math
j.1471-6402.1994.tb01049.x performance. Psychological Science, 16, 175–179. doi:10.1111/j.0956-
Else-Quest, N. M., Hyde, J. S., & Linn, M. C. (2010). Cross-national 7976.2005.00799.x
patterns of gender differences in mathematics and gender equity: A Katzman, S., & Alliger, G. M. (1992). Averaging untransformed variance
ratios can be misleading: A comment on Feingold. Review of Educa- National Science Foundation. (2008b). Thirty-three years of women in
tional Research, 62, 427– 428. S&E faculty positions (NSF InfoBrief No. 08-308). Retrieved from
Keller, C. (2001). Effect of teachers’ stereotyping on students’ stereotyping http://www.nsf.gov/statistics/infbrief/nsf08308/nsf08308.pdf
of mathematics as a male domain. The Journal of Social Psychology, National Science Foundation. (2008c). Science and engineering indicators
141, 165–173. doi:10.1080/00224540109600544 2008. Retrieved from http://www.nsf.gov/statistics/seind08
Kenney-Benson, G. A., Pomerantz, E., Ryan, A., & Patrick, H. (2006). Sex Penner, A. J. (2008). Gender differences in extreme mathematical achieve-
differences in math performance: The role of children’s approach to ment: An international perspective on biological and social factors.
schoolwork. Developmental Psychology, 42, 11–26. doi:10.1037/0012- American Journal of Sociology, 114, S138 –S170. doi:10.1086/589252
1649.42.1.11 Quinn, D. N., & Spencer, S. J. (2001). The interference of stereotype threat
Kiefer, A. K., & Sekaquaptewa, D. (2007). Implicit stereotypes, gender with women’s generation of mathematical problem-solving strategies.
identification, and math-related outcomes: A prospective study of female Journal of Social Issues, 57, 55–71. doi:10.1111/0022-4537.00201
college students. Psychological Science, 18, 13–18. doi:10.1111/j.1467- Shields, S. A. (1982). The variability hypothesis: The history of a biolog-
9280.2007.01841.x ical model of sex differences in intelligence. Signs: Journal of Women in
Kimball, M. M. (1989). A new perspective on women’s math achievement. Culture and Society, 7, 769 –797. doi:10.1086/493921
Psychological Bulletin, 105, 198 –214. doi:10.1037/0033-2909.105.2.198 Spencer, S. J., Steele, C. M., & Quinn, D. M. (1999). Stereotype threat and
Li, Q. (1999). Teachers’ beliefs and gender differences in mathematics: A women’s math performance. Journal of Experimental Social Psychol-
review. Educational Research, 41, 63–76. ogy, 35, 4 –28. doi:10.1006/jesp.1998.1373
Lindberg, S. M., Hyde, J. S., & Hirsch, L. M. (2008). Gender and mother– Steele, C. M. (1997). A threat in the air: How stereotypes shape intellectual
child interactions during mathematics homework. Merrill-Palmer Quar- identity and performance. American Psychologist, 52, 613– 629. doi:
terly, 54, 232–255. doi:10.1353/mpq.2008.0017 10.1037/0003-066X.52.6.613
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual
Oaks, CA: Sage. test performance of African Americans. Journal of Personality and
Longitudinal Study of American Youth [Data set]. (n.d.). Retrieved January Social Psychology, 69, 797– 811. doi:10.1037/0022-3514.69.5.797
29, 2009, from http://lsay.msu.edu/ Steele, J. (2003). Children’s gender stereotypes about math: The role of
Markwardt, F. C., Jr. (1998). Peabody Individual Achievement Test– stereotype stratification. Journal of Applied Social Psychology, 33,
Revised. Circle Pines, MN: American Guidance Service. 2587–2606. doi:10.1111/j.1559-1816.2003.tb02782.x
Meece, J. L., Eccles Parsons, J., Kaczala, C. M., & Goff, S. B. (1982). Sex Streitmatter, J. L. (1999). For girls only: Making a case for single-sex
differences in math achievement: Toward a model of academic choice. schooling. Albany, NY: State University of New York Press.
Psychological Bulletin, 91, 324 –348. doi:10.1037/0033-2909.91.2.324 Su, R., Rounds, J., & Armstrong, P. I. (2009). Men and things, women and
Melhuish, E. C., Sylva, K., Sammons, P., Siraj-Blatchford, I., Taggart, B., people: A meta-analysis of sex differences in interests. Psychological
Phan, M. B., & Malin, A. (2008, August 29). The early years: Preschool Bulletin, 135, 859 – 884.
influences on mathematics achievement. Science, 321, 1161–1162. doi: Tarr, J. E., Reys, R. E., Reys, B. J., Chávez, Ó., Shih, J., & Osterlind, S. J.
10.1126/science.1158808 (2008). The impact of middle-grades mathematics curricula and the
Mislevy, R. J., Johnson, E. G., & Muraki, E. (1992). Scaling procedures in classroom learning environment on student achievement. Journal for
NAEP. Journal of Educational Statistics, 17, 131–154. doi:10.2307/ Research in Mathematics Education 39, 247–280.
1165166 Tiedemann, J. (2000). Parents’ gender stereotypes and teachers’ beliefs as
National Center for Education Statistics. (n.d.-a). National Education Lon- predictors of children’s concept of their mathematical ability in elemen-
gitudinal Study of 1988 (NELS88). Retrieved January 29, 2009, from tary school. Journal of Educational Psychology, 92, 144 –151. doi:
http://nces.ed.gov/surveys/NELS88/ 10.1037/0022-0663.92.1.144
National Center for Education Statistics. (n.d.-b). The Nation’s Report Watt, H. M. G. (2004). Development of adolescents’ self-perceptions,
Card: Mathematics. Retrieved January 29, 2009, from http:// values, and task perceptions according to gender and domain in 7th-
nces.ed.gov/nationsreportcard/mathematics/ through 11th-grade Australian students. Child Development, 75, 1556 –
Nosek, B. A., Banaji, M. R., & Greenwald, A. G. (2002). Math ⫽ male, 1574. doi:10.1111/j.1467-8624.2004.00757.x
me ⫽ female, therefore math ⫽ me. Journal of Personality and Social Webb, N. L. (1999). Alignment of science and mathematics standards and
Psychology, 83, 44 –59. doi:10.1037/0022-3514.83.1.44 assessments in four states (Research Monograph No. 18). Washington,
Nosek, B. A., Smyth, F. L., Sriram, N., Lindner, N. M., Devos, T., Ayala, DC: Council of Chief State School Officers.
A., . . . Greenwald, A. G. (2009). National differences in gender–science Wilson, D. B. (2005). Meta-analysis macros for SAS, SPSS, and Stata
stereotypes predict national sex differences in science and math achieve- [Computer software]. Retrieved from http://mason.gmu.edu/⬃dwilsonb/
ment. Proceeding of the National Academy of Sciences, 106, 10593– ma.html
10597. doi:10.1073/pnas.0809921106
National Science Foundation. (2008a). Women, minorities, and persons Received November 23, 2009
with disabilities in science and engineering. Retrieved May 26, 2008, Revision received June 9, 2010
from www.nsf.gov/statistics/wmpd Accepted July 20, 2010 䡲

Bul 136 6 1123

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bul 136 6 1123

Uploaded by

Copyright:

Available Formats

Psychological Bulletin © 2010 American Psychological Association

2010, Vol. 136, No. 6, 1123–1135 0033-2909/10/$12.00 DOI: 10.1037/a0021276

New Trends in Gender and Mathematics Performance: A Meta-Analysis

Sara M. Lindberg, Janet Shibley Hyde, Marcia C. Linn

Keywords: mathematics performance, gender, meta-analysis

Supplemental materials: http://dx.doi.org/10.1037/a0021276.supp

1. What is the magnitude of the gender difference in math-

Figure 1. Distribution of effect sizes in Study 1.

selective samples. Primarily European American ⫹0.13 58

Depth of knowledge Elementary Middle school High school College Adult

Level 1 ⫹0.03 (k ⫽ 21) ⫺0.02 (k ⫽ 16) ⫹0.21 (k ⫽ 9) ⫹0.24 (k ⫽ 11) ⫹0.17 (k ⫽ 3)

Year Age Nmale Mmale Nfemale Mfemale d VR

1997 13 3,184 96.57 2,860 96.53 0.00 1.07

Year Grade Ability Question type Nmale Mmale Nfemale Mfemale d VR

1988 8 All MC 11,763 0.54 11,885 0.54 ⫹0.00 1.00

Year Grade Nmale Mmale Nfemale Mfemale d VR

1987 7 1,593 28.66 1,472 29.41 ⫺0.07 1.14

Assessment Year Age Grade Nmale Mmale Nfemale Mfemale d VR

LTT 1990 9 N/A 3,038 229 3,162 230 ⫺0.03 1.06

You might also like