Professional Documents
Culture Documents
35(3) 269-303
The Author(s) 2011
Methodological Reprints and permission:
sagepub.com/journalsPermissions.nav
Considerations in DOI: 10.1177/0193841X11412071
http://erx.sagepub.com
Using Complex
Survey Data: An
Applied Example
With the Head Start
Family and Child
Experiences Survey
Debbie L. Hahs-Vaughn1,
Christine M. McWayne2,
Rebecca J. Bulotsky-Shearer3,
Xiaoli Wen4, and Ann-Marie Faria5
Abstract
Complex survey data are collected by means other than simple random
samples. This creates two analytical issues: nonindependence and unequal
1
Department of Educational Research, University of Central Florida, Orlando, FL, USA
2
Tufts University, Boston, MA, USA
3
University of Miami, Miami, FL, USA
4
National-Louis University, Chicago, IL, USA
5
American Institute for Research, Washington, DC, USA
Corresponding Author:
Debbie L. Hahs-Vaughn, Educational and Human Sciences, University of Central Florida, PO
Box 161250, Orlando, FL 32816, USA
Email: debbie.hahs-vaughn@ucf.edu
270 Evaluation Review 35(3)
Keywords
complex samples, survey weights, sampling weights, survey research,
multilevel modeling
Widely available and easily accessible, national and international data sets,
many of which have an educational focus, are a valuable resource that can
be used for evaluating programs and interventions. For example, Hong and
Raudenbush (2005) used the Early Childhood Longitudinal Study-
Kindergarten Class of 19981999, a national database available through the
U.S. Department of Education National Center for Education Statistics, to
examine grade retention in kindergarten on performance in reading and
mathematics. As another example, Hindman et al. (2010) used the 1997
Head Start Family and Child Experiences Survey (FACES) to analyze the
relationship between child, family, and classroom variables during Head
Start and growth in literacy and mathematics through first grade. It is
important that researchers who use these data sets approach the analyses
to fully understand the way in which these data are collectedespecially
when the findings may be used to inform national policies or educational
interventions. Specifically, there is a lack of a sampling frame from which
participants can be randomly selected (e.g., no list exists of everyone in
the population) and the need to ensure sufficient sample sizes for sub-
groups of interest (e.g., underrepresented groups; Stapleton and Thomas
2008). As a result of the data collection process, these data sets are termed
complex samples.
Complex survey data are data that are collected by means other than sim-
ple random samples. Often, complex sampling designs include stratified
multistage cluster sampling that creates nonindependence among units
along with disproportionate sampling where some groups may be over-
sampled or probability proportional to size sampling has been applied. This
then creates two primary issues that must be addressed when analyzing
Hahs-Vaughn et al. 271
complex survey data: (a) homogeneity that is created due to the nonsimple
random sample (i.e., nonindependence) and (b) disproportionate sampling
that results in unequal selection probabilities (e.g., oversampling or adjust-
ment for nonresponse; Brick, Morganstein, and Valliant 2000; Lee, Forthofer,
and Lorimor 1989; Skinner, Holt, and Smith 1989). Failing to address these
issues results in incorrectly estimated standard errors (often underestimated
which translates into an increased probability of a Type I error; in other
words, results that suggest statistical significance when in reality they are not)
and biased parameter estimates (Kish 1992; Korn and Graubard 1995; Landis
et al. 1982; Brogan 1998; Hahs-Vaughn 2005, 2006a, 2006b; Kaplan and
Ferguson 1999; Stapleton 2002; Kalton 1983a; DuMouchel and Duncan
1983). Although accurate variances can be estimated in multilevel models
(a design-based approach), unequal selection probabilities at any level
within the hierarchical structure can produce biased parameter estimates
(Pfeffermann et al. 1998).
The purpose of this article is to serve as a resource for researchers
wishing to use complex sample data for evaluation research and includes:
(a) an overview of complex samples; (b) strategies for addressing complex
sampling design when computing advanced statistical procedures such as
multilevel modeling; (c) an applied example of correctly versus incorrectly
analyzing complex sample data using extant data in three varied design-
based approaches; (d) limitations of working with complex survey data; and
(e) recommendations for reporting results from complex sample data.
Disproportionate Sampling
Disproportionate sampling can occur in many ways. For example, (a) when
subgroups of the population are oversampled to ensure sufficient sample
size for estimation (e.g., oversampling children with disabilities, linguistic
or ethnic minority children, novice teachers); (b) when there are poststrati-
fication adjustments (e.g., to correct for nonresponse when sampled units do
not cooperate with providing datanote that this is not the same as item
nonresponse), and/or (c) when probability proportional to size sampling has
been applied (e.g., selecting programs based on probability proportional to
the enrollment in the program; Pike 2008; Biemer and Christ 2008). Failing
to address disproportionate sampling has consistently shown that standard
errors will be underestimated, which then leads to overestimated test statis-
tics, and ultimately increased probabilities of Type I errors (i.e., rejecting the
null hypothesis when it is false; Kish 1992; Korn and Graubard 1995; Landis
et al. 1982; Brogan 1998; Hahs-Vaughn 2005a, 2006a, 2006b; Kaplan and
Ferguson 1999; Stapleton 2002; Kalton 1983a; DuMouchel and Duncan
1983). More specifically, the groups that were oversampled or that responded
(as units that did not respond will not be accounted for in the analysis unless
weights are applied) will artificially influence the results (Stapleton and
Thomas 2008). To correct for the unequal selection probability, survey
weights should be applied during the analysis (Biemer and Christ 2008).
Hahs-Vaughn et al. 273
Nonindependence
Nonindependence results from the nonsimple random sampling design uti-
lized in collecting complex survey data. The common assumption for most
statistical procedures, however, is that the residuals are independent across
the observed units. Clustered data, as seen in complex surveys, violates this
assumption of independence and leads to inaccurate estimates of variation
because the variance observed within the clusters (due to units with the
cluster being more homogenous than units selected by a simple random
sample) is usually less than the variance between the clusters (Lee,
Forthofer, and Lorimor 1989; Skinner, Holt, and Smith 1989; Hox and Kreft
1994). For example, in school settings where students are nested within
classrooms, there is often less within classroom variation than between
classroom variance. Neglecting the violation of independence assumption
has been shown to produce biased parameter estimates and lead to inflated
Type I errors (Hox and Kreft 1994).
Model-based and design-based approaches are two methods used to
address nonindependence and the resulting homogeneity due to clustering
(Kalton 1983b). The research question should guide which approach is
selected. When interest is in estimating variation that can result from the
clustered relationships (e.g., children within classroom) as well as from the
individual, a model-based approach should be selected. If the sample is
examined as one group, without attention to nesting, then the appropriate
approach is a design-based model (e.g., examination of children in aggregate
274 Evaluation Review 35(3)
without attention to any nesting within a higher level unit such as a classroom
or school; Thomas and Heck 2001).
words, once the strata are created, the researcher samples from each strata
as if the strata were independent of the remaining strata. Additional techni-
cal details on Taylor series linearization can be found in Wolter (1985).
Commonly used replication methods include balanced repeated replica-
tion (BRR), jackknife (JK) methods, and bootstapping (Wolter 1985; Rust
and Rao 1996). Although the computations for the replicate weights are
estimated slightly differently, all replication methods divide the full sample
into subsamples and then estimate variances based on the subsamples with
replicate weights created for each subsampleand thus are considered
resampling methods (Rust and Rao 1996). Because bootstrapping is infre-
quently available in statistical software, discussion is not provided on it, and
interested readers are referred to other sources (e.g., Rust and Rao 1996).
BRR, also known as balanced half samples (Rust and Rao 1996), was
originally developed using basic replication concepts of subsampling
(Mahalanobis 1946) with the addition of orthogonal balancing introduced
later (McCarthy 1966). In BRR methods, each stratum is divided into one
half and the estimate of interest is computed from this half sample (i.e., a
replicate). The design-corrected variance can then be estimated on the basis
of the estimates from the half samples using the full-sample mean (Cavin
and Ohls 1990). BRR is restricted to cases where there are only two PSUs
selected from each stratum (Rust and Rao 1996). A variation of BRR is the
Fay method (Fay 1989), which is attractive when there are sparse subgroups
as all observations are used to form each replicate (Rust and Rao 1996). In
comparison with BRR, the possibility exists that the estimator of interest is
not defined for sparse subgroups since only one half of the observations are
used to form the replicate (and thus the algorithm cannot converge since
there is division by zero; Rust and Rao 1996).
Jackknife repeated replication (JRR) methods were introduced approxi-
mately 40 years ago (Frankel 1971). JRR replicate weights are created with
the leave-one-out approach (Kish and Frankel 1974; Rust and Rao 1996).
For FACES 1997, for example, one Head Start program at a time was given
a zero replicate weight. An adjustment factor was applied to the weights of
the children in the remaining programs in the same stratum to account for
the reduced sample size. The weights of children in the other strata were not
adjusted. This was repeated for each of the 40 sampled programs in FACES
1997, thus 40 replicate weights were computed that can be applied to accu-
rately estimate the variances, given the nonindependence. This specific JRR
procedure is the general standard stratified jackknife (JKn) and used in
cases where two or more PSUs have been selected from each stratum
(e.g., different number of PSUs selected from each stratum; Brick,
276 Evaluation Review 35(3)
divided by its mean (Thomas and Heck 2001; Thomas, Heck, and Bauer
2005; Peng 2000). The normalized weight takes the unequal selection prob-
ability into account but does so assuming a simple random sample (Thomas
and Heck 2001; Thomas, Heck, and Bauer 2005; Hahs-Vaughn 2005). In
other words, the nonindependence has still not been addressed through a
normalized weight, but this can be done by the inclusion of a DEFF.
The DEFF provides a measure of degree of departure from SRS on the
precision of a sample estimated (Kish 1965). The DEFF is the ratio of the
estimated variance derived from considering the sampling design to that
derived from a simple random sample, where is the variance of a statistic
from complex sample data and S 2 is the variance from a simple random
sample (Kish 1965).
S2
DEFF w2 :
S
A DEFF that is larger than 1.0 suggests that there is decreased precision of
the estimate relative to what would have been obtained from a simple ran-
dom sample, and thus DEFF values less than 1.0 indicate increased preci-
sion (Kalton 1983a; Muthen and Satorra 1995). Adjusting the normalized
weight by the DEFF will produce more accurate standard errors (than when
the complex sampling features are simply ignored). A DEFF adjusted nor-
malized weight is computed as follows where DEFF is the outcome of inter-
est (i.e., dependent variable).
Normalized Weight
wDEFF :
DEFF
When using secondary data, researchers must rely on the technical manual
for their survey data to provide the DEFFs. Not all outcome variables of
interest may have a DEFF reported, however. In those cases, it is appropri-
ate to use the DEFF for a similar variable, the average DEFF averaged over
a set of variables, or the average DEFF of the dependent variable averaged
over subgroups of the independent variable (Huang et al. 1996).
statistic or alpha level may be appropriate when the survey weight, strata,
cluster variables (needed for Taylor series linearization), replicate weights
(needed for replication methods), and/or the DEFF for the variables of inter-
est (needed to compute a DEFF adjusted weight) are not available. If the
DEFF is available, the t-test statistic can be divided by the square root of
the DEFF and the F test statistic can be divided by the DEFF (West and
Rathburn 2004). In situations where the DEFF is not available, the
researcher can apply an alpha level adjusted for the intraclass correlation
coefficient value (Thomas and Heck 2001):
VarBetweenClusters
ICC :
VarBetweenClusters VarWithinClusters
(second stage). All children in their first year of Head Start were selected
from the sampled classrooms (Zill et al. 2006).
FACES Weights
To address the unequal selection probability and nonresponse in FACES,
weights (cross sectional and longitudinal) are available for each cohort.
Although similarities exist between the sampling designs of the FACES
cohorts, there are some differences in the way the weights are created
depending on the FACES cohort. For example, in FACES 1997, only
child-level weights (that incorporate program and center information, in
addition to information about the child, in their calculation) are available
(Zill et al. 2005). Thus, if a model-based approach is used with the analyses,
only the Level 1 units can be weighted. In FACES 1997, there are five sets
of weights available: one cross-sectional weight for analysis of fall 1997
data and four longitudinal weights for analyses that examine each subse-
quent wave of data collection. Also in FACES 1997 are replicate weights
that can be analyzed using JK replication, along with the application of the
appropriate child weight (Zill et al. 2005). For researchers who are using a
design-based approach, strata and cluster variables are available so that
Taylor Series linearization can be applied.
FACES 2000 includes cross-sectional (fall 2000) and longitudinal (fall
2000spring 2001) child and classroom weights and one additional longitu-
dinal child weight for analyses that includes the kindergarten year. The JK
replicate weights are also available in FACES 2000 (Zill et al. 2006) that
can be applied along with the appropriate child and/or classroom weight.
For researchers who are using a design-based approach, strata and cluster
variables are available to allow for Taylor Series linearization.
given the complex sample is the application of the weight to compensate for
unequal selection probability.
Readers are reminded that the purpose of this article is methodological in
scope, to illustrate how results differ based on addressing the issues associated
with complex samples, and thus detailed information on the specific methods
(e.g., instrumentation, handling missing data) for each of the illustrations is not
presented here. Rather, readers are referred to the original manuscripts for the
theoretical framework and a comprehensive presentation of methods. In all
three examples, the weighted analyses were computed first. The unweighted
analyses are identical with the exception that the weights were not applied
(e.g., same data set, centeringwhere applicable, and other analytical aspects).
Intercept b0 jg00 .510 (.183) [.2.788, .006] .601 .620 (.132) [4.707, .000], 1.858
[.419, .861] [1.434, 2.408]
Intercept Arnett g01 .026 (.020) [1.294, .197] 1.026 .001 (.014) [.792, .429] .989
[.987, 1.067] [.961, 1.017]
Intercept Early Childhood .102 (.245) [.418, .676] .903 .350 (.196) [1.787, .075] 1.419 .394 (.187) [2.106, .036]; 1.482
Environment Rating Scale [.558, 1.462] [.965, 2.087] [1.026, 2.142]
(ECERS) Mean g02
Intercept Years Teaching .001 (.019) [.046.963] 1.001 [.963, .021 (.015) [1.413, .159] 1.021
Experience g03 1.040] [.992, 1.051]
Intercept Teacher Education: .519 (.514) [1.009, 10.315] 1.680 .125 (.429) [.291, .771] 1.113 [.487,
Less than Some Collegecg04 [.610, 4.626] 2.638]
Intercept Teacher Education: .047 (.429) [.110, .913] 1.049 [.450, .204 (.331) [.618, .537] 1.227 [.639,
Bachelors Degreecg05 2.442] 2.354]
Intercept Teacher Education: .244 (.519) [.470, .639] 1.276 [.459, .347 (.399) [.871, .385] 1.416 [.645,
Graduate School or 3.546] 3.106]
Greatercg06
Intercept Mean Classroom .210 (.096) [2.185, .030] .810 .052 (.076) [.688, .492] .949
Parent Involvement in Head [.671, .979] [.818, 1.102]
Start g07
Age in months b1j g10 .147 (.031) [4.697, .000] 1.158 .030 (.025) [1.225, .222] 1.031
[1.089, 1.232] [.982, 1.082]
d
Female b2 j g20 .604 (.334) [1.808, .071] 1.830 .463 (.297) [1.560, .119] 1.589 .622 (.259) [2.406, .017] 1.863
[.949, 3.529] [.887, 2.845] [1.122, 3.096]
(continued)
Table 1. (continued)
Time 2 Profile 1 (T2P1): High Averageb Time 2 Profile 2 (T2 P2): Averageb
Spanish assessment .941 (.733) [1.283, .200] .390 .085 (.522) [.164, .871] 1.089 [.391,
flageb3 j g30 [.093, 1.646] 3.034]
Disabledf b4 j g40 .028 (.518) [.054, .958] 1.028 [.372, .108 (.436) [.249, .804] .897
2.842] [.381, 2.113]
Blackgb5 j g50 .068 (.460) [.148, .883] 1.071 [.434, .218 (.362) [.600, .548] 1.243 [.610,
2.643] 2.532]
Hispanicgb6 j g60 .049 (.535) [.092, .928] .952 .372 (.445) [.835, .404] 690
[.333, 2.723] [.288, 1.651]
Other non-White racegb7 j g70 1.107 (.924) [1.198, .232] .331 .327 (.623) [.526, .599] .721
[.054, 2.029] [.212, 2.449]
Family structure: single 1.051 (.376) [2.797, .006]. 350 .510 (.278) [1.837, .067] .600
parenthb8 j g80 [.167, .731] [.348, 1.036]
Family structure: mother .247 (.479) [.515, .607] .781 .490 (.401) [1.223, .223] .613
grandmother or other non [.305, 2.002] [.279, 1.346]
mothermalehb9 j g90
Mothers education: less than high .927 (.402) [2.306, .022] .396 .115 (.302) [.380, .704] .892
schoolib10 j g100 [.180, .871] [.493, 1.614]
Mothers education: some college .463 (.387) [1.194, .234] .630 .446 (.314) [1.418, .157] .640
or moreib11 j g110 [.294, 1.347] [.345, 1.187]
Weekly/monthly combined .049 (.044) [1.101, .272] 1.050 .022 (.034) [.630, .529] 1.022 [.955,
activities b12 j g120 [.962, 1.146] 1.093]
Authoritative b13 j g130 .008 (2.84) [.029, .977].992 .232 (.234) [.993, .322] 1.261 [.797,
[.568, 1.731] 1.996]
Authoritarian b14 j g140 .433 (.260) [1.706, .088].642 .558 (.202) [.2.76, .006].573
[.385, 1.069] [.385, .850]
(continued)
283
284
Table 1. (continued)
Time 2 Profile 1 (T2P1): High Averageb Time 2 Profile 2 (T2 P2): Averageb
Parental involvement in Head .333 (.282) [1.181, .239] 2.641 (.053) [2.641, .009] 1.149 .027 (.047) [.566, .575] 1.027 [.934,
Start b15 j g150 1.142 [1.027, 1.270] [1.036, 1.275] 1.129]
Random effects Variance (df) [w2] Variance (df) [w2] Variance (df) [w2] Variance (df) [w2]
Contextual models
Intercept t00 .360 (179) [153.379] .327 (179) [151.484] .027 (179) [170.080] .124 (179) [178.061]
statistically significant predictor for the unweighted model (but not statisti-
cally significant in the weighted model). The results for the average
profile membership at Time 2 (i.e., T2 P2) relative to high behavior
problems, low average at Time 2 suggest there are two variables that are
statistically significantly different between the weighted versus
unweighted models. In both cases, they were not statistically significant
in the weighted model but are significant in the unweighted model: (a)
ECERS mean (a global rating of classroom quality based on structural fea-
tures of the classroom); and (b) female.
Table 1 also presents the odds ratios and confidence intervals of the odds
ratios. The odds ratios that are presented in bold correspond to the parameter
estimates that were statistically significantly different between the weighted
and unweighted models. The confidence intervals for the weighted and
unweighted odds ratios were examined for overlap (Schenker and Gentleman
2001), and nonoverlapping confidence intervals suggest statistically signifi-
cant differences in odds ratios. The confidence intervals of the odds ratios all
overlapped suggesting that there were no statistically significant differences
in the effect sizes of the weighted to the unweighted models.
Weighted Unweighted
(continued)
Table 2. (continued)
Weighted Unweighted
287
(continued)
288
Table 2. (continued)
Weighted Unweighted
Parent involvement in Head Start g2130 .15 (.11) 1.39 .17 .208 (.090) 2.306 .021
Parent involvement in Head Start ECERS g2131 .193 (.170) 1.135 .257
Parent involvement in Head Start Arnett g2132 .0001 .019 .985
Random effects (variance components) Variance (df) [w2, p]
Level one variance
Temporal variation etij 28.228
Level two variance (within classrooms)
Individual initial status r0ij 38.899 (707) [4,369.568, <.001]f
Level three variance (between classrooms)
Classroom mean Head Start status u00j 19.395 (330) [606.124, <.001]g
Classroom mean Head Start growth rate u10j 2.678 (328) [389.558, .011]g
Classroom mean kindergartenfirst grade growth rate u20j .879 (328) [310.874, >.500]g
Deviance (no. of estimated parameters) 26,807.471 (62)
estimated models (i.e., those models estimated after the two-profile parent
involvement and a one-profile classroom quality model) incrementally
increased the number of Level 1 (parent involvement) and Level 2 (classroom
quality) profiles. This resulted in several multilevel solution combinations
(up to six parent involvement profiles and up to three classroom quality pro-
files). For example, multilevel solutions were estimated that consisted of
three parent involvement profiles varying between two classroom quality
profiles; or three parent involvement profiles varying among three profiles
of classroom quality. As model estimation became more complex, random
starts, the number of iterations, and final-stage optimizations were increased
to avoid local maxima (Hipp and Bauer 2006).The best fitting model con-
sisted of three within (parent involvement) and two between (classroom qual-
ity) profiles resulting in a total of six distinct latent profile combinations.
Table 3 presents the unweighted results for the final multilevel latent
profile analysis (LPA) model and their comparability to the analysis that
correctly applied the weights. The results of the weighted models are pro-
vided in their entirety in Bulotskey-Shearer, R. J., X. Wen, A. Faria, D.
L. Hahs-Vaughn, and J. Korfmacher (under review). There were no statis-
tically significant differences in means based on nonoverlapping confidence
intervals (Schenker and Gentleman 2001). The confidence intervals are not
presented here but are available upon request. In reviewing the frequencies
(and percentages) of children in each profile, there were a small percentage
of children (n 153; 8.18%) who changed profile membership depending
on whether the children weights were applied or were not applied during the
analyses. All profiles experienced some movement between weighted and
unweighted models. However, four of the six profiles had about 1% or less
movement, one profile had approximately 2% movement (Profile 6, high
parent school and home involvement, above average classroom quality),
and one profile experienced about 4% movement (Profile 4, low parent
school and home involvement, above average classroom quality). The pro-
file that had the most movement (approximately 4%) had low parent school
and home involvement. Only two profiles (Profile 2, high parent school
involvement, very low classroom quality; Profile 4, low parent school and
home involvement, above average classroom quality) had a larger member-
ship when weighted as compared to unweighted.
Discussion
Three illustrations of complex sample analyses within a multilevel framework
(i.e., model-based approach) were provided: multilevel multinomial logistic
Table 3. Unweighted Prevalence and Mean Z Scores (Standard Errors) for the Final Multilevel Latent Profile Solution [Weighted]
Parent Involvement Classroom Quality
AP Learning Classroom
Latent Profilea n (%) School Invol. Home Weekly Home Monthly ECERS Environ. Arnett Climate
1 179 (9.6) .894 (.069) .200 (.083) .402 (.085) .894 (.139) .460 (.135) 1.603 (.215) 1.247 (.155)
[165, 8.8%]
2 92 (4.9) .822 (.157) .147 (.114) .032 (.192) .894 (.139) .460 (.135) 1.603 (.215) 1.247 (.155)
[93, 5.0%]
3 45 (2.4) .121 (.349) .329 (.207) 1.722 (.199) .792 (.058) .835 (.057) .410 (.047) .667 (.062)
[37, 2.0%]
4 814 (43.5) .729 (.036) .291 (.057) .394 (.080) .224 (.049) .071 (.045) .357 (.040) .243 (.054)
[889, 47.5%]
5 470 (25.1) .838 (.048) .013 (.070) .213 (.069) .224 (.049) .071 (.045) .357 (.040) .243 (.054)
[449, 24.0%]
6 270 (14.4) .655 (.150) .746 (.086) 1.349 (.093) .224 (.049) .071 (.045) .357 (.040) .243 (.054)
[236, 12.6%]
Note. AP Learning Environ. assessment profile learning environment; School invol. school involvement.
Weighted Model Fit indices: Log likelihood 9,885.080, Akaike information criterion (AIC) 19,842.161, Bayesian information criterion (BIC) 20,041.374,
Adjusted BIC 19,927.002, entropy 0.743.
Unweighted Model Fit indices: Log likelihood 9,887.078, AIC 19,846.155, BIC 20,045.368, Adjusted BIC 19,930.997, entropy 0.730.
N 1,870.
a
Latent Profile Type 1 Low parent school and home involvement, very low classroom quality.
Type 2 High parent school involvement, very low classroom quality.
Type 3 High parent home involvement, very low classroom quality.
Type 4 Low parent school and home involvement, above average classroom quality.
Type 5 High parent school involvement, above average classroom quality.
Type 6 High parent school and home involvement, above average classroom quality.
291
292 Evaluation Review 35(3)
was applied. If a classroom weight been applied as well, it is likely that there
would have been even more movement between the profiles for the weighted
versus unweighted models.
Methodologically, these results are interesting, however substantively
(in terms of evidence to support theory, research informing practice, and
policy implications), there are great ramifications suggested by the results
of this study. More specifically, FACES data are designed such that, when
the sampling design is acknowledged through the analysis, the variances are
estimated correctly (i.e., there is an adjustment for nonindependence) and
the results are representative of all children who attended Head Start at a
particular point in time (i.e., there is an adjustment for disproportionate
sampling). The analyses presented herein were all multilevel, thus weight-
ing to address for unequal selection probability was the only adjustment
made. If the weight is not applied, as stated by Kalton (1989), the sample
becomes simply a collection of individuals that represents no meaningful
population (p. 583) which likewise suggests meaningless interpretations.
There are additional ramifications based on these results. Relationships
between factors that were important when the sampling design was consid-
ered were not important when unweighted (and vice versa). The conclu-
sions this leads to, in terms of effectiveness of programs (such as Head
Start) and stakeholders (e.g., parents and teachers), can then fluctuate
quite wildly. One may swing (depending on the weighting of the analyses)
from concluding that parent involvement at home and/or at school is
important to conclude that there is not a relationship between parent invol-
vement and Head Start outcomes. In this era of accountability, given that
funding decisions for programs are often made based on existing research
findings (e.g., continuation of Title I funding for parent involvement pro-
gramming), results stemming from complex sample data that are appropri-
ately analyzed are critical as the findings have very real consequences for
programs such as Head Start. This research illustrates the substantially
different conclusions that could have been made if the analyses had not
accounted for the sampling design.
All three illustrations presented are model-based approaches and there-
fore adjust for the nonindependence of observations by default in the multi-
level framework. This left only one issue to examine, unequal selection
probability, which can be handled by application of the appropriate weights.
There is still debate on weighting complex survey data within a multilevel
framework. As stated by Stapleton and Thomas (2008), . . . [M]ultilevel
analyses require a more complex way of thinking about the purpose and
behavior of sample weights (p. 40) and similar comments have been made
Hahs-Vaughn et al. 295
Authors
Best practices for researchers who have analyzed complex sample data
include the following: (a) Specify the survey weight applied during the anal-
yses to correct for disproportionate sampling, and this should include the
exact variable name as presented in the data set as well as a general
296 Evaluation Review 35(3)
description (e.g., cross-sectional weight for the spring 2001 data collection
wave; Hahs-Vaughn 2006c). For design-based models (i.e., single level anal-
ysis), similar information should be reported for the variables applied to com-
pute correct standard errors and adjust for nonindependence. For Taylor series
linearization, these would be the strata and cluster variables. For replicate
methods, these would be the survey weight along with replicate weights.
(b) If a design-based approach is followed, specify how variances have been
estimated (e.g., Taylor Series linearization, replication methodand more
specifically, which replication method). (c) If the complex sample design is
not addressed in the analyses, provide information to assist the reader in
understanding why that may have been appropriate and the impact on the
interpretations of the results (e.g., to whom the results can be generalized;
Hahs-Vaughn 2006c).
Journal Editors
Journal editors are especially poised, through manuscript submission guide-
lines and requirements, to assist researchers in understanding best practices
of conducting and reporting complex sample analysis. Best practices for jour-
nal editors include the following: (a) Requiring authors to report the previously
mentioned items (e.g., survey weights, variance estimation method) when their
manuscript includes analysis of complex sample data (Hahs-Vaughn 2006c).
(b) Providing resources, such as this manuscript or other primers (e.g.,
Hahs-Vaughn 2005; Thomas and Heck 2001), to the editorial board that pro-
vides foundational information about complex survey data and can be a helpful
tool when reviewing manuscripts that have analyzed this type of data.
Conclusions
A decade ago, the following sentiment was characteristic of the field. The
techniques [of complex samples] generally require specialized software that
Hahs-Vaughn et al. 297
is difficult to learn and use and is based on concepts that are not familiar to
many analysts. In some cases, the providers of the data do not even supply
analysts with the information necessary to implement the techniques.
Recent advances in software help reduce these difficulties and bring prac-
tice more into line with theory (Brick, Morganstein, and Valliant 2000).
Ten years later, strategies for addressing design issues of complex samples
are now almost common place in the most frequently used statistical soft-
ware. Most statistical software programs provide users with the capability
to apply one or more survey weights to adjust for unequal selection prob-
ability. Taylor series linearization, a good but less than desirable approach
as compared to the use of replication methods, is the most commonly
available method in statistical software programs for producing accurate
standard error estimates. A few software programs also provide replica-
tion methods as an option and it is anticipated that the options for analysis
using replication will increase dramatically as well. The limitation, how-
ever, that still exists with some statistical software packages is that while
there may be options that allow the researcher to weight for unequal selec-
tion probability and correctly estimate variances due to nonindependence,
for which procedures these estimates can be produced may be limited to
those most commonly used. For example, in the case of standard statistical
software, regression procedures are often those which can be applied with
complex sample data.
With the advances of statistical processing, there still exist limitations on
what we know and understand about how the sampling design affects the
statistical results (and thereby conclusions that are made about the data).
For example, research that has been conducted on complex survey data
often uses simulated rather than extant data (e.g., Kaplan and Ferguson
1999; Stapleton 2002) and more commonly used statistical procedures such
as ordinary least squares regression (e.g., DuMouchel and Duncan 1983;
Korn and Graubard 1995; Skinner, Holt, and Smith 1989; Hahs-Vaughn
2005, 2006c), and structural equation modeling (e.g., Kaplan and Ferguson
1999; Hahs-Vaughn 2006c; Stapleton 2002). There is still much left to
understand about working with complex sample data within a multilevel
framework (Stapleton and Thomas 2008) and within more complex statis-
tical procedures such as those presented herein (multinomial logistic regres-
sion, growth modeling, latent profile analysis).
Echoing previous calls (e.g., Hahs-Vaughn 2006c; Rodgers-Farmer and
Davis 2001), wider dissemination of the importance of addressing dispropor-
tionate sampling and nonindependence is needed, given that many research-
ers may not understand the ramifications of analyzing complex survey data
298 Evaluation Review 35(3)
Funding
The author(s) received no financial support for the research, authorship, and/or
publication of this article.
References
Bell-Ellison, B. A., and J. D. Kromrey. 2007. Alternatives for Analysis of Complex
Sample Surveys: A Comparison of SAS, SUDAAN, and AM Software. Paper
presented at SAS Global Forum, Orlando, FL, April 16.
Berglund, P. A. 2002. Analysis of Complex Sample Survey Data: Using
the SURVEYMEANS and SURVEYREG Procedures and Macro Coding.
Paper presented at SAS Users Group International (SUGI), Orlando, FL,
April 14.
Biemer, P. P., and S. L. Christ. 2008. Weighting Survey Data. In International
Handbook of Survey Methodology (pp. 317-341), edited by E. D. DeLeeuw.
New York, NY: Lawrence Erlbaum.
Brick, J. M., D. Morganstein, and R. Valliant. 2008. Analysis of complex sample
data using replication. Westat 2000. http://www.westat.com/Wesvar/techpapers/
ACS-Replication.pdf (accessed December 10, 2008).
Brogan, D. J. 1998. Pitfalls of Using Standard Statistical Software Packages for
Sample Survey Data. In Encyclopedia of Biostatistics, edited by P. Armitage,
and T. Colton. New York, NY: Wiley.
Bulotsky-Shearer, R. J., X. Wen, A. Faria, D. L. Hahs-Vaughn, and J. Korfmacher
(under review). National profiles of classroom quality and family involvement:
A multilevel examination of proximal influences on Head Start childrens school
readiness.
Hahs-Vaughn et al. 299
Campbell, C., and M. Meyer. 1978. Some Properties of t Confidence Intervals for
Survey Data. Paper presented at American Statistical Association Annual
Meeting, Washington, DC.
Cavin, Edward S., and James C. Ohls. 1990. An Application of Balanced Repeated
Replication (BRR) Variance Estimation to Program Evaluation. Evaluation
Review 14:20613.
DuMouchel, W. H., and G. J. Duncan. 1983. Using Sample Survey Weights in
Multiple Regression Analyses of Stratified Samples. Journal of the American
Statistical Association 78:535-43.
Fay, R. E. 1989. Theoretical application of weighting for variance Calculation. Pro-
ceedings of the Section on Survey Research Methods of the American Statistical
Association (pp. 212-217).
Frankel, M. R. 1971. Inference From Survey Samples. Ann Arbor, MI: Institute for
Social Research.
Hahs-Vaughn, D. L. 2005. A Primer for Using and Understanding Weights with
National Datasets. Journal of Experimental Education 73:221-48.
. 2006a. Analysis of Data From Complex Samples. International Journal
of Research & Method in Education 29:163-81.
. 2006b. Utilization of Sample Weights in Single Level Structural Equation
Modeling. Journal of Experimental Education 74:163-90.
. 2006c. Weighting Omissions and Best Practices When Using Large-Scale
Data in Educational Research. Association for Institutional Research Profes-
sional File 101:1-9.
Hindman, A. H., L. E. Skibbe, A. Miller, and M. Zimmerman. 2010. Ecological
Contexts and Early Learning: Contributions of Child, Family, and Classroom
Factors during Head Start, to Literacy and Mathematics Growth through First
Grade. Early Childhood Research Quarterly 25:235-50.
Hipp, J. R., & D. J. Bauer. 2006. Local solutions in the estimation of growth mix-
ture models. Psychological Methods, 11:36-53.
Hofferth, S. L. 2005. Secondary Data Analysis in Family Research. Journal of
Marriage and Family 67:891-907.
Hong, G., and S. W. Raudenbush. 2005. Effects of Kindergarten Retention Policy
on Childrens Cognitive Growth in Reading and Mathematics. Educational
Evaluation and Policy Analysis 27:205-24.
Hox, J. J., and I. G. G. Kreft. 1994. Multilevel Analysis Methods. Sociological
Methods and Research 22:283-99.
Huang, G., S. Salvucci, S. Peng, and J. Owings. 1996. National Educational Long-
itudinal Study of 1988 (Nels:88) Research Framework and Issues. Arlington,
VA: Synetics for Management Decisions.
Kalton, G. 1983a. Introduction to Survey Sampling. Thousand Oaks, CA: Sage.
300 Evaluation Review 35(3)
Bios
Debbie L. Hahs-Vaughn, PhD, is currently an associate professor in the department
of educational & human sciences at the University of Central Florida. She earned
her doctorate in Educational Research from the University of Alabama. Her research
interests include: methodological issues associated with applying quantitative statis-
tical methods to complex survey data; using complex survey data to answer substan-
tive research questions; practitioner use of research; quality in reporting research;
and program evaluation.
Hahs-Vaughn et al. 303