You are on page 1of 35

Evaluation Review

35(3) 269-303
The Author(s) 2011
Methodological Reprints and permission:
sagepub.com/journalsPermissions.nav
Considerations in DOI: 10.1177/0193841X11412071
http://erx.sagepub.com

Using Complex
Survey Data: An
Applied Example
With the Head Start
Family and Child
Experiences Survey

Debbie L. Hahs-Vaughn1,
Christine M. McWayne2,
Rebecca J. Bulotsky-Shearer3,
Xiaoli Wen4, and Ann-Marie Faria5

Abstract
Complex survey data are collected by means other than simple random
samples. This creates two analytical issues: nonindependence and unequal

1
Department of Educational Research, University of Central Florida, Orlando, FL, USA
2
Tufts University, Boston, MA, USA
3
University of Miami, Miami, FL, USA
4
National-Louis University, Chicago, IL, USA
5
American Institute for Research, Washington, DC, USA

Corresponding Author:
Debbie L. Hahs-Vaughn, Educational and Human Sciences, University of Central Florida, PO
Box 161250, Orlando, FL 32816, USA
Email: debbie.hahs-vaughn@ucf.edu
270 Evaluation Review 35(3)

selection probability. Failing to address these issues results in underestimated


standard errors and biased parameter estimates. Using data from the
nationally representative Head Start Family and Child Experiences Survey
(FACES; 1997 and 2000 cohorts), three diverse multilevel models are
presented that illustrate differences in results depending on addressing or
ignoring the complex sampling issues. Limitations of using complex survey
data are reported, along with recommendations for reporting complex
sample results.

Keywords
complex samples, survey weights, sampling weights, survey research,
multilevel modeling

Widely available and easily accessible, national and international data sets,
many of which have an educational focus, are a valuable resource that can
be used for evaluating programs and interventions. For example, Hong and
Raudenbush (2005) used the Early Childhood Longitudinal Study-
Kindergarten Class of 19981999, a national database available through the
U.S. Department of Education National Center for Education Statistics, to
examine grade retention in kindergarten on performance in reading and
mathematics. As another example, Hindman et al. (2010) used the 1997
Head Start Family and Child Experiences Survey (FACES) to analyze the
relationship between child, family, and classroom variables during Head
Start and growth in literacy and mathematics through first grade. It is
important that researchers who use these data sets approach the analyses
to fully understand the way in which these data are collectedespecially
when the findings may be used to inform national policies or educational
interventions. Specifically, there is a lack of a sampling frame from which
participants can be randomly selected (e.g., no list exists of everyone in
the population) and the need to ensure sufficient sample sizes for sub-
groups of interest (e.g., underrepresented groups; Stapleton and Thomas
2008). As a result of the data collection process, these data sets are termed
complex samples.
Complex survey data are data that are collected by means other than sim-
ple random samples. Often, complex sampling designs include stratified
multistage cluster sampling that creates nonindependence among units
along with disproportionate sampling where some groups may be over-
sampled or probability proportional to size sampling has been applied. This
then creates two primary issues that must be addressed when analyzing
Hahs-Vaughn et al. 271

complex survey data: (a) homogeneity that is created due to the nonsimple
random sample (i.e., nonindependence) and (b) disproportionate sampling
that results in unequal selection probabilities (e.g., oversampling or adjust-
ment for nonresponse; Brick, Morganstein, and Valliant 2000; Lee, Forthofer,
and Lorimor 1989; Skinner, Holt, and Smith 1989). Failing to address these
issues results in incorrectly estimated standard errors (often underestimated
which translates into an increased probability of a Type I error; in other
words, results that suggest statistical significance when in reality they are not)
and biased parameter estimates (Kish 1992; Korn and Graubard 1995; Landis
et al. 1982; Brogan 1998; Hahs-Vaughn 2005, 2006a, 2006b; Kaplan and
Ferguson 1999; Stapleton 2002; Kalton 1983a; DuMouchel and Duncan
1983). Although accurate variances can be estimated in multilevel models
(a design-based approach), unequal selection probabilities at any level
within the hierarchical structure can produce biased parameter estimates
(Pfeffermann et al. 1998).
The purpose of this article is to serve as a resource for researchers
wishing to use complex sample data for evaluation research and includes:
(a) an overview of complex samples; (b) strategies for addressing complex
sampling design when computing advanced statistical procedures such as
multilevel modeling; (c) an applied example of correctly versus incorrectly
analyzing complex sample data using extant data in three varied design-
based approaches; (d) limitations of working with complex survey data; and
(e) recommendations for reporting results from complex sample data.

Overview of Complex Samples


Tests of inference assume an independent and identically distributed (IID)
sample. In an ideal world, IID is met via simple random sampling (SRS;
Lumley 2004; Kish and Frankel 1974). In reality, the assumption of inde-
pendence is often difficult to achieve. Take for example, an examination
of children enrolled in Head Start, a federally funded intervention program
for low-income children. An easily accessible list of all children in the
United States that are enrolled in Head Start may be difficult to obtain.
What is more feasible to access, therefore, is a list of all Head Start pro-
grams (primary sampling unit [PSU]) and within that frame, all Head Start
centers (secondary sampling unit). From the Head Start centers, classrooms
and then ultimately children can be identified. This results in a multistage
sampling design. In addition, to ensure subgroups of the population are suf-
ficiently represented in the sample, oversampling is often employed (Lee,
Forthofer, and Lorimor 1989). Also, there is usually some level of unit
272 Evaluation Review 35(3)

nonresponse in complex survey data (Pike 2008; Lee, Forthofer, and


Lorimor 1989). This type of complex sampling design is much more effi-
cient than a random sample (Cavin and Ohls 1990), but it comes at the
expense of requiring the researcher to employ analytic strategies in their
computation to ensure that the results are representative of the intended
population and the variance estimates are correct.

Strategies for Addressing Complex Sampling Design


There are two issues that are inherent in complex survey data: (a)
disproportionate sampling resulting in unequal selection probability (e.g.,
oversampling, selection based on probability proportional to size, or post-
stratification to adjust for nonresponse; Lee, Forthofer, and Lorimor
1989); and (b) nonindependence due to stratified cluster sampling (usually
multistage) resulting in homogeneity that may produce small standard
errors (as compared to simple random samples; Kish 1965; Skinner 1989).

Disproportionate Sampling
Disproportionate sampling can occur in many ways. For example, (a) when
subgroups of the population are oversampled to ensure sufficient sample
size for estimation (e.g., oversampling children with disabilities, linguistic
or ethnic minority children, novice teachers); (b) when there are poststrati-
fication adjustments (e.g., to correct for nonresponse when sampled units do
not cooperate with providing datanote that this is not the same as item
nonresponse), and/or (c) when probability proportional to size sampling has
been applied (e.g., selecting programs based on probability proportional to
the enrollment in the program; Pike 2008; Biemer and Christ 2008). Failing
to address disproportionate sampling has consistently shown that standard
errors will be underestimated, which then leads to overestimated test statis-
tics, and ultimately increased probabilities of Type I errors (i.e., rejecting the
null hypothesis when it is false; Kish 1992; Korn and Graubard 1995; Landis
et al. 1982; Brogan 1998; Hahs-Vaughn 2005a, 2006a, 2006b; Kaplan and
Ferguson 1999; Stapleton 2002; Kalton 1983a; DuMouchel and Duncan
1983). More specifically, the groups that were oversampled or that responded
(as units that did not respond will not be accounted for in the analysis unless
weights are applied) will artificially influence the results (Stapleton and
Thomas 2008). To correct for the unequal selection probability, survey
weights should be applied during the analysis (Biemer and Christ 2008).
Hahs-Vaughn et al. 273

Sample weights, in simple terms, are the inverse of selection probability,


where i is the unit sampled (Potthoff, Woodbury, and Manton 1992):
1
:
Wi
In practice, sample weights usually also incorporate nonresponse or other
adjustments (e.g., oversampled units). Sample weights, when applied to com-
plex sample analyses, produce meaningful estimates (Kalton 1989) that
correspond to the population. In other words, application of weights ensures
that the results estimated on the sample are representative of the intended
population. If sample weights are not applied, the results reflect a collection
of individuals that represents no meaningful population (Kalton 1989).
Units that are oversampled (i.e., have a higher probability of being selected)
have a smaller weight value so that, when weighted, those oversampled units
have less influence in the analysis (Thomas, Heck, and Bauer 2005).

Nonindependence
Nonindependence results from the nonsimple random sampling design uti-
lized in collecting complex survey data. The common assumption for most
statistical procedures, however, is that the residuals are independent across
the observed units. Clustered data, as seen in complex surveys, violates this
assumption of independence and leads to inaccurate estimates of variation
because the variance observed within the clusters (due to units with the
cluster being more homogenous than units selected by a simple random
sample) is usually less than the variance between the clusters (Lee,
Forthofer, and Lorimor 1989; Skinner, Holt, and Smith 1989; Hox and Kreft
1994). For example, in school settings where students are nested within
classrooms, there is often less within classroom variation than between
classroom variance. Neglecting the violation of independence assumption
has been shown to produce biased parameter estimates and lead to inflated
Type I errors (Hox and Kreft 1994).
Model-based and design-based approaches are two methods used to
address nonindependence and the resulting homogeneity due to clustering
(Kalton 1983b). The research question should guide which approach is
selected. When interest is in estimating variation that can result from the
clustered relationships (e.g., children within classroom) as well as from the
individual, a model-based approach should be selected. If the sample is
examined as one group, without attention to nesting, then the appropriate
approach is a design-based model (e.g., examination of children in aggregate
274 Evaluation Review 35(3)

without attention to any nesting within a higher level unit such as a classroom
or school; Thomas and Heck 2001).

Model based. In a model-based approach, the statistical methodology


directly incorporates the clustering in the analysis by partitioning the var-
iance of the dependent variable into within- and between-variances that are
explained by including predictors at each level (Raudenbush and Bryk
2002). Multilevel statistical procedures (e.g., multilevel regression, multile-
vel latent profile analysis) are model-based approaches (e.g., the examina-
tion of children within classrooms). However, disproportionate sampling
still has to be addressed, and this can be accomplished through application
of weights. Many software programs now easily allow users the capability
of applying weights to analyses.

Design based. A design-based approach is a single-level model (e.g., the


examination of children or classrooms but not simultaneously) and thus
both the disproportion and nonindependence must be addressed in the anal-
yses (Muthen and Satorra 1995). Three different recommendations have
been made on how this can be achieved, including, in rank order of produc-
ing the most accurate estimates: (a) utilizing a specialized software; (b)
applying a normalized weight; and (c) adjusting by an inflation factor
(Thomas and Heck 2001).

First design-based method: Use specialized software. Specialized software


includes methods (either Taylor series linearization or replication methods)
that most directly address complex sample issues. These methods allow the
incorporation of sample weights and procedures to accurately estimate the
variances.
Taylor series linearization, also known as Taylor expansion or the delta
method, (Kish and Frankel 1974), is similar to sandwich variances estima-
tors that are used in econometrics (Lumley 2004). In this computation, an
approximate estimate of the variance/covariance matrix of population totals
(i.e., the sum of the first derivatives) is sandwiched between converged
parameter values. Researchers who use Taylor series linearization apply a
weight (to address unequal selection probability) and strata and cluster
(i.e., PSU) variables. Strata are the groups of clusters which are sampled
as independent populations (e.g., in the Head Start FACES 1997 sampling
design, stratification was based on census region, urbanicity, and percent-
age of enrolled minority children creating 16 cells or strata). PSUs
(e.g., Head Start programs) are clusters sampled within the stratum. In other
Hahs-Vaughn et al. 275

words, once the strata are created, the researcher samples from each strata
as if the strata were independent of the remaining strata. Additional techni-
cal details on Taylor series linearization can be found in Wolter (1985).
Commonly used replication methods include balanced repeated replica-
tion (BRR), jackknife (JK) methods, and bootstapping (Wolter 1985; Rust
and Rao 1996). Although the computations for the replicate weights are
estimated slightly differently, all replication methods divide the full sample
into subsamples and then estimate variances based on the subsamples with
replicate weights created for each subsampleand thus are considered
resampling methods (Rust and Rao 1996). Because bootstrapping is infre-
quently available in statistical software, discussion is not provided on it, and
interested readers are referred to other sources (e.g., Rust and Rao 1996).
BRR, also known as balanced half samples (Rust and Rao 1996), was
originally developed using basic replication concepts of subsampling
(Mahalanobis 1946) with the addition of orthogonal balancing introduced
later (McCarthy 1966). In BRR methods, each stratum is divided into one
half and the estimate of interest is computed from this half sample (i.e., a
replicate). The design-corrected variance can then be estimated on the basis
of the estimates from the half samples using the full-sample mean (Cavin
and Ohls 1990). BRR is restricted to cases where there are only two PSUs
selected from each stratum (Rust and Rao 1996). A variation of BRR is the
Fay method (Fay 1989), which is attractive when there are sparse subgroups
as all observations are used to form each replicate (Rust and Rao 1996). In
comparison with BRR, the possibility exists that the estimator of interest is
not defined for sparse subgroups since only one half of the observations are
used to form the replicate (and thus the algorithm cannot converge since
there is division by zero; Rust and Rao 1996).
Jackknife repeated replication (JRR) methods were introduced approxi-
mately 40 years ago (Frankel 1971). JRR replicate weights are created with
the leave-one-out approach (Kish and Frankel 1974; Rust and Rao 1996).
For FACES 1997, for example, one Head Start program at a time was given
a zero replicate weight. An adjustment factor was applied to the weights of
the children in the remaining programs in the same stratum to account for
the reduced sample size. The weights of children in the other strata were not
adjusted. This was repeated for each of the 40 sampled programs in FACES
1997, thus 40 replicate weights were computed that can be applied to accu-
rately estimate the variances, given the nonindependence. This specific JRR
procedure is the general standard stratified jackknife (JKn) and used in
cases where two or more PSUs have been selected from each stratum
(e.g., different number of PSUs selected from each stratum; Brick,
276 Evaluation Review 35(3)

Morganstein, and Valliant 2000; Zill et al. 2005). Additional JK methods


include JK1 where there is only one stratum and a random sample of PSUs
from the stratum. JK2 is applicable when two PSUs have been selected per
stratum (and thus is similar to BRR). Replicate weights are also available in
FACES 2000 and were also calculated using JKn. For all replication meth-
ods, unequal selection probability still must and can be addressed by apply-
ing the appropriate weight (e.g., child and/or classroom). While space
limitations prevent more detailed presentation of the replication methods,
interested readers may refer to Wolter (1985) or Rust and Rao (1996).
In terms of accuracy of the three methods, all three approaches provide
good results for basic statistics (e.g., means, correlations) and similar results
(Lemeshow and Levy 1979; Campbell and Meyer 1978; Kovar, Rao, and
Wu 1988). However BRR, followed by JRR and then Taylor expansion, has
been shown to produce the most accurate estimates in more complex proce-
dures (e.g., multiple regression; Kish and Frankel 1973). Thus, when the
data have replicate weights available and the software will support replica-
tion methods, a replication procedure should be applied. The choice of BRR
or JK is likely already defined in the technical manual by means by which
the replicate weights were created. For example, FACES replicate weights
were created using the standard stratified jackknife procedure and thus JKn
is the appropriate replicate method to use.
Another deciding factor when considering the methods available for pro-
ducing accurate variance estimates (i.e., Taylor series or replicate weights)
is often dictated by the software being used. A number of statistical
software programs now provide options for either or both Taylor series lin-
earization or replicate weights (e.g., SPSS complex samples add-on, SAS,
HLM, LISREL, and Mplus). There are some publications that report on
various software program features as they pertain to analysis of complex
sample data. For example, Bell-Ellison and Kromrey (2007) provide a com-
parison of SAS, SUDAAN, and AM software programs. Hahs-Vaughn
(2005) illustrates and compares SPSS and AM. Berglund (2002) illustrates
the use of various SAS procedures (e.g., PROC SURVEYREG). Within the
context of the study, Stapleton (2006) provides general information on
many of the widely used programs for structural equation modeling.

Second design-based method: Apply a normalized weight adjusted by the


design effect. When specialized software is not available, a second design-
based approach for addressing disproportionate sampling and nonindepen-
dence is to adjust the normalized weight by the design effect (DEFF; Cavin
and Ohls 1990). A normalized weight is computed as the raw weight
Hahs-Vaughn et al. 277

divided by its mean (Thomas and Heck 2001; Thomas, Heck, and Bauer
2005; Peng 2000). The normalized weight takes the unequal selection prob-
ability into account but does so assuming a simple random sample (Thomas
and Heck 2001; Thomas, Heck, and Bauer 2005; Hahs-Vaughn 2005). In
other words, the nonindependence has still not been addressed through a
normalized weight, but this can be done by the inclusion of a DEFF.
The DEFF provides a measure of degree of departure from SRS on the
precision of a sample estimated (Kish 1965). The DEFF is the ratio of the
estimated variance derived from considering the sampling design to that
derived from a simple random sample, where is the variance of a statistic
from complex sample data and S 2 is the variance from a simple random
sample (Kish 1965).
S2
DEFF w2 :
S
A DEFF that is larger than 1.0 suggests that there is decreased precision of
the estimate relative to what would have been obtained from a simple ran-
dom sample, and thus DEFF values less than 1.0 indicate increased preci-
sion (Kalton 1983a; Muthen and Satorra 1995). Adjusting the normalized
weight by the DEFF will produce more accurate standard errors (than when
the complex sampling features are simply ignored). A DEFF adjusted nor-
malized weight is computed as follows where DEFF is the outcome of inter-
est (i.e., dependent variable).
Normalized Weight
wDEFF :
DEFF
When using secondary data, researchers must rely on the technical manual
for their survey data to provide the DEFFs. Not all outcome variables of
interest may have a DEFF reported, however. In those cases, it is appropri-
ate to use the DEFF for a similar variable, the average DEFF averaged over
a set of variables, or the average DEFF of the dependent variable averaged
over subgroups of the independent variable (Huang et al. 1996).

Third design-based method: Adjust the test statistic or alpha level by an


inflation factor. A third adjustment that can be accomplished without the
use of specialized software is to adjust by an inflation factor. Although
the necessity of this approach is dwindling, thanks in part to advances in
technology and surveys that more frequently include variables that will
allow the complex structure to be accounted for in the analysis, it may still
be appropriate in some instances. For example, adjustments of the test
278 Evaluation Review 35(3)

statistic or alpha level may be appropriate when the survey weight, strata,
cluster variables (needed for Taylor series linearization), replicate weights
(needed for replication methods), and/or the DEFF for the variables of inter-
est (needed to compute a DEFF adjusted weight) are not available. If the
DEFF is available, the t-test statistic can be divided by the square root of
the DEFF and the F test statistic can be divided by the DEFF (West and
Rathburn 2004). In situations where the DEFF is not available, the
researcher can apply an alpha level adjusted for the intraclass correlation
coefficient value (Thomas and Heck 2001):
VarBetweenClusters
ICC :
VarBetweenClusters VarWithinClusters

FACES: An Illustration of Using Weights


in Complex Survey Data
There are a large number of examples of complex survey data available
through agencies such as the U.S. Department of Education National Center
for Education Statistics (http://nces.ed.gov) and the National Science Foun-
dation (http://nsf.gov). For purposes of this illustration, we will focus on the
Head Start FACES (http://www.acf.hhs.gov/programs/opre/hs/faces/) which
is available through the U.S. Department of Health and Human Services
Administration for Children and Families (ACF). Data collected in FACES
include characteristics, experiences, and outcomes for nationally representa-
tive samples of Head Start children as well as their families, classrooms, and
Head Start programs.
To date, three cohorts of 3- and 4-year-old children enrolled in Head
Start programs have been followed and data from these cohorts are avail-
able. For purposes of this article, we will focus on FACES 1997 and FACES
2000. Data on the first cohort of children were collected in 1997 (FACES
1997), and these children were followed through five waves of data collec-
tion (fall of their first year in Head Start through spring of first grade), end-
ing in spring 2001. Similar patterns of data collection are seen in the cohorts
that follow FACES 2000, although the last data collection period for cohorts
other than FACES 1997 is kindergarten rather than first grade.

FACES Sampling Design


Of the FACES cohorts that are completed, each follows a similar multistage
sampling scheme. As such, only the sampling design for FACES 1997 (the
Hahs-Vaughn et al. 279

first FACES cohort) is presented here as an illustration. Readers interested


in the sampling designs for FACES 2000 are referred to its technical manual
(Zill et al. 2006).
The sampling frame for FACES 1997 was Head Start programs
(N 1,734) identified in the 19951996 Head Start Program Information
Report database, for which all Head Start programs must submit informa-
tion yearly. Excluded from the population source were programs that were
designated as American Indian, Migrant, or Early Head Start. The sampling
frame was stratified three ways: (a) census region (Northeast, Midwest,
South, and West), (b) urbanicity (urban vs. nonurban); and (c) percentage
of enrolled minority children (50% minority or <50% minority). This cre-
ated a 4  2  2 stratification (i.e., 16 cells or strata; OBrien et al. 2002;
Zill et al. 2005).
A three-stage (i.e., multistage) sampling design (selection of Head Start
programs, then centers, then children) was then followed for FACES 1997.
First, 40 Head Start programs (the PSUs) were selected based on probability
proportional to size (based on proportion of enrollment of children aged 3 and
older). The sampling frame (N 1,734) of Head Start programs was slotted
in the 16 stratification cells in proportion to enrollment of 3-year-old children
(based on the 19951996 Program Information Reports (PIR) data for each
stratum). Two cells (NE, rural, over 50% minority; and Midwest, rural, over
50% minority) had less than 12 programs and thus from those strata, no pro-
grams were selected (resulting in 14 stratification cells). In the second stage
of sampling, Head Start centers were selected from each of the selected pro-
grams. In a field test in spring 1997, four Head Start centers from 36 of the 40
programs (i.e., the programs with four or more centers) were selected (result-
ing in 157 selected centers). For the fall 1997 cohort, additional centers
(beyond the four from each program which participated in the field test) were
selected resulting in 180 centers that ultimately participated in the main
study. Children from the selected centers were identified in the third, and
final, stage of sampling. Children were randomly selected from the class-
rooms that had the largest proportion of 3- and 4-year-old children who were
newly enrolled in Head Start. The number of 3- and 4-year-old children ran-
domly selected was based on estimated allocation at the program level and
then at the center level. The desired sample size was 3,200. Of the 3,648 fam-
ilies who were asked to participate, 3,179 provided consent, resulting in a
response rate of 87.1% (OBrien et al. 2002; Zill et al. 2005).
As stated previously, FACES 2000 follows a similar sampling design.
FACES 2000 was a two-stage (rather than three-stage) sample with Head
Start programs selected first (PSU) followed by classes within the programs
280 Evaluation Review 35(3)

(second stage). All children in their first year of Head Start were selected
from the sampled classrooms (Zill et al. 2006).

FACES Weights
To address the unequal selection probability and nonresponse in FACES,
weights (cross sectional and longitudinal) are available for each cohort.
Although similarities exist between the sampling designs of the FACES
cohorts, there are some differences in the way the weights are created
depending on the FACES cohort. For example, in FACES 1997, only
child-level weights (that incorporate program and center information, in
addition to information about the child, in their calculation) are available
(Zill et al. 2005). Thus, if a model-based approach is used with the analyses,
only the Level 1 units can be weighted. In FACES 1997, there are five sets
of weights available: one cross-sectional weight for analysis of fall 1997
data and four longitudinal weights for analyses that examine each subse-
quent wave of data collection. Also in FACES 1997 are replicate weights
that can be analyzed using JK replication, along with the application of the
appropriate child weight (Zill et al. 2005). For researchers who are using a
design-based approach, strata and cluster variables are available so that
Taylor Series linearization can be applied.
FACES 2000 includes cross-sectional (fall 2000) and longitudinal (fall
2000spring 2001) child and classroom weights and one additional longitu-
dinal child weight for analyses that includes the kindergarten year. The JK
replicate weights are also available in FACES 2000 (Zill et al. 2006) that
can be applied along with the appropriate child and/or classroom weight.
For researchers who are using a design-based approach, strata and cluster
variables are available to allow for Taylor Series linearization.

Analytic Illustrations Using FACES: To Weight


or Not to Weight
To demonstrate the effect that addressing or failing to address the complex
sampling design can have on the results of a study, three illustrations using
different cohorts of FACES data (1997 and 2000) and varied model-based
techniques (multilevel multinomial logistic regression, multilevel growth
model, and multilevel latent profile analysis) will be presented. All three
approaches employed a multilevel procedure. Because multilevel models
are a model-based approach and variances are estimated correctly through
the statistical methodology, the only additional issue that must be addressed
Hahs-Vaughn et al. 281

given the complex sample is the application of the weight to compensate for
unequal selection probability.
Readers are reminded that the purpose of this article is methodological in
scope, to illustrate how results differ based on addressing the issues associated
with complex samples, and thus detailed information on the specific methods
(e.g., instrumentation, handling missing data) for each of the illustrations is not
presented here. Rather, readers are referred to the original manuscripts for the
theoretical framework and a comprehensive presentation of methods. In all
three examples, the weighted analyses were computed first. The unweighted
analyses are identical with the exception that the weights were not applied
(e.g., same data set, centeringwhere applicable, and other analytical aspects).

Multilevel Multinomial Logistic Regression


In McWayne, C. M., D. L. Hahs-Vaughn, K. Cheung, and L. G. Wright
(under review), a series of five multilevel logistic regression models (each
representing one of the five Time 1 profiles) were undertaken to estimate pro-
file membership in spring of childrens first year in Head Start (spring 2001
referred to as Time 2), given family and classroom factors. Of the five logistic
Time 1 Profile membership subgroup analyses conducted, the analyses pre-
sented here is for the Time 1 Average Profile subgroup and represents a
multilevel multinomial logistic model because the outcome (Time 2 Profile
membership) included three categories (i.e., three Time 2 profiles). While not
presented, examination of weighted as compared to unweighted results for the
other four Time 1 Profile subgroup analyses yielded similar results. This
research was conducted using FACES 2000 and thus both child- and
classroom-level weights were available and applied in the study.
Table 1 presents the unweighted results for the contextual model esti-
mates and their comparability to the analysis that correctly applied the
weights. For brevity and because the results of the weighted models are pro-
vided in their entirety in McWayne, C. M., D. L. Hahs-Vaughn, K. Cheung,
and L. G. Wright (under review), the only weighted parameter estimates
that are presented are those that reflect statistically significant differences
in the model estimates (i.e., a predictor that is statistically significant in the
unweighted model but that is not statistically significant in the weighted
model or vice versa), and these are presented in boldface type.
In examining the Time 1 Average Profile subgroup weighted versus
unweighted models for predicting the high average profile membership
at Time 2 (i.e., T2 P1) relative to high behavior problems, low average at
Time 2, the results suggest that only parent involvement in Head Start is a
282
Table 1. Multilevel Multinomial Logistic Regression Predicting Profile Membership at End of First Year in Head Start for Time 1 Profile
Averagea: Model Estimates
Time 2 Profile 1 (T2P1): High Averageb Time 2 Profile 2 (T2 P2): Averageb

Weighted Unweighted Weighted Unweighted


b (SE) [t, p] OR [OR CI] b (SE) [t, p] OR [OR CI] b (SE) [t, p] OR [OR CI] b (SE) [t, p], OR [OR CI]

Intercept b0 jg00 .510 (.183) [.2.788, .006] .601 .620 (.132) [4.707, .000], 1.858
[.419, .861] [1.434, 2.408]
Intercept  Arnett g01 .026 (.020) [1.294, .197] 1.026 .001 (.014) [.792, .429] .989
[.987, 1.067] [.961, 1.017]
Intercept  Early Childhood .102 (.245) [.418, .676] .903 .350 (.196) [1.787, .075] 1.419 .394 (.187) [2.106, .036]; 1.482
Environment Rating Scale [.558, 1.462] [.965, 2.087] [1.026, 2.142]
(ECERS) Mean g02
Intercept  Years Teaching .001 (.019) [.046.963] 1.001 [.963, .021 (.015) [1.413, .159] 1.021
Experience g03 1.040] [.992, 1.051]
Intercept  Teacher Education: .519 (.514) [1.009, 10.315] 1.680 .125 (.429) [.291, .771] 1.113 [.487,
Less than Some Collegecg04 [.610, 4.626] 2.638]
Intercept  Teacher Education: .047 (.429) [.110, .913] 1.049 [.450, .204 (.331) [.618, .537] 1.227 [.639,
Bachelors Degreecg05 2.442] 2.354]
Intercept  Teacher Education: .244 (.519) [.470, .639] 1.276 [.459, .347 (.399) [.871, .385] 1.416 [.645,
Graduate School or 3.546] 3.106]
Greatercg06
Intercept  Mean Classroom .210 (.096) [2.185, .030] .810 .052 (.076) [.688, .492] .949
Parent Involvement in Head [.671, .979] [.818, 1.102]
Start g07
Age in months b1j g10 .147 (.031) [4.697, .000] 1.158 .030 (.025) [1.225, .222] 1.031
[1.089, 1.232] [.982, 1.082]
d
Female b2 j g20 .604 (.334) [1.808, .071] 1.830 .463 (.297) [1.560, .119] 1.589 .622 (.259) [2.406, .017] 1.863
[.949, 3.529] [.887, 2.845] [1.122, 3.096]

(continued)
Table 1. (continued)
Time 2 Profile 1 (T2P1): High Averageb Time 2 Profile 2 (T2 P2): Averageb

Weighted Unweighted Weighted Unweighted


b (SE) [t, p] OR [OR CI] b (SE) [t, p] OR [OR CI] b (SE) [t, p] OR [OR CI] b (SE) [t, p], OR [OR CI]

Spanish assessment .941 (.733) [1.283, .200] .390 .085 (.522) [.164, .871] 1.089 [.391,
flageb3 j g30 [.093, 1.646] 3.034]
Disabledf b4 j g40 .028 (.518) [.054, .958] 1.028 [.372, .108 (.436) [.249, .804] .897
2.842] [.381, 2.113]
Blackgb5 j g50 .068 (.460) [.148, .883] 1.071 [.434, .218 (.362) [.600, .548] 1.243 [.610,
2.643] 2.532]
Hispanicgb6 j g60 .049 (.535) [.092, .928] .952 .372 (.445) [.835, .404] 690
[.333, 2.723] [.288, 1.651]
Other non-White racegb7 j g70 1.107 (.924) [1.198, .232] .331 .327 (.623) [.526, .599] .721
[.054, 2.029] [.212, 2.449]
Family structure: single 1.051 (.376) [2.797, .006]. 350 .510 (.278) [1.837, .067] .600
parenthb8 j g80 [.167, .731] [.348, 1.036]
Family structure: mother .247 (.479) [.515, .607] .781 .490 (.401) [1.223, .223] .613
grandmother or other non [.305, 2.002] [.279, 1.346]
mothermalehb9 j g90
Mothers education: less than high .927 (.402) [2.306, .022] .396 .115 (.302) [.380, .704] .892
schoolib10 j g100 [.180, .871] [.493, 1.614]
Mothers education: some college .463 (.387) [1.194, .234] .630 .446 (.314) [1.418, .157] .640
or moreib11 j g110 [.294, 1.347] [.345, 1.187]
Weekly/monthly combined .049 (.044) [1.101, .272] 1.050 .022 (.034) [.630, .529] 1.022 [.955,
activities b12 j g120 [.962, 1.146] 1.093]
Authoritative b13 j g130 .008 (2.84) [.029, .977].992 .232 (.234) [.993, .322] 1.261 [.797,
[.568, 1.731] 1.996]
Authoritarian b14 j g140 .433 (.260) [1.706, .088].642 .558 (.202) [.2.76, .006].573
[.385, 1.069] [.385, .850]

(continued)

283
284
Table 1. (continued)
Time 2 Profile 1 (T2P1): High Averageb Time 2 Profile 2 (T2 P2): Averageb

Weighted Unweighted Weighted Unweighted


b (SE) [t, p] OR [OR CI] b (SE) [t, p] OR [OR CI] b (SE) [t, p] OR [OR CI] b (SE) [t, p], OR [OR CI]

Parental involvement in Head .333 (.282) [1.181, .239] 2.641 (.053) [2.641, .009] 1.149 .027 (.047) [.566, .575] 1.027 [.934,
Start b15 j g150 1.142 [1.027, 1.270] [1.036, 1.275] 1.129]
Random effects Variance (df) [w2] Variance (df) [w2] Variance (df) [w2] Variance (df) [w2]
Contextual models
Intercept t00 .360 (179) [153.379] .327 (179) [151.484] .027 (179) [170.080] .124 (179) [178.061]

Note. CI confidence interval; OR odds ratio; SE standard error.


a
Reference group is high behavior problems, low average.
b
T2 P1 Time 2 Profile 1 (high average); T2 P2 Time 2 Profile 2 (average).
c
Reference group is associates degree or some college.
d
Reference group is male.
e
Reference group is non-Spanish flagged assessment.
f
Reference group is nondisabled.
g
Reference group is White.
h
Reference group is mothermale household.
i
Reference group is high school diploma or GED.
Hahs-Vaughn et al. 285

statistically significant predictor for the unweighted model (but not statisti-
cally significant in the weighted model). The results for the average
profile membership at Time 2 (i.e., T2 P2) relative to high behavior
problems, low average at Time 2 suggest there are two variables that are
statistically significantly different between the weighted versus
unweighted models. In both cases, they were not statistically significant
in the weighted model but are significant in the unweighted model: (a)
ECERS mean (a global rating of classroom quality based on structural fea-
tures of the classroom); and (b) female.
Table 1 also presents the odds ratios and confidence intervals of the odds
ratios. The odds ratios that are presented in bold correspond to the parameter
estimates that were statistically significantly different between the weighted
and unweighted models. The confidence intervals for the weighted and
unweighted odds ratios were examined for overlap (Schenker and Gentleman
2001), and nonoverlapping confidence intervals suggest statistically signifi-
cant differences in odds ratios. The confidence intervals of the odds ratios all
overlapped suggesting that there were no statistically significant differences
in the effect sizes of the weighted to the unweighted models.

Piecewise Multilevel Growth Model


Wen, X., R. J. Bulotskey-Shearer, D. L. Hahs-Vaughn, and J. Korfmacher
(under review) examined two dimensions of Head Start program quality,
classroom quality and parent involvement, and looked at how these dimen-
sions uniquely and interactively contributed to childrens academic growth
from the beginning of Head Start through the end of first grade for various
childrens outcomesone of which, presented here as an illustration, was
receptive language (measured by the Peabody Picture Vocabulary Test
(PPVT). This research, which applied piecewise multilevel growth models,
was conducted using FACES 1997 and thus only the child-level weights
were available and applied in AUTHOR (under review).
Table 2 presents the unweighted results for the contextual model and
their comparability to the analysis that correctly applied the weights. Once
again, for brevity and because the results of the weighted models are pro-
vided in their entirety in Wen, X., R. J. Bulotskey-Shearer, D. L. Hahs-
Vaughn, and J. Korfmacher (under review), the only weighted parameter
estimates that are presented are those that reflect statistically significant dif-
ferences in the model estimates (i.e., a predictor that is statistically signif-
icant in the unweighted model but that is not statistically significant in the
weighted model or vice versa), and these are presented in boldface type.
286
Table 2. Final Contextual Growth Model for PPVT W Score: Weighted Versus Unweighted Model Estimate Comparison

Weighted Unweighted

Fixed effects b (SE) t P b (SE) t p

Model for initial status p0ij


Intercept g000 62.642 (.371) 168.942 <.001
Age cohort (4-year-olds) g010 .875 (1.005) 1.005 .384
Child age g020 .796 (.078) 10.210 <.001
Child gender (girls) g030 .238 (.532) .447 .655
a
RaceBlack g040 7.536 (.772) 9.768 <.001
a
RaceHispanic g050 4.494 (1.069) 4.205 <.001
a
RaceOther (non-White) g060 3.002 (1.107) 2.712 .007
b
Language minority g070 6.902 (.986) 6.997 <.001
c
Maternal education (high school or greater) g080 4.438 (.629) 7.054 <.001
d
Momdad household g090 .508 (.582) .874 .383
Maternal depression g0100 .10 (.05) 1.77 .08 .081 (.041) 1.963 .049
e
Program location (urban) g010 .471 (.822) .573 .566
Parent involvement at home g020 .492 (.112) 4.411 <.001
Parent involvement in Head Startg0130 .192 (.149) 1.290 .198
Model for Head Start growth rate p1ij
Intercept g100 9.540 (.257) 37.182 <.001
Intercept  ECER g101 .421 (.379) 1.112 .267
Intercept  Arnett g102 .040 (.020) 1.983 .048
Age cohort (4-year-old cohort) g110 1.39 (.94) 1.48 .14 1.738 (.744) 2.335 .020
Child age g120 .126 (.061) 2.058 .039
Child gender (girls) g130 .484 (.405) 1.195 .233
a
RaceBlack g140 .23 (.87) .26 .79 1.179 (.525) 2.245 .025
a
RaceHispanic g150 .253 (.784) .323 .747
a
RaceOther (non-White) g160 .778 (.814) .956 .339
b
Language minority g170 .205 (.744) .276 .782
c
Maternal education (high school or greater) g180 .652 (.482) 1.354 .176

(continued)
Table 2. (continued)

Weighted Unweighted

Fixed effects b (SE) t P b (SE) t p


d
Momdad household g190 .281 (.436) .643 .520
Maternal depression g1100 .020 (.032) .629 .529
e
Program location (urban)g1110 .324 (.491) .660 .509
Parent involvement at home g1120 .26 (.11) 2.31 .02 .160 (.084) 1.908 .056
Parent involvement at home  ECERS g1121 .009 (.144) .063 .950
Parent involvement at home  Arnett g1122 .02 (.01) 2.05 .04 .013 (.008) 1.626 .104
Parent involvement in Head Start g1130 .28 (.20) 1.44 .15 .291 (.114) 2.566 .011
Parent involvement in Head Start  ECERS g1131 .093 (.191) .489 .624
Parent involvement in Head Start  Arnett g1132 .010 (.009) 1.101 .271
Model for kindergarten-first grade growth rate p2 ij
Intercept g201 7.641 (.189) 40.392 <.001
Intercept  ECERS g201 .107 (.336) .318 .751
Intercept  Arnett g202 .04 (.02) 2.02 .04 .019 (.017) 1.119 .265
Age cohort (4-year-old cohort)g210 1.66 (.68) 2.43 .02 .720 (.586) 1.229 .220
Child age g220 .242 (.048) .5.006 <.001
Child gender (girls) g230 .075 (.323) .233 .816
a
RaceBlackg240 .306 (.429) .714 .475
a
RaceHispanic g250 .385 (.642) .601 .548
a
RaceOther (non-White) g260 .509 (.687) .742 .458
b
Language minority g270 1.626 (.614) 2.648 .009
c
Maternal education (high school or greater) g280 .96 (.45) 2.13 .03 .690 (.381) 1.812 .070
d
Momdad household g290 .108 (.352) .307 .759
Maternal depression g2100 .012 (.026) .455 .648
e
Program location (urban) g2110 .259 (.389) .666 .505
Parent involvement at home g2120 .043 (.067) .641 .521
Parent involvement at home  ECERS g2121 .008 (.131) .064 .949
Parent involvement at home  Arnett g2122 .01 (.01) 2.06 .04 .012 (.007) 1.767 .077

287
(continued)
288
Table 2. (continued)

Weighted Unweighted

Fixed effects b (SE) t P b (SE) t p

Parent involvement in Head Start g2130 .15 (.11) 1.39 .17 .208 (.090) 2.306 .021
Parent involvement in Head Start  ECERS g2131 .193 (.170) 1.135 .257
Parent involvement in Head Start  Arnett g2132 .0001 .019 .985
Random effects (variance components) Variance (df) [w2, p]
Level one variance
Temporal variation etij 28.228
Level two variance (within classrooms)
Individual initial status r0ij 38.899 (707) [4,369.568, <.001]f
Level three variance (between classrooms)
Classroom mean Head Start status u00j 19.395 (330) [606.124, <.001]g
Classroom mean Head Start growth rate u10j 2.678 (328) [389.558, .011]g
Classroom mean kindergartenfirst grade growth rate u20j .879 (328) [310.874, >.500]g
Deviance (no. of estimated parameters) 26,807.471 (62)

Note. SE standard error.


a
Reference category is White.
b
Reference category is nonlanguage minority.
c
Reference category is no high school diploma or GED.
d
Reference category is non motherfather household.
e
Reference category is rural.
f
Based on 1,051 of 1,132 units that had sufficient data for computation. Fixed effects and variance components are based on all the data.
g
Based on 331 of 356 units that had sufficient data for computation. Fixed effects and variance components are based on all the data.
Hahs-Vaughn et al. 289

In examining the weighted versus unweighted models, the estimates for


initial status of receptive vocabulary in Head Start were similar for both
conditions (weighted and unweighted) with one exception. Maternal
depression was a statistically significant predictor of initial receptive voca-
bulary in the unweighted but not the weighted model.
In terms of growth in receptive vocabulary during Head Start, there
were five variables for which the statistical significance of the estimates
differed between the weighted and unweighted models. More specifically,
two variables were statistically significant in the weighted model (parent
involvement at home and the interaction of parent involvement at home
with the Arnett, a rating scale of teacher behavior toward children in the
classroom) but were not statistically significant in the unweighted model.
There were three variables that were not statistically significant in the
weighted model but that were significant in the unweighted model (age
cohort, Black, and parent involvement in Head Start).
In terms of growth in receptive vocabulary during kindergarten and first
grade, there were also five variables for which the statistical significance of
the estimates differed between the weighted and unweighted models. More
specifically, four variables were statistically significant in the weighted
model (Arnett as a predictor of baseline receptive vocabulary, age cohort,
maternal educationhigh school or greater, and the interaction of parent
involvement at home and Arnett) but were not statistically significant in the
unweighted model. There was one variable that was not statistically signif-
icant in the weighted model but was significant in the unweighted model
(parent involvement in Head Start).

Multilevel Latent Profile Analysis


Bulotskey-Shearer, R. J., X. Wen, A. Faria, D. L. Hahs-Vaughn, and
J. Korfmacher (under review) identified six distinctive multilevel latent
profiles of Head Start classroom quality and parent involvement, and then
examined their differential relationship to academic and social outcomes at
the end of Head Start. The latent profiles were estimated in Mplus v. 6
(Muthen and Muthen 19982010). The study used FACES 1997 and only
the child-level longitudinal child base weight (CHLGWT0) was applied.
All profiles estimated parent involvement variables at Level 1 (within
children) and classroom quality at Level 2 (between children). The first
model estimated was a two-profile parent involvement and a one-profile
classroom quality model. In other words, this modeled two distinct latent pro-
files of parent involvement at one level of classroom quality. Subsequently
290 Evaluation Review 35(3)

estimated models (i.e., those models estimated after the two-profile parent
involvement and a one-profile classroom quality model) incrementally
increased the number of Level 1 (parent involvement) and Level 2 (classroom
quality) profiles. This resulted in several multilevel solution combinations
(up to six parent involvement profiles and up to three classroom quality pro-
files). For example, multilevel solutions were estimated that consisted of
three parent involvement profiles varying between two classroom quality
profiles; or three parent involvement profiles varying among three profiles
of classroom quality. As model estimation became more complex, random
starts, the number of iterations, and final-stage optimizations were increased
to avoid local maxima (Hipp and Bauer 2006).The best fitting model con-
sisted of three within (parent involvement) and two between (classroom qual-
ity) profiles resulting in a total of six distinct latent profile combinations.
Table 3 presents the unweighted results for the final multilevel latent
profile analysis (LPA) model and their comparability to the analysis that
correctly applied the weights. The results of the weighted models are pro-
vided in their entirety in Bulotskey-Shearer, R. J., X. Wen, A. Faria, D.
L. Hahs-Vaughn, and J. Korfmacher (under review). There were no statis-
tically significant differences in means based on nonoverlapping confidence
intervals (Schenker and Gentleman 2001). The confidence intervals are not
presented here but are available upon request. In reviewing the frequencies
(and percentages) of children in each profile, there were a small percentage
of children (n 153; 8.18%) who changed profile membership depending
on whether the children weights were applied or were not applied during the
analyses. All profiles experienced some movement between weighted and
unweighted models. However, four of the six profiles had about 1% or less
movement, one profile had approximately 2% movement (Profile 6, high
parent school and home involvement, above average classroom quality),
and one profile experienced about 4% movement (Profile 4, low parent
school and home involvement, above average classroom quality). The pro-
file that had the most movement (approximately 4%) had low parent school
and home involvement. Only two profiles (Profile 2, high parent school
involvement, very low classroom quality; Profile 4, low parent school and
home involvement, above average classroom quality) had a larger member-
ship when weighted as compared to unweighted.

Discussion
Three illustrations of complex sample analyses within a multilevel framework
(i.e., model-based approach) were provided: multilevel multinomial logistic
Table 3. Unweighted Prevalence and Mean Z Scores (Standard Errors) for the Final Multilevel Latent Profile Solution [Weighted]
Parent Involvement Classroom Quality

AP Learning Classroom
Latent Profilea n (%) School Invol. Home Weekly Home Monthly ECERS Environ. Arnett Climate

1 179 (9.6) .894 (.069) .200 (.083) .402 (.085) .894 (.139) .460 (.135) 1.603 (.215) 1.247 (.155)
[165, 8.8%]
2 92 (4.9) .822 (.157) .147 (.114) .032 (.192) .894 (.139) .460 (.135) 1.603 (.215) 1.247 (.155)
[93, 5.0%]
3 45 (2.4) .121 (.349) .329 (.207) 1.722 (.199) .792 (.058) .835 (.057) .410 (.047) .667 (.062)
[37, 2.0%]
4 814 (43.5) .729 (.036) .291 (.057) .394 (.080) .224 (.049) .071 (.045) .357 (.040) .243 (.054)
[889, 47.5%]
5 470 (25.1) .838 (.048) .013 (.070) .213 (.069) .224 (.049) .071 (.045) .357 (.040) .243 (.054)
[449, 24.0%]
6 270 (14.4) .655 (.150) .746 (.086) 1.349 (.093) .224 (.049) .071 (.045) .357 (.040) .243 (.054)
[236, 12.6%]

Note. AP Learning Environ. assessment profile learning environment; School invol. school involvement.
Weighted Model Fit indices: Log likelihood 9,885.080, Akaike information criterion (AIC) 19,842.161, Bayesian information criterion (BIC) 20,041.374,
Adjusted BIC 19,927.002, entropy 0.743.
Unweighted Model Fit indices: Log likelihood 9,887.078, AIC 19,846.155, BIC 20,045.368, Adjusted BIC 19,930.997, entropy 0.730.
N 1,870.
a
Latent Profile Type 1 Low parent school and home involvement, very low classroom quality.
Type 2 High parent school involvement, very low classroom quality.
Type 3 High parent home involvement, very low classroom quality.
Type 4 Low parent school and home involvement, above average classroom quality.
Type 5 High parent school involvement, above average classroom quality.
Type 6 High parent school and home involvement, above average classroom quality.

291
292 Evaluation Review 35(3)

regression, piecewise growth model, and a multilevel latent profile analysis.


The results suggest that, in all three analyses, there are important differences
in the findings (and therefore interpretations) depending on whether the anal-
yses incorporate the sample weights or ignore the sample weights.
The results of the multilevel multinomial logistic regression for Time 1
Average Profile subgroup analyses suggest minimal impact on the find-
ings between the weighted and unweighted models when examining high
average profile membership to high behavior problems, low average.
Only one variable (parent involvement in Head Start) was found to be sta-
tistically significant in the unweighted model (but not the weighted
model) when predicting Time 2 high average profile membership as
compared to high behavior problems, low average. Similar differences
in the weighted to unweighted model are seen in the analysis that exam-
ined average profile membership to high behavior problems, low
average profiles. When the data were not weighted, the results suggest
classroom quality (as measured by the ECERS classroom quality) and
being female statistically significantly increase the odds of being in the
average profile at Time 2 as compared to being in high behavior prob-
lems, low average. In terms of interpretations, the unweighted results
may lead the researcher (and readers) to believe different stories about
what factors are important in effecting profile membership.
The results of the weighted and unweighted piecewise multilevel growth
model lead to substantially different interpretations. The most dramatic differ-
ences between the weighted and unweighted models were evidenced in
growth, both growth in Head Start and growth in kindergartenfirst grade. Spe-
cifically, there were five variables that differed between the models for growth
in Head Start and five variables that differed between the models for growth in
kindergartenfirst grade. Parent involvement at home and the interaction of
parent involvement at home with teacherchild interactions (i.e., Arnett) were
statistically significant in predicting growth in receptive vocabulary during
Head Start in the weighted model (but were not statistically significant when
the data were unweighted). Three variables (age cohort, Black, and parent
involvement in Head Start) were statistically significant in predicting
growth in Head Start in the unweighted model (but were not statistically sig-
nificant in the weighted model). The interpretations of the unweighted model
provide a substantially different scenario as compared to the weighted model.
For example, interpreting only those differences between weighted and
unweighted models suggests that when unweighted, a childs growth in recep-
tive vocabulary in Head Start can be statistically significantly enhanced by
receiving 1 year (as compared to 2 years) of Head Start, by being Black, and
Hahs-Vaughn et al. 293

by increased parental involvement in Head Start. In contrast, when weighted, a


childs growth in receptive vocabulary in Head Start is statistically signifi-
cantly predicted by parent involvement at home and this is positively moder-
ated by teacherchild interactions (i.e., Arnett).
Potentially more dramatically, in examining growth in kindergarten
first grade and reviewing only those variables where differences were
evident in the weighted to unweighted models, the unweighted model sug-
gests that only parent involvement in Head Start is statistically significant.
In contrast, four additional variables in the weighted model were statisti-
cally significant in predicting growth in kindergartenfirst grade: (a)
teacherchild interactions (i.e., Arnett), (b) being 4 years old (as com-
pared to 3 years old), (c) maternal education (mothers with a high school
education or more), and (d) the interaction of parent involvement at home
with teacherchild interactions.
There are small differences in the weighted results of the multilevel latent
profile analyses (MLPA) as compared to the unweighted results. The latent
profiles and model fit indices are relatively stable between the weighted and
unweighted models with no statistically significant mean differences between
the weighted and unweighted models. However, there was some movement
(ranging from less than 14%) of children between the profiles depending
on whether the sample weight was included in the analyses. Across the six
profiles, the largest movement of children within a profile ranged from about
2% (specifically in Profile 6, high parent school and home involvement,
above average classroom quality) to 4% (in Profile 4, low parent school and
home involvement, above average classroom quality). In four profiles, mem-
bership was smaller when the model was weighted (Profiles1, 3, and 5,
weighted membership less than 1% smaller when weighted; Profile 6, 2%
less when weighted). The two most promising multilevel latent profile types
(in terms of both parent and classroom factors) were Profiles 5 (high parent
school involvement, above average classroom quality) and 6 (high parent
school and home involvement, above average classroom quality). Both these
profiles saw very slight decreases in membership from weighted to
unweighted models (Profile 5, n 449 to n 470, about 1% movement; Pro-
file 6, n 236 to n 270, about 2% movement). Although there were no
statistically significant differences in weighted or unweighted means for the
parent involvement or classroom quality indicators, the results do suggest that
there are larger percentages of children in the most promising profiles when
the unequal selection probability is not addressed in the analyses via the
application of appropriate the sample weight. It is important to remember that
FACES 1997 was the data for the MLPA and thus only the child level weight
294 Evaluation Review 35(3)

was applied. If a classroom weight been applied as well, it is likely that there
would have been even more movement between the profiles for the weighted
versus unweighted models.
Methodologically, these results are interesting, however substantively
(in terms of evidence to support theory, research informing practice, and
policy implications), there are great ramifications suggested by the results
of this study. More specifically, FACES data are designed such that, when
the sampling design is acknowledged through the analysis, the variances are
estimated correctly (i.e., there is an adjustment for nonindependence) and
the results are representative of all children who attended Head Start at a
particular point in time (i.e., there is an adjustment for disproportionate
sampling). The analyses presented herein were all multilevel, thus weight-
ing to address for unequal selection probability was the only adjustment
made. If the weight is not applied, as stated by Kalton (1989), the sample
becomes simply a collection of individuals that represents no meaningful
population (p. 583) which likewise suggests meaningless interpretations.
There are additional ramifications based on these results. Relationships
between factors that were important when the sampling design was consid-
ered were not important when unweighted (and vice versa). The conclu-
sions this leads to, in terms of effectiveness of programs (such as Head
Start) and stakeholders (e.g., parents and teachers), can then fluctuate
quite wildly. One may swing (depending on the weighting of the analyses)
from concluding that parent involvement at home and/or at school is
important to conclude that there is not a relationship between parent invol-
vement and Head Start outcomes. In this era of accountability, given that
funding decisions for programs are often made based on existing research
findings (e.g., continuation of Title I funding for parent involvement pro-
gramming), results stemming from complex sample data that are appropri-
ately analyzed are critical as the findings have very real consequences for
programs such as Head Start. This research illustrates the substantially
different conclusions that could have been made if the analyses had not
accounted for the sampling design.
All three illustrations presented are model-based approaches and there-
fore adjust for the nonindependence of observations by default in the multi-
level framework. This left only one issue to examine, unequal selection
probability, which can be handled by application of the appropriate weights.
There is still debate on weighting complex survey data within a multilevel
framework. As stated by Stapleton and Thomas (2008), . . . [M]ultilevel
analyses require a more complex way of thinking about the purpose and
behavior of sample weights (p. 40) and similar comments have been made
Hahs-Vaughn et al. 295

by others (Pfeffermann et al. 1998). Questions include, for example, at


which level the weights should be applied (Stapleton and Thomas 2008).
In the case of FACES 1997, weights are only available for the child. What
this means in a multilevel context is that at Level 2, because there was no
weight information for the classroom (i.e., no classroom level weight
applied), the classroom relationships are calculated assuming each of the
classrooms are equally informative (Stapleton and Thomas 2008). When
Level 2 weights are applied, the Level 2 estimates will inform estimation
at level one (Stapleton and Thomas 2008). The results presented herein,
which illustrate analyses that includes the Level 1 weight only (i.e., FACES
1997) as well as both child- and classroom-level weights (i.e., FACES
2000) assist in building the foundation for continuing to understand some
of these complexities.

Limitations of Complex Sample Data


Readers are reminded that complex sample data are secondary data sources.
As such, there are limitations on how constructs are operationalized and the
quality of the documentation that comes with the data (McCall and Appel-
baum 1991). Additionally, while research that uses primary data is built
from the top (research question) to bottom (measures collected based on the
research question), research that is conducted that uses secondary data (i.e.,
complex survey data) is built from the bottom up (McCall and Appelbaum
1991). Researchers are limited by the measures collected in the survey and
thus the research question is dictated, to some degree, by the measures
available to them (Hofferth 2005; McCall and Appelbaum 1991; Stapleton
and Thomas 2008).

Recommended Reporting Practices


Recommendations for authors, editorial board members, and editors will be
made to assist in ensuring transparency and greater ease in replication when
using complex sample data.

Authors
Best practices for researchers who have analyzed complex sample data
include the following: (a) Specify the survey weight applied during the anal-
yses to correct for disproportionate sampling, and this should include the
exact variable name as presented in the data set as well as a general
296 Evaluation Review 35(3)

description (e.g., cross-sectional weight for the spring 2001 data collection
wave; Hahs-Vaughn 2006c). For design-based models (i.e., single level anal-
ysis), similar information should be reported for the variables applied to com-
pute correct standard errors and adjust for nonindependence. For Taylor series
linearization, these would be the strata and cluster variables. For replicate
methods, these would be the survey weight along with replicate weights.
(b) If a design-based approach is followed, specify how variances have been
estimated (e.g., Taylor Series linearization, replication methodand more
specifically, which replication method). (c) If the complex sample design is
not addressed in the analyses, provide information to assist the reader in
understanding why that may have been appropriate and the impact on the
interpretations of the results (e.g., to whom the results can be generalized;
Hahs-Vaughn 2006c).

Editorial Board Members


The best practices previously mentioned for authors provide the basic cri-
teria for editorial board members as they review manuscripts. Ideally, these
are implemented as part of the journals manuscript submission guidelines.
If that is not the case, editorial board members can take the lead in using the
previously mentioned best practices when reviewing manuscripts where
complex sample data have been analyzed.

Journal Editors
Journal editors are especially poised, through manuscript submission guide-
lines and requirements, to assist researchers in understanding best practices
of conducting and reporting complex sample analysis. Best practices for jour-
nal editors include the following: (a) Requiring authors to report the previously
mentioned items (e.g., survey weights, variance estimation method) when their
manuscript includes analysis of complex sample data (Hahs-Vaughn 2006c).
(b) Providing resources, such as this manuscript or other primers (e.g.,
Hahs-Vaughn 2005; Thomas and Heck 2001), to the editorial board that pro-
vides foundational information about complex survey data and can be a helpful
tool when reviewing manuscripts that have analyzed this type of data.

Conclusions
A decade ago, the following sentiment was characteristic of the field. The
techniques [of complex samples] generally require specialized software that
Hahs-Vaughn et al. 297

is difficult to learn and use and is based on concepts that are not familiar to
many analysts. In some cases, the providers of the data do not even supply
analysts with the information necessary to implement the techniques.
Recent advances in software help reduce these difficulties and bring prac-
tice more into line with theory (Brick, Morganstein, and Valliant 2000).
Ten years later, strategies for addressing design issues of complex samples
are now almost common place in the most frequently used statistical soft-
ware. Most statistical software programs provide users with the capability
to apply one or more survey weights to adjust for unequal selection prob-
ability. Taylor series linearization, a good but less than desirable approach
as compared to the use of replication methods, is the most commonly
available method in statistical software programs for producing accurate
standard error estimates. A few software programs also provide replica-
tion methods as an option and it is anticipated that the options for analysis
using replication will increase dramatically as well. The limitation, how-
ever, that still exists with some statistical software packages is that while
there may be options that allow the researcher to weight for unequal selec-
tion probability and correctly estimate variances due to nonindependence,
for which procedures these estimates can be produced may be limited to
those most commonly used. For example, in the case of standard statistical
software, regression procedures are often those which can be applied with
complex sample data.
With the advances of statistical processing, there still exist limitations on
what we know and understand about how the sampling design affects the
statistical results (and thereby conclusions that are made about the data).
For example, research that has been conducted on complex survey data
often uses simulated rather than extant data (e.g., Kaplan and Ferguson
1999; Stapleton 2002) and more commonly used statistical procedures such
as ordinary least squares regression (e.g., DuMouchel and Duncan 1983;
Korn and Graubard 1995; Skinner, Holt, and Smith 1989; Hahs-Vaughn
2005, 2006c), and structural equation modeling (e.g., Kaplan and Ferguson
1999; Hahs-Vaughn 2006c; Stapleton 2002). There is still much left to
understand about working with complex sample data within a multilevel
framework (Stapleton and Thomas 2008) and within more complex statis-
tical procedures such as those presented herein (multinomial logistic regres-
sion, growth modeling, latent profile analysis).
Echoing previous calls (e.g., Hahs-Vaughn 2006c; Rodgers-Farmer and
Davis 2001), wider dissemination of the importance of addressing dispropor-
tionate sampling and nonindependence is needed, given that many research-
ers may not understand the ramifications of analyzing complex survey data
298 Evaluation Review 35(3)

without attention to its unique sampling design. Suggestions for future


research, therefore, include examination of both simulated and extant data
using advanced statistical procedures (e.g., propensity score analysis, latent
class, and latent profile analysis). Dissemination through graduate level
courses, workshops through professional organizations (e.g., the American
Educational Research Associations Statistics Institute or training programs
sponsored by federal agencies that collect/house complex survey data), and
other means will also assist in advancing the use and adoption of procedures
that appropriately accommodate complex sample data.

Declaration of Conflicting Interests


The author(s) declared no potential conflicts of interest with respect to the research,
authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or
publication of this article.

References
Bell-Ellison, B. A., and J. D. Kromrey. 2007. Alternatives for Analysis of Complex
Sample Surveys: A Comparison of SAS, SUDAAN, and AM Software. Paper
presented at SAS Global Forum, Orlando, FL, April 16.
Berglund, P. A. 2002. Analysis of Complex Sample Survey Data: Using
the SURVEYMEANS and SURVEYREG Procedures and Macro Coding.
Paper presented at SAS Users Group International (SUGI), Orlando, FL,
April 14.
Biemer, P. P., and S. L. Christ. 2008. Weighting Survey Data. In International
Handbook of Survey Methodology (pp. 317-341), edited by E. D. DeLeeuw.
New York, NY: Lawrence Erlbaum.
Brick, J. M., D. Morganstein, and R. Valliant. 2008. Analysis of complex sample
data using replication. Westat 2000. http://www.westat.com/Wesvar/techpapers/
ACS-Replication.pdf (accessed December 10, 2008).
Brogan, D. J. 1998. Pitfalls of Using Standard Statistical Software Packages for
Sample Survey Data. In Encyclopedia of Biostatistics, edited by P. Armitage,
and T. Colton. New York, NY: Wiley.
Bulotsky-Shearer, R. J., X. Wen, A. Faria, D. L. Hahs-Vaughn, and J. Korfmacher
(under review). National profiles of classroom quality and family involvement:
A multilevel examination of proximal influences on Head Start childrens school
readiness.
Hahs-Vaughn et al. 299

Campbell, C., and M. Meyer. 1978. Some Properties of t Confidence Intervals for
Survey Data. Paper presented at American Statistical Association Annual
Meeting, Washington, DC.
Cavin, Edward S., and James C. Ohls. 1990. An Application of Balanced Repeated
Replication (BRR) Variance Estimation to Program Evaluation. Evaluation
Review 14:20613.
DuMouchel, W. H., and G. J. Duncan. 1983. Using Sample Survey Weights in
Multiple Regression Analyses of Stratified Samples. Journal of the American
Statistical Association 78:535-43.
Fay, R. E. 1989. Theoretical application of weighting for variance Calculation. Pro-
ceedings of the Section on Survey Research Methods of the American Statistical
Association (pp. 212-217).
Frankel, M. R. 1971. Inference From Survey Samples. Ann Arbor, MI: Institute for
Social Research.
Hahs-Vaughn, D. L. 2005. A Primer for Using and Understanding Weights with
National Datasets. Journal of Experimental Education 73:221-48.
. 2006a. Analysis of Data From Complex Samples. International Journal
of Research & Method in Education 29:163-81.
. 2006b. Utilization of Sample Weights in Single Level Structural Equation
Modeling. Journal of Experimental Education 74:163-90.
. 2006c. Weighting Omissions and Best Practices When Using Large-Scale
Data in Educational Research. Association for Institutional Research Profes-
sional File 101:1-9.
Hindman, A. H., L. E. Skibbe, A. Miller, and M. Zimmerman. 2010. Ecological
Contexts and Early Learning: Contributions of Child, Family, and Classroom
Factors during Head Start, to Literacy and Mathematics Growth through First
Grade. Early Childhood Research Quarterly 25:235-50.
Hipp, J. R., & D. J. Bauer. 2006. Local solutions in the estimation of growth mix-
ture models. Psychological Methods, 11:36-53.
Hofferth, S. L. 2005. Secondary Data Analysis in Family Research. Journal of
Marriage and Family 67:891-907.
Hong, G., and S. W. Raudenbush. 2005. Effects of Kindergarten Retention Policy
on Childrens Cognitive Growth in Reading and Mathematics. Educational
Evaluation and Policy Analysis 27:205-24.
Hox, J. J., and I. G. G. Kreft. 1994. Multilevel Analysis Methods. Sociological
Methods and Research 22:283-99.
Huang, G., S. Salvucci, S. Peng, and J. Owings. 1996. National Educational Long-
itudinal Study of 1988 (Nels:88) Research Framework and Issues. Arlington,
VA: Synetics for Management Decisions.
Kalton, G. 1983a. Introduction to Survey Sampling. Thousand Oaks, CA: Sage.
300 Evaluation Review 35(3)

. 1983b. Models in the Practice of Survey Sampling. International


Statistical Review 51:175-88.
. 1989. Modeling Considerations: Discussion from a Survey Sampling
Perspective. In Panel Surveys, edited by D. Kasprzyk, G. Duncan, G. Kalton,
and M. Singh. New York, NY: Wiley.
Kaplan, D., and A. J. Ferguson. 1999. On the Utilization of Sample Weights in
Latent Variable Models. Structural Equation Modeling 6:305-21.
Kish, L. 1965. Survey Sampling. New York, NY: Wiley.
. 1992. Weighting for Unequal Pi. Journal of Official Statistics 8:
183-200.
Kish, L., and M. R. Frankel. 1973. Inference from Complex Samples. Paper
presented at annual meeting of the Royal Statistical Society, October 17.
Kish, L., and M. R. Frankel. 1974. Inference from Complex Samples. Journal of
the Royal Statistical Society, Series B 36:1-37.
Korn, E. L., and B. I. Graubard. 1995. Examples of Differing Weighted
and Unweighted Estimates from a Sample Survey. American Statistician
49:291-305.
Kovar, J. G., J. N. K. Rao, and C. F. J. Wu. 1988. Bootstrap and Other Methods
to Measure Errors in Survey Estimates. The Canadian Journal of Statistics
16:25-45.
Landis, R. J., J. M. Lepkowski, S. A. Eklund, and S. A. Stehouwer. 1982. A Sta-
tistical Methodology for Analyzing Data from a Complex Survey: The First
National Health and Nutrition Examination Survey. In Vital and Health Statis-
tics. Hyattsville, MD: National Center for Health Statistics.
Lee, E. S., R. N. Forthofer, and R. J. Lorimor. 1989. Analyzing Complex Survey
Data. Newbury Park, CA: Sage.
Lemeshow, S., and P. S. Levy. 1979. Estimating the Variance of Ratio Estimates in
Complex Sample Surveys with Two Primary Units per Stratum: A Comparison
of Balanced Repeated Replication and Jackknife Techniques. Journal of Statis-
tical Computing and Simulations 8:191-205.
Lumley, T. 2004. Analysis of Complex Survey Samples. Journal of Statistical
Software 9:1-19.
Mahalanobis, P. C. 1946. On large-scale sample surveys. Philosophical Transac-
tions of the Royal Society of London, series B, 231, 329-451.
McWayne, C. M., D. L. Hahs-Vaughn, K. Cheung, and L. G. Wright (submitted
for review). National profiles of school readiness skills for Head Start chil-
dren: An investigation of stability and change. Early Childhood Research
Quarterly.
McCall, R. B., and M. I. Appelbaum. 1991. Some Issues of Conducting Secondary
Analyses. Developmental Psychology 27:911-17.
Hahs-Vaughn et al. 301

McCarthy, P. J. 1966. Replication: An approach to the analysis of data from


complex surveys. Public Health Service Publication Number 1000, series 2, num-
ber 14. Washington, DC: U.S. Department of Health, Education, and Welfare.
Muthen, L. K., and B. O. Muthen. 1998-2010. Mplus users guide (6th ed.). Los
Angelas, CA: Muthen & Muthen.
Muthen, B. O., and A. Satorra. 1995. Complex Sample Data in Structural Equation
Modeling. Sociological Methodology 25:267-316.
OBrien, R. W., M. A. DElio, M. Vaden-Kiernan, C. Magee, T. Younoszai, M.
J. IKeane, D. C. Connell, and L. Hailey. 2002. A Descriptive Study of Head Start
Families: FACES Technical Report I. Washington, DC: Department of Health and
Human Services Administration for Children and Families Head Start Bureau.
Peng, S. S. 2000. Technical Issues in Using NCES Data. Paper presented at AIR/
NCES National Data Institute on the Use of Postsecondary Databases, Gaithers-
burg, MD.
Pfeffermann, D., C. J. Skinner, D. J. Holmes, H. Goldstein, and J. Rasbash. 1998.
Weighting for Unequal Selection Probabilities in Multilevel Models. Journal
of the Royal Statistical Society B 60:23-40.
Pike, G. R. 2008. Using Weighting Adjustments to Compensate for Survey
Nonresponse. Research in Higher Education 49:153-171.
Potthoff, R. F., M. A. Woodbury, and K. G. Manton. 1992. Equivalent sample
size and equivalent degrees of freedom refinements for inference using survey
weights under superpopulation models. Journal of the American Statistical
Association 87:383-96.
Raudenbush, S. W., and A. S. Bryk. 2002. Hierarchical Linear Models: Applica-
tions and Data Analysis Methods. Thousand Oaks, CA: Sage.
Rodgers-Farmer, A. Y., and D. Davis. 2001. Analyzing Complex Survey Data.
Social Work Research 25:185-92.
Rust, K. F., and J. N. K. Rao. 1996. Variance Estimation for Complex Surveys
Using Replication Techniques. Survey Methods in Medical Research 5:
283-310.
Schenker, N., and J. F. Gentleman. 2001. On Judging the Significance of Differ-
ences by Examining the Overlap Between Confidence Intervals. The American
Statistician 55:182-7.
Skinner, C. J. 1989. Domain Means, Regression and Multivariate Analysis. In
Analysis of Complex Surveys (pp. 59-88), edited by C. J. Skinner, D. Holt,
and T. M. F. Smith. New York, NY: Wiley.
Skinner, C. J., D. Holt, and T. M. F. Smith, eds. 1989. Analysis of Complex Samples.
New York, NY: Wiley.
Stapleton, L. M. 2002. The Incorporation of Sample Weights into Multilevel
Structural Equation Models. Structural Equation Modeling 9:475-502.
302 Evaluation Review 35(3)

. 2006. An Assessment of Practical Solutions for Structural Equation Mod-


eling with Complex Sample Data. Structural Equation Modeling 13:28-58.
Stapleton, L. M., and S. L. Thomas. 2008. The Use of National Datasets for Teach-
ing and Research. In Multilevel Modeling of Educational Data (pp. 11-59),
edited by A. A. OConnell, and D. B. McCoach. Charlotte, NC: Information
Age Publishing.
Thomas, S. L., and R. H. Heck. 2001. Analysis of Large-Scale Secondary Data in
Higher Education Research: Potential Perils Associated with Complex Sampling
Designs. Research in Higher Education 42:517-40.
Thomas, S. L., Ronald H. Heck, and K. W. Bauer. 2005. Weighting and Adjusting
for Design Effects in Secondary Data Analyses. New Directions for Institutional
Research 12:51-72.
Wen, X., R. J. Bulotskey-Shearer, D. L. Hahs-Vaughn, and J. Korfmacher, (submitted
for review). Examination of Head Start program quality: Combining classroom
quality and parent involvement to understand childrens vocabulary, literacy, and
mathematics achievement trajectories. Early Childhood Research Quarterly.
West, J., and A. Rathburn. 2004. ECLS-K Technical Issues. Paper presented at
the American Educational Research Association Institute on Statistical Analysis
for Education Policy, San Diego, CA.
Wolter, K. 1985. Introduction to Variance Estimation. New York, NY: Springer-
Verlag.
Zill, N., K. Kim, A. Sorongon, R. Herbison, and C. Clark. 2006. Head Start Family
and Child Experiences Survey (FACES) 2000 Cohort: Users Guide. Washing-
ton, DC: United States Department of Health and Human Services; Administra-
tion for Children and Families; Office of Planning, Research and Evaluation.
Zill, N., K. Kwang, A. Sorongon, R. Herbison, and C. Clark. 2005. Head Start Fam-
ily and Child Experiences Survey (FACES) 1997 Cohort: Users Guide. Ann
Arbor, MI: Inter-University Consortium for Political and Social Research at the
Institute for Social Research, University of Michigan and the United States
Department of Health and Human Services, Administration for Children and
Families, Office of Planning, Research and Evaluation.

Bios
Debbie L. Hahs-Vaughn, PhD, is currently an associate professor in the department
of educational & human sciences at the University of Central Florida. She earned
her doctorate in Educational Research from the University of Alabama. Her research
interests include: methodological issues associated with applying quantitative statis-
tical methods to complex survey data; using complex survey data to answer substan-
tive research questions; practitioner use of research; quality in reporting research;
and program evaluation.
Hahs-Vaughn et al. 303

Christine M. McWayne, PhD, is an associate professor in the Eliot-Pearson depart-


ment of child development in the school of arts and sciences at Tufts University. She
earned her doctorate in School, Community, and Clinical-Child Psychology from
the University of Pennsylvania in 2003. McWaynes work has focused primarily
on two areas: (1) the interplay of contextual factors in the home, school, and neigh-
borhood and low-income childrens early development of social and academic com-
petencies; and (2) the advancement of a partnership-based research approach for
informing systems level interventions to improve developmental outcomes for
urban, low-income children.
Rebecca J. Bulotsky-Shearer, PhD, is currently an assistant professor in the depart-
ment of psychology, child division, at the University of Miami. She is involved in
community-based research within the Head Start community in Miami-Dade
County. Her research interests include the development of contextually relevant
measures of emotional and behavioral adjustment for culturally and linguistically
diverse, low-income preschool populations; and in identifying early protective influ-
ences in the home and school contexts that promote learning and social adjustment
for low-income preschool children.
Xiaoli Wen, PhD, is an assistant professor in early childhood education at National
Louis University. She received her doctoral degree from Purdue University. Her
research focuses on early childhood intervention evaluation and child care quality.
Ann-Marie Faria, PhD, is a research analyst at the American Institutes for
Research in the education and human development division. She earned her docto-
rate in Applied Developmental Psychology from the University of Miami in 2009,
where she was an Institutes of Education Sciences (IES) pre-doctoral fellow. Farias
work focuses on applying rigorous quantitative methodology to understand the over-
lap between young childrens social emotional and academic development.

You might also like