Professional Documents
Culture Documents
Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series C (Applied Statistics).
http://www.jstor.org
Appl.Statist.(1981),
30, No. 2, pp. 163-169
SUMMARY
A clustering
or
In thispapertheuse ofclusteranalysisin stratification
is considered.
whichis derivedwithinthecontextof
criterion
function
is suggested
stratification
multivariate
stratified
sampling.
A casestudyinwhichtheStatesofMexicoarestratified
withrespect
to ninesocio-economic
variables
is presented.
SAMPLING
CLUSTERANALYSIS;MULTIVARIATE
Keywords:OPTIMUMSTRATIFICATION;
1. INTRODUCTION
STRATIFICATION
to
maybe usedinsurveysforvariousreasons.For example;foradministration;
by
ok
h= 1
(1)
V(0k)
h= 1
Wh V(Xk,h),
k = 1,2,...,K.
(2)
Here onlystratified
randomsamplingwithproportionalallocationis consideredand hence(2)
reducesto
V(Ok)=
NTn
hL=
APPLIED
164
STATISTICS
x (h)
TX(h)
Is1X(1
1)
(X(h+
xf(x)dx
< f(x) dx
xf(x)dx
Jxh
X(h)
X(h- 1)
x(h
X(h)
+ 1)
f(x) dx
h=
...,L-1.
(3)
F(S)=
E dk(S).
k=1
OPTIMUM
STRATIFICATION
165
Z Vkdk(S)
k= 1
(4)
3. CASE STUDY
The problemofstrataconstruction
maybe seenas a classification
problemand thishas led
authorsto considertheuse ofclustering
algorithms
(e.g.Greenet al., 1967;Golderet al., 1973).
These algorithmshave been devisedforspecialproblems,and it is not clear whytheyshould
workindifferent
contexts.The discussionofSection2 suggeststhattheuse ofWard'sclustering
algorithmor the K-meansclusteringalgorithm,
appropriatelyapplied,should yieldsensible
In thissection,a case studyis presented.The proceduredescribedin Section2 is
stratifications.
appliedtogetherwithotherstratification
procedures(someofwhichuse clustering
algorithms)
forthepurposeof efficiency
comparison.
It is assumed that the aim is to estimatethe means 01, 02,
of nine socio-economic
09
variablesX 1,X2, ..., X9 (fortheirdefinition
see Table 3) usinga stratified
randomsample.The 31
StatesofMexicoarestratified
withrespectto thesevariables.The data refer
to theyear1974and
are foundin IEPES (1976);thesamplesize is consideredto be n = 12.The K-meansalgorithm
was used to minimizeF(S), thusprovidingS*. The resultingvariancesoftheestimatorsofthe
meansare presentedin Table 1 fora numberofstrataL = 2, 3, 4 and 5. In thetable,therow
markedRS refers
to thevariancesobtainedusingsimplerandomsampling,and thenumbersin
bracketsreferto thecorrespondinglowerbounds (e.g. forL = 4, VS(05))1 1 forall S).
Frequentlyin surveypractice,stratification
is done with respectto a variable that is
consideredas themain indicatorof the variablesof interest.Table 2 presents(forL = 5) the
variancesobtainedwithS* and thosethatwould be obtainedifXl was consideredas "main
indicator"ofX1,..., Xg and Dalenius's optimumstratification
procedurewas appliedtof(X1).
...,
166
APPLIED
STATISTICS
TABLE 1
Variancesof theestimators
of themeansusingstratification
S* (withlowerboundsin brackets)
RS
L= 2
L= 3
L= 4
L= 5
V(01)
V(02)
123
121
(041)
116
(0 22)
072
(0 10)
1 23
(0 074)
0144
0141
(0056)
013
(0-03)
011
(0 02)
008
(0 011)
V(03)
V(04)
V(05)
V(06)
V(07)
368
259
(1 19)
281
(0 56)
259
(0-35)
1 51
(0-25)
576
406
(1 97)
191
(0-95)
181
(0 47)
201
(0 33)
982
395
(3 18)
411
(2 0)
44
(1 1)
269
(0-51)
1113
1067
(50)
925
(2-6)
105
(1 3)
803
(0 76)
500
035
016
355
(1 55) (0 13)
016
234
(0 53) (0 06)
019
18
(0 31) (005)
1 87
009
(0 18) (0 033)
V(08)
V(09)
008
004
(0016)
004
(0 01)
006
(0 005)
006
(0-003)
TABLE 2
Variancesobtainedwithstratifications
S*, S*, S* and S* (L =5)
Parameter
S*
S1
4S*
9S9
01
1-2
(0074)
11
113
02
03
04
008
016
016
012
1-5
3*7
36
3-5
20
4-7
033
50
Parameter
05
27
100
57
7.5
66
07
68
09
80
122
140
128
1.9
67
5*7
36
009
035
025
028
006
009
007
0003
OPTIMUM
167
STRATIFICATION
TABLE 3
Strataand nationalmeans
Stratummeans
National
mean
46
84
27
47
56
246
46
82
9
43
7
19
41
6
9
27
30
48
46
341
41
49
62
65
80
53-7
37
18
10
28
20
26
36
20
25
15
11
248
13
73
4-5
45
Variable
42
45 5
7-3
21-1
38
Stratum1: Chiapas, Guanajuato, Guerrero,Hidalgo, Michoacan, Oaxaca, Puebla, Queretaro,San Luis Potosi,
Tabasco, Zacatecas
Stratum2: Baja CaliforniaSur,Campeche,Durango,Nayarit,Quitana Roo, Sinaloa, Veracruz
Stratum3: Edo de Mexico, Morelos,Tlaxcala, Yucatan
Stratum4: Aguascalientes,Coahuila, Colima, Chihuahua,Jalisco,Sonora, Tamaulipas
Stratum5: Baja CaliforniaNorte,Nuevo Le6n
usingthefirstprincipalcomponentoftheoriginaldata measuredfrom
SPCT: Stratification
theirmeans and dividedby theirtotals(explains50 per centof total variance).
procedure:
Index stratification
SI: Stratification
usinga welfareindexdefinedas the sum of the ranksbut witheach
rankmultipliedby (+ 1) or (- 1) accordingto whethertheattributeis desirableor
not.
was obtained.For
stratifications
By applyingtheseprocedures,a rangeof quitedifferent
thefunctions
F(S) and D(S) (definedin Section2) wereevaluated(forL = 2, 3,
each stratification
proceduresgive
4 and 5). The valuesobtainedare givenin Table 4. For L = 2,fourstratification
approximatelythe same value of F(S) (e.g. F(S*) = F(SPC) = 19, F(SPCT) = 20 and
thesame. For
are effectively
F(SW) = 21). This is because,in thiscase, all fourstratifications
to thesestratification
L = 3,4 and 5,however,thedifference
in thevaluesofF(S) corresponding
proceduresbecomeslarger.For example,withL = 5, F(S*) = 79 and F(SPCT) = 97. It is also
(SL) providesworsevaluesofF(S)
algorithm
to notethatthesinglelinkageclustering
interesting
thanifno stratification
was carriedout.Thatis,SL giveslargervaluesthanthoseobtainedusing
a simplerandomsample(see rowmarkedRS). (A reasonwhythisalgorithmwas so bad maybe
a "tree",
to thestratumoftheclosestelement,forming
thatelementsare assignedhierarchically
shouldnot
algorithms
withoutregardto thestratummeans.)This resultshowsthatclustering
in stratification.
be used indiscriminantly
4. CONCLUSION
APPLIED
168
STATISTICS
TABLE 4
criterion
F(S) and D(S)
Values of stratification
functions
(N = 31, n = 12, K = 9)
D(S)
F(S)
S
L=2
S*
SW
SPC
SPCT
SI
9S9
S*4
S*1
RS
SL
19
21
19
20
23
25
24
27
28
29
L= 3
33
34
34
38
38
44
45
54
54
59
L=4
53
56
60
64
69
80
81
93
96
107
L=5
79
81
91
97
105
117
131
150
153
166
L=2
14
19
16
18
28
30
34
45
44
52
L=3
77
81
92
114
117
181
197
300
351
357
L=4
L=5
283
297
371
453
498
719
773
1048
928
1237
648
673
934
1074
1202
1544
2194
2965
2500
3133
OPTIMUM
STRATIFICATION
169
Stockholm:Almqvistand Wiksell.
withtwo characters.Ann.Math. Statist.,34, 866-872.
GHOSH,S. P. (1963). Optimumstratification
GOLDER, P. A. and YEOMANS,K. A. (1973). The use of clusteranalysisforstratification.
Appl.Statist.,22, 213-219.
GOWER,J.C. and Ross, G. J.S. (1969). Minimumspanningtreesand singlelinkageclusteranalysis.Appl.Statist.,18,
54-64.
GREATERLONDON COUNCIL (1971). ResearchReportNo. 9. Classificationof the London Boroughs.
GREEN, P. E., FRANK, R. E. and ROBINSON,P. J.(1967). Clusteranalysisin testmarketselection.Manag. Sci., 13(8),
387-400.
HAGOOD, M. J.and BERNERT,E. H. (1945).Componentindexesas a basis forstratification
in sampling.J.Amer.Statist.
Ass.,40, 330-341.
IEPES (1976). La CampaiiaPresidencialen Cifras.Jose Lopez Portillo.
KOKAN, A. R. (1963). Optimumallocationin multivariate
surveys.J. R. Statist.Soc. A, 126, 557-565.
MACQUEEN, J.(1967).Some methodsforclassification
and analysisofmultivariate
observations.In Proc.5thBerkeley
Symp.Math. Statist.and Prob. 1, 281-297. University
of CaliforniaPress.
SADASIVAN,G. and AGGARWAL,R. (1978). Optimumpointsof stratification
in bi-variatepopulations.Sankhya,40,
C, 84-97.
WARD, J. H. (1963). Hierarchicalgroupingto optimisean objectivefunction.
J. Amer.Statist.Ass.,58, 236-244.