Professional Documents
Culture Documents
8.1 introduction
A cluster sample is a probability sample in which each sampling unit is a collection, or
cluster, of elements.
Cluster sampling is less costly than simple or stratified random sampling if the cost of
obtaining a frame that lists all population elements is very high or if the cost of obtaining
observations increases as the distance separating the elements increases.
1. A good frame listing population elements either is not available or is very costly to
obtain, while a frame listing cluster is easily obtained.
2. The cost of obtaining observations increases as the distance separating the elements
increases.
Example:
A sociologist wants to estimate the average per capita income in a certain small city. No
list of resident adults is available. How should he design the sample survey?
Solution:
Cluster sampling seems to be the logical choice for the survey design because no list of
elements are available. The city is marked off into rectangular blocks, except for two industrial
areas and three parks that contain only a few houses. The sociologist decides that each of the
city blocks will be considered one cluster, the two industrial areas will be considered one
cluster,and, finally. The three parks will be considered one cluster. The clusters are numbered
on a city map, with the numbers from 1 to 415. The experimenteras enough time and money to
sample n=25 clusters and to interview every household within each cluster. Hence 25 random
numbers between 1 and 415 are selected from table 2 of the appendix, and the cluster having
these numbers are marked on the map. Interviewers are then assigned to each of the sampled
clusters.
8.3 estimation of a population mean and total
Cluster sampling is simple random sampling with each sampling unit containing a number of
elements. Hence the estimator of the population mean 𝜇 and total 𝜏 are similar to those for
simple random sampling. In particular, the sample mean 𝑦̅ is a good estimator of the population
mean 𝜇. An estimator of 𝜇 and two estimators of 𝜏 are discussed in this section.
M = ∑𝑁
𝑖=1 𝑚𝑖 = the number of elements in the population
The estimator of the population mean 𝜇 is the sample mean 𝑦̅, which is given by
∑𝑛
𝑖=1 𝑦𝑖
𝑦̅ = ∑𝑛
𝑖=1 𝑚𝑖
Thus 𝑦̅ takes the form of a ratio estimator, as developed chapter 6(regresi). With
𝑚𝑖 taking the place of 𝑥𝑖 . Then the estimated variance of 𝑦̅ has the form of the
variance of a ratio estimator.
̅:
Estimation variance of 𝒚
𝑛 2
𝑁−𝑛 ∑ (𝑦 −𝑦̅𝑚𝑖 )
𝑉̂ (𝑦̅) = ( ̅ 2) 𝑖=1 𝑖
𝑁𝑛𝑀 𝑛−1
Bound on the error of estimation:
𝑛 2
𝑁−𝑛 ∑ (𝑦 −𝑦̅𝑚𝑖 )
2√𝑉̂ (𝑦̅) = 2√𝑉̂ (𝑦̅) = ( ̅ 2) 𝑖=1 𝑖
𝑁𝑛𝑀 𝑛−1
̅ can be estimated by 𝑚
Here 𝑀 ̅ if M us unknown.
Example:
Interviews are conducted in each of the 25 blocks sampled in Example 8.1. The
data on incomes are presented inTable 8.1. Use the data to estimate the per-
capita income in the city and place a hound on the error of estimation.
The estimated variance in Eq. (8.2) is biased and a good estimator of 𝑉(𝑦̅) only
if n is large say, n ≥ 20. The bias disappears if the cluster sizes 𝑚1 , 𝑚2 ⋯ 𝑚𝑁
are equal. As in all cases of ratio estimation, the estimator and its standard
error can he calculated by fitting a weighted regression line forced through
the origin with weights equal to the reciprocal of the m values. Example 8.2
illustrates this estimation procedure
The best estimate of the population mean? is given by Eq. (8.1) and calculated
as follows:
Because M is not known. the M appearins in Eq. (8.2) must be estimated by fi,
Where
The best estimate of the average per-capita income is $8801. and the error of
estimation should be less than $1617 with probability close to .95. This bound
on the error of estimation is rather large: it could be reduced by sarnpling more
clusters.
Recall that the ratio estimator is nearly unbiased when the plot of y versus m
shows points falling close to a straight line through the origin. Aplot of the data
from
EXAMPI.E 8.3 Use the data in Table 8.1 to estimate the total incutne of all
residents of the city and place a bound on the error of estimation. There are 2500
residents of the city.
MT = 2500(8801) = S22.002.500
The quantity is calculated by the method used in Example 8.2, except that
= 22.002.500 i 2J(2500)'(653.785)
= 22.002.500 i 4,042.848
Again, this bound on the error of estimation is large. and it could be reduced
by increasing the sample size.
Often the number of elements in the population is not known in problem
for which cluster sampling is appropriate. Thus, we cannot use the estimator My
but we can form another estimator of the population total that does not
depend on M. The quantity y1, given by
is the average of the cluster totals for the rr sampled clusters. Hence, 7, is an
unbiased estimator of the average of the N cluster totals in the population. By
the same reasoning as employed in Chapter 4, Nyt., is an unbiased estimator of
the sum of the cluster totals or. equivalently, of the population total t.
For example. it is highly unlikely that the number of adult males in a city would
be known, and hence the estimator Nyt. rather than My, would have to be used
to estimate t
If there is a large amount of variation among the cluster sizes and if cluster sizes
are highly correlated with cluster totals. the variance of N.yt in Eq. (8.8) is
generally larger than the variance of My. in Eq. (8.5). The estimator Ny,. does
not use the information provided by the cluster sizesm1,m2,…mn and hence
may be less precise.
8.4 EQUAL CLUSTER SIZE; COMPARISON TO SIMPLE RANDOM SAMPLING
For a more precise study of the relationship between cluster sampling and simple random
sampling, we confine our discussion to the case in which all of the 𝑚𝑖 ’s are equal to a common value, say
m. We assume this to be true for the entire population of clusters, as in the case of sampling cartons of
canned foods where each carton contains axactly 24 cans. In this case, 𝑀 = 𝑁𝑚 and the total sample size
is 𝑛𝑚 elements (n clusters of m elements each).
This estimators of 𝜇 and 𝜏 possess special properties when all cluster sizes are equal (that is,
𝑚1 = 𝑚2 = ⋯ = 𝑚𝑁 ). First, the estimator 𝑦̅, given by Equation (8.1), is an unbiased estimator of the
population mean 𝜇. Second, 𝑉̅ (𝑦̅), given by Equation (8.2), is an unbiased estimator of the variance of 𝑦̅.
Finally, the two estimators, 𝑀𝑦̅ and 𝑁𝑦̅𝑡, of the population total 𝜏 are equivalent.
The estimator (8.1) of the population mean per element will be denoted in this equal cluster size
case by 𝑦̿𝑐, and it becomes
𝑛 𝑛 𝑚
1 1 1
𝑦̿𝑐 = [ ∑ 𝑦𝑖 ] = ∑ ∑ 𝑦𝑖𝑗
𝑚 𝑛 𝑚𝑛
𝑖=1 𝑖=1 𝑖=1
Where 𝑦𝑖𝑗 denotes the jth sample observation from cluster i. Note that 𝑦̿𝑐 can be thought of as the overall
average of all nm sample measurments, or as the average of the sampled cluster total divided by m.
8.4 EQUAL CLUSTER SIZE
𝑛
𝑁−𝑛 1 1
𝑉̅ (𝑦̿𝑐 ) = ( ) ( 2) ( ) ∑(𝑦𝑖 − 𝑦̅𝑡 )2
𝑁 𝑛𝑚 𝑛−1
𝑖=1
Where
𝑛
1
𝑦̅𝑡 = ∑ 𝑦𝑖 = 𝑚𝑦̿𝑐
𝑛
𝑖=1
If we let the sample average for cluster i be denoted by 𝑦̅𝑖 , we have 𝑦̅𝑖 = 𝑦𝑖 /𝑚, or 𝑦𝑖 = 𝑚𝑦̅𝑖 . We can then
write
𝑛 𝑛 𝑛
1 1 1
2
∑(𝑦𝑖 − 𝑦̅𝑐 ) = 2 ∑(𝑚𝑦̅𝑖 − 𝑚𝑦̿𝑐 )2 = ∑(𝑦̅𝑖 − 𝑦̿𝑐 )2
𝑚 𝑛(𝑛 − 1) 𝑚 𝑛(𝑛 − 1) 𝑛(𝑛 − 1)
𝑖=1 𝑖=1 𝑖=1
To simplify the variance computations and to explore the relationship between cluster sampling
and simple random sampling, we use a sum-of-squares identity similar to that developed in classical
analysis of variances arguments. It can be shown that
𝑛 𝑚 𝑛 𝑚 𝑛 𝑚
2 2
∑ ∑(𝑦𝑖𝑗 − 𝑦̿𝑐 ) = ∑ ∑(𝑦𝑖𝑗 − 𝑦̅𝑖 ) + ∑ ∑(𝑦̅𝑖 − 𝑦̿𝑐 )2
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑚 𝑛
2
= ∑ ∑(𝑦𝑖𝑗 − 𝑦̅𝑖 ) + 𝑚 ∑(𝑦̅𝑖 − 𝑦̿𝑐 )2
𝑖=1 𝑖=1 𝑖=1
The three terms, from the left, are named total sum of squares (SST), within-cluster sum of squares
(SSW) and between-cluster sum of squares (SSB). The above equality is then
𝑛
MSB 𝑚
MSB = = ∑(𝑦̅𝑖 − 𝑦̿𝑐 )2
𝑛−1 𝑛−1
𝑖=1
𝑛 𝑚
SSW 1 2
MSW = = ∑ ∑(𝑦𝑖𝑗 − 𝑦̅𝑖 )
𝑛(𝑚 − 1) 𝑛(𝑚 − 1)
𝑖=1 𝑖=1
𝑁−𝑛 1
𝑉̂ (𝑦̿𝑐 ) = ( ) MSB
𝑁 𝑛𝑚
Example 8.5 The circulation manager of a newspaper wishes to estimate the average number of
newspaper purchased per household in a given community. Travel costs from household
to household are substantial. Therefore the 4000 households in the community are listed
in 400 geographical clusters of 10 households each, and a simple random sample 0f 4
clusters is selected. Interviews are conducted, with the results as shown in the
accompanying table. Estimate the average number of newspaper per household for the
community, and place a bound on the error of estimation.
∑𝑛 𝑦
𝑦̅ = ∑𝑛𝑖=1𝑚𝑖
𝑖=1 𝑖
∑𝑛𝑖=1 𝑦𝑖 19 + 20 + 16 + 20
𝑦̿𝑐 = = = 1.875
𝑛𝑚 4(10)
Analysis of Variance
Source DF SS MS
Factor 3 1.07 0.36
Error 36 43.3 1.2
Total 39 44.38
In this output, Factor denotes the between-cluster calculations and Error denoted the
the within-cluster calculations. Thus, MSB = .36 and MST = 1.20. It follow that
𝑁−𝑛 1
𝑉̂ (𝑦̿𝑐 ) = ( ) ( ) MSB
𝑁 𝑛𝑚
396 1
=( ) (. 36)
400 4(10)
= .0089
And
1.88 ± .19
How can we compare the precision of cluster sampling with that of simple random sampling? If
we had taken the nm observations in a simple random sample and the computed mean 𝑦̅ and variance 𝑠 2
then we would have
𝑀𝑛 − 𝑛𝑚 𝑠2 𝑁 − 𝑛 𝑠2
𝑉̂ (𝑦̅) = ( )∙ =( )
𝑁𝑚 𝑛𝑚 𝑁 𝑛𝑚
Since there would be Nm total observations in the population. Thus we can measure the relative
efficiently of 𝑦̿𝑐 to 𝑦̅ by comparing MSB to 𝑠 2 . But we did not take a simple random sample; we took a
cluster sample, so 𝑠 2 is not available. Fortunately, it turns out that we can approximate 𝑠 2 (the variance
we would have obtained in a simple random sample) from quantities available in the cluster sample
results. This approximation is
1
= [(𝑚 − 1)MSW + MSB]
𝑚
When N is large.
1
𝑠̂ 2 = [(9)(1.20) + .36] = 1.12
10
𝑠̂ 2 1.12
̂ (𝑦̿𝑐 ⁄𝑦̅) =
RE = = 3.11
MSB . 36
In this case, cluster sampling is more efficient, because there is so little variation btween clusters (each
cluster seems to be fairly representative of the entire population). This is somewhat unusual, since, in
most cases of naturally occurring clusters, cluster sampling will be less efficient than simple random
sampling.
Analysis of Variance
Source DF SS MS
Factor 3 40.54 13.51
Error 76 610.85 8.04
Total 79 651.39
From this,
1
𝑠̂ 2 = [19(8.04) + 13.51] = 8.31
20
and
8.31
̂ (𝑦̿𝑐 ⁄𝑦̅) =
RE = .62
13.51
This is a little lower than expected, but not far from 1.0.
We will continue this discussion of comparisons between cluster sampling and simple random in
Chapter 9.
8.5 SELECTING THE SAMPLE SIZE FOR ESTIMATING POPULATION MEANS AND
TOTAL
The quantity of information in a cluster sample is affected by to factors, the number of clusters
and the relative cluster size. We have not encountered the latter factor in any of the sampling
procedures discussed previously. In the problem of estimating the number of homes with
inadequate fire insurance in a state, the clusters could be counties, voting districts, school
districts, communities, or any other convenient grouping of homes. As we have already seen, the
size of the bound on the error of estimation depends crucially upon the variation among the
cluster totals. Thus in attempting to achieve small bounds on the error of a estimation, one must
select clusters with as little variation as possible among these totals. We will now assume that the
cluster size (sampling unit) has been chosen and will consider only the problem of choosing the
number of clusters, n.
where
∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅𝑚𝑖 )2
𝑠𝑐2 =
𝑛−1
𝑁−𝑛 2
𝑉(𝑦̅ ) = (𝜎 )
̅2 𝑐
𝑁𝑛𝑀
2√𝑉(𝑦̅) = 𝐵
Approximate sample size required to estimate 𝜇 with a bound B on the error of estimation:
𝑁𝜎𝑐2
𝑛=
𝑁𝐷 + 𝜎𝑐2
̅2
𝐵2 𝑀
𝐷=
4
Example 8.6 Suppose the data in Table 8.1 represent a preliminary sample of incomes in the city. How
large a sample should be taken in a future survey in order to estimate the average per
capita income 𝜇 with a bound of $500 on the error of estimation?
Solution
To use Equation (8.13), we must estimate 𝜎𝑐2 ; the best estimate available is 𝑠𝑐2 , which can
be calculated by using the data in Table 8.1. Using the calculations in Example 8.2, we
have
𝐵2 𝑚
̅ 2 (500)2 (6.04)2
= = (62,500)(6.04)2
4 4
𝑁𝜎𝑐2 415(634,479,260)
𝑛= 2 = = 166.58
𝑁𝐷 + 𝜎𝑐 415(6.04)2 (62,500) + 634,479,260
𝑁𝜎𝑐2
𝑛=
𝑁𝐷 + 𝜎𝑐2
𝐵2
𝐷=
4𝑁 2
Example 8.7 Again using the data in Table 8.1 as a preliminary sample of incomes in the city, how
large a sample is necessary to estimate the total income of all residents, 𝜏, with a bound
of $1,000,000 on the error of estimation? There are 2500 residents of the city (M = 2500).
Solution
𝑠𝑐2 = 634,479,260
As in Example 8.6. When estimating 𝜏, we use
𝐵2 (1,000,000)2
𝐷= =
4𝑁 2 4(415)2
(1,000,000)2
𝑁𝐷 = = 602,409,000
4(415)
𝑁𝜎𝑐2 415(634,479,260)
𝑛= 2 = = 212.88
𝑁𝐷 + 𝜎𝑐 602,409,000 + 634,479,260
Thus 213 clusters should be sampled to estimate the total income with a bound of
$1,000,000 on the error of estimation
𝑁−𝑛 2
𝑉̂ (𝑁𝑦̅𝑡 ) = 𝑁 2 ( ) 𝑠𝑡
𝑁𝑛
Where
∑𝑚 ̅𝑡 )2
𝑖=1(𝑦𝑖 − 𝑦
𝑠𝑡2 =
𝑛−1
𝑁−𝑛 2
𝑉(𝑁𝑦̅𝑡 ) = 𝑁 2 𝑉(𝑦̅𝑡 ) = 𝑁 2 ( ) 𝜎𝑡
𝑁𝑛
2√𝑉(𝑁𝑦̅𝑡 ) = 𝐵
𝑁𝜎𝑡2
𝑛=
𝑁𝐷 + 𝜎𝑡2
Example 8.8 Assume the data of Table 8.1 are from a preliminary study of incomes in the city and M
is not known. How large a sample must be taken to estimate the total income of all
residents, 𝜏, with a bound of $1,000,000 on the error of estimation?
Solution
The quantity 𝜎𝑡2 must be estimated by 𝑠𝑡2 , which is calculated from the data of Table 8.1.
Using the calculations of Example 8.4 gives
𝐵2 (1,000,000)2
𝐷= =
4𝑁 2 4(415)2
Thus a sample of 183 clusters must be taken to have a bound of $1,000,000 on the error
of estimation.