You are on page 1of 22

Cluster Sampling

8.1 introduction
A cluster sample is a probability sample in which each sampling unit is a collection, or
cluster, of elements.

Cluster sampling is less costly than simple or stratified random sampling if the cost of
obtaining a frame that lists all population elements is very high or if the cost of obtaining
observations increases as the distance separating the elements increases.

To summarize, cluster sampling is an effective design for obtaining a specified amount


of information at minimum cost under the following conditions:

1. A good frame listing population elements either is not available or is very costly to
obtain, while a frame listing cluster is easily obtained.
2. The cost of obtaining observations increases as the distance separating the elements
increases.

8.2 How to draw a cluster sample

Example:

A sociologist wants to estimate the average per capita income in a certain small city. No
list of resident adults is available. How should he design the sample survey?

Solution:

Cluster sampling seems to be the logical choice for the survey design because no list of
elements are available. The city is marked off into rectangular blocks, except for two industrial
areas and three parks that contain only a few houses. The sociologist decides that each of the
city blocks will be considered one cluster, the two industrial areas will be considered one
cluster,and, finally. The three parks will be considered one cluster. The clusters are numbered
on a city map, with the numbers from 1 to 415. The experimenteras enough time and money to
sample n=25 clusters and to interview every household within each cluster. Hence 25 random
numbers between 1 and 415 are selected from table 2 of the appendix, and the cluster having
these numbers are marked on the map. Interviewers are then assigned to each of the sampled
clusters.
8.3 estimation of a population mean and total

Cluster sampling is simple random sampling with each sampling unit containing a number of
elements. Hence the estimator of the population mean 𝜇 and total 𝜏 are similar to those for
simple random sampling. In particular, the sample mean 𝑦̅ is a good estimator of the population
mean 𝜇. An estimator of 𝜇 and two estimators of 𝜏 are discussed in this section.

The following notation is used in this chapter:

N = the number of clusters in the population

n = the number of clusters selected in a simple random sample

𝑚𝑖 = the number of elements in cluster I, i=1,….,N


1
̅ = 𝑛 ∑𝑛𝑖=1 𝑚𝑖 = the average cluster size for the sample
𝑚

M = ∑𝑁
𝑖=1 𝑚𝑖 = the number of elements in the population

̅ = M/N = the average cluster size for the population


𝑀

𝑦𝑖 = the total of all observations in the 𝑖 th cluster

The estimator of the population mean 𝜇 is the sample mean 𝑦̅, which is given by

∑𝑛
𝑖=1 𝑦𝑖
𝑦̅ = ∑𝑛
𝑖=1 𝑚𝑖

Thus 𝑦̅ takes the form of a ratio estimator, as developed chapter 6(regresi). With
𝑚𝑖 taking the place of 𝑥𝑖 . Then the estimated variance of 𝑦̅ has the form of the
variance of a ratio estimator.

Estimator of the population mean 𝝁:


∑𝑛
𝑖=1 𝑦𝑖
𝑦̅ = ∑𝑛
𝑖=1 𝑚𝑖

̅:
Estimation variance of 𝒚
𝑛 2
𝑁−𝑛 ∑ (𝑦 −𝑦̅𝑚𝑖 )
𝑉̂ (𝑦̅) = ( ̅ 2) 𝑖=1 𝑖
𝑁𝑛𝑀 𝑛−1
Bound on the error of estimation:
𝑛 2
𝑁−𝑛 ∑ (𝑦 −𝑦̅𝑚𝑖 )
2√𝑉̂ (𝑦̅) = 2√𝑉̂ (𝑦̅) = ( ̅ 2) 𝑖=1 𝑖
𝑁𝑛𝑀 𝑛−1

̅ can be estimated by 𝑚
Here 𝑀 ̅ if M us unknown.

Example:

Interviews are conducted in each of the 25 blocks sampled in Example 8.1. The
data on incomes are presented inTable 8.1. Use the data to estimate the per-
capita income in the city and place a hound on the error of estimation.

The estimated variance in Eq. (8.2) is biased and a good estimator of 𝑉(𝑦̅) only
if n is large say, n ≥ 20. The bias disappears if the cluster sizes 𝑚1 , 𝑚2 ⋯ 𝑚𝑁
are equal. As in all cases of ratio estimation, the estimator and its standard
error can he calculated by fitting a weighted regression line forced through
the origin with weights equal to the reciprocal of the m values. Example 8.2
illustrates this estimation procedure

Because the estimator of the mean per element is a ratio estimator,


computations proceed exactly as they do for ratio estimators in Chapter 6. A
summary of the basic statistics for these data is presented in the table.

The best estimate of the population mean? is given by Eq. (8.1) and calculated

as follows:
Because M is not known. the M appearins in Eq. (8.2) must be estimated by fi,

Where

Example 8.1 gives N = 415. Then from Eq. (8.2)

Thus, the estimate of 𝜇 with a bound on the error of estimation is given by

The best estimate of the average per-capita income is $8801. and the error of
estimation should be less than $1617 with probability close to .95. This bound
on the error of estimation is rather large: it could be reduced by sarnpling more
clusters.

Recall that the ratio estimator is nearly unbiased when the plot of y versus m
shows points falling close to a straight line through the origin. Aplot of the data
from

Table 8. I is shown in Figure 8.1. Although there is something of a linear trend


here,
it does not appear to be strong (𝜌̂ = 0.303). Even so. the relative bias,
approximated by

is small. and the ratio estimate of p should be reasonably good.

The population total T is now M𝜇 because IW denotes the total number of


elements in the population. Consequently, as in simple random sampling, My.
provides an estimator of t.
Note that the estimator M𝑦̅ is useful only if [he number of elements in the
population, M, is known.

EXAMPI.E 8.3 Use the data in Table 8.1 to estimate the total incutne of all
residents of the city and place a bound on the error of estimation. There are 2500
residents of the city.

SOLUTlON The sample mean J is calculated to he $8801 in Example 8.2. Thus,


the estimate of t is

MT = 2500(8801) = S22.002.500

The quantity is calculated by the method used in Example 8.2, except that

M can now be used in place of m. The estimate of r with a bound on the


error of estimation is

M,? f 24e(~j) = M.7 & 24M2i'(.T)

= 22.002.500 i 2J(2500)'(653.785)

= 22.002.500 i 4,042.848

Again, this bound on the error of estimation is large. and it could be reduced
by increasing the sample size.
Often the number of elements in the population is not known in problem
for which cluster sampling is appropriate. Thus, we cannot use the estimator My
but we can form another estimator of the population total that does not
depend on M. The quantity y1, given by

is the average of the cluster totals for the rr sampled clusters. Hence, 7, is an
unbiased estimator of the average of the N cluster totals in the population. By
the same reasoning as employed in Chapter 4, Nyt., is an unbiased estimator of
the sum of the cluster totals or. equivalently, of the population total t.

For example. it is highly unlikely that the number of adult males in a city would

be known, and hence the estimator Nyt. rather than My, would have to be used
to estimate t

If there is a large amount of variation among the cluster sizes and if cluster sizes
are highly correlated with cluster totals. the variance of N.yt in Eq. (8.8) is
generally larger than the variance of My. in Eq. (8.5). The estimator Ny,. does
not use the information provided by the cluster sizesm1,m2,…mn and hence
may be less precise.
8.4 EQUAL CLUSTER SIZE; COMPARISON TO SIMPLE RANDOM SAMPLING

For a more precise study of the relationship between cluster sampling and simple random
sampling, we confine our discussion to the case in which all of the 𝑚𝑖 ’s are equal to a common value, say
m. We assume this to be true for the entire population of clusters, as in the case of sampling cartons of
canned foods where each carton contains axactly 24 cans. In this case, 𝑀 = 𝑁𝑚 and the total sample size
is 𝑛𝑚 elements (n clusters of m elements each).

This estimators of 𝜇 and 𝜏 possess special properties when all cluster sizes are equal (that is,
𝑚1 = 𝑚2 = ⋯ = 𝑚𝑁 ). First, the estimator 𝑦̅, given by Equation (8.1), is an unbiased estimator of the
population mean 𝜇. Second, 𝑉̅ (𝑦̅), given by Equation (8.2), is an unbiased estimator of the variance of 𝑦̅.
Finally, the two estimators, 𝑀𝑦̅ and 𝑁𝑦̅𝑡, of the population total 𝜏 are equivalent.

The estimator (8.1) of the population mean per element will be denoted in this equal cluster size
case by 𝑦̿𝑐, and it becomes

𝑛 𝑛 𝑚
1 1 1
𝑦̿𝑐 = [ ∑ 𝑦𝑖 ] = ∑ ∑ 𝑦𝑖𝑗
𝑚 𝑛 𝑚𝑛
𝑖=1 𝑖=1 𝑖=1

Where 𝑦𝑖𝑗 denotes the jth sample observation from cluster i. Note that 𝑦̿𝑐 can be thought of as the overall
average of all nm sample measurments, or as the average of the sampled cluster total divided by m.
8.4 EQUAL CLUSTER SIZE

View it is easy to see that

𝑛
𝑁−𝑛 1 1
𝑉̅ (𝑦̿𝑐 ) = ( ) ( 2) ( ) ∑(𝑦𝑖 − 𝑦̅𝑡 )2
𝑁 𝑛𝑚 𝑛−1
𝑖=1

Where

𝑛
1
𝑦̅𝑡 = ∑ 𝑦𝑖 = 𝑚𝑦̿𝑐
𝑛
𝑖=1

If we let the sample average for cluster i be denoted by 𝑦̅𝑖 , we have 𝑦̅𝑖 = 𝑦𝑖 /𝑚, or 𝑦𝑖 = 𝑚𝑦̅𝑖 . We can then
write

𝑛 𝑛 𝑛
1 1 1
2
∑(𝑦𝑖 − 𝑦̅𝑐 ) = 2 ∑(𝑚𝑦̅𝑖 − 𝑚𝑦̿𝑐 )2 = ∑(𝑦̅𝑖 − 𝑦̿𝑐 )2
𝑚 𝑛(𝑛 − 1) 𝑚 𝑛(𝑛 − 1) 𝑛(𝑛 − 1)
𝑖=1 𝑖=1 𝑖=1

To simplify the variance computations and to explore the relationship between cluster sampling
and simple random sampling, we use a sum-of-squares identity similar to that developed in classical
analysis of variances arguments. It can be shown that

𝑛 𝑚 𝑛 𝑚 𝑛 𝑚
2 2
∑ ∑(𝑦𝑖𝑗 − 𝑦̿𝑐 ) = ∑ ∑(𝑦𝑖𝑗 − 𝑦̅𝑖 ) + ∑ ∑(𝑦̅𝑖 − 𝑦̿𝑐 )2
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1

𝑛 𝑚 𝑛
2
= ∑ ∑(𝑦𝑖𝑗 − 𝑦̅𝑖 ) + 𝑚 ∑(𝑦̅𝑖 − 𝑦̿𝑐 )2
𝑖=1 𝑖=1 𝑖=1

The three terms, from the left, are named total sum of squares (SST), within-cluster sum of squares
(SSW) and between-cluster sum of squares (SSB). The above equality is then

SST = SSW + SSB


With appropriate divisors, these sums of squares become the usual mean squares of analysis nof variance.
Thus, the between-cluster mean square MSB is given by

𝑛
MSB 𝑚
MSB = = ∑(𝑦̅𝑖 − 𝑦̿𝑐 )2
𝑛−1 𝑛−1
𝑖=1

And the within-cluster mean square, MSW, is given by

𝑛 𝑚
SSW 1 2
MSW = = ∑ ∑(𝑦𝑖𝑗 − 𝑦̅𝑖 )
𝑛(𝑚 − 1) 𝑛(𝑚 − 1)
𝑖=1 𝑖=1

It now follows that

𝑁−𝑛 1
𝑉̂ (𝑦̿𝑐 ) = ( ) MSB
𝑁 𝑛𝑚

Chapter 8 CLUSTER SAMPLING

Example 8.5 The circulation manager of a newspaper wishes to estimate the average number of
newspaper purchased per household in a given community. Travel costs from household
to household are substantial. Therefore the 4000 households in the community are listed
in 400 geographical clusters of 10 households each, and a simple random sample 0f 4
clusters is selected. Interviews are conducted, with the results as shown in the
accompanying table. Estimate the average number of newspaper per household for the
community, and place a bound on the error of estimation.

Cluster Number Of Newspaper Total


1 1 2 1 3 3 2 1 4 1 1 19
2 1 3 2 2 3 1 4 1 1 2 20
3 2 1 1 1 1 3 2 1 3 1 16
4 1 1 3 2 1 5 1 2 3 1 20
Solution

From equation (8.1)

∑𝑛 𝑦
𝑦̅ = ∑𝑛𝑖=1𝑚𝑖
𝑖=1 𝑖

When 𝑚1 = 𝑚2 ⋯ = 𝑚𝑛 = 𝑚𝑖 , the equation becomes

∑𝑛𝑖=1 𝑦𝑖 19 + 20 + 16 + 20
𝑦̿𝑐 = = = 1.875
𝑛𝑚 4(10)

Standard analysis of variance computations were performed (using minitab) on


the data with the following results:

Analysis of Variance
Source DF SS MS
Factor 3 1.07 0.36
Error 36 43.3 1.2
Total 39 44.38

In this output, Factor denotes the between-cluster calculations and Error denoted the
the within-cluster calculations. Thus, MSB = .36 and MST = 1.20. It follow that

𝑁−𝑛 1
𝑉̂ (𝑦̿𝑐 ) = ( ) ( ) MSB
𝑁 𝑛𝑚

396 1
=( ) (. 36)
400 4(10)

= .0089

And

2√𝑉̂ (𝑦̿𝑐 ) = .19

Therefore, our best estimate of the number of newspapers per household is

1.88 ± .19
How can we compare the precision of cluster sampling with that of simple random sampling? If
we had taken the nm observations in a simple random sample and the computed mean 𝑦̅ and variance 𝑠 2
then we would have

𝑀𝑛 − 𝑛𝑚 𝑠2 𝑁 − 𝑛 𝑠2
𝑉̂ (𝑦̅) = ( )∙ =( )
𝑁𝑚 𝑛𝑚 𝑁 𝑛𝑚

Since there would be Nm total observations in the population. Thus we can measure the relative
efficiently of 𝑦̿𝑐 to 𝑦̅ by comparing MSB to 𝑠 2 . But we did not take a simple random sample; we took a
cluster sample, so 𝑠 2 is not available. Fortunately, it turns out that we can approximate 𝑠 2 (the variance
we would have obtained in a simple random sample) from quantities available in the cluster sample
results. This approximation is

𝑁(𝑚 − 1)MSW + (𝑁 − 1)MSB


𝑠̂ 2 =
𝑁𝑚 − 1

1
= [(𝑚 − 1)MSW + MSB]
𝑚

When N is large.

Using the calculation from Example 8.5, we see that

1
𝑠̂ 2 = [(9)(1.20) + .36] = 1.12
10

The estimated relative efficiency of 𝑦̿𝑐 to 𝑦̅ is thus

𝑠̂ 2 1.12
̂ (𝑦̿𝑐 ⁄𝑦̅) =
RE = = 3.11
MSB . 36

In this case, cluster sampling is more efficient, because there is so little variation btween clusters (each
cluster seems to be fairly representative of the entire population). This is somewhat unusual, since, in
most cases of naturally occurring clusters, cluster sampling will be less efficient than simple random
sampling.

As another example, we sampled 𝑛 = 4 clusters of 𝑚 = 20 contiguous random digits from a


random number table. If the goal is to estimate the mean of the random digits (known to be 4.5 in the
case), how should our cluster sample compare to taking 80 = 4(20) random digits in a simple random
sample? Since the clusters themselves contaim randomly generated digits, we would expect the relative
efficiency to be close to one

The analysis of variance results of our sample were as follows

Analysis of Variance
Source DF SS MS
Factor 3 40.54 13.51
Error 76 610.85 8.04
Total 79 651.39

From this,

1
𝑠̂ 2 = [19(8.04) + 13.51] = 8.31
20

and

8.31
̂ (𝑦̿𝑐 ⁄𝑦̅) =
RE = .62
13.51

This is a little lower than expected, but not far from 1.0.

We will continue this discussion of comparisons between cluster sampling and simple random in
Chapter 9.
8.5 SELECTING THE SAMPLE SIZE FOR ESTIMATING POPULATION MEANS AND
TOTAL

The quantity of information in a cluster sample is affected by to factors, the number of clusters
and the relative cluster size. We have not encountered the latter factor in any of the sampling
procedures discussed previously. In the problem of estimating the number of homes with
inadequate fire insurance in a state, the clusters could be counties, voting districts, school
districts, communities, or any other convenient grouping of homes. As we have already seen, the
size of the bound on the error of estimation depends crucially upon the variation among the
cluster totals. Thus in attempting to achieve small bounds on the error of a estimation, one must
select clusters with as little variation as possible among these totals. We will now assume that the
cluster size (sampling unit) has been chosen and will consider only the problem of choosing the
number of clusters, n.

From Equation (8.2) the estimated variance of 𝑦̅ is


𝑁−𝑛 2
𝑉̂ (𝑦̅) = 𝑠
𝑁𝑛𝑀2 𝑐

where

∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅𝑚𝑖 )2
𝑠𝑐2 =
𝑛−1

The actual variance of 𝑦̅ is approximately

𝑁−𝑛 2
𝑉(𝑦̅ ) = (𝜎 )
̅2 𝑐
𝑁𝑛𝑀

Where 𝜎𝑐2 is the population quantity estimated by 𝑠𝑐2 .

̅ , choice of the sample size, that


Because we do not know 𝜎𝑐2 or the average cluster size 𝑀
is, the number of clusters necessary to purchase a specified quantity of information concerning a
population parameter, is difficuit. We overcome this difficulty by using the same method we used
̅ available fom a prior survey, or we
for ratio estimation. That is, we use an estimate of 𝜎𝑐2 and 𝑀
̅ can be computed
select a preliminary sample containing n’ elements. Estimates of 𝜎𝑐2 and 𝑀
from the preliminary sample and used to acquire an approximate total sample size n. Thus, as in
all problems of selecting a sample size, we equate two standard deviations of our estimator to a
bound on the error of estimation, B. This bound is chosen by the experimenter and represents the
maximum error that he or she is willing to tolerate. That is

2√𝑉(𝑦̅) = 𝐵

Using Equation (8.12), we can solve for n.


We obtain similar results when using 𝑀𝑦̅ to estimate the population total 𝜏, because
𝑉(𝑀𝑦̅) = 𝑀2 𝑉(𝑦̅).

Approximate sample size required to estimate 𝜇 with a bound B on the error of estimation:

𝑁𝜎𝑐2
𝑛=
𝑁𝐷 + 𝜎𝑐2

Where 𝜎𝑐2 is estimated by 𝑠𝑐2 and

̅2
𝐵2 𝑀
𝐷=
4

Example 8.6 Suppose the data in Table 8.1 represent a preliminary sample of incomes in the city. How
large a sample should be taken in a future survey in order to estimate the average per
capita income 𝜇 with a bound of $500 on the error of estimation?

Solution

To use Equation (8.13), we must estimate 𝜎𝑐2 ; the best estimate available is 𝑠𝑐2 , which can
be calculated by using the data in Table 8.1. Using the calculations in Example 8.2, we
have

∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅𝑚𝑖 )2 15,227,502,247


𝑠𝑐2 = = = 634,479,260
𝑛−1 24
̅ can be estimated by 𝑚
Quantity 𝑀 ̅ = 6.04 calculated from Table 8.1. Then D is
approximately

𝐵2 𝑚
̅ 2 (500)2 (6.04)2
= = (62,500)(6.04)2
4 4

Using Equation (8.13) yields

𝑁𝜎𝑐2 415(634,479,260)
𝑛= 2 = = 166.58
𝑁𝐷 + 𝜎𝑐 415(6.04)2 (62,500) + 634,479,260

The 167 clusters should be sampled.

̅, with a bound B on the error of


Approximate size required to estimate 𝝉, using 𝑴𝒚
estimation:

𝑁𝜎𝑐2
𝑛=
𝑁𝐷 + 𝜎𝑐2

Where 𝜎𝑐2 is estimated by 𝑠𝑐2 and

𝐵2
𝐷=
4𝑁 2

Example 8.7 Again using the data in Table 8.1 as a preliminary sample of incomes in the city, how
large a sample is necessary to estimate the total income of all residents, 𝜏, with a bound
of $1,000,000 on the error of estimation? There are 2500 residents of the city (M = 2500).

Solution

We use Equation (8.14) and estimate 𝜎𝑐2 by

𝑠𝑐2 = 634,479,260
As in Example 8.6. When estimating 𝜏, we use

𝐵2 (1,000,000)2
𝐷= =
4𝑁 2 4(415)2

(1,000,000)2
𝑁𝐷 = = 602,409,000
4(415)

Then using Equation (8.14) gives

𝑁𝜎𝑐2 415(634,479,260)
𝑛= 2 = = 212.88
𝑁𝐷 + 𝜎𝑐 602,409,000 + 634,479,260

Thus 213 clusters should be sampled to estimate the total income with a bound of
$1,000,000 on the error of estimation

The estimator 𝑁𝑦̅𝑡 , shown in Equation (8.8), is used to estimate 𝜏 when M is


unknown. The estimated variance of 𝑁𝑦̅𝑡 , show in Equation (8.9), is

𝑁−𝑛 2
𝑉̂ (𝑁𝑦̅𝑡 ) = 𝑁 2 ( ) 𝑠𝑡
𝑁𝑛

Where

∑𝑚 ̅𝑡 )2
𝑖=1(𝑦𝑖 − 𝑦
𝑠𝑡2 =
𝑛−1

Thus the population variance of 𝑁𝑦̅𝑡 is

𝑁−𝑛 2
𝑉(𝑁𝑦̅𝑡 ) = 𝑁 2 𝑉(𝑦̅𝑡 ) = 𝑁 2 ( ) 𝜎𝑡
𝑁𝑛

Where 𝜎𝑡2 is the population quantity estimated by 𝑠𝑡2 .


Estimation of 𝜏 with a bound of B units on the error of estimation leads to the
following equation:

2√𝑉(𝑁𝑦̅𝑡 ) = 𝐵

Using Equation (8.16), we can solve for n.

̅𝒕 with a bound B on the


Approximate sample size required to estimate 𝝉, using 𝑵𝒚
error of estimation:

𝑁𝜎𝑡2
𝑛=
𝑁𝐷 + 𝜎𝑡2

Example 8.8 Assume the data of Table 8.1 are from a preliminary study of incomes in the city and M
is not known. How large a sample must be taken to estimate the total income of all
residents, 𝜏, with a bound of $1,000,000 on the error of estimation?
Solution

The quantity 𝜎𝑡2 must be estimated by 𝑠𝑡2 , which is calculated from the data of Table 8.1.
Using the calculations of Example 8.4 gives

∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅𝑡 )2 1,389,360,000


𝑠𝑡2 = = = 474,556,667
𝑛−1 24

The bound on the error of estimation is 𝐵 = $1,000,000. Hence

𝐵2 (1,000,000)2
𝐷= =
4𝑁 2 4(415)2

From Equation (8.17)


𝑁𝜎𝑡2 415(474,556,667)
𝑛= 2 = 415(1,000,000)2 ⁄4(415)2 + 474,556,667 = 182.88
𝑁𝐷 + 𝜎𝑡

Thus a sample of 183 clusters must be taken to have a bound of $1,000,000 on the error
of estimation.

You might also like