You are on page 1of 39

SAMPLING DISTRIBUTIONS

Population Distribution: When we talk of Population distribution, we assume that we have


investigated the population and have full knowledge of its mean and standard deviation.
Population mean is denoted by µ and standard deviation of population is denoted by σ . The
measures µ and σ of populations are called parameters.

Sample Distribution: When we talk of a sample distribution, we take a sample from the
population. The mean and standard deviation of the sample are denoted by X and s . These
measures related to sample are called statistic. It may be noted several sample distributions
are possible from a given population.

Distribution of Sample Means: Considering sample mean X as variable, we observe that


the expected value of X is population mean. i.e., E ( X ) = µ X = µ and the standard deviation

σ
of X is given by σ X = where n is the sample size. Standard deviation of mean is also
n
known as standard error of mean.

In order to use the standard deviation of the sample ‘s’ as an estimate the for σ , we have the
following formula

s=
∑ (x i − x)2
n −1
And the standard error of mean is given by

σX =
s
=
∑ (x i − x)2
n n( n − 1)
When the sample size ‘n’ is not very small in comparison with the finite population size ‘N’,
then we consider the following formula:

σ N −n
σX =
n N −1

Exercise: The time between two arrivals in a queuing model is normally distributed with
mean 2 minutes and standard deviation 0.25 minute. If a random sample size of 36 is drawn,
what is the probability that a sample mean will be greater than 2.1 minutes?
Solution: n = 36; µ = 2 ; σ = 0.25 . The standard error for mean is calculated as under:
σ 0.25
σX = = = 0.042 .
n 36
Now to find the probability that sample mean greater than 2.1 is given by
 X − 2 2.1 − 2 
P( X ≥ 2.1) = P
0.042 ≥ 0.042  = P( Z ≥ 2.38) = 1 − 0.9913 = 0.0087
 

Exercise: The weight of certain type of car tire is normally distributed with mean of 25
pounds and variance of 3 pounds. A random sample of 50 tires is selected. What is the
probability that the mean of this sample lies between 24.5 and 25.5 pounds?

Exercise: An auditor takes a sample of size 36 from a population of 1000 accounts receivable.
The standard deviation of the population is unknown, but the standard deviation of the
sample is Rs 43. If the true mean value of the accounts receivable is Rs 260, what is the
probability that the sample mean will be less than or equal to Rs 250?

Estimation of Population Mean:


In most of the research studies, population parameters are unknown and have to be estimated
from a sample. As such the methods of estimating parameters assume an important role in
statistical analysis.

The estimate of a population parameter may be one single value or it could be a range of
values. If the estimate is one single value, it is referred as point estimate, whereas in the range
of values case it is termed as interval estimate.

A good estimator possesses the following properties:


(i) An estimator should on the average be equal to the value of the parameter being
estimated. (Property of Unbiased ness)
(ii) An estimator should have relatively less variance. (Property of efficiency)
(iii) An estimator should use as much as possible the information available from
sample (Property of Sufficiency)
(iv) An estimator should approach the value of parameter as the sample size becomes
larger and larger. (Property of Consistency)

The point estimator of population mean ( µ ) is X , the sample mean.


The interval estimator for the mean µ is given by the interval around X for certain degree
of confidence with the help of Standard error.
For example, for 95% degree of confidence interval for the population mean is given by the
lower limit X − 1.96 SE and upper limit X + 1.96 SE . In other words, the probability of µ

being in the interval [ X − 1.96 SE , X + 1.96 SE ] is 0.95.


Or, P[ X − 1.96SE ≤ µ ≤ X + 1.96SE ] = 0.95
In the above, 1.96 is the z-variate of standard normal distribution for the confidence level of
95% (or the significance level of 5%)
If the sample size is smaller, i.e., lesser than 30, we use t-variate with n-1 degree of freedom,
for the estimation.

Exercise: From a random sample of 36 civil service personnel, the mean age and sample
standard deviation were found to be 40 years and 4.5 years respectively. Construct a 95%
confidence interval for the mean age of civil servants. Also construct a 96% confidence
interval for the mean age of civil servants.
Solution: In the above n = 36, X = 40 and s = 4.5 . Population size is not finite. Sample size
may be considered as large. The standard error of mean is given by
s 4.5
SE = σ X = = = 0.75
n 36
Standard normal variate for 95% confidence is 1.96.
Thus 95% confidence interval for the mean of population is given by the limits X ± 1.96 SE .
X ± 1.96SE = 40 ± (1.96)(0.75) = 40 ± 1.47
Therefore the 95% confidence interval for population mean is [38.53, 41.47]
In other words, P (38.53 ≤ µ ≤ 41.47) = 0.95 .

Standard normal variate for 96% confidence is 2.065.


Thus 95% confidence interval for the mean of population is given by the limits X ± 2.065SE .
X ± 2.065SE = 40 ± (2.065)(0.75) = 40 ± 1.55
Therefore the 95% confidence interval for population mean is [38.45, 41.55]
In other words, P (38.45 ≤ µ ≤ 41.55) = 0.96 .

Exercise: In a random selection of 64 of 2400 intersection in a small city, the mean number of
scooter accidents per year was 3.2 and sample standard deviation was 0.8.
(i) Make an estimate of standard deviation of the population from the standard
deviation
(ii) Workout standard error of mean for this finite population
(iii) If the desired confidence level is 0.90, what will be the upper limit and lower
limits of confidence interval for the mean number accidents per year?

Exercise: A random sample of 16 values from normal population showed a mean of 41.5
inches and the sum of squares of deviation from this mean is 135 square inches. Obtain 95%
and 99% confidence limit for the same.

Exercise: The foreman of ABC mining company has estimated the average quantity of iron
ore extracted to be 36.8 tons per shift and the sample standard deviation to be 2.8 tons per
shift, based upon a random selection of four shifts. Construct a 90% confidence interval
around the estimate.

Estimation of Sample Size:


Size of the sample should be determined by a researcher keeping the following points:
(i) Nature of Universe: If the items of the universe are homogeneous, a small sample
can serve the purpose. But if the items are heterogeneous, a large sample would be
required. Technically, this can be termed as dispersion factor.
(ii) Number of classes proposed: If many class groups are to be formed, a large
sample would be required because a small sample may not be able to give
reasonable number of items in each class-group.
(iii) Nature of Study: If items are to be intensively and continuously studied, the
sample should be small. For a general survey the size of the sample should be
large, but small sample is considered appropriate in technical survey.
(iv) Type of sampling: Sampling technique plays an important part in determining the
size of the sample. A small random sample is apt to be much superior to a larger
but badly selected sample.
(v) Standard of accuracy and acceptable confidence level: If the standard of accuracy
or the level of precision is to be kept high, we shall require relatively larger
sample. For doubling the accuracy for fixed significance level, the sample size has
to be increased fourfold.
(vi) Availability of finance: In practice, the size of the sample depends upon the
amount of money available for the study purposes. This factor should be kept in
view while determining the size of the sample. Larger sample result in increasing
the cost of sampling estimates.
(vii) Other considerations: Nature of units, size of population, size of questionnaire,
availability of trained investigators, the conditions under which the sample is
being conducted, the time available for completion of the study are few other
considerations to which a researcher must pay attention while selecting the size of
the sample.

Sample Size when estimating a mean:


Note that the limits of confidence interval for the Mean of Population is by
σ
X ± z.SE = X ± z ,
n
where X is the sample mean
z is the value of standard variate at given confidence level
n is the sample size, and
σ is the standard deviation of population.
If the researcher like to estimate the mean of population within desired precision ± e , then
σ z 2σ 2
get e = z and therefore n = .
n e2
In case of finite population, we get

σ N −n z 2σ 2 N
e = z.SE = z and therefore n =
n N −1 ( N − 1)e 2 + z 2σ 2
Many a times, the standard deviation of population is not known and sample is not yet taken,
rough estimate of the population is given by
Range of Population Distribution
σˆ =
6
Range in the above may have to be obtained from past records or through a pilot survey of
large number of items.

Exercise: If the acceptable error in estimating the population is within 3 units of the sample
mean with 95% confidence estimate the sample size, when the standard deviation of the
population is known and equals to 4.8.

Solution: Here e = 3 z = 1.96 (for 95% confidence level) and σ =4.8. The estimation of
sample size for 95% confidence limit and within 3 units from the sample mean is given by
z 2σ 2 (1.96) 2 (4.8) 2
n= = = 9.834 ≅ 10
e2 (3) 2
Therefore the size of sample for estimating population mean within range of 3 units and with
95% confidence is 10.

Exercise: A cigarette manufacturer wishes to use a random sample to estimate the average
nicotine content. The error should not be more than 1 milligram above or below the true
mean, with 99% confidence coefficient. The population standard deviation is 4 milligrams.
What sample size should one the company use in order to satisfy the requirement?

Exercise: Determine the size of the sample for estimating the true weight of the 5000 cereal
container on the basis of following information:
The variance of weight is 4 ounces on the basis of past records.
Estimate should be within 0.8 ounces of the true average weight with 99% probability.
Will there be change in the size of sample if we assume infinite population in the given case?
If so, explain by how much?

Sample size when estimating the population proportion:

If we are to find the sample size for estimating a proportion of population, our reasoning
remains similar to what we have said in the context of population mean. It is required to
specify the precision and the confidence level and then estimate the sample size as under:
Note that the standard error of proportion is given by

pq
SE = σ p = (in case of infinite population)
n

pq N − n
SE = σ p = (in case of finite population of size N)
n N −1
Where, p is the sample proportion, q = 1-p, z is the standard variate for appropriate
confidence level and n is the sample size.
Further, confidence interval for the population proportion is given by
p ± z.SE
If e is the precision rate, the acceptable error then the sample size can be expressed as
z 2 pq
n= (in case of infinite population)
e2
z 2 pqN
n= (in case of finite population)
e 2 ( N − 1) + z 2 pq

Exercise: What should be the sample size if a simple random sample from a population of
4000 items to be drawn to estimate the percent of defective within 2% of true value with
95.5% probability? What should be the size of the sample if the population is assumed to be
infinite in the given case? (from the pilot study, it has been observed that the proportion of
defective items is about 2%)
Solution:
In this case N = 4000, z = 2.005, p = 0.02 and e = 0.02
z 2 pqN (2.005) 2 (0.02)(0.98)(4000)
n= = = 187.88 ≅ 188
e 2 ( N − 1) + z 2 pq (0.02) 2 (4000 − 1) + ((2.005) 2 (0.2)(0.98)
Therefore the sample size is estimated to be equals to 188 for sample proportion to be with in
2% limit and 95.5% confidence.

If we assume that the population size is infinite, then


z 2 pq (2.005) 2 (0.02)(0.98)
n= = = 196.98 ≅ 197
e2 (0.02) 2

Exercise: Suppose a certain hotel management is interested in determining the percentage of


the hotel’s guests who stay for more than 3 days. The reservation manager wants to be 95%
confident that the percentage has been estimated within 3% of the true value. What is the
most conservative sample size needed for the problem.

Exercise: Suppose the following ten values represent random observation from a normal
parent population:
2, 6, 7, 9, 5, 1, 0, 3, 5, 4.
Construct a 99 percent confidence interval for the mean of the parent population.

Exercise: A team of medic research experts feels confident that a new drug they have
developed will cure about 80% of the patients. How large should the sample size be for the
team to be 98% certain that the sample proportion of cure is within plus and minus 2% of the
proportion of all cases that drug will cure?
Exercise: Annual income of 900 salesmen employed by a company is known to be
approximately normally distributed. If the company wants 95% confident that the true mean
of this year’s salesmen’s income does not differ by more than 2% of the last year’s mean
income of Rs 12,000, what sample size would be required assuming the population standard
deviation to be Rs 1500?

Exercise: In a random sample of 64 items taken from a large consignment, some were found
to be defective. Deduce that percentage of defective items in the consignment almost
certainly lies between 31.25 and 68.75 given that the standard error of the proportion of
defective items in the sample is 1/16.

Exercise: A cigarette manufacturer claims that his cigarettes have an average content of 18.3
mg of nicotine. If random samples of this type have for content of 20, 17, 21, 19, 22, 21, 20
and 16 mg, would you agree with the manufacturer’s claim. Assume suitable value for level
of significance. (Level of significance = 1 – Level of Confidence)

Distribution of Sample Standard deviation:


If a population is large and normally distributed with standard deviation σ , the standard
deviation of random samples of size ‘n’ (n is large) are approximated by normal distribution

with standard deviation σ / 2n (Standard error of standard deviation).

The standard deviation of the distribution of standard deviation of samples drawn from a
normal population is called standard error of standard deviation and is denoted by
SE = σ / 2n .
TESTING OF HYPOTHESES
Hypothesis:
It is an assumption or some supposition to be proved or rejected.
Definition: Hypothesis is a proposition or a set of propositions set forth as an explanation for
the occurrence of some specified group of phenomena either asserted merely as a provisional
conjecture to guide some investigation or accepted as highly probable in the light of
established facts.

Characteristic of Hypothesis:
(i) Hypothesis should be clear and precise.
(ii) Hypothesis should be capable of being tested
(iii) Hypothesis should state relationship between variables, if it happens to be
relational hypothesis
(iv) Hypothesis should be limited to the scope and must be specific. A researcher must
remember that narrower hypothesis is more generally testable and should develop
such hypothesis.
(v) Hypothesis should be stated as far as possible in most simple terms so that the
same is easily understandable by all concerned.
(vi) Hypothesis should be consistent with most known facts, i.e., it must be consistent
with a substantial body of established facts. In other words, it should be one which
judges accept as being the most likely.
(vii) Hypothesis should be amenable to testing within reasonable time.
(viii) Hypothesis must explain the facts that gave rise to the need for explanation.

Null Hypothesis and Alternative Hypothesis: Null Hypothesis is an initial statement


concerning a population parameter. It is generally denoted by H 0 . Any hypothesis which
differs from a null hypothesis is called ‘alternative hypothesis. Alternative Hypothesis is
denoted by H 1 .

Type I error: The error of rejecting the hypothesis when it should have been accepted is
known as type I error.

Type II error: The error of accepting the hypothesis when it should have been rejected is
known as type II error
The probability of Type I error is usually determined in advance and understood as level of
significance of testing the hypothesis. If the type I error is fixed at 5%, it means that there are
about 5 chances in 100 that we reject H 0 when H 0 is true.

But with a fixed sample size, n, when we try to reduce the type I error, the probability of
committing type II error increases. Both type of error can not be reduced simultaneously.

Two-tailed and One –tailed test: A two-tailed test rejects the null hypothesis if, say, the
sample mean is significantly higher or lower than the hypothesized value of mean of the
population. Thus in a two-tailed test, there are two rejection regions, one on each tail of
normal curve.
H 0 : µ = µH0
H1 : µ ≠ µ H0

A one-tailed test would be used when there are to test, say, whether the population mean is
either lower than or higher than some hypothesized value.
H 0 : µ = µH0 H 0 : µ = µH0
or
H1 : µ > µH0 H1 : µ < µ H0

Examples: A random sample of 25 tiers from a large consignment gave an average life of
38,000 kms and standard deviation of 5000 kms. Could the sample come from a population
with mean life of tiers 40,000 kms?

Solution:
We make null hypothesis and the alternative hypothesis as under:
H 0 : µ = 40000
H 1 : µ ≠ 40000
We make two-tailed test for population mean. Consider the level of significance α = 0.005 .
X −µ
The test criterion is t = . Here n = 25, sample mean is 38,000 and sample standard
s
n
deviation is s = 5000. Therefore
X − µ 38000 − 40000
t= = = −2
s 5000
n 25
| t |= 2
From the table, t-variate value for 5% significant level (95% confidence level) and with 24
degree of freedom is 2.064.
Since the calculated t-value is lesser than the table value, we accept the null hypothesis that
the mean life of tier is 40,000 kms with 5% significance level (95% confidence level).

Flow Chart for Hypothesis Testing:

State H 0 as well H 1

Specify the level of significance ( α )

Decide the correct sampling distribution

Obtain sample and workout an appropriate


value from sample data

Calculate the probability that sample result would diverge


as widely as it has from expectations, if the null hypothesis
were true (find z-value or t-value for the purpose)

Compare this probability with significance level( α / 2 in case of two


tailed test; α in case of one tail test).
(Find whether calculated z or t value is in the rejection region)

Yes No

Reject H 0 Accept H 0
Exercise: A certain stimulus administered to each of 12 patients resulted in the following
change in of blood pressure:
5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6
Can it be concluded that the stimuli will, in general, accompanied by an change in blood
pressure?

Solution:
From the given data, we obtain sample mean and sample variance as

X =
∑X =
31
= 2 .6
n 12

s =
2 ∑ ( X − X )2
= 9.538
n −1
s = 3.08
We shall make a null hypothesis that stimulus in general not be accompanied by the change
in blood pressure. Therefore the null hypothesis and the alternative hypothesis can be
formulated as under:
H0 : µ = 0
H1 : µ ≠ 0
Assume 5% level of significance. i.e., 95% level of confidence. Corresponding t-value with
11 degree of freedom is 2.201( t 0.025 , 24 ). Further rejection region is given by R :| t |> 2.201

X −µ 2.6 − 0
t= = = 2.94
s 3.08
n 12
Since calculated t-value is bigger than the table value, we reject the null hypothesis.

Exercise: A certain stimulus administered to each of 12 patients resulted in the following


change in of blood pressure:
5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6
Can it be concluded that the stimuli will, in general, accompanied by an increase in blood
pressure?
Solution:
From the given data, we obtain sample mean and sample variance as
X =
∑X =
31
= 2 .6
n 12

s2 =
∑ ( X − X ) 2 = 9.538
n −1
s = 3.08
We shall make a null hypothesis that stimulus in general not be accompanied by an increase
in blood pressure. Therefore the null hypothesis and the alternative hypothesis can be
formulated as under:
H0 : µ = 0
H1 : µ > 0
Assume 5% level of significance. i.e., 95% level of confidence. Corresponding t-value (one-
tail test) with 11 degree of freedom is 1.796 ( t 0.05,11 ) and the rejection region is R : t > 1.796

X −µ 2.6 − 0
t= = = 2.94
s 3.08
n 12
Since calculated t-value is bigger than the table value, we reject the null hypothesis.

Exercise: A cigarette manufacturer claims that his cigarettes have an average content of 18.3
mg of nicotine. If random samples of this type have for content of 20, 17, 21, 19, 22, 21, 20
and 16 mg, would you agree with the manufacturer’s claim. Assume suitable value for level
of significance.

Exercise: Raju Restaurant near the railway station has been having a average sales of 500 tea
cups per day. Because of some development of bus stand nearby, it expects to increase its
sales. During the first 12 days after the start of the bus stand, the daily sales were as under:
550, 570, 490, 615, 505, 580, 570, 460, 600, 580, 530, 526
On the basis of simple information, can one conclude that Raju Restaurant’s sales have
increased? Use 5% level of significance.

Solution: Consider null hypothesis that sales average is 500 cups and sale has not increased
unless proved. We can write:
H 0 : µ = 500
H 1 : µ > 500
The sample size is small from infinite population. So, we shall use one-tailed t-test and
X −µ
compute t-statistic given by t = . Further note that population standard deviation is not
s n

given. We shall compute X and s .


Xi (X i − X ) ( X i − X )2

550 2 4
570 22 484
490 -58 3364
615 67 4489
505 -43 1849
580 32 1024
570 22 484
460 -88 7744
600 52 2704
580 32 1024
530 -18 324
526 -22 484
6576 23978
6576
X = = 548
12

s=
∑ (X i − X )2
=
23978
= 46.68
n −1 11
X − µ 548 − 500
t= = = 3.558
s n 46.68 12
Degree of freedom = n – 1 = 12 – 1 = 11. Therefore, corresponding t-value (one-tail test) at
5% significance level and with 11 degree of freedom is 1.796 ( t 0.05,11 ) and R : t > 1.796

Since calculated t-value is greater than table value and in the rejection region, we reject the
null hypothesis that there is no change in the sales and conclude that there is increase in sales.

Exercise: A sample of 400 male students is found to have a mean height 67.47 inches. Can it
be regarded as a sample from large population with mean height 67.39 inches and standard
deviation 1.30 inches? Test 5% level of significance.
Solution: Consider the null hypothesis that the average height is 67.39 inches and we can
H 0 : µ = 67.39
write . The sample size is large (400), population is infinite and standard
H 1 : µ ≠ 67.39
deviation of the population is known, we shall use two-tailed z –test and find z-statistic
X −µ
z= . Note that at 5% significance level for 2-tailed test, z-variate is 1.96 and the
σ n
rejection region is R :| z |> 1.96

X −µ 67.47 − 67.39
z= = = 1.231 and therefore the calculated z-variate value is within the
σ n 1.30 400
acceptance region. We accept the null hypothesis that the mean height of students 67.39 at
5% significance level.

Exercise: Suppose that we are interested in a population of 20 industrial units of same size,
all of which are experiencing excessive of labor turnover problems. The past records show
that the mean of the distribution of turnover is 320 employees, with a standard deviation of
75 employees. A sample of 5 of these industrial units is taken at random which gives a mean
of annual turnover as 300 employees. Is the sample mean consistent with the population
mean? Test at 5% significant level.

Exercise: The mean of a certain production process is known to be 50 with a standard


deviation of 2.5. The production manager may welcome any change is mean value towards
higher side but would like to safeguard against decreasing values of mean. He takes a sample
of 36 items that gives a mean value of 48.5. What inference should the manager take for the
production process on the basis of sample results? Use 5% level of significance for the
purpose.

Exercise: The mean lifetime of a random sample of 50 similar torch bulbs drawn from a batch
of 500 bulbs is 72 hours. The standard deviation of the lifetime of sample is 10.4 hours. The
batch is classed as inferior if the mean lifetime is less than the 75 hours. Determine whether,
as a result of sample data, the batch is considered to be inferior at level of significance of a)
0.05 and b) 0.01

Solution: Population is finite and N = 500. Sample size is n = 50. The sample mean X = 72
hrs and sample standard deviation is 10.4(s). Claimed lifetime of the bulbs (population
mean) is minimum of 75 Hrs. Objective is to test the given batch is of inferior quality (life
time less than 75 Hrs). Therefore we make null hypothesis that the life time of the bulbs is
not less than 75 Hrs. i.e.,
H 0 : µ ≥ 75
H 1 : µ < 75
We shall have one-tail test for larger sample from finite population.
X −µ 72 − 75 −3
z= = = = −2.148
 s  N −n 10.4 
 500 − 50  (1.471)(0.95)
    
 n  N −1  50  500 − 1  
(a) Test at 5% significance level: Table value for z is -1.645. Therefore rejection region is
R : z < −1.645 . Calculated value for z is in the rejection region and therefore we
reject null hypothesis at 5% level of significance.
(b) Test at 1% significance level: Table value for z is -2.33. Therefore rejection region is
R : z < −2.33 . Calculated value for z is not in the rejection region and therefore we
accept null hypothesis at 1% level of significance.

Hypothesis Testing for Difference of Means:


In some decision making situations, we may have to find whether the parameters of two
populations are alike or different. For example, one may like to know whether female worker
earn same as male worker or different. In this situation, we like to test whether the mean
income of males and females are same or not.
In this case the parameter of our interest is µ1 − µ 2 , where µ1 may the mean income of

female population and µ 2 may be the mean income of male population. Suppose n1 and n2

are the sizes of two samples and, σ 1 and σ 2 are the standard deviations of populations
respectively. We consider the standard deviations of samples in the absence of population
standard deviation for the estimation.
Standard Error for the difference of means is given by

σ 12 σ 22
SE = σ X1 − X 2 = + and test statistic is given by
n1 n2

X1 − X 2
z= (in case of large sample)
σ 12 σ 22
+
n1 n 2
X1 − X 2
t= (with n1 + n2 - 2 degree of freedom, in case of small sample)
σ 12 σ 22
+
n1 n 2

In case of large samples are presumed to be drawn from same population whose variance
( σ 2 ) is known, we use z test for the difference in means and compute z-statistics and t-
statistics are as under
X1 − X 2
z= (in case of large sample)
1 1 
σ 2
n n 
+ 
 1 2 

X1 − X 2
t= (with n1 + n2 - 2 degree of freedom; in case of small sample)
1 1 
σ 2
n + n 
 1 2 

In case population variance is not known, we estimate the standard deviation of population as
under:

n1 ( s12 + D12 ) + n 2 ( s 22 + D22 )


σˆ = ; where D1 = ( X 1 − X 1, 2 ); D2 = ( X 2 − X 1, 2 )
n1 + n 2
n1 X 1 + n 2 X 2
X 1, 2 =
n1 + n 2
In case of small samples are presumed to be taken from same population and population
variance is not known, then we use t-test for the difference of means and z- statistics and t-
statistics are computed as under:
X1 − X 2
z=
∑ (X 1i − X 1 ) 2 + ∑ ( X 2i − X 2 ) 2 1
+
1
n1 + n 2 − 2 n1 n 2

X1 − X 2
t= ; with n 1 + n 2 - 2 degree of freedom
∑ ( X 1i − X 1 ) 2 + ∑ ( X 2i − X 2 ) 2 1
+
1
n1 + n 2 − 2 n1 n 2

Alternatively,
X1 − X 2
z= ;
(n1 − 1) s + (n 2 − 1) s 22
2
1 1
1
+
n1 + n2 − 2 n1 n 2
X1 − X 2
t= ; with n 1 + n 2 - 2 degree of freedom
(n1 − 1) s12 + (n2 − 1) s 22 1 1
+
n1 + n 2 − 2 n1 n 2

Exercise: The mean produce of wheat of a sample of 100 fields 200 quintal per acre with
standard deviation 100 quintal. Another sample of 150 fields gives the same mean of 220
quintal per acre with standard deviation of 12 quintal. Can the two samples be considered to
have been taken from the two populations with same mean yield? Use 5% level of
significance.

Solution: Taking the null hypothesis that the mean of two populations do not differ, consider
H 0 : µ1 = µ 2
H 1 : µ1 ≠ µ 2
It is given that
n1 = 100; n2 = 150;
X 1 = 200; X 2 = 220;
s1 = 10; s 2 = 12
Sample sizes are large; we can have two-tailed test to compare the mean with 5% level of
significance. Z-statistics for 5% level of significance in two tailed test is 1.96. Therefore the
rejection is R :| z |> 1.96 .
Note that standard deviations of the populations are not given. From the given data we have
X1 − X 2 200 - 220 − 20
z= = = = −14.28
s12 s 22 10 2 12 2 1.4
+ +
n1 n2 100 150

Since calculated z-variate is is not in the acceptance region and in fact, in the rejection region,
we reject the null hypothesis at 5% level of significance.

Exercise: The mean produce of wheat of a sample of 100 fields 200 quintal per acre with
standard deviation 100 quintal. Another sample of 150 fields gives the same mean of 220
quintal per acre with standard deviation of 12 quintal. Can the two samples be considered to
have been taken from the same population whose standard deviation is 11 quintal? Use 5%
level of significance.
Solution: Assuming that both the samples are from same population, consider the null
hypothesis
H 0 : µ1 = µ 2
H 1 : µ1 ≠ µ 2
Where
n1 = 100; n2 = 150;
X 1 = 200; X 2 = 220
Standard deviation of the population is given as 11 quintal, i.e., σ = 11. Since the null
hypothesis is that both the samples are from same population, we can take that σ 1 = σ 2 = 11.
Sample sizes are large; we can have two-tailed test to compare the mean with 5% level of
significance. Z-statistics for 5% level of significance in two tailed test is 1.96. Therefore the
rejection is R :| z |> 1.96 .
Further, z-statistics is calculated as:
X1 − X 2 200 - 220 − 20
z= = = = −14.08 .
σ2
σ 2
112
11 2 1.42
1
+ 2
+
n1 n 2 100 150

Calculated z-value falls in the rejection region and therefore we reject the null hypothesis at
5% significance level.

Exercise: A simple random sampling survey in respect of monthly earning of semi-skilled


workers in two cities gives the following information:
City Average Monthly earning St deviation of monthly earning Size of sample
A 695 40 200
B 710 60 175
Test the hypothesis that there is no difference between monthly earning of workers of two
cities.

Exercise: Sample of sales in similar shops in two groups are taken for a new product with
following results:
Group Mean Sales Variance Size of sample
A 57 5.3 5
B 61 4.8 7
Is there any evidence that both the groups are in the same town without any difference in
sales pattern? Use 5% level of significance.
Solution: Presuming that both the groups are from the same town and having same sales
pattern. In other words we make null hypothesis that both the groups are from single
population. Consider hypotheses
H 0 : µ1 = µ 2
H 1 : µ1 ≠ µ 2
It is given that
n1 = 5; n 2 = 7;
X 1 = 57; X 2 = 61;
s1 = 5.3; s 2 = 4 .8
Since the samples are small and population variances are not known, we consider the
following test t statistics as under:
X1 − X 2 57 - 61
t= = = −3.053
(n1 − 1) s12 + (n2 − 1) s 22 1 1 (5 - 1)(5.3) + (7 - 1)(4.8) 1 1
+ +
n1 + n 2 − 2 n1 n 2 5+7-2 5 7

At 5% level of significance and with 5+7-2=10 degree of freedom t-statistics from table is
2.228 and therefore the rejection region is given by R :| t |> 2.228 . Note that calculated t-
value is in the rejection region. So we reject the null hypothesis at 5% level of significance.
So we may conclude that the sample groups A and B are from different population with
different sales pattern.

Exercise: Two independent samples of size 9 and 7 respectively had the following values:
Sample 1: 18 20 36 50 49 36 34 49 41
Sample 2: 29 28 26 35 30 44 46
Is the difference between the means of sample significant at 5% level of significance?

Exercise: A group of seven-week old chickens reared on a high protein diet weigh 12, 15, 11,
16, 14, 14 and 16 ounces; a second group of five chickens, similarly treated except that they
receive a low protein diet, weigh 8, 10, 14, 10 and 13 ounces. Test at 5% level whether there
is significant evidence that additional protein has increased the weight of chickens.

Hypothesis testing of Proportions & Difference between Proportions:


Recall that the standard error of proportion is given by

pq
SE = σ p = (in case of infinite population)
n

pq N − n
SE = σ p = (in case of finite population of size N)
n N −1
Where, p is the proportion of the items in the population, q = 1-q, z is the standard variate for
appropriate confidence level and n is the sample size.
If p̂ is the observed proportion, then to test the null hypothesis that H 0 : p = p H 0 , we

compute following z-statistic as under:


pˆ − p H
z= .
SE
pˆ − p H
For a large population, we have z = .
pq
n
Standard error in case of difference between proportions is,
pˆ 1 qˆ1 pˆ 2 qˆ 2
SE = σ p1 − p2 = + , where p̂1 and p̂2 are sample proportions of samples of sizes
n1 n2

n1 and n2 respectively. The above formula is more conveniently used whenever the samples
are drawn from two heterogeneous populations. But when we assume that the populations are
similar as regards the given attribute, we make use of the following formula to compute SE.

1 1 
SE = σ p1 − p2 = pˆ 0 qˆ 0 
n + n   where
 1 2 

n p + n2 p2
p0 = 1 1 q0 = 1 − p0
n1 + n 2

Exercise: A sample survey indicates that out of 3232 births, 1705 were boys and the rest were
girls. Do these figures confirm the hypothesis that the sex ratio is 50:50? Test at 5% level of
significance.

Solution: Define p as the ratio of boy babies. We shall make null hypothesis and alternative
hypothesis as under:
H 0 : p = 0.5
H 1 : p ≠ 0.5
1705
Observed value for p is given by pˆ = = 0.5275 .
3232

pq (0.5)(0.5)
Standard error for the proportion is given by SE = σ p = = = 0.0088 and z-
n 3232
pˆ − p 0.5275 − 0.5
test statistic is given by z = = = 3.125 .
pq 0.0088
n
With reference to null hypothesis and alternative hypothesis, we apply two-tailed test and
rejection region at 5% significance level is R :| z |> 1.96 . Calculated z-value lies in the
rejection region and therefore we reject null hypothesis at five percent significance level and
conclude that the sex ratio among the births are not 50:50.

Exercise: A certain process produces 10% defective items. A supplier of new raw material
claims that the use of his material would reduce the proportion of defectives. The random
sample of 400 units using this new material was taken out of which 34 were defective. Can
the supplier claim be accepted? Test at 1% level of significance.

Solution: Since the supplier claim that there is a decrease in defective items, we shall
consider the following null hypothesis and alternative hypothesis:
H 0 : p = 0.10
H 1 : p < 0.10
From the above null hypothesis and alternative hypothesis, we shall have one-tail test (left) at
1% level of significance. Rejection region at 1% level of significance is R : z < −2.32 .
34
Observed sample proportion is given by pˆ = = 0.085 further z-statistics from the given
400
pˆ − p 0.085 − 0.1
data is z = = = −1.00
pq (0.1)(0.9)
n 400
Since computed z-value does not fall in the rejection region, we accept the null hypothesis at
1% level of significance. So at 1% level of significance, we can accept the supplier’s claim
that there is significant reduction in the defective items.

Exercise: The null hypothesis is that 20% of the passengers go in first class, but management
recognizes the possibility that this percentage could be more or less. A random sample of 400
passengers includes 70 passengers holding first class ticket. Can the null hypothesis be
rejected at 10% level of significance?

Exercise: A drug research experimental unit is testing two drugs newly developed to reduce
BP level. The drugs are administered to two different sets of animals. In group one, 350 of
600 animals tested respond to drug one and in group two, 260 of 500 animals tested respond
to drug two. The research unit wants to test whether there is difference between the efficiency
of the said two drugs at 5% level of significance. How will you deal with this problem?
Solution: Let p1 be the proportion of animals respond to the drug one and p2 be the
proportion of animals respond to drug two. Here we may consider that the samples are from
different population.
Consider the null hypothesis:
H 0 : p1 = p 2 i.e., the proportions of response for both the drugs are same.
And the alternative hypothesis:
H 1 : p1 ≠ p 2
We shall have two-tailed test for the samples from different population at 5% significance
level. The rejection region is R :| z |> 1.96
From given data, we have
350
pˆ 1 = = 0.583
600 n1 = 600
260 n2 = 500
pˆ 2 = = 0.520
500
Further, z-value for the observed data is given by
pˆ 1 − pˆ 2 0.583 − 0.520
z= = = 2.093
pˆ 1 qˆ1 pˆ 2 qˆ 2 (0.583)(0.417) (0.520)(0.480)
+ +
n1 n2 600 500

As calculated value is in the rejection region, we reject the null hypothesis at 5% level of
significance.

Exercise: A drug research experimental unit is testing two drugs newly developed to reduce
BP level. The drugs are administered to two different sets of animals. In group one, 350 of
600 animals tested respond to drug one and in group two, 260 of 500 animals tested respond
to drug two. The research unit wants to test whether the efficiency of the first drug is more
than the second drug at 5% level of significance. How will you deal with this problem?

Exercise: At a certain date in a large city 400 out of a random sample 500 men were found to
be smokers. After the tax on tobacco had been heavily increased, another random sample of
600 men in the same city included 400 smokers. Was the observed decrease in the proportion
of smokers significant? Test at 5% level of significance.

Solution: We start with null hypothesis that the proportion of smokers even after the heavy
tax on the tobacco remains unchanged i.e., H 0 : p1 = p 2 and alternative hypothesis that
proportion of smokers after tax has decreased i.e., H 0 : p1 > p 2 . So, we shall have one-tail

test (right). Rejection region at 5% level of significance is R : z ≥ 1.645 .


From the given data, we have
400
p1 = = 0 .8
500
400
p2 = = 0.667
600
On the presumption that the populations are similar, the best estimator for the proportion is
given by
n1 p1 + n 2 p 2 500(0.8) + 600(0.667)
p0 = = = 0.7273
n1 + n2 500 + 600
q 0 = 1 − 0.7273 = 0.2727
Further,
p1 − p 2 (0.8) − (0.667)
z= = = 4.926
1 1   1 1 
p0 q 0  +
n + n 
(0.7273)(0.2727) 
 500 600 
 1 2 

So the calculated value is in the rejection region an therefore we reject the null hypothesis at
5% level of significance. There is a significance decrease in smokers after the increase in tax
on tobacco.

Exercise: There are 100 students in a university college and in the whole university, inclusive
of this college; the number of students is 2000. In a random sample study of 20 were found
smokers in the college and the proportion of smokers in the university is 0.05. Is there a
significant difference between the proportion between the smokers in the college and
university? Test at 5% level.
CHI-SQUARE TEST

Chi-square Distribution: Chi-square distribution is used when we deal with collection of


values that involve sum of squares. Chi-square distribution is defined for positive value of
random variable and the distribution curve is not symmetric. This distribution depends on yet
another parameter, the degree of freedom (n-1), where n is the sample size.
Chi-square, by notation χ 2 , is a statistical measure used in the context of sampling analysis
for comparing a sample variance to a theoretical variance. As a non-parametric test, it can be
used to determine if categorical data shows dependency or two classifications are
independent. It can also be used to make comparisons between theoretical populations and
actual data when categories are used. Thus, the chi-square test is applicable in large number
of problems in the areas such as:
(i) test the goodness of fit
(ii) test the significance of association between two attributes, and
(iii) test the homogeneity or the significance of population variance.

Chi-square Test for testing significance of Population Variance:


We can use the test to judge if a random sample has been drawn from a normal population
with mean µ and with a specified variance σ 2 . Given a sample of size ‘n’ and the sample

s2
variance s , we observe that the quantity χ = 2 (n − 1) has the chi-square distribution with
2 2

σ
n-1 degree of freedom. To test the null hypothesis H 0 : σ 2 = s 2 , we compare the calculated

χ 2 value against the table value at n-1 degree of freedom and given level of significance. If
the calculated value is higher than the table value, then we reject the null hypothesis,
otherwise we accept the null hypothesis.

Exercise: The weights of ten students are as follows:


S.No: 1 2 3 4 5 6 7 8 9 10
Weight(kg): 38 40 45 53 47 43 55 48 52 49
Can we say that the variance of the distribution of weight of all students from which the
above sample of 10 students was drawn is equal to 20 kgs? Test at 5% level of significance.

Solution:
First we shall find the variance of sample data given.
S.No X i (weight) (X i − X ) ( X i − X )2

1 38 -9 81
2 40 -7 49
3 45 -2 04
4 53 6 36
5 47 0 00
6 43 -4 16
7 55 8 64
8 48 1 01
9 52 5 25
10 49 2 04
470 280
470
X = = 47
10

s2 =
∑ (X i − X )2 =
280
= 31.11
n −1 9
Let the null hypothesis H 0 : σ 2 = s 2 . To test the hypothesis, we shall compute

s2 31.11
χ = 2 (n − 1) =
2
(10 − 1) = 13.99
σ 20
Table value of χ 2 at 10 – 1 = 9 degree of freedom and 5% level of significance is 16.92.
Since calculated value is less than the table value we accept the null hypothesis at 5% level of
significance. In other words, we can say that the sample is taken from the population with
variance 20 kgs.

Exercise: A sample of 10 is drawn randomly from a certain population. The sum of squared
deviation from the mean of given sample is 50. Test the hypothesis that the variance of the
population is 5 at 5% level of significance.

Chi-Square Test as Non-Parametric Test: This test can be used for (i) Testing goodness of
fit (ii) Testing independence of data
Testing goodness of fit: Chi-square test enables us to see how well does the assumed
theoretical distribution fit to the observed data. When some theoretical distribution is fitted to
the given data, we are always interested in knowing as to how well this distribution fits with
observed data.
We consider the fit is considered to be good, in other words, the divergence between the
observed and expected frequencies is attributable to fluctuation of sample, if the calculated
value of χ 2 is lesser than the table value for certain level of significance. Otherwise, fit is
not considered to be good one.

Test of independence: χ 2 test enables us to explain whether or not two attributes are
associated. For instance, we may be interested in knowing a new medicine is effective in
controlling fever or not, in such a case χ 2 test helps us in deciding the issue.
In such situation, we proceed with null hypothesis that the two attributes are independent. i.e,
the new medicine is not effective in controlling fever. On this basis we calculate the expected
frequencies and then workout the value of χ 2 . If the calculated χ 2 value is lesser than the
table for given degree of freedom, we accept the null hypothesis, otherwise, we reject.

(O − E ) 2
We calculate χ = ∑
2
where O is the observed frequency and E is the expected
E
frequency.

Degree of Freedom:
If there are ‘n’ number of frequency classes and there is one independent constraint, then the
degree of freedom is given by ‘n-1’.
When we have two independent constraints (bivariate case) with ‘c’ number of rows and ‘r’
number of columns then the degree of freedom is given by (c-1)(r-1).
For instance, in the following data obtained during the outbreak of smallpox:
Attacked Not attacked Total
Vaccinated 31 469 500
Not vaccinated 185 1315 1500
Total 216 1784 2000
The degree of freedom is (2-1)(2-1) = 1
Exercise: Genetic theory states that children having one parent of blood type A and the other
of blood type B will always one of the three types, A, AB, B and the proportion of three types
will be on an average be as 1 : 2 : 1. A report states that out of 300 children having one A
parent and one B parent, 30 percent were found to be of type A, 45 percent type AB and
remainder type B. Test the hypothesis by χ 2 test.

Solution: Observed frequencies of type A, AB and B are given by 90, 135 and 75 respectively
(in the proportion of 30 : 45: 25). Theoretically, it should have in the proportion of 1 : 2 : 1.
Therefore the expected frequencies of type A , AB and B are 75, 150 and 75 respectively. We
shall have chi-square test to verify the goodness of fit of theoretical distribution given.
Let the null hypothesis that the given data fits into given distribution. We shall calculate the

χ 2 as under:
Type Observed Expected (O − E ) (O − E ) 2 (O − E ) 2
Frequency(O) Frequency(E) E
A 90 75 15 225 3
AB 135 150 -15 225 1.5
B 75 75 0 0 0

χ 2 = 3 + 1.5 + 0 = 4.5
Degree of freedom = 3 – 1= 2
Table value of χ 2 for 2 degree of freedom at 5% level of significance is 5.991

Calculated χ 2 value is lesser than the table value. Therefore we accept the null hypothesis
that on an average type A , AB and B stand in the proportion of 1 : 2 : 1.

Exercise: A dice is rolled 240 times and observed frequencies are given below.
Face observed 1 2 3 4 5 6
Frequency observed 49 35 32 46 49 29
Using χ 2 test verify whether the dice is unbiased. Test at 5% level of significance.

Exercise: A sample of 10 is drawn randomly from a certain population. The sum of squared
deviation from the mean of given sample is 50. Test the hypothesis that the variance of the
population is 5 at 5% level of significance.
Exercise: In a city a survey was carried out of 200 families, each with 5 children. The
distribution shown below was produced.
(Boys, Girls) (5, 0) (4, 1) (3, 2) (2, 3) (1, 4) (0, 5)
No of families 11 35 69 55 25 5
Test the null hypothesis that the observed frequencies are consistent with male and female
births being equal probable, assuming binomial distribution, a level of significance of 0.05.
Solution: Assume that male and female births are equal probable. That is p = q = 0.5 . Note

5  5 −k
that the probability having k boys among 5 children in a family is given by 
k 
k
(0.5) (0.5)
 
(B, G) Observed Prob Expected (O − E ) (O − E ) 2 (O − E ) 2
Frequency(O) Frequency (E) E
(Prob x 200)
(5, 0) 11 0.03125 6 5 25 4.167
(4, 1) 35 0.15625 31 4 16 0.516
(3, 2) 69 0.3125 63 6 36 0.571
(2, 3) 55 0.3125 63 -8 64 1.016
(1, 4) 25 0.15625 31 -6 36 1.161
(0, 5) 5 0.03125 6 -1 1 1.167

(O − E ) 2
χ2 = ∑ = 7.598
E
Degree of freedom = 6 – 1 = 5
Table value of χ 2 for 5 degree of freedom at 5% level of significance is 11.1.

Since calculated χ 2 is lesser than the table value, we accept the null hypothesis that observed
frequencies are consistent with male and female births are equal probable.

Exercise: Two research groups classified some people in income groups on the basis of
sampling studies. The results are as follows:
Investigator Income groups Total
Poor Middle Rich
A 160 30 10 200
B 140 120 40 300
Total 300 150 50 500
Show that the sampling technique of at least one research group is defective.
Solution: Let us make the hypothesis that the techniques adopted both the groups are similar
and the data are similar.
Expected frequencies are
Investigator Income groups Total
Poor Middle Rich
A 120 60 20 200
B 180 90 30 300
Total 300 150 50 500
(O − E ) 2
χ2 = ∑
E
(160 − 120) 2 (30 − 60) 2 (10 − 20) 2 (140 − 180) 2 (120 − 90) 2 (40 − 30) 2
= + + + + +
120 60 20 180 90 30
= 55.54
Degree of freedom = (3-1)(2-1)=2
Table value of χ 2 for 2 degree of freedom at 5% level of significance is 5.991. Since the
calculated value is bigger than the table value, we conclude the rejection of null hypothesis at
5% level of significance. Technique adopted by one of two groups in data collection is
defective.

Exercise: The following data is obtained during the outbreak of smallpox:


Attacked Not attacked Total
Vaccinated 31 469 500
Not vaccinated 185 1315 1500
Total 216 1784 2000
Test the effectiveness of vaccination in preventing the attack from the smallpox.

Exercise: Consider the following information regarding home condition and children’s
condition:
Condition of child Condition of home Total
Clean Dirty
Clean 70 50 120
Fairly Clean 80 20 100
Dirty 35 45 80
Total 185 115 300
State whether the two attributes viz., condition of home and condition of child are
independent. Use chi-square test for the purpose.

Conditions for Application of Chi-square test:

The following conditions should be satisfied before χ 2 test being applied:


(i) Observation recorded and used are collected on random basis.
(ii) All the items in the sample must be independent.
(iii) No group should contain very few items.
(iv) The overall number of items also must also be reasonably large.
(v) The constraints must be linear. Constraints which involve linear equations in the
cell frequencies of a contingency table.
ANOVA
Consider a case of three varieties of wheat, each grown on four plots and production of wheat
for each kind of wheat per acre land in each kind of plot is given below:
Plot of land Variety of wheat
A B C
1 6 5 5
2 7 5 4
3 3 3 3
4 8 7 4
Researcher may be interested if there is significant difference between varieties of wheat
and/or varieties of plots.
ANOVA technique is very useful in making analysis in the above context.

ANOVA is an important technique in those entire situations where we want to compare more
than two populations such as in comparing the yield of crop from several varieties of seeds,
mileage of several automobiles and so on. In the circumstances of these kinds, one generally
does not want to consider all those combinations of two populations at a time, where the
number of tests required before arriving to a decision is larger.

The basic principle of ANOVA is to test for differences among the means of the populations
by examining the amount of variation within each of these samples, relative to the amount of
variation between the samples. In terms of variation within the given population, it is
assumed that the values of X differ from the mean of this population only because of random
effects, i.e., there are influences on X which are unexplainable, where as in examining
differences between populations we assume that the difference between the mean of jth
populations and the grand mean is attributable to what is called a ‘specific factor’ or what is
technically described as treatment effect. Thus while using ANOVA, we assume that each of
the samples is drawn from normal population and each of these populations has the same
variance. We also assume that all the factors other than the one or more being tested are
effectively controlled. In other words, means that we assume the absence of many factors that
might affect our conclusions concerning the factor(s) to be studied.

In this case we make two estimates of populations, namely, one based on between samples
variance and the other based on within the samples variance. Then the said two estimates of
population variance are compared with F-test, wherein we workout.
Esimate of population variance based on between samples variance
F=
Estimate of population variance on whithin samples variance
This value of F is to be compared to the F-limit for given degree of freedom. If the calculated
F value is more than the F-limit value, we may say that there are significance differences
between the sample means.

ANOVA Technique: One-way ANOVA


Under one-way ANOVA, we consider just one factor and then observe that the reason for
said factor to be important is that several possible types of samples can occur within that
factor. We then determine if there are differences within that factor.
(i) Obtain X 1 , X 2 ,..., X k where ‘k’ is the number of samples.

X 1 + X 2 + ... + X k
(ii) Workout mean of sample mean by the formula X =
k
(iii) Calculate sum of square between the samples by the formula

SS between = n1 ( X 1 − X ) 2 + n2 ( X 2 − X ) 2 + .... + nk ( X k − X ) 2
(iv) Compute Mean Square between the samples by the formula
SS between
MS between =
k −1
Where k-1 represents degree of freedom between the samples
(v) Calculate Sum of squares within samples by the formula :
SS within = ∑ ( X 1i − X 1 ) 2 + ∑ ( X 2i − X 2 ) 2 + .... + ∑ ( X ki − X k ) 2

(vi) Compute Mean Square within by


SS within
MS within =
n −k
Where n is the total number of samples, and k is the number of sample.
(vii) Compute sum of squares of deviations for total variance by

SS for total variance = ∑ ( X ij − X ) 2 = SS between + SS within

Degree of freedom for total variance is n-1 = (k-1)+(n-k)


(viii) Finally, F ratio is computed by formula
MS between
F − ratio =
MS within
If the calculated F-ration is greater than the F-value for the given degrees of
freedom and the significance level, then we reject the null hypothesis and in fact,
we conclude that the differences between the means are significant.
ANOVA TABLE (ONE-WAY ANALYSIS)
Source of Sum of Squares Degree Mean F-Ratio
variation (SS) of Square
Freedom (MS)
Between n1 ( X 1 − X ) 2 + n2 ( X 2 − X ) 2 + (k-1) SS between MS between
the samples k −1 MS within
.... + nk ( X k − X ) 2

Within
Samples
∑ (X 1i − X 1 ) 2 + ∑ ( X 2i − X 2 ) 2 +
(n-k) SS within
.... + ∑ ( X ki − X k ) 2
n-k
Total
∑ (X ij − X )2 (n-1)

Short Cut Method:


Compute T = ∑ X ij and further

Source of variation Sum of Squares Degree of Mean Square F-Ratio


(SS) Freedom (MS)
Between the samples (T j ) 2 (T ) 2 (k-1) SS between MS between
∑ nj

n k −1 MS within

Within Samples
(T ) 2
∑ X 2 ij −
n

(T j ) 2 (T ) 2 (n-k) SS within
∑ nj

n n-k

Total (T ) 2 (n-1)
∑ X 2 ij −
n

Exercise: Setup an ANOVA table for the following per acre production for three varieties of
wheat, each grown on four plots and state if the variety difference is significant:
Plot of land Variety of wheat
A B C
1 6 5 5
2 7 5 4
3 3 3 3
4 8 7 4
Solution:
Plot of land Variety of wheat
A B C
1 6 5 5
2 7 5 4
3 3 3 3
4 8 7 4
Total 24 20 16
n1 = n2 = n3 = 4
n = 12
24 20 16
X1 = = 6; X2 = = 5; X3 = =4
4 4 4
6 +5 +4
X = =5
3

SS between = n1 ( X 1 − X ) 2 + n 2 ( X 2 − X ) 2 + n3 ( X 3 − X ) 2
= 4(6 − 5) 2 + 4(5 − 5) 2 + 4( 4 − 5) 2 = 8

SS within = ∑ ( X 1i − X 1 ) 2 + ∑ ( X 2i − X 2 ) 2 + ∑ ( X 3i − X 3 ) 2 = 24

Source of variation (SS) df (MS) F-Ratio


Between the samples 8 3-1=2 SS between 8 MS between 4.00
= =4 =
k −1 2 MS within 2.67
= 1 .5

Within Samples 24 12-3=9 SS within 24


= = 2.67
n-k 9
Total 32 (n-1) F-limit 5%
level of significance
F(2,9) = 4.26

Since the calculated value lesser than the table value, we accept the null hypothesis that
difference between outputs due to variety of seed is not significant.

Exercise: A manager of a firm wishes to test whether the salesmen of his firm (A, B, and C)
tend to make sales of same size. During a week there have been 14 sale calls. A made 5, B
made 4 and C made 5 calls respectively. Following are the sales data for the week of 3
salesmen.
A: 500 400 700 800 600
B: 300 700 400 600
C: 500 300 500 400 300
Perform ANOVA and draw your conclusion at 5% level of significance (F(2, 11)=3.98)

Hint : Use scaling factor (X-500)/100

TWO-WAY ANOVA
Two-way ANOVA technique is used when the data are classified on the basis of two factors.
For example:
(i) The agricultural output may be classified on the basis of different varieties of
seeds and also on the basis of different fertilizers used.
(ii) A business firm may have its sales data classified on the basis of different
salesmen and also on the basis of sales in different regions.
(iii) In a factory, the various units of a product produced during a certain period may
be classified on the basis of different varieties of machines and also on the basis of
different grades of labor.
Two way designs may have repeated measurement of each factor or may not have repeated
values. The ANOVA technique is little different in case of repeated measurements where we
also compute interaction variation.

Two-way ANOVA when values are not repeated.


Source of Sum of Squares Degree of Mean Square F-Ratio
variation (SS) Freedom (MS)
Between (T j ) 2 (T ) 2 (c – 1) SS between columns MS between columns
Column ∑ nj

n c −1 MS Residual

Treatment
Between (Ti ) 2 (T ) 2 (r – 1) SS between rows MS between rows
rows
∑ ni

n r −1 MS Residual

Treatment
Residual Total SS – (SS (c-1)(r-1) SS Residual
error between columns (c − 1)(r − 1)
+SS between rows
Total (T ) 2 c.r - 1
∑X 2
ij −
n
Exercise: Set up an ANOVA table for the following two-way design results:
Per acre production of wheat
Variety of Variety of wheat
Fertilizer A B C
1 6 5 5
2 7 5 4
3 3 3 3
4 8 7 4
Total 24 20 16
Also state whether variety differences are significant at 5% level significance.
Solution:
Variety of Variety of wheat Row total
Fertilizer
1 6 5 5 16
2 7 5 4 16
3 3 3 3 09
4 8 7 4 19
Column Total 24 20 16 60

n = 12; T = 60
T 2 60 2
= = 300
n 12
(T ) 2
SS total = ∑ X 2 ij −
n
= (6 + 5 + 5 + 7 + 5 2 + 4 2 + 33 + 3 2 + 3 2 + 8 2 + 7 2 + 4 2 ) − 300 = 32
2 2 2 2

(T j ) 2 (T ) 2
SS between column treatment = ∑ −
nj n
24 2 20 2 16 2
= + + − 300 = 8
4 4 4
(Ti ) 2 (T ) 2
SS between row treatment = ∑ −
ni n
16 2 16 2 9 2 19 2
= + + + − 300 = 18
3 3 3 3
SS Residual = Total SS – (SS between columns +SS between rows)
= 32 –(8 +18) = 6
ANOVA Table:
Source of Sum of Squares Degree of Mean Square F-Ratio 5% level
variation (SS) Freedom (MS) F-limit
Between 8 2 4 4 F(2, 6) =
Column 5.14
Treatment
Between 18 3 6 6 F(3, 6) =
rows 4.76
Treatment
Residual 6 6 1
error
Total 32 11
F-ratio due to column treatment (4) is lesser than the table value (5.14) at 5% level of
significance. Therefore difference among mean yield due to column treatment (varieties of
seeds) is not significant.
But F-ratio in case of row treatment (6) is larger than the table value (4.76) at 5% level of
significance. Therefore, difference among mean yield due to column treatment (varieties of
fertilizer) is significant.

Exercise: The following data gives the number of units produced per day by 5 workers using
4 different machines.
M1 M2 M3 M4
A: 45 42 48 38
B: 40 32 50 34
C: 43 36 44 40
D: 36 38 46 36
E: 41 37 47 37
Test if the production is equal with respect machines and with respect to workers.
Answer:
Source of Sum of Squares Degree of Mean Square F-Ratio 5% level
variation (SS) Freedom (MS) F-limit
Between 335 3 111.67 14.97 F(3,12)
Column = 3.49
Treatment
Between 48.5 4 12.125 1.626 F(4,12)
rows =3.26
Treatment
Residual 89.5 12 7.458
error
Total 473 11

Limitations of Test of Hypothesis:


(i) The test should not be used in a mechanical fashion. It should be kept in view that
testing is not decision making itself; the tests are only useful aids for decision-
making. Hence proper interpretation of statistical evidence is important to
intelligent decision.
(ii) Tests do not explain the reasons as to why do the difference exist. They simply
indicate whether the difference is due to fluctuations of sampling or because of
other reasons but tests do not tell us as to which is/are the other reason(s) causing
the difference.
(iii) Results of significant tests are based on probabilities and such cannot be expressed
with full certainty. When a test shows that a difference is statistically significant,
then it simply suggest that the difference is probably not due to chance.
(iv) Statistical inferences based on the significance tests cannot be said to be entirely
correct evidences concerning the truth of hypothesis. This is specially so in case of
small samples where the probability of drawing erring inferences happens to be
generally higher. For greater reliability, the size of samples is sufficiently
enlarged.

You might also like