S28 Statistical Inferences & Hypothesis Testing (NEW)

28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28
28 28 28 28 28 28 28 28 28 28 28
28 28 28 28 28 28 28 28 28 28 28
28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28
28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28
28 28 28 28 28
28 28 28 28 28
28 28 28 28 28
28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 14 14
28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 14 14
28
CHAPTER
STATISTICAL INFERENCES
AND HYPOTHESIS TESTING

28.1INTRODUCTION:
Induction and the deductions are the two basic methods of logical reasoning. In inductive method,
on the basis of few observed individual facts, we make certain more general conclusions. In
deductive method, on the other hand, we move from a commonly observed phenomenon to some
specific individualistic conclusions. Among the two types of logical reasoning, the inductive
method plays a crucial role in inferential statistics. The process of inferring information about the
population by using the sample information is often called the theory of inference. In inductive
reasoning method, since we move from some specific individualistic facts to more general
conclusions the so obtained conclusions may not be true by 100%. We can never be sure that the
derived inductive inferential conclusions are flawless. However, if the individualistic results are
based on random sampling then one could attach a certain degree of confidence on the inductive
result obtained with the help of probability theory. Thus, the theory of statistical inference is
addressed as the application of inductive reasoning into the theory of probability. Such reasoning
allows us to make certain broad based conclusions about the population.
28.2 POPULATION VERSUS SAMPLEING:
The word population is highly technical in statistics. It does not necessarily refer to the people in
general. It is simply the totality of the objects under consideration, may be collection of people,
may be collection students in an institution, and may be collection teachers in a university.
Sometimes the population in this sense is also called the universe. The population which is
having a finite number of objects like number of students in a college is called a finite population.
The populations having infinite number of objects, like the number of stars in the sky, are called
infinite population.
To know some thing about the population the best way is under take a study that takes
care of all the units of the population under reference. Such a detailed study is often called survey
method. If the population is finite then such a study is viable in practice. For an infinite
population such a detailed study is not only impossible but also gives rice to large amount of
human errors often called non- sampling errors. Under such circumstance we normally go for
samples and derive conclusion about the population under reference. Such a task is not flawless
all the time. For example the sample mean need not necessarily be equal to the population mean
whatever may by the sophistication of sampling. The difference between the population value and
the corresponding sample value is called the sampling error. The only prescription needed is the
use of probability sampling while taking samples.
Quantitative Methods for Management
28.3 THEORY OF ESTIMATION AND CHARACTERISTICS OF A GOOD
ESTIMATOR
The estimation of population parameters like the mean, median, mode, variance, correlation
coefficient, regression coefficients etc. from their respective sample statistics with reliably
accuracy is often called the theory of estimation.
In practice, the sample mean is the most frequently used estimate of the population mean because
it satisfies several interesting properties that statistician expect. Some of these properties are:
1. Linearity
2. Unbiasedness
3. Efficiency
4. Consistency
Let us now discuss each one of them in brief.
1. Linearity:
An estimator is said to be linear if it is a linear function of the given sample observations. The
sample mean is obviously linear because ) X .... .......... X X (
n
1
n
X
X
n 2 1
i
+ + +
is a
linear function of the n given observations.
2. Unbiasedness:
If the average of all possible sample means of a given population equals the population mean then
the concerned sample mean is an unbiased estimate of the population mean. For example,
X
is
an unbiased estimator of
, because ) X E( must be true always. In figure 1 we show the

sampling distribution of the sample means by considering all possible samples. According to the
central limit theorem, such a distribution is normal as shown in the figure. Accordingly,
12 ) X E(
X
, the population in the illustrative diagram
0
5
10
15
20
25
340 350 360 370 380 390 400 410 420 430 440 450 460
F
r
e
q
u
e
n
c
y
Sample mean
Sampling Distribution of mean
44
x
) X ( E
Statistical Inferences & Testing of hypothesis
Fig 28.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
9 10 11 12 13 14 15
Sampling distribution Variance
Bia
3. Efficiency:
The property of unbiasedness though essential is not adequate for a good estimator. Suppose we
have two unbiased means to estimate the population means, one with a sample size of 10 and
another with the sample size of 20. Clearly, the sample mean with sample size 20 give us a better
estimate for the population mean population then according this efficiency rule we choose the one
having the least variance. In the figure 2 clearly the sample mean is more efficient in estimating
the population mean than the sample median because the scatteredness of the sample means is
less then the scatteredness of the sample medians in the given figure. Remember for a normal
distribution mean median and mode are equal.
0.0000
0.0050
0.0100
0.0150
0.0200
0.0250
0.0300
0.0350
0.0400
0.0450
340 350 360 370 380 390 400 410 420 430 440 450 460
f(X)
X
Distribution of sample Means with x =400 ,n =10
Distribution of sample Means with x =400 ,n = 20
Fig 28.2
45
4. Consistency:
Suppose X is normal variant having the mean
and variance
2
{ ) N( ( X
2
} Now let us
draw a random sample of size n from this normal population and calculate the sample mean by
using the formula
n
X
X
i
The so obtained sample mean from a single sample necessarily
need not be equal to the population mean. However, when n tends to infinity the sample mean
tends to the population mean. Such a property is often called the consistency property of the
concerned statistics.
Keeping the above stated properties in mind the researcher must select appropriate
estimators for his study. With this brief discussion we propose to explain the methods estimating
population mean and population proportion with reasonable accuracy.
28.4 PARAMETERS AND STATISTICS:
Population related summery yielding inferential statistical like the mean, median, mode,
correlation coefficient, and regression coefficients are called parameters. Similar character
revealing statistics pertaining to the sample are called statistics. For example, the population
mean denoted by
is the parameter. If it is relating to the mean of the sample denoted by

X
is
the statistics.
28.5 APPROACHES TO TESTING HYPOTHESIS
There are two basic approaches to the theory of hypothesis testing. In the classical approach
sampling plays crucial place in hypothesis testing. On the other hand the Bayesian approach to
the theory of statistics is of recent origin. It is simply an extension of the classical sampling
theory approach to the ever-growing field of statistics. This approach emphasis more on
subjective experience rather than the mere data already collected on the said subject. It
emphasizes more on new information in addition to the information already gathered. These
revised estimates on the basis of new information are often called posterior distribution.
However, when further new information comes in, the estimate gets revised. This Bayesian
approach is based on the century old Bayes theorem, which insists revision of probability
whenever additional information comes in. It has emerged as one of the alternative hypothesis-
testing tool since the mid 1950s.
28.6 STATISTICAL SIGNIFICANCE
By using samples alone we can neither accept nor reject the stated hypothesis. Whatever may be
the sophistication of sampling; its finding will surely vary from the population in some respect or
the other due to sampling error. Therefore, before accepting the results, on the basis of sampling,
one must judge whether the noticed differences between the sample information and the
population is statistically significant or not. This noticed difference between the sample and that
of the population is proved to be statistically insignificant then we accept the hypothesis that the
sample resembles the population by all means. In the other hand, if the difference is proved
statistically significant we reject the hypothesis and state that the sample does resembles the
population.
28.7 HYPOTHESIS AND ITS TYPES
In general, a hypothesis is a theoretical proposition capable of empirical verification. It may be
viewed as a statement of an event, which may or may not be true. A distinction is often made
between maintained and testable hypothesis. Such of those assumptions that we normally use in
theory as a simplifying devise are not testable empirically. They are called maintained
hypotheses. For example, when we formulate the demand theory, we assume that, tastes and
46
preferences of the consumer remain constant. We will not test this hypothesis in general. It is
used in the theory merely as a simplifying device.
The testable theoretical hypothesis, on the other hand, states that there is no difference
between the sample statistics and population parameter. To insist the no difference quote we
always call this hypothesis as null hypothesis. The following example illustrates the importance
of null and alternative hypotheses.
Suppose a coin is tossed 100 times and 52 heads were observed. It would not be correct to jump
the conclusion that the said coin is biased to yield more number of heads than the tails because
the experiment yielded 2 heads more than the expected 50 heads. In fact 52 heads is consistent
with the hypothesis that the coin is unbiased. .Thus it would not be surprising to flip the fair coin
100 times and observe 52 heads. On the other hand, flipping 80 or 90 heads in 100 flips would
seem to be contradicting the hypothesis that the coin is unbiased. In this case definitely the coin is
a biased one.
28.8 HYPOTHESIS FORMULATION
vs
Is the difference between the benchmark value and the observed
value statistically significant?
Benchmark value Observed value
Test of significance
An advertising department of a leading farm newspaper believes that the farmer who
subscribes to this newspaper earns higher average income than the state average income of all
farmers including the subscribers. In support of this claim the manager of the advertising
department collected a sample of 3,600 subscribers from its mailing list and estimated the average
income as Rs.4, 290. From other government sources the average income of all the farmers in the
state was obtained as Rs.4, 162. Since the difference between the two observations is Rs.128 the
newspaper claims that the subscribers got more income than the state average because of their
accessibility to the farm newsletters. Now the real problem is to assess whether the difference of
this magnitude is too small to be ignored or too big to be taken care off. If we prove that the
difference is too small and hence should be ignored, then the newspapers claim too is to be
ignored. On the other hand, if there is strong evidence to state that the observed difference is too
big, then we accept the farm paper claim that its subscribers income is more than the state
average. The problem presented here is enough to illustrate the procedure for setting hypothesis.
Since by inductive reasoning we try to prove a more general result by using the sample the proof
needs to be very much convincing. Instead, once it is disproved it is disproved forever. To begin
with we do not magnify the difference; instead we say that the difference the population mean
and the sample mean is insignificant. So in our illustrative case we start by saying that the
difference of Rs.128 unreliably very small and can be ignored. Since the difference between the
47
sample value and the corresponding population value is hypothesized as zero the so framed
hypothesis is often called null hypothesis and denoted by H
0
.
H
0
: = 4,162 (There is no change in the average income of the farming community in the state.)
On the other hand, if there is enough evidence to tell that the difference of Rs.128 is very
high then we simply reject the null hypothesis. By simply rejecting the null hypothesis we move
to an alternative called alternative hypothesis. However, the alternative hypothesis may take three
distinct forms. It may be either not equal type or grater than type (as the farm paper claims in
our illustration) or less than type.
28.9 ONE TAIL TEST
1. Right side Tail
Among the three types test listed for in H
1
since the farm magazine insists that their subscribers
income is greater than the state average the relevant H
1
could be formed as follows.
4162 : H
4162 : H
1
0
>
0.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0.400
0.450
0.500
-3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
f
(
X
)
Z
95%
Rs.4029
Rs.4162 Rs.4295
Reject Ho Do not Reject Ho
0.05
1.645
Fig 28.3 One Tail Test (Right side Tail) at 5% level of significance
2. Left side Tail
If in our example, if a competitive farm journal restate that the per capita income of the
subscribing farmers to the said journal is only Rs.4100 and not Rs.4290 as claimed, then the one
tail test concentrating on the left side tail is formatted as follows
48
One Tail (Left side) Test at 5% level of significance
0.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0.400
0.450
0.500
-3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
f
(
X
)
Z
95%
Rs.4029
Rs.4162 Rs.4295
Reject Ho
Do not Reject Ho
0.05
-1.645
Fig 28.4
28.10 TWO TAIL TESTS
In our farm newspaper example, instead of the farm news paper, if the concerned state
government claims that there is no difference between the subscribing farmers and non
subscribing farmers then we frame the two tail test as not equal type as shown below. This time
the per capita income of the subscribing farmers could be either higher are lower than the state
average. So we distribute the level of significance (Type I error) on either side of the tail equally.
For example if the level of significance = 0.05 then we distribute this amount equally at the
rate 0.025 on either side of the tail and redraw the diagram as in figure3.
4162 : H
4162 : H
1
0
0.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0.400
0.450
0.500
-3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
f
(
X
)
Z
95%
Rs.4029
Rs.4162 Rs.4295
Reject Ho Reject Ho Do not reject Ho
0.025 0.025
49
Z = +1.96 Z = -1.96
Fig 28.5 Two-Tail Test at 5% level of significance
28.11 TYPE I AND TYPE II ERRORS:
A null hypothesis that should have been accepted some times gets rejected. Such type of
error is often called type I error.
On some other occasion the null hypothesis gets rejected though it should have been
accepted as true one. Such errors are called the type II error
. To understand the seriousness these two types of errors for a moment consider our legal
system practiced in India and elsewhere. An indicted person may be guilty or innocent. In our
system of justice, the innocence of an indicted person is presumed until proof of guilty is
established beyond doubt. Let this be our null hypothesis. Accordingly, there should be no
difference between presumption and outcome unless counter evidence is furnished. Once the
evidence of guilty beyond doubt is provided the treatment as innocence (H
o
) no longer is valued
and conviction is recommended. Equally, we reject the null hypothesis and accept the alternative
hypothesis. However, unjustly convicting an innocent person by mistake is unacceptable. The
following are the four alternatives possible decisions.
1. Innocent is punished
2. Innocent is left free
3. Guilty is left free
4. Guilty is punished
Among the four, decision 2 and 4 are correct and decision 1 and 3 are incorrect. As stated in
outcome 1, an innocent person is unjustly convicted then we would say that there is Type I error (
) in the judgment, a true null hypothesis of innocent should not be punished is rejected.
Similarly, in outcome 3 the guilty is unpunished. Such a mistake is often called Type II error (
). Among the two possibilities clearly punishing an innocent by mistake is more dangerous than
leaving the guilty unpunished. That is why it is often said that one thousand guilty may be left
free but one innocent should not be punished. Equivalently, in hypothesis testing we pay greater
emphasis on Type I error and try to keep it as %5 or 1%, and try to keep type II as minimum as
possible.
28.12 LEVEL OF SIGNIFICANCE
As long as we use sample data to draw conclusions about the population it is absolutely
impossible to ascertain that the accepted hypothesis is really true. Similarly, by simply rejecting
we cannot say that the hypothesis is false. Thus, when type I error ( ) is committed, a true null
hypothesis is rejected; innocent person is unjustly punished. The value is called the level of
significance and hence refers the probability of rejecting the true null hypothesis. With a type II
error (
), one fails to reject a false null hypothesis; again the decision is unjustified because the
guilt is left free. Clearly convicting an innocent should be avoided or at least kept at its minimum
rather than bothering ourself the guilty persons left free.
Naturally one would be interested in minimizing both the types of errors. But
unfortunately, for any given sample size, it is not possible to minimize both these errors
simultaneously. Thus, in practice we try to keep the probability of committing the more
dangerous type I error at a fairly minimum level, such as 0.01 (1%) or 0.05 (5%) and then try to
minimize type II errors as much as possible. The probability of committing type I error is often
called the 'Level of Significance' and is often denoted by . By choosing a level of significance
we simply mean specifying the probability of committing type I error. The amount of Type II
error is referred as .
50
29.13 PROCEDURE TO TEST A HYPOTHESIS
Following are the six important steps that one will have to follow to test a hypothesis:
1. Formulate suitable Null and alternative hypotheses.
2. Decide about the level of significance (
)
, which is the probability of
committing type I error.
3. Decide about the critical region. The most commonly used z values are 1.96 and
2.56 for 01 . 0 and 05 . 0 respectively.
4. Compute the z from the sample. This is done by finding the difference between
the sample mean and universe mean in standard error units.
5. If computed z is greater than the z corresponding to the level of significance set
reject H
0
and accept H
1
.
6. If computed z is less than the z corresponding to the level of significance set the
hypothesis H
0
is not rejected. The smaller value of z does not prove the hypothesis .It
simply fails to disprove it.
28.14 SAMPLE MEAN AS AN ESTIMATE OF POPULATION MEAN: A NUMERICAL
ILLUSTRATION
For manageability, consider a finite population of a very small size of 10 families and their
associated expenditure in hundreds as shown in table 1.
Table 28.1
Family 1 2 3 4 5 6 7 8 9 10
Expentiture 74 47 37 90 84 40 51 54 66 59
Calculation of population mean and standard deviation
Though, we really plan to estimate the population mean
by using sample mean

X
for clear
understanding of the issues we propose to calculate the population mean
and its standard

deviation first as highlighted in table 2.
Table 28.2
Family Expenditure
1 74 13.80 190.44
2 47 -13.20 174.24
3 37 -23.20 538.24
4 90 29.80 888.04
5 84 23.80 566.44
6 40 -20.20 408.04
7 51 -9.20 84.64
8 54 -6.20 38.44
9 66 5.80 33.64
10 59 -1.20 1.44
Total 602 2923.60
Mean = 60.2 SD = 17.10
) ( X X
2
) ( X X
51
60.2
10
602
N
X

( )
17.10
10
2923.60
N
X X

Sample mean as an estimate of the population mean
Now since sample is the typical representative of the population is the mantra of sampling let
us select a sample of size two and obtain its mean for our illustration and try to study its relevance
in the estimation of the population mean. Let our random sample consists of family 3 and 5 with
the expenditure of 37 and 84 constituting an average of 60.50 (Sample No 19 in table 4 below).
Let us now reproduce all the informations gathered till now in table 3.
Table 28.3
Population Sample
N=10
20 . 60
10 . 17
50 . 60 X
2 n
Since the population mean is less than the sample mean (60.20 < 60.50) can we say that
the families 3 and 5 do not belong to our population under reference? Definitely not, because we
knew that these two families are the lawful members of our population under reference. Then,
how does this amount of difference (0.3) occur? Could this much of difference is attributable to
sampling (sampling error) or some thing else? Given the fact that we are working with the sample
rather than the population, there are two ways of hypothesizing the noticed difference.
H
0
: There is really no significant difference between the sample mean and the population
mean.
H
1
: There is really significant difference between the sample mean and the population mean.
The very first step in decision-making process invariably assumes H
0
. Symbolically, this
assumption normally referred as the null hypothesis because it is framed on the assumption that
there is no difference between the population mean and the sample mean. H
1
is called the
alternative hypothesis.
60.20 : H
0

The alternative statement is rewritten as
60.20 : H
1
t
Among to two which explanation for the observed difference is correct? As long as we
are working with a sample rather than the population, we cannot identify the correct statement.
However, at this stage we can initiate a conservative decision making rule. One among the two
statements can be chosen as the right one by keeping the error of making the incorrect choice at
its minimum.
52
If H
0
is chosen as true statement, then the remaining problem is to determine the chance
(probability) of selecting our typical sample 19 having the mean 50 . 62 X out of the 45
possible samples in our illustration. If such a chance (probability) of random selection in the
whole lot of 45 samples in our illustration is less than 0.05 (5%, a prefixed objective decision
rule) then we declare that the explanation H
1
is unlikely with 95% confidence. We can determine
this probability with precision provided we know some thing more about the distribution of all
the possible 45-sample means including the one under reference. Till such a time let us postpone
our discussion of estimation for some time to come.
23.16 SAMPLING DISTRIBUTION OF MEAN (SMALL SAMPLE FROM A FINITE
POPULATION)
. In the case of sampling, as already stated, sampling error is bound to occur. If such
errors are proved to be statistically insignificant then the sample statistics by all means could be
used as an estimate of the population mean. Such an exercise invariably needs some thing more
than mere observation of a single sample. One such exercise is proposed to explore the
characteristics of the all 45 sample means listed out in table 4.
Table 28.4
All possible 45 samples and their means
S
l
.
N
o
S
a
m
p
l
e
M
e
a
n
S
l
.
N
o
S
a
m
p
l
e

M
e
a
n
1 1 & 2 74 & 47 60.50 25 4& 5 90 & 84 87.0
2 1 & 3 74 & 37 55.50 26 4& 6 90 & 40 65.0
3 1 & 4 74 & 90 82.00 27 4& 7 90 & 51 70.5
4 1 & 5 74 & 84 79.00 28 4& 8 90 & 54 72.0
5 1 & 6 74 & 40 57.00 29 4& 9 90 & 66 78.0
6 1 & 7 74 & 51 62.50 30 4&10 90 & 59 74.5
7 1 & 8 74 & 54 64.00
8 1 & 9 74 & 66 70.00 31 5& 6 84 & 40 62.0
9 1 & 10 74 & 59 66.50 32 5& 7 84 & 51 67.5
33 5& 8 84 & 54 69.0
10 2 & 3 47 & 37 42.00 34 5& 9 84 & 66 75.0
11 2 & 4 47 & 90 68.50 35 5&10 84 & 59 71.5
12 2 & 5 47 & 84 65.50
13 2 & 6 47 & 40 43.50 36 6& 7 40 & 51 45.5
14 2 & 7 47 & 51 49.00 37 6& 8 40 & 54 47.0
15 2 & 8 47 & 54 50.50 38 6& 9 40 & 66 53.0
16 2 & 9 47 & 66 56.50 39 6&10 40 & 59 49.5
17 2 & 9 47 & 59 53.00
40 7& 8 51 & 54 52.5
18 3 & 4 37 & 90 63.50 41 7& 9 51 & 66 58.5
19 3 & 5 37 & 84 60.50 42 7&10 51 & 59 55.0
20 3 & 6 37 & 40 38.50
21 3 & 7 37 & 51 44.00 43 8& 9 54 & 66 60.0
22 3 & 8 37 & 54 45.50 44 8&10 54 & 59 56.5
23 3 & 9 37 & 66 51.50
24 3 & 10 37 & 59 48.00 45 9&10 66 & 59 62.5
2709.0 Total
F
a
m
i
l
y
S
a
m
p
l
e

v
a
l
u
e
s
F
a
m
i
l
y
S
a
m
p
l
e

v
a
l
u
e
s
53
Table 28.5
Frequency distribution of the entire 45 sample means
f f
38.50 1 62.00 1
42.00 1 62.50 2
43.50 1 63.50 1
44.00 1 64.00 1
45.50 2 65.00 1
47.00 1 65.50 1
48.00 1 66.50 1
49.00 1 67.50 1
49.50 1 68.50 1
50.50 1 69.00 1
51.50 1 70.00 1
52.50 1 70.50 1
53.00 2 71.50 1
55.00 1 72.00 1
55.50 1 74.50 1
56.50 2 75.00 1
57.00 1 78.00 1
58.50 1 79.00 1
60.00 1 82.00 1
60.50 2 87.00 1
Total 45
X X
To illustrate the sampling distribution of the means in our hypothetical population let us
take a sample of two (for simplicity) at a time and calculate the sample means by simply averring
these two values. Since there are 10 families in the population, taking two at time, we can make
45
1 2
9 10
2
10
C samples all together. The table 28.4 gives all the 45 samples with their
respective sample mean.
With the so obtained means we prepare the associated frequency distribution table by
clubbing the items with the same means into one group. The table 28.5 provides such a frequency
distribution. This distribution of mean is often called the sampling distribution of mean. The
figure 28.6 shows the graph of the associated sampling distribution.
0
0.5
1
1.5
2
2.5
38.5
44 48
50.5
53
56.5
60
62.5
65
67.5
70 72 78 87
Frequency
Expenditure
Frequency Distribution
Fig 28.6 Frequency distribution of means
Relative otherwise called the probability distribution of all the 45 sample means
54
f r.f f r.f
38.50 1 0.022222 62.00 1 0.022222
42.00 1 0.022222 62.50 2 0.044444
43.50 1 0.022222 63.50 1 0.022222
44.00 1 0.022222 64.00 1 0.022222
45.50 2 0.044444 65.00 1 0.022222
47.00 1 0.022222 65.50 1 0.022222
48.00 1 0.022222 66.50 1 0.022222
49.00 1 0.022222 67.50 1 0.022222
49.50 1 0.022222 68.50 1 0.022222
50.50 1 0.022222 69.00 1 0.022222
51.50 1 0.022222 70.00 1 0.022222
52.50 1 0.022222 70.50 1 0.022222
53.00 2 0.044444 71.50 1 0.022222
55.00 1 0.022222 72.00 1 0.022222
55.50 1 0.022222 74.50 1 0.022222
56.50 2 0.044444 75.00 1 0.022222
57.00 1 0.022222 78.00 1 0.022222
58.50 1 0.022222 79.00 1 0.022222
60.00 1 0.022222 82.00 1 0.022222
60.50 2 0.044444 87.00 1 0.022222
Total 45 1
X X
Sampling distribution: Sampling distribution of sample mean is the probability distribution of
all possible sample means of size n of the highlighted population
Note: this frequency distribution though not normal in this case, it will be normal provided the
population is large and the sample size is considerably large. Thus, the non-normal character of
the graph is mainly due to the very small population that too with a very small sample size of 2.
Calculation of the Mean and standard deviation (standard Error) of all the 45 means by
using routine method
Table 28.6
f f
38.50 1 38.50 1482.25 62.00 1 62.00 3844.00
42.00 1 42.00 1764.00 62.50 2 125.00 7812.50
43.50 1 43.50 1892.25 63.50 1 63.50 4032.25
44.00 1 44.00 1936.00 64.00 1 64.00 4096.00
45.50 2 91.00 4140.50 65.00 1 65.00 4225.00
47.00 1 47.00 2209.00 65.50 1 65.50 4290.25
48.00 1 48.00 2304.00 66.50 1 66.50 4422.25
49.00 1 49.00 2401.00 67.50 1 67.50 4556.25
49.50 1 49.50 2450.25 68.50 1 68.50 4692.25
50.50 1 50.50 2550.25 69.00 1 69.00 4761.00
51.50 1 51.50 2652.25 70.00 1 70.00 4900.00
52.50 1 52.50 2756.25 70.50 1 70.50 4970.25
53.00 2 106.00 5618.00 71.50 1 71.50 5112.25
55.00 1 55.00 3025.00 72.00 1 72.00 5184.00
55.50 1 55.50 3080.25 74.50 1 74.50 5550.25
56.50 2 113.00 6384.50 75.00 1 75.00 5625.00
57.00 1 57.00 3249.00 78.00 1 78.00 6084.00
58.50 1 58.50 3422.25 79.00 1 79.00 6241.00
60.00 1 60.00 3600.00 82.00 1 82.00 6724.00
60.50 2 121.00 7320.50 87.00 1 87.00 7569.00
Total 45 2709 168929
X X . f X . f
2
X . f
2
X . f X
From table 28.6 the grand mean is obtained as
55
20 . 60
45
2709
n
X f
X

Thus, the mean of all the 45 means is exactly equal to the very population mean 60.20 obtained
earlier. When such a property is satisfied then the estimate of the population mean by a sample
mean is unbiased. Similarly, the standard deviation otherwise called the standard error is obtained
by using the usual formula.
2
2 2
2
X
45
2709
45
168929
n
fX
n
X f
,
_
,
_

40 . 11
9378 . 129
2025
7338681 7601805
45
) 2709 ( ) 168929 )( 45 (
X
2
2

The distribution of all

X
s around its grand mean is called the sampling distribution of
mean. The smallest among the 45 sample is Rs.35.50 (average of the family 3 & 6 in sample
no.20). Similarly, the largest sample means is Rs.87.00 (average of the family 4 & 5 in sample
no.25).
It this section we took a very small universe with only 10 elements for illustration and
made the needed standard error calculations by considering all the 45 samples. The standard error
of the sampling distribution in our illustrative case is obtained as
X
= 11.40. Really, it is not

necessary to list all the possible samples in order to compute the standard error of the sample
means as we have demonstrated in our illustration. Moreover, for a large population finding the
means of all the possible samples, however small or large may be the size, is an impossible task
indeed. In fact, such a vast exercise is unwarranted provided you know the standard deviation of
the population and its size N.
Given the population data, we can easily calculate the needed standard error by using the
following formula.
S.E of sample mean
1 N
n N
n
X

, where is the standard deviation and n is the sample
size and N is the population size.
In our illustrative case
40 . 11
9
8
414 . 1
10 . 17
1 10
2 10
2
10 . 17
1 N
n N
n
X

56
This is exactly the same standard error value that we got by using the sampling distribution of all
the 45 samples, a troublesome unwarranted excise indeed. Now let us sum up all the discussions
we had so far.
28.17 MEASURE OF PRECISION OF A SAMPLE MEAN AS AN ESTIMATE OF THE
POPULATION MEAN:
Now we are in a position to continue our postponed discussion on our null hypothesis. The extent
to which our individual sample estimate deviates from its universe value is often considered as
the measure of precision. In our illustrative case we are not luckily enough to have Rs.60.20 as
the average for all the 45 possible individual samples. In such cases every sample can be used as
an estimate of the population mean with 100% accuracy. However, it can be seen that most of the
sample means are scattered around the population mean namely 60.20 except few extreme cases.
Such an observation gives us certain amount of confidence over the individual mean as a typical
representative of the population mean. At this juncture what is needed is a method of
summarizing the whole discussion made so far in this section. The scatteredness of the individual
estimates ( s X' ) from the population value ( ) could be one such starting point for our further
discussion. More the scatteredness of the means less will be the reliability of the individual mean
as estimate of the population mean Thus, the right answer to our typical question is to obtain a
measure of scatteredness of the individual sample means from the grand (population Mean)
mean. This standard deviation, often called standard Error, of the sample means of all the 45
samples give us a measures for the reliability of the sample mean.
To sum up:
Table 28.7
P o p ul at i o n S a m pl e S a m p l i ng D i s t ri b ut i o n
N = 10
20 . 60
1 0 . 1 7
5 0 . 6 0 X
2 n
20 . 60
X
1
N
n N
n
X

The reaming problem is to determine the probability of the observed single sample
outcome if H
0
is correct. Based on central limit theorem we may assume that this sampling
distribution is normally distributed with mean 60.20
) 20 . 60 (
X

and standard error
40 . 11
X
.
57
Standard Norml Distribution
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5
f
(
X
)
95%
Rs.37.856
Rs.60.20 Rs.82.544
Z = 0 Z = 1.96 Z = -1.96
0.025
0.025
Reject H
0
Reject H
0
Do Not Reject Ho
Fig 28.7
Using our knowledge of standard normal distribution we can add further useful
information to our sampling distribution discussion of sample means. With appropriate Z score
we can comfortably depict the decision rule stated previously when hypothesis H
0
is true.
02631 . 0
40 . 11
3 .
.
40 . 11
20 . 60 50 . 60 X
Z
X

Now at 5% level of significance for a two-tailed test the standard normal Z score from the normal
table is 96 . 1 t . Since the calculated Z score 0.02631<1.96 the explanation under H
0
is accepted.
28.18 ALTERNATIVE CONFIDENCE INTERVAL APPROACH FOR THE PRECISION
OF THE ESTIMATE:
Another method for indicating the precision of the estimate is to determine an interval
around the population mean within which the sample mean will fall. If the estimated individual
sample mean falls within this interval then we reasonably assume that our sample estimate is a
good estimator of the population mean. If the distribution of the sample mean fallow a normal
distribution (follows from the central lint theorem) we can always make confidence interval for
any given level of confidence. If 95% level of confidence is required then the needed confidence
is simply 1.96 times of the standard error of the sampling on either side of the population mean.
Instead, if use t distribution in the case of small samples in testing the hypothesis then the
appropriate confidence interval is constructed using t values from a t table. Also note that larger
the confidence level we want to have larger will be the confidence interval. The following table 9
and the associated figure 28.8 give the confidence intervals for varying degrees of confidence.
Table 28.8
58
Approximate Degree
of Confidence
Confidence
Interval
68.27%
95.45%
99.73%
95.00%
99.00%
SE 2 t
SE 3 t
SE 6 1.9 t
SE 1 t
SE 2.5758 t
Standard Normal Curve
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
Z
f
(
X
)
Z
68%
95%
99.73%
.
Fig 28.8
If the estimated individual sample mean falls within this interval then we reasonably
assume that our estimate is a good representative value for the population mean.
95% confidence interval for our illustration
. In our illustrative case the 95 percent confidence interval is simply 1.96 sigma times the
standard error of the sampling distribution on either side of the population mean
82.544) 37.856, (
11.40) 1.96 - 60.20 11.40, 1.96 - (60.20
SE) 96 . 1 , SE 96 . 1 ( SE 96 . 1

+ t
59
95%Confidance interval
Fig 28.9
Our confidence interval is given by (37.856 to 82.544)
Clearly, our sample mean 60.50 is within this interval we can safely take this value as the
precision estimate the population means.
28.19 IMPORTANCE OF SAMPLING SIZE IN MEASURING THE PRECISION OF THE
ESTIMATE:
The size of the sample plays a very crucial role in the determination of precision of the
estimate. The whole argument in the previous section is veiled only for the sample size 2. The
following table illustrates the relationship between the sample size and the precision measure
X
. As the size of the sample increases, the standard error of the sampling distribution declines
proportionally as stated in table 10. At last when the whole of the population is taken as the
sample then standard error invariably become zero and the estimate become 100% correct.
Table 28.9
Size of the sample Standard Error
2 11.4
3 8.71
4 6.98
5 5.7
6 4.65
7 3.73
8 2.65
9 1.9
10 0
Standard deviation of thesample means
As an example if we take a sample of 8 items instead of 2 as the sample we could make
again
10
C
8
=
10
C
2
= 45 samples again, but this time our estimate is going to be more precise than
the sample of size 2.
This time the sample mean for first 8 items = (74+47+37+90+84+40+51+54)/8 = 60.667
The associated 95% confidence interval could be very narrow as shown below.
60
37.856
82.544
60.20
60.50
Estimate From 19
th
sample
sample
) 394 . 5 6 to 006 . 55 (
) 65 . 2 96 . 1 20 . 60 to 65 . 2 96 . 1 20 . 60 (
) 96 . 1 to 96 . 1 (
X X
+
+
Estimate of mean
Fig 28.10
This time also our sample means 60.667 is within the confidence limit though it is narrow in size.
28.20 RELAVANCE OF POPULATION SIZE IN MEASURING THE PRECISIAN OF
THE ESTIMATE
Let us take a sample of size 1000 from the population of size 1, 00,000. This amounts to
say that the sample is one out of every 100. If the population size is just 80,000 then by a sample
of 1000 items we mean the sample is one out of 80. It appears the first case give us more precise
result than the second. This need not be the case. The following examples illustrates one such
case
Standard error calculation when the population size is 10000 and the sample size is 100
.
995037
9999
9900
10
10
1 10000
100 10000
100
10
1 N
n N
n
X

;
Standard error calculation when the population size is 8000 and the sample size is 100
61
60.20
55.006 65.394
60.667
Estimate From first eight items
993792
7999
7900
10
10
1 8000
100 8000
100
10
1 N
n N
n
X

Clearly, since the standard error is larger in the first case it will provide wider confidence
interval than the second case. Lesser the confidence interval more will be the precision of the
estimate. Thus the second sample is more precision to the estimate in which the population size is
only 8000.
Note: If size of the sample is small in relation to the population, irrespective of the size of the
population; the second term in the above formula called finite multiplier will be close to unity
and hence can be ignored for all practical calculations. In practice when the sampling fraction n/N
does not exceed 0.05 (5%) we ignore the finite multiplier in the above formula. Even if we knew
that N is large we can simply ignore the finite multiplier without committing serious error in our
result and rewrite the formula as.
n
X

28.21 DETERMINATION OF OPTIMUM SAMPLE SIZE FOR A GIVEN LEVEL OF
PRECISION:
In theory of sampling the trickiest question that one will ask is what should the optimum
size of the sample for a given precision level? If sample size is too big then we will have to invest
lot of money and time. The non-sampling errors are also will be more in such a situation. If
sample size is too small then the needed precision in our estimation will not be there. In certain
problem a specified amount of precision is enough for the type of issue under consideration. Thus
for a given precision level and level of confidence, there exists an ideal optimum sample size.
There are two alternative approaches for determining the correct size of the sample. In
the first approach we specify the precision of the estimate and determine the needed sample size.
In the second approach we use Bayesion statistics to weigh the cost of additional information
against the expected value of the additional information and deicide about the correct size.
However, in this section we propose use the first approach to determine the size of the sample. In
this approach the researcher will have to specify the level of precision that he wants to have for
the problem under reference. For example, let the researcher wants to estimate the population
mean by using the sample mean with in e t of the true mean with 95% of confidence. Thus, in
this case we say that the desired level of precision as e t . For a finite population like the one we
have discussed till now the 95% confidence interval is normally written as:
1 N
n N
n
Z

t
So for the stated precision level
1 N
n N
n
Z e

This equation can be solved for n as given below.
2 2 2
2 2
Z e ) 1 N (
N Z
n
+

62
Using the above expression we calculate the need sample size after substituting the relevant
values.
For a large population the relent formula for the calculation may be written as
2
2 2
e
Z
n
n
Z e

In our finite population illustrative example disused above, if e = 5 units either way is the
precision needed then the corresponding sample size is obtained as
( )
8 331259 . 8
) 1 . 17 ( ) 96 . 1 ( ) 5 ( 1 10
) 1 . 17 ( ) 10 ( ) 96 . 1 (
Z e ) 1 N (
N Z
n
2 2 2
2 2
2 2 2
2 2

+

Thus if the sample size is 8 units then the needed precision is achieved
Determine the size of the sample for estimating the true weight of the cereal containers for the
universe with N = 5000 on the basis of the following data.
1) The variance of weight = 4 ounce on the basis of past record.
2) Estimate should be within 0.8 ounces of the true average weight with 99% probability.
Will there be a change in the size if we assume infinite population in the given case? If so
explain by how much?
Solution:
Here the confidence interval for the population mean
is given by
1 N
n N
n
Z

t
( )
41 95 . 40
7796 . 3225
132098
4196 . 26 36 . 3199
132098
) 2 ( ) 257 ( ) 8 . 0 ( 1 5000
) 2 ( ) 5000 ( ) 57 . 2 (
Z e ) 1 N (
N Z
n
2 2 2
2 2
2 2 2
2 2

+
Thus if the sample size of 41 is taken then the needed precision is achieved.
If the sample size is infinite then the needed sample size is obtained as
41 28 . 41
8 . 0
2 ) 57 . 2 (
e
Z
n
2
2 2
2
2 2

.
63
Thus there is no change in the sample size even though we brought in infinite population in the
place of finite population.
28.22 SHAPE OF THE SAMPLING DISTRIBUTION OF MEAN
In our earlier discussion we opted for very small finite population in order to keep our task simple
and illustrative. As a result we obtained a sampling non- normal sampling distribution. in fact in
many situations it is reasonable to assume the population from which we selecting a simple
random sample has a normal or nearly normal distribution. When the population is having a
normal distribution, the sampling distribution of sample mean is also follow a normal distribution
irrespective of the sample size. Generalizing our discussion made in the previous section we have
two important theorems relating to shape of the sampling distribution.
Central Limit Theorem 1: If X is a normal population variable with mean and the standard
deviation then the sampling distribution of the sample mean X of size n will also follow a
normal distribution with mean and standard deviation
n
.
Central Limit Theorem 2: If X is a non-normal population variable with mean and the
standard deviation then the sampling distribution of the sample mean X of large size n will
also follow a normal distribution with mean and standard deviation
n

) n / , ( N ) , ( N X
X

REVIEW PROBLEM FOR Z TEST (Normal Population)
1. Population normal, population finite, sample size may be large or small but variance of
the population is known. H
1
may be one sided or two sided.
20 . 60 : H
20 . 60 : H
1
0
t

OR
20 . 60 : H
20 . 60 : H
1
0
>

OR
20 . 60 : H
20 . 60 : H
1
0
<

1 N
n N
n
X X
Z
X
Example 1: From a finite population of 10 observations a sample of 2 observations 37 and 84

were taken. Test whether the sample comes from a population whose mean is 60.20 and standard
deviation of 17.10 at 5% level of significance. Also construct a 95% confidence interval and
confirm the hypothesis testing result.
64
60.20 : H
60.20 : H
1
0
t
Data:
17.10
20 . 60

50 . 60 2 / ) 84 37 ( X +
026316 . 0
4 . 11
3 . 0
1 10
2 10
2
1 . 17
20 . 60 50 . 60
1 N
n N
n
X
Z
Now at 5% level of significance for a two-tailed test the standard normal Z score is 96 . 1 t .
Since the calculated Z score 0.026316<1.96 the null hypothesis under H
0
is accepted.
Confidence Interval:
The 95% confidence interval is obtained as
82.544) 37.856, (
11.40) 1.96 - 60.20 11.40, 1.96 - (60.20
SE) 96 . 1 , SE 96 . 1 ( SE 96 . 1

+ t
Clearly our sample mean 50 . 60 X very well lie in this interval. So our H
o
gets confirmed
once again
Example: 2 From a finite population of 1000 observation a sample of 121 items were taken and
the associated sample mean was obtained as 60. Test whether the sample comes from a
population whose mean is 62 and standard deviation of 15 at 5% level of significance.
2 6 : H
2 6 : H
1
0
t
Data:
15
62

60 X
563579 . 1
999
879
11
15
2
1 1000
121 1000
121
15
60 62
1 N
n N
n
X
Z
0
is accepted.
Example 3: An advertising department of a leading farm newspaper believes that the farmer who
subscribes to their newspaper earns higher average income than the state average. In support of
this claim the manager of the advertising department collected a sample of size 3,600 subscribers
from its mailing list and estimated the average income as Rs.4, 290. From other government
sources the average income of all the farmers in the state was obtained as Rs.4, 162 with the
standard deviation of Rs.2280. Since the difference between the two observations is Rs.128 the
newspaper claims that the subscribers get more income than the state average because of their
accessibility to the farm newsletters. Examine the validity of the farms newspaper claim
2 6 4162 : H
2 6 41 : H
1
0
>
65
Data:
2280
4162

4290 X
37 . 3
60
2280
128
3600
2280
4162 4290
n
X
Z
Now from the Z table for a one tail test at 5% level of significance the corresponding Z = 2.58.
Since the calculated 3.37 is grated than the table value of 2.58 we reject the null hypothesis and
accept the alternative hypothesis and confirm the newspapers claim.
Confidence Interval
Alternatively, we can always set these critical limits in our original units as well.
48 . 4236
4162 96 . 1 38 X
96 . 1 38 4162 X
38
4162 X
96 . 1
X
Z
X
Similarly,
( )
52 . 4087
4162 96 . 1 38 X
96 . 1 38 4162 X
38
4162 X
96 . 1
X
Z
X
So our lower and upper critical values in original units are Rs.4087.52 and Rs.4236.48
respectively. Since our sample mean of Rs.4290 falls out side this interval we once again reject
H
0
and accept H
1
Example 4: A particular brand of electric bulb was found to have a mean lifetime of 1400 hours
with standard deviation of 64 hours. A random sample of 100 bulbs when tested showed a mean
lifetime 1250 hours. Is this evidence is sufficient to confirm that the life time of the machine has
gone down?
Solution:
1400 : H
1400 : H
1
0
<

4375 . 23
4 . 6
1400 1250
SE
X
Z
4 . 6
100
64
n
SE

66
Now from the standard normal table the corresponding Z value at 5% level of significance for
one tail test is obtained as -1.645. Since the calculated Z value namely -23.4375 is less than the
table value namely -1.645 we reject the null hypothesis and accept the alternative and confirm
that the life time of the bulbs have gone down.
Example 5: The I.Q score of a large competitive examination was found to be 120 with a
standard deviation 4. In a random sample of 100, the mean score was found to be 125. Is there
any evidence that the mean I.Q score has gone up?
Solution:
120 : H
120 : H
1
0
>

5 . 12
4 .
5
4 . 0
120 125
SE
X
Z
4 . 0
10
4
100
4
n
SE

Now from the standard normal table the corresponding Z value at 5% level of significance for
one tail test is obtained as 1.645. Since the calculated Z value namely 12.5 is grater than the table
value namely 1.645 we reject the null hypothesis and accept the alternative and confirm that the
I.Q score has come down.
2. Population normal, population infinite, sample size may be large or small and variance of
1
n
X X
Z
X

Example 6: From a large population a sample of 2 observations 37 and 84 ware taken. Test
whether the sample comes from a population whose mean is 60.20 and standard deviation 17.10
at 5% level of significance.
Solution:
60.20 : H
60.20 : H
1
0
t
Data:
50 . 60 X
17.10
20 . 60

50 . 60 2 / ) 84 37 ( X +
Since the population is sufficiently large the standard error of given by
67
09153 . 12
2
10 . 17
n
SE
X

024811 . 0
09153 . 12
3 . 0
.
09153 . 12
20 . 60 50 . 60
n
X
Z
0
is accepted.
Example 7: From a large population a sample of 225 observations were taken. If the mean of the
sample is 75 test whether the sample comes from a population whose mean is 70 and standard
deviation of 15 at 5% level of significance.
Solution:
0 7 : H
0 7 : H
1
0
t
Data:
75 X
16
70

Since the population is sufficiently large the standard error of given by
1
225
15
n
SE
X

00 . 5
1
70 75
n
X
Z
Since the calculated Z score 5.00>1.96 we reject the null hypothesis and accept the alternative.
REVIEW PROBLEM FOR Z TEST (non-Normal Population)
1. Population is non normal and infinite, sample size is large but variance of the population
is known. H
1
n
X
Z

This

result is applicable in the case of infinite non-normal population with known variance of the
population. This is indeed follows from the central limit theorem. If the non normal population is
finite, the sample size is large and the variance is known then our test statistic is written as.
1 N
n N
n
X
Z
Example 13: From a large non normal population a sample of 121 observations were taken. If the
mean of the sample is 75 test whether the sample comes from a population whose mean is 70 and
standard deviation of 15 at 5% level of significance.
68
Solution:
0 7 : H
0 7 : H
1
0
t
Data:
75 X
15
70

666667 . 3
11 / 15
5
121
15
70 75
n
X
Z
Since the calculated Z score 3.6666 >1.96 we reject the null hypothesis and accept the alternative.
Example 14: From a non- normal finite population of 5000 units a sample of 225 observations
were taken. If the mean of the sample is 71 test whether the sample comes from a population
whose mean is 70 and standard deviation of 15 at 5% level of significance.
Solution:
0 7 : H
0 7 : H
1
0
t
Data:
71 X
15
70

000 . 1
1
1
225
15
70 71
n
X
Z
Since the calculated Z score 1.000<1.96 we accept the null hypothesis.
2. Population non-normal, population infinite, sample size is large but variance of the
population is not known. H
1
n
X
Z

28.23 RELEVANCE OF STUDENT t DISTRIBUTION:
While considering the sampling distribution, we made two assumptions. The mean

and standard deviation of the given population are known. Thus when one knew the standard
deviation of the universe from which the sample is drawn then it is a simple matter of arithmetic
to the calculation of standard error
X
by using the formula given above. This is not most likely

to happen in all the cases. When , the population standard deviation is not known it is quite
reasonable to use the standard deviation of the individual sample itself as the satisfactory estimate
of the population standard deviation. This means that sample itself should be used to estimate
both the population mean and its standard deviation. Note that this is the unique advantage of
using probability sampling. Such a facility is not available for non-probability sampling.
69
Unfortunately neither the sample variance
n
) X X (
s
2
2

nor the standard deviation
n
) X X (
s
2
of the sample is an unbiased estimate of the respective population standard

deviation. However, one could always remove the amount of bias present in the said estimate by
introducing an appropriate correction term. To get an unbiased estimate of the population
variance the variance of the sample namely s
2
must be multiplied by n/(n-1).Once this correction
is incorporated the standard deviation is obtained by simply taking the square root to the
corrected variance. To distinguish this estimated standard deviation from the actual we write the
so estimated standard deviation with a cat as instead of . Thus, the formula for the estimate
may be written as
1 n
n
s
2 2

1 n
) X X (
1 n
n
n
) X X (
1 n
n
s
2 2
2

Thus in this formula, the sample variance is first corrected for bias and then the unbiased standard
deviation is obtained by simply taking square root. However, if the universe in normally
distributed then it is possible to make the unbiased estimate of the population standard deviation
directly from the standard deviation of the very sample, instead of adjusting the variance first and
getting the standard deviation.
Once the standard deviation of the universe is estimated as ,the standard error of the
sample mean for a finite population is obtain by using the standard formula as usual.
1 N
n N
n

So the appropriate t ratio may be written as
1 N
n N
n
X
t
However, for large population, as stated earlier, this formula after ignoring the finite multiplier
may be written as
1 n
s
1 n
s
1 n
n
s
n
1
n
2
2
X

So the appropriate t ratio may be written as
1 n
s
X
1 n
s
X
t

For large samples of size more than 30 statisticians use standard normal table as a proxy
to the t value because in this range the t distribution approaches the Standard normal. For a
sample of more than 120 at % level of significance the t value and the Z value are = 1.96 at 5%
level of significance.
70
Properties of t distribution:
1. The t distribution is symmetrical but little flatter than the normal distribution as
shown in the figure.
2. In addition, the t distribution varies with each different number of degrees of
freedom called parameter.
0.0000
0.0050
0.0100
0.0150
0.0200
0.0250
0.0300
0.0350
0.0400
0.0450
340 350 360 370 380 390 400 410 420 430 440 450 460
X
f
(
X
)
Normal distribution t distribution
Fig 28.11
However, when number of the degrees of freedom is very small, the variation of t distribution
from the normal is very much significant, but the degrees of freedom increase the t distribution
approaches the normal distribution. When there are as much as 120 or more degrees of freedom,
there is very little difference between the t and the normal distribution. However, for more than
30 degrees of freedom it is common practice to use normal distribution as approximation to t
distribution
The mean of the t distribution, like the standard normal distribution, is 0 but its variance is k / (k-
2). Hence the variance of t is defined only k is grater than 2.
REVIEW PROBLEMS FOR t- TEST
1. Population normal, population finite, sample size is small but variance of the population
is not known. H
1
1 N
n N
n
X
t
with n-1 degrees of freedom

Example 8: From a finite population of 10 a sample of 2 observations 37 and 84 were taken. Test
whether the sample comes from a population whose mean is 60.20 at 5% level of significance.
71
60.20 : H
60.20 : H
1
0
t
Data:
50 . 60 X
known Not
20 . 60

50 . 60 2 / ) 84 37 ( X +
Since the population standard deviation is not given let us estimate from the sample
standard deviation s.

{ }
25 . 552
2
) 50 . 60 84 ( ) 50 . 60 37 (
s
2 2
2
50 . 1104 2 25 . 552
1 2
2
25 . 552
1 n
n
s
2
33.23402
15601 . 22
9
8
2
23402 . 33
1 10
2 10
2
23402 . 33
1 N
n N
n

01354 . 0
156 . 22
20 . 60 50 . 60
X
t
X
Since this calculated t value is less than the table value of t corresponding to one degree of
freedom namely 12.706 we accept the null hypothesis and confirm that the sample belong to the
said population..
2. Population normal, population infinite, sample size is small or large, variance of the
1
1 n
s
X
t

Example 9: From a large population a sample of 2 observations were 37 and 84. Test whether the
sample comes from a population whose mean is 60.20 at 5% level of significance.
Solution:
60.20 : H
60.20 : H
1
0
t
50 . 60 X
known Not
20 . 60

72

{ }
25 . 552
2
) 50 . 60 84 ( ) 50 . 60 37 (
s
2 2
2
50 . 1104 2 25 . 552
1 2
2
25 . 552
1 n
n
s
2
33.23402
This time since the population is large the finite multiplier term is ignored in the calculation of
SE.
5 . 23
2
23402 . 33
2
23402 . 33
n

012766 . 0
5 . 23
20 . 60 50 . 60
X
t
X
freedom 12.706 we accept the null hypothesis.
Note:
1 n
s
n
1 n
n
s
n
1 n
n
s
n
2
X

So
1 n / s
X
X
t
X

Alternatively
60.20 : H
60.20 : H
1
0
t
{ }
5 . 23 s
25 . 552
2
) 50 . 60 84 ( ) 50 . 60 37 (
s
2 2
2
012766 . 0
5 . 23
3 .
) 1 2 /( 5 . 23
20 . 60 50 . 60
1 n / s
X
X
t
X

Example 10: A random sample of 5 students from a large population was taken. The marks
scored by them are 80, 50,40,90,80. Are these sample observations confirmed that the class
average is 70?
Solution:
0 7 : H
0 7 : H
1
0
t
Since the sample size is very small we propose to use t and not z in this case
Table 28.12
73
X
X X
2
) X X (
80
50
40
90
80
12
-18
-28
22
12
144
324
784
484
144
340 1880
68
5
340
X
4 . 19 376 s
376
5
1880
) X X (
n
1
s
2 2

1
]
1

21 . 0
4 . 19
4 ) 70 68 (
s
) 1 n ( ) X (
t
Since this t value is less than the table t value of -2.78 corresponding to n-1 = 4 degrees of
freedom we accept the null hypothesis and confirm that the said class average is 70.
Example 11: A group of 50 students from a large population was selected at random. The
average age of the sample was 21.5 years with SD = 4 years. Test whether the populations mean
height is 22?
Solution:
22 : H
22 : H
1
0
t
96 . 1 875 . 0
4
35 . 0
4
7 5 . 0
4
49 ) 22 5 . 21 (
s
) 1 n ( ) X (
t <
Since this t value is less than the table Z value of -1.96 we accept the null hypothesis and confirm
that the said class average is 22.
Note: Since the sample size is n = 50 from a large population we took the Z value from the
normal table instead of the t value,
3. Population normal, population infinite, sample size large (greater than 120) but variance
of the population is not known. H
1
s
) 1 n ( ) X (
t Z

with n-1 degrees of freedom for t
Example 12: A sample of 122 items from a large population was selected at random. The
average weight of the sample was 25 kgs with SD = 4kgs. Test whether the populations mean
weight is 21kgs?
Solution:
22 : H
22 : H
1
0
t
74
96 . 1 11
4
44
4
11 4
4
121 ) 21 25 (
s
) 1 n ( ) X (
t Z >

Since this t value is greater than the table Z value of 1.96 we reject the null hypothesis.
28.25 MEAN AND STANDARD ERROR OF SAMPLING DISTRIBUTION OF
PROPORTIONS (SMALL SAMPLE)
Clearly this time also the grand mean of proportion of TV possession from all the 45 samples is
0.4. So the sample estimate of proportion from a single sample is an unbiased estimate of the
population proportion.
The standard error of this sampling distribution of all the 45 sample proportions is obtained in the
table shown below.
Table 28.10
Proportion Owning
Colur TV
No. of
Samples (f)
fX
0.00 15 0 -0.4 0.16 2.4
0.50 24 12 0.1 0.01 0.24
1.00 6 6 0.6 0.36 2.16
Total 45 18 - 4.8
Mean proportion (p) 0.4
) ( p X
2
) ( p X 2
) ( p X f
4 . 0
45
18
N
fX
p

The standard error of the sampling distribution of sample proportion
Clearly the mean of all the 45 sample proportion is equal to the very population proportion .
Thus this time also p could be an unbiased estimate of the population proportion.
Table 28.11
75
327 . 0
45
8 . 4
45
) p X ( f
SE
2
P

Sl.
No Percentage
Sl.
No Percentage
1 1 & 2* 0.50 25 4 & 5* 0.50
2 1 & 3* 0.50 26 4 & 6 0.00
3 1 & 4 0.00 27 4 & 7 0.00
4 1 & 5* 0.50 28 4 & 8* 0.50
5 1 & 6 0.00 29 4 & 9 0.00
6 1 & 7 0.00 30 4 & 10 0.00
7 1 & 8* 0.50
8 1 & 9 0.00 31 5* & 6 0.50
9 1 & 10 0.00 32 5* & 7 0.50
33 5* & 8* 1.00
10 2 & 3* 0.50 34 5* & 9 0.50
11 2 & 4 0.00 35 5* & 10 0.50
12 2 & 5* 0.50
13 2 & 6 0.00 36 6 & 7 0.00
14 2 & 7 0.00 37 6 & 8* 0.50
15 2 & 8* 0.50 38 6 & 9 0.00
16 2 & 9 0.00 39 6 & 10 0.00
17 2 & 9 0.00
40 7 & 8* 0.50
18 3* & 4 0.50 41 7 & 9 0.00
19 3* & 5* 1.00 42 7 & 10 0.00
20 3* & 6 0.50
21 3* & 7 0.50 43 8* & 9 0.50
22 3* & 8* 1.00 44 8 & 10 0.00
23 3* & 9 0.50
24 3* & 10 0.50 45 9* & 10 0.21
Total 18.00
Mean 0.4
Family Family
Calculation of the standard error using the formula when the population is finite and the
population proportion is known
Without the laborious job of taking all the 45 samples one could calculate the standard
error of the sampling distribution of all 45 sample proportions by using the following formula
provided the population proportion ( ) is known.
1 N
n N
n
) 1 (
SE
p

In our illustrative example this result is obtained as
327 . 0
9
96 . 0
9
8
2
24 . 0
1 10
2 10
2
) 4 . 0 1 ( 4 . 0
SE
p

Hypothesis Testing:
Having known the population proportion , sample proportion p, and the standard error of the
sampling distribution we can conduct the Z test as usual.
30581 . 0
327 . 0
1 . 0
327 . 0
4 . 0 5 . 0
SE
p
Z

Since the calculated Z value is less than the Z = 1.96 at to 5% level of significance we accept H
0
and confirm that the sample proportion is a good estimator for the population proportion.
76
Calculation of the standard error when the population is infinite and population proportion
is known:
For a large population with comparatively small sample the correction term in the above
formula can be taken as unity and hence omitted. So the formula reduces to
327 . 0 12 . 0
2
24 . 0
2
6 . 0 4 . 0
n
) 1 (
SE
p

Having known the population proportion sample proportion p and the standard error of the
sampling distribution we can conduct the Z test as usual.
30581 . 0
327 . 0
1 . 0
327 . 0
4 . 0 5 . 0
SE
p
Z

Estimation of standard error when the population is finite and population proportion is
not known
In the absence of population proportion we could use the sample proportion p itself as the
proxy for the population proportion . Thus the standard error of the sampling distribution is
obtained by using the following formula.
,
_

N
n N
1 n
) p 1 ( p
SE
p
Estimation of SE in the absence of in our illustrative problem is shown below

438178 . 0 20 . 19
10
192
10
8
1
24 . 0
10
2
1
1 2
6 . 0 4 . 0
N
n
1
1 n
) p 1 ( p
SE
p

,
_

,
_

,
_

Once the SE is estimated for our finite population of sample size 2 the appropriate test statistics is
t and not Z.
228218 . 0
438178 . 0
1 . 0
8178 . 43
4 . 0 5 . 0
SE
p
t
Since the calculated t value 0.228212 is less than the table value t value 12.706 at 5% level of
significance we accept the null hypothesis and confirm that the sample proportion is a good
estimate of the population proportion.
Estimation of standard error when the population is large and the population proportion
is not known
If the population is large then irrespective of the sample size we can use the following formula for
the calculation of the standard error.
1 n
) p 1 ( p
SE
p

Once the SE is an estimated though the population is large for small samples, we use only t and
not Z for testing.
77
228218 . 0
438178 . 0
1 . 0
8178 . 43
4 . 0 5 . 0
SE
p
t
If the population is large and the sample size is also large then we can use Z or t. both will give
the same result.
SE
p
t Z

29.26 CHI-SQUARE TEST FOR GOODNESS OF FIT
The chi-square is used test the significance in the analysis of frequency distribution. The
categorical variables like sex distribution may be analysis for statistical significance.
Example 2: In a hospital 60 male child and 40 male child births were recorded during Feb. 1990.
Test the hypothesis that male child birth and female child births are equal.
60
50 50 40
Girl Child Boy Child Girl Child Boy Child
Expected distribution
Is the difference between the expected and Observed
distribution statistically significant?
Observed distribution
Example In a hospital 60 male child and 40 male child births were recorded during Feb. 1990.
Test the hypothesis that male child birth and female child births are equal.
To analyze the birth rate data given we start with the null hypothesis that the number of male
birth equals the number of female births. In other words this is to say that the actual distribution
of the sample equal to the expected distribution of the population. The need chi=square statistics
is obtained by using the following formula.
78
Sex
Observed
frequency
(O)
Expected
probability
Expected
frequency
(E)
O-E
Male 60 0.5 50 10 2
Female 40 0.5 50 -10 2
Total 100 1 100 4
( )
E
E O
2
( )
4
E
E O
2
2
1
]
1

Since table value for chi-square for one degree of freedom is 3.841 and over
calculated chi-square value is 4 (4 > 3.84) we reject the null hypothesis.
29.14 CHOICE OF THE APPROPRIATE STATISTICAL TESTS:

To select a particular test the researcher should consider the following three important points.
1. Type of questions to be answered
2. Does the test statistics involve one, two or k samples at a time.
3. The third question rests on whether the data set scale is nominal, ordinal, interval or ratio.
1. Type of questions
The choice of statistical technique depends up on the type of question that the researcher is
attempting to answer. For example, if the researcher is concerned with the central tendency like
mean then one may for Z or t test. If he is concerned with the distribution of the variable
concerned then chi- square test for goodness of fit is appropriate.
2. Number of variables
The number of variables that will be investigated simultaneously is a primary consideration in the
choice of the statistical technique. If only one variable is involved then univariate analysis is
appropriate. To answer questions relating to two variables one must use vicariate statistical
analysis. To answer questions relating to several variables then one will have to opt for
multivariate tools.
3.3 Scale of measurement
In this context again the choice of the statistical technique heavily depends upon the scale used to
measure the concerned variable. For example for testing the hypothesis about the mean requires
an interval-or ratio scale data. If the data is measure by nominal scale then clearly mode is the
only central tendency that he can explore. One the other hand if the data is measured using the
ordinal scale then he can clearly explore median, quartile and percentile related issues. The
following table 1 gives the summery of scale - wise classification of the relevant measures. Table
2 shows guidelines for selecting the relevant test statistics.
Table 1
Scale Centeral tendancy Dispersion
Nominal Mode None
Ordinal Median Percentail
Interval or ratio Mean Standard deviation
79
Table 2
Guide line for selecting univariate tests
Business problems Relevant Statistical question Possible test statistics
Interval or ratio scale
Compare actual and
hypothetical values
of average salary
In the sample mean differ
significantly fromthe
population mean
Zfor large sample and
t- for small sample
Ordinal scale
Compare actual and
expected evaluations
Does the distribution of
scores on a scale with
categories excellent, good,
fair and poor differ fromthe
expected distribution?
Chi-Square test
Ordinal Scale
Determine ordered
preferences for all
brands in a product
group
Does a set of rank orderings
in a sample differ froman
expected or hypothesised
rank ordering?
Kolmogoro-Smirnov
test
Nominal scale
identfy the sex of key
exicutives
Is the numberof feamale
exicutives equal to number
of male exicutives
Chi-Square test
Nominal scale
iIndicate percentage
of key exicutive who
are male
Is the priportion of male
exicutives the same as the
the hypothisized proportion
t- test for proportion
29.13 PARAMETRIC AND NON-PARAMETRIC TESTS
The term parametric statistics and non parametric statistics refer to two major groups of statistical
test procedures. The major distinction between these two groups of procedures lies in the
underlying assumptions regarding the data to be analyzed. When the data are measured in
interval or ratio scales and sample size is large, parametric statistical procedures are appropriate.
Parametric tests are more powerful because in such tests we use data measured in either in
interval or ratio scales. Though parametric tests are powerful they must satisfy some stringent
conditions beforehand. The following are the conditions the parametric test under goes.
1. The observations must be independent
2. The observations must come from a normal parental population. If not normal then it
should have been drawn from a very large population.
3. These populations must have equal variances if at all we compare two populations.
4. The measurement scale must be at least interval, which enables further arithmetic operations.
Thus it is the duty of the researcher to check the fulfillment of these assumptions before
selecting the test statistics. Performing diagnostic checks on the data will help the researcher to
identify the most appropriate tests.
On the other hand, non-parametric tests we use only nominal and ordinal data. Non-
parametric tests have fewer and less stringent assumptions. They do not specify either normality
or equal variance. Further non-parametric tests are the only answer for data available in nominal
and ordinal scales. These tests are also usable for interval and ratio scales although they waste
some of the data information available. The following table 23 gives the summery of tests
procedures.
80
28.25 ONE SAMPLE TEST FOR MEAN: A REVIEW
3. Population normal, population finite, sample size is small but variance of the
1
1 N
n N
n
X
t

Example 8: From a finite population of 10 a sample of 2 observations 37 and 84 were taken. Test
whether the sample comes from a population whose mean is 60.20 at 5% level of significance.
60.20 : H
60.20 : H
1
0
t
Data:
50 . 60 X
known Not
20 . 60

50 . 60 2 / ) 84 37 ( X +

{ }
25 . 552
2
) 50 . 60 84 ( ) 50 . 60 37 (
s
2 2
2
50 . 1104 2 25 . 552
1 2
2
25 . 552
1 n
n
s
2
33.23402
15601 . 22
9
8
2
23402 . 33
1 10
2 10
2
23402 . 33
1 N
n N
n

01354 . 0
156 . 22
20 . 60 50 . 60
X
t
X
freedom namely 12.706 we accept the null hypothesis and confirm that the sample belong to the
said population..
4. Population normal, population infinite, sample size is small or large, variance of the
1
1 n
s
X
t

81
Example 9: From a large population a sample of 2 observations were 37 and 84. Test whether the
sample comes from a population whose mean is 60.20 at 5% level of significance.
Solution:
60.20 : H
60.20 : H
1
0
t
50 . 60 X
known Not
20 . 60


{ }
25 . 552
2
) 50 . 60 84 ( ) 50 . 60 37 (
s
2 2
2
50 . 1104 2 25 . 552
1 2
2
25 . 552
1 n
n
s
2
33.23402
This time since the population is large the finite multiplier term is ignored in the calculation of
SE.
5 . 23
2
23402 . 33
2
23402 . 33
n

012766 . 0
5 . 23
20 . 60 50 . 60
X
t
X
Note:
1 n
s
n
1 n
n
s
n
1 n
n
s
n
2
X

So
1 n / s
X
X
t
X

Alternatively
60.20 : H
60.20 : H
1
0
t
{ }
5 . 23 s
25 . 552
2
) 50 . 60 84 ( ) 50 . 60 37 (
s
2 2
2
012766 . 0
5 . 23
3 .
) 1 2 /( 5 . 23
20 . 60 50 . 60
1 n / s
X
X
t
X

82
Example 10: A random sample of 5 students from a large population was taken. The marks
scored by them are 80, 50,40,90,80. Are these sample observations confirmed that the class
average is 70?
Solution:
0 7 : H
0 7 : H
1
0
t
Since the sample size is very small we propose to use t and not z in this case
Table 28.12
X
X X
2
) X X (
80
50
40
90
80
12
-18
-28
22
12
144
324
784
484
144
340 1880
68
5
340
X
4 . 19 376 s
376
5
1880
) X X (
n
1
s
2 2

1
]
1

21 . 0
4 . 19
4 ) 70 68 (
s
) 1 n ( ) X (
t
Since this t value is less than the table t value of -2.78 corresponding to n-1 = 4 degrees of
freedom we accept the null hypothesis and confirm that the said class average is 70.
Example 11: A group of 50 students from a large population was selected at random. The
average age of the sample was 21.5 years with SD = 4 years. Test whether the populations mean
height is 22?
Solution:
22 : H
22 : H
1
0
t
96 . 1 875 . 0
4
35 . 0
4
7 5 . 0
4
49 ) 22 5 . 21 (
s
) 1 n ( ) X (
t <
Since this t value is less than the table Z value of -1.96 we accept the null hypothesis and confirm
that the said class average is 22.
Note: Since the sample size is n = 50 from a large population we took the Z value from the
normal table instead of the t value,
5. Population normal, population infinite, sample size large (greater than 120) but variance
of the population is not known. H
1
83
s
) 1 n ( ) X (
t Z

with n-1 degrees of freedom for t
Example 12: A sample of 122 items from a large population was selected at random. The
average weight of the sample was 25 kgs with SD = 4kgs. Test whether the populations mean
weight is 21kgs?
Solution:
22 : H
22 : H
1
0
t
96 . 1 11
4
44
4
11 4
4
121 ) 21 25 (
s
) 1 n ( ) X (
t Z >

Since this t value is grater than the table Z value of 1.96 we reject the null hypothesis.
5. Population is non normal and infinite, sample size is large but variance of
1
n
X
Z

This

result is applicable in the case of infinite non-normal population with known variance of the
population. This is indeed follows from the central limit theorem. If the non normal population is
finite, the sample size is large and the variance is known then our test statistic is written as.
1 N
n N
n
X
Z
Example 13: From a large non normal population a sample of 121 observations were taken. If the
mean of the sample is 75 test whether the sample comes from a population whose mean is 70 and
standard deviation of 15 at 5% level of significance.
Solution:
0 7 : H
0 7 : H
1
0
t
Data:
75 X
15
70

666667 . 3
11 / 15
5
121
15
70 75
n
X
Z
Since the calculated Z score 3.6666 >1.96 we reject the null hypothesis and accept the alternative.
84
Example 14: From a non- normal finite population of 5000 units a sample of 225 observations
were taken. If the mean of the sample is 71 test whether the sample comes from a population
whose mean is 70 and standard deviation of 15 at 5% level of significance.
Solution:
0 7 : H
0 7 : H
1
0
t
Data:
71 X
15
70

000 . 1
1
1
225
15
70 71
n
X
Z
Since the calculated Z score 1.000<1.96 we accept the null hypothesis.
7. Population non-normal, population infinite, sample size is large but variance of the
1
n
X
Z

28.26 ONE SAMPLE TEST FOR PROPORTIONS
1. Difference between sample proportion and the universe proportion when the universe
proportion is known:
Example 15: A wholesaler of mangoes claims that only 4% of mangoes supplied by him are
defective. In a random sample of 600 mangoes 36 were defective. Does this observation confirm
the claim of the wholesaler?
Solution:
The null hypothesis may be stated as
H
0
: = 0.04
H
1
: > 0.04
This is really the problem of one tail test because if the defective is less than 0.04 no one will
bother.
From the sample the probability of the defective mangoes is obtained as p = 36/600=0.06
So the SE = 008 . 0
600
96 . 0 04 . 0
n
) 1 (
p

5 . 2
008 . 0
04 . 0 06 . 0 p
Z
p

At 5% level of significance for a one tailed test the critical value is 1.645. Since the calculated Z
is greater than the table entry the null hypothesis is rejected
85
Example 16: An apple wholesaler claims that 95% of the apples supplied are good ones. In a
random sample of 200 apples 182 was good. Does this confirm the wholesalers claim?
Solution:
The null hypothesis may be stated as
H
0
: = 0.95
H
1
: < 0.95
This is really the problem of one tail test because if the good ones are more than 95% no one will
bother.
From the sample the probability of the good apples is obtained as p = 182/200 = 0.9
So the SE = 0154 . 0
200
05 . 0 95 . 0
n
) 1 (
p

247 . 3
0154 . 0 . 0
90 . 0 95 . 0
SE
p
Z
At 5% level of significance for a one tailed test the critical value is 1.645. Since the calculated Z
is greater than the table entry the null hypothesis is rejected
Example 17: Assume a metal stamping machine, when properly set in, will produce 0.05
defective items at an average. Inspection of a lot contains 400 parts, 32 were defective. Test
whether machine setting is proper or needs adjustment.
Solution:
Let be the population of defectives and p refers the sample proportion of defectives.
05 . 0 : H
05 . 0 : H
1
0
>

( )
1 N
n N
n
1
p

Since the size of the sample namely 400 is very small in relation to the universe safely we can
ignore the correction term.
08 . 0
400
32
p
( )
0109 . 0
400
95 . 0 05 . 0
.
n
1
p

75 . 2
0109 . 0
05 . 0 08 . 0 p
Z
p

In our problem we are interested only in determining the probability of securing a proportion of
defectives greater than 0.05, one tailed test is the most appropriate one. From the Z table the table
entry corresponding to a level of significance 0.05 is z = 1.645. Since the calculated z value
86
namely 2.75 is greater than the table value of 1.645 we reject the null hypothesis and accept the
alternative hypothesis and recommend for corrective measure for the defective machine.
Example 18: In a hospital 480 male and 520 male childbirths were recorded during Feb. 1990.
Test the hypothesis that male child born in greater proportion
H
0
: Male and female Child born in equal proportion
H
1
: Male child born in greater proportion
2
1
1
016 . 0
1000
2
1
2
1
n
) 1 (
SE
p

Observed male proportion 52 . 0
1000
520
p
The expected male proportion 5 . 0
1000
500

645 . 1 25 . 1
16 . 0
5 . 0 52 . 0 p
Z
p
<
At 5% level of significance the Z value from the standard normal table is 1.645 (one Tail Test)
Since the calculated Z value 1.25 is less than the table value of 1.96 we accept the null hypothesis
and confirm that proportion of male and female Childbirths are equal to one another.
2. Difference between sample proportion and the universe proportion when the universe
proportion is not known:
In all the previous illustrations the sample proportions assumed to be known in advance.
In many cases such advanced information rarely available. Under such circumstances, usually we
take the sample proportion of success is taken as the estimate for the population proportion of
success. However, this assumption is justified provided the size of the sample n is considerably
large and neither p nor q is very small. If the difference between p and q are not much then
irrespective of n the sample proportion will follow a normal distribution.
However, p is very small then often we take the highest value of p q = 1/2 1/2 =
1/4.for all calculation to be on the safer side.
Assume that a survey is made across 1, 00,000 families in a large city to determine the
percentage of female children. Out of 1,000 children chosen at random 620 (62%) was female. In
the absence of any other information by assuming the existence similar conditions in the whole
city we can safely use the sample proportion p itself as an estimate of the population proportion
. Using p as an estimate of , it is possible to estimate the confidence interval within which
the percentage of underfed will lie. The value of p is substituted in the place of the standard
error of p is obtained by using the following formula
53 . 1
) 01 . 0 1 (
1 1000
38 . 62 .
N
) n N (
1 n
pq

This indicates if all possible sample of size 1000 were taken from this universe the
parameter p would fall within 62t 1.96 1.53 in approximately 99 out of 100.
87
If the N is very large the finite multiplier is ignored in the calculation of SE.
n
pq
p

However, if p is very small for a safer side we take the highest value of pq. It could be
seen that the value of pq = (1/2)(1/2) = 1/4 is the highest possible.
Example 19: From a very large city in a sample of 400 children 150 found to be underfed.
Estimate the percent of underfed in the population. Also construct 99 percent confidence interval.
Solution:
From the data given the proportion of underfed in the sample is obtained as:
p = 150/400=3/8= 37.5
So q = 5/8 = 68.5, and total number of children examined is n = 400
% 4 . 2 024 . 0
400
8
5
8
3
n
pq
SE
p

Whatever may the percentage be of underfed in the city as a whole, the simple sample would give
percentages within which the sample estimates lie.
So the required interval 37.5 t (3 2.4)% that is 30.3% to 44.7%. This means that 99 sample
results out of 100 will lie in the said interval
Alternatively if we take the maximum value of pq = in the calculation of the SE then
025 .
400
2 / 1 2 / 1
SE
p

So the needed limit interval would be 37.5 t (3 2.5) that is 30% to 45%
Whatever may be the percentage of bad mangoes in the lot as a whole, a simple sample would
give a percentage within three times the standard error.
So the required interval in percentage 10 t (3 1.3)% that is 6.1% to 13.9%
Alternatively if we take the maximum value of pq = in the calculation of the SE then
022 .
500
2 / 1 2 / 1
SE
p

(2.2%)
So the needed limit would be 10 t (3 2.2) that is 30% to 45%
28.27 ONE SAMPLE TESTING FOR NUMBER OF SUCCESS
Example 20: A die was thrown 6000 times and the occurrence of 1 or 6 was recorded as success.
There were 2020 successes. Test the null hypothesis that the die is unbiased.
Let H
0
: the die is unbiased (p = 1/3)
H
1
: the die is biased (p 1/3)
88
The probability of getting a 1 or 6 as success,
3
1
6
2
6
1
6
1
p +
So the probability of failure
3
2
3
1
1 q
5148 . 36
3
2
3
1
6000 npq SE
The observed number of success =2020
The expected number of success 2000
3
1
6000 np
96 . 1 5477 . 0
5148 . 36
20
5148 . 36
2000 2020
npq
E O
SE
E O
Z <
At 5% level of significance the Z value from the standard normal table is 1.96
Since the calculated Z value 0.5477 is less than the table value of 1.96 we accept the null
hypothesis and confirm that the die is unbiased.
Example 21: A coin is tossed for 400 times and 210 heads were recorded. Test whether the coin
is biased or not.
Solution: We know that the theoretical probability of getting head in a single toss is
p = 1/2, so q = 1/2
H
0
: The coin is unbiased (p = 1/2)
H
1
: The coin is biased (p 1/2)
The expected number of success (heads) in 400 tosses = np = 0.5 400 = 200
As per the data the actual hands noticed = 210
10 100
2
1
2
1
400 npq SE
96 . 1 00 . 1
10
10
10
200 210
SE
E _ O
Z <

and confirm that the coin is unbiased.
28.28 TWO SAMPLE TESTS:
1. TESTS OF SIGNIFICANCE BETWEEN TWO SAMPLE MEANS
Case 1: The standard deviations
1
and
2
of two populations are known:
In the previous section we dealt with hypothesis testing relating to the
population with the help of a single sample. In this section we will consider a
different research situation involving two samples. Suppose two samples have given
us two sample means and we are interested in finding out whether there is a
significant difference between the two means. Alternatively, the problem is to find
89
whether they could have come from the one and the universe or from universe having
the same mean and standard deviation. Here we shall calculate the standard error of
the difference between the two sample means and proceed further in the usual manner
for testing.
Although the general procedure for testing single sample and two samples are
identical, some important difference must be taken note of before we proceed further.
In single sample case we invariably use a random sampling method. In addition to
this in two sample cases both the samples must be independent. In other words this is
to say that the selection of cases for one sample should not affect the selection of
cases for other sample. The second major difference lies in framing the null
hypothesis. The null hypothesis is still the statement of no difference. In one
sample case it is the no difference statement between the population and the sample.
In the two samples case it again no difference statement between the two
populations. As usual if the test statistics falls in the critical region we reject the null
of no difference will be rejected.
Here the test statistics will be the difference in the sample means. As long as
the size of the sample is large the sampling distribution of the differences will follow
the normal rule and hence standard normal could be used to get the needed critical
region. In this case the appropriate test statistic is defined as
2 1 1
2 1 0
: H
: H

( ) ( )
2
1
X X
2 1
2 1 X X
Z
In the above definition the term ( )

2 1
appears to be troublesome since
these two values are unknown. However, under the null hypothesis
0
2 1

this
term will vanish ultimately. Thus our test statistics reduces to

( )
2
1
2 1
X X
X X
Z
Case 2: When both the samples are drawn from the same population or from two different
population but with same sample mean and standard deviation: with known SD =
2 1 1
2 1 0
: H
: H

2 1
2 1 2 1
n
1
n
1
X X
SE
X X
Z
+
Example 22: Two large samples of size 1000 and 2000 from a given population yielded 67.5 and
68 as averages. Can these samples be regarded as drawn from a population having the standard
deviation of 2.5 at one percent level of significant?
Solution:
, 68 X , 2000 n
, 5 . 67 X , 1000 n
2
2
1
1

90
58 . 2 1 . 5
0387 . 0 5 . 2
5 . 0
2000
1
1000
1
5 . 2
68 75 . 67
n
1
n
1
X X
Z
2 1
2 1
<
,
_
We reject the null hypothesis and conclude that the samples are not drawn from the population
under reference.
Case 3: When samples are drawn from two different populations but with known standard
deviations
1
and

2
of the respective populations
The test statistics is given by
2 1 1
2 1 0
: H
: H

( ) ( )
2
2
2
1
2
1
2 1
X X
2 1
n n
X X X X
Z
2
1

+
Example 23: In random sample of 200 villages of a district A, the average population was 485
with SD = 50. Another random sample of 200 villages pertaining to district B gave the average
population 510 with SD = 40. Is the noticed difference between the averages of the two samples
significant at 5% level of significance?
Solution:
From the data:
40 , 510 X , 200 n
50 , 200 X , 200 n
2
2
2
1
1
1

The relevant test is summarized as follows
2 1 1
2 1 0
: H
: H

96 . 1 5 . 5
200
40
200
50
510 485
n n
X X
SE
X X
Z
2 2
2
2
2
1
2
1
2 1
2 1
<
+
Since the calculated Z is less than the 1.96 we reject the null hypothesis and confirm that the
difference between the two means is significant
Case 4: When samples are drawn from two different populations but with known standard
deviations s
1
and

s
2
of the respective samples
91
Since we will rarely be in a position to know the standard deviations on the concerned
populations, we use their respective sample standard deviations as proxy for their respective
population values.
2 1 1
2 1 0
: H
: H

1 n
s
1 n
s
X X
Z
2
2
2
1
2
1
2 1
Example 24: A sample of heights of 6400 Indians found to have a mean height of 67.85.inches
and a standard deviation of 2.56 while the sample of 1600 in Pakistan has mean height of
68.55inches and SD of 2.52. Does the data indicate that any significant difference in heights
Solution:
H
0
=
1 -
2
= 0
H
1
=
1 -
2
0
Where
1
is the first sample mean and
2
is the second sample mean.
58 . 2 10
07 . 0
70 . 0
6399
56 . 2
1599
25 . 2
85 . 67 55 . 68
1 n
s
1 n
s
X X
Z
2 2
2
2
2
1
2
1
2 1
>
+
The heights differ at 1 % level of significant. The Pakistanis are taller than Indians.
Example 25: For the data given below conduct the test for the equivalence of means
Table 28.13
Sample1 Sample 2
Mean= 6.2 Mean = 6.5
SD=1.3 SD=1.4
N=324 n = 317
2 1 1
2 1 0
: H
: H

80 . 2
107 . 0
3 .
0062 . 0 0052 . 0
3 . 0
1 317
4 . 1
1 324
3 . 1
5 . 6 2 . 6
1 n
s
1 n
s
X X
Z
2 2
2
2
2
1
2
1
2 1

At 5% level since-2.80<1.96 the null hypothesis if rejected.

Case 5: When variances of the samples are known as s
1
2
and s
2
2
are drawn from the same
population whose standard deviation is not known.
H
0
=
1 -
2
= 0
H
1
=
1 -
2
0
92
( ) ( )
2 n n
s 1 n s 1 n
2 1
2
2 2
2
1 1
+
+

Where
1
is the first sample mean and
2
is the second sample mean.
Here t is defined as
2 1
2 1
n
1
n
1
X X
t
+
Example 26: An advertising company claims that an attractive picture display in a junction point
will increase the sale of real estate plots. On first 40 days without the said display the real estate
was able to sell 100 plots per day with SD = 25. In the next 40 days, of course with the said
advertisement, the sales went up to 110 plots per day with SD =25.Test the advertising
companys claim that the sales went up due to advertisement.
Solution:
H
0
=
1 -
2
= 0
H
1
=
1 -
2
0
Sample 1 40
1
n
110 X1
s
1
= 25
Sample 2: n
2
= 40
100 X2
s
2
= 20
( ) ( )
64 . 22
2 40 40
20 39 25 39
2 n n
s 1 n s 1 n
2 5
2 1
2
2 2
2
1 1
+
+
+
+

974 . 1
64 . 22
20 10
40 / 1 40 / 1 64 . 22
100 110
n
1
n
1
X X
t
2 1
2 1
Since the degrees of freedom in this case = n

1
+ n
2
2 = 78 is fairly large. The sampling
distribution can very well be taken as standard normal .So the appropriate z =1.654. Since the
calculated t is more than the z from the table we reject the null hypothesis. Thus, it is confirmed
that the advertisement display has increased the sales.
2. TESTS OF SIGNIFICANCE BETWEEN TWO SAMPLE PROPORTIONS:
1. When proportions of success in both the populations are not known:
In the previous section we tested the difference between the actual proportions observed
from the samples and the expected proportion in the universe. There may be situations involving
two samples with proportions p
1
and p
2
with size n
1
and n
2
respectively. The question that one
will come across under the new situation is whether these two sample proportions significantly
differ from one another or not. If the no difference hypothesis is true then we accept the null
hypothesis and confirm the difference between these sample proportions is purely due
fluctuations in the sample.
The comparison of two sample proportions or percentages for their equality is similar to
the test that we carried out in the preceding section for the equality of two sample means. The
probabilities associated with the standard normal curve could be used as an approximate test
93
between the two sample proportions. When the samples are reasonably large, this approximation
is adopted almost universally.
Assume the machine 1 turns out 25 defective items in a lot of 400 tested and machine 2
turns out 42 defective items in a lot of 600 tested. We want to know whether there is a significant
difference in the proportion of defectives turned out by both the machines. Since no more
information is available we will have to answer this question only on the basis of the sample
proportion available. As usual we make the null hypothesis on no difference basis.
2 1 1
2 1 0
: H
: H

The proportion of defective produced by the first machine 0625 . 0
400
25
p
1

The proportion of defective produced by the second machine 07 . 0
600
42
p
2

The Z is defined as
( ) ( )
2 1
p p
2 1 2 1
p p
Z
Where ( )
2 1
the difference between population proportions
Where ( )
2 1
p p the difference between sample proportions
Where
2 1
p p
= SE of the sampling distribution.

By hypothesis since the sample proportions are equal the second term in the definition of Z
vanishes.
Therefore the formula for Z reduces to
( )
2 1
p p
2 1
p p
Z
Under the null hypothesis since the population proportions are assumed to be equal, we use this
useful information and compute the best possible pooled proportion of defective as follows.
933 . 0 067 . 0 1 p 1 q
067 . 0
600 400
42 25
n n
p n p n
p
0
2 1
2 2 1 1
0

+
+
+
+

The standard error of the sampling distribution is computed by using the following formula
0161 . 0
600
1
400
1
933 . 0 067 . 0
n
1
n
1
q p SE
2 1
0 0

,
_
,
_
+
To test the hypothesis let us compute Z
47 . 0
0161 . 0
07 . 0 0625 . 0 p p
Z
2 1
p p
2 1

Since the calculated Z is less than 1.96 at 5% level of significance we are not justified in rejecting
our null hypothesis.
94
Example 27: In a random sample of 1000 persons from city A, 400 are rice eaters. In another
sample of 800 persons from city B, 400 are rice eaters. Is there any significant difference in the
rice consumption habits?
Solution:
2 1 1
2 1 0
: H
: H

The proportion rice eaters in city A: 4 . 0
1000
400
p
1

The proportion rice eaters in city B: 5 . 0
800
400
p
2

Under the null hypothesis since the population proportions are assumed to be equal, we use this
useful information and compute the best possible pooled proportion of defective as follows.
) 6 . 55 ( 556 . 0 444 . 0 1 p 1 q
%) 4 . 44 ( 444 . 0
1800
800
800 1000
400 400
n n
p n p n
p
0
2 1
2 2 1 1
0

+
+
+
+

The standard error of the sampling distribution is computed by using the following formula
02357 . 0
800
1
1000
1
556 . 0 444 . 0
n
1
n
1
q p SE
2 1
0 0 p p
2 1

,
_
,
_
+

To test the hypothesis let us compute Z
24 . 4
02357 . 0
5 . 0 4 . 0 p p
Z
2 1
p p
2 1

Since the calculated Z is less than 1.96 at 5% level of significance we reject null hypothesis and
confirm the noticed differences is significant.
Example 28: Out of a random sample of 1000 males selected city A 600 were smokers. Out of
900 random sample males from city B, 450 were smokers. Is there any significant difference
between the smoking habits among males?
Solution:
From the data
900 n , 5 . 0
900
450
p
1000 n , 6 . 0
1000
600
p
2 2
1 1

38
17
q So
38
21
900 1000
450 600
n n
p n p n
p
0
2 1
2 2 1 1
0
+
+
+
+
072 . 0 02 . 0 26 . 0
900
1
1000
1
38
17
38
21
n
1
n
1
q p SE
2 1
0 0

,
_
,
_
+

96 . 1 39 . 1
072 . 0
5 . 0 6 . 0
SE
p p
Z
2 1
<
95
and confirm that there is no significant difference between smoking habits between cities.
2. When proportions of success in two distinct populations are not known:
In the previous illustrations though the samples ware drawn from two populations we had
assumed that the two populations are alike and estimated a pooled proportion. If populations are
really different such a procedure cannot be used. Instead, we calculate the standard error
distinctly for both the proportions and combine them to get the needed standard error.
1
1 1
1
n
q p
SEp
2
2 2
2
n
q p
SEp
2
2 2
1
1 1
2 1
n
q p
n
q p
p SEp +
Example 29: In two large cities the previous 1990 census reveals that the male populations in
these cities are 30% and 25% respectively. Is this difference is likely to hidden in samples of
1200 and 9000 respectively?
Solution:
2 1 1
2 1 0
: H
: H

9000 n 75 . 25 . 1 q , 25 . p
1200 n 70 . 30 . 1 q , 30 . p
2 2 2
1 1 1

0196 . 0
00021 . 0 000175 . 0
9000
75 . 25 .
1200
7 . 3 .
n
q p
n
q p
SE
2
2 2
1
1 1
+
96 . 1 55 . 2
0196 . 0
25 . 30 .
SE
p p
Z
2 1
>
At 5% level of significance the Z value from the standard normal table is1.96.Since the calculated
Z value 2.55 is greater than the table value of 1.96 we accept the null hypothesis and confirm that
the noticed difference is unlikely to be hidden due to sampling fluctuation.
3. Comparing a sample proportions with the pooled proportion:
Sometimes proportion observed in a sample is compared the pooled proportion of the one under
reference and second one. The associated standard error is obtained as
96
( )

,
_
2 1 1
2
0 0 p p
n n n
n
q p SE
0 1
Example 30: In a sample of 100 students from the university department 75 got through the final
examination. In another sample of 125 from an affiliated college students 75 got through in the
said examination. Test whether the university pass percentage is higher than aggregate pass
percentage of both the university and the affiliated college.
Solution:
Let
H
0
: Pass proportion of the University department is equal to the pass proportion of aggregate.
H
1
: Pass proportion of the University department is not equal to the pass proportion of aggregate.
Given n
1
= 100, n
2
=125
The proportion of success in the university department p
1
= 75/100 = 0.75
The proportion of success in the affiliated college p
2
= 75/125= 0.60
Under the null hypothesis since the success rates are assumed to be equal the pooled aggregate
success rate
67 . 0
125 100
75 75
n n
p n p n
p
2 1
2 2 1 1
0

+
+
+
+

So q
0
= 0.33
( )
96 . 1 3 . 2
035 . 0
08 . 0
100
125
125 100
33 . 0 67 . 0
75 . 0 67 . 0
n n n
n
q p
p p
Z
2 1 1
2
0 0
1 0
<
,
_
At 5% level of significance the Z value from the standard normal table is -1.96
Since the calculated Z value -2.3 is less than the table value of -1.96 we reject the null hypothesis
and confirm that proportions of success are not equal
28.29 CHI-SQUARE (
2
) DISTRIBUTION:
In statistics often we come across squared quantities like the sample variance
1 n / ) X X ( s
2
i
2
. Do these quantities have their own sampling distributions like
X
?
Under certain conditions we will have such distributions. Such a distribution is often called Chi-
square distribution denoted by
2
. Statistical theory shows that the square of a standard normal
variable is distributed as
2
. Symbolically such a distribution is denoted by
2
) 1 (
2
Z
. Here 1
refers the degrees of freedom specifying the number of independent observations.
If Z
1
, Z
2
, Z
k
are k independent normal variables then
2
) k (
2
k
2
3
2
2
2
1
Z ....... .......... .......... Z Z Z + + + +

Properties of chi-squares:
97
1. Unlike the standard normal the chi-square distribution takes only positive values in the x-
axis, because it is defined as squared quantities.
2. It is a skewed distribution; the degree of Skewness depends upon the degrees of freedom for
few degrees of freedom. It is highly skewed to the right. However, as sample size increases to
infinity this distribution tends to normal.
3. The mean value of this distribution is simply k, the degrees of freedom. The associated
variance is 2k.
4. If
2
k2
2
1 k
and are two independent chi-squares with k
1
and k
2
degrees of freedom then
2
k2
2
1 k
2
+ is a chi-square distribution with k
1
+k
2
degrees of freedom.

Density function of chi square distribution
0
5
10
15
20
25
30
35
40
45
0 20 40 60 80 100 120
Fig 28.12
Non-parametric Tests
The Chi-square test is probably the most frequently used to test the hypothesis in social sciences.
Its popularity is mainly due to it adaptability even to relatively less restrictive assumptions. In
particular, chi-square is non- parametric test of hypothesis and requires no assumption about the
exact shape of the population. Since this test is more appropriate to nominal scale of
measurement, we make only minimum possible assumptions with respect to the level of
measurement. In addition to two sample tests, the chi-square can also be used in situations of
interest with more than two samples. Depending upon the scale of measurements used variety of
non-parametric tests is available. To deal with the classificatory type of nominal scale often we
use either binomial or Chi-square (
2
) tests. The binomial is useful when the population is
having only two classes like male or female, success or failure. Chi-square however is usable for
situation involving more than two nominal groups such as favor, undecided, and non-favour. This
test has several uses. .
1. The test of independence
2. Test of goodness of fit
3. Test to determine whether the universe has a specified value of variance or not
1) Test of independence: The chi square test has several distinct uses. In the context of chi-
square, the meaning of independence is closely related to the concept that we are using all
through. Two variables are independent if, for all cases, the classification of a case into a
particular category of one variable has no effect on the probability that the said case will fall into
any particular category of the second variable. For example, the problem of testing for a
significant difference between two-sample proportions discussed in the previous section may be
approached in a different manner by using chi square distribution
98
Example 31: The following example illustrates the method of getting the chi square value. The
following table reports the number of crimes committed in a given jurisdiction on monthly basis.
Is there any seasonal rhythm in the crime rate committed? The crime rate is the only variable
under consideration. If the crime rate does not vary by month, we could expect 1/12 of all crimes
committed in a year would be committed in each and every month.
Solution:
H
0
: The crime rates are uniformly spread over all the twelve months.
H
1
: The crime rates are not uniformly spread over all the twelve months.
Table 28.14
Months O E (O-E)
2
/E
Jan 190 181 0.447514
Feb 152 181 4.646409
Mar 121 181 19.8895
Apr 110 181 27.85083
May 147 181 6.38674
Jun 199 181 1.790055
Jul 250 181 26.30387
Aug 247 181 24.0663
Sep 201 181 2.209945
Oct 150 181 5.309392
Nov 193 181 0.79558
Dec 212 181 5.309392
Total 2172 Chi square = 125.0055
( )
0055 . 125
E
E O
2
2
1
]
1

This value of Chi Square is greater than the chi square value 19.675 obtained from the table
corresponding to k 1 =12 11 degrees of freedom. So we reject the null and confirm that the
crime rate does vary with month.
Example 32: Three hundred apples were distributed among 10 persons. The distribution pattern
is given in the following table. Conduct a chi-square test to test the belief that the apples were
distributed equally.
Table 28.15
Persons 1 2 3 4 5 6 7 8 9 10
No. of Apples 28 29 33 31 26 35 32 30 31 25
Solution:
v = 10-1=9
92 . 16
2
05 . 0

The calculated value of chi-square is less than the corresponding table entry we accept
the null hypothesis and confirm the belief that the apples were distributed equally.
Table 28.16
99
O E 0-E (0-E)
2
/E
28 30 -2 0.1333
29 30 -1 0.0333
33 30 3 0.3000
31 30 1 0.0333
26 30 -4 0.5333
35 30 5 0.8333
32 30 2 0.1333
30 30 0 0.0000
31 30 1 0.0333
25 30 -5 0.8333
Total 2.8667
The 2 2 contingency table and use of chi-square: The simplest form of test for independence
is found when there are only two groups within each basis of classification, giving altogether four
groups. Such a classification is often called a 2 2 contingency table. To represent this situation
we use cross-classification table shown below.
Example 33: 1000 parts of a product manufactured by two machines were tested for quality. The
following table gives the summery details. It is believed that the defects are not related to the
machines. Using chi-square test this hypothesis for 0.1 level of significance.
Table 28.17
Machine Number Defective Effective Total
1 25 375 400
2 42 558 600
Total 67 933 1000
Classification of output of two machines as defective and efficitive
Solution:
In the above table the cell frequencies indicate the number of cases belong to that cell. It
is a bivariate classification. The totals at the last row and last column are called marginals
indicating the respective univariate frequencies. For example, the last row shows the defective
and effective frequencies in total without bothering the machine number. Similarly, the last
column show the respective univariate frequencies of machine 1and 2 irrespective of effective or
defective. Invariably, the row total and the column total must tally at the grand total as shown
above.
The hypothesis to be tested in the said two machine-wise classifications is unnecessary in
the sense that both the machines turn out defective ones in the same manner. If this hypothesis is
true then the variation of the observed values from their expected values may be attributed to
mere sampling fluctuations.
Since it is a 2 x 2 contingency table the d.f = (2-1)(2-1) = 1. Tough we need 4 expected
frequencies we need to calculate only one among the four. The remaining frequencies are
obtained by subtracting the obtained one from the respective column or row totals given
Table 28.18
100
Machine Number Defective Effective
1 25 375
2 42 558
Total 67 933
1 26.80 373.20
2 40.20 559.80
Total 67.00 933.00
cell reference O E (O-E) (O-E)
2
/E
Machinc1 defective 25 26.80 -1.80 0.1209
Machine 1 effective 375 373.20 1.80 0.0087
Machinc 2 defective 42 40.20 1.80 0.0806
Machine 2 effective 558 559.80 -1.80 0.0058
Total 1000 1000.00 0.00 0.2160
400
600
1000
Calculation of Chi-square value
1000
Classification of expected defective and effective items
Total
Total
400
600
. The expected cell values are obtained in the following table. Irrespective of the origin of
the machine there are 1000 items, out of which 67 are found to be defective and 933 are found to
be effective. So the proportion of defective in total is 0.67 irrespective of the origin of the output.
If this is true for both the machines then the number of defectives from the first machine is simply
0.67 400 = (67/1000) 400 =26.8.
Now the expected frequency of defective ones from the second machine = 67.00-26.80 = 40.20
The expected frequency of effective ones from the first machine = 400-28.60 =373.20
The expected frequency of effective ones from the second machine = 600-40.20 = 559.80
(N) casses of number Total
marginal) (Coloumn marginal) (Row
E
Calculation of Chi-square value: Once the expected values for all the four cells are obtained the
needed chi square value is obtained by using the following formula
( )
1
]
1

E
E O
2
2
From the table
( )
216 . 0
E
E O
2
2

1
]
1

Formula method for the calculation of chi-square for a 2
2 classification
Table 28.19
a b a+b
c d c+d
a+c b+d n
101
( )
( )( )( )( )
( )( ) ( )( ) [ ]
( )( )( )( )
( )( )
. 2160 . 0
000 , 640 , 002 , 15
1000 000 , 240 , 3
933 67 600 400
1000 42 375 556 25
d b c a d c b a
n bc ad
2
2
2

+ + + +

The value of chi-square varies with the degrees of freedom. In the present problem there
are four cells in the classification, but only one-degree of freedom. In the sense that any one entry
in one of the four cells will automatically facilitate us to get the remaining cell entries by using
the row and column totals given in the problem. So the chi-square value corresponding to one
degree of freedom from the table is 6.635 at 0.01 level of significance. Since the calculated chi-
square is comfortably less than the table entry we accept the null hypothesis and confirm that
there is no significant difference between the defective items produced on the two machines
Example 34: One hundred students appeared for an examination and their results were classified
as follows on the basis whether they received special training:
Table 28.20
pass fail total
Yes 36 12 48
No 30 22 52
Ttotal 66 34 100
Special training
Results
Test whether the special training was useful to the students. The relevant table value for 5% level
is 3.84
Solution:
Ho The special test has no effect on the students performance.
V =(2-1)(2-1) = 1
( )
331 . 3
E
E O
2
2

1
]
1

Since the table value of chi square test is 3.84 is grater than the calculated value, we
accept the null hypothesis and confirm that the special training has no effect on the
students performance.
Table 28.21
102
Special Training pass fail
Yes 36 30
No 12 22
Total 48 52
Special Training pass fail
Yes 31.68 34.32
No 16.32 17.68
Total 48.00 52.00
O E (O-E) (O-E)
2
(O-E)
2
/E
36 31.68 4.32 18.6624 0.5891
30 34.32 -4.32 18.6624 0.5438
12 16.32 -4.32 18.6624 1.1435
22 17.68 4.32 18.6624 1.0556
100 100 - - 3.3320
100
Calculation of expected frequencies
Total
2 X 2 CROSS CLASSIFICATION PASS AND FAIL
Total
66
34
66
34
100
The 3 3 contingency table and use of chi-square: For a 3 x 3 contingency table the associated
degrees of freedom = (3-1)(3-1) = 4. Thus though we are in need of nine expected frequencies it
is enough to calculate the expected frequencies for four items. The remaining ones are obtained
from the marginal frequencies by subtraction.
Example 35: The table shows cross-classification of 500 individuals by income levels and
preference for ice creams of two types A and B. test whether the two attributes are associated at
5% level of significance.
Solution:
( )
0375 . 51
E
E O
2
2
1
]
1

9.4877 f d. 4 for
2
05 . 0

Since the calculated value is much more than the table value we reject the null hypothesis and
accept the alternative and confirm that the stated attributes are associated.
Table 28.22
103
A A or B B Total
Low 170 30 80 280
Medium 50 25 60 135
High 20 10 55 85
Total 240 65 195 500
A A or B B Total
Low 134.4 36.4 109.2 280
Medium 64.8 17.55 52.65 135
High 40.8 11.05 33.15 85
Total 240 65 195 500
O E O-E (O-E)
2
(O-E)
2
/E
170 134.4 35.6 1267.36 9.4297619
50 64.8 -14.8 219.04 3.38024691
20 40.8 -20.8 432.64 10.6039216
30 36.4 -6.4 40.96 1.12527473
25 17.55 7.45 55.5025 3.16253561
10 11.05 -1.05 1.1025 0.09977376
80 109.2 -29.2 852.64 7.80805861
60 52.65 7.35 54.0225 1.02606838
55 33.15 21.85 477.4225 14.4018854
500 500 0 51.0375268
Income level
Expected preference for ice cream
Computation of chi-square
Income level
Observed preference for ice cream
The r c contingency table: The example illustrated above is based on a 2 2 contingency table
showing a bivariate in four cells. Now let us extend this bivariate model involving r rows and c
columns totaling to r c cells altogether. The table so designed is called a r c constancy table.
Example 36: The following example uses the 4 5 contingency classification of accidents by
roads and days of the week in Bangalore for the year 1990. Use the chi-square test the belief that
accidents in Bangalore are independent of roads and days.
Table 28.23
Roads Monday Tuesday Wednesday Thursday Friday Total
A 21 13 22 30 46 132
B 11 17 12 17 12 69
C 10 10 12 12 10 54
D 28 26 25 28 31 138
Total 70 66 71 87 99 393
Recorded accdients during 1996 in Bangalore
Solution:
As usual, we compute the expected cell frequencies by assuming that the accidents in
each road in Bangalore are distributed throughout the week in the same proportion as the total
accidents in the four roads. For example, 70 accidents out of the total 393 occurred on Monday.
Thus, it is assumed that this proportion of accidents occurs on all the roads on Monday. The
expected cell frequency is computed as follows
5115 . 23 132
393
70
E

104
This computation is repeated for all the remaining cells using the respective marginal frequencies
and reported in the following table.
Table 28.24
A 21 13 22 30 46 132
B 11 17 12 17 12 69
C 10 10 12 12 10 54
D 28 26 25 28 31 138
Total 70 66 71 87 99 393
Recorded accdients during 1996 in Bangalore
Table 28.25
A 23.5115 22.1679 23.8473 29.2214 33.2519 132
B 12.2901 11.5878 12.4656 15.2748 17.3817 69
C 9.6183 9.0687 9.7557 11.9542 13.6031 54
D 24.5802 23.1756 24.9313 30.5496 34.7634 138
Total 70 66 71 87 99 393
Expected accdients during 1996 in Bangalore
Table 28.26
Roads Days O E (O-E) (O-E)
2
(O-E)
2
/E
Road A Monday 21.0000 23.5115 -2.5115 6.3074 0.268269
Road B Monday 11.0000 12.2901 -1.2901 1.6643 0.135418
Road C Monday 10.0000 9.6183 0.3817 0.1457 0.015146
Road D Monday 28.0000 24.5802 3.4198 11.6954 0.475805
Road A Tuesday 13.0000 22.1679 -9.1679 84.0511 3.791562
Road B Tuesday 17.0000 11.5878 5.4122 29.2921 2.527839
Road C Tuesday 10.0000 9.0687 0.9313 0.8673 0.095638
Road D Tuesday 26.0000 23.1756 2.8244 7.9774 0.344215
Road A Wednesday 22.0000 23.8473 -1.8473 3.4126 0.143103
Road B Wednesday 12.0000 12.4656 -0.4656 0.2168 0.017394
Road C Wednesday 12.0000 9.7557 2.2443 5.0368 0.516289
Road D Wednesday 25.0000 24.9313 0.0687 0.0047 0.000189
Road A Thursday 30.0000 29.2214 0.7786 0.6063 0.020747
Road B Thursday 17.0000 15.2748 1.7252 2.9763 0.194849
Road C Thursday 12.0000 11.9542 0.0458 0.0021 0.000175
Road D Thursday 28.0000 30.5496 -2.5496 6.5006 0.212787
Road A Friday 46.0000 33.2519 12.7481 162.5138 4.887354
Road B Friday 12.0000 17.3817 -5.3817 28.9625 1.666264
Road C Friday 10.0000 13.6031 -3.6031 12.9820 0.954344
Road D Friday 31.0000 34.7634 -3.7634 14.1629 0.407408
16.6748
Computation of Chi squares
Chi Square =
By using table 1 and 2 we can compute the chi-square value in the usual manner by using
the formula
( )
6748 . 16
E
E O
2
2
1
]
1

105
However, one could derive a short cut computation formula
( )
( )

+
1
]
1
1
]
1

n O E
E O 2
E
O
E
E OE 2 O
E
E O
2
2 2 2
2
n
E
O
2
2

Computation of 7568 . 18
5115 . 23
21
E
O
2 2

Similar calculations are made for all the remaining cells under reference and reported in the
following table
Table 28.27
A 18.7568 7.6236 20.2958 30.7994 63.6354 141.111
B 9.8453 24.9401 11.5517 18.9200 8.2846 73.54176
C 10.3968 11.0269 14.7606 12.0460 7.3513 55.58159
D 31.8957 29.1686 25.0689 25.6632 27.6440 139.4404
Total 70.8946 72.7593 71.6769749 87.42856 106.9154 409.6748
Computation
6748 . 16 393 6748 . 409
n
E
O
2
2

Here the degrees of freedom is obtained as (r 1)(c 1) = (4 1)(5 1) =12
The value of chi-square at .01% level of significance for 12 degrees of freedom is obtained form
the chi-square table as 26.217. Since the calculated chi-square value is less than the
corresponding table entry we accept the null hypothesis of independence. In other words, this is
to say that the distribution of accidents among the roads in Bangalore is independent of the days
of the week.
Summary of the test:
H
0
:1 The accidents in different roads in Bangalore are independent of the days of
the week
H
0
:2 The accidents in different roads in Bangalore are related to the days of the week
Critical region: Reject H
0
and accept H
1
if the chi-square value is >26.217
Steps to be followed for the calculation of chi squares:
1. Calculated the expected value
2. Find the difference between the observed and respective expected values.
3. Express the square of the so obtained differences as a fraction of the expected value
4. Obtain the total
5. Compare this calculated value with the table value and decide about the test result.
106
If E stands for the expected value and O stands for the observed value then the needed chi-square
is defined as
( )
1
]
1

E
E O
2
2
The distribution of this term computed from successive samples constitutes the chi-square
distribution
The test may be summarized as follows
H
0
::1 The Defective items are independent of the machine
H
0
::2 The Effective items are independent of the machine
Critical region: Reject H
0
and accept H
1
if the chi-square value is>26.217
Critical decision: reject H
0
and accept H
1
if chi-square >6.635
2 TEST OF GOODNESS OF FIT OF A NORMAL DISTRIBUTION
Till now we have dealt with the chi-square test for independence involving two variables,
each of which has two or more categories. Another situation in which the chi square test will be
useful, called Goodness of fit, is one in which the distribution of scores of a single variable must
be tested. The logic underlying the test is exactly similar to that of the independence test that we
have used so far. In the observed and the expected frequencies are closer then we say that there is
a good fit and conclude that the two distributions are not significantly different. The major
difference in this new application lies in the method of assigning the expected frequencies. On the
basis of the null hypothesis we calculate the expected frequencies.
Test of normality of the population by using chi-square test: Fitting a normal distribution
In almost all tests we have invariably assumed that the given population is a normal
without any sound basis for such an assumption. This section we explore the usefulness of chi
square in testing the normality of the given data. The hypothesis to be tested is that the universe
from which the sample was taken is normal and hence the deviation of sampling distribution from
normality is only random. However, since we are left with only the sample information about the
population, we test the normality of the given sample first. Once this is done our inductive logical
basis will establish the normality of the population as well.
Example 37: For the data given below fit a normal curve.
Table 28.28
Average earnings Mid Frequency
Per week Value (X) f d' fd' fd'
2
70 and bleow 80 75 1 -4 -4 16
80 and below 90 85 5 -3 -15 45
90 and bleow 100 95 24 -2 -48 96
100 and below 110 105 33 -1 -33 33
110 and bleow 120 115 40 0 0 0
120 and below 130 125 23 1 23 23
130 and bleow 140 135 4 2 8 16
140 and below 150 145 8 3 24 72
150 and bleow 160 155 11 4 44 176
160 and below 170 165 3 5 15 75
Total 152 14 552
107
As a first step we calculate the mean and the standard deviation of the sample in the routine
method. Once this is done we next trace the normal distribution having the mean, standard
deviation and frequencies as that of the very sample. This is more complex than the calculation of
expected frequencies in the previous section. However, in such calculations the expected
marginal frequencies and the observed marginal frequencies are equal both in column-wise and
row-wise. Such an observation is helpful in arriving at the needed expected frequencies in this
case.
92 . 115
152
14
115 c
n
' fd
A X + +

03 . 19 6231 . 3 10
152
14
152
552
10
n
' fd
n
' fd
c s
2 2
2

,
_

,
_

09 . 19
151
152
03 . 19
1 n
n
s
2

Thus, the estimated mean of the universe = 115.92
The estimated standard deviation of the universe 09 . 19
Table 28.29
Real Lower class
limit
Proportion
of area
between
Yo and Z
Expected
frequency
between
Yo and Z
70 -45.92 -2.41 0.49202 74.79
80 -35.92 -1.88 0.46995 71.43
90 -25.92 -1.36 0.41308 62.79
100 -15.92 -0.83 0.29673 45.10
110 -5.92 -0.31 0.12172 18.50
120 4.08 0.21 0.08317 12.64
130 14.08 0.74 0.27035 41.09
140 24.08 1.26 0.39617 60.22
150 34.08 1.79 0.46327 70.42
160 44.08 2.31 0.48956 74.41
170 54.08 2.83 0.49767 75.65
X X

X X
Z

Now our task is to compute a normal distribution with mean 115.92 standard deviation 19.09 and
total frequency of 152. In such a distribution the total number of frequencies represents the total
area under the curve. The frequency of each class in the expected distribution will be in
proportion to the area under standard normal curve obtainable from the table. The following table
illustrates the methodology.
In column I the real lower limits of the respective intervals are reported. Since the mean
and the standard deviation of the universe is not known their estimates
X
and are used in the
calculation of z. These values are reported in the third column. Corresponding to these values the
proportion of area between the maximum ordinate Y
o
and z is obtained from the standard normal
table and reported in the fourth column. The number of frequencies between Y
0
and z totaling to
the needed 152. It is obtained simply multiplying column 4 by 152.
Fig 28.13
108
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
6
0
8
0
1
0
0
1
2
0
1
4
0
1
6
0
1
8
0
Expected Original
Now from the cumulative frequencies between the maximum ordinate and the lower bound we
can calculate the respective expected frequencies. These expected frequencies are reported in the
third column of the following table. In the table the first class corresponds to earnings less than
70 is reported. The observed frequency to this class is nil in our problem. However for a
continuous theoretical distribution since it extends to minus infinity on either to plus infinity on
the other definitely we must attribute a certain frequencies this first class interval under 70.
According to the table the number of frequencies between the maximum ordinate and 70 is 74.79.
Since one half of the total 152 falls below the maximum ordinate, the frequency below 70 could
be 76 74.79 = 1.21 The expected frequency between 70 and under 80 is obtained by subtracting
the cumulative frequency at 70 from the cumulative frequency at 77.
This is equal to 74.79 -71.43 =3.36. All other class frequencies are obtained in the same way and
reported in the last column except for the class in which the maximum ordinate falls. The
frequency between the maximum ordinate and 110 is 18.50.The frequency between the maximum
ordinate and 120 is 12.64. Thus the sum of these two frequencies (31.14) gives us the total for the
said interval namely 110-120. The frequency for the last open-end class above 170 is obtained as
76 -75.65 = .35
Computation of chi-square:
The number of classes in a frequency distribution is optional. But in chi square the class
with small frequencies should not be used. In the frequency is less than 5 in any cell one or two
adjacent classes must be combined to arrive at a comfortable total
Table 28.30
Average earnings Observed Expected
Per week frequency frequency
O E O
2
O
2
/E
upper 90 6 13.21 36 2.7252
90 and bleow 100 24 17.69 576 32.5684
100 and below 110 33 26.60 1089 40.9372
110 and bleow 120 40 31.14 1600 51.3766
120 and below 130 23 28.45 529 18.5927
130 and bleow 140 4 19.12 16 0.8366
140 and below 150 8 10.20 64 6.2743
150and above 14 5.58 196 35.1254
Total 152 152.00 188.4365
109
4365 . 36 152 4365 . 188
n
E
O
2
2

At 0.01 level of significance, for 5 degrees of freedom the chi-square = 15.086. Since calculated
value is more than this value we reject the null hypothesis and confirm that the given distribution
is not normal.
3. CHI-SQUARE TEST FOR POPULATION VARIANCE
Let X be a normal variable with mean
and variance
2
; that is, ) , ( N X
2
.
Now we know that
X
Z is a standard normal variable with mean 0 and variance unity.
The chi-square variable in it simple form is defined as the square of the standard normal variable.
Symbolically this may be written as
2 2
) 1 (
Z
where the subscripts (1) refers the degrees of freedom. Just like the mean and variance are the
parameters of the normal distribution, the d.f is the parameter of the chi-square distribution. Here
by the degrees of freedom we mean the number of independent observation in the defined sum of
square. In the above definition the d.f is 1 because we are considering only the square of one
standard normal variable.
Now let Z
1,
Z
2
,
..
Z
k
are the k independent standard normal variables. Now redefine the
chi-square as the sum of the squares of the said k independent standard normal variables.
2
k
2
3
2
2
2
1
2
) k (
Z ....... .......... Z Z Z + + + +
The
2
distributions are used to test the significance of the population variance through
confidence intervals. In other words this means that we can use this
2
distribution to judge
whether the sample has been drawn from a normal population with mean and standard
deviation or not.
2
distribution is defined as
2
2
n
n
2
2
2
2
1
1
2
2
x 2
) X X ( ....... .......... ) X X ( ) X X ( N
+ +

with n-1 degrees of freedom. By comparing the calculated value with the table value at certain
level of significance we can accept or reject our null hypothesis in usual manner. In order to
compute test statistics such as in t here for chi-square distribution also we must use the number of
sample observations. In addition we will have to use certain population parameters as well. In the
absence of such parameters we must estimate them from the sample.
Example 38: A sample of 10 students is drawn randomly from a certain population. The sum of
the squared deviations from its mean is 50.Test the hypotheses that variance of the population is 5
at 5% level of significance.
Solution:
From the data n = 10
5
10
50
50 ) X X (
2
x
2

110
Now
10
5
5 10 N
: H
: H
2
2
x 2
2 2
x 1
2 2
x o

Now the table value of chi-square distribution with 10-1 = 9 degrees of freedom is 16.92. Since
the calculated value namely 10 is less than the table entry we accept the null hypothesis and
confirm that the variance of the population is 5.
Example 39:The marks scored by ten students in Economics are as follows.
Table 28.31
S.N 1 2 3 4 5 6 7 8 9 10
M 38 40 45 53 47 43 55 48 52 49
Can we infer that the variance of the distribution of marks of all the students, from the sample of
10 has been taken, is equal to 10? Test this both at 5% and 1% level of significance.
Solution:
To perform the test first we must calculate the standard deviation from the above given data.
Table 28.32
S.N X
X X
(
X X
)
2
1
2
3
4
5
6
7
8
9
10
38
40
45
53
47
43
55
48
52
49
-9
-7
-2
6
0
-4
8
1
5
2
81
49
4
36
0
16
64
1
25
4
N=10

470 X

280 ) X X (
2
28
3 . 5
10
280
n
) X X (
47
10
470
n
X
X
2
x
2
x

Now let the null and alternative hypotheses be

2
p
2
x 1
2
p
2
x o
H
H

111
14
20
28 10 n
2
p
2
x 2

The degrees of freedom is 10 -1 = 9
Now from the table the 92 . 16
2
and 21.67 for 5%and 1% significance for 9 degrees of
freedom respectively. Since both these values are greater than the calculated we accept the null
hypothesis and confirm that the variance of sample can be taken as 20 at both the levels of
significance.
Limitations of chi-square test
The chi-square test has two potential difficulties. The first one is normally occurs with
the small samples. When the sample size is small, we can no longer assume that the sampling
distribution of all possible sample statistics is accurately described by the chi-square distribution.
In the case of chi square test, a small sample is defined as one where a high percentage of cells
have expected frequencies less than 5.
Yates correction:
In the case of 2 2 contingency table the obtained chi square can be adjusted by using
Yates formula given below.
[ ]

E
5 E O
2
2
c
2
c
= the corrected chi square

However, for more than 2 2 there is no correction formula for computing corrected chi-square.
Under such circumstances normally by combining the categories cell frequencies are increased to
meet the requirement of greater than 5. Obviously, this course of action should be taken only
when it is sensible to do so.
The second limitation is related to large samples. In the earlier discussion we repeatedly stressed
the role of sample size in rejecting or accepting the said null hypothesis. The probability of
rejecting the null hypothesis increases with sample size regardless of the size of the difference
and the selected alpha level of significance. The chi-square too is sensitive to sample size In fact,
chi-square is more sensitive to sample size than other sampling distributions. If the sample size is
doubled then the chi-square value also gets doubled exactly.
Thus, as long as we are confined to sample, we must always remember that our results could have
been produced by mere random chance. In other words, tests of hypothesis, like any other
statistical technique, are limited in the range of questions they normally answer. These results tell
us whether our results are statically significant or not. They do not necessarily important to our
result other than this angle.
28.30 ANALYSIS OF VARIANCE (ANOVA)
Like chi-square, ANOVA is a very flexible and widely used test in social
sciences. It is designed to be used with interval-ratio scales. In the previous section we tested the
difference between two sample means, under null hypothesis of no difference between the two
sample means by using both Z and t tests. It implies that both the samples were drawn from the
same universe and hence there is no significant difference between the two means. In the absence
of the standard deviation of the population we made use of these two sample variances were used
to estimate the population standard deviation while using t test. However, both the estimates may
yield quite different estimates, though the population is one and the same due to sample variation.
112
Under such circumstances the difference in variances must be tested for significance." The
analysis of variance is essentially a procedure for testing the difference between different groups
of data for homogeneity." It is perhaps easiest to think of ANOVA as an extension of the t test
that we have used in the previous section for testing the difference between the two sample
means. However thet is limited to only two sample case, but ANOVA allow us test the
significance in much broader range of situation involving more than two samples.
Suppose we are interested in religious support for capital punishment like death
sentence. Suppose that we decide to conduct an investigation by devising a scale that
measures support for capital punishment at the interval- ratio level. Let the question is
attempted by using a random sample three major religion namely Hindus, Muslims
and Christians in India.
In considering the gravity of the situation, you might consider testing for
significance of the differences by collapsing the scale on the support of capital punishment into
few categories like pro-capital punishment, neutral and anti capital punishment and go for a chi-
square test in the usual manner as we did in the previous section. However, if we are not certain
about the adequacy of our level of measurement this approach will have severe disadvantage
leads to potential lose of valuable information.

The second option could be running a series of t tests by taking two groups at a time for
all possible
3
C
2
combinations. One obvious difficulty is that such an approach will involve lot
time and energy.
For ANOVA the null hypothesis is that the population from which samples are drawn are
equal on the characteristic of interest. Pertaining to our problem it could be that the people with
various religions do not vary in their support to the capital punishment issue: So symbolically it
could be
3 2 1 0
H
The field data on this issue may show some differences between these means due to
sampling. Here we test the significance of this no difference Hypothesis on the basis of the
sample information. If at all variation between the group exists it can be either between the
religious group or within these groups or both. Thus, the technique of analysis of variance
consists in splitting the variance for analytical purposes into its various components. Normally
the variance is split into two parts:
1. Variance between samples:
2. Variance within samples:
In order to use analysis of variance, we must assume that both the samples are from normal
population for smaller sample size. However, for larger sample size then the normality
assumption may be relaxed. The method of getting the components of variance is illustrated step
by step as shown below.
(a) Let N be the total number of random sample across the population. Divide this sample into K
groups of interest (for example religion wise). Obtain the mean of the k samples.
. K 3 2 1 X ......... ,......... X , X , X . e . i
(b) Obtain the grand mean of all the n sample means by using the formula
(c)
113

N
X n . ....... .......... X n X n X n
. X
. n
k
3
3
2
2
1
1
+ + + +
(d) Calculate the sum of the squares of deviation between samples

2
k
k
2
2
2
2
1
1
) X X ( n .... .......... ) X X ( n ) X X ( n + +
(e) Divide the above result by the degrees of freedom to obtain variance between samples. The
degree of freedom between the samples is equal to the number of samples minus one. d. f
B
=
k-1.
(f) Obtain the sum squares of the deviations within the sample.

2
ik
2
2 i
2
1 i
) X X ( ...... .......... ) X X ( ) X X (

+ + +
(g) Divide the above result by the degrees of freedom with in sample to obtain the corresponding
variance d.f
w
= (N-k)
(h) Where N = Total number of items in all the samples. k = Number of samples.
(i) The degrees of freedom for the total variance will be equal to the number of items in all the
samples minus one. d.f
T
= (N-1)
Table 28.33
Analysis of variance Table
Source of
Variation
Some of
squares
Degrees of
freedom
Variance
Between
samples
SSB
... ) X X ( n
2
1
1
+
(k-1)
1 k
... ) X X ( n
2
1
1
+
Within
Sample
SSW
+ ... ) X X (
2
1
1 i
(N-k)
k N
... ) X X (
2
1
1 i
Total
SST

2
i
) X X (
(N-1)
1 N
) X X (
2
i
To do this the ratio of larger variance to the smaller variance is obtained first. In a similar way if
we take many more pairs of samples and estimate many more such ratios they all form a sampling
distribution called the F distribution.
28.31 F DISTRIBUTION AND ANALYSIS OF VARIENCE (ANOVA)
Let X
1
,X
2
, ..X
m
be a random sample with m observations from a normal population
with mean
2
x x
iance var and Similarly, let Y
1
,Y
2
, ..Yn be another sample with n
observation from a normal population with mean
2
y y
iance var and
. Suppose we want to
test whether
2
y
2
x

. Since the population variances are unobservable we use their estimates
and defines F distribution as the ratio of larger variance to the smaller variance.
1 n / ) Y Y (
1 m / ) X X (
s
s
F
2
i
2
i
2
y
2
x

If both the variances are equal definitely the F must be equal to one. If they are different
F also must be greater than 1. Thus this ratio will always greater than unity. This F coefficient is
114
often used to test whether the difference between two variances is significant or just due to
fluctuations of the sample. If the calculated value is equal or less than the corresponding table
entry then we accept the null hypothesis and confirm that the difference between variances is
insignificant. On the other hand if it is greater than the table entry then we reject the null
hypothesis and accept the alternative and state that the difference between these two variances is
significant.
Properties F distribution:
1. Like chi-square this distribution is also right skewed and ranges from 0 to infinity.
2. For large values of k
1
and k
2
this distribution tends to a normal distribution.
3. The square of a t distribution with k degrees of freedom is F distribution with 1 and k degrees
of freedom.
Note: The F distribution starts at zero and rises to its peak at the value equal to
) 1 n ( n
) 1 n ( n
2 2
1 1
and
falls to zero again as F increases without limit. The mean of this distribution is
) 2 n (
n
2
2
. The
shape of this distribution depends upon the values of n
1
and n
2
. For larger values of n
1
and n
2
this
distribution will be symmetrical. If one of the n's increases to infinite while the other remains
small the F distribution tends to
2
distribution. If n
1
=1 and n
2
is infinite then F approaches
the student t distribution. So that F = t
2
.
In other words the normal,
2
and t distributions are
special cases of this more generalized F distribution.
Example 40: The following table gives the interval ratio scale four categories of people across
the society relating to the capital punishment that is practice. Setup an analysis of variance table
and test whether the religion is significant in determining the capital punishment or not.
Table 28.34
Hindus Muslims Christians
No cast
affiliation
8 12 12 15
12 20 13 16
13 25 18 23
17 27 21 28
Solution:
1. Direct Method
Table 28.35
115
S
a
m
p
l
e

i
t
e
m
s
M
e
a
s
u
r
m
e
n
t
S
a
m
p
l
e

m
e
a
n
s
H1 8 (12.5-8)
2
= 20.3 (8-17.5)
2
= 90.25
H2 12 12.5 (12.5-12)
2
= 0.25 4(12.5-17.5)
2
= 100 (12-17.5)
2
=30.25
H3 13 (12.5-13)
2
= 0.25 (13-17.5)
2
=20.25
H4 17 (12.5-17)
2
= 20.3 (17-17.5)
2
=0.25
M1 12 (21-12)
2
= 81 (12-17.5)
2
=30.25
M2 20 21 (21-20)
2
= 1 4(21-17.5)
2
= 49 (20-17.5)
2
=6.25
M3 25 (21-25)
2
= 16 (25-17.5)
2
=56.25
M4 27 (21-27)
2
= 36 (27-17.5)
2
=90.25
C1 12 (16-12)
2
= 16 (12-17.5)
2
=30.25
C2 13 16 (16-13)
2
= 9 4(16-17.5)
2
= 9 (13-17.5)
2
=20.25
C3 18 (16-18)
2
= 4 (18-17.5)
2
=0.25
C4 21 (16-21)
2
= 25 (21-17.5)
2
=12.25
N1 15 (20.5-15)
2
= 30.3 (15-17.5)
2
=6.25
N2 16 20.5 (20.5-16)
2
= 20.3 4(20.5-17.5)
2
= 36 (16-17.5)
2
=2.25
N3 23 (20.5-23)
2
= 6.25 (23-17.5)
2
=30.25
N4 28 (20.5-28)
2
= 56.3 (28-17.5)
2
=110.3
Total 280 536
17.5 N = k = 4 16
T
S
S
342
Grand Mean =
194
S
S
B
S
S
W
Note: In the above table TSS is obtained in the last column is simply the total of SSW and SSB.
Thus the last column calculations can safely be ignored for all practical problems.
Table 28.36
3 = k-1 SSW/k-1= 64.66667
12 = N-k SSB/N-k = 28.5
15 = N-1 F = 2.269006
V
a
r
i
e
n
c
e
Analysis of Varience
SSW
SSB
342
TSS 536
194
S
o
u
r
c
e

o
f

v
a
r
i
a
t
i
o
n
S
u
m

o
f

s
q
u
a
r
e
s
D
e
g
r
e
e
s

o
f

f
r
e
e
d
o
m
For 5% level of significance with 12 (within) and 3 (between) degrees of freedom the critical F
from the F table is 3.49. Since the calculated F is less than the table value we reject the null
hypothesis of no difference.
2) Short cut Method: In the method shown above the method of calculation is somewhat tedious
and time consuming. Thus often we follow a short cut method to calculate the F value. To do the
short cut follows the following steps.
1. Find the sum of all the N items across the category. Let this total be T.
2. Calculate the correction term given by T
2
/N
3. Find the squares all the items of all the samples and add then together.
4. Find SST by subtracting the correction term from the total of all squared value in step 3.
5. Find SSB by following the steps given below
6. Squares the totals of the respective samples and divide by their respective number of items.
Add them all.
7. Subtract the correction term from this total and get the needed SSB.
8. Find SSE as SST SSW
9. After obtaining all the needed values setup the F table for test.
Short cut Method:
Table 28.37
116
H
i
n
d
u
s

X 1
X
1
2
M
u
s
l
i
m
s

X
2
X
2
2
C
h
r
i
s
t
i
a
n
s

X 3
X
3
2
N
o
n
e

X
4
X
4
2
8 64 12 144 12 144 15 225
12 144 20 400 13 169 16 256
13 169 25 625 18 324 23 529
17 289 27 729 21 441 28 784
50 666 84 1898 64 1078 82 1794
342 194 - 536 SSB - SST SSW
194 4900 5094 correction
4
82
4
64
4
84
4
50
SSB
536 4900 - 5436 term Correction - all of squares of Sum SST
5436 1794 1078 1898 666 items the all of squares of Sum
4900
16
280
.. N
T
term Correction
280 82 64 84 50 T samples the all from items all of Sum
2 2 2 2
2 2

+ + +

+ + +

+ + +
Table 28.38
3 = k-1 SSW/k-1= 64.666667
12 = N-k SSB/N-k = 28.500000
15 = N-1 F = 2.269006 SST 536
SSB 194
SSW 342
Analysis of varience
S
o
u
r
c
e

o
f

v
a
r
i
a
t
i
o
n
S
u
m

o
f

s
q
u
a
r
e
s
D
e
g
r
e
e
s

o
f

f
r
e
e
d
o
m
M
e
a
n

s
q
u
a
r
e
s
Coding: In Some problems if the given data series are big then the computations by the above
method could be difficult indeed. The coding is done by any one of the following or combination
of methods. The variance and the associated F will remain unaltered by such operations
1. Multiply the entire given data series by any given number
2. Divide the entire given date series by any given number
3. Add any number to the entire data series
4. Subtract any number from the entire data series.
Solution:
To make the entries smaller we subtract 8 from all the entries and obtain the following new table
for our further calculation in the usual manner
Table 28.39
9 19 13 20
5 17 10 15
4 12 5 8
0 4 4 7
Hindus Muslims Christians Others
Table 28.40
117
S
a
m
p
l
e

i
t
e
m
s
M
e
a
s
u
r
m
e
n
t
S
a
m
p
l
e

m
e
a
n
s
H1 0 (0-4.5)
2
= 20.25 (0-9.5)
2
= 90.25
H2 4 5 (4-4.5)
2
= 0.25 4(4.5-9.5)
2
= 100 (4-9.5)
2
= 30.25
H3 5 (5-4.5)
2
= 0.25 (5-9.5)
2
= 20.25
H4 9 (9-4.5)
2
= 20.25 (9-9.5)
2
= 0.25
M1 4 (4-13)
2
= 81 (4-9.5)
2
= 30.25
M2 12 13 (12-13)
2
= 1 4(13-9.5)
2
= 49 (12-9.5)
2
= 6.25
M3 17 (17-13)
2
= 16 (17-9.5)
2
= 56.25
M4 19 (19-13)
2
= 36 (19-9.5)
2
= 90.25
C1 4 (4-8)
2
= 16 (4-9.5)
2
= 30.25
C2 5 8 (5-8)
2
= 9 4(8-9.5)
2
= 9 (5-9.5)
2
= 20.25
C3 10 (10-8)
2
= 4 (10-9.5)
2
= 0.25
C4 13 (13-8)
2
= 25 (13-9.5)
2
= 12.25
N1 7 (7-12.5)
2
= 30.25 (7-9.5)
2
= 6.25
N2 8 13 (8-12.5)
2
= 20.25 4(12.5-9.5)
2
= 36 (8-9.5)
2
= 2.25
N3 15 (15-12.5)
2
= 6.25 (15-9.5)
2
= 30.25
N4 20 (20-12.5)
2
= 56.25 (20-9.5)
2
= 110.3
Total 152 536
9.5 N = k = 4 Grand Mean =
S
S
T
342 194
S
S
B
S
S
W
16
Table 28.41
3 = k-1 SSW/k-1= 64.66667
12 = N-k SSB/N-k = 28.5
15 = N-1 F = 2.269006 SST 536
194
S
o
u
r
c
e

o
f

v
a
r
i
a
t
i
o
n
S
u
m

o
f

s
q
u
a
r
e
s
D
e
g
r
e
e
s

o
f

f
r
e
e
d
o
m
SSW
SSB
342
V
a
r
i
e
n
c
e
Analysis of varience Table
Since there is no change in the F value we derive the same conclusion as that of the original data
Example 41: A new brand of telephone was introduced in some targeted sales points in four
metros. The following table gives the sales details.
Table 28.42
Metro
P 14 16 17 16
Q 12 11 13 9
R 11 9 11 11
S 16 18 20 15
Sales in thousands in per month
Using ANOVA test the significance of difference between the sales of the toothpaste in four
Metros.
Solution: Direct Method
118
Table 28.43
Q R S
12 11 16
11 9 18
13 11 20
9 11 15
Sample
items Measurement Sample means SSW SSB
P1 14 3.0625
P2 16 15.75 0.0625
P3 17 1.5625 17.015625
P4 16 0.0625
Q1 12 0.5625
Q2 11 11.25 0.0625 23.765625
Q3 13 3.0625
Q4 9 5.0625
R1 11 0.25
R2 9 10.5 2.25 40.640625
R3 11 0.25
R4 11 0.25
S1 16 1.5625
S2 18 17.25 0.5625 50.765625
S3 20 7.5625
S4 15 5.0625
Total 219 31.25 132.1875
13.6875 N = 16
Sum of squares D. of freedom Mean squares
132.1875 3 44.0625
31.25 12 2.6042
163.4375 15
F = 16.9200
16
17
16
14
Analysis of variance
Calculation of variance between and within samples
P
TSS
Source of variation
SSW
SSB
Grand Mean =
3.4903 table the 12 v and 3 v For
2
2 1

Since the calculated value is more than the table value we reject the null hypothesis and confirm
that the sales in four Metros are different.
Shot cut Method
Table 28.44
119
P P
2
Q Q
2
R R
2
S S
2
14 196 12 144 11 121 16 256
16 256 11 121 9 81 18 324
17 289 13 169 11 121 20 400
16 256 9 81 11 121 15 225
63 997 45 515 42 444 69 1205
219
2997.56
3161
163.438
132.188
31.25
N = 16 k = 4
3 = k-1 SSW/k-1= 44.0625
12 = N-k SSB/N-k = 2.60417
15 = N-1 F = 16.92
Sum of all the samples T =
Correction Term T
2
/N =
Sum of squares of all the items (997+515+444+1205) =
SST= Sum of squares of all - T
2
/N=
15
R Q S
16
18
20
11
9
11
11 16
12
11
13
9
P
14
16
17
SSB=(63)
2
/4+(45)
2
/4+(42)
2
/4+(69)
2
/4=
SSW =SST-SSW =
Varience
Analysis of varience
Source of
variation Sum of squares
Degrees of
freedom
SSB 132.1875
SSW 31.25
SST 163.4375
Example 42: A new brand of Soap was introduced in some targeted sales points in four metros.
The following table gives the sales details.
Table 28.45
Cities
A 24 26 29 25
B 22 21 25 14
C 21 19 23 20
D 26 28 31 28
Sales in thousands in per month
Using ANOVA test the significance of difference between the sales of the soap in four Metros.
120
Table 28.46
B C D
22 21 26
21 19 28
25 23 31
14 20 28
Sample
items Measurement Sample means SSW SSB
A1 24 4
A2 26 26 0
A3 29 9 18.0625
A4 25 1
B1 22 2.25
B2 21 20.5 0.25 45.5625
B3 25 20.25
B4 14 42.25
C1 21 0.0625
C2 19 20.75 3.0625 39.0625
C3 23 5.0625
C4 20 0.5625
D1 26 5.0625
D2 28 28.25 0.0625 76.5625
D3 31 7.5625
D4 28 0.0625
Total 382 100.5 179.25
23.875 N = 16
Sum of squares D. of freedom Mean squares
179.25 3 59.7500
100.5 12 8.3750
279.75 15
F = 7.1343
A
TSS
Source of variation
SSW
SSB
Grand Mean =
25
29
26
24
Calculation of variance between and within samples
3.4903 table the 12 v and 3 v For
2
2 1

Since the calculated value is more than the table value we reject the null hypothesis and confirm
that the sales in four Metros are different.
Example 43: The following table gives the output levels of three varieties of corn yields
cultivated in four different plots. Setup an analysis of variance table and find out whether the
variety difference is significant in determining in wheat yield or not.
Table 28.47
Variety Plot 1 Plot 2 Plot 3 Plot 4
A 6 7 3 8
B 5 5 3 7
C 6 4 3 4
Solution:
Now by definition
5 . 1
67 . 2
4
variance Smaller
riance Greater va
F
121
Now the degrees of freedom for greater variance df
1
(or v
1
) = 2 and degrees of freedom for
smaller variance df
2
(or v
2
) = 9. Corresponding to these the values at 5% level of significance the
F table entry is 4.26. Since this table value is greater than the calculated value we accept the null
hypothesis and conclude that the difference in variance between and within varieties is
insignificant.
Table 28.48
Sample
items Measurment
Sample
means
6 (6-6)
2
= 0 (6-5)
2
= 1
7 6 (7-6)
2
= 1 4(6-5)
2
= 4 (7-5)
2
= 4
3 (3-6)
2
= 9 (3-5)
2
= 4
8 (8-6)
2
= 4 (8-5)
2
= 9
5 (5-5)
2
= 0 (5-5)
2
= 0
5 5 (5-5)
2
= 0 4(5-5)
2
= 0 (5-5)
2
= 0
3 (3-5)
2
= 4 (3-5)
2
= 4
7 (7-5)
2
= 4 (7-5)
2
= 4
5 (5-4)
2
= 1 (5-5)
2
= 0
4 4 (4-4)
2
= 0 4(4-5)
2
= 4 (4-5)
2
= 1
3 (3-4)
2
= 1 (3-5)
2
= 4
4 (4-4)
2
= 0 (4-5)
2
= 1
Total 60 32
5 N = k = 3 12
8
SSW SST
24
Grand Mean =
SSB
Sample 1
Sample 2
Sample 3
Table 28.49
2 = k-1 SSW/k-1= 4
9= N-k SSB/N-k = 2.666667
11 = N-1 F = 1.5 SST 32
8
S
o
u
r
c
e

o
f

v
a
r
i
a
t
i
o
n
S
u
m

o
f

s
q
u
a
r
e
s
D
e
g
r
e
e
s

o
f

f
r
e
e
d
o
m
V
a
r
i
e
n
c
e
SSW
SSB
24
`
Example 44: The following figures relate to production in kg of three varieties A, B, and C of
wheat sown in 12 plots. Is there any significant difference in the production of three varieties?
Given v
1
= 2 and v
2
=9 F = 4.26
Table 28.50
Varity Plot 1 Plot 2 Plot 3 Plot 4 Plot 5
A 14 16 18 - -
B 14 13 15 22 -
C 18 16 19 19 20
122
Table 28.51
A B C
X1 X2 X3
14 14 18
16 13 16
18 15 19
- 22 19
- - 20
48 64 92
16 16 18.4
1 X
2 X 3 X
17
5 4 3
4 . 18 5 16 4 16 3
n n n
X n X n X n
X
3 2 1
3
3
2
2
1
1
+ +
+ +
+ +
+ +
Table 28.52
A B C
1 1 1.96
1 1 1.96
1 1 1.96
- 1 1.96
- - 1.96
3 4 9.8
Sum of squares between samples
2
1 ) ( X X
2
2 ) ( X X
2
3 ) ( X X
SSB = 3 + 4 + 9.8 = 16.8
Table 28.53
A B C
4 4 0.16
4 9 5.76
4 1 0.36
- 36 0.36
- - 2.56
12 50 9.2
Sum of squares Within samples
2
1
1
) ( X X
2
3
3
( X X 2
1
2
) ( X X
SSW =12 + 50 + 9.2 = 67.2
Table 28.54
S
o
u
r
c
e

o
f

v
a
r
i
a
t
i
o
n
S
u
m

o
f

s
q
u
a
r
e
D
e
e
g
r
e
s
.
o
f

F
r
e
e
d
o
m
M
e
a
n

s
q
u
a
r
e
SSB 16.80 2 8.40
SSW 67.20 9 7.47
Total 84.00 11
Analaysis of Varance
F
-
R
a
t
o
1.125
123
Given v
1
= 2 and v
2
=9 F = 4.26.Since the calculated F is less than the table value we accept the
null hypothesis and confirm that there is no significant difference between varieties of wheat.
Analysis of variance in a two- way classification:
A two way ANOVA is used when we classify the data on the basis of two
factors. For example, in agriculture the yield may be classified on the basis of different varieties
of seeds and also based on different plots. The business firm may have its sales data classified on
the basis of different salesmen and also on the basis of sales in different regions. In such cases
one variable is studied along the column as we done it earlier. The second variable is studied
through the row.
1. ANOVA for a two-way design with out repeated values
As we do not have repeated values, we cannot directly compute the sum of squares within
samples as we have done in the case of one-way ANOVA. Thus in such cases we calculate this as
residual or error variation by subtracting the sum of squares of between variance from some of
squares between variance calculated in the usual manner. The variances are calculated for rows
and columns separately and compared with the residual variance. The analysis of variance for this
type of problems is shown hereunder.
Table 28.55
Sourceses of
variation
Sum of
Squares
Degrees of
freedom Mean square F ratio
Between columns SSB c-1 MSB=SSB/c-1 MSB/MSE
Between rows SSW r-1 MSW=SSW/r-1 MSW/Mse
Residual SSE (c-1)(r-1) MSE=SSE/(c-1)(r-1)
Total SST N-1 r-1
Example 45: The performance of three detergents at three different temperatures of water are
examined for it effectiveness on cleaning. The scale of cleaning on a specially desiccated meter is
given hereunder. Perform a two-way analysis of variance using 5% level of significance.
Table 28.56
Water Temperature. Detergent A Detergent B
Cold 57 55
Warm 49 52
Hot 54 46
Detergent C
67
58
68
Solution:
Null Hypothesis:
H
0
: (i) There is no difference in cleaning due to varieties of detergent
(ii) There is no difference in cleaning due to water temperature
To make the calculations simple let us code the data by subtracting 50 from all the entries
124
Table 28.57
Water Temperature. Detergent A Detergent B Detergent C
Cold 7 5 17
Warm -1 2 18
Hot 4 -4 8
Column Total 10 3 43
N = 9 c = 3 r = 3
56
19
8
Coded Data
Row Total
29
Now we use the shortcut method and do this problem
Table 28.58
Water Temperature. Detergent A Detergent B Detergent C
Cold 49 25 289
Warm 1 4 324
Hot 16 16 64
363
Squares of Data
Row Total
329
96
788
78 . 61 55 . 73 22 . 304 56 . 439 SSB SSB - SST Residual
56 . 439
N
T
- 788 (SST) square of sum Total
55 . 73
N
T
3
8
3
19
3
29
SSBw) ter temp.( between wa square of Sum
22 . 304
N
T
3
43
3
3
3
10
(SSBd) detergent between squares of Sum
444 . 348
9
56
N
T
Term Correction
w d
2
2 2 2 2
2 2 2 2
2 2

+ +
+ +

Table 28.59
S
o
u
r
c
e

o
f

v
a
r
i
a
t
i
o
n
S
u
m

o
f

s
q
u
a
r
e
D
e
e
g
r
e
s
.
o
f

F
r
e
e
d
o
m
M
e
a
n

s
q
u
a
r
e
SSB Detergent 304.22 2 152.11
SSB Water tem. 73.56 2 36.78
Residual 61.78 4 15.44
Total 439.56 8
F
-
R
a
t
o
9.849
2.381
Analaysis of Variance
The critical value of F
2,4
from the table = 6.94. The calculated value of different detergents is
9.845. So the difference between varieties is significant.
But for water temperature the calculated value 2.381 is less than the table value. So the difference
is not significant.
Example 46: The following table gives the number of units of a product produced by three
machines by three different workers. Test (i) whether the mean productivity is the same for all the
three machine types. Test (ii) whether the mean productivity is the same for all the three workers.
Solution:
125
Table 28.60
Workers Machine X Machine Y
A 8 32
B 28 36
C 6 28
Workers Machine X Machine Y Machine Z
A -22 2 -10
B -2 6 8
C -24 -2 -16
Column Total -48 6 -18
N = 9 c = 3 r = 3
Workers Machine X Machine Y Machine Z
A 484 4 100
B 4 36 64
C 576 4 256
-60
400.00
488.00
v
1
=(c-1)=(3-1) = 2
536.00
v
2
=(r-1)=(3-1)= 2
1128.00
104.00
d.f for reminder= (c-1)(r-1)=(3-1)(3-1)=2 x 2 = 4
S
o
u
r
c
e

o
f

v
a
r
i
a
t
i
o
n
S
u
m

o
f

s
q
u
a
r
e
D
e
e
g
r
e
s
.
o
f

F
r
e
e
d
o
m
M
e
a
n

s
q
u
a
r
e
SSB Machines 488.00 2 244.00
SSB Workers 536.00 2 268.00
Residual 104.00 4 26.00
Total 1128.00 8
Coded Data (Subtract 30 from all data entries)
F
-
R
a
t
o
9.385
10.308
-60
Row Total
104
836
SST= 1064+44+420-400 =
Residual sum of squares =
Machine Z
20
14
38
12
-42
Squares of Data
Row Total
-30
588
1528
Correction factor =T
2
/.N =3600/9=
SSB Machines =(-48)
2
/3+(6)
2
/3+(-18)
2
/3-400=
T=
SSB Workers= (-30)
2
/3+(12)
2
/3+(-42)
2
/3-400 =
2,4
from the table = 6.94. Since the calculated value 9.385 is grater then the
table value we conclude that the mean productivity is not the same for all the three machines.
Since the calculated value 10.308 is grater than the table value we also conclude that the mean
productivity is not the same for all the three workers.
Example 47: The following table gives the output levels of three varieties of corn yields
cultivated in four different plots. Setup an analysis of variance table and find out whether the
variety difference is significant in productivity. Also find out whether the plot difference is
significant in productivity.
Table 28.61
Variety Plot 1 Plot 2 Plot 3
A 6 7 3
B 5 5 3
C 6 4 3
Table 28.62
126
Variety Plot 1 Plot 2
A 6 7
B 5 5
C 6 4
Coded Data (subtract 3 from all data values)
A 3 4 0
B 2 2 0
C 3 1 0
Column Total 8 7 0
N = 9 c = 3 r = 3
A 9 16 0
B 4 4 0
C 9 1 0
15
25.00
12.67
v
1
=(c-1)=(3-1) = 2
2.00
v
2
=(r-1)=(3-1) = 2
18.00
3.33
d.f for reminder = (c-1)(r-1)=(3-1)(3-1) =2 x 2 = 4
S
o
u
r
c
e

o
f

v
a
r
i
a
t
i
o
n
S
u
m

o
f

s
q
u
a
r
e
D
e
e
g
r
e
s
.
o
f

F
r
e
e
d
o
m
M
e
a
n

s
q
u
a
r
e
SSB plot = 12.67 2 6.33
SSB variety = 2.00 2 1.00
Residual = 3.33 4 0.83
Total 18.00 8
10
SST=(22+21+0) -25=
Residual sum of squares =18-2-12.67 =
25
43
2
/.N =
SSB plot =(8
2
/3+7
2
/3+0
2
/3)-25 =
T=
SSB variety=(7
2
/3+4
2
/3+4
2
/3)-25 =
Squares of Data
Row Total
7
15
Row Total
8
F
-
R
a
t
o
7.600
1.200
Plot 3
3
3
3
4
4
2,4
from the table = 6.94. Since the calculated value 7.600 is grater then the
table value we conclude that the mean productivity is not the same for all the three plots. Since
the calculated value 1.200 is less then the table value we also conclude that the mean productivity
is the same for all the three varieties.
Example 48: The following table gives the number of units of a product produced by five
machines by four workers. Test (i) whether the mean productivity is the same for all the five
machine types. Test (ii) whether the mean productivity is the same for all the four workers.
Table 28.63
127
A B C D E Total
1 4 5 3 7 6 25
2 6 8 6 5 4 29
3 7 6 7 8 8 36
4 3 5 4 8 2 22
Total 20 24 20 28 20 112
Workers A B C D E Total
1 16 25 9 49 36 136
2 36 64 36 25 16 179
3 49 36 49 64 64 265
4 9 25 16 64 4 122
Total 110 150 110 202 120 702
N= 20 c = 5 r = 4
112
627.20
12.80
4
22.00
3
74.80
40.00
12
S
o
u
r
c
e

o
f

v
a
r
i
a
t
i
o
n
S
u
m

o
f

s
q
u
a
r
e
D
e
e
g
r
e
s
.
o
f

F
r
e
e
d
o
m
M
e
a
n

s
q
u
a
r
e
SSB Machines 12.80 4 3.20
SSB Workers 22.00 3 7.33
Residual 40.00 12 3.33
Total 74.80 19
SSB Machines =(20
2
/4+24
2
/4+28
2
/4+20
2
/4)-627.2 =
0.960
2.200
T=
2
/.N =
v
1
=(c-1)=(5-1) =
SSB Workers =(25
2
/5+29
2
/5+36
2
/5+22
2
/5)-627.2 =
v
2
=(r-1)=(4-1) =
SST=(110+150+110+202+120)-627.2 =
Residual sum of squares =
d.f for reminder = (c-1)(r-1)=(5-1)(4-1) =
Machine types
Workers
Squares of Data
F
-
R
a
t
o
Analaysis of Varance
4,12
from the table =3.26. Since the calculated value 0.960 is less then the
table value we conclude that the mean productivity is the same for all the five machines. Since
the calculated value 2.220 is less then the table value F
3,12
=3.49 we also conclude that the mean
productivity is the same for all the four workers.
Example 49: The following table gives the number of units of a product produced by four
operators in three machines. Test whether the mean productivity is the same for all the four
operators.
Table 28.64
A B C
1 174 173 171.5
2 173 172 171
3 173.5 173 173
Operators
Machines D
173.5
171
172.5

Solution:
128
Table 28.65
A B C
1 174 173 171.5
2 173 172 171
3 173.5 173 173
A B C D Total
1 2 1 -0.5 1.5 4
2 1 0 -1 -1 -1
3 1.5 1 1 0.5 4
Total 4.5 2 -0.5 1 7
N= 12 c = 4 r = 3
A B C D Total
1 4 1 0.25 2.25 7.5
2 1 0 1 1 3
3 2.25 1 1 0.25 4.5
Total 7.25 2 2.25 3.5 15
7
4.083333
4.416667
6.5
10.91667
Sum of
squares d.f
Mean
sum of
squares F-ratio
4.416667 3 1.472222 1.811966
6.5 8 0.8125
10.91667 11
SSW = 10.91667-4.416667 =
SST=(7.25+2+2.25+3.5)-4.083333=
Total
D
173.5
171
172.5
Source of variation
SSB
SSW
Squaired coded data
Machines
Operators
Machines
Coded Data (subtract 172 from all data entries)
Machines
Operators
T=
Correction term =T
2
/N=
SSB =(4.5
2
/3+2
2
/4+(-0.5)
2
/3+1
2
/3)-4.083333 =
Operators
3,8
from the table = 4.07. Since the calculated value 1.811966 is less then
the table value we conclude that the mean productivity is the same for all the five operators.
Example 50: The following table gives the number of units of a product produced by four
operators in three machines. Test (i) whether the mean productivity is the same for all the three
machine types. Test (ii) whether the mean productivity is the same for all the four operators.
Table 28.66
A B C
1 174 173 171.5
2 173 172 171
3 173.5 173 173
D
173.5
171
172.5
Operators
Machines
Solution:
129
Table 28.67
A B C
1 174 173 171.5
2 173 172 171
3 173.5 173 173
A B C D Total
1 2 1 -0.5 1.5 4
2 1 0 -1 -1 -1
3 1.5 1 1 0.5 4
Total 4.5 2 -0.5 1 7
N= 12 c = 4 r = 3
A B C D Total
1 4 1 0.25 2.25 7.5
2 1 0 1 1 3
3 2.25 1 1 0.25 4.5
Total 7.25 2 2.25 3.5 15
7
4.083333
4.416667
4.166667
10.91667
2.333333
Sum of
squares d.f
Mean
sum of
squares F-ratio
4.416667 3 1.472222 3.785714
4.166667 2 2.083333 5.357143
Residual 2.333333 6 0.388889
10.91667 11
SSB Machines=(4
2
/4+(-1)
2
/4+4
2
/4)-4.083333 =
SST=15-4.083333 =
Residuals=10.91667-4.166667-4.416667=
Operators
T=
Correction term =T
2
/N=7
2
/12=
SSB Operators=(4.5
2
/3+2
2
/3+(-0.5)
2
/3+1
2
/3)-4.083333 =
Operators
Machines
Coded Data (subtract 172 from all data entries)
Machines
Operators
Total
D
173.5
171
172.5
Source of variation
SSB Operator
SSB Machines
Squaired coded data
Machines
3,6
from the table = 4.76. Since the calculated value 3.785714 is less
then the table value we conclude that the mean productivity is the same for all the five operators.
Since the calculated value 5.357143 is grater then the table value F
2,6
=5.14 we also
conclude that the mean productivity is not the same for all the three machines.
2. ANOVA for a two-way design with repeated values
In case of two-way design with repeated measurements for all categories, we can obtain a
separate independent measure of inherent or smallest variations. For this measure we can
calculate the sum of squares and the degrees of freedom in the same way as we had worked out
the sum of squares for variance within samples in the case of one-way. Total SS , SS between
columns and SS between rows can also be worked out as stated above. We can find out the left-
over sums of squares and left over degrees of freedom which used for what is known as
interaction variation. After making all such calculations we set up the analysis table in the usual
manner for test.
130
Example 51: Set up an ANOVA table for the following information relating to three different
drugs to judge the effectiveness in reducing the blood pressure for three different groups of
people.
Table 28.68
Group of People
X Y Z
14 10 11
15 9 11
12 7 10
11 8 11
10 11 8
11 11 7
Drugs
A
B
C
a) Do the drugs act differently?
b) Are the different groups of people affect differently?
c) Is the interaction term different?
Answer the above questions at 5% level of significance
Solution:
T= 187, n = 18 so the correction term is obtained as: 72 . 1942
18
187 187
TSS =(14)
2
+(15)
2
+(12)
2
+(11)
2
+(10)
2
+(11)
2
+(10)
2
+(9)
2
+(7)
2
+(8)
2
+(11)
2
+(11)
2
+(11)
2
+(11)
2
+(10)
2
+(11)
2
+(8)
2
+(7)
2
28 . 76 72 . 1942 2019
18
) 187 (
2

1
]
1
SS between columns (drugs) =

77 . 28 72 . 1942 67 . 560 66 . 522 16 . 888
18
187
6
58
6
56
6
73
2 2 2 2
+ +
1
]
1
1
]
1
+ +
SS between rows (people) =
78 . 14 72 . 1942 67 . 560 16 . 580 67 . 816
18
187
6
58
6
59
6
70
2 2 2 2
+ +
1
]
1
1
]
1
+ +
SS within samples =
50 . 3 ) 5 . 7 7 ( ) 5 . 7 8 ( ) 115 11 ( ) 11 11 ( ) 5 . 10 11 ( ) 5 . 10 10 (
) 5 . 10 11 ( ) 105 10 ( ) 5 . 7 8 ( ) 5 . 7 7 ( ) 5 . 11 11 ( ) 5 . 11 12 (
) 11 11 ( ) 11 11 ( ) 5 . 9 9 ( ) 5 . 9 10 ( ) 5 . 14 15 ( ) 5 . 14 14 (
2 2 2 2 2 2
2 2 2 2 2 2
2 2 2 2 2 2
+ + + + + +
+ + + + + +
+ + + + +
SS for interaction variation=76.28-{28.77+14.78+3.50} = 29.23
131
Table 28.69
Analysis Table
Source of variation SS df MS F-Ratio 5% F value
Between columns (drugs) 28.77 3-1=2 14.385 36.97943445 F(2,9)=4.26
Between rows (people) 14.78 3-1=2 7.39 18.99742931 F(2,9)=4.26
Interaction 29.23 4 7.3075 18.78534704 F(4,9)=3.63
Within samples (error) 3.5 9 0.389
Total 76.28 18-1=17
The above table shows that the calculated F value is grater than the table values in all cases. Thus,
the drugs acts differently, different group of people are affected differently and the interaction
term is significant.
Exercises
1. What do you understand by sampling distribution?
2. Explain meaning and significance of the term standard error in sampling
analysis
3. How does the size of the sample affect the standard error?
4. Distinguish between sampling and non sampling error.
5. Explain the role of small and large samples in hypothesis testing
6. What are the salient features of a point estimator?
7. What is student t distribution? In which way it is different from the moral distribution.
8. What is Chi-square distribution? Explain its importance in testing hypothesis.
9. Explain the meaning and significance of standard error in the theory of sampling.
10. Write short notes on
(a) Type one and type two errors.
(b) Sampling distribution.
(c) One-tailed and two-tailed tests.
(d) Sampling and non- sampling errors
11. Define F distribution and assess its role in hypothesis testing.
12. A random sample of 5 students from a class was taken. The marks scored by them in the
subject economics are 70, 60,50,80,70. Are these sample observations confirmed that the
class average is 72?
13. The following are the sales figures of a certain product by of two salesmen A and B.
Salesman A claims that his sales performance is better than that of B. Test his claim
using t statistics given the t value from the table as 2.26 for 9 degrees of freedom at 5%
level of significance.
A B
Number of Sales 7 7
Average Sale 4 6
SD 7 8
132
14. A sample of 10 students is drawn randomly from a certain population and their weights in
kgs are given below. Test the hypotheses that variance of the population is 7 at 5% level
of significance.
S.N 1 2 3 4 5 6 7 8 9 10
W 45 65 64 58 60 62 56 67 65 61
15. A sample of 10 students is drawn randomly from a certain population. The sum of the
squared deviations from its mean is 50.Test the hypotheses that variance of the
population is 5 at 5% level of significance.
16. The following table gives the output levels of three varieties of wheat yields cultivated in
four different plots. Setup two way analysis table and find out whether the Variety
difference and Plot difference are significant in determining the yield.
Plot 1 Plot 2 Plot 3 Plot 4
A 7 7 6 6
B 4 6 4 5
C 7 8 7 4
Varity
Wheat Output
17. The following table gives the level yield of milk in liters for three varieties of cows when
they are fed with four types of feeds. Setup a one way analysis table and find out whether
the Variety difference is significant or not in determining the milk yield.
Feed1 Feed 2 Feed 3 Feed 4
A 5 5 6 5
B 4 8 5 5
C 6 6 7 6
Varity of Cows
Milk Yield
18. The following table gives the level sale by three Salesmen in four states. Setup a variance
analysis table and find out whether the salesman difference is significant or not.
State 1 State 2 State 3 State 4
A 5 5 6 5
B 4 8 5 5
C 6 6 7 6
Salesman
Sales
Bivicariate Analysis
Type of
measurement
Difference among two
independent groups
Difference among three or
more independent groups
Interval and Ratio
Independent groups
t-test or Z-test
One-way ANOVA
Ordinal
Mann-Whitney U-Test
Wilcoxon test
Kruskal -Wallis Test
Nominal
Z-test (two proportions)
Chi-square Test
Chi-square test
133
Two variable chi-square test for goodness of fit
The 2 2 contingency table and use of chi-square: The simplest form of test for independence
is found when there are only two groups within each basis of classification, giving altogether four
groups. Such a classification is often called a 2 2 contingency table. To represent this situation
we use cross-classification table shown below.
Example 33: 1000 parts of a product manufactured by two machines were tested for quality. The
following table gives the summery details. It is believed that the defects are not related to the
machines. Using chi-square test this hypothesis for 0.1 level of significance.
Table 28.17
Machine Number Defective Effective Total
1 25 375 400
2 42 558 600
Total 67 933 1000
Solution:
558
375
48
25
Effective Defective Effective Defective
Is the difference between the two Observed distribution
statistically significant?
Machine 1 Machine 1
134
559.8
373.2
40.2
26.8
Defective Effective Defective Effective
Machine independent effectiveones and defective ones
Machine 2 Machine 1
In the above table the cell frequencies indicate the number of cases belong to that cell. It
is a bivariate classification. The totals at the last row and last column are called marginals
indicating the respective univariate frequencies. For example, the last row shows the defective
and effective frequencies in total without bothering the machine number. Similarly, the last
column show the respective univariate frequencies of machine 1and 2 irrespective of effective or
defective. Invariably, the row total and the column total must tally at the grand total as shown
above.
The hypothesis to be tested in the said two machine-wise classifications is unnecessary in
the sense that both the machines turn out defective ones in the same manner. If this hypothesis is
true then the variation of the observed values from their expected values may be attributed to
mere sampling fluctuations.
Since it is a 2 x 2 contingency table the d.f = (2-1)(2-1) = 1. Tough we need 4 expected
frequencies we need to calculate only one among the four. The remaining frequencies are
obtained by subtracting the obtained one from the respective column or row totals given
Table 28.18
135
1 25 375
2 42 558
Total 67 933
1 26.80 373.20
2 40.20 559.80
Total 67.00 933.00
cell reference O E (O-E) (O-E)
2
/E
Machinc1 defective 25 26.80 -1.80 0.1209
Machine 1 effective 375 373.20 1.80 0.0087
Machinc 2 defective 42 40.20 1.80 0.0806
Machine 2 effective 558 559.80 -1.80 0.0058
Total 1000 1000.00 0.00 0.2160
400
600
1000
1000
Classification of expected defective and effective items
Total
Total
400
600
. The expected cell values are obtained in the following table. Irrespective of the origin of
the machine there are 1000 items, out of which 67 are found to be defective and 933 are found to
be effective. So the proportion of defective in total is 0.67 irrespective of the origin of the output.
If this is true for both the machines then the number of defectives from the first machine is simply
0.67 400 = (67/1000) 400 =26.8.
Now the expected frequency of defective ones from the second machine = 67.00-26.80 = 40.20
The expected frequency of effective ones from the first machine = 400-28.60 =373.20
The expected frequency of effective ones from the second machine = 600-40.20 = 559.80
(N) casses of number Total
marginal) (Coloumn marginal) (Row
E
Calculation of Chi-square value: Once the expected values for all the four cells are obtained the
needed chi square value is obtained by using the following formula
( )
1
]
1

E
E O
2
2
From the table
( )
216 . 0
E
E O
2
2

1
]
1

Formula method for the calculation of chi-square for a 2
2 classification
Table 28.19
a b a+b
c d c+d
a+c b+d n
136
( )
( )( )( )( )
( )( ) ( )( ) [ ]
( )( )( )( )
( )( )
. 2160 . 0
000 , 640 , 002 , 15
1000 000 , 240 , 3
933 67 600 400
1000 42 375 556 25
d b c a d c b a
n bc ad
2
2
2

+ + + +

The value of chi-square varies with the degrees of freedom. In the present problem there
are four cells in the classification, but only one-degree of freedom. In the sense that any one entry
in one of the four cells will automatically facilitate us to get the remaining cell entries by using
the row and column totals given in the problem. So the chi-square value corresponding to one
degree of freedom from the table is 6.635 at 0.01 level of significance. Since the calculated chi-
square is comfortably less than the table entry we accept the null hypothesis and confirm that
there is no significant difference between the defective items produced on the two machines
137

S28 Statistical Inferences & Hypothesis Testing (NEW)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

S28 Statistical Inferences & Hypothesis Testing (NEW)

Uploaded by

Copyright:

Available Formats

28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28

, because ) X E( must be true always. In figure 1 we show the

is the parameter. If it is relating to the mean of the sample denoted by

by using sample mean

and its standard

The distribution of all

= 11.40. Really, it is not

Example 1: From a finite population of 10 observations a sample of 2 observations 37 and 84

by using the formula given above. This is not most likely

of the sample is an unbiased estimate of the respective population standard

with n-1 degrees of freedom

with n-1 degrees of freedom

In the above definition the term ( )

At 5% level since-2.80<1.96 the null hypothesis if rejected.

Since the degrees of freedom in this case = n

= SE of the sampling distribution.

Now let the null and alternative hypotheses be

= the corrected chi square

(d) Calculate the sum of the squares of deviation between samples

SS between columns (drugs) =

You might also like

S28 Statistical Inferences &amp; Hypothesis Testing (NEW)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

S28 Statistical Inferences &amp; Hypothesis Testing (NEW)

Uploaded by

Copyright:

Available Formats

28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28

, because ) X E( must be true always. In figure 1 we show the

is the parameter. If it is relating to the mean of the sample denoted by

by using sample mean

and its standard

The distribution of all

= 11.40. Really, it is not

Example 1: From a finite population of 10 observations a sample of 2 observations 37 and 84

by using the formula given above. This is not most likely

of the sample is an unbiased estimate of the respective population standard

with n-1 degrees of freedom

with n-1 degrees of freedom

In the above definition the term ( )

At 5% level since-2.80<1.96 the null hypothesis if rejected.

Since the degrees of freedom in this case = n

= SE of the sampling distribution.

Now let the null and alternative hypotheses be

= the corrected chi square

(d) Calculate the sum of the squares of deviation between samples

SS between columns (drugs) =

You might also like

S28 Statistical Inferences & Hypothesis Testing (NEW)

S28 Statistical Inferences & Hypothesis Testing (NEW)