You are on page 1of 21

SYLLABUS OF UNIT IV

Probability, Sampling distribution,Fundamental of statistical analysis and inference,


Estimation, Hypothesis Testing,Correlation, Regression, Types of study designing,
Experimental designing, Error Analysis.
4.1 Probability
Trial & Event: Consider an experiment which though repeated does not give unique results but may result in
any one of the several possible outcomes. The experiment is called trial & outcomes are called events.
Throwing a die is a trial & getting 2 or 3 is an event.
4.2 Sampling distribution: If we take certain number of samples and for each sample compute various
statistical measures such as mean, standard deviation, etc., then each sample may give its own value for the
statistic. All such values of a particular statistic,(say mean with relative frequencies) will constitute sampling
distribution of the particular statistic.
(A) Sampling distribution of mean: It refers to probability distribution of all the possible means of random
samples of a given size taken from a population. If samples are taken from a normal population, N ( , ),
the sampling distribution of mean would be normal with mean and standard deviation , where
= standard deviation of the population, = mean of the population and n = number of items in a sample.
When sampling is from a population which is not normal, as per the central limit theorem the sampling
distribution of mean tends quite closer to the normal distribution if n is large (more than 30). If we want to
reduce sampling distribution of mean to unit normal distribution N (0,1), we can write normal variate
z = ( )/(/ ) for the sampling distribution of mean.
(B) Sampling distribution of proportion: Proportion is a measure of attribute.
Let N= population size. p= x/n = sample proportion.
X= number of people out of N possessing a particular attribute.
= X/N = actual proportion of people possessing specified attribute.
Let a sample is selected from this population with n= sample size
x= number of people in the sample possessing specified particular attribute.
The distribution of x is Binomial distribution (n, ). Using property of Binomial distribution, for sufficiently
large sample size n we have
x n p
z N (0,1)
n 1 1 / n
This result is true for n 30 or n 5.

(C) Students t-distribution: When population standard deviation is unknown and the sample size is small
( n < 30 ) we use t - distribution for the sampling distribution of mean

where S12
1

n

n 1 i 1
xi
x
2

= sample variance
The variable t differs from z because we use sample standard deviation ( S1 ) in the calculation of t, where
we use standard deviation of population ( ) in calculation of z. There is a different t- distribution for every
possible sample size i.e., for different degrees of freedom. The degrees of freedom for a sample of size n is n
1. Larger the sample size, shape of the t- distribution approximately equal to the normal distribution. For
sample size n > 30, t - distribution is close to normal distribution & we can use normal to approximate the t-
distribution.

1
(D) Chi-square 2 Distribution: Variances of samples require to add a collection of squared quantities and
it relates to chi-square distribution .
Suppose we have a random sample x1 , x2 ,, xn from N ( ) population then the distribution of statistic

is Chi-square distribution with n degrees of freedom. Alternatively the distribution of


statistic
2
xi x 2 2
n 1 S1 n S
i1
n

2 2

is Chi-square distribution with (n 1) degrees of freedom. Chi-square distribution is used in taking decisions
about the variance, testing independence of attributes and in testing goodness of fit.
(E) Snedecors F- distribution: Let X and Y be two independent sample statistics such that X and Y both
have Chi-square distribution with degrees of freedom d1 and d2 respectively. Then distribution of statistics
is Snedecors F- distribution with d1 and d2 degrees of freedom. Alternatively if we have two

independent samples and of size n1 and n2 respectively from


normal populations having same variance then distribution of statistics

n 1 1 y
n1 n2 2
1 S
F
2 2
xi x / y x

n1 1
j 2
i 1 j 1 2 S y

is Snedecors F- distribution with ( n1 -1) and ( n2 -1) degree of freedom.


FUNDAMENTAL OF STATISTICAL ANALYSIS AND INFERENCE
When we do not study the whole population, we study a sample to make conclusion. Drawing inference from
probabilistic sample about unknown population parameters is called statistical inference. It is concerned with
hypothesis testing and estimation. Estimation means estimating unknown population parameters using a
sample. Two types of estimates i.e. Point Estimate and interval estimate.
Point Estimate
The statistic (sample mean) is used to estimate population mean . So sample mean is a

point estimate of population mean. The statistics and

are used to estimate unknown population variance 2So these two


statistics s 2 and s1 2 are point estimates of 2.
PROPERTIES OF A GOOD ESTIMATOR
(i)Unbiasednes: Consider population of size four: 18, 20, 22, 24.
Population mean
Population variance
Now we collect all possible samples of size 2 from this population and calculate sample mean
, sample variance for each sample
Samples

18,18 18 0 0
20,18 19 1 2
22,18 20 4 8

2
24,18 21 9 18
18,20 19 1 2
20,20 20 0 0
22,20 21 1 2
24,20 22 4 8
18,22 20 4 8
20,22 21 1 2
22,22 22 0 0
24,22 23 1 2
18,24 21 9 18
20,24 22 4 8
22,24 23 1 2
24,24 24 0 0
Average 21 2.5 5
We see that average of all sample mean is same as population mean 21.
The average of all values is not same as population variance while the average of all is same as
population variance 5. This property of an estimate is called unbiasedness.
An estimate say of parameter say is known as unbiased estimate of if average of all possible values of
(from all possible samples of given size) is same as .
(ii)Consistency: An estimator should approach the value of population parameter as the sample size becomes
larger and larger. This property is referred the property of consistency and it is most desirable.
(iii)Sufficiency: An estimator should use as much as possible the information available from the sample.
This property is property of sufficiency.
(iii)Efficiency: An estimator should have a small variance. The most efficient estimator among a group of
unbiased estimators is one which has smallest variance. This property is property of efficiency.
Interval Estimation
In interval estimation we obtain a desired level of confidence say 99% means probability that interval will
possess true unknown parameter is 0.99.
Let = unknown parameter
T = unbiased point estimate of i.e. E( T ) = .
Fix a desired level of confidence ( 1 )100% ( usually fixed at 95% or 99%). The value of ( 1 ) is
called confidence coefficient. For 95% confidence level, =0.05.
Confidence interval estimate of is [ Th, T+h] means
P( T h ) = 1 & h = critical value.
(i)Estimating population mean : When is known to get interval estimate of a parameter we require
unbiased point estimate of that parameter, std. error of the point estimate, confidence coefficient( to be fixed
by researcher) and critical value.
The unbiased point estimate of population mean is sample mean .The standard error of is

If the population is normal or sample size n is large then the distribution of is N(0,1) . Under

this condition critical value is =1 . Thus 100%

confidence interval estimate of is given by .


Confidence level Confidence coeff. Critical value



3



KNOWN 2
Ex 9.1 A sample of 11 circuits from a large normal population has a mean 2.2 ohms. We know from past
testing that the population std. deviation is 0.35 ohms. Determine a 95% confidence interval for the time
mean resistance of the population.
Solution.
0.35

Here n=11, =0.35.


Z/2 x = 1.96 x = 0.2068

Z/2 = Critical Value = 1.96 (From Table)


95% confidence interval for time mean resistance of population is
= (2.2 0.2068, 2.2 + 0.2068) = (1.9932, 2.4068)

=1.9932 < < 2.4068.


When 2 is unknown
In case of small samples n < 30 the distribution
is not normal, the distribution of would be t- distribution with (n-1) degree of freedom
provided parent population is normal.
In such a situation (1-) x 100% confidence interval estimate of is

Where critical value is given by t- distribution table for (n-1) deg. of freedom.

Ex. 9.2: A sample of 11 circuits from large normal population has a mean resistance of 2.2 ohms. Population
std. Deviation is unknown. Sample std. deviation =0.35 ohm. Find a 95% confidence interval for true
mean resistance of the population.
Soln: =0.35 , n=11

With n-1=10 degrees of freedom & =0.05 from t-distribution table, =2.228=critical value. But X

=2.228 X = 0.235
Required 95% confidence interval is

= (2.2-0.235, 2.2+0.235) = (1.965, 2.435)


(ii) Estimating population proportion : An unbiased estimate of population proportion is provided by
sample proportion p= and the std. error of p is 1 / n .
p
For large sample size n, the distribution statistic is N (0,1).
1 / n
4
Using this, confidence interval estimate of is given by
(p- X 1 / n , p + 1 / n )
As this expression itself contains (which is unknown), we replace by its unbiased estimate p. Hence
confidence interval estimate of is
(p- z/2 X p1 p / n , p+ z/2 p1 p / n )
Provided sample size n is large (n 30).
Ex. 9.3: A random sample of 100 people of 100 people shows that 25 have opened IRA(individual retirement
arrangement) this year. Construct a 95 % confidence interval for the true proportion of people who have
opened IRA.
Soln:From normal table with 95% confidence interval critical value =1.96 (here =0.05).

P= = =0.25. Given n=100, =25.

p1 p / n = 1.96 0.25 X 0.75/100 = 0.08487


95% confidence interval for true proportion of people who have opened IRA is
(p- X p1 p / n , p+ p1 p / n )
=(0.25 - 0.08487, 0.25+0.08487) = (0.16513, 0.33487)
(iii) Estimating population variance 2: We have two point commonly used point estimates of population
variance 2 given as and . is unbiased

estimate but is not an unbiased estimate for estimating 2. Both are consistent estimators
of 2 .
For sufficiently large sample .

Case I. Small sample size (n<30): The distribution of statistic (n-1) is chi-square distribution with (n-1)
degree of freedom provided the parent population is normal. Using this distribution (1-) X 100% confidence
interval of 2 is given by

where and are critical values obtained using chi-square distribution with

(n-1) degree of freedom. These critical values can be taken from chi-square table.
Ex. 9.4 The cholesterol concentration in yolks of a sample of 18 randomly selected eggs laid by chickens
were found to have a mean 9.38 mg/g of yolk & std. deviation 1.62 mg/g. Find a 95% confidence interval of
the true variance of cholesterol concentration.
Solution: n=18, =1.62
95% confidence interval =1- 0.95=0.05 with n-1 = 17 degrees of freedom from chi-square table critical
value =27.587 for =0.05.

For 1-= 0.95 with n-1=17 degree of freedom from chi-square table =8.672

5
Hence 95% confidence interval of the true variance of the cholesterol concentration is

= ( ) = (1.61724, 5.144696)
Case-2: Large Sample (n30): By central limit theorem for large sample (n30) the distribution of
is N (0,1). Using this we can obtain (1-) X 100% confidence interval of as

( , ).
Here critical value is obtained by using N(0,1) distribution.

Ex. 9.5 A technologist collects 50 specimens of product from a new process & determines the percent of
water in each Sample mean=43.24% & sample std. deviation =7.93%. Compute 95% confidence interval
estimate of the true variance of the percentage of water for this new process.

Soln. ( , )

= ( , ) = (45.04714, 104.1106)
is 95% confidence interval estimate of the true variance.
Determination of sample size (confidence level approach): In a sample study there arises some sampling
error which can be controlled by taking a sample of adequate size. Researcher will have to specify the
precision that he wants in his estimates. For example if for sample mean desired precision 3 means 97 <
mean < 100 where actual sample mean is 100.
(a) Sample size decision while estimating mean
The confidence interval for the universe mean is given by

= sample mean, = std. deviation of population


n=sample size = value of std. variate at a given confidence level

Sample size n=(z2/2). for infinite population

= for finite population

N= size of population
e = acceptable error or the precision or estimation error.
Ex. 9.6: Find the sample size for estimating the true weight of the cereal containers for the universe with
N=5000.
Variance of weight = 4 ounces from last records. Estimation should be within 0.8 ounces of the true average
weight with 99% probability.

6
Solution: N=5000, = 2 ounces, e= 0.8 ounces with 99% confidence level = 2.57 from normal table.

n= = = 40.95 41

Hence sample size n = 41 for the given precision & confidence level.
If population is infinite then = (z2/2). = = 41.28 41 .

(b) Sample size decision while estimating a percentage or proportion

n= (for finite population)

n=(z2/2).( pq/e2), (for infinite population)


where p = sample proportion, q=1-p
n= sample size, N= size of population
= Value of std. variate at given confidence level.
Ex. 9.7 : Find the sample size if a random sample from a population of 4000 items is drawn to estimate the
percent defective within 2% of the true value with probability 95.5% ?
Soln: N = 4000, e = 0.02.
= 2.005 with 95.5% confidence level from normal table.
As value of p is not given let p=0.02 (on the basis of past data or experience)
Q=1-p=0.98 (p=proportion of defectives in the universe)

n= = =187.78 188

If population happens to be infinite, =(z2/2).( pq/e2) = =196.98 .

DETERMINATION OF SAMPLE SIZE (BAYESIAN APPROACH)


1. Find the expected value of the sample information (EVSI) for every possible n.
2. Find reasonable approx. cost of taking a sample of every possible n.
3. Compare EVSI & cost of the sample for every possible n
Expected net gain (ENG)= (EVSI) - (Cost of Sample)
4. The optimal sample size that value of n which maximizes the difference between EVSI and cost of the
sample can be determined.
The computation of EVSI for every possible n & comparing with the respective cost is a tedious task
and is rarely used in practice.
TESTING OF HYPOTHESIS
Hypothesis is defined as a proposition or a set of proposition set forth as an explanation for the occurrence
of some specified group of phenomenon, either a provisional conjecture to guide some investigation or
accepted as highly probable in the light of established facts.
A research hypothesis is a predictive statement capable of being tested by scientific methods that relates an
independent variables to some dependent variables. A hypothesis states what we are looking for & it is a
proposition which can be put to a test to determine its validity.

7
Characteristic of Hypothesis
1. Hypothesis should be clear & precise, otherwise inferences drawn cannot be reliable.
2. It should be capable of being tested.
3. It should state relationship between variables.
4. It should be limited in scope & specific.
5. It should be stated in simple terms.
6. It should be consistent with most known facts.
7. It should be amenable to testing within a reasonable time.
Null hypothesis and alternative hypothesis
To compare methods A with method B about its superiority it we proceed on assumption that both methods
are equally good then this assumption is called null hypothesis. If method A is superior or inferior to B, then
it is called Alternate hypothesis. Suppose we want to test hypothesis that population mean () equals to
hypothesized mean =100
Null hypothesis: : = =100
Alternative hypothesis :> (population mean>100)
or :< (population mean <100)
Alternative hypothesis is usually one which one wishes to prove & null hypothesis is one which one wishes
to disprove.
Thus a null hypothesis represents the hypothesis is we are trying to reject and alternate hypothesis
represents all other possibilities.
Type I & type II errors: Type I error means rejecting true hypothesis. Type II error means accepting a false
hypothesis.
Decision Actual Situation
True False
Accept No Error Type II Error
Probability= 1- Probability =

Reject Type I Error No Error


Probability= Probabilty=1-

If type I error is fixed at 5% it means that there are about 5 chances in 100 that we will reject when
is true. We can control type I error by fixing it at a lower level(say 1%). With a fixed sample size when we
try to reduce type I error, the probability of committing type II error increases. The decision makers decide
the appropriate level of type I error by examining the costs of penalties attached to both types of errors.
Level of significance () : It is the probability of type I error. The 5% level of significance means that
researcher is willing to take as much as a 5% risk of rejecting the null hypothesis when it ( ) happens to be
true. Thus the significance level is the maximum value of the probability of rejecting when it is true and
is usually determined in advance before testing the hypothesis.

Two tailed & one tailed tests:


We test three types of hypothesis : = against :
(i) : = against :

8
or : against : >
(ii) := against :
or : against :
In Alternative Hypothesis
When we have sign we have two-tailed test.
When we have > sign, we have right-tailed test.
When we have < sign, we have left-tailed test.
Critical value & decision rule
Critical value divides the area under probability curve of the distribution into critical (or rejection) region &
acceptance region.
In case of two-tailed test when we test := against :
we have two critical values as the critical region is divided into two equal parts on the two ends of
probability curve. Size of each part =

(Two-tailed test)

(Right-tailed test) (Left-tailed test)


For a right tailed test when we wish to test := against : the critical value is on the right
side end of the probability curve when the distribution of test statistic is N(0,1) & the level of significance is
5% or =0.05 the critical value is 1.645 . The null hypothesis is rejected when the value of test statistic is
smaller than 1.645.
PROCEDURE FOR HYPOTHESIS TESTING
1. Setting up the hypothesis: Hypothesis must be clearly stated.
If Mr. X of civil dept. wants to test the load bearing capacity of an old bridge which must be move than
10 tons.
Null hypothesis: :=10 tons & Alternative hypothesis: : > 10 tons.
Formulation of hypothesis is important step which must be done with care in accordance with the
object & nature of the problem. If is greater than (or less than) type we use a one tailed test.
When is of the type whether greater or smaller then we use a two tailed test.

9
2. Selecting a significance level: In practice 5% or 1% level of significance is adopted. The factors that
affect level of significance are sample size, magnitude of the difference between sample means,
variability of measurements within samples and whether the hypothesis is directional or not?
3. Test statistic : We obtain the value of test statistic using the sample observations selected by the
researcher & the hypothetical parametric value stated under null hypothesis.
4. Critical value: Using the distribution of test statistic, level of significance() & type of test we get
critical value.
5. Decision: Comparing the test statistic value & critical value we reject the null hypothesis when
(a) Value of test statistic > Critical value in a right tailed test
(b) Value of test statistic < Critical value in left tailed test
(c) Value of test statistic < Lower critical value or more than upper critical value in a two tailed test.
HYPOTHESIS TESTING FOR MEAN
Case-I: Population standard deviation is known
= hypothetical population mean
= population mean
In this case population should be normal or sample size should be normal or sample size should be large (n
30 in practice).
The test statistic is = & = sample mean for n= sample size.

Ex.10.1 Average salary = $74914 (ten years ago). An researcher like to test whether this average has
increased or not? A sample of 112 produced a mean salary of $78695. Given = $14530.
Solution: Here we like to test : 74914 against : > 74914 (right tailed test).
But = 78695. Let level of significance = 5%.
Researcher need not to fix type-II error.
He should try to have an optimum size of type-I error so that size of type-II error is very small. As population
standard deviation = $14530 is known & sample size n= 112 which is large enough, the test statistic is =
= = 2.7539

Distribution of test statistic is N(0,1).


So critical value of 5% level of significance from normal table is 1.645 (right tailed test). Since computed
value 2.7539 > Critical value at 5% level of significance 1.645, hence we reject hypothesis at 5% level of
significance and conclude that average salary has increased.
Case II: is unknown: In this case population should be normally distributed. When population is not
normal, we require a large sample. Thus this method is generally used for small sample drawn from a
normal population.
The test statistic is = , where = = Sample variance.

The distribution of this test statistic is students t-distribution with (n-1) degrees of freedom. Critical value
can be obtained from t-distribution table.
Ex.10.2 A sample of 25 people is taken. The length of time to prepare dinner is recorded in minutes as
given below.

10
44,51.9,49.7,40,55.5,33,43.4,41.3,45.2,40.7,41.1,49.1,30.9,45.2,55.3, 52.1,
55.1,38.8,43.1,39.2,58.6,49.8,43.2,47.9,46.6
Is there any evidence that the population mean time to prepare dinner< 48 minutes? Use a level of
significance of 0.05.
Solution: We have to test : 48 against : < 48 (left tailed test).
Population standard deviation is unknown. From the given data n=25, =45.628 (sample mean
calculated). Sample std. deviation =6.9587(calculated). Test statistic = = 1.70435.

The test statistic follows t-distribution with 1 = 24 deg. of freedom. For a left tailed test & for level of
significance 0.05 from distribution table critical value is 1.711.
Since calculated value of test statistic= -1.70435 > - 1.711 (critical value), we cant reject the null hypothesis at
given level of significance 0.05.(i.e. hypothesis is accepted).

Note: In case of finite population(when n/N > 0.05) we should use finite population correction (fpc)=

In such a case we multiply the denominator of both of the test statistic given by & by fpc=

Rest steps are same.


HYPOTHESIS TESTING FOR PROPERTION
The parameter of interest here is the population proportion having that attribute. For example proportion of
female students in a college or proportion of defective items produced by a company in the whole day.
Based on a sample we test any of the following hypothesis.
H0 : 0 H1 : > 0 (Right Tail Test)
H0 : 0 H1 : < 0 (Left Tail Test)
H0 : = 0 H1 : 0 (Two Tail Test)
Test statistics for this test is Zc =

Where P= sample proportion. The distribution of the test statistics is N (0,1) for sufficiently large sample. In
practice n30 or n0 5.
Ex 10.3 A marketing company claims that it receives 8% responses from its mailing. It seems that the
company is giving an exaggerated picture & the said % is less. To test this claim a random sample of 500 was
surveyed with 30 responses. Test companys claim at significance level = 0.05 against that the company is
given exaggerated picture.
Solution: We have to test H0 : = 0.08 against H1 : < 0.08 (Left Tailed Test).
Given sample proportion p=30/500=0.06 & Sample size n=500 (large enough).

Test statistics , Zc = = = 1.64845.

The distribution of test statistics is N(0,1). From normal table with 0.05 level of significance we get critical
value = -1.645.
At 5 % level of significance, Computed value of test statistics < critical value.
So we reject the null hypothesis at 5% of significance.
Note: If we change level of significance to 2.5% then critical value will be -1.96 & the null hypothesis will be
accepted. So, the researcher needs to decide level of significance carefully. Acceptance of null hypothesis is
caused by unavailability of any statistical evidence against the null hypothesis provided by the selected
sample.
11
HYPOTHESIS TESTING FOR VARIANCE
We wish to test any of the following
(i) : = , H1 : (two tailed test)
(ii) : = , H1 : (right tailed test)
(iii) : = , H1 : (left tailed test)

The test statistic

where is sample variance. is hypothesized value of population variance.


The distribution of the test statistic 2 is chi-square distribution with (n-1) degree of freedom provided that
the population has normal distribution. Range of chi-square distribution is (0, ) and it is not symmetric.
Ex.10.4 A pharmaceutical company is considering the purchase of new bottling machines to increase
efficiency. The factory currently makes use of machines that fill cough syrup bottles whose volume of
medicine has a std. deviation 1.6ml. The new machine they are considering was tested on 30 bottles,
producing a batch with std. deviation 1.2 ml. Does this machine produce a std. deviation less than 1.6ml ?
Assuming normal distribution test at 5% significance level?
Solution: Here we have to test
2
H0: (1.6)2 against H1: 2
< (1.6)2
2
Or H0: =(1.6)2 against H1: 2
< (1.6)2
(1)s21 (301) 1.22
Test statistic, 2 = = 1.62
=16.3125

The distribution of test statistic is chi-square with n-1=29 degree of freedom. It is a left tailed test with 5%
level of significance. Using chi-square table critical value is 17.708. Since, Computed value of test statistic <
critical value
we reject the null hypothesis at 5% level of significance & conclude that the new machine produces a std.
Deviation < 1.6ml.
Note: Table 3 has critical values for chi-square distribution with d.f.1 to 30.
For larger samples N (0,1) distribution can be used. After obtaining value of test statistic 2 apply the
formula Z = 22 2 1. The computed value of Z can be compared with the critical values
obtained using N(0,1).
HYPOTHESIS TESTING FOR DIFFERENCE OF TWO MEANS:
Here we have two samples, may be drawn from different populations & based on these two samples we
compare the population means.
We have samples x1, x2,............,xn drawn from a population with population mean x and population
variance 2 . Another sample y1,y2,........,ym is drawn from a population with population mean y and
population variance 2 .
We wish to test any of the following:
(i) H0: x= y against H1: x y
(ii) H0: x= y against H1: x > y
or H0: x y against H1: x > y
(iii) H0: x= y against H1: x < y
or H0: x y against H1: x < y

12
Case I: Both samples are independent of each other & both population variances 2 & 2 are known:
Let and be sample means. The parameter of interest is xy . A feasible estimate of this unknown
parameter is .
TE(T)
From central limit theorem for sufficiently large sample, distribution of S.E.(T)
is N(0,1) where T is sample
statistic S.E.(T) is std. error of T.
E(T) is expected value or mean of T.
2 2
Using sampling distribution of mean we have E( - )= x.y and S.E.( - )= +
. Since value of


parameter x-y under null hypothesis is zero, the test statistic is = .
2 2
+

When both populations are normal or both sample sizes are large enough the distribution of this test
statistic is N(0,1).
Ex.10.5 In a survey of buying habits, in a super market A, 400 women shoppers are chosen at random. Their
average weekly food expenditure is Rs. 250 with std. deviation Rs. 40. In another super market B 400
women shoppers chosen at random, the avg. weekly food expenditure is Rs. 220 with std. deviation Rs. 55.
Do these two populations have similar shopping habits? Is the average weekly food expenditure of two
populations of shoppers equal? Test at 5% level of significance.
Solution: We wish to test H0: x= y against H1: x y .
Both sample sizes are very large. So sample variance & population variance are almost same. So we consider
as population variances be known
= 250, = 220, x =40, y = 55, m= 400, n= 400 .
250220
Test statistic, = = = 8.822575 .
4040 5555
2 2 +
+ 400 400

For a two tailed test at 5% significance level critical values are 1.96 and 1.96 using N(0,1) distribution.
As computed value of test statistic > upper critical value 1.96 we reject the null hypothesis at 5% level of
significance & conclude that two populations have different shopping habits.
Case II 2 & 2 are unknown and both samples are independent of each other:
We use students t-distribution which requires two conditions
(i) Both populations have normal distribution
(ii) and 2 = 2 (both population variances are equal)
Unknown variances are estimates by the pooled sample variance given by
1
2 = +2 [=1( )2 + =1( )2 ] . The test statistic is T = 1 1
.
+

Distribution of Tc is student-t-distribution with deg of freedom m+n-2 .
Ex.10.6 You are a financial analyst for a brokerage firm. Is there a difference in divided yield between stocks
listed on NYSE & NASDAQ ?
NYSE NASDAQ
No. Of stocks 21 25
Sample mean 3.27 2.53
Sample std. Dev. 1.3 1.16
13
Assuming both populations are approximately normal with equal variances, is there a difference in average
yield ( = 0.05) ?
Solution: We wish to test H0: x= y against H1: x y
1
Pooled variance: 2 = 21+252 [20 1.32 + 24 1.162 ] = 1.5021 .

Test statistic Tc = = = 2.04 .
1 1
+

Test statistic has t-distribution with 21+25-2=44 degree of freedom.
Critical value of t-distribution can be approximated using N(0,1) distribution(for more than 30 deg. of
freedom).
The critical values for this two tailed test at 5% level of significance are -1.96 and 1.96. Computed value of
test statistic = 2.04 which is larger than upper critical value 1.96 . So we reject the null hypothesis at 5%
level of significance.
Case III : Both Samples Are Related
Here both samples are not independent but related. For example height of a group of 15 children before &
after they are fed a certain nutritional food product for two months make two samples which are related &
observations are paired.
We find differences di=xi-di for i=1,2, ... ,n. Hence mean =(1/n)( i ) & sample std. deviation Sd
1 2
=1 =1( ) .

The test statistic is Tc = . This Tc has t- distribution with n-1 deg. of freedom.
/
Ex.10.7: You send your sales people to a customer service training workshop. Has the training made a
difference in number of complaints? You collect the following data
Sales person Number of complaints
Before After
A 6 4
B 20 6
C 3 2
D 0 0
E 4 0
Conduct the test assuming normal distribution at 1% level of significance.
Solution: We wish to test H0: x = y against H1: x y
In this paired data we first find the differences
A B C D E
di -2 -14 -1 0 -4
1
= (-2-14-1+0-4)= -4.2 &
5
1 2 2 2 2 2
Sd= =1( ) =1 [((-2)(-4.2)) + ((-14)(-4.2)) +((-1)(-4.2)) +0 + ((-4)(-4.2)) ] = 5.67
1 4


Test statistic Tc = = = -1.66 .
/

Using t-distribution with 4 deg of freedom at 1% level of significance, critical values for this two tailed test
are -4.604 and 4.604.
Since computed value -1.66 is smaller than 4.604 and large than -4.604 we accept the null hypothesis.
i.e. -4.604(lower critical value) < -1.66 < 4.604(upper critical value).

14
HYPOTHESIS TESTING FOR DIFFERENCE OF TWO PROPORTIONS
Test statistic is ,where , & are sample sizes.

Zc has N(0,1) distribution provided both sample sizes are large enough( & >30).
Ex.10.8 random sampling of 400 men and 600 women were asked whether they would like to have a
flyover near their residence, 200 men & 325 women were in favour of the proposal. Is the opinion of men &
women the same, in general, on this issue?
Solution: we wish to test, H0 : 1 = 2 against H1 : 1 2

P1= = 0.5 P2= = 0.5417

= = 0.525

Test statistic Zc = = -1.2936

For this two tailed test at 5% level of significance, using N(0,1) distribution the critical value are -1.96 and
1.96.
-1.96 (lower critical value) < computed value = -1.2936 < 1.96 (upper critical value).
We accept the null hypothesis. Thus we cannot say that men and women are different in opinion.
HYPOTHESIS TESTING FOR DIFFERENCE OF TWO VARIANCES
To compare two population variances using sample variance, we use Snedecors F-distribution.

Test statistic Fc= =

When both populations are normal, the distribution of Fc is Snedecors F-distribution with (n-1, m-1) degree
of freedom & n,m are sample sizes.
Ex. 10.9 NYSE NASDAQ
No. of stocks 21 25
Sample mean 3.27 2.53
Sample std. dev 1.3 1.16
Is there a difference in the variances (or risks) between the NYSE and NASDAQ?
Solution: We wish to test at 5% significance level.
H0 : against :

Test statistic, Fc= = = = 1.256

Critical value at (n-1, m-1)= (20,24) degree of freedom


Upper critical value FU= 2.33= F0.025 at (20,24) degree of freedom.
Lower critical value FL= F0.975 at d.f.(20,24)= = = 0.41

0.41< Fc = 1.256 < 2.33


So we can not reject H0 and conclude that there is insufficient evidence of a difference in variance at =
0.05

15
CORRELATION
Correlation coefficient,

Coefficient of correlation is not affected by change in scale or by change in location. It can be used to
compare the relationships between two pairs of variables. It is a unit free measure of relationship between
two variables and takes values [-1,1]. When r is close to 1 there is strong positive relationship. When r is
close to -1 there is strong negative relationship. Like covariance it also measures linear relationship.
Correlation coeff. Is zero when covariance is zero.

Spearmans coeff. of correlation = rs= 1-

Where n= number of pairs of observations


di= difference between ranks of ith pair of the two variable
This measure is useful when a group of n individuals is arranged in order of merit or proficiency in
possession of two characteristics. Observations on each characteristic are assigned ranks from 1 to n. If
some individuals have same merit or proficiency of a characteristic, they are assigned same rank which is
arithmetic mean of their positions after arranging them in rising order of merit.
Ex. Find the correlation coefficient.
x 1 3 5 11
y 10 15 25 30

Solution: n=4, = 5, = 20
1
x
2
= ( )2 = [(1-5)2+(3-5)2+(5-5)2+(11-5)2]
1 =1
2
x =18.66 x= 4.32
2
y = )2 = [(10-20)2+(15-20)2+(25-20)2+(30-20)2]
2
y =83.33 y= 9.128

xy= = [(1-5)(10-20)+(3-5)(15-20)+(5-5)(25-20)+(11-5)(30-20)] =36.66

= = 0.92977

REGRESSION
After finding correlation between two variables we like to determine a mathematical relationship between
them so that we can
(i) Predict value of a variable based on the value of the other variable
(ii) Explain the impact of changes in the values of variables on the values of other variable.
We regress the dependent variable on the independent variable. The dependent variable is called regressed
(or study) variable, the independent variable is called regressor variable. For example rainfall is independent
variable which affects the yield of a certain crop. Yield becomes dependent variable.
A linear regression model between a single study variable & single explanatory variable is called simple
linear regression model. The simplest relationship between X and Y is a linear relationship given by Y= a+bX
The exact relationship between X and Y is Y= a+bX+ error.
This error is difference between the observed value & the predictable value of Y. Using collected
observations (x1, y1) (x2, y2) .......(xn,yn), these errors or residuals can be written as (yi-a-bxi) for i= 1,2,.....,n.
We want to have such values of a and b for which these errors are minimum.

16
In least squares method we minimize the sum of squared residuals (or error). For this we differentiate
w.r.to and separately & equate the derivatives to zero. Solving those
equations we get following estimates of and

= - and = = = .

Ex.14.1 A sample of fifteen 10 year old children was taken. The no. of pounds each child was overweight
was recorded. The no. of hours of T.V viewing per weeks was recorded. These data are listed below. Fit the
regression line & describe what the coefficients tell about relation between the two variables.
TV 42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
Over weight 18 6 0 -1 13 14 7 7 -9 8 8 5 3 14 -7
Solution: We prepare the following table
TV (xi) Over weight ( )
42 18 10.5333 12.2667 110.9511 129.2089
34 6 2.5333 0.2667 6.4178 0.6756
25 0 -6.4667 -5.7333 41.8178 37.0756
35 -1 3.5333 -6.7333 12.4844 -23.7911
37 13 5.5333 7.2667 30.6178 40.2089
38 14 6.5333 8.2667 42.6844 54.0089
31 7 -0.4667 1.2667 0.2178 -0.5911
33 7 1.5333 1.2667 2.3511 1.9422
19 -9 -12.4667 -14.7333 155.4178 183.6756
29 8 -2.4667 2.2667 6.0844 -5.5911
38 8 6.5333 2.2667 42.6844 14.8089
28 5 -3.4667 -0.7333 12.0178 2.5422
29 3 -2.4667 -2.7333 6.0844 6.7422
36 14 4.5333 8.2667 20.5511 37.4756
18 -7 -13.4667 -12.7333 181.3511 171.4756

= 31.4667 = 5.7333 Sx2= 671.7333 Sxy=649.8667


Using above calculation
= = = 0.9674

and = - =5.7333 0.9674 X 31.4667= -24.709


The fitted simple linear regression model is y= 24.709 + 0.9674 .
The value = 0.9674 is the change in the value of for a unit change in the value of . The intercept is a
constant or the value of when is zero.
The fitted line can be used to predict some value of for a new value of .
For Example, the predicted value of when = 30 is
-24.709 + 0.962 30 = 4.301
Thus weight gain for 30 hours of TV watching per week is 4.301 Pounds.
EXPERIMENTAL DESIGNS: Experimental design refers to the framework or structure of on experiment.
Two types: Informal and formal experimental design.
(a) Informal experimental design:
i. Before and after without control design
ii. After only with control design
iii. Before and after with control design

17
(b) Formal experimental designs:
i. Completely randomized design (C.R. Design)
ii. Randomized block design (R.B. Design)
iii. Latin square design (L.S. Design)
iv. Factorial Designs
(a) (i). Before and after without control design
A single test group or area is selected & the dependent variable is measured before the
introduction of treatment. The dependent variable is measured again after treatment has been
introduced.
The effect of treatment =(Y) (X)
Y= level of phenomenon after the treatment.
X= level of phenomenon before the treatment.
The difficulty of such a design is that with the passage of time considerable extraneous variations
may be there in its treatment effect.
(ii). After- only with control design: In this design two groups or areas (test area & control area) are
selected & treatment is introduced into test area only. The dependent variable is then measured in
both the areas at the same time.
Treatment impact is assessed by subtracting the value of the dependent variable in the control area
from its value in the test area. We assume that the two areas are identical w.r. to their behavior
towards the phenomenon considered.
(iii). Before and- After with control design
In this design two areas are selected & the dependent variable is measured in both areas for an
identical time period before the treatment. The treatment is then introduced into test area only &
dependent variable is measured in both for an identical time-period after introduction of the
treatment. The treatment effect is determined by subtracting change in dependent variables in the
control area from change in the dependent variable in the test area.
It is superior to above two designs because it avoids extraneous variation resulting both from the
passage of time and from non-comparability of the test & control areas.
COMPLETELY RANDOMIZED DESIGN
The essential characteristics of the design is that subjects are randomly assigned to experimental
treatments (or vice-versa). One way ANOVA is used to analyze such a design. Its provides
maximum no. of degree of freedom to the error. There are two forms
(i) Two-group Simple randomized experimental design:
First all the population is defined & then from the population a sample is selected randomly.
Items be randomly assigned to the experimental & control groups. It becomes possible to draw
conclusions on the basis of samples applicable for the population.
This design of expt. is common in research studies of behavioral sciences. The merit of such a
design is that it is simple & randomizes the differences among the sample items.
Limitation is that individual differences among those conducting the treatments are not eliminated.

18
(ii) RANDOM REPLICATION DESIGN

From the diagram clearly there are two populations in the replication design. The sample is taken randomly
from the population & is randomly assigned 4 expt. &
4 control groups. Similarly sample is taken randomly from the population available to conduct experiments
& eight individuals so selected should be randomly assigned to the eight groups.
Variables relating to both population characteristics are assumed to be randomly distributed among the two
groups.
RANDOMLY BLOCK DESIGN (R.B. DESIGN):
In R.B. design principle of local control can be applied along with other two principles of exptl. design. The
subjects are first divided into groups known as blocks, such that within each group subjects are relatively
homogeneous in respect to some selected variable. The no. of subjects in a given block would be equal to

19
no. of treatments & one subject in each block would be randomly assigned to each treatment. Main feature
of R.B. design is that in this treatment appears same no. of times in each bock.
For example suppose 4 different forms of a test were given to each of five students and following score they
get.
Very low IQ Low IQ student B Avg. IQ student C High IQ student Very High IQ student E
student A D
Form 1 82 67 57 71 73
Form 2 90 68 54 70 81
Form 3 86 73 51 69 84
Form 4 93 77 60 65 71
If each student separately randomized the order in which he took the 4 tests (using random no. or else) we
refer to R.B. design. The purpose of this randomization is to take care of such possible extraneous factors or
experience from repeated test.
LATIN SQUARE DESIGN (L.S. DESIGN)
This is used in agricultural research. L.S. design is used when there are two major extraneous factors such
as varying seeds and varying soil fertility. The treatments in L.S. design are so allocated among the plots
that no treatment occurs more than once in any one row or any one column.
FERTILITY LEVEL
I II III IV V
X1 A B C D E
X2 B C D E A
Seeds Differences X3 C D E A B
X4 D E A B C

X5 E A B C D
The field is divided into as many blocks as there are varieties of fertilizers, then each block is divided into as
many parts as there are varieties of fertilizers; such that each fertilizer variety is used in each of the block
only once.
Merit of this design is that it enables differences in fertility gradients in the field to be eliminated in
comparison to the effects of different varieties of fertilizers on the yield of crop.
In L.S. design we must assume that there is no interaction between treatments & blocking factors.
Limitation is that it requires equal rows, columns & treatments which reduces utility of this design. If
treatments are 10 rows & columns may not be homogeneous (as size of rows & columns are large)
FACTORIAL DESIGN
It is used in expt. where effects of varying more than one factor are to be determined. Two types, simple
factorial design & complex factorial designs.
(i). Simple Factorial Design: We consider effects of varying two factors on the dependent variable in
simple factorial design. When an expt. is done with more than 2 factors, complex factorial design is used.

In this design the extraneous variable to be controlled by homogeneity is called control variable. The
independent variable which is manipulated is called exptl. variable. There are two treatments of exptl.
20
variable & two levels of control variable. There are 4 cells into which sample is divided. Each of the four
combinations would provide one treatment or experimental condition. Subjects are assigned at random to
each treatment as in randomized group design.
Means of different cells represent the mean scores for different variable & column means are main effect
for treatments neglecting deferential effect due to control variable. Row means are main effects for levels
without regard to treatment.
Control Variable Experimental Variable

Treatment A Treatment B Treatment C Treatment D


Level I Cell 1 Cell 4 Cell 7 Cell 10
Level II Cell 2 Cell 5 Cell 8 Cell 11
Level III Cell 3 Cell 6 Cell 9 Cell 12

The means for columns provide the researcher with an estimate of main effects for treatments and
means for rows provide an estimate of main effects for the levels. Such a design enables the
researcher to determine interaction between treatments & levels.
(ii). Complex Factorial Designs: A design which considers 3 or more independent variables
simultaneously is called complex factorial design.
For 3 factors with one expt. variable having 2 treatments & 2 control variables, each having two
levels, the design used is called 2x2x2 complex factorial design.
Experimental Variable
Treatment A Treatment B
Control Control Control Control
Variable 2 Variable 2 Variable 2 Variable 2
Level I Level II Level I Level II
Control Level I Cell 1 Cell 3 Cell 5 Cell 7
Variable
1 Level II Cell 2 Cell 4 Cell 6 Cell 8

To determine main effects of experimental variable, the researcher must compare combined mean
of data in cells 1,2,3, & 4 for treatment A with combined mean of data in cell 5,6,7,8 for treatment
B. In this way main effect of exptl. variable independent of control variable 1 and variable 2 is
obtained. Similarly main effect of control variable 1 independent of experimental variable and
control variable 2 is obtained if we compare combined mean of data in cells 1,3,5 and 7 with
combined mean of data in cells 2,4,6 and 8 of 2x2x2 factorial design.
To get first order interaction say EVxCV1 (exptl. variable with control variable) in this design the
researcher must ignore control variable 2 for which purpose he may develop 2x2 design from the
2x2x2 design by combining data of relevant cell of the latter design.
Complex factorial design can be generalized to any number and combination of exptl. & control
independent variables. Advantage of factorial design is to provide equivalent accuracy with less
labour, economic and permit various other comparisons of interest.

-------*******-------

21

You might also like