Professional Documents
Culture Documents
UNIT-IV
4.0 OBJECTIVES
4.1 NEED FOR SAMPLING
4.2 ELEMENTS OF SAMPLING PLAN
4.3 TYPES OF SAMPLING
4.3.1 Random or Probability Sampling
Simple Random Sampling
Stratified Random Sampling
Systematic Random Sampling
Cluster Sampling
4.3.2 Non-Random or Non-Probability Sampling
Convenience Sampling
Judgmental Sampling
Quota Sampling
4.4 SAMPLING AND NON-SAMPLING ERRORS
4.4.1 Reasons for sampling errors
4.42 Reasons for non-sampling errors
4.5 TESTING OF HYPOTHESIS
4.5.1 Sampling Distribution
4.5.2 Standard Error
4.5.3 Null & Alternative Hypothesis
4.5.4 Errors in testing of hypothesis
4.5.5 Critical Region
4.5.6 Two tailed and One tailed test
4.5.7 Large and Small sample test
4.6 PROCEDURE FOR TESTING OF HYPOTHESIS
4.7 TESTS OF SIGNIFICANCE
4.7.1 Test for single mean
4.7.2 Test for difference of two means
4.7.3 Test for two standard deviations
4.7.4 Test for Single Proportion
4.7.5 Test for difference of two proportions
4.8 Analysis of Variance
4.8.1 Assumptions
4.8.2 One way ANOVA
4.8.3 Applications
4.0 Objectives
Sampling is being used in our everxyday life without knowing about it.
For examples,
a cook tests a small quantity of rice to see whether it has been well cooked
and a grain
merchant does not examine each grain of what he intends to purchase, but
inspects only
a small quantity of grains. Most of our decisions are based on the
examination of a few
items only.
(ii) it could take a sample and inspect the sample for defectives
Statistics for Managers and then estimate the total number of
defectives for the population as a whole.
The second approach that uses sampling has two major advantages.
The main steps involved in the planning and execution of sample survey are:
I) Objectives The first task is to lay down in concrete terms the basic
objectives of the survey. Failure to define the objective(s) will
clearly undermine the purpose of carrying out the survey itself. For
example, in a nationalized bank wants to study savings bank
account holders perception of the service quality rendered over a
period of one year, the objective of the sampling is, here, to analyze
the perception of the account holders in the bank.
iv) Sampling Unit For the purpose of sample selection, the population
should be capable of being divided up into sampling units. The
division of the population into sampling units should be
unambiguous. Every element of the population should belong to just
one sampling unit. Each account holder of the savings bank account
in the bank, form a unit of the sample as all the savings bank
account holders in the bank constitute the population.
Example
and interviewing them. This could be achieved in many ways. Two common
ways are:
The variables used to partition the population into strata are referred
to as Theory of Sampling and stratification variables. The criteria for the
selection of these variables consist of Testing of Hypothesis homogeneity,
heterogeneity, relatedness, and cost. The elements within a stratum should
be as homogeneous as possible, but the elements in different strata should
be as heterogeneous as possible. The stratification variables should also be
closely related to
the characteristic of interest. The more closely these criteria are met, the
greater the
effectiveness in controlling extraneous sampling variation. Finally, the
variables should
decrease the cost of the stratification process by being easy to measure and
apply.
the sample
size n and rounding to the nearest integer. For example, there are 100,000
elements in
the population and a sample of 1,000 is desired. In this case, the sampling
interval. i, is
100. A random number between I and 100 is selected. If, for example, this
number is
23, the sample consists of elements 23, 123,223,323,423,523, and so on.
chance and due to the existence of chance in sampling, the sampling errors
occur.
Errors in sampling arise primarily due to the following reasons:
Consider all possible samples of size ‘n’ which can be drawn from a
given population. For each sample we can compute a statistic such as mean,
standard deviation, etc. which will vary from sample to sample. The
aggregate of various values of the statistic under consideration may be
grouped into a frequency distribution. This distribution is known as sampling
distribution of the statistic. Thus the probability distribution of all the
possible values that a sample statistic can take is called the sampling
distribution of the statistic.
If x1, x2, x3, ……….. xn are n independent random samples drawn from a
normal population with mean m and standard deviation s, then the sampling
distribution of x (the sample mean) follows a normal distribution with mean
m and standard deviation σn .
Consider all possible samples of size n drawn from this population. For each
sample, determine the proportion p of successes. Applying central limit
theorem, if the sample of size n is large, the distribution of the sample
proportion p follows a normal distribution with mean mp = P and S.D
σp=PQn.
without replacement
Decision
Level of Significance
The test of significance is (a) Test of significance for large sample and
(b) Test of significance for small samples. For larger sample size (.30), all the
distributions like Binomial, Poisson etc., are approximated by normal
distribution. Thus normal probability curve can be used for testing of
hypothesis.
Steps for testing hypothesis is given below ( for both large sample and small
sample tests)
Zcal= x-μσn
Solution:
Null Hypothesis (H0): Sample mean has been drawn from a large
population with mean height of 171.17 cm. i.e., H0: μ = 171.17 cm
Alternative Hypothesis (H1): Sample mean has not been drawn from a
large population with mean 171.17cm i.e., H1: μ≠171.17cm.
Test Statistic:
Zcal= x-μσn
Zcal= 171.38-171.174.40400
Interference: Since the calculated value of Z is less than the critical value of
Z at 5% level, hence we accept the null hypothesis and conclude that, the
sample mean has been drawn from a large population with mean height of
171.17cm.
Example 4.2 The mean lifetime of 100 fluorescent light bulbs produced by a
company is computed to be 1570 hours with standard deviation of 120
hours. If m is the mean lifetime of all the bulbs produced by the company,
test the hypothesis μ = 1600 hours against the alternative hypothesis m ≠
1600 hours using a 5% level of significance.
Solution:
We are given
Test Statistic:
Zcal= x-μσn
Interference: Since the calculated value is greater than the critical value of Z
at 5% level, hence we reject the null hypothesis and conclude that , there is
a significant difference between the sample mean and population mean.
1. A random sample of 900 members has a mean 3.4 cm and S.D 2.61
cm. Is the sample from a large population of mean 3.25 cm and S.D
2.61 cm?
2. A random sample os size 400 drawn and the sample mean was found
to be 99. Test whether the sample could have come from a normal
population with mean 100 and standard deviation 8 at 5% level.
n=400, x = 99, μ = 100, σ = 8, |Zcal| = 2.5
[Hint: H0 is rejected at 5% level]
Working Rule:
Example 4.3: A college conducts both day and evening classes intended to
be identical. A sample of 100 day students yields examination results as
under x1= 72 and σ1 = 14.8. A sample of 200 evening students yields
examination result under x2= 73.9 and σ2=17.9. Are the two mean
statistically equal at 5% level?
Solution:
We are given
Null Hypothesis (H0): H0:μ1 = μ2. .e, the two means are statistically equal.
Alternative Hypothesis (H1): μ1 ≠μ2 (Two tailed test) i.e., the two means
are not statistically equal.
Test Statistics
Inference: Since the calculated value of Zcal is less than the critical value of
Z at 5% level, hence we accept the null hypothesis and conclude that, the
two means are statistically equal.
Example 4.4 A random sample of 1000 workers from South India shows that
their mean wages are Rs 47 per week with a standard deviation of Rs. 28. A
random sample of 1500 workers from North India gives a mean age of Rs. 49
per week with a standard deviation of Rs. 40. Is there any significant
difference between their mean levels of wages?
Solution
Alternative Hypothesis (H1): H1: μ1≠μ2 (Two tailed test) i.e., there is a
significant difference between their mean level of wages.
Test Statistics
Inference: Since the calculated value of Zcal is less than the critical value of
Z at 5% level, hence we accept the null hypothesis and conclude that the
two means are statistically equal.
Self-Assessment Question
Country Country
A B
[Hint: n1: 1000, x1 =67.42, s1=2.58, n2=1200, x2 =67.25 and s2=2.50, |Zcal|
=1.56]
Working Rule:
Step 1: Setting up of Null Hypothesis. The sample has been drawn from
a population with proportion P, i.e., P=P0.
Zcal= p-PPQn
Example 4.5 In a sample of 1000 people in Karnataka 540 are rice eater
and the rest are wheat eaters. Can we assume that both rice and wheat
eaters are equally popular in this state at 1% level of significance?
Solution:
Null Hypothesis (H0): The sample has been drawn from a population with
proportion P, i.e., H0: P=0.5
Alternative Hypothesis (H1): The sample has not been drawn from a
population with proportion P, i.e., H1: P≠0.5.
Test Statistics:
Inference: Since the calculated value of |Zcal|is less than the critical value of
Z at 1% level, hence we accept the null hypothesis and conclude that, the
sample has been drawn from a population with proportion P, i.e., H0: P=0.5.
Working Rule:
Example 4.6 In a sample of 600 men from a certain city, 450 are found to
be smokers. In a sample of 900 from another city 450 are found to be
smokers. Do the data indicate that the two cities are significantly different
with respect to prevalence of smoking habits among men?
Solution:
Null Hypothesis (H0): The two samples have been drawn from same
population, i.e., H0: P1 = P2
Alternative Hypothesis (H1): The two samples have not been drawn from
the same population, i.e., H1: P1 ≠ P2.
Test Statistic:
|Zcal|= 9.7
Inference: Since the calculated value of |Zcal|is greater than the critical
value of Z at 5% level, hence we reject the null hypothesis and conclude
that,the two samples have not been drawn from the same population.
Solution:
Test Statistic:
|Zcal|= 0.19
Inference: Since the calculated value of is less than the critical value of Z at
5% level hence we accept the null hypothesis and conclude that, there is no
improvement after overhauling.
1. In a random samples of 600 and 1000 men from two cities 400 and
600 men are found to be literate. Do the data indicate at 5% level of
significance that the population are significantly different in the
percentage of literacy?
[Hint: n1=600, n2=1000, p1=400600 = 0.67, p2=6001000=0.6;
P=0.625;Q=0.375,]
|Zcal|=2.67
2. Before an increase in excise duty on tea 400 people out of a sample of
500 persons were found to be tea drinkers. After an increase in the
duty, 400 persons were known to the tea drinkers in sample of 600
people. Do you think that there has been a significant decrease in the
consumption of tea after the increase in the excise duty?
3. [Hint: n1=500, n2=600, p1=400500 = 0.80, p2=400600=0.67;
P=0.73;Q=0.27,]
4. |Zcal|=4.81; H0: P1 = P2; H1 : P1 < P2 (one tailed test).
4.8ANALYSIS OF VARIANCE:
If we can control the factors being studied, we say that the data are
experimental. Furthermore, in this case the values, or levels, of the factor (or
combination of factors) are called treatments. The purpose of most
experiment is to compare and estimate the effects of the different
treatments on the response variable. For example, suppose that an oil
company wishes to study how three different gasoline types (A, B and C)
affects the mileage obtained by popular midsized automobile model. Here
the response variable is gasoline mileage and the company will study a
single factor-gasoline type. Since the oil company can control which gasoline
type is used in the midsized automobile; the data that the oil company will
collect are experimental. Furthermore, the treatments – the levels of the
factor gasoline type – are gasoline type A, B and C.
Definition:
4.8.1 Assumptions:
For the validity of the F-test in ANOVA the following assumptions are
made
(i) The observations are independent.
(ii) Parent population from which the observations are taken is
normal and
(iii) Various treatment and environmental effects are additive in
nature.
Mean Total
G
Grand Total
The total variation in the observation xij can be split into the following two
components:
The first type of variation is due to assignable causes which can be detected
and controlled by human endeavor and the second type of variation due to
chance causes which are beyond the control of human hand.
In Particular, let us consider the effect of k diffent rations on the yield in milk
of N cows (of the same breed and stock) divided into k classes of sizes n 1, n2,
……..nk.
Test Procedure:
a) Find the sum of values of all the (N) items of the given data. Let this
grand total represented by ‘G’.
Then correction Factor (C.F)=G2N
b) Find the sum of squares of all the individual items (xij) and then the
Total sum of squares (TSS) = ∑∑xij2-C.F.
c) Find the sum of squares of all the class totals (or each treatment
total) Ti (i=1,2,…….k) and then the sum of squares between the
classes or between the treatments (SST) is SST = i=1kTi2nj - C.F.
where ni (i=1,2,…..k) is the number of observations in the ith class
or number of observations received by ith treatment.
d) Find the sum of squares within the class or sum of squares due to
error (SSE) by subtraction. SSE = TSS-SST
MSE
Total N-1
= MSTMSE
If variance within the treatment is more than the variance between the
treatments, then numerator and denominator should be interchanged and
degrees of freedom adjusted accordingly.
5) Inference:
If calculated F value is less than table value of F, we may accept our
null hypothesis H0 and say that there is no significant difference
between treatments. If Calculated F value is greater than table value of
F, we reject our H0 and say that the difference between treatments is
significant.
Example 4.7
The following table gives the yields on 15 sample plots under three varieties
of seed
A: 20 21 23 16 20
B: 18 20 17 15 25
C: 25 28 22 28 32
Solution:
Level of Significance(α):0.05
Test Statistic:
A 20 21 23 16 20 100 10000
B 18 20 17 15 25 95 9025
C 25 28 22 28 32 135 18225
Squares:
Variet Tota
y l
Total 759
0
= 7450-7260 = 190
Total 15-1=14
Hint:
19
Total 19-1=18
Chapter Summary
UNIT – V
5.0 OBJECTIVES
5.1 MEANING OF CORRELATION
5.2 TYPES OF CORRELATION
5.2.1 Positive and Negative Correlation
5.2.2 Linear and Non-linear Correlation
5.3 MEASUREMENT TECHNIQUES OF CORRELATION COEFFICEINT
5.3.1 Scatter Diagram
5.3.2 Karl Pearson’s Coefficient of Correlation
5.3.3 Spearman’s Rank Correlation
Ranks are given directly
Non -repeated ranks
Repeated ranks
5.4 PROPERTIES OF CORRELATION COEFFICEINT
5.5 MEANING OF REGRESSION
5.6 TYPES OF REGRESSION LINES
5.6.1 Regression lines of X on Y
5.6.2 Regression line of Y on X
5.7 CONSTUCTION OF REGRESSION EQUATIONS
5.8 PROPERTIES OF REGRESSION COEFFICENTS
5.9 DIFFERENCES BETWEEN CORRELATION AND REGRESSION
5.10 APPLICATIONS OF REGRESSION ANALYSIS
INTRODUCTION
In this unit you will be able to learn the concept of correlation and
regression. Also from this unit you will be able to learn the various methods
of obtaining the correlation coefficients, rank correlation coefficient,
regression equations etc. This unit explains the differences between the
correlation and regression. It is easy to understand the techniques to be
discussed in this unit by making use of calculation. Try out the example
problems with the calculator.
We are familiar that, the change in one factor, say, the amount of
rainfall affects the change in the other factors, say, yield of rice. This means
that there exists some kind of relationship between the two factors. Thus
correlation is relationship between two factors.
There are different types of correlation. They can be classified into the
following categories.
If the changes in the factors are in the same direction then the
correlation is said to be “Positive degree correlation”. Relationship between
the amount of rainfall and yield of rice is an example of positive degree
correlation. If the rainfall level increases then the yield of rice also increases
and vice-versa.
If the changes in the factors are in the constant ratio then the
correlation is said to be “Linear correlation”.
For example
If the changes in the factors are not in the constant ratio then the
correlation is said to be “Non-linear correlation”.
For example
State the different types of correlation with example in the space given
below. Limit your answer in about 80 words.
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
________________________________
1. If all the points lie on a straight line falling from the lower left- hand
corner to the upper right-hand corner, the correlation is said to be
perfective positive(Fig 5.1) i.e. the correlation coefficient r = +1
Figure 5.1 r= +1
2. If all the points lie on a straight line falling from the upper left-hand corner
to the upper right-hand corner, the correlation is said to be perfectively
negative (i.e. the correlation coefficient r = -1) Fig 5.2.
Figure 5.2 r = -1
Figure If all the points lie on a straight line fall in a narrow band and they
show a rising tendency from the lower left-hand corner to the upper right-
hand corner, there would be high degree of positive correlation. Fig 5.3
5.3 r = 1
If all the points lie on a straight line fall in a narrow band and they
show a declining tendency from the upper left hand corner to the lower
right-hand corner, there would be high degree of negative correlation.
Fig 5.4.
Fig 5.4 r = 1
If a all the points lie on a straight line fall in a widely band and they
show a rising tendency from ∑the lower left-hand corner to the upper
right-hand corner, there would be low degree of positive correlation.
Fig 5.5.
If all the points lie on a straight line fall in a widely band and they show a
declining tendency from the upper left hand corner to the lower right hand
corner, there would be low degree of negative correlation. Fig 5.6
Fig 5.7 r = 0
= X-XY-YX-X2Y-Y2
= xyx2y2
Working Procedure
X= Xn ; and Y= Yn
Step 3: Take the deviations of the observations in X-series and from X and
write it under the column headed by x = -X . Take the deviation of the
observations in Y series from Y and write it under the column y = Y-Y.
Step 4: Multiply the respective deviations and write it under the column
headed by xy.
Step 5: Square the deviations obtained in step 4 for X and Y series and write
it under the column headed by x2 and y2.
rxy = xyx2y2
Example 5.1 Find the coefficient of correlation between height of brothers
and sisters from the following data
Height of Brothers 6 6 6 68 69 70 71
(in cm) 5 6 7
Height of Sisters 6 6 6 69 72 72 69
(in cm) 7 8 6
X x = X-X Y y = X xy X2 Y2
= ∑Xn
= 4767
= 68
Y-Y
65 -3 67 -2 6 9 4
66 -2 68 -1 2 4 1
67 -1 66 -3 3 1 9
68 0 69 0 0 0 0
69 1 72 +3 +3 1 9
70 2 72 +3 +6 4 9
71 3 69 0 0 9 0
47 - 48 - 20 2 3
6 3 8 2
X = ∑Xn = 4767 = 68
Y = ∑Yn = 4837 = 69
r= xyx2y2
= 2028 32
= 205.2915(5.6569 = 2029.9335
r = 0.06681
Calculate the correlation coefficient between the height of sister and height
of the brothers from the given data:
Height of Sisters 6 6 6 6 6 69 70
(in cm) 4 5 6 7 8
Height of Brothers 6 6 6 6 7 68 72
(in cm) 6 7 5 8 0
The above direct method for calculating ‘r’ is not convenient when (i)
the terms of the Series X and Y are larger and the calculation of X and Y
become difficult (or) (ii) the mean of X or Y are not integers. In these cases
we apply the following formula of assumed mean
rxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2n∑dy2-(∑dy)2
where,
Working Procedure
Step 2: Take any term ‘A’ as assumed mean of X series and ‘B’ as assumed
mean of Y series (preferably the middle one).
Step3: Take the deviations of the observations in X – series from A and writ
it under the column headed by dx = X-A. Take the deviations of the
observations in Y series from B and write it under the column headed by dy=
Y-B.
Step 4: Multiply the respective deviations and write it under the column
headed by dx dy.
Step 5: Square the deviations obtained in step 4 fro X and Y series and write
it under the column headed by dx2 and dy2.
rxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2n∑dy2-(∑dy)2
Example 5.2: Calculate the coefficient of correlation for the following pairs
of values of X and Y.
X 17 19 21 26 20 28 26 29
Y 23 27 25 26 27 25 30 33
Solution:
X Y dx = X- dy = Y- dxdy dx2 dy 2
23 27
17 23 -6 -4 24 36 16
19 27 -4 0 0 16 0
21 25 -2 -2 4 4 4
26 26 3 -1 -3 9 1
20 27 -3 0 0 9 0
28 25 5 -2 -10 25 4
26 30 3 3 9 9 9
29 33 6 6 36 36 36
Note that, here X =∑Xn = 1868 = 23.25, which is not an integer, we use
short-cut method,
rxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2n∑dy2-(∑dy)2
rxy = 860-2(0)8(144)-(2)28(70)-(0)2
rxy = 4801148560
rxy = 48033.8821(23.6643)
rxy = 480801.7962
rxy = 0.5987
X 10 25 13 25 22 11 12 25 21 20
Y 12 22 16 15 18 18 17 23 24 17
ρ = 1- 6∑D2n3-n
Working Procedure
Step 2: Calculate the difference or R1 and R2 and write it under the column
headed by D
Step 3: Square the difference D and write it under the column headed by D2.
ρ = 1- 6∑D2n3-n
Example 5.3: Two judges in a beauty contest rank the 12 entries as follows.
Judge 1 2 3 4 5 6 7 8 9 10 11 12
X
Judge 1 9 6 10 3 5 4 7 8 2 11 1
Y 2
Calculate the rank correlation coefficient between the two judges X and Y.
1 12 -11 121
2 9 -7 49
3 6 -3 9
4 10 -6 36
5 3 2 4
6 5 1 1
7 4 3 9
8 7 1 1
9 8 1 1
10 2 8 64
11 11 0 0
12 1 11 121
Total 41
6
Now,
ρ = 1- 6∑D2n3-n
ρ = 1- 6(416)123-12
ρ = 1- 24961728-12
ρ = 1- 24961716 = 1- 1.4545
ρ = -0.4545
Judge 1 6 5 10 3 2 4 9 7 8
1
Judge 3 5 8 4 7 10 2 1 6 9
2
Judge 6 4 9 8 1 2 3 10 5 7
3
Use the rank correlation coefficient to determine which pair of judges has the
nearest approach to common taste in beauty.
Solution
Let R1, R2, R3 respectively be the ranks given by first, second and third judge.
Let ρij be the rank correlation coefficient between the ranks given by ith and
jth judges, i=1,2,3; j=1,2,3.
Let Dij =Ri – Rj, be the difference of ranks of an individual give by ith and Jth
Judge.
R1 R2 R3
1 3 6 -2 4 -3 9 -5 25
6 5 4 1 1 1 1 2 4
5 8 9 -3 9 -1 1 4 16
10 4 8 6 36 -4 16 2 4
3 7 1 -4 16 6 36 2 4
2 10 2 -8 64 8 64 0 0
4 2 3 2 4 -1 1 1 1
9 1 10 8 64 -9 81 -1 1
7 6 5 -1 1 1 1 2 4
8 9 7 -1 1 2 4 1 1
Total 20 21 60
0 4
Since ρ13 is maximum, thus the pair of the first and third judges has the
nearest approach to common taste in beauty.
Judge 6 4 9 8 1 2 3 10 5 7
Y
1st 1 5 4 8 9 6 10 7 3 2
Judge
2nd 4 8 7 6 5 9 10 3 2 1
Judge
3rd 6 7 8 1 5 10 9 2 3 4
Judge
Use spearman’s coefficient of rank correlation to determine which pair of
judges has the nearest approach to common taste in beauty:
In this case we are given only the data. We assign the ranks to both
the series of X and Y by giving the ranks in ascending order for both series
(or descending order).
Working Rule
Step 2: Calculate the difference of ranks and write it under the column
headed by D.
Step 3: Square the difference D and write it under the heading D2.
ρ=1- 6∑D2n3-n
Example 5.5
Series 80 91 99 71 61 81 70 59
X
Solution:
80 123 5 5 0 0
91 135 7 7 0 0
99 154 8 8 0 0
71 110 4 3 1 1
61 105 2 1 1 1
81 134 6 6 0 0
70 121 3 4 -1 1
59 106 1 2 -1 1
Tota 4
l
Here, n = 8; ∑D2 = 4
Now,
Calculate the rank correlation coefficient for the following data of two series
Series 92 89 87 86 83 77 71 63 53 50
X
Series 86 83 91 77 68 85 52 82 37 57
Y
For example, suppose an item is repeated at rank 5, (i.e., the 5th and
6th item are having same values), then the common rank assigned to 5the
and 6th is (5+6)/2=5.5. The next rank assigned thrice, then the common rank
assigned to the value is sum of the ranks by divided by 3. In order to find the
rank correlation coefficient the adjustment factor is added to the formula,
which is given by
The modified formula for the rank correlation coefficient is given by,
Example 5.6 From the following data related to the series X and Y, calculate
the coefficient of rank correlation.
Series 48 33 40 9 16 16 65 24 16 57
X
Series 13 13 24 6 15 4 20 9 6 19
Y
Solution
40 24 7 10 -3 9.00
16 15 3 7 -4 16.0
0
16 4 3 1 2 4.00
65 20 10 9 1 1.00
24 9 5 4 1 1.00
57 19 9 8 1 1.00
Total 41.0
0
[Remark: In the X series, we see that the value 16 is repeated thrice, the
common rank is given to the X value is 3, which is the average of 2.3 and 4.
i.e., (2+3+4)/3=3]
For Y series, ,
= 1 – 6[41+2+0.5+0.5]103-10
= 1 – 644990
= 1 – 264990
= 1- 0.2667
ρ = 0.7333
Series 68 64 75 50 64 80 75 40 55 64
X
Series 62 58 68 45 81 60 68 48 50 70
Y
➢ The value of ‘r’ does not depend on which of the two variables under
study is labeled X and which is labeled Y.
➢ The correlation coefficient lies between -1 and +1 i.e., -1≤r≤+1
➢ The correlation coefficient is independent of change of origin and
scale.
➢ r = +1, if all (Xi, Yj) pairs lie on a straight line with positive slope and
r= -1, if all (Xi, Yj) pairs lie on a straight line with negative slope.
A line of regression is the line, which gives the best estimate of one variable
X, for any given value of the other variable. We have two types of regression
lines, namely,
○ Regression line of X on Y
○ Regression line of Y on X.
It is the line, which gives the best estimate for the values of X for a specified
value of Y.
It is given by
X - X = bxy (Y - Y)
bxy = ∑xy∑y2
where, x = X - X and y= Y - Y
or
bxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dy)2
or
bxy = rσxσy
It is the line, which gives the best estimate for the value of Y for a
specified value of X.
It is given by
Y - Y = byx (X - X)
or
or
byx = rσyσx where ‘r’ is the correlation coefficient σ x, σy are the standard
deviations of X and Y series.
Example 5.7 The height of a sample of 10 fathers and their eldest sons are
given below 9to the nearest cm).
Height of Father 170 167 162 163 167 166 169 171 166 169
(X)
Height of Son 166 167 164 166 166 164 168 170 163 166
(Y)
Solution
Height of Height of X = X - y = Y - Xy x2 Y2
Father Son X Y
(X) (Y)
170 166 3 0 0 9 0
167 167 0 1 0 0 1
162 164 -5 -2 10 2 4
5
163 166 -4 0 0 1 0
6
167 166 0 0 0 0 0
166 164 -1 -2 2 1 4
169 168 2 2 4 4 4
171 170 4 4 16 1 1
MATHEMATICS AND STATISTICS Page 53
BHARTHIDASAN UNIVERSITY
6 6
166 163 -1 -3 3 1 9
169 166 2 0 0 4 0
1670 1660 0 0 35 7 3
6 8
Regression line of Y on X
Y - Y = byx (X - X)
Y = 0.4605X + 89.0965
X = 0.9211Y + 14.0934
X = 0.9211(190)+14.0934
X = 175.009+14.0934
X=189.1024cm
Y = 0.4605X+89.0965
Y= 0.4605(160)+89.0965
Y= 73.68+89.0965
Y=162.78cm
Purcha 71 75 69 97 70 91 99 61 80 47
se
Line of Y on X:Y=0.6132X+14.812;
Example 5.8
Also estimate the likely price at Mumbai when the price at Chennai is Rs 60/-
Solution
(Y)
55 24 0 2 0 0 4
61 26 6 0 0 36 0
76 15 21 -9 -189 441 81
Here, n=10, ∑x =296, ∑Y=130, ∑dx=34; ∑dy= -44; ∑dxdy= 238, ∑dx2=1848;
∑dy2=450
Regression Line of X on Y
X - X = bxy (Y - Y)
X - 49.33 = -0.089(Y-21.67)
X - 49.33= -0.089Y + 1.9286
Regression line of Y on X
Y - Y = byx (X - X)
Y -21.67 = -0.0078 (X-49.33)
Y-21.67 = -0.0078X+0.3848
Y= -0.0078X+0.3848+21.67
Y=-0.0078X+22.0548
X = -0.089Y + 51.2586
X = -0.089(60)+51.2586 = 45.92
Age of 23 22 28 26 35 20 22 40 20 18
Husband
Age of Wife 18 15 20 17 22 14 16 21 15 14
Hence estimate the age of husband when the age of wife is 19.
Average 30 50
Standard 5 10
Deviation
Regression line of Y on X
Y - Y = byx (X - X)
byx= r (σyσx)=0.8(105) = 1.6
Y-50 = 1.6(X-30)
Y-50 = 1.6X-48
Y=1.6X-48+50
Y=1.6X+2
When rainfall X = 40 cm
Y=1.6(40)+2
Y=66Quintals
Estimate the most likely yield of paddy when the annual rainfall is 22cm
other factors being assumes to remain same.
2. If one regression coefficient is greater than unity, then the other one has
to be less than unity.
(i.e., r = ±bxybyx )
Correlation Regression
3. It does not imply cause and 4. It indicates the cause and effect
effect relationship; between the relationship between the variable.
variable under study. The variable corresponding to cause
is taken as independent variable,
whereas corresponding to effect is
taken as dependent variable.
✔ The causes and effect relations are indicated from the study of
regression analysis.
✔ It establishes the rate of change in one variable in terms of the
changes in another variable
✔ It is useful in economic analysis as regression equation can determine
an increase in the cost of living index for a particular increase in
general price level.
✔ It helps in prediction and thus it can estimate the values of unknown
quantities
✔ It helps in determining the coefficient of correlation.
✔ It enables us to study the nature of relationship between the variables.
✔ It can be useful to all natural, social and physical sciences, where the
data are in functional relationship.
Chapter Summary