You are on page 1of 61

BHARTHIDASAN UNIVERSITY

UNIT-IV

THEORY OF SAMPLING AND TESTING OF HYPOTHESIS

4.0 OBJECTIVES
4.1 NEED FOR SAMPLING
4.2 ELEMENTS OF SAMPLING PLAN
4.3 TYPES OF SAMPLING
4.3.1 Random or Probability Sampling
Simple Random Sampling
Stratified Random Sampling
Systematic Random Sampling
Cluster Sampling
4.3.2 Non-Random or Non-Probability Sampling
Convenience Sampling
Judgmental Sampling
Quota Sampling
4.4 SAMPLING AND NON-SAMPLING ERRORS
4.4.1 Reasons for sampling errors
4.42 Reasons for non-sampling errors
4.5 TESTING OF HYPOTHESIS
4.5.1 Sampling Distribution
4.5.2 Standard Error
4.5.3 Null & Alternative Hypothesis
4.5.4 Errors in testing of hypothesis
4.5.5 Critical Region
4.5.6 Two tailed and One tailed test
4.5.7 Large and Small sample test
4.6 PROCEDURE FOR TESTING OF HYPOTHESIS
4.7 TESTS OF SIGNIFICANCE
4.7.1 Test for single mean
4.7.2 Test for difference of two means
4.7.3 Test for two standard deviations
4.7.4 Test for Single Proportion
4.7.5 Test for difference of two proportions
4.8 Analysis of Variance
4.8.1 Assumptions
4.8.2 One way ANOVA
4.8.3 Applications

4.0 Objectives

MATHEMATICS AND STATISTICS Page 1


BHARTHIDASAN UNIVERSITY

Sampling is being used in our everxyday life without knowing about it.
For examples,
a cook tests a small quantity of rice to see whether it has been well cooked
and a grain
merchant does not examine each grain of what he intends to purchase, but
inspects only
a small quantity of grains. Most of our decisions are based on the
examination of a few
items only.

In a statistical investigation, the interest usually lies in the assessment


of general magnitude and the study of variation with respect to one or more
characteristics relating to individuals belonging to a group. This group of
individuals or units under study is called population or universe. Thus in
statistics, population is an aggregate of objects or units under study. The
population may be finite or infinite.

Sampling and Sample

Sampling is a method of selecting units for analysis such as


households, consumers, companies etc. from the respective population
under statistical investigation. The theory of sampling is based on the
principle of statistical regularity. According to this principle,
a moderately large number of items chosen at random from a large group
are almost sure on an average to possess the characteristics of the larger
group. A smallest non-
divisible part of the population is called a unit. A unit should be well defined
and should not be ambiguous. For example, if we define unit as a household
then it should be defined that a person should not belong to two households
nor should it leave out persons belonging to the population.

A finite subset of a population is called a sample and the number of


units in a sample is called its sample size. By analyzing the data collected
from the sample one can draw inference about the population under study.

Parameter and Statistic

The statistical constants of a population like mean (m), variance (s2),


and proportion (P) are termed as parameters. Statistical measures like
mean (x), variance (s2), proportion (p) computed from the sampled
observations are known as statistics. Sampling is employed to throw light
on the population parameter. A statistic is an estimate based on sample
data to draw inference about the population parameter.

MATHEMATICS AND STATISTICS Page 2


BHARTHIDASAN UNIVERSITY

4.1 NEED FOR SAMPLING

Suppose that the raw materials department in a company receives


items in lots and issues them to the production department as and when
required. Before accepting these items, the inspection department inspects
or tests them to make sure that they meet the required specifications. Thus

(i) it could inspect all items in the lot or

(ii) it could take a sample and inspect the sample for defectives
Statistics for Managers and then estimate the total number of
defectives for the population as a whole.

The first approach is called complete enumeration (census). It has two


major disadvantages namely, the time consumed and the cost involved in it.

The second approach that uses sampling has two major advantages.

(i) It is significantly less expensive.

(ii) It takes least possible time with best possible results.

There are situations that involve destruction procedure where sampling is


the only answer. A well-designed statistical sampling methodology would
give accurate results and at the same time will result in cost reduction and
least time. Thus sampling is the best available tool to decision makers.

4.2 ELEMENTS OF SAMPLING PLAN

The main steps involved in the planning and execution of sample survey are:

I) Objectives The first task is to lay down in concrete terms the basic
objectives of the survey. Failure to define the objective(s) will
clearly undermine the purpose of carrying out the survey itself. For
example, in a nationalized bank wants to study savings bank
account holders perception of the service quality rendered over a
period of one year, the objective of the sampling is, here, to analyze
the perception of the account holders in the bank.

MATHEMATICS AND STATISTICS Page 3


BHARTHIDASAN UNIVERSITY

ii) Population to be covered Based on the objectives of the survey,


the population should be well defined. The characteristics
concerning the population under study should also be clearly
defined. For example, to analyze the perception of the savings bank
account holders about the service rendered by the bank, all the
account holders in the bank constitute the population to be
investigated.

iii) Sampling frame In order to cover the population decided upon,


there should be some list, map or other acceptable material (called
the frame) which serves as a guide to the population to be covered.
The list or map must be examined to be sure that it is reasonably
free from defects. The sampling frame will help us in the selection
of sample. All the account numbers of the savings bank account
holders in the bank are the sampling frame in the analysis of
perception of the customers regarding the service rendered by the
bank.

iv) Sampling Unit For the purpose of sample selection, the population
should be capable of being divided up into sampling units. The
division of the population into sampling units should be
unambiguous. Every element of the population should belong to just
one sampling unit. Each account holder of the savings bank account
in the bank, form a unit of the sample as all the savings bank
account holders in the bank constitute the population.

v) Sample Selection The size of the sample and the manner of


selecting the sample should be defined based on the objectives of
the statistical investigation. The estimation of population parameter
along with their margin of uncertainty are some of the important
aspccts to be followed in sample selection.

vi) Collection of data The method of collecting the information has to


be decided, keeping in view the costs involved and the accuracy
aimed at. Physical observation, intcrvewing respondents and
collecting data through mail are some ofthe methods that can be
followed in collection of data.

vii) Analysis of data The collected data should be properly classified


and subjected to an appropriate analysis. The conclusions are
drawn based on the results of the analysis.

4.3 Types of Sampling

MATHEMATICS AND STATISTICS Page 4


BHARTHIDASAN UNIVERSITY

The technique of selecting a sample from a population usually depends on


the nature of the data and the type of enquiry. The procedure of
sampling may be broadly classified under the following heads:

1) Probability sampling or random sampling and

2) Non-probability sampling or non-random sampling.

4.3.1 Probability sampling

Statistics for Managers Probability sampling is a method of sampling that


ensures that every unit in the population has a known non-zero chance of
being included in the sample.

The different methods of random sampling are:

(a) Simple Random Sampling (SRS)

Simple random sampling is the foundation of probability sampling. It is a


special case of probability sampling in which every unit in the population has
an equal chance of being included in a sample. Simple random sampling also
makes the selection of every possible combination of the desired number of
units equally likely. Sampling may be done with or without replacement. It
may be noted that when the sampling is with replacement, the units drawn
are replaced before the next selection is made. The population size remains
constant when the sampling is with replacement. If one wants to select n
units from a population of size N without replacement, then every possible
selection of n units must have the same probability. Thus there are NCn
possible ways to pick up n units from the population of size N. Simple
random sampling guarantees that a sample of n units has the same
probability 1NCnof being selected.

Example

A bank wants to study the Savings Bank account holders perception of


the service quality rendered over a period of one year. The bank has to
prepare a complete list of savings bank account holders, called as sampling
frame, say 500. Now the process involves selecting a sample of5O out of 500

MATHEMATICS AND STATISTICS Page 5


BHARTHIDASAN UNIVERSITY

and interviewing them. This could be achieved in many ways. Two common
ways are:

(1) Lottery method: Select 50 slips from a box containing well


shuffled 500 slips of account numbers without replacement. This method can
be applied when the population is small enough to handle.

(2) Random numbers method: When the population size is very


large, the most practical and inexpensive method of selecting a simple
random sample is by using the random number tables.

(b) Stratified Random Sampling

Stratified sampling is a two-step process in which the population is


partitioned into sub-populations, or strata. The strata should be mutually
exclusive and collectively exhaustive in that every population element
should be assigned to one and only one stratum and no population elements
should be omitted. Next, elements are selected from each stratum by a
random procedure, usually SRS. Technically, only SRS should be employed in
selecting the elements from each stratum. In practice, sometimes systematic
sampling and other probability sampling procedures are employed. Stratified
sampling differs from quota sampling in that the sample elements are
selected probabilistically rather than based on convenience or judgment. A
major objective of stratified sampling is to increase precision without
increasing cost.

The variables used to partition the population into strata are referred
to as Theory of Sampling and stratification variables. The criteria for the
selection of these variables consist of Testing of Hypothesis homogeneity,
heterogeneity, relatedness, and cost. The elements within a stratum should
be as homogeneous as possible, but the elements in different strata should
be as heterogeneous as possible. The stratification variables should also be
closely related to
the characteristic of interest. The more closely these criteria are met, the
greater the
effectiveness in controlling extraneous sampling variation. Finally, the
variables should
decrease the cost of the stratification process by being easy to measure and
apply.

(c) Systematic Random Sampling

In systematic random sampling, the sample is chosen by selecting


a random
starting point and then picking every ith element in succession from the
sampling frame.
The sampling interval, i, is determined by dividing the population size N by

MATHEMATICS AND STATISTICS Page 6


BHARTHIDASAN UNIVERSITY

the sample
size n and rounding to the nearest integer. For example, there are 100,000
elements in
the population and a sample of 1,000 is desired. In this case, the sampling
interval. i, is
100. A random number between I and 100 is selected. If, for example, this
number is
23, the sample consists of elements 23, 123,223,323,423,523, and so on.

Systematic sampling is similar to SRS in that each population


element has a
known and equal probability of selection. However, it is different from SRS in
that only
the permissible sample size n that can be drawn has a known and equal
probability of
selection. The measuring sample of size n has a zero probability of being
selected.

For systematic sampling, we assume that the population elements are


ordered in
some respect. In some cases, the ordering is unrelated to the characteristic
of interest.
Systematic sampling is a convenient way of selecting a sample. It requires
less time and
cost when compared to simple random sampling.

(d) Cluster Random Sampling

In cluster sampling, the target population is first divided into


mutually exclusive
and collectively exhaustive subpopulation, or clusters. Then a random
sample of clusters
selected, based on a probability sampling technique such as SRS. For each
selected
cluster, either all the elements are included in the sample or a sample of
elements is
drawn probabilistically. If all the elements in each selected cluster are
included in the
sample the procedure is called one-stage cluster sampling. If a sample of
elements is
drawn probabilistically from each selected cluster, the procedure is two-
stage cluster
sampling. Furthermore, a cluster sample can have multiple (more than two)
stages, as
in multistage cluster sampling.

MATHEMATICS AND STATISTICS Page 7


BHARTHIDASAN UNIVERSITY

The distinction between cluster sampling and stratified sampling is that


in cluster
sampling, only sample of subpopulations (clusters) is chosen, whereas in
stratified
sampling, all the subpopulations(strata) are selected for further sampling.

4.3.2 Non-Probability Sampling

The fundamental difference between probability sampling and non-


probability
sampling is that in non-probability sampling procedure, the selection of the
sample units
does not ensure a known chance to the units being selected. In other words
the units are
selected without using the principle of probability. Even though the non-
probability
sampling has advantages such as reduced cost, speed and convenience in
implementation, it lacks accuracy in view of the selection bias. Non-
probability sampling is suitable for pilot studies and exploratory research.

The methods of non-random sampling are:

(a) Convenience sampling:

Convenience sampling attempts to obtain a sample of convenient


elements. The selection of sampling units is left primarily to the interviewer.
Often, respondents are selected because they happen to be in the right
place at the right time. Examples of convenience sampling include

(1) Use of students, church groups, and members of social organizations,

(2) mall-intercept interviews without qualifying the respondents,

(3) Department stores using charge account lists.

Convenience sampling is the least expensive and least time consuming of


all sampling techniques. The sampling units are accessible, easy to measure,
and cooperative.

(b) Judgmental sampling:

Judgmental sampling is a form of convenience sampling in which the


population elements are selected based on the judgment of the researcher.
The researcher, exercising judgment or expertise, chooses the elements to

MATHEMATICS AND STATISTICS Page 8


BHARTHIDASAN UNIVERSITY

be included in the sample. Because he or she believes that they are


representative of the population of interest or are otherwise appropriate.
Common examples of judgmental sampling include

(1) Test markets selected to determine the potential of a new product,

(2) Purchase engineers selected in industrial marketing research because


they are considered to be representative of the company,

(3) Expert witnesses used in court.

(c) Quota sampling:

This is a restricted type of judgment sampling. This consists in


specifying quotas
of the samples to be drawn from different groups and then drawing the
required samples from these groups by judgmental sampling. Quota
sampling is widely used in opinion and market research surveys.

4.4 SAMPLING AND NON SAMPLING ERRORS:

A sample is a part of the whole population. A sample drawn from the


population depends on chance and as such all the characteristic of the
population may not be present in the sample drawn from the same
population. Any statistical measure say, mean of the sample, may not be
equal to the corresponding statistical measure of the population from which
the sample has been drawn. Thus there can be discrepancies in the
statistical measure of population. i.e.. parameters and (he statistical
measures of sample drawn from the same population. i.e., statistic. These
discrepancies are known as Errors in sampling. Errors in sampling are of two
types

(i) Sampling Errors

(ii,) Non-sampling Errors

4.4.1 Sampling Errors

Sampling Errors is inherent in the method of sampling. Sampling


depends on

MATHEMATICS AND STATISTICS Page 9


BHARTHIDASAN UNIVERSITY

chance and due to the existence of chance in sampling, the sampling errors
occur.
Errors in sampling arise primarily due to the following reasons:

1. Faulty selection of the sample. This may be due to selection of


defective sampling
techniques which may introduce the element of bias, e.g., purposive or
judgmental
sampling, in which investigator deliberately selects a non-representative
sample.

2. Substitution. Sometimes an investigator while collecting the information


from a
particular sampling unit, included in the random selection substitutes a
convenience
member of the population and this may lead to some bias as the
characteristic
possessed by the substituted unit may be different from those possessed by
the
original unit included in sampling.

3. Faulty demarcation of sampling units

4. Variability of the population. Sampling error may also depend o the


variability or
heterogeneity of the population from which the samples are to be drawn.

4.4.2 Non-Sampling Errors

Non-sampling errors or Bias automatically creep in due to human


factors which
always vary from one investigator to another. Bias may arise in the following
different
ways.

(i) Due to negligence and carelessness on the part of the investigator

(ii) Due to faulty planning of sampling

(iii) Due to the faulty selection of sample units

(iv) Due to incomplete investigation and sample survey

(v) Due to framing of a wrong questionnaire

(vi) Due to negligence and non-response on the part of the respondents

MATHEMATICS AND STATISTICS Page 10


BHARTHIDASAN UNIVERSITY

(vii) Due to substitutes of selected unit by another

(viii) Due to error in compilation

(ix) Due to applying wrong statistical measure.

4.5 TESTING OF HYPOTHESIS

The testing of hypothesis is a procedure that helps us to ascertain the


likelihood of hypothesized population parameter being correct by making use
of the sample statistic. In testing of hypothesis a statistic is computed from a
sample drawn from the parent population and on the basis of this statistic, it
is observed whether the sample so drawn has come from the population with
certain specified characteristic.

4.5.1 Sampling Distribution

Consider all possible samples of size ‘n’ which can be drawn from a
given population. For each sample we can compute a statistic such as mean,
standard deviation, etc. which will vary from sample to sample. The
aggregate of various values of the statistic under consideration may be
grouped into a frequency distribution. This distribution is known as sampling
distribution of the statistic. Thus the probability distribution of all the
possible values that a sample statistic can take is called the sampling
distribution of the statistic.

Sampling mean and sample proportion based on random sample are


example of sample statistic

Sampling distribution of the Mean from normal population

If x1, x2, x3, ……….. xn are n independent random samples drawn from a
normal population with mean m and standard deviation s, then the sampling
distribution of x (the sample mean) follows a normal distribution with mean
m and standard deviation σn .

It may be noted that

(i) the sample mean x = i=1nxin = x1+x2+ ……+xnn


Thus x is a random variable and will be different every time when a
new sample of n observations are taken.

MATHEMATICS AND STATISTICS Page 11


BHARTHIDASAN UNIVERSITY

(ii) x is an unbiased estimator of the population mean m. i.e. E ( x ) =


μ, denoted by μx = μ.
(iii) The standard deviation of the sample mean x is given by σx = σn .

Sampling distribution of proportions

Suppose that a population is infinite and that the probability of


occurrence of an even, say success is P. Let Q=1-P denotes the probability of
failure.

Consider all possible samples of size n drawn from this population. For each
sample, determine the proportion p of successes. Applying central limit
theorem, if the sample of size n is large, the distribution of the sample
proportion p follows a normal distribution with mean mp = P and S.D
σp=PQn.

4.5.2 Standard Error

The standard deviation of the sampling distribution of a statistic is


called the standard error of the statistic. The standard deviation of the
distribution of the sample mean is called the standard error of the mean.
Likewise, the standard deviation of the distribution of the sample proportion
is called the standard error of the proportion.

The standard error is popularly known as sampling error. Sampling


error throws light on the precision and accuracy of the estimate. The
standard error is inversely proportional to the sample size i.e. the larger the
sample size the smaller the standard error.

The standard error measures the dispersion of all possible values of


the statistic in repeated samples of a fixed size from a given population. It is
used to set up confidence limits for population parameters in tests of
significance. Thus the standard errors of sample mean x and sample
proportion p are used to find confidence limits for the population mean m
and the population proportion P respectively.

Statistic Standard Remark


Error

Sample mean x σn Population size is infinite or sample


with replacement

σnN-nN-1 Population size N finite or sample

MATHEMATICS AND STATISTICS Page 12


BHARTHIDASAN UNIVERSITY

without replacement

Sample Proportion PQn Population size is infinite or sample


(p) with replacement

PQn N-nN-1 Population size N finite or sample


without replacement

4.5.3 Null and Alternative Hypothesis

Null Hypothesis: The statistical hypothesis that is set up for testing a


hypothesis is known as Null Hypothesis. The null hypothesis is set up in
testing a statistical hypothesis only to decide whether to accept or reject the
null hypothesis. It asserts that there is no difference between the sample
statistic and population parameter and whatever difference is there, is
attributable to sampling errors. Null Hypothesis usually denoted by H0.

Alternative Hypothesis: The negation of Null Hypothesis is called the


Alternative hypothesis. In other words, any hypothesis which is not a null
hypothesis is called alternative hypothesis. It is always denoted by H1 or Ha.
It is set in such a way that rejection of null hypothesis implies the acceptance
of alternative hypothesis.

4.5.4 Error in testing of hypothesis

For testing the hypothesis, we take a sample from the population, an


on the basis of the sample result obtained, we decide whether to accept or
reject the hypothesis. Here, two type of errors are possible. A null hypothesis
could be rejected when it is true. This is called Type I error and the
probability of committing type I error is denoted by α. Alternatively, an error
could result by accepting a null hypothesis when it is false, this is known as
Type II error and the probability of committing type II error is denoted by β.

This is illustrated in the following table:

Statistical Decision of the


Test

True H0 is True H0 is False


Situation

H0 is True Correct Type I Error

MATHEMATICS AND STATISTICS Page 13


BHARTHIDASAN UNIVERSITY

Decision

H0 is False Type II Error Correct


Decision

4.5.5. Critical Region

A region in the sample space which amounts to rejection of null


hypothesis (H0) is called the critical region. After formulating the null and
alternative hypothesis about a population parameter, we take a sample from
the population and calculate the value of the relevant statistic, and compare
it with the hypothesized population parameter. After doing this, we have to
decide the criteria for accepting or rejecting the null hypothesis. These
criteria are given as a range of values in the form of an interval, say (a,b), so
that if the statistic value falls outside the range, were reject the null
hypothesis. If the statistic value falls within the interval (a,b) then we accept
H0. This criterion has to be decided on the basis of the level of significance.
At 5% level of significance means that 5% of the statistical value arrived at
from the samples will fall outside this range (a,b)and 95% of the values will
be within the range (a,b). Thus the level of significance is the probability of
Type I error. The levels of significance usually employed in testing of
hypothesis are 5% and 1%. A high significance level chose for testing a
hypothesis would imply that higher is the probability of rejecting a null
hypothesis if it is true.

Table of critical value Zα of Z.

Level of Significance

Critical Value 1% 5% 10%


(Zα)

Two tailed test | Zα| | Zα|=1.96 | Zα|


=2.58 =1.645

One tailed test | Zα| | Zα| | Zα|=1.28


=2.33 =1.645

4.5.6 Two tailed and one tailed test:

MATHEMATICS AND STATISTICS Page 14


BHARTHIDASAN UNIVERSITY

The probability curve of the sampling distribution of the test statistic is


a normal curve. In any test, the critical region is represented by a portion of
the area under this normal curve. This curve has two sides (or ends) known
as two tails. The rejection region may be represented by a portion of area on
each of the two sides or by only on the side of the normal curve and
correspondingly the test is known as two tailed test (or two sided test) or one
tailed test (or one sided test).

When the test of hypothesis is made on the basis of rejection region


represented by both sides of the standard normal curve, it is called a two
tailed test or two sided test.

When the test of hypothesis is made on the basis of rejection region


represented by any of the sides of the standard normal curve, it is called a
one tailed test or one sided test.

4.5.7 Large and small sample test

The test of significance is (a) Test of significance for large sample and
(b) Test of significance for small samples. For larger sample size (.30), all the
distributions like Binomial, Poisson etc., are approximated by normal
distribution. Thus normal probability curve can be used for testing of
hypothesis.

4.6 PROCEDURE FOR TESTING OF HYPOTHESIS:

Steps for testing hypothesis is given below ( for both large sample and small
sample tests)

Step 1: Null hypothesis: Set up null hypothesis H0

Step 2: Alternative Hypothesis: Set up alternative Hypothesis H1, which


is complementary to H0 which will indicate whether one tailed (right or left
tailed) or two tailed test is to be applied

Step 3: Level of significance: Choose an appropriate level of significance


(a).

Step 4: Test Statistic (or test of criterion):

Calculate the value of the test statistic, Z = t-E(t)S.E.(t)

Under the null hypothesis, where‘t’ is the sample statistic

MATHEMATICS AND STATISTICS Page 15


BHARTHIDASAN UNIVERSITY

Step 5: Critical Value: Find the critical value Za of Z at the level of


significance, from the table “areas under the normal curve Za – values” in
case of large samples, or areas under t-table, F-table, Chi-square table” in
case of small samples.

Step 6: Inference: We compare the computed value of Z (in absolute


value) with the significant value (critical value) Zα/2 (or Za). If |Z|>Za, we
reject the null hypothesis H0 at a % level of significance and if |Z|<Z a, we
accept H0 at a % level of significance.

4.7. LARGE SAMPLE TESTS:

4.7.1 Test for single mean:

Step 1: Setting up of a Null hypothesis. There is no significance


difference between the sample and the population mean or the sample has
been drawn from the parent population. H0: x = μ

Step 2: Setting up of an Alternative hypothesis: There is a significance


difference between the sample mean and the population mean. H1: x ≠ μ

Step 3: Fixing of level of significance: a (normally it is 5%)

Step 4: Computation of test Statistic:

Zcal= x-μσn

Step 5: Critical Value: Find the critical value Za at a % level of significance,


from the table” areas under the normal curves Za – values.

Step 6: Interference: If the modulus of the calculated value Zcal≤ Zα ,


obtained in step 5, we accept the null hypothesis at a % level of significance.
Otherwise we reject the null hypothesis at a % level of significance.

Now, we discuss the above with an example.

Example 4.1 A Sample of 400 male students of a college is found to have a


mean height of 171.38cm. Can it be regarded as a sample from a large
population with mean height 171.17cm and standard deviation 4.40cm?

Solution:

MATHEMATICS AND STATISTICS Page 16


BHARTHIDASAN UNIVERSITY

Given n = 400 (Large Sample)

μ = 171.17cm; x = 171.38 cm; σ = 3.30 cm

Null Hypothesis (H0): Sample mean has been drawn from a large
population with mean height of 171.17 cm. i.e., H0: μ = 171.17 cm

Alternative Hypothesis (H1): Sample mean has not been drawn from a
large population with mean 171.17cm i.e., H1: μ≠171.17cm.

Level of significance (α): 5%

Test Statistic:

Zcal= x-μσn

Zcal= 171.38-171.174.40400

Zcal= 0.210.22 = 0.9546


Critical value: At 5% level, Z0.05 = 1.96

Interference: Since the calculated value of Z is less than the critical value of
Z at 5% level, hence we accept the null hypothesis and conclude that, the
sample mean has been drawn from a large population with mean height of
171.17cm.

Example 4.2 The mean lifetime of 100 fluorescent light bulbs produced by a
company is computed to be 1570 hours with standard deviation of 120
hours. If m is the mean lifetime of all the bulbs produced by the company,
test the hypothesis μ = 1600 hours against the alternative hypothesis m ≠
1600 hours using a 5% level of significance.

Solution:

We are given

X = 1570 hrs. μ = 1600 hrs, σ = s =120 hrs, n = 100

Null Hypothesis (H0): m=1600. i.e., There is no significant difference


between the sample mean and population mean.

MATHEMATICS AND STATISTICS Page 17


BHARTHIDASAN UNIVERSITY

Alternative Hypothesis (H1): M1 1600 (tow tailed Alternative) There is a


significant difference between the sample mean and population mean.

Level of Significance (a): 5%

Test Statistic:

Zcal= x-μσn

Zcal= 1570-1600120100 = -2.5


|Zcal| = 2.5
Critical value: At 5% level,

Interference: Since the calculated value is greater than the critical value of Z
at 5% level, hence we reject the null hypothesis and conclude that , there is
a significant difference between the sample mean and population mean.

Self – Assessment Question

1. A random sample of 900 members has a mean 3.4 cm and S.D 2.61
cm. Is the sample from a large population of mean 3.25 cm and S.D
2.61 cm?

n=900, x = 3.4, μ = 3.25, σ = 2.61, |Zcal| = 1.724

[Hint: H0 is accepted at 5% level]

2. A random sample os size 400 drawn and the sample mean was found
to be 99. Test whether the sample could have come from a normal
population with mean 100 and standard deviation 8 at 5% level.
n=400, x = 99, μ = 100, σ = 8, |Zcal| = 2.5
[Hint: H0 is rejected at 5% level]

4.7.2 Test for difference of mean:

Working Rule:

MATHEMATICS AND STATISTICS Page 18


BHARTHIDASAN UNIVERSITY

Step 1: Setting up of a Null Hypothesis: The two samples have


been drawn from different from different populations having the same
means and equal standard deviation
H0 : μ1 = μ2

Step 2: Setting up of an Alternative Hypothesis. The two samples


have been not drawn from differne tfrom different populations.
H0 : μ1 ≠ μ2 (Two tailed test), or H1 : μ1 < μ2 (One tailed test), or

H1 : μ1 > μ2(One tailed test).

Step 3: Fixing the level of Significance: α (normally it is 5%)

Step 4: Computation of test Statistic:


Zcal = x1- x2σ12n1+σ22n2 ; if the population s.d’s are known
Zcal = x1- x2s12n1+s22n2 ; if the population s.d’s are not known.

Step 5: Critical Value: Find the critical value Za at α% level of


significance, from the table areas under the normal curve Za -values

Step 6: Inference: If the modulus of the calculated value, obtained in


step 5, we accept the null hypothesis at α% level of significance,
Otherwise we reject the null hypothesis at a % level of significance.

Example 4.3: A college conducts both day and evening classes intended to
be identical. A sample of 100 day students yields examination results as
under x1= 72 and σ1 = 14.8. A sample of 200 evening students yields
examination result under x2= 73.9 and σ2=17.9. Are the two mean
statistically equal at 5% level?

Solution:

We are given

n=100, x1= 72 and σ1 = 14.8, n=200, x2= 73.9 and σ2=17.

Null Hypothesis (H0): H0:μ1 = μ2. .e, the two means are statistically equal.

Alternative Hypothesis (H1): μ1 ≠μ2 (Two tailed test) i.e., the two means
are not statistically equal.

MATHEMATICS AND STATISTICS Page 19


BHARTHIDASAN UNIVERSITY

Level of Significance (α): 5%

Test Statistics

Zcal = x1- x2σ12n1+σ22n2 = 72.4-73.9(14.8)2100+(17.9)2200 =-1.53.7925


= -1.51.947 = -0.77
|zcal| = 0.77

Critical value: At 5% level, Z0.05 = 1.96

Inference: Since the calculated value of Zcal is less than the critical value of
Z at 5% level, hence we accept the null hypothesis and conclude that, the
two means are statistically equal.

Example 4.4 A random sample of 1000 workers from South India shows that
their mean wages are Rs 47 per week with a standard deviation of Rs. 28. A
random sample of 1500 workers from North India gives a mean age of Rs. 49
per week with a standard deviation of Rs. 40. Is there any significant
difference between their mean levels of wages?

Solution

We are given, n1 = 1000, x1 = 47 and s1 = 28, n2=1500, x2 = 73.9 and


S2=17.9

Null Hypothesis (H0): H0:μ1=μ2 i.e., there is no significant difference


between their mean level of wages

Alternative Hypothesis (H1): H1: μ1≠μ2 (Two tailed test) i.e., there is a
significant difference between their mean level of wages.

Level of significance (α): 5%

Test Statistics

Zcal = x1- x2s12n1+s22n2 = 47- 49(28)21000+(40)21500 = -21.8507 =


-21.3604 = -1.47
|Zcal|= 1.47

MATHEMATICS AND STATISTICS Page 20


BHARTHIDASAN UNIVERSITY

Critical value: At 5% level

Inference: Since the calculated value of Zcal is less than the critical value of
Z at 5% level, hence we accept the null hypothesis and conclude that the
two means are statistically equal.

Self-Assessment Question

1. In a survey if buying habits, 400 women shoppers are chosen at


random in super market ‘A’ located in certain section of the city. Their
average weekly food expenditure is Rs. 250 with a standard deviation
of Rs. 40. For 400 women shoppers chosen at random in super market
‘B’ in another section of the city, the average weekly food expenditure
is Rs. 220 with standard deviation of Rs. 55. Test at 5% level of
significance whether the average weekly food expenditure of the two
populations of shoppers are equal.
[Hint: n1: 400, x1 =250, s1=40, n2=400, x2 =220 and s2=55, |Zcal|
=8.82]
2. Random samples drawn from two countries gave the following data
relating to the heights of adult males:
3.

Country Country
A B

Mean height in 67.42 67.25


inches

Standard 2.58 2.50


deviation

Number of sample 1000 1200

Is the difference between the means significant?

[Hint: n1: 1000, x1 =67.42, s1=2.58, n2=1200, x2 =67.25 and s2=2.50, |Zcal|
=1.56]

4.7.3 Test for single proportion

Working Rule:

MATHEMATICS AND STATISTICS Page 21


BHARTHIDASAN UNIVERSITY

Step 1: Setting up of Null Hypothesis. The sample has been drawn from
a population with proportion P, i.e., P=P0.

Step 2: Setting up of Alternative Hypothesis. The sample has not been


drawn from a population with proportion P, i.e, H1:P≠ P0.

Step 3: Fixing of level of significance: α (normally it is 5%)

Step 4: Computation of test statistic:

Zcal= p-PPQn

Step 5: Critical Value: Find the critical value Za at α% level of


significance, from the table “areas under the normal curve Za – values.”

Step 6: Inference: If the modulus of the calculated value |Zcal|≤ Zα,


obtained in step 5, we accept the null hypothesis at α% level of significance.
Otherwise we reject the null hypothesis at α% level of significance.

Now, we discuss the above with an example.

Example 4.5 In a sample of 1000 people in Karnataka 540 are rice eater
and the rest are wheat eaters. Can we assume that both rice and wheat
eaters are equally popular in this state at 1% level of significance?

Solution:

Given, n=1000; p=5401000 = 0.54; P=).5; Q=1-P=1-0.5=0.5

Null Hypothesis (H0): The sample has been drawn from a population with
proportion P, i.e., H0: P=0.5

Alternative Hypothesis (H1): The sample has not been drawn from a
population with proportion P, i.e., H1: P≠0.5.

Level of significance (α): 1%

Test Statistics:

Zcal = p-PPQn = 0.54-0.50.5(0.5)1000 = 0.040.0158 = 2.53


|Zcal|= 2.53

Critical Value: At 1% level, Z0.01=2.58

MATHEMATICS AND STATISTICS Page 22


BHARTHIDASAN UNIVERSITY

Inference: Since the calculated value of |Zcal|is less than the critical value of
Z at 1% level, hence we accept the null hypothesis and conclude that, the
sample has been drawn from a population with proportion P, i.e., H0: P=0.5.

Self – assessment Question

1. In a random sample of 400 persons from a large population, 120 are


female can it said that males and females are in the ration 5:3 in the
population. Use 5% level of significance.
[Hint: n= 400, p=120400 = 0.3; P = 38 =0.375; Q= 1-P=1-
0.375=0.625=2.58]
|Zcal|=3.125

4.7.4 Test for two proportions:

Working Rule:

Step 1: Setting up of a Null Hypothesis. The two samples have been


drawn from same population, i.e., H0: P1=P2.

Step 2: Setting up of an Alternative hypothesis. The two samples have


not been drawn from same population, i.e., H1:P1≠P2

Step 3: Fixing of level of significance: α (normally it is 5%)

Step 4: Computation of test statistics:

Zcal= p1-p2PQ(1n1+1n2) ; Where P=n1p1+n2p2n1+n2 and


Q=1-P

Step 5:Critical value: Find the critical value Za at a % level of significance,


from the table “areas under the normal curve Z a– values.

Example 4.6 In a sample of 600 men from a certain city, 450 are found to
be smokers. In a sample of 900 from another city 450 are found to be
smokers. Do the data indicate that the two cities are significantly different
with respect to prevalence of smoking habits among men?

Solution:

Given n1=600; n2=900; p1=450600 = 0.75; p2 = 450900 = 0.5;

MATHEMATICS AND STATISTICS Page 23


BHARTHIDASAN UNIVERSITY

P= n1p1+n2p2n1+n2 = 6000.75+900(0.5)600+900 = 0.6

Q= 1-P = 1-0.6 = 0.4

Null Hypothesis (H0): The two samples have been drawn from same
population, i.e., H0: P1 = P2

Alternative Hypothesis (H1): The two samples have not been drawn from
the same population, i.e., H1: P1 ≠ P2.

Level of significance (α): 5%

Test Statistic:

Zcal = p1-p2PQ(1n1+1n2) = 0.75-0.50.6(0.4)(1600+1900) = 0.250.0258 = 9.7

|Zcal|= 9.7

Critical Value: At 5% level, Z0.05=1.96

Inference: Since the calculated value of |Zcal|is greater than the critical
value of Z at 5% level, hence we reject the null hypothesis and conclude
that,the two samples have not been drawn from the same population.

Example 4.7 A machine puts out 16 imperfect articles in a sample of 500.


After the machine is overhauled, it puts out 3 imperfect articles in a batch of
100. Has the machine improved?

Solution:

Given n1=500; n2=100; p1=16500 = 0.032; p2 = 3100 = 0.03;

P= n1p1+n2p2n1+n2 = 5000.032+100(0.03)500+100 = 0.0316

Q= 1-P = 1-0.0316 = 0.968

Null Hypothesis(H0): P1 = P2.

Alternative Hypothesis (H1): P1 > P2 (one tailed test)

Level of significance (α): 5%

Test Statistic:

Zcal = p1-p2PQ(1n1+1n2) = 0.032-0.030.0316(0.968)(1500+1100) =


0.0020.0105 = 0.19

MATHEMATICS AND STATISTICS Page 24


BHARTHIDASAN UNIVERSITY

|Zcal|= 0.19

Critical Value: At 5% level, Z0.05 = 1.645

Inference: Since the calculated value of is less than the critical value of Z at
5% level hence we accept the null hypothesis and conclude that, there is no
improvement after overhauling.

Self Assessment Question

1. In a random samples of 600 and 1000 men from two cities 400 and
600 men are found to be literate. Do the data indicate at 5% level of
significance that the population are significantly different in the
percentage of literacy?
[Hint: n1=600, n2=1000, p1=400600 = 0.67, p2=6001000=0.6;
P=0.625;Q=0.375,]
|Zcal|=2.67
2. Before an increase in excise duty on tea 400 people out of a sample of
500 persons were found to be tea drinkers. After an increase in the
duty, 400 persons were known to the tea drinkers in sample of 600
people. Do you think that there has been a significant decrease in the
consumption of tea after the increase in the excise duty?
3. [Hint: n1=500, n2=600, p1=400500 = 0.80, p2=400600=0.67;
P=0.73;Q=0.27,]
4. |Zcal|=4.81; H0: P1 = P2; H1 : P1 < P2 (one tailed test).

4.8ANALYSIS OF VARIANCE:

In many statistical studies a variable of interest, called the response


variable (or dependent variable), is identified. Then the data are collected
that tell us about how one or more factors (or independent variables)
influence the variable of interest. If we cannot control the factor(s) being
studied, we say that the data obtained are observational. For example,
suppose that in order to study how the size of a home relates to the sale
price of the home, a real estate agent randomly selects 50 recently sold
homes and records the square footages and sale prices of these homes.
Because the real estate agent cannot control the sizes of the randomly
selected homes, we say that data are observational.

MATHEMATICS AND STATISTICS Page 25


BHARTHIDASAN UNIVERSITY

If we can control the factors being studied, we say that the data are
experimental. Furthermore, in this case the values, or levels, of the factor (or
combination of factors) are called treatments. The purpose of most
experiment is to compare and estimate the effects of the different
treatments on the response variable. For example, suppose that an oil
company wishes to study how three different gasoline types (A, B and C)
affects the mileage obtained by popular midsized automobile model. Here
the response variable is gasoline mileage and the company will study a
single factor-gasoline type. Since the oil company can control which gasoline
type is used in the midsized automobile; the data that the oil company will
collect are experimental. Furthermore, the treatments – the levels of the
factor gasoline type – are gasoline type A, B and C.

In order to collect data in an experiment, the different treatments are


assigned to objects (people, cars, animals or the like) that are called
experimental units. For example in gasoline mileage situation, gasoline types
A, B and C will be compared by conducting mileage test using a midsized
automobile. The automobiles used in the test are the experimental units.

Definition:

According to R.A. Fisher, Analysis of Variance (ANOVA) is the “Separation of


Variance ascribable to one group of causes from the variance ascribable to
other group”. By this technique te toal variation in the sample data is
expressed as the sum of its nonnegative components where each of these
components is a measure of the variation due to some specific independent
source or factor or cause

4.8.1 Assumptions:

For the validity of the F-test in ANOVA the following assumptions are
made
(i) The observations are independent.
(ii) Parent population from which the observations are taken is
normal and
(iii) Various treatment and environmental effects are additive in
nature.

1.8.1 One Way Classification

MATHEMATICS AND STATISTICS Page 26


BHARTHIDASAN UNIVERSITY

Let us suppose that N observations, i=1, 2,…………….k; j=1,2…….) of a


random variable X are grouped on some basis, into k classes of sizes n1,
n2, ……nk respectively

(N=i=1kni) as exhibited below:

Mean Total

X11 x12 . . . . . . x1n1 x1 T1

X21 x22 . . . . . . x2n2 x2 T2

Xi1 xi2 . . . . . . xini xi Ti

Xk1 xk2 . . . . . . xknk xk Tk

G
Grand Total

The total variation in the observation xij can be split into the following two
components:

(i) The variation between he classes or the variation due to


different bases of classification, commonly known as
treatments.
(ii) The variation within the classes i.e., the inherent variation of
the random variable within the observations of a class.

The first type of variation is due to assignable causes which can be detected
and controlled by human endeavor and the second type of variation due to
chance causes which are beyond the control of human hand.

In Particular, let us consider the effect of k diffent rations on the yield in milk
of N cows (of the same breed and stock) divided into k classes of sizes n 1, n2,
……..nk.

Respectively (N=i=1kni) Hence the sources of variation are

(i) Effect of rations


(ii) Error due to chance causes produced by numerous causes that they
are not detected and identified.

Test Procedure:

MATHEMATICS AND STATISTICS Page 27


BHARTHIDASAN UNIVERSITY

The steps involved in carrying out the analysis are:

1) Null Hypothesis (H0): The first step is to set up of a null hypothesis


H0: μ1 = μ2 =………= μk
2) Alternative Hypothesis (H1): all μ1 ’s are not equal (i=1,2,……k)
3) Level of significance: Let α 0.05
4) Test statistic:

Various sums of squares are obtained as follows:

a) Find the sum of values of all the (N) items of the given data. Let this
grand total represented by ‘G’.
Then correction Factor (C.F)=G2N
b) Find the sum of squares of all the individual items (xij) and then the
Total sum of squares (TSS) = ∑∑xij2-C.F.
c) Find the sum of squares of all the class totals (or each treatment
total) Ti (i=1,2,…….k) and then the sum of squares between the
classes or between the treatments (SST) is SST = i=1kTi2nj - C.F.
where ni (i=1,2,…..k) is the number of observations in the ith class
or number of observations received by ith treatment.
d) Find the sum of squares within the class or sum of squares due to
error (SSE) by subtraction. SSE = TSS-SST

1) Degrees of freedom (d.f): The degrees of freedom for total sum of


squares freedom for SSE is (N-k)
2) Mean sum of squares: The mean sum of squres for treatments is
SSTk-1 and mean sum of squares for erro is SSEN-k
3) ANOVA Table: The above sum of squres together with their
respective degrees of freedom and mean sum of squres will be
summarized in the following table.

ANOVA Table for one-way classification

Sources of d.f. S.S M.S.S F-


Variation . ratio

Between k-1 SST SSTk-1 MSTMS


Treatments =MST E = F1

Error N-k SSE SSEN-k =

MATHEMATICS AND STATISTICS Page 28


BHARTHIDASAN UNIVERSITY

MSE

Total N-1

Calculation of variance ration

Variance ratio of F is the ratio between greater variance and smaller


variance, thus

= F1 = Variation between the treatmentsVariation within the treatments

= MSTMSE

If variance within the treatment is more than the variance between the
treatments, then numerator and denominator should be interchanged and
degrees of freedom adjusted accordingly.

4) Critical Value of F or Table value of F:


The critical value of F or table value of F is obtained from F table for (k-
1, N-k) d.f. at 5% level of significance.

5) Inference:
If calculated F value is less than table value of F, we may accept our
null hypothesis H0 and say that there is no significant difference
between treatments. If Calculated F value is greater than table value of
F, we reject our H0 and say that the difference between treatments is
significant.

Example 4.7

The following table gives the yields on 15 sample plots under three varieties
of seed

A: 20 21 23 16 20
B: 18 20 17 15 25
C: 25 28 22 28 32

Prepare an analysis of variance table

Solution:

Null Hypothesis (H0): μ1 = μ2 = μ3 (i.e., various varities of seeds are


homogeneous)

MATHEMATICS AND STATISTICS Page 29


BHARTHIDASAN UNIVERSITY

Alternative hypothesis (H1): μ1 ≠ μ2 ≠ μ3 (i.e., various varities of seeds


are not homogeneous)

Level of Significance(α):0.05

Test Statistic:

Variet Tota Squar


y l es

A 20 21 23 16 20 100 10000

B 18 20 17 15 25 95 9025

C 25 28 22 28 32 135 18225

Grand Total 330

Squares:

Variet Tota
y l

A 400 441 529 256 400 2026

B 324 400 289 225 625 1863

C 625 784 484 784 102 3701


4

Total 759
0

Correction Factor (C.F.) = G2N = 330215 = 7260

Total sum of Squares (TSS) is TSS = ∑∑xij2- C.F = 7590-7260 = 330

Sum of squares between the classes or between the treatments (SST) is

SST = i=1kTi2ni - C.F.

SST = (10025+ 9525+ 13525)-7260

MATHEMATICS AND STATISTICS Page 30


BHARTHIDASAN UNIVERSITY

= 7450-7260 = 190

Sum of squares due to error (SSE) = TSS –SST = 330-190 = 140

ANOVA Table for one-way classification

Source of d.f. S.S M.S.S F=ratio


Variation

Between 3-1=2 190 1902 = 95 9511.667 =


treatments 8.142

Error 14-2=12 140 14012 =


11.667

Total 15-1=14

Table Value: Table value of F for 92,12) d.f., at 5% level of significance is


3.89 (From F-table)

Inference: Since calculated F is greater than table value of F, we may reject


our H0 and say that various varieties of seeds are not homogeneous.

Self – Assessment Question

1. Three processes A, B and C is tested to see whether their outputs are


equivalent.
The following observation of output are made
A 10 12 13 11 10 14 15 13
B 9 11 10 12 13
C 11 10 15 14 12 13

Carry out the analysis of variance and state your conclusion

Hint:

Sources of d.f. S.S. M.S.S F-ratio


Variation

Between 3-1=2 7 7/2 =3.5 3.5/3.19 =


treatments 1.097

Error 18-2=16 51 51/16=3.

MATHEMATICS AND STATISTICS Page 31


BHARTHIDASAN UNIVERSITY

19

Total 19-1=18

Table value of F for (2,16) d.f. at 5% level of significance is 3.63

Chapter Summary

This chapter has explained different types of sampling. First we


discussed the need an elements of sampling plan. We continued by
discussing sampling and non-sampling errors and Testing of hypothesis. We
saw that both large sample and small sample test are inferences can be
made. We learned that procedure for testing of hypothesis for testing single
mean, difference of two means, singly proportion and difference of two
proportions under large sample tests. To conclude this chapter, we explained
how to test homogeneity using one way ANOVA.

MATHEMATICS AND STATISTICS Page 32


BHARTHIDASAN UNIVERSITY

UNIT – V

CORRELATION AND REGRESSION ANALYSIS

5.0 OBJECTIVES
5.1 MEANING OF CORRELATION
5.2 TYPES OF CORRELATION
5.2.1 Positive and Negative Correlation
5.2.2 Linear and Non-linear Correlation
5.3 MEASUREMENT TECHNIQUES OF CORRELATION COEFFICEINT
5.3.1 Scatter Diagram
5.3.2 Karl Pearson’s Coefficient of Correlation
5.3.3 Spearman’s Rank Correlation
Ranks are given directly
Non -repeated ranks
Repeated ranks
5.4 PROPERTIES OF CORRELATION COEFFICEINT
5.5 MEANING OF REGRESSION
5.6 TYPES OF REGRESSION LINES
5.6.1 Regression lines of X on Y
5.6.2 Regression line of Y on X
5.7 CONSTUCTION OF REGRESSION EQUATIONS
5.8 PROPERTIES OF REGRESSION COEFFICENTS
5.9 DIFFERENCES BETWEEN CORRELATION AND REGRESSION
5.10 APPLICATIONS OF REGRESSION ANALYSIS

MATHEMATICS AND STATISTICS Page 33


BHARTHIDASAN UNIVERSITY

INTRODUCTION

In this unit you will be able to learn the concept of correlation and
regression. Also from this unit you will be able to learn the various methods
of obtaining the correlation coefficients, rank correlation coefficient,
regression equations etc. This unit explains the differences between the
correlation and regression. It is easy to understand the techniques to be
discussed in this unit by making use of calculation. Try out the example
problems with the calculator.

5.1 MEANING OF CORRELATION

We are familiar that, the change in one factor, say, the amount of
rainfall affects the change in the other factors, say, yield of rice. This means
that there exists some kind of relationship between the two factors. Thus
correlation is relationship between two factors.

In simple words, correlation means “the degree of relationship


between two or more factors”. An example of the relationship that exist
between the price and demand.

5.2 TYPES OF CORRELATION

There are different types of correlation. They can be classified into the
following categories.

a) Positive and Negative degree correlation


b) Linear and Non-linear correlation

First we will discuss positive and Negative degree correlation

5.2.1 Positive and Negative degree correlation

If the changes in the factors are in the same direction then the
correlation is said to be “Positive degree correlation”. Relationship between
the amount of rainfall and yield of rice is an example of positive degree
correlation. If the rainfall level increases then the yield of rice also increases
and vice-versa.

Now, we will discuss the linear and Non-linear correlation.

MATHEMATICS AND STATISTICS Page 34


BHARTHIDASAN UNIVERSITY

5.2.2 Linear and Non-linear Correlation

If the changes in the factors are in the constant ratio then the
correlation is said to be “Linear correlation”.

For example

Amount of rainfall 40 60 80 100 12


(in mm) 0

Yield of rice 100 15 20 250 30


0 0 0

From the above example, it can be observed that amount of rainfall


increases with 20 mm at each level and yield of rice increases with 50
tonnes at each level.

If the changes in the factors are not in the constant ratio then the
correlation is said to be “Non-linear correlation”.

For example

Factor 40 60 80 100 120


1

Factor 100 150 200 250 300


2

From the above example, it can be observed that, the changes at


various levels are different

Self Assessment Question

State the different types of correlation with example in the space given
below. Limit your answer in about 80 words.

MATHEMATICS AND STATISTICS Page 35


BHARTHIDASAN UNIVERSITY

_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________________________________
________________________________

5.3.1 Scatter Diagram

If the values of variables or factors, say X and Y is plotted in the XY –


plane, the diagram of the data obtained is called as scatter diagram. The
greater the scatter of the plotted points on the diagram, the lesser is the
relationship between the two variables or factors

1. If all the points lie on a straight line falling from the lower left- hand
corner to the upper right-hand corner, the correlation is said to be
perfective positive(Fig 5.1) i.e. the correlation coefficient r = +1

Figure 5.1 r= +1

2. If all the points lie on a straight line falling from the upper left-hand corner
to the upper right-hand corner, the correlation is said to be perfectively
negative (i.e. the correlation coefficient r = -1) Fig 5.2.

Figure 5.2 r = -1

Figure If all the points lie on a straight line fall in a narrow band and they
show a rising tendency from the lower left-hand corner to the upper right-
hand corner, there would be high degree of positive correlation. Fig 5.3

5.3 r = 1

MATHEMATICS AND STATISTICS Page 36


BHARTHIDASAN UNIVERSITY

If all the points lie on a straight line fall in a narrow band and they
show a declining tendency from the upper left hand corner to the lower
right-hand corner, there would be high degree of negative correlation.
Fig 5.4.

Fig 5.4 r = 1

If a all the points lie on a straight line fall in a widely band and they
show a rising tendency from ∑the lower left-hand corner to the upper
right-hand corner, there would be low degree of positive correlation.
Fig 5.5.

Fig 5.5 r > 0

If all the points lie on a straight line fall in a widely band and they show a
declining tendency from the upper left hand corner to the lower right hand
corner, there would be low degree of negative correlation. Fig 5.6

Fig 5.6 r < 1

If the plotted points lie on a straight line parallel to x-axis or in


haphazard manner it shows the absence of correlation between two
factors. Fig 5.7.

Fig 5.7 r = 0

MATHEMATICS AND STATISTICS Page 37


BHARTHIDASAN UNIVERSITY

5.3.2 KARL PEARSON’S COEFFICEINT OF CORRELATION

As a measure of degree of linear relationship between two variables,


Karl Pearson developed a formula called correlation coefficient. The
correlation coefficient between two variable usually denoted by rxy, is a
measure of relationship between them is defined as,

rxy= Cov (x,y) σ xσ y

= X-XY-YX-X2Y-Y2

= xyx2y2

Where x = X-X ; y = Y-Y

Working Procedure

Step 1: Denote one series by X and other by Y

Step 2: Calculate X and Y of the X and Y series respectively, using the


formula,

X= Xn ; and Y= Yn

Step 3: Take the deviations of the observations in X-series and from X and
write it under the column headed by x = -X . Take the deviation of the
observations in Y series from Y and write it under the column y = Y-Y.

Step 4: Multiply the respective deviations and write it under the column
headed by xy.

Step 5: Square the deviations obtained in step 4 for X and Y series and write
it under the column headed by x2 and y2.

Step 6: Apply the following formula to calculate the correlation coefficient


(r).

rxy = xyx2y2
Example 5.1 Find the coefficient of correlation between height of brothers
and sisters from the following data

MATHEMATICS AND STATISTICS Page 38


BHARTHIDASAN UNIVERSITY

Height of Brothers 6 6 6 68 69 70 71
(in cm) 5 6 7

Height of Sisters 6 6 6 69 72 72 69
(in cm) 7 8 6

Solution: Let the heights of Brothers be denoted by X and that of Sisters by


Y. Let us prepare the following table

X x = X-X Y y = X xy X2 Y2
= ∑Xn
= 4767
= 68

Y-Y

65 -3 67 -2 6 9 4

66 -2 68 -1 2 4 1

67 -1 66 -3 3 1 9

68 0 69 0 0 0 0

69 1 72 +3 +3 1 9

70 2 72 +3 +6 4 9

71 3 69 0 0 9 0

47 - 48 - 20 2 3
6 3 8 2

From the above table.

N=7; ∑X = 476; ∑Y = 483; ∑xy= 20; ∑x2 = 28; ∑y2 = 32

X = ∑Xn = 4767 = 68

Y = ∑Yn = 4837 = 69

Karl Pearson’s Coefficient of Correlation is now calculated as follows:

r= xyx2y2

MATHEMATICS AND STATISTICS Page 39


BHARTHIDASAN UNIVERSITY

= 2028 32

= 205.2915(5.6569 = 2029.9335

r = 0.06681

Self – Assessment Question

Calculate the correlation coefficient between the height of sister and height
of the brothers from the given data:

Height of Sisters 6 6 6 6 6 69 70
(in cm) 4 5 6 7 8

Height of Brothers 6 6 6 6 7 68 72
(in cm) 6 7 5 8 0

[Hint: X = 67, Y = 68, ∑X2=28, ∑Y2=34, ∑xy = 25, r= 0.81]

Short Cut Method

The above direct method for calculating ‘r’ is not convenient when (i)
the terms of the Series X and Y are larger and the calculation of X and Y
become difficult (or) (ii) the mean of X or Y are not integers. In these cases
we apply the following formula of assumed mean

rxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2n∑dy2-(∑dy)2

where,

dx = X-A, A is the assumed mean of X – series

dy = Y-B, B is the assumed mean of Y – series

n is number of observation of X and Y

Working Procedure

Step 1: Denote one series by X and the other by Y.

Step 2: Take any term ‘A’ as assumed mean of X series and ‘B’ as assumed
mean of Y series (preferably the middle one).

MATHEMATICS AND STATISTICS Page 40


BHARTHIDASAN UNIVERSITY

Step3: Take the deviations of the observations in X – series from A and writ
it under the column headed by dx = X-A. Take the deviations of the
observations in Y series from B and write it under the column headed by dy=
Y-B.

Step 4: Multiply the respective deviations and write it under the column
headed by dx dy.

Step 5: Square the deviations obtained in step 4 fro X and Y series and write
it under the column headed by dx2 and dy2.

Step 6: Apply the following formula to calculate the correlation coefficient


(r).

rxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2n∑dy2-(∑dy)2

Example 5.2: Calculate the coefficient of correlation for the following pairs
of values of X and Y.

X 17 19 21 26 20 28 26 29

Y 23 27 25 26 27 25 30 33

Solution:

Let the assumed means for X and Y be 23 and 27 respectively, so that dx =


X-23, dy = Y-27,

We have the following table

X Y dx = X- dy = Y- dxdy dx2 dy 2
23 27

17 23 -6 -4 24 36 16

19 27 -4 0 0 16 0

21 25 -2 -2 4 4 4

26 26 3 -1 -3 9 1

20 27 -3 0 0 9 0

28 25 5 -2 -10 25 4

26 30 3 3 9 9 9

MATHEMATICS AND STATISTICS Page 41


BHARTHIDASAN UNIVERSITY

29 33 6 6 36 36 36

186 216 2 0 60 144 70

Note that, here X =∑Xn = 1868 = 23.25, which is not an integer, we use
short-cut method,

Here n=8, ∑dx = 2, ∑dy = 0, ∑dxdy=60, ∑dx2 = 144, ∑dy2 = 70,

rxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2n∑dy2-(∑dy)2

rxy = 860-2(0)8(144)-(2)28(70)-(0)2

rxy = 4801148560

rxy = 48033.8821(23.6643)

rxy = 480801.7962

rxy = 0.5987

Self – assessment Question

Compute the coefficient of correlation for the following data

X 10 25 13 25 22 11 12 25 21 20

Y 12 22 16 15 18 18 17 23 24 17

[Ans: rxy= 0.53]

5.3.3 : Spearman’s Rank Correlation Coefficient

The coefficient of rank correlation is based on the various values of the


varieties and is denoted by

ρ = 1- 6∑D2n3-n

where, D – is the difference of corresponding ranks and n – is the number or


pairs of observations.

MATHEMATICS AND STATISTICS Page 42


BHARTHIDASAN UNIVERSITY

TYPE I: RANKS ARE GIVEN DIRECTLY

Working Procedure

Step 1: Denote rank of X series by R1 and rank of Y series by R2.

Step 2: Calculate the difference or R1 and R2 and write it under the column
headed by D

Step 3: Square the difference D and write it under the column headed by D2.

Step 4: Apply the formula:

ρ = 1- 6∑D2n3-n

This method is described with following example

Example 5.3: Two judges in a beauty contest rank the 12 entries as follows.

Judge 1 2 3 4 5 6 7 8 9 10 11 12
X

Judge 1 9 6 10 3 5 4 7 8 2 11 1
Y 2

Calculate the rank correlation coefficient between the two judges X and Y.

Judge X Judge Y D=R1- D2


(R1) (R2) R2

1 12 -11 121

2 9 -7 49

3 6 -3 9

4 10 -6 36

5 3 2 4

6 5 1 1

7 4 3 9

8 7 1 1

9 8 1 1

MATHEMATICS AND STATISTICS Page 43


BHARTHIDASAN UNIVERSITY

10 2 8 64

11 11 0 0

12 1 11 121

Total 41
6

Here n = 12; ∑D2=416

Now,

ρ = 1- 6∑D2n3-n

ρ = 1- 6(416)123-12

ρ = 1- 24961728-12

ρ = 1- 24961716 = 1- 1.4545

ρ = -0.4545

Example 5.4: Ten competitors in a beauty contest were ranked by three


judges in the following order:

Judge 1 6 5 10 3 2 4 9 7 8
1

Judge 3 5 8 4 7 10 2 1 6 9
2

Judge 6 4 9 8 1 2 3 10 5 7
3

Use the rank correlation coefficient to determine which pair of judges has the
nearest approach to common taste in beauty.

Solution

Let R1, R2, R3 respectively be the ranks given by first, second and third judge.

Let ρij be the rank correlation coefficient between the ranks given by ith and
jth judges, i=1,2,3; j=1,2,3.

MATHEMATICS AND STATISTICS Page 44


BHARTHIDASAN UNIVERSITY

Let Dij =Ri – Rj, be the difference of ranks of an individual give by ith and Jth
Judge.

Judge Judge Judge D12=R1- D12 D23=R2- D23 D13=R1- D13


1 2 3 R2 2
R3 2
R3 2

R1 R2 R3

1 3 6 -2 4 -3 9 -5 25

6 5 4 1 1 1 1 2 4

5 8 9 -3 9 -1 1 4 16

10 4 8 6 36 -4 16 2 4

3 7 1 -4 16 6 36 2 4

2 10 2 -8 64 8 64 0 0

4 2 3 2 4 -1 1 1 1

9 1 10 8 64 -9 81 -1 1

7 6 5 -1 1 1 1 2 4

8 9 7 -1 1 2 4 1 1

Total 20 21 60
0 4

Here n = 10; ∑D122=200, ∑D232=214; ∑D132=60

First and Second Judges

ρ12 = 1 - 6∑D122n3-n = 1- 6(200)103-10 = 1- 1200990 = 1- 1.2121 = 0.2121

Second and Third Judges

ρ23 = 1 - 6∑D232n3-n = 1- 6(214)103-10 = 1- 1284990 = 1- 1.2969 = 0.2969

First and Third Judges

MATHEMATICS AND STATISTICS Page 45


BHARTHIDASAN UNIVERSITY

ρ13 = 1 - 6∑D132n3-n = 1- 6(60)103-10 = 1- 360990 = 1- 0.3636 = 0.6364

Since ρ13 is maximum, thus the pair of the first and third judges has the
nearest approach to common taste in beauty.

Self –assessment Question

1. Two judges in a musical contest rank the 10 entries as follows:


Judge 3 5 8 4 7 10 2 1 6 9
X

Judge 6 4 9 8 1 2 3 10 5 7
Y

[Hint: n = 10; ∑D2=149; ρ =


0.8495]

2. Ten Competitors in a beauty contest were ranked by three judges in


the following order

1st 1 5 4 8 9 6 10 7 3 2
Judge

2nd 4 8 7 6 5 9 10 3 2 1
Judge

3rd 6 7 8 1 5 10 9 2 3 4
Judge
Use spearman’s coefficient of rank correlation to determine which pair of
judges has the nearest approach to common taste in beauty:

[Hint: n = 10; ∑D122=74, ∑D232=44, ∑D132=156, ρ12=05515, ρ23=0.7333;


ρ13=0.0545]

TYPE II: RANKS ARE NOT GIVEN – NON – REPEATED RANKS

In this case we are given only the data. We assign the ranks to both
the series of X and Y by giving the ranks in ascending order for both series
(or descending order).

Working Rule

MATHEMATICS AND STATISTICS Page 46


BHARTHIDASAN UNIVERSITY

Step 1: Assign ranks to each items of both series in ascending or


descending order.

Step 2: Calculate the difference of ranks and write it under the column
headed by D.

Step 3: Square the difference D and write it under the heading D2.

Step 4: Apply the formula,

ρ=1- 6∑D2n3-n

This method is explained with the help of the following example.

Example 5.5

For the following data calculate the coefficient of rank correlation.

Series 80 91 99 71 61 81 70 59
X

Series 123 135 154 110 105 134 121 106


Y

Solution:

Series Series Rank X Rank Y D D2


X Y
(R1) (R2)

80 123 5 5 0 0

91 135 7 7 0 0

99 154 8 8 0 0

71 110 4 3 1 1

61 105 2 1 1 1

81 134 6 6 0 0

70 121 3 4 -1 1

59 106 1 2 -1 1

Tota 4
l

MATHEMATICS AND STATISTICS Page 47


BHARTHIDASAN UNIVERSITY

Here, n = 8; ∑D2 = 4

Now,

ρ=1- 6∑D2n3-n = 1 – 6 (4)83-8 = 1 - 24504 = 1-0.0476 = 0.09524

Self – assessment Question

Calculate the rank correlation coefficient for the following data of two series

Series 92 89 87 86 83 77 71 63 53 50
X

Series 86 83 91 77 68 85 52 82 37 57
Y

[Hint: n = 10; ∑D2=44; ρ=0.733]

TYPE III: RANKS ARE NOT GIVEN – REPEATED RANKS

If two or more individuals are placed together in any classification with


respect to an attribute, there are more than one item with the same rank in
either or both the series, then the problem is solved by assigning average
rank to each of their individuals who are put in a tie.

For example, suppose an item is repeated at rank 5, (i.e., the 5th and
6th item are having same values), then the common rank assigned to 5the
and 6th is (5+6)/2=5.5. The next rank assigned thrice, then the common rank
assigned to the value is sum of the ranks by divided by 3. In order to find the
rank correlation coefficient the adjustment factor is added to the formula,
which is given by

Adjustment Factor (A.F) = 112 (m3-m)

Where ‘m’ is the number of times an item is repeated. This


Adjustments Factor is to be added for each repeated value in both the series.

The modified formula for the rank correlation coefficient is given by,

ρ=1– 6[∑D2+ 112∑(m3-m)]n3-n

This method is explained with the following example,

MATHEMATICS AND STATISTICS Page 48


BHARTHIDASAN UNIVERSITY

Example 5.6 From the following data related to the series X and Y, calculate
the coefficient of rank correlation.

Series 48 33 40 9 16 16 65 24 16 57
X

Series 13 13 24 6 15 4 20 9 6 19
Y

Solution

Series Series Rank X Rank Y D=R1- D2


X Y (R1) (R2) R2

48 13 8 5.5 2.5 6.25

33 13 6 5.5 0.5 0.25

40 24 7 10 -3 9.00

9 6 1 2.5 -1.5 2.25

16 15 3 7 -4 16.0
0

16 4 3 1 2 4.00

65 20 10 9 1 1.00

24 9 5 4 1 1.00

16 6 3 2.5 0.5 0.25

57 19 9 8 1 1.00

Total 41.0
0

Here n = 10, ∑D2=41

[Remark: In the X series, we see that the value 16 is repeated thrice, the
common rank is given to the X value is 3, which is the average of 2.3 and 4.
i.e., (2+3+4)/3=3]

Now, Adjustement Factor

MATHEMATICS AND STATISTICS Page 49


BHARTHIDASAN UNIVERSITY

For X series, AF1= 112 (33-3) = 2

For Y series, ,

AF2= 112 (23-2) = 0.5

AF3= 112 (23-2) = 0.5

The coefficient of rank correlation is,

ρ=1– 6 [ ∑D2+ 112∑m3-m]n3-n

= 1 – 6[41+2+0.5+0.5]103-10

= 1 – 644990

= 1 – 264990

= 1- 0.2667

ρ = 0.7333

Self – assessment Question

Obtain the rank correlation coefficient for the following data

Series 68 64 75 50 64 80 75 40 55 64
X

Series 62 58 68 45 81 60 68 48 50 70
Y

[Hint: n=10; ∑D2=72; ρ=0.545]

5.4 properties of Correlation Coefficient

➢ The value of ‘r’ does not depend on which of the two variables under
study is labeled X and which is labeled Y.
➢ The correlation coefficient lies between -1 and +1 i.e., -1≤r≤+1
➢ The correlation coefficient is independent of change of origin and
scale.

MATHEMATICS AND STATISTICS Page 50


BHARTHIDASAN UNIVERSITY

➢ r = +1, if all (Xi, Yj) pairs lie on a straight line with positive slope and

r= -1, if all (Xi, Yj) pairs lie on a straight line with negative slope.

5.5 REGRESSION ANALYSIS

Managers often make decisions by studying the relationship between


variables and process improvements can often be made by understanding
how changes in one or more variables affect the process output. Regression
Analysis is a statistical technique in which we observe data to relate a
variable of interest, which is called the dependent (or response) variable, to
one or more independent (or predicator) variable. The objective is to build a
regression model, or prediction equation, that can be used to describe ,
predict and control the dependent variable on the basis of the independent
variables. For example, a company might wish to improve its marketing
process. After collecting data concerning the demand for a product, the
product’s price, and the advertising expenditures made to promote the
product, the company might use regression analysis to develop an equation
to predict demand on the basis of price and advertising expenditure.
Predictions of demand for various price-advertising expenditure
combinations can then be used to evaluate potential changes in the
company’s marketing strategies.

In the words of M.M. Blair,

Regression analysis is a “mathematical measure of average


relationship between two or more variables in terms of the original unit of
the data”.

5.5. Types of Regression Lines

A line of regression is the line, which gives the best estimate of one variable
X, for any given value of the other variable. We have two types of regression
lines, namely,

○ Regression line of X on Y
○ Regression line of Y on X.

First we will give the regression line of X on Y.

It is the line, which gives the best estimate for the values of X for a specified
value of Y.

MATHEMATICS AND STATISTICS Page 51


BHARTHIDASAN UNIVERSITY

It is given by

X - X = bxy (Y - Y)

Where bxy is the regression coefficient of X on Y, which can be calculated


using any of th formula under the natures of the data

bxy = ∑xy∑y2

where, x = X - X and y= Y - Y

or

bxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dy)2

where, dx = X – A, dy= Y – B; and A, B are assumed mean

or

bxy = rσxσy

where ‘r’ is the correlation coefficient, σx an d σy are the standard


deviations for X and Y series.

Now we give the regression line of Y on X.

It is the line, which gives the best estimate for the value of Y for a
specified value of X.

It is given by

Y - Y = byx (X - X)

Where byx is the regression line of Y on X, which can be calculated using


any one of the following formula depending upon the nature of data.

byx = ∑xy∑x2 ; where x = X - X and y = Y - Y

or

bxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dy)2 where, dx = X- A; dy=Y-B and A, B are


assumed mean.

or

MATHEMATICS AND STATISTICS Page 52


BHARTHIDASAN UNIVERSITY

byx = rσyσx where ‘r’ is the correlation coefficient σ x, σy are the standard
deviations of X and Y series.

5.7 CONSTRUCTION OF REGRESSION EQUATION

Example 5.7 The height of a sample of 10 fathers and their eldest sons are
given below 9to the nearest cm).

Height of Father 170 167 162 163 167 166 169 171 166 169
(X)

Height of Son 166 167 164 166 166 164 168 170 163 166
(Y)

(i) Obtain the two regression equations


(ii) Estimate the likely height of Father when the height of Son is 190
cm.
(iii) Estimate the likely height of Son when the height of Father is 160
cm.

Solution

Height of Height of X = X - y = Y - Xy x2 Y2
Father Son X Y

(X) (Y)

170 166 3 0 0 9 0

167 167 0 1 0 0 1

162 164 -5 -2 10 2 4
5

163 166 -4 0 0 1 0
6

167 166 0 0 0 0 0

166 164 -1 -2 2 1 4

169 168 2 2 4 4 4

171 170 4 4 16 1 1
MATHEMATICS AND STATISTICS Page 53
BHARTHIDASAN UNIVERSITY

6 6

166 163 -1 -3 3 1 9

169 166 2 0 0 4 0

1670 1660 0 0 35 7 3
6 8

Here, n=10, ∑X=1670, ∑Y=1660

X = ∑Xn = 167010 = 167

Y = ∑Yn = 166010 = 166

From the table, ∑xy=35, ∑x2=76, ∑y2= 38

bxy = ∑xy∑y2 = 3538 = 0.9211

byx = ∑xy∑x2 = 3576 = 0.4605

(i) Regression line of X on Y


X - X = bxy (Y - Y)
X-167 = 0.9211(Y-166)
X-167 = 0.9211 Y – 152.9028
X = 0.9211Y- 152.9026 + 167
X = 0.9211Y+14.934

Regression line of Y on X

Y - Y = byx (X - X)

Y – 166 = 0.4605 (X -167)

Y – 166 = 0.4605 X – 76.9035

Y = 0.4605 X – 76.9035 + 166

Y = 0.4605X + 89.0965

ii) Given, Height of Son (Y)= 190 cm.

To estimate the height of Father (X) we use X on Y equation

X = 0.9211Y + 14.0934

MATHEMATICS AND STATISTICS Page 54


BHARTHIDASAN UNIVERSITY

X = 0.9211(190)+14.0934

X = 175.009+14.0934

X=189.1024cm

iii) Given, Height of Father (X) = 160cm.

To estimate the height of son, we use Y on X equation.

Y = 0.4605X+89.0965

Y= 0.4605(160)+89.0965

Y= 73.68+89.0965

Y=162.78cm

Self – Assessment Question

From the following data, obtain the two regression equations.

Sales 91 97 108 121 67 124 51 73 111 57

Purcha 71 75 69 97 70 91 99 61 80 47
se

Also estimate the sales when the purchase is 90.

[Hint: n=10; X =90, Y = 76, ∑xy=3900, ∑x2=6360, ∑y2= 2388

bxy = 1.36; byx = 0.6132, Line of X on Y : X = 1.36 Y -5.2;

Line of Y on X:Y=0.6132X+14.812;

Estimated sales, when the purchase is 90=117.2

Example 5.8

Find the two lines of regression from te following data

Price at Mumbai (in 36 42 55 61 76 26


Rs.)

Price at Chennai (in 15 36 24 26 15 14


Rs.)

MATHEMATICS AND STATISTICS Page 55


BHARTHIDASAN UNIVERSITY

Also estimate the likely price at Mumbai when the price at Chennai is Rs 60/-

Solution

Price at Price dx=X- dy=Y- dxd dx2 dy2


at A B y
Mumbai
Chenn (A=55 (B=26
(X)
ai ) )

(Y)

36 15 -19 -11 209 361 121

42 36 -13 10 -130 169 100

55 24 0 2 0 0 4

61 26 6 0 0 36 0

76 15 21 -9 -189 441 81

26 14 -29 -12 348 841 144

296 130 -34 -44 238 184 45


8 0

Here, n=10, ∑x =296, ∑Y=130, ∑dx=34; ∑dy= -44; ∑dxdy= 238, ∑dx2=1848;
∑dy2=450

X = ∑Xn = 2966 = 49.33

Y = ∑Yn = 1306 = 21.67

bxy = n∑dxdy-(∑dx)(∑dy)n∑dy2-(∑dy)2 = 6238--34(-44)6450-(-44)2 = 1428-


14962700-1936 = -68764 = -0.089

bxy = n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2 = 6238--34(-44)61848-(-34)2 = -689888-


1150 = -688732 = 0.0078

Regression Line of X on Y

X - X = bxy (Y - Y)
X - 49.33 = -0.089(Y-21.67)
X - 49.33= -0.089Y + 1.9286

MATHEMATICS AND STATISTICS Page 56


BHARTHIDASAN UNIVERSITY

X = -0.089Y + 1.9286 + 49.33


X = -0.089Y + 51.2586

Regression line of Y on X

Y - Y = byx (X - X)
Y -21.67 = -0.0078 (X-49.33)
Y-21.67 = -0.0078X+0.3848
Y= -0.0078X+0.3848+21.67
Y=-0.0078X+22.0548

To find the estimate likely price at Mumbai, we use the line X on Y

X = -0.089Y + 51.2586

X = -0.089(60)+51.2586 = 45.92

Hence the price at Mumbai is Rs 45.92.

Self – assessment Question

Age of 23 22 28 26 35 20 22 40 20 18
Husband

Age of Wife 18 15 20 17 22 14 16 21 15 14

Hence estimate the age of husband when the age of wife is 19.

[Hint: n=10; X = 25.6, Y = 17.2; bxy =2.23; byx=0.385

X=2.23Y-12.76; Y=0.385X+7.3Y; Age of Husband(X) = 29.61]

Example 5.9 Find out the likely production corresponding to a rainfall of 40


cm from the following data

Rainfall (in Output (in


cm) quintals)

Average 30 50

Standard 5 10

MATHEMATICS AND STATISTICS Page 57


BHARTHIDASAN UNIVERSITY

Deviation

Coefficient of correlation, r=0.8

Let X and Y denotes the rainfall and output respectively

Given: X = 30, Y = 50, σx = 10, σy=10, r=0.8

Regression line of Y on X

Y - Y = byx (X - X)
byx= r (σyσx)=0.8(105) = 1.6
Y-50 = 1.6(X-30)
Y-50 = 1.6X-48
Y=1.6X-48+50
Y=1.6X+2

When rainfall X = 40 cm

Y=1.6(40)+2

Y=66Quintals

Self – Assessment Question

Estimate the most likely yield of paddy when the annual rainfall is 22cm
other factors being assumes to remain same.

Yield per hectare (in kg) Annual Rainfall (in cm)

Mean 973.5 18.3

Standard Deviation 38.4 2.0

Coefficient of Correlation = 0.58

[Hint: Regression line of Y and X, Y=11.136X+769.71; For X=22; Yield (Y)=


1014.7 kg]

5.8 Properties of regression coefficients

1. There two regression lines, namely, X on Y and Y on X and they always


intersect at the mean (X,Y)

2. If one regression coefficient is greater than unity, then the other one has
to be less than unity.

MATHEMATICS AND STATISTICS Page 58


BHARTHIDASAN UNIVERSITY

3. Geometric mean between the regression coefficients is correlation


coefficient

(i.e., r = ±bxybyx )

4. Although regression equations are usually different, they become identical


if r= +1.

5. If r=0 then the regression lines are perpendicular to each other.

5.9. Difference between correlation and regression

Correlation Regression

1. IT is the degree of relationship 1. It is the average relationship


between two or more variable between two or more variables or
or factos factors

1. It is symmetric in x and y, i.e., 2. The regression coefficients are not


rxy = ryx symmetric

2. The correlation coefficient does 3. Regression coefficients reflects on


not reflect upon the nature of the nature of variable i.e., which is
variable (independent or dependent and which is independent.
dependent variable)

3. It does not imply cause and 4. It indicates the cause and effect
effect relationship; between the relationship between the variable.
variable under study. The variable corresponding to cause
is taken as independent variable,
whereas corresponding to effect is
taken as dependent variable.

4. It is a relative measure and is 5. Regression coefficients are


independent of the units of absolute measure of finding out the
measurement. relationship between two or more
variables.

5. It indicates the degree of 6. It is used to forecast the nature of


associations. dependent variable when the value
of independent variable is known.

5.10 Application of Regression

MATHEMATICS AND STATISTICS Page 59


BHARTHIDASAN UNIVERSITY

✔ The causes and effect relations are indicated from the study of
regression analysis.
✔ It establishes the rate of change in one variable in terms of the
changes in another variable
✔ It is useful in economic analysis as regression equation can determine
an increase in the cost of living index for a particular increase in
general price level.
✔ It helps in prediction and thus it can estimate the values of unknown
quantities
✔ It helps in determining the coefficient of correlation.
✔ It enables us to study the nature of relationship between the variables.
✔ It can be useful to all natural, social and physical sciences, where the
data are in functional relationship.

Chapter Summary

This chapter has discussed simple correlation coefficient, correlation


coefficient and simple linear regression analysis, which relates a dependent
variable to a single independent variable. We began by considering the
simple linear regression model, which employs two parameters; the slope
and y intercept. We next discussed how to compute the least square point
estimates of the parameters and how to construct the regression equations
by using various methods. We learned that the difference between
correlation and regression and applications of regression analysis.

MATHEMATICS AND STATISTICS Page 60


BHARTHIDASAN UNIVERSITY

MATHEMATICS AND STATISTICS Page 61

You might also like