Regression

Relationship Between Variables
TYPES OF RELATIONSHIP
► Deterministic relationship (functional relationship)

► The relationship between two variables is known exactly
► Area of a circle= πr2
► F=k(m1m2/r2) (Newton’s law of gravity)
► Dollar sales of a product sold at a fixed price and the number of units
sold.
► Probabilistic relationship (statistical relationship)

The relation between variables are not know exactly and we have
to approximate the relationship and develop models that
characterize their main features.
1
Regression
Regression analysis is a statistical technique for
investigating and modeling the relationship
between variables. The word regression is used
to investigate the dependence of one variable
called the dependent variable denoted by Y, on
one or more variables, called independent
variables denoted by X’s and provides an
equation to be used for estimating or predicting
the average value of the dependent variable from
the known values of the independent variables
3
Regression Analysis
► Regression Analysis is used to estimate a function f( )
that describes the relationship between a continuous
dependent variable and one or more independent
variables.
Y = f(X1, X2, X3,…, Xn) + ε
Note:
• f( ) describes systematic variation in the relationship.
• ε represents the unsystematic variation (or random error) in the
relationship
Where Y=dependent, response, predeictand, Regressand
X=Independent, Stimulus, predictor, Regressor
4
2
Examples
► Sales=f(Adv.Expenditure)+E
► Fiber=f(Weight of jute plant)+E
► Consumption Exp.=f( Income) +E
► Yield=f( fertilizer, seed rate, rainfall)+E
► Marks=f(Study hours, IQ level)+E
► Demand=f(Price, Price of related commodities,
Consumer income, Consumer taste, Adv. Expenses
for creation of demand)+E
5
Model building with one regressor
Example:--Consider the relationship between

Example:
advertising (X) and sales (Y) for a company.
• There probably is a relationship...
...as advertising increases, sales should increase.
• But how would we measure and quantify this
relationship?
3
Example: [1]The following data represent the money spent
on research and development and the firm’s annual profit
X=Expenditure for R&D Y=Annual profit
Fit appropriate model to the data
X Y
$ million ($ million)
5 31
11 40
4 30
5 34
3 25
2 20
Scatter plot
SCATTER PLOT
PROFIT VS EXPENDITURE
50
40
30
20
Y
10
0 2 4 6 8 10 12
However, the observed data points do not all fall on a straight line but
cluster about it. Many lines can be drawn through the data points; the
problem is to select among them. The method of LEAST SQUARE
results in a line that minimizes the sum of squared vertical distances
from the observed data points to the line (i.e Random Error).
Any other line has a larger sum 8
4
Best fit line to the data
LEAST SQUARE LINE
A least square line is described in terms of its Y-
Y-intercept
(the height at which it intercepts the Y-
Y-axis) and its
slope (the angle of the line). The line can be expressed
by the following relation
− −
S xy =
∑ (X − X ) (Y − Y )
=
1 
 ∑ XY −
∑ X ∑ Y  = 1  ( 30 )( 180 ) 
n −1 .n − 1  1000 −  = 20
Y=a + bX (Estimated regression of Y on X) 2
n  5 6
−
Y=20+2X ∑ ( X − X ) = 1  X − ( ∑ X )  = 2
1  ( 30 ) 2 
S =2
X
n −1 . n − 1 
∑ 2
 200 −
5  6 
 = 10
n 
Where
► b= Slope of line b = S XY
2
S X
− −
X =5 Y = 30
► a=intercept of the line − −
a = Y−b X
9
Interpretation of the estimated parameters
Y=20+2X
► The value of b=2, indicates that the annual
profit is expected to increase by $2 million ,on
the average, with each $1 million increase in
R&D expenditures.
► The value a=20 indicates that the annual
profit is $20 million when X=0 i.e without R&D
expenditures, but this interpretation is not
always valid.
Regression results are valid within the scope of the
data i.e experimental region
10
5
Measuring the reliability of
the estimating equation
E x p e n d itu re V S A n n u a l p ro fit
45
A c t u a l v a lu e
40
F it t e d v a lu e
35
profit
30 L in e a r ( F it t e d
v a lu e )
25 L in e a r ( A c t u a l
20
15 y = 2x + 20
0 2 4 6 8 10 12 R 2 = 0 .8 2 6 4
E x p e n d itu re
The observed values of (X,Y) do not all fall on the regression line but they scatter
away from it. The degree of scatter of the observed values about the regression
line is measured by what is called standard error of estimate or standard error of
regression and denoted by Se.
To measure the accuracy or reliability of the estimating regression, we need to
compute the standard error of the estimate also called standard error of
regression. The standard error of regression measures the variability of observed
points about the regression line. A small variation indicates that the estimating 11
regression is adequate
Partition of variation in dependent variable

into explained and unexplained variation
Total variation=Explained variation (Variation due to X also called

(variation due to regression)
+
Unexplained variation
(Variation due to unknown factors)
Total Variation= (n-1)S2y=242
=(b)((n-1)Sxy =2(5)(20)=200
Explained variation =(b)
12
6
Goodness of Fit
A commonly used measure of the goodness of fit of a
linear model is R2 called coefficient of determination.
If all the observations fall on the regression line R2 is
1. If no linear relationship between Y & X R2 is 0.
The co-
co-efficient of determination tells us the proportion
of variation in the dependent variable explained by the
independent variable
Coefficient of determination (R2)=(Explained/Total Variation)x100=83%
The higher the coefficient of determination is, the better the
regression function explains the observed values. The value of
R2 indicates that about 83% variation in the dependent variable
has been explained by the linear relationship with X and
remaining are due to some other unknown factors.
13
Estimation in regression
(Predicting unknown value of Y
from known value of X)
► Estimate the profit of the firm for which research and
development expenditure are $8 million
put X=8 in the estimated equation
Y=20 + 2(8)=$36 million
► NOTE:--Predictions made using an estimated

NOTE:
regression function may have little or no
validity for values of the independent variables
that are substantially different from those
represented in the sample.
14
7
Example:-[2]
Example:-
Find the least squares regression line for the data on
incomes (in hundreds of dollars) and food expenditure on
the seven households
Income Food Expenditure

35 9
49 15
21 7
39 11
15 5
28 8
25 9
15
Scatter Diagram
A plot of paired
observations is
Food expenditure
called a scatter
diagram.
Income
16
8
Scatter diagram and straight lines.
Food expenditure
Income
17
Least Squares Line
Regression line and random errors.
e
Food expenditure
Regression line
18
Income
9
Regression Analysis
19
Error Sum of Squares (SSE)
The error sum of squares, denoted SSE, is
SSE = ∑ e 2 = ∑ ( y − yˆ ) 2
The values of a and b that give the minimum SSE
are called the least square estimates of A and
B, and the regression line obtained with these
estimates is called the least square line.
20
10
The Least Squares Line
For the least squares regression line
ŷ = a + bx
bx,,
S xy
b= 2
and a = y − bx
S x
( ∑x)( ∑y)  ( ∑ ) 
2
1   1 x
Sxy = ∑xy −  and S2x = ∑x2 − 
n −1 n  n −1 n 
  
21
Solution
Income Food Expenditure

x y xy x²
35 9 315 1225
49 15 735 2401
21 7 147 441
39 11 429 1521
15 5 75 225
28 8 224 784
25 9 225 625
Σx = 212 Σy = 64 Σxy = 2150 Σx² = 7222 22
11
∑ x = 212 ∑ y = 64
x = ∑ x / n = 212 / 7 = 30 . 2857
y = ∑ y / n = 64 / 7 = 9 . 1429
S S xy =
1 
∑ xy −
( ∑ x )( ∑ y )  = 1  2 1 5 0 − ( 2 1 2 )(6 4 )  = 3 5 .2 8 5
n −1  n  6  7 


 (∑ x ) 
2
 = 1  7 2 2 2 − (2 1 2 )  = 1 3 3 .5 7 1
2
1 
S 2
x =
n −1 
∑ x2 − n  6 7
  
S xy 3 5 .2 8 5
b= 2
= = 0 .2 6 4 2
S x 1 3 3 .5 7 1
a = y − b x = 9 .1 4 2 9 − (.2 6 4 2 )(3 0 .2 8 5 7 ) = 1 .1 4 1 4
Fitted line ŷ = 1.1414 + 0.2642x 23
Error of prediction.
ŷ = 1.1414 + .2642x
Food expenditure
Predicted = $1038.84
e Error = -$138.84
Actual = $900
Income
24
12
Interpretation of a and b
ŷ = 1.1414 + .2642 X
Interpretation of a
Consider the household with zero income
ŷ = 1.1414 + .2642(0) = $1.1414 hundred
Thus, we can state that households with no
income is expected to spend $114.14 per
month on food
25
Interpretation of a and b cont.

ŷ = 1.1414 + .2642 X
Interpretation of b
The value of b in the regression model gives the
change in y due to change of one unit in x
We can state that, on average, a $1 increase in
income of a household will increase the food
expenditure by $0.2642
The regression line is valid only for the values of x
between 15 and 49 (Scope of the model)
26
13
Goodness of Fit
R2=92%
The value of R2 indicates that about 92%
variation in the dependent variable has
been explained by the linear relationship
with X and remaining are due to some
other unknown factors.
27
Positive and negative linear relationships

between x and y.
y y
b<0
b>0
(a) Positive linear x (b) Negative linear x

relationship. relationship.
28
14
Example:[3]:-
Example:[3]:- Driving Monthly Auto
A random sample of Experience Insurance
eight drivers insured (years) Premium($)
with a company and
having similar auto 5 64
insurance policies was 2 87
selected. The following 12 50
table lists their driving 9 71
experience (in years) 15 44
and monthly auto 6 56
insurance premiums. 25 42
16 60
29
a) Does the insurance premium depend on

the driving experience or does the driving
experience depend on the insurance
premium? Do you expect a positive or a
negative relationship between these two
variables?
a) The insurance premium depends on
driving experience
The insurance premium is the dependent
variable
The driving experience is the independent
variable 30
15
b) Plot the scatter diagram and identify the
nature and strength of relationship.
Insurance premium Negative and moderate
Experience
31
c) Find the least squares regression line by

choosing appropriate dependent and
independent variables based on your answer in
part a.
Experience Premium
x y xy x² y²
5 64 320 25 4096
2 87 174 4 7569
12 50 600 144 2500
9 71 639 81 5041
15 44 660 225 1936
6 56 336 36 3136
25 42 1050 625 1764
16 60 960 256 3600
Σx = 90 Σy = 474 Σxy = 4739 Σx² = 1396 Σy² = 29,642
32
16
c)
x= ∑ x/n=90/8=11.25
y= ∑ y/n=474/8=59.25
(∑x)(∑ y) (90)(474)
SSxy =∑xy- =4739- =-593.5
n 8
(∑x)2 (90)2
SSxx =∑x - 2
=1396- =383.5
n 8
(∑ y)2 (474)2
SSyy =∑ y - 2
=29,642- =1557.5
n 8
33
LEAST SQUARE REGRESSION LINE
SSxy − 593.5000
b= = = −1.5476
SSxx 383.5000
a = y − bx = 59.25− (−1.5476)(11.25) = 76.6605
34
17
d) Interpret the meaning of the values
of a and b calculated
a = 76.6605 gives the value of ŷ for x = 0

Amount of monthly premium with no driving
experience
b = -1.5476 indicates that, on average, for
every extra year of driving experience, the
monthly auto insurance premium decreases by
$1.55.
35
f) Calculate coefficient of
determination
R² = 59%
59% of the total variation in insurance
premiums is explained by years of driving
experience and 41% is due to other
unknown factors
36
18
Predict the monthly auto insurance for a
driver with 10 years of driving experience.
The predict value of y for x = 10 is
ŷ = 76.6605 – 1.5476(10) = $61.18
37
Standard deviation of regression
Compute the standard SSyy − bSSxy

deviation of se =
n−2
regression i.e
measure of variation 1557.5000− (−1.5476)(−593.5000)
=
of points around the 8−2
regression line = 10.3199
38
19
Regression with more than one independent variables
Example:-The following information has been gathered
Example:-
from a random sample of apartment’s renters in a city. We
are trying to predict rent (in dollars per month) based on
the size of the apartment (number of rooms) and the
distance from downtown (in miles)
Rent Number Distance
($) of rooms ( miles)
[Y] [X1] [X2]
360 2 1
1000 6 1
450 3 2
525 4 3
350 2 10
300 1 4 39
Regression equation:
RENT = 96.458 +136.485 NUM_ROOM –2.403 DISTANCE
► b1=136.485, means keeping distance constant,

rent will increase, on the average, by $ 136
with each increase of one room.
► b2=
b2=--2.403, means by keeping number of
rooms constant the rent will decrease, on the
average by $ 2.403 with each one mile
increase in downtown distance.
► In this example bo has no valid interpretation
40
20
Goodness of fit
► R2=92%
A high value of R2 indicates that much of the
variation in rent has been explained by
regressors; number of rooms and downtown
distance
QUESTION: Which regressor is relatively more important in

explaining variation in response variable
Answer:- Use standardized regression coefficient 41
Standardized Regression Coefficient

(Beta Coefficient)
► Often the independent variables are measured in different
units. The standardized coefficients or betas are an
attempt to make the regression coefficients more
comparable. A high value of standardized coefficients i.e
bata coefficient indicates the relative importance of the
independent variable
The beta coefficients are
► NUM_ROOM=0.943 Relatively more
► DIS=
DIS=--0.031 important
42
21
CORRELATION
The correlation can be defined as the degree
of association/relationship between two or
more variables
Marks of students in physics is associated with the
marks in mathematics.
The cost of a commodity in the market is related to
the quantity of the commodity available for sale in the
market
43
Types of correlation
(number of variables)
► Simple Correlation
Degree of relationship existing between two variables is
called simple correlation.
► Multiple Correlation
Degree of relationship connecting three or more
variables is called multiple correlations.
44
22
Scatter plot
► The plot between two variables is called
scatter plot
► The scatter plot indicates
►The nature of relationship (+ve , -ve, No)
►The strength of relationship (Strong,moderate,Weak)
45
Types of correlation
(Direction Change)
► Positive Correlation
When variables tend to change together in
the same direction e.g quantity of
commodity supplied and its price
► Negative Correlation
When variables tend to change in opposite
directions
e.g quantity demanded and the price of a
normal good are negatively correlated
► Uncorrelated
Two variables are uncorrelated when they
tend to change with no connection to
each other e.g. height of people and 46
steel production.
23
Possible patterns in Scatter plot
Perfect positive Strong positive Weak positive
linear correlation linear correlation linear correlation
No Correlation r=0
r=1 r is close to 1 r is close to 0
Perfect Negative Strong Negative Weak negative
linear correlation linear correlation linear correlation
r=-1 r is close to -1 r is close to 0
47
Measurement of correlation
► We can determine the kind of correlation between two variables by
direct observation of the scatter plot
► If the points lie close to the line, the correlation is strong.
► The inspection of scatter diagram gives only a rough idea about the
relationship between the two variables
• For a precise quantitative measurement of the degree of correlation
between two variables we use a quantity which is called correlation
coefficient.
• Simple linear correlation coefficient:-
coefficient:-It is used to
measure the strength of linear relationship between two variables.
S -1 ≤ r ≤ +1
r = XY
2 2
S X S y 48
24
Limitations
► Simple correlation coefficient measures only linear relation
ship between variables i.e even if r=0 the variables may be
related nonlinearly
► Although correlation measures co- co-variability of variables it

does not imply any functional relationship between the
variables. It discovers existing co-
co-variation, but does not
establish or prove any causal relationship between
variables
49
Example:-Suppose we want to compute the correlation coefficient

Example:-
between the quantity supplied (Y) and price (X) with the following set
of observations on the variables.
Time Quantity Price Xi
period Supplied (in shillings) S XY
r=
25
Yi
(in tons) 20 =0.973
1 10 2 S2X S2y 15
2 20 4 10
3 50 6
4 40 8 5
5 50 10 0
6 60 12 0 50 100 150
7 80 14
8 90 16
9 90 18
10 120 20
n=10 ΣYi=610 ΣXi=110
− −
( X − X )(Y − Y )
S xy = ∑ ∑ X ∑ Y  = 1 8520 − (110)(610)  = 201.11
1 
= ∑ XY −  
n −1 n − 1 − 2 n 
 9 10
S 2Y =-S(2Y,Y ) =
∑(Y −Y) = 1 ∑Y 2 − (∑Y)2  = 1 5642− (180)2 = 48.4
 −1 2  1 n  5(1 1 0 ) 2  6 
S 2 ∑ (X - X )
=
1 n
= ∑ X2 (.n∑−1X )
-  =  
1 5 4 0 -  = 3 6 .6 7
X n -1 n -1  n  9  10 
   
- 2
1  ( ∑ Y ) 2  1  ( 6 1 0 ) 2 
= ∑
(Y - Y )
S2 = ∑ Y 2- = 47700- = 1 1 6 5 .5 6
Y n -1 n -1  n  9  10 
    50
25
Amazing correlations
High positive correlation between
► Temperature in Faisalabad and employment rate
► Import of banana and Divorce rate
► Strength of police force and number of crimes
the following situations may brought
about a high correlation
X is the cause of Y
Y is the cause of X
There is a third factor Z that affects X and Y such
that they show a close relation
The correlation between X and Y may be due to
chance. 51
Correlation when variables are not

QUANTITATIVE
► simple correlation coefficient is based on the assumption that the
variables involved are quantitative
► However, in many cases the variables may be qualitative and cannot
be measured numerically .
► In such cases the data values are assigned ranks (1, 2, 3…, n and the
relationship between ranks of variables is measured rather than their
actual numerical values and correlation coefficient is called spearman
rank correlation coefficient
NOTE:-If one variable is quantitative and other one is
qualitative then assign ranks to both the variables and then
calculate rank correlation coefficient
52
26
Example:- Calculate rank correlation coefficient between
Example:-
interview grade and test score
Student Interview Test score RANK RANK D D2
grade
Interview Test score
grade
1 A 60 1.5 4 -2.5 5.25

2 B 61 3 3 0 0
3 A 50 1.5 5 -3.5 12.25
4 C 72 4 1 3 9
5 D 70 5 2 3 9
6 ∑ D2 6(36.5) 36.5
r′ = 1 − = 1− 2 = -0.825
n(n − 1)
2
5(5 − 1)
The negative value indicates that there is no
agreement between two methods so these
53
methods are not good for judging the students.
27

Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression

Uploaded by

Copyright:

Available Formats

Relationship Between Variables

► Deterministic relationship (functional relationship)

► Probabilistic relationship (statistical relationship)

Model building with one regressor

Example:--Consider the relationship between

Interpretation of the estimated parameters

Partition of variation in dependent variable

Total variation=Explained variation (Variation due to X also called

► NOTE:--Predictions made using an estimated

Income Food Expenditure

Least Squares Line

Regression line and random errors.

Error Sum of Squares (SSE)

The error sum of squares, denoted SSE, is

Income Food Expenditure

Interpretation of a and b cont.

Positive and negative linear relationships

(a) Positive linear x (b) Negative linear x

a) Does the insurance premium depend on

Insurance premium Negative and moderate

c) Find the least squares regression line by

LEAST SQUARE REGRESSION LINE

a = 76.6605 gives the value of ŷ for x = 0

The predict value of y for x = 10 is

ŷ = 76.6605 – 1.5476(10) = $61.18

Standard deviation of regression

Compute the standard SSyy − bSSxy

► b1=136.485, means keeping distance constant,

QUESTION: Which regressor is relatively more important in

Answer:- Use standardized regression coefficient 41

Standardized Regression Coefficient

Perfect Negative Strong Negative Weak negative

linear correlation linear correlation linear correlation

r=-1 r is close to -1 r is close to 0

► Although correlation measures co- co-variability of variables it

Example:-Suppose we want to compute the correlation coefficient

Correlation when variables are not

1 A 60 1.5 4 -2.5 5.25

methods are not good for judging the students.

You might also like