Professional Documents
Culture Documents
Variables
1
TYPES OF RELATIONSHIP
2
Regression
► Regression analysis is a statistical technique
for investigating and modeling the relationship
between variables.
► The word regression is used to investigate the
dependence of one variable called the
dependent variable denoted by Y, on one or
more variables, called independent variables
denoted by X’s and provides an equation to be
used for estimating or predicting the average
value of the dependent variable from the
known values of the independent variables
3
Regression Analysis
► Regression Analysis is used to estimate a function f( )
that describes the relationship between a continuous
dependent variable and one or more independent
variables.
Y = f(X1, X2, X3,…, Xk) + ε
Note:
• f( ) describes systematic variation in the relationship.
• ε represents the unsystematic variation (or random error) in
the relationship
· Where Y=dependent, response, predeictand, Regressand
X=Independent, Stimulus, predictor, Regressor
4
Examples
► Sales=f(Adv.Expenditure)+E
► Fiber=f(Weight of jute plant)+E
► Consumption Exp.=f( Income) +E
► Yield=f( fertilizer, seed rate, rainfall)+E
► Marks=f(Study hours, IQ level)+E
► Demand=f(Price, Price of related commodities,
Consumer income, Consumer taste, Adv. Expenses
for creation of demand)+E
5
Model building with one
regressor
Example:-Consider the relationship between
Example:-Consider the relationship between
PRICE (Y) and square feet living area SQFT (X)
of a house.
• There probably is a relationship...
...as SQFT increases, PRICE should
increase.
• But how would we measure and quantify this
relationship?
6
Consider the data on sale price
PRICE(Y) SQFT(X)
( thousand of dollars) and living area ( in 199.9 1065
square feet) of 14 houses in a particular 228 1254
location
235 1300
Estimate the relationship between PRICE
and SQFT 285 1577
239 1600
293 1750
285 1800
365 1870
295 1935
290 1948
385 2254
505 2600
425 2800
415 3000
7
Scatter plot
500
400
PRICE
300
200
The observed data points do not all fall on a straight line but cluster
about it. Many lines can be drawn through the data points; the problem
is to select among them.
The method of LEAST SQUARE results in a line that
minimizes the sum of squared vertical distances
from the observed data points to the line (i.e 8
10
Method of LEAST SQUARE
► The method of LEAST SQUARE results
in a line that minimizes the sum of
squared vertical distances from the
observed data points to the line (i.e
Random Error). Any other line has a
larger sum.
11
Best fit line to the data
LEAST SQUARE LINE
A least square line is described in terms of its Y-
intercept (the height at which it intercepts the Y-
axis) and its slope (the angle of the line). The line
can be expressed by the following relation
Y=bo+ b1X (Estimated regression of Y on X)
Where
S XY
► b1= Slope of line b1 = 2
S X
− −
► bo=intercept of the line
bo = Y − b X
S xy =
∑ − X )(Y − Y ) X Y
=
1
∑ XY −
∑ ∑ 46315.31
=
n −1 .n −1 n
S2X =
∑ ( X − X )2
=
1
∑
X2−
( ∑X )2
= 333803 .30
n −1 .n − 1 n
S XY 46315 .31
b1 = = = 0.13875
S2X 333803 .30
bo = Y − b1 X = 317.49 − (0.13875) x1910.93 = 52.351
Y=52.35 + 0.1388 X 13
Interpretation of the estimated
parameters
PRICE=52.35+0.1388SQFT
► The value of b1=0.1388, indicates that
the average price of a house is expected
to go up by 0.13875 thousands of dollars
( i.e $138.75) with each unit(100-Square
foot) increase in living area
► The value of bo is indicates that average
estimated price of an empty lot (X=0) is
$52,351 but this interpretation is not
always true be careful in interpreting the
intercept coefficient when scope of the
model does not cover X=0
14
Fitted Least Square Line
PRICE(Y) SQFT(X) Y^=52.35+0.1388SQFT
199.9 1065 200.12
600
228 1254 226.344
500
235 1300 232.726
285 1577 271.16 400
Y
239 1600 274.351 300
Y
P re d i c te d Y
16
Ŷ=52.35+0.1388X Least Square Line
Ŷ=50+0.1388X other Line
LEAST SQUARE OTHER LINE
Y X LINE e =Y-Y^
Y^ e2 Y^ e e2
199.9 1065 200.12 -0.22 0.05 197.82 2.08 4.32
228 1254 226.34 1.66 2.74 224.06 3.94 15.56
235 1300 232.73 2.27 5.17 230.44 4.56 20.79
285 1577 271.16 13.84 191.54 268.89 16.11 259.61
239 1600 274.35 -35.35 1249.72 272.08 -33.08 1094.29
293 1750 295.16 -2.16 4.68 292.90 0.10 0.01
285 1800 302.10 -17.10 292.46 299.84 -14.84 220.23
365 1870 311.81 53.19 2828.75 309.56 55.44 3074.04
295 1935 320.83 -25.83 667.33 318.58 -23.58 555.92
290 1948 322.64 -32.64 1065.14 320.38 -30.38 923.09
385 2254 365.09 19.91 396.24 362.86 22.14 490.39
505 2600 413.10 91.90 8445.30 410.88 94.12 8858.57
425 2800 440.85 -15.85 251.28 438.64 -13.64 186.05
415 3000 468.60 -53.60 2873.16 466.40 -51.40 2641.96
4444.9 26753 4444.9 0 18273.58 4413.32 31.58 18344.83
17
600 Equation
500
400
Y
Y
300
PredictedY
200
100
0
0 1000 2000 3000 4000
The observed values of (X,Y) do not all fall on the regression line but
they scatter away from it. The degree of scatter of the observed values
about the regression line is measured by what is called standard error of
estimate or standard error of regression and denoted by Se.
2 ∑ (Y − Yˆ ) 2 ∑ ∑ Y − b1∑ XY = 1522.79
Y 2 − bo
S e
=
n− 2
OR
n− 2
S e
= 39.023
1 X2
SE (bo ) = S e + = 37 .285
n (n −1) S x2
1
SE (b1 ) = S e = 0.01873
( n −1) S x2 20
Inference in Simple Linear Regression
(From samples to population)
► Generally, more is sought in regression
analysis than a description of observed data.
One usually wishes to draw inferences about
the relationship of the variables in the
population from which the sample was taken
► The slope and the intercept estimated from a
single sample typically differ from the
population values and vary from sample to
sample. To use these estimates for inference
about the population values, the sampling
distributions of the two statistics are needed
21
Test of hypothesis
1) Construction of hypotheses
for β 1
Ho : β 1 = 0
H1: β 1 ≠ 0
2) Level of significance
α = 5%
3) TEST STATISTIC t = b1 − β 1 = 0.1388− 0 = 7.41*
SE(b1) 0.01873
4) Decision Rule:- Reject Ho if |tcal | ≥ tα /2(n-2) =2.56
5) Result:- Soreject Ho andconcludethat thereissignificant relationshipbetweenPRICEand
SQFT
22
Confidence intervals for regression
parameters
► A statistic calculated from a sample provides
a point estimate of the unknown parameter.
► A point estimate can be thought of as the
single best guess for the population value.
► While the estimated value from the sample
is typically different from the value of the
unknown population parameter, the hope is
that it isn’t too for away.
► Based on the sample estimates, it is
possible to calculate a range of values that,
with a designated likelihood, includes the
population value. Such a range is called a
confidence interval.
23
95% C.I for β 1
95% C.I can be interpret as
b1± tα / 2 ( n− 2 ) ( SE (b1) ) If we take 100 samples of
the same size under the
same conditions and
0.1388± t .025(12)
( 0.01873)
compute 100 C.I’s about
parameter, one from each
sample, then 95 such C.Is
will contain the parameter
(0.0909 , 0.1867 ) (i.e not all the constructed
C.Is)
Confidence interval
estimate of a parameter is
more informative than
point estimate because it 24
Ho : β 1 = 0
H1: β 1 ≠ 0
(0.0909 , 0.1867 )
+
Unexplained variation (Variation due to unknown factors)
Total Variation= (n-1)S2y=101815
Explained variation =(b1)(n-1)Sxy =83541
Un-Explained variation =101815-83541 =18274
26
Goodness of Fit
A commonly used measure of the goodness of fit of a
linear model is R2 called coefficient of determination. If
all the observations fall on the regression line R2 is 1. If
no linear relationship between Y & X R2 is 0.
The co-efficient of determination tells us the proportion
of variation in the dependent variable explained by the
independent variable
Coefficient of determination (R2)=(Explained/Total Variation)x100=82%
27
The hypothesis β 1=0
by analysis of variance
procedure.
ANOVA TABLE
Source Of Degree of Sum of Mean Sum Fcal Ftab p-Value
Variation Freedom Squares of Squares
(S.O.V) (DF) (SS) (MSS=SS/df)
Regression 1 83541 83541 54.86* F.05(1,12) =4.84 0.000
Error 14-2=12 18274 1523
TOTAL 13-1=12 101815
29
Example:-[2]
Find the least squares regression line for the data
on incomes (in hundreds of dollars) and food
expenditure on the seven households
A plot of paired
Food expenditure
observations is
called a
scatter
diagram.
Income
31
Scatter diagram and straight
Food
expenditure lines.
Income
32
Least Squares Line
e
Food expenditure
Regression line
33
Income
Error Sum of Squares (SSE)
SSE = ∑ e = ∑ ( y − yˆ )
2 2
Sxy
b= 2
and a= y− bx
S x
( ∑ x )−( ∑)y
1( )x = x
2
1 2
Sxy = ∑ xy and S x 2 ∑
n−1 n n 1 −
35
Solution
Income Food
x Expenditure xy x²
y
35 9 315 1225
49 15 735 2401
21 7 147 441
39 11 429 1521
15 5 75 225
28 8 224 784
25 9 225 625
36
n −1
∑ −x =
n 6
−7222 =
7
133.571
S xy 35.285
b = 2= = 0.2642
S x 133.571
a =y −bx= −
9.1429 =
(.2642)(30.2857) 1.1414
Fitted line ŷ = 1.1414 + 0.2642x 37
Error of prediction.
ŷ = 1.1414 + .
Food expenditure
2642x
Predicted = $1038.84
e Error = -
$138.84
Actual = $900
Income
38
Interpretation of a and b
ŷ = 1.1414 + .2642 X
Interpretation of a
Consider the household with zero income
ŷ = 1.1414 + .2642(0) = $1.1414
hundred
Thus, we can state that households with no
income is expected to spend $114.14 per
month on food
39
Interpretation of a and b cont.
ŷ = 1.1414 + .2642 X
Interpretation of b
The value of b in the regression model gives the
change in y due to change of one unit in x
We can state that, on average, a $1 increase in
income of a household will increase the food
expenditure by $0.2642
The regression line is valid only for the values of x
between 15 and 49 (Scope of the model)
40
Goodness of Fit
R2=92%
The value of R2 indicates that about
92% variation in the dependent
variable has been explained by the
linear relationship with X and
remaining are due to some other
unknown factors.
41
Positive and negative linear relationships
between x and y.
y y
b<0
b>0
42
Example:[3]:- Driving Monthly Auto
A random sample of Experience Insurance
eight drivers insured (years) Premium($)
with a company and
having similar auto 5 64
insurance policies was
2 87
selected. The following
table lists their driving 12 50
experience (in years) 9 71
and monthly auto 15 44
insurance premiums. 6 56
25 42
16 60
43
a) Does the insurance premium depend on
the driving experience or does the driving
experience depend on the insurance
premium? Do you expect a positive or a
negative relationship between these two
variables?
Negative and
moderate
Experience
45
c) Find the least squares regression line by
choosing appropriate dependent and
independent variables based on your
answer in part a.
Experience Premium
x y xy x² y²
5 64 320 25 4096
2 87 174 4 7569
12 50 600 144 2500
9 71 639 81 5041
15 44 660 225 1936
6 56 336 36 3136
25 42 1050 625 1764
16 60 960 256 3600
(∑ x)( ∑ y) (90)(474)
SSxy =∑ xy- =4739- =-593.5
n 8
SSxx =∑ 2
(
x -
∑ x)2
(90)
=1396-
2
=383.5
n 8
SSyy =∑ 2
(
y -
∑ y)2
(474)
=29,642-
2
=1557.5
n 8
47
LEAST SQUARE REGRESSION
LINE
SSxy − 593.5000
b= = = − 1.5476
SSxx 383.5000
a = y − bx = 59.25 − (− 1.5476)(11.25) = 76.6605
yˆ = 76 .6605 −1.547 x
48
d) Interpret the meaning of the
values of a and b calculated
yˆ = 76 .6605 −1.547 x
a = 76.6605 gives the value of ŷ for x = 0
Amount of monthly premium with no
driving experience
b = -1.5476 indicates that, on average,
for every extra year of driving
experience, the monthly auto insurance
premium decreases by $1.55.
49
f) Calculate coefficient of
determination
R² = 59%
59% of the total variation in insurance
premiums is explained by years of
driving experience and 41% is due to
other unknown factors
50
Predict the monthly auto insurance
for a driver with 10 years of driving
experience.
The predict value of y for x = 10 is
51
egression with more than one independent variable
Example:-The following information has been
gathered from a random sample of apartment’s
renters in a city. We are trying to predict rent (in
dollars per month) based on the size of the
apartment (number of rooms) and the distance from
downtown
Rent Number(in Distance
miles)
($) of rooms ( miles)
[Y] [X1] [X2]
360 2 1
1000 6 1
450 3 2
525 4 3
350 2 10
300 1 4
52
Matrix Plot
to identify the relationship
X1
X2
53
Least square regression by
SPSS
a
Co efficien ts
Standardi
zed
Unstandardized Coefficien
Coefficients ts
M odel B Std. Error Beta t Sig.
1 (Constant) 96.458 118.121 .817 .474
X1 136.485 26.864 .943 5.081 .015
X2 -2.403 14.171 -.031 -.170 .876
a. Dependent Variable: Y
54
Regression equation:RENT =
96.458 +136.485 NUM_ROOM –2.403 DISTANCE
56
ANOVA considering both X1 &
X2
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 306910.2 2 153455.102 16.280 .025 a
Residual 28277.297 3 9425.766
Total 335187.5 5
a. Predictors: (Constant), X2, X1
b. Dependent Variable: Y
Coefficients a
Standardi
zed
Unstandardized Coefficien
Coefficients ts
Model B Std. Error Beta t Sig.
1 (Constant) 96.458 118.121 .817 .474
X1 136.485 26.864 .943 5.081 .015
X2 -2.403 14.171 -.031 -.170 .876
57
a. Dependent Variable: Y
ANOVA considering X1
Model Summary
ANOVA b
Sumof
Model Squares df Mean Square F Sig.
1 Regression 306639.1 1 306639.063 42.964 .003 a
Coefficients a
Standardi
zed
Unstandardized Coefficien
Coefficients ts
Model B Std. Error Beta t Sig.
1 (Constant) 82.188 72.140 1.139 .318
X1 138.438 21.120 .956 6.555 .003
a. Dependent Variable: Y
58
QUESTION: Which regressor (Independent
Variable) is relatively more important in
explaining variation in response
variable(dependent variable)
Answer:- Use standardized regression
coefficient Y=RENT
S x1
b1* =b1
X1=Number of Rooms
Sy X2=Distance
Sy=Standard deviation of Y
b 2* =b 2 Sx2=Standard Deviation of X2
Sy b1,b2=Unstandardized Coefficients
b1*,b2*=Standardized Coefficients
59
Standardized Regression Coefficient
(Beta Coefficient)
Often the independent variables
C o e f f i c iae n t s
are measured in different
units. The standardized S ta n d a rd i
zed
coefficients or betas are an U n s t a n d a r d iz eCd o e f f ic ie n
attempt to make the C o e f f ic ie n t s ts
M odel B S td . E r r o r B e ta t S ig .
regression coefficients more 1 ( C o n s t a n t )9 6 . 4 5 8 1 1 8 . 1 2 1 .8 1 7 .4 7 4
comparable. A high value of X1 1 3 6 .4 8 5 2 6 .8 6 4 .9 4 3 5 .0 8 1 .0 1 5
standardized coefficients i.e X2 -2 .4 0 3 1 4 .1 7 1 - .0 3 1 - .1 7 0 .8 7 6
a .D e p e n d e n t V a r ia b le : Y
bata coefficient indicates the
relative importance of the
independent variable
X1 X1 X1 X1
β1 < 0 β1 > 0 β1 < 0 β1 > 0
β2 > 0 β2 > 0 β2 < 0 β2 < 0
β1 = the coefficient of the linear term
β2 = the coefficient of the squared term
62
Estimation of Quadratic Regression
Model
Yi = β 0 + β1X i + β 2 X + ε i 2
i
Convert the 2nd degree model to multiple linear regression model by using
transformation X1=X and X2=X2
Yi = β 0 + β1X1i + β 2 X 2i + ε i
The above model is multiple linear regression model with two regressors
where 2nd regressor is the square of ist regressor
63
Testing for Significance: Quadratic
Model
► Testing the Quadratic Effect
Compare quadratic model
Yi = β 0 + β1X i + β 2 X i2 + ε i
with the linear model
Yi = β 0 +β1X i +ε i
Hypotheses
► H0 : β2 = 0 (No quadratic term)
► H1 : β 2 ≠ 0 (Quadratic term is needed)
64
Heating Oil Example
40.80 73 6
65
Scatter Diagram
a) Oil used VS Temp
b) Oil used VS insulation
Fig. a:- Ist degree curve is appropriate Fig. b:- 2nd degree curve is appropriate
66
M o d e l S u m m a ry
A d ju ste d S td . E rr o r o f
M odel R R S q u a r e R S q u a re th e E stim a te
1 a
.9 8 6 .9 7 3 .9 6 5 2 4 .2 9 3 7 8
a . P r e d icto rs: (C o n sta n t) , X 2 _ 2 , X 1a, X 2
Co efficien ts
Unstandardized Standardized
Coefficients Coefficients
M odel B Std. Error Beta t Sig.
1 (Constant) 624.586 42.435 14.719 .000
X1 -5.363 .317 -.854 -16.910 .000
X2 -44.587 14.955 -1.019 -2.981 .012
X2_2 1.867 1.124 .568 1.661 .125
a. Dependent Variable: Y
ˆ
Y = 624.59 − 5.36X 1 − 44.59X 2 + 1.87X 2
2
67
Test of overall significance of regression
Ho : β1 = β2 = β3 = 0
H 1 : Atleast one β is not zero
RegSS
Reg df RMS
F= = = 129 .70 *
ESS EMS
Edf
b
A N O VA
Sum of
M odel Squares df M ean Square F Sig.
1 R egression 229643.2 3 76547.721 129.701 .000a
R esidual 6492.065 11 590.188
T otal 236135.2 14
a. Predictors: (C onstant), X2_2, X1, X2
b. Dependent Variable: Y
68
Heating Oil Example:
Yi = βwithout
Model 0 + β1 X +β X
1quadratic
i +βX
2
2 2 i insulation
3 2i iterm +ε
Yi = β 0 + β1
Hypotheses X1 i + β2 X2 i +iε
► (No quadratic term in insulation)
► (Quadratic term is needed in insulation)
H0 : β3 = 0
H1 : β 3 ≠ 0
69
Test of significance of quadratic
term
Is quadratic term in insulation needed on monthly
consumption of heating oil? Test at α = 0.05.
H0: β =0
3
Test Statistic:
H1: β ≠ 0 b3 − β3 1.8667− 0
3 t= = = 1.6611
df = 11 Sb3 1.1238
P-value=0.1249
Decision: Do not reject H0 at α = 0.05
C o e f f ic iean t s
Conclusion: There is not
U n s ta n d a r d iz e d S ta n d a r d iz e d
sufficient evidence for the C o e ffic ie n ts C o e ffic ie n ts
Yi = β 0 + β1 X1 i + β2 X2 i +3β2X2 i +i εH 0 : β
3 =
0 Yi = β 0 + β1 X1 i + 2β X2 i +
Unrestricted ANOVA
Restricted ANOVA
SOV DF SS SOV DF SS
REGRESSION K-1= 3 REGRESSION K-1= 2
(x1 , x2, x22) (x1 , x2)
ERROR n-K=10 ESS(UR) ERROR n-K=11 ESS(R)
TOTAL n-1=14 TOTAL n-1=14
Yi = β 0 + β1 X1 i + β2 X2 i +3β2X2 i ε0:β
+i H 3 =
0 Yi = β 0 + β1 X1 i + 2β X2 i +
Unrestricted ANOVA Restricted ANOVA
ANOVA b ANOVA b
Sum of Sum of
Model Squares df Model Squares df
1 Regression 229643.2 3 1 Regression 228014.6 2
Residual 6492.065 11 Residual 8120.603 12
Total 236135.2 14 Total 236135.2 14
b. Dependent Variable: Y b. Dependent Variable: Y