Basic Regression Analysis

Unit I: Basic Regression Analysis
Nature and Scope of Econometrics Simple Regression Model:

Specification, OLS Method,
Assumptions of CLRM and Gauss Markov Theorem, Hypothesis Testing and
Goodness of Fit
Extensions of Simple Regression Model: Regression through Origin, Scaling
and Units of
Measurement, Functional Forms of Regression Model Maximum
Likelihood Estimation
Nature and Scope of Econometrics
Definition
Goldberger defines econometrics as the social science in which the tools of
economic theory, mathematics and statistical inference are applied to the
analysis of economic phenomena.
It is concerned with the empirical determination of economic relationships.
In econometric analysis we formulate the economic theory in mathematical
terms and is combined with empirical measurement of economic phenomena.
The most important characteristic of econometric relationships is that they
contain a random element which is however ignored by economic theory and
mathematical economics which postulates exact relationship between the
various economic magnitudes.
Econometrics
Economic
Mathematical
Statistical
theory
economics
methods
Methodology of econometrics
Simple and multiple regressions

If we are studying the dependence of a variable on only a single explanatory
variable, such a study is known as simple, or two-variable regression
analysis. However, if we are studying the dependence of one variable on
more than one explanatory variable, it is known as multiple regression
analysis.
Types of Data
A time series is a set of observations on the values that a variable takes at
different times. Cross-section data are data on one or more variables
collected at the same point in time. In pooled data are elements of both time
series and cross-section data
THE METHOD OF ORDINARY LEAST SQUARES
The method of ordinary least squares is attributed to Carl Friedrich Gauss, a
German mathematician. A population regression curve is simply the locus of
the conditional means of the dependent variable for the fixed values of the
explanatory variable(s).
Least-squares criterion states that the Sample Regression Function can be

fixed in such a way that ui2 = (Yi Yi)2 is as small as possible, where ui2
are the squared residuals.
Categories of econometrics
Functions of econometrics
Measurement of economic relations
In many cases we can apply the various econometric techniques in
order to obtain estimates of the individual coefficients of the
economic relationships, from which we may evaluate other
parameters of economic theory.
Verification of economic theory
Econometrics aims primarily at the verification of economic
theories. In this case we can say that the purpose of the research
analysis ie obtaining empirical evidence to test the explanatory
power of economic theories to decide how well they explain the
observed behavior of the economic units
Forecasting
In formulating policy decisions it is essential to be able to forecast
the value of the economic magnitudes. Such forecasts will enable
the policy makers to judge whether it is necessary to take any
measure in order to influence the relevant economic variable.
Regression analysis
The term regression was introduced by Francis Galton. Regression analysis
is concerned with the study of the dependent variable, on one or more
explanatory variables.
Maximum Likelihood Estimation

A method of point estimation with some stronger theoretical properties than
the method of OLS is the method of maximum likelihood (ML).
The method of maximum likelihood, as the name indicates, consists in
estimating the unknown parameters in such a manner that the probability of
observing the given Ys is as high as possible.
The ML estimator of 2 is ui2 /n. This estimator is biased, whereas the OLS
estimator of 2 = ui2 /(n 2), as we have seen, is unbiased. However as n
increases indefinitely, the ML estimator of 2 is also unbiased.
Under the normality assumption, the ML and OLS estimators of the intercept
and slope parameters of the regression model are identical.
ML method is generally called a large-sample method.
ML method is of broader application in that it can also be applied to
regression models that are nonlinear in the parameters.
Gauss Markov Theorem
Given the assumptions of the classical linear regression model, the leastsquares estimators, in the class of unbiased linear estimators, have minimum
variance, that is, they are BLUE.
The OLS estimator 2, is said to be a best linear unbiased estimator (BLUE)
of 2 if the following hold:
1. It is linear, that is, a linear function of a random variable, such as the
dependent variable Y in the regression model.
2. It is unbiased, that is, its average or expected value, E( 2), is equal to the
true value, 2.
3. It has minimum variance in the class of all such linear unbiased
estimators; an unbiased estimator with the least variance is known as an
efficient estimator.
Assumptions of Classical Linear Regression Model

1Linear regression model.
The regression model is linear in the parameters,
Yi = 1 + 2Xi + ui
Where,
Yi = explained variable
1 = intercept coefficient
2 = slope coefficient
Xi = explanatory variable
Ui = stochastic error
Linear in parameter means that the parameters s are raised to the first power
only.
2X values are fixed in repeated sampling.
Values taken by the regressor X are considered fixed in repeated samples.
More technically, X is assumed to be nonstochastic.
It means that our regression analysis is conditional regression analysis that is
conditioned on the given values of the regressor(s) X
3Zero mean value of disturbance ui.
Given the value of X, the mean, or expected, value of the random disturbance
term ui is zero. Technically, the conditional mean value of ui is zero.
Symbolically, we have
E(ui |Xi) = 0
Geometrically this can be pictured as follows
The positive ui values cancel out the negative ui values so that their mean
effect on Y is zero.
4Homoscedasticity or equal variance of ui.
Given the value of X, the variance
of ui is the same for all observations. That is, the conditional variances of ui
are identical.
Symbolically, we have
var (ui |Xi) = E[ui E(ui |Xi)]2
= E(ui2 | Xi ) because of Assumption 3
= 2
where var stands for variance.
Diagrammatically, the situation can be depicted as follows.
5 No autocorrelation between the disturbances.

Given any two X values, Xi and Xj (i _= j), the correlation between any two
ui and uj (i _= j) is zero. Symbolically,
cov (ui, uj |Xi, Xj) = = E(ui |Xi)(uj | Xj) = 0
This can be diagrammatically be like,
When we have access to a randomly drawn sample from a population, this

will be the case.
6Zero covariance between ui and Xi, or E(uiXi) = 0. Formally,
cov (ui, Xi) = E(uiXi)) = 0
There must not be any relation between the residual term and the X variable
which is to say that they are uncorrelated. This is to mean that the variables
left unaccounted for in the residual should have no relationship with the
variable X included in the model.
The number of observations n must be greater than the number of
parameters to be estimated. Alternatively, the number of
observations n must be greater than the number of explanatory
variables.
8Variability in X values.
The X values in a given sample must not all be the same. Technically, var (X)
must be a finite positive number.
X cannot be a constant within a given sample since we are interested in how
variation in X affects the variations in Y. if all values of X are identical, it
will be impossible to estimate the parameters.
The regression model is correctly specified. Alternatively, there is no
specification bias or error in the model used in empirical analysis.
By omitting important variables from the model, or by choosing the wrong
functional form, or by making wrong stochastic assumptions about the
variables of the model, the validity of interpreting the estimated regression
will be highly questionable.
10There is no perfect multicollinearity.
That is, there are no perfect linear relationships among the explanatory
variables.
This assumption relates to multiple regression models. It requires that in the
regression function, we include only those variables that are not exact linear
functions of one or more variables in the model
7
9
Extensions of Simple Regression Model:

Regression through Origin
Regression through the origin is a situation where the intercept term, 1 is
absent from the model.
Yi = 2Xi + ui
In the model with the intercept term absent, we use raw sums of squares and
cross products but in the intercept-present model, we use adjusted sums of
squares and cross products.
The df for computing 2 is (n 1) in the first case and (n 2) in the second
case.
ui, which is always zero for the model with the intercept term, need not be
zero when that term is absent.
Coefficient of determination is always nonnegative for the conventional
model; can on occasions turn out to be negative for the interceptless model.
Scaling and Units of Measurement
The units and scale in which the regressand and the regressor(s) are
expressed are very important because the interpretation of regression
coefficients
critically depends on them.
Yi = 1 + 2Xi + ui
Now rescale Xi and Yi using the constants w1 and w2, called the scale factors
Y i * = 1*+ 2*Xi*+ ui*
Where
Yi* = w1Yi
Xi* = w2Xi
If the scaling factors w1 = w2, the slope coefficient and its standard
error remain unaffected; the intercept and its standard error are
both multiplied by w1
If the X scale is not changed (i.e., w2 = 1) and the Y scale is
changed by the factor w1, the slope as well as the intercept
coefficients and their respective standard errors are all multiplied
by the same w1 factor.
If the Y scale remains unchanged (i.e., w1 = 1) but the X scale is
changed by the factor w2, the slope coefficient and its standard
error are multiplied by the factor (1/w2) but the intercept coefficient
and its standard error remain unaffected.
Functional Forms of Regression Model

Regression models can take the following functional forms:
log log model (Double log)
In the log-linear model both the regressand and the regressor(s) are
expressed in the logarithmic form.
One attractive feature of the log-log model is that the slope coefficient
represents the elasticity of Y with respect to X.
Log Hyperbola or Logarithmic Reciprocal Model

Log Reciprocal Model takes the following form:
Initially Y increases at an increasing rate and then it increases at a decreasing

rate.
Such a model may therefore be appropriate to model a short-run production

function.
Unit II: Multiple Regression Models
Model Specification, Interpretation Multiple Regression Equation, Testing
Hypothesis: Individual Partial Regression Coefficient and Overall
Significance, Goodness of Fit F-tests, R2 and Adjusted R2.
It is to be noted that the elasticity coefficient between Y and X remain

Multiple Regression Models
constant and hence the alternative name constant elasticity model.
Multiple Regression Models are models in which the dependent variable, or
log linear
In the linear-log model, the regressand is expressed in the logarithamic form. regressand, Y depends on two or more explanatory variables, or regressors.
Interpretation Multiple Regression Equation
Consider the following multiple regression model
In this model, the slope coefficient 2 measures the relative change in Y for a Yi = 1 + 2X2i + 3X3i + ui
given absolute change in X. ie,
where
Y = dependent variable,
X2 and X3 the explanatory variables
u =stochastic disturbance term
100 times 2 gives the growth rate or the semi elasticity of Y with respect to 1 = intercept term
X.
2 and 3 = partial slope coefficients
linear log
2 measures the change in the mean value of Y, E(Y), per unit change in X2,
In the linear-log model, the regressor is expressed in the logarithamic form. holding the value of X3 constant.
3 measures the change in the mean value of Y per unit change in X3, holding
the value of X2 constant.
In this model, the slope coefficient 2 represent the absolute change in Y for a
percent change in X.
Goodness of Fit
Goodness of Fit considers how well the sample regression line fits the data.
R2 and Adjusted R2.
2
An interesting application has been found in the so-called Engel expenditure The coefficient of determination R is a summary measure of Goodness of
Fit.
It
tells
what
proportion
of
the
variation
in the dependent variable is
models- the total expenditure that is devoted to food tends to increase in
explained by the explanatory variable.
arithmetic progression as total expenditure increases in geometric
progression.
Reciprocal models
In the reciprocal models, either the regressand or the regressor is expressed
in reciprocal, or inverse, form to capture nonlinear relationships between
economic variables, as in the celebrated Phillips curve.
It is a nonnegative quantity lying between zero and one; 0 r 2 1. Closer
the r2 to 1 better the fit. An important property of R2 is that it is a
nondecreasing function of the number of explanatory variables or regressors
As X increases indefinitely, the term 2 (l/X) approaches zero
present in the model; as the number of regressors increases, R2 almost
invariably increases and never decreases.
To compare two R2 terms, one must take into account the number of X
variables present in the model. This can be done readily if we consider an
alternative coefficient of determination, adjusted R2
where k = the number of parameters in the model including the intercept

term.
The term adjusted means adjusted for the df associated with the sums of
squares. Unlike R2, the adjusted R2 will increase only if the absolute t value
of the added variable is greater than 1. For comparative purposes, therefore,

R2 is a better measure than R2.
Unit V: Dummy Variable Models
Qualitative Independent Variables: Qualitative Variables with Two
Categories and Many Categories, Estimating Seasonal Effects, Testing for
Structural Change, Piecewise Linear Regression, Qualitative and Limited
Dependent Variable: Binary Choice Model, Probit Model, Logit Model,
Limited Dependent Variable
Assume that a company remunerates its sales representatives based on sales

in such a manner that up to a certain threshold level, X*, there is one
commission structure and beyond that level another.
The technique of dummy variables can be used to estimate the differing
slopes of the two segments of the piecewise linear regression
Yi = 1 + 1Xi + 2(Xi X*)Di + ui
where
Yi = sales commission
Xi = volume of sales generated by the sales person
X* = threshold value of sales
D = 1 if Xi > X*
= 0 if Xi < X*
Here 1 gives the slope of the regression line in segment I, and 1 + 2 gives
the slope of the regression line in segment II
A test of the hypothesis that there is no break in the regression at the
threshold value X* can be conducted easily by noting the statistical
significance of the estimated differential slope coefficient 2
Dummy Variables
Qualitative variables usually indicate the presence or absence of a quality
or an attribute, such as male or female, black or white, Democrat or
Republican. We can construct artificial variables that take on values of 1 or
0, 1 indicating the presence of that attribute and 0 indicating the absence of
that attribute.
For example 1 may indicate that a person is a female and 0 may designate a
male. Variables that assume such 0 and 1 values are called dummy
variables.
Dummy variable trap
QUALITATIVE RESPONSE REGRESSION MODELS
Where you have a dummy variable for each category or group and also an
There are three approaches to developing a probability model for a binary
intercept, you have a case of perfect collinearity, that is, exact linear
response variable:
relationships among the variables. This situation is called the dummy
1. The linear probability model (LPM)
variable trap. If a qualitative variable has m categories, introduce only (m
2. The logit model
1) dummy variables.
3. The probit model
ANOVA models
Regression models which contain regressors that are all exclusively dummy,
or qualitative, in nature are called Analysis of Variance (ANOVA) models.
ANCOVA models
Regression models containing an admixture of quantitative and qualitative
variables are called analysis of covariance (ANCOVA) models.
The linear probability model (LPM)

In LPM, the regressand is binary, or dichotomous. This is because the
conditional expectation of Yi given Xi , E(Yi | Xi ), can be interpreted as the
conditional probability that the event
will occur given Xi , that is, Pr (Yi = 1 | Xi).
Yi = 1 + 2Xi + ui
where X = family income and
Estimating Seasonal Effects
Y = 1 if the family owns a house and 0 if it does not own a house.
Many economic time series based on monthly or quarterly data exhibit
If Pi = probability that Yi = 1 (that is, the event occurs), and (1 Pi) =
seasonal patterns. Often it is desirable to remove the seasonal factor from a probability that Yi = 0 (that is, that the event does not occur), the variable Yi
time series so that one can concentrate on the other components, such as the has the following (probability) distribution.
trend. The process of removing the seasonal component from a time series is
Yi
Probability
known as deseasonalization. Dummy variables can be used to deseasonalize
0
1 Pi
economic time series.
1
Pi
Consider the following model:
Total
1
Yt = 1D1t + 2D2t + 3tD3t + 4D4t + ut
That
is,
Yi
follows
the
Bernoulli probability distribution.
where Yt = sales of refrigerators and
Now, by the definition of mathematical expectation, we obtain:
the Ds are the dummies, taking a value of 1 in the relevant quarter and 0
E(Yi) = 0(1 Pi) + 1(Pi) = Pi
otherwise.
And
If there is any seasonal effect in a given quarter, that will be indicated by a
E(Yi | Xi) = 1 + 2Xi = Pi
statistically significant t value of the dummy coefficient for that quarter.
Since the probability Pi must lie between 0 and 1, we have the restriction
0 E(Yi | Xi) 1
Testing for Structural Change
that is, the conditional expectation (or conditional probability) must lie
(The Dummy variable alternative to Chow test)
between 0 and 1.
Dummy variable can be used as an alternative to Chow test to examine
structural stability. When we examine two periods, the introduction of the
dummy variable D in the multiplicative form enables us to differentiate
between slope coefficients of the two periods, just as the introduction of the
dummy variable in the additive form enabled us to distinguish between the
intercepts of the two periods.
Piecewise Linear Regression

A piecewise linear regression consists of two linear segments, which are
labeled I and II in the following figure.
Problems of LPM
Non-Normality of the Disturbances ui
Heteroscedastic Variances of the Disturbances
Nonfulfillment of 0 E(Yi | X) 1
Questionable Value of R2 as a Measure of Goodness of Fit

Applications of LPM
Until the availability of readily accessible computer packages to estimate the
logit and probit models, the LPM was used quite extensively because of its
simplicity. Some of the applications include:
Labor force participation

Predicting a bond rating
Predicting bond defaults
The logit model
In the logit model, the dependent variable is the log of the odds ratio, Pi/ (1
Pi). The probability function that underlies the logit model is the logistic
distribution.
The Tobit Model

(Limited Dependent Variable Model/ Censored regression model)
An extension of the probit model is the tobit model originally developed by
James Tobin. In this model, the response variable is observed only if certain
conditions are met. Thus, the question of how much one spends on a house is
meaningful only if one decides to buy a house. If a consumer does not
purchase a house, obviously we have no data on housing expenditure for
such consumers; we have such data only on consumers who actually
purchase a house.
Thus consumers are divided into two groups, one consisting of n1consumers
about whom we have information on the regressors as well as the regressand
The probit model
and another consisting of n2 consumers about whom we have information
The estimating model that emerges from the normal CDF is popularly known
only on the regressors but not on the regressand.
as the probit model. It is also known as the normit model.
A sample in which information on the regressand is available only for some
Suppose the decision of the ith family to own a house or not depends on an
observations is known as a censored sample. Therefore, the tobit model is
unobservable utility index Ii that is determined by one or more explanatory
also known as a censored regression model or limited dependent variable
variables, say income Xi , in such a way that the larger the value of the index
regression models.
Ii , the greater the probability of a family owning a house. We express the
The model can be expressed as:
index Ii as
Yi = 1 + 2Xi + ui if RHS > 0
Ii = 1 + 2Xi
= 0 otherwise
where Xi is the income of the ith family.
As the figure shows, if Y is not observed, all such observations (= n2),
It is reasonable to assume that there is a critical or threshold level of the
denoted by crosses, will lie on the horizontal axis. If Y is observed, the
index, call it I*i , such that if Ii exceeds I*i , the family will own a house,
observations (= n1), denoted by dots, will lie in the XY plane.
otherwise it will not. The threshold I*i , like Ii , is not observable, but if we
assume that it is normally distributed with the same mean and variance, it is
possible not only to estimate the parameters of the index but also to get some
information about the unobservable index itself.
Given the assumption of normality, the probability that I*i is less than or
equal to Ii can be computed from the standardized normal CDF as:
Pi = P(Y = 1 | X) = P(Zi 1 + 2Xi)
where P(Y = 1 | X) means the probability that an event occurs given the
values of the X, and
Zi is the standard normal variable.
Since P represents the probability that an event will occur, here the
probability of owning a house, it is measured by the area of the standard
normal curve from to Ii as shown in Figure
As P goes from 0 to 1 (i.e., as Z varies from to +), the logit L goes from
to +.
If L, the logit, is positive, it means that when the value of the regressor(s)
increases, the odds that the regressand equals 1 increases. If L is negative, the
odds that the regressand equals 1 decreases as the value of X increases.
The slope, 2 measures the change in L for a unit change in X. that is, it tells
how the log-odds in favor of owning a house change as income changes by a
unit. The intercept 1 is the value of the logodds in favor of owning a house
if income is zero.
The parameters obtained from the subset of n1 observations will be biased

as well as inconsistent. Therefore, Maximum Likelihood method is used.
Logit and Proibit Models

The logistic distribution has slightly fatter tails,ie the conditional probability
Pi approaches zero or one at a slower rate in logit than in probit. Both the
models give similar results. But many researchers choose the logit model
because of its comparative mathematical simplicity.

Basic Regression Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic Regression Analysis

Uploaded by

Copyright:

Available Formats

Unit I: Basic Regression Analysis

Nature and Scope of Econometrics Simple Regression Model:

Simple and multiple regressions

Least-squares criterion states that the Sample Regression Function can be

Maximum Likelihood Estimation

Assumptions of Classical Linear Regression Model

5 No autocorrelation between the disturbances.

When we have access to a randomly drawn sample from a population, this

Extensions of Simple Regression Model:

Functional Forms of Regression Model

Log Hyperbola or Logarithmic Reciprocal Model

Initially Y increases at an increasing rate and then it increases at a decreasing

Such a model may therefore be appropriate to model a short-run production

It is to be noted that the elasticity coefficient between Y and X remain

where k = the number of parameters in the model including the intercept

of the added variable is greater than 1. For comparative purposes, therefore,

Assume that a company remunerates its sales representatives based on sales

The linear probability model (LPM)

Piecewise Linear Regression

Non-Normality of the Disturbances ui

Heteroscedastic Variances of the Disturbances

Questionable Value of R2 as a Measure of Goodness of Fit

Labor force participation

The Tobit Model

The parameters obtained from the subset of n1 observations will be biased

Logit and Proibit Models

You might also like