You are on page 1of 135

Linear Regression Model

Classical Linear Regression Model

Regression is the single most important tool at the econometrician’s


disposal.

What is regression analysis?

It is concerned with describing and evaluating the relationship between a


given variable (usually called the dependent variable) and one or more
other variables (usually known as the independent variable(s)).
Classical Linear Regression Model

Denote the dependent variable by y and the independent variable(s) by x1, x2, ... ,
xk where there are k independent variables.

Some alternative names for the y and x variables:


y x
dependent variable independent variables
regressand regressors
effect variable causal variables
explained variable explanatory variable

Note that there can be many x variables but we will limit ourselves to the case
where there is only one x variable to start with. In our set-up, there is only one
y variable.
Classical Linear Regression Model

Regression is different from correlation

If we say y and x are correlated, it means that we are treating y and x in a


completely symmetrical way.

In regression, we treat the dependent variable (y) and the independent


variable(s) (x’s) very differently. The y variable is assumed to be random or
“stochastic” in some way, i.e. to have a probability distribution. The x
variables are, however, assumed to have fixed (“non-stochastic”) values in
repeated samples.
Classical Linear Regression Model

For simplicity, say k=1. This is the situation where y depends on only
one x variable.
Classical Linear Regression Model
Suppose that we have the following data on the excess returns on a fund
manager’s portfolio (“fund XXX”) together with the excess returns on a
market index:
Year, t Excess return Excess return on market index
= rXXX,t – rft = rmt - rft
1 17.8 13.7
2 39.0 23.2
3 12.8 6.9
4 24.2 16.8
5 17.2 12.3

We have some intuition that the beta on this fund is positive, and we
therefore want to find whether there appears to be a relationship between x
and y given the data that we have. The first stage would be to form a scatter
plot of the two variables.
Classical Linear Regression Model

45
Excess return on fund XXX

40
35
30
25
20
15
10
5
0
0 5 10 15 20 25
Excess return on market portfolio
Classical Linear Regression Model
We can use the general equation for a straight line,

y = a + bx

to get the line that best “fits” the data.

However, this equation (y = a + bx) is completely deterministic.

Is this realistic? No.

So what we do is to add a random disturbance term, u into the


equation.
yt =  + xt + ut

where t = 1,2,3,4,5
Classical Linear Regression Model

The disturbance term can capture a number of features:

- We always leave out some determinants of yt

- There may be errors in the measurement of yt that cannot be


modelled.

- Random outside influences on yt which we cannot model


Classical Linear Regression Model

So how do we determine what  and  are?


Choose  and  so that the (vertical) distances from the data points to the
fitted lines are minimised (so that the line fits the data as closely as possible):
y

x
Classical Linear Regression Model

The most common method used to fit a line to the data is known as OLS
(ordinary least squares).

What we actually do is take each distance and square it (i.e. take the area
of each of the squares in the diagram) and minimise the total sum of the
squares (hence least squares).

Tightening up the notation, let


yt denote the actual data point t
ŷt denote the fitted value from the regression line
ût denote the residual, yt - ŷt
Classical Linear Regression Model

yyi t

û i

ŷŷt
i

xi x
Classical Linear Regression Model
5

So min. uˆ1  uˆ2  uˆ3  uˆ4  uˆ5 , or minimise


2 2 2 2 2
 uˆ
t 1
2
t . This is known as
the residual sum of squares.

But what was ût ? It was the difference between the actual point and the
line, yt - ŷt .

So minimising   yt  yˆ t  is equivalent to minimising


2
 t
ˆ
u 2

with respect to $ and $ .


Classical Linear Regression Model

But yˆ t  ˆ  ˆxt , so let L   ( y t  yˆ t ) 2   ( y t  ˆ  ˆxt ) 2


t i

Want to minimise L with respect to (w.r.t.) $ and $ , so differentiate L w.r.t.


$ and $
L
ˆ

 2 ( yt  ˆ  ˆxt )  0 (1)
t
L
 2 xt ( yt  ˆ  ˆxt )  0 (2)
ˆ t

From (1),  ( y t  ˆ  ˆxt )  0  y t  Tˆ  ˆ  xt  0


t

But  y t  Ty and  xt  Tx .
Classical Linear Regression Model

So we can write Ty  Tˆ  Tˆx  0 or y  ˆ  ˆx  0 (3)


From (2),
 xt ( yt  ˆ  ˆxt )  0 (4)
t

From (3), ˆ  y  ˆx (5)


Substitute into (4) for $ from (5),
 xt ( yt  y  ˆx  ˆxt )  0
t

 t t  t
x y  y x  ˆ
x  t
x  ˆ
  t 0
x
2

 xt yt  Tyx  ˆTx 2  ˆ  xt  0
2

t
Classical Linear Regression Model

Rearranging for $ ,

ˆ (Tx 2   xt2 )  Tyx   xt yt

So overall we have

ˆ
  xt yt  Tx y
andˆ  y  ˆx

 xt2  Tx 2

This method of finding the optimum is known as ordinary least squares.


Classical Linear Regression Model

In the CAPM example, plugging the 5 observations in to make up the formulae


given above would lead to the estimates
$ = -1.74 and $ = 1.64. We would write the fitted line as:
yˆ t  1.74  1.64 x t
Question: If an analyst tells you that she expects the market to yield a return 20%
higher than the risk-free rate next year, what would you expect the return on
fund XXX to be?

Solution: We can say that the expected value of y = “-1.74 + 1.64 * value of x”, so
plug x = 20 into the equation to get the expected value for y:
yˆ i  1.74  1.64 20  31.06
Classical Linear Regression Model: Specification

In order to use OLS, we need a model which is linear in the parameters (


and  ). It does not necessarily have to be linear in the variables (y and x).

Linear in the parameters means that the parameters are not multiplied
together, divided, squared or cubed etc.

Some models can be transformed to linear ones by a suitable substitution or


manipulation, e.g. the exponential regression model

Yt  e X t eut ln Yt     ln X t  ut
Then let yt = ln Yt and xt = ln Xt
yt    xt  ut
Classical Linear Regression Model

This is known as the exponential regression model. Here, the coefficients can
be interpreted as elasticities.

Similarly, if theory suggests that y and x should be inversely related:



yt     ut
xt
then the regression can be estimated using OLS by substituting
1
zt 
xt
Classical Linear Regression Model

We observe data for xt, but since yt also depends on ut, we must be specific about
how the ut are generated.

We usually make the following set of assumptions about the ut’s (the unobservable
error terms):
Technical Notation Interpretation
1. E(ut) = 0 The errors have zero mean
2.Var (ut) = 2 The variance of the errors is constant and finite
over all values of xt
3. Cov (ui,uj)=0 The errors are statistically independent of
one another
4. Cov (ut,xt)=0 No relationship between the error and
corresponding x variate
Classical Linear Regression Model

A fifth assumption is required if we want to make inferences about the


population parameters (the actual  and ) from the sample parameters ($
and $ )

Additional Assumption
5. ut is normally distributed
Classical Linear Regression Model

If assumptions 1. through 4. hold, then the estimators $ and $ determined by


OLS are known as Best Linear Unbiased Estimators (BLUE).
What does the acronym stand for?

“Estimator” - $ is an estimator of the true value of .


“Linear” - $ is a linear estimator
“Unbiased”- On average, the actual value of the $ and$ ’s will be equal to the
true values.
“Best”- means that the OLS estimator $has minimum variance among
the class of linear unbiased estimators. The Gauss-Markov theorem
proves that the OLS estimator is the most appropriate estimator.
Classical Linear Regression Model

Consistent
The least squares estimators $and $ are consistent. That is, the estimates
will converge to their true values as the sample size increases to infinity.
Need the assumptions E(xtut)=0 and Var(ut)=2 <  to prove this.
Consistency implies that
 
lim Pr ˆ      0   0
T 
Unbiased
The least squares estimates of $and $are unbiased.That is E($)= and E( $)=
Thus on average the estimated value will be equal to the true values. To
prove this also requires the assumption that E(ut)=0. Unbiasedness is a stronger
condition than consistency.
Efficiency
An estimator $of parameter  is said to be efficient if it is unbiased and no
other unbiased estimator has a smaller variance. If the estimator is efficient, we are
minimising the probability that it is a long way off from the true value of .
Classical Linear Regression Model
Any set of regression estimates of $and $are specific to the sample used in their
estimation.
The estimators of  and  from the sample parameters ($ and $) are given by
ˆ   t 2 t
x y  Tx y
andˆ  y  ˆx
 xt  Tx 2

What we need is some measure of the reliability or precision of the estimators


( $ and $ ). The precision of the estimate is given by its standard error. Given
assumptions 1 - 4 above, then the standard errors can be shown to be given by

 xt2  xt
2

SE (ˆ )  s s ,
T  ( xt  x ) 2
T  x T x
2
t
2 2

1 1
SE ( ˆ )  s s
 t
( x  x ) 2
 t x 2
 T x 2

where s is the estimated standard deviation of the residuals.


Classical Linear Regression Model
The variance of the random variable ut is given by
Var(ut) = E[(ut)-E(ut)]2
which reduces to Var(ut) = E(ut2)

We could estimate this using the average of : ut2


1
s2 
T
 ut2
Unfortunately this is not workable since ut is not observable. We can use the
sample counterpart to ut, which is ût :
1
s2 
T
 uˆt2

But this estimator is a biased estimator of 2. An unbiased estimator is given


by
s
 uˆ
2
t

T 2
Classical Linear Regression Model

Assume we have the following data calculated from a regression of y on a single


variable x and a constant over 22 observations.
Data:
 xt yt  830102, T  22, x  416.5, y  86.65,
 t  3919654, RSS  130.6
x 2

830102  (22 * 416.5 * 86.65)


Calculations: $  2  0.35
3919654  22 *(416.5)

$ 86.65  035 .  5912


. * 4165 .
We write yˆ t  ˆ  ˆxt
yˆt  59.12  0.35xt
Classical Linear Regression Model

SE(regression), s
 uˆ t2

130.6
 2.55
T 2 20

3919654
SE ( )  2.55 *  3.35

22  3919654  22  416.5 
2

1
SE (  )  2.55 *  0.0079

3919654  22  416.5 2

We now write the results as

yˆ t   59.12  0.35 xt
(3.35) (0.0079)
Classical Linear Regression Model

We want to make inferences about the likely population values from the
regression parameters.

Example: Suppose we have the following regression results:


yˆ t  20.3  0.5091xt
(14.38) (0.2561)
$  0.5091is a single (point) estimate of the unknown population
parameter, . How “reliable” is this estimate?

The reliability of the point estimate is measured by the coefficient’s


standard error.
Classical Linear Regression Model

Assume the regression equation is given by ,


yt    xt  ut for t=1,2,...,T

The steps involved in doing a test of significance are:


1. Estimate $, $ and SE($) , SE( $) in the usual way

2. Calculate the test statistic.This is given by the formula


$   *
test statistic 
SE ( $)
where  * is the value of  under the null hypothesis.
Classical Linear Regression Model

Using the regression results above,

yˆ t  20.3  0.5091xt , T=22


(14.38) (0.2561)

Using both the test of significance and confidence interval approaches,


test the hypothesis that  =1 against a two-sided alternative.

The first step is to obtain the critical value. We want tcrit = t20;5%
Classical Linear Regression Model

The hypotheses are:


H0 :  = 1
H1 :   1

Test of significance Confidence interval


approach approach
$   * ˆ  t crit  SE ( ˆ )
test stat 
SE ( $)
05091
. 1  0.5091  2.086  0.2561
  1917
.
0.2561  (0.0251,1.0433)
Do not reject H0 since Since 1 lies within the
test stat lies within confidence interval,
non-rejection region do not reject H0
Multiple Linear Regression Model

Before, we have used the two variable model


t = 1,2,...,T
But what if our dependent (y) variable depends on more than one
independent variable?
For example the number of cars sold might plausibly depend on
1. the price of cars
2. the price of public transport
3. the price of petrol
4. the extent of the public’s concern about global warming

Similarly, stock returns might depend on several factors.

Having just one independent variable is no good in this case - we want to


have more than one x variable. It is very easy to generalise the simple model
to one with k-1 regressors (independent variables).
Multiple Linear Regression Model

Now we write
yt  1   2 x2t  3 x3t  ...   k xkt  ut , t=1,2,...,T

Where is x1? It is the constant term. In fact the constant term is usually
represented by a column of ones of length T:

1
1
x1   


1

1 is the coefficient attached to the constant term (which we called 


before).
Multiple Linear Regression Model

We could write out a separate equation for every value of t:


y1  1   2 x21   3 x31  ...   k xk1  u1
y2  1   2 x22   3 x32  ...   k xk 2  u2
  
yT  1   2 x2T   3 x3T  ...   k xkT  uT
We can write this in matrix form
y = X +u
where y is T  1
X is T  k
 is k  1
u is T  1
Multiple Linear Regression Model

e.g. if k is 2, we have 2 regressors, one of which is a column of ones:

 y1  1 x21   u1 
 y  1 x22  1  u2 
 2   
    
  2   
     
 yT  1 x2T  uT 
T 1 T2 21 T1

Notice that the matrices written in this way are conformable.


Multiple Linear Regression Model

Previously, we took the residual sum of squares, and minimised it w.r.t. 


and .
In the matrix notation, we have
 uˆ1 
 uˆ 
uˆ   2 
 
 
uˆ T 

The RSS would be given by


 uˆ1 
uˆ 
uˆ ' uˆ  uˆ1 uˆ2  uˆT  2   uˆ12  uˆ22  ...  uˆT2   uˆt2
 
 
uˆT 
Multiple Linear Regression Model

In order to obtain the parameter estimates, 1, 2,..., k, we would


minimise the RSS with respect to all the s.

It can be shown that


 ˆ1 
 
 ˆ 
ˆ   2   ( X X ) 1 X  y

 
 ˆ k 
Multiple Linear Regression Model

Check the dimensions: $ is k  1 as required.

But how do we calculate the standard errors of the coefficient estimates?

Previously, to estimate the variance of the errors, 2, we used s2 


 uˆ 2
t
.
T 2
u$' u$
Now using the matrix notation, we use s 
2
Tk

where k = number of regressors. It can be proved that the OLS estimator of


the variance of $ is given by the diagonal elements of s2 ( X ' X )1 , so that the
variance of $1 is the first element, the variance of $2 is the second element,
and …, and the variance of $k is the kth diagonal element.
Multiple Linear Regression Model

Example:The following model with k=3 is estimated over 15 observations:


y  1   2 x2   3 x3  u
and the following data have been calculated from the original X’s.
 2.0 35
. 10.  30 . 
( X ' X ) 1   35
. 10. 65.  ,( X ' y)   2.2  , u$' u$ 10.96
10 . 4.3 
. 65  0.6 
Calculate the coefficient estimates and their standard errors.
To calculate the coefficients, just multiply the matrix by the vector to obtain
 X ' X 1 X ' y .
To calculate the standard errors, we need an estimate of 2.
RSS 10.96
s2    0.91
T  k 15  3
Multiple Linear Regression Model

The variance-covariance matrix of $ is given by


 183
. 320
. 0.91
s2 ( X ' X ) 1  0.91( X ' X ) 1   320
. . 
0.91 594
0.91 594
. . 
393

The variances are on the leading diagonal:


Var ( $1 )  183
. SE ( $1 )  135
.
Var ( $2 )  0.91  SE ( $2 )  0.96
Var ( $ )  3.93
3 SE ( $ )  198
3 .
We write:
yˆ  1.10  4.40 x2t  19.88 x3t
1.35 0.96 1.98
Multiple Linear Regression Model

Recall that the formula for a test of significance approach to hypothesis


testing using a t-test was
$i   i*
test statistic 
SE $i 
If the test is H0 : i = 0
H1 : i  0
i.e. a test that the population coefficient is zero against a two-sided
alternative, this is known as a t-ratio test:

$i
Since  i* = 0, test stat 
SE ( $i )

The ratio of the coefficient to its SE is known as the t-ratio or t-statistic.


Multiple Linear Regression Model

Testing for the presence and significance of abnormal returns (“Jensen’s


alpha” - Jensen, 1968).

The Data: Annual Returns on the portfolios of 115 mutual funds from 1945-
1964.

The model: R jt  R ft   j   j ( Rmt  R ft )  u jt for j = 1, …, 115

We are interested in the significance of j.

The null hypothesis is H0: j = 0 .


Testing Multiple Hypotheses: The F-test

We used the t-test to test single hypotheses, i.e. hypotheses involving only
one coefficient. But what if we want to test more than one coefficient
simultaneously?

We do this using the F-test. The F-test involves estimating 2 regressions.

The unrestricted regression is the one in which the coefficients are freely
determined by the data, as we have done before.

The restricted regression is the one in which the coefficients are restricted,
i.e. the restrictions are imposed on some s.
Testing Multiple Hypotheses: The F-test

Example
The general regression is
yt = 1 + 2x2t + 3x3t + 4x4t + ut (1)

We want to test the restriction that 3+4 = 1 (we have some hypothesis
from theory which suggests that this would be an interesting hypothesis to
study). The unrestricted regression is (1) above, but what is the restricted
regression?
yt = 1 + 2x2t + 3x3t + 4x4t + ut s.t. 3+4 = 1

We substitute the restriction (3+4 = 1) into the regression so that it is


automatically imposed on the data.
3+4 = 1  4 = 1- 3
Testing Multiple Hypotheses: The F-test

yt = 1 + 2x2t + 3x3t + (1-3)x4t + ut


yt = 1 + 2x2t + 3x3t + x4t - 3x4t + ut

Gather terms in ’s together and rearrange


(yt - x4t) = 1 + 2x2t + 3(x3t - x4t) + ut

This is the restricted regression. We actually estimate it by creating two new


variables, call them, say, Pt and Qt.
Pt = yt - x4t
Qt = x3t - x4t so
Pt = 1 + 2x2t + 3Qt + ut is the restricted regression we actually
estimate.
Testing Multiple Hypotheses: The F-test

The test statistic is given by

RRSS  URSS T  k
test statistic  
URSS m
where URSS = RSS from unrestricted regression
RRSS = RSS from restricted regression
m = number of restrictions
T = number of observations
k = number of regressors in unrestricted
regression including a constant in the unrestricted regression (or the
total number of parameters to be estimated).
Testing Multiple Hypotheses: The F-test

The test statistic follows the F-distribution, which has two d.f. parameters.

The value of the degrees of freedom parameters are m and (T-k)


respectively (the order of the d.f. parameters is important).

The appropriate critical value will be in column m, row (T-k).

The F-distribution has only positive values and is not symmetrical. We


therefore only reject the null if the test statistic > critical F-value.
Testing Multiple Hypotheses: The F-test

Examples :
H0: hypothesis No. of restrictions
1 + 2 = 2 1
2 = 1 and 3 = -1 2
2 = 0, 3 = 0 and 4 = 0 3
If the model is yt = 1 + 2x2t + 3x3t + 4x4t + ut,
then the null hypothesis
H0: 2 = 0, and 3 = 0 and 4 = 0 is tested by the regression F-statistic. It
tests the null hypothesis that all of the coefficients except the intercept coefficient
are zero.
Note the form of the alternative hypothesis for all tests when more than one
restriction is involved: H1: 2  0, or 3  0 or 4  0
Goodness of the fit of the Model

We would like some measure of how well our regression model actually fits the
data.
We have goodness of fit statistics to test this: i.e. how well the sample
regression function (srf) fits the data.
The most common goodness of fit statistic is known as R2. One way to define R2
is to say that it is the square of the correlation coefficient between y and y$.
For another explanation, recall that what we are interested in doing is explaining
the variability of y about its mean value, y , i.e. the total sum of squares, TSS:

TSS    yt  y 
2

t
We can split the TSS into two parts, the part which we have explained (known
as the explained sum of squares, ESS) and the part which we did not explain
using the model (the RSS).
Defining R2

That is, TSS = ESS + RSS



 ty  y 2
 
 tˆ
y  y 2
  uˆt2
t t t
Our goodness of fit statistic is
ESS
R 
2
TSS
But since TSS = ESS + RSS, we can also write

ESS TSS  RSS RSS


R2    1
TSS TSS TSS
R2 must always lie between zero and one. To understand this, consider two
extremes
RSS = TSS i.e. ESS = 0 so R2 = ESS/TSS = 0
ESS = TSS i.e. RSS = 0 so R2 = ESS/TSS = 1
Goodness of the fit of the Model

There are a number of issues with R2 :

1 R2 never falls if more regressors are added. to the regression, e.g.


consider:
Regression 1: yt = 1 + 2x2t + 3x3t + ut
Regression 2: y = 1 + 2x2t + 3x3t + 4x4t + ut
R2 will always be at least as high for regression 2 relative to
regression 1.

2. R2 quite often takes on values of 0.9 or higher for time series


regressions.
Goodness of the fit of the Model

In order to get around these problems, a modification is often made


which takes into account the loss of degrees of freedom associated with
adding extra variables. This is known as R 2 , or adjusted R2:

 T 1 
R 2 1  (1  R 2 )
T  k 
So if we add an extra regressor, k increases and unless R2 increases by a
more than offsetting amount, R 2 will actually fall.
Violation of the Assumptions of the CLRM

Recall that we assumed of the CLRM disturbance terms:

1. E(ut) = 0
2.Var(ut) = 2 < 
3. Cov (ui,uj) = 0
4.The X matrix is non-stochastic or given in repeated samples
5. ut  N(0,2)
Assumption 1: E(ut) = 0

• Assumption that the mean of the disturbances is zero.

• For all diagnostic tests, we cannot observe the disturbances and so


perform the tests of the residuals.

• The mean of the residuals will always be zero provided that there is a
constant term in the regression.
Assumption 2: Var(ut) = 2 < 

We have so far assumed that the variance of the errors is constant, 2 - this
is known as homoscedasticity. If the errors do not have a constant variance,
we say that they are heteroscedastic e.g. say we estimate a regression and
calculate the residuals. û +
t

x 2t

-
Detection of Heteroscedasticity

Graphical methods
Formal tests:
One of the general test for heteroscedasticity.

The test is carried out as follows:


1.Assume that the regression we carried out is as follows
yt = 1 + 2x2t + 3x3t + ut
And we want to test Var(ut) = 2. We estimate the model, obtaining
the residuals, u$t

2.Then run the auxiliary regression


uˆt2  1   2 x2t  3 x3t   4 x22t  5 x32t   6 x2t x3t  vt
Performing White’s Test for Heteroscedasticity

3. Obtain R2 from the auxiliary regression and multiply it by the


number of observations, T. It can be shown that
T R2  2 (m)
where m is the number of regressors in the auxiliary regression
excluding the constant term.

4. If the 2 test statistic from step 3 is greater than the


corresponding value from the statistical table then reject the null
hypothesis that the disturbances are homoscedastic.
How Do we Deal with Heteroscedasticity?

If the form (i.e. the cause) of the heteroscedasticity is known, then we can use
an estimation method which takes this into account (called generalised least
squares, GLS).
A simple illustration of GLS is as follows: Suppose that the error variance is
related to another variable zt by
var ut    2 zt2

To remove the heteroscedasticity, divide the regression equation by zt


yt 1 x x
 1   2 2t   3 3t  vt
zt zt zt zt
u
where vt  t is an error term.
zt
 ut  var ut   2 zt2
Now var vt   var   2
 2
  2 for known zt.
 zt  zt zt
Autocorrelation

We assumed of the CLRM’s errors that Cov (ui , uj) = 0 for ij, i.e.

This is essentially the same as saying there is no pattern in the errors.

Obviously we never have the actual u’s, so we use their sample


counterpart, the residuals (the u$t’s).

If there are patterns in the residuals from a model, we say that they are
autocorrelated.
Positive Autocorrelation

+
û t û t
+

- uˆ +t 1
Time

-
-

Positive Autocorrelation is indicated by a cyclical residual plot over time.


Negative Autocorrelation

+ û t
û t
+

- +
uˆ t 1
Time

- -

Negative autocorrelation is indicated by an alternating pattern where the


residuals cross the time axis more frequently than if they were distributed
randomly
No pattern in residuals –
No autocorrelation
û t
+
+
û t

- +
uˆ t 1

-
-

No pattern in residuals at all: this is what we would like to see


Detecting Autocorrelation:
The Durbin-Watson Test

The Durbin-Watson (DW) is a test for first order


autocorrelation - i.e. it assumes that the relationship is between an error
and the previous one

ut = ut-1 + vt (1)

where vt  N(0, v2).

The DW test statistic actually tests H0 : =0 and H1 : 0


T
The test statistic is calculated by   u$t  u$t 1 2
DW  t  2 T
 u$t 2
t 2
Detecting Autocorrelation:
The Durbin-Watson Test

Conditions which Must be Fulfilled for DW to be a Valid Test


1. Constant term in regression
2. Regressors are non-stochastic
3. No lags of dependent variable
Consequences of Ignoring Autocorrelation
if it is Present

The coefficient estimates derived using OLS are still unbiased, but they
are inefficient, i.e. they are not BLUE, even in large sample sizes.

Thus, if the standard error estimates are inappropriate, there exists the
possibility that we could make the wrong inferences.

R2 is likely to be inflated relative to its “correct” value for positively


correlated residuals.
“Remedies” for Autocorrelation

If the form of the autocorrelation is known, we could use a GLS


procedure – i.e. an approach that allows for autocorrelated residuals.

But such procedures that “correct” for autocorrelation require


assumptions about the form of the autocorrelation.

If these assumptions are invalid, the cure would be more dangerous than
the disease! - see Hendry and Mizon (1978).

However, it is unlikely to be the case that the form of the autocorrelation


is known, and a more “modern” view is that residual autocorrelation
presents an opportunity to modify the regression.
A Strategy for Building Econometric Models

Our Objective:
To build a statistically adequate empirical model which
- satisfies the assumptions of the CLRM
- is parsimonious
- has the appropriate theoretical interpretation
- has the right “shape” - i.e.
- all signs on coefficients are “correct”
- all sizes of coefficients are “correct”
- is capable of explaining the results of all competing models
2 Approaches to Building Econometric Models

There are 2 popular philosophies of building econometric models: the


“specific-to-general” and “general-to-specific” approaches.

“Specific-to-general” was used almost universally until the mid 1980’s, and
involved starting with the simplest model and gradually adding to it.

Little, if any, diagnostic testing was undertaken. But this meant that all
inferences were potentially invalid.

An alternative and more modern approach to model building is the “LSE”


or Hendry “general-to-specific” methodology.

The advantages of this approach are that it is statistically sensible and also the
theory on which the models are based usually has nothing to say about the
lag structure of a model.
The General-to-Specific Approach

First step is to form a “large” model with lots of variables on the right hand side
At this stage, we want to make sure that the model satisfies all of the
assumptions of the CLRM
If the assumptions are violated, we need to take appropriate actions to remedy
this, e.g.
- taking logs
- adding lags
- dummy variables
We need to do this before testing hypotheses
Once we have a model which satisfies the assumptions, it could be very big with
several independent variables
The General-to-Specific Approach

The next stage is to reparameterise the model by


- knocking out very insignificant regressors
- some coefficients may be insignificantly different from
each other,
so we can combine them.

At each stage, we need to check the assumptions are still OK.

Hopefully at this stage, we have a statistically adequate empirical model which


we can use for
- testing underlying financial theories
- forecasting future values of the dependent variable
- formulating policies, etc.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

One way of dealing with the difference in the costs would be to run separate regressions for
the two types of school.
However this would have the drawback that you would be running regressions with two
small samples instead of one large one, with an adverse effect on the precision of the
estimates of the coefficients.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

 1' Occupational schools


Regular schools

1

OCC = 0 Regular school COST = 1 + 2N + u


OCC = 1 Occupational school COST = 1' + 2N + u
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

=================================================================
Dependent variable:
---------------------------------------------
COST
(1) (2)
-----------------------------------------------------------------
N 436.777*** 152.298***
(58.621) (41.398)

Constant 47,974.070 51,475.250**


(33,879.030) (21,599.150)

-----------------------------------------------------------------
Observations 34 40
R2 0.634 0.263
Adjusted R2 0.623 0.243
Residual Std. Error 104,425.500 (df = 32) 56,544.870 (df = 38)
F Statistic 55.516*** (df = 1; 32) 13.534*** (df = 1; 38)
=================================================================
Note: *p<0.1; **p<0.05; ***p<0.01

result1<-lm(COST~N, data=subset(schools, OCC==1))


result2<-lm(COST~N, data=subset(schools, OCC==0))
stargazer(list (result1, result2), type="text")
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

 1' Occupational schools


Regular schools

1

OCC = 0 Regular school COST = 1 + 2N + u


OCC = 1 Occupational school COST = 1' + 2N + u

Effectively, we are hypothesizing that the annual overhead cost is different for the two types
of school, but the marginal cost is the same. The marginal cost assumption is not very
plausible and we will relax it in due course.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

 1' Occupational schools


 Regular schools

1

OCC = 0 Regular school COST = 1 + 2N + u


OCC = 1 Occupational school COST = 1' + 2N + u

Let us define  to be the difference in the intercepts:  = 1' – 1.


DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

 1+ Occupational schools


 Regular schools

1

OCC = 0 Regular school COST = 1 + 2N + u


OCC = 1 Occupational school COST = 1 +  + 2N + u

Then 1' = 1 +  and we can rewrite the cost function for occupational schools as shown.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

 1+ Occupational schools


 Regular schools

1

Combined equation COST = 1 +  OCC + 2N + u


OCC = 0 Regular school COST = 1 + 2N + u
OCC = 1 Occupational school COST = 1 +  + 2N + u

We can now combine the two cost functions by defining a dummy variable OCC that has
value 0 for regular schools and 1 for occupational schools.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

COST

 1+ Occupational schools


 Regular schools

1

Combined equation COST = 1 +  OCC + 2N + u


OCC = 0 Regular school COST = 1 + 2N + u
OCC = 1 Occupational school COST = 1 +  + 2N + u

Dummy variables always have two values, 0 or 1. If OCC is equal to 0, the cost function
becomes that for regular schools. If OCC is equal to 1, the cost function becomes that for
occupational schools.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
N

Occupational schools Regular schools

We will now fit a function of this type using data for a sample of 74 secondary schools in
Shanghai.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

School Type COST N OCC

1 Occupational 345,000 623 1


2 Occupational 537,000 653 1
3 Regular 170,000 400 0
4 Occupational 526.000 663 1
5 Regular 100,000 563 0
6 Regular 28,000 236 0
7 Regular 160,000 307 0
8 Occupational 45,000 173 1
9 Occupational 120,000 146 1
10 Occupational 61,000 99 1

The table shows the data for the first 10 schools in the sample. The annual cost is
measured in rupees. N is the number of students in the school.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

School Type COST N OCC

1 Occupational 345,000 623 1


2 Occupational 537,000 653 1
3 Regular 170,000 400 0
4 Occupational 526.000 663 1
5 Regular 100,000 563 0
6 Regular 28,000 236 0
7 Regular 160,000 307 0
8 Occupational 45,000 173 1
9 Occupational 120,000 146 1
10 Occupational 61,000 99 1

OCC is the dummy variable for the type of school.


DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 331.449***
(39.758)

OCC 133,259.100***
(20,827.580)

Constant -33,612.550
(23,573.470)

-----------------------------------------------
Observations 74
R2 0.616
Adjusted R2 0.605
Residual Std. Error 89,248.090 (df = 71)
F Statistic 56.861*** (df = 2; 71)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

We now run the regression of COST on N and OCC, treating OCC just like any other
explanatory variable, despite its artificial nature.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

The regression results have been rewritten in equation form. From it we can derive cost
functions for the two types of school by setting OCC equal to 0 or 1.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

Regular School ^
COST = –34,000 + 331N
(OCC = 0)

If OCC is equal to 0, we get the equation for regular schools, as shown. It implies that the
marginal cost per student per year is 331 rupees and that the annual overhead cost is -
34,000 rupees.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

Regular School ^
COST = –34,000 + 331N
(OCC = 0)

Obviously having a negative intercept does not make any sense at all and it suggests that
the model is misspecified in some way. We will come back to this later.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

Regular School ^
COST = –34,000 + 331N
(OCC = 0)

The coefficient of the dummy variable is an estimate of , the extra annual overhead cost of
an occupational school.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

^
COST = –34,000 + 133,000OCC + 331N

Regular School ^
COST = –34,000 + 331N
(OCC = 0)

Occupational School ^
COST = –34,000 + 133,000 + 331N
(OCC = 1)
= 99,000 + 331N

Putting OCC equal to 1, we estimate the annual overhead cost of an occupational school to
be 99,000 rupees. The marginal cost is the same as for regular schools. It must be, given
the model specification.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Occupational schools Regular schools

The scatter diagram shows the data and the two cost functions derived from the regression
results.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 331.449***
(39.758)

OCC 133,259.100***
(20,827.580)

Constant -33,612.550
(23,573.470)

-----------------------------------------------
Observations 74
R2 0.616
Adjusted R2 0.605
Residual Std. Error 89,248.090 (df = 71)
F Statistic 56.861*** (df = 2; 71)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

In addition to the estimates of the coefficients, the regression results will include standard
errors and the usual diagnostic statistics.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 331.449***
(39.758)

OCC 133,259.100***
(20,827.580)

Constant -33,612.550
(23,573.470)

-----------------------------------------------
Observations 74
R2 0.616
Adjusted R2 0.605
Residual Std. Error 89,248.090 (df = 71)
F Statistic 56.861*** (df = 2; 71)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

We will perform a t test on the coefficient of the dummy variable. Our null hypothesis is H0:
 = 0 and our alternative hypothesis is H1:   0.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 331.449***
(39.758)

OCC 133,259.100***
(20,827.580)

Constant -33,612.550
(23,573.470)

-----------------------------------------------
Observations 74
R2 0.616
Adjusted R2 0.605
Residual Std. Error 89,248.090 (df = 71)
F Statistic 56.861*** (df = 2; 71)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

In words, our null hypothesis is that there is no difference in the overhead costs of the two
types of school. The t statistic is 6.40, so it is rejected at the 0.1% significance level.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 331.449***
(39.758)

OCC 133,259.100***
(20,827.580)

Constant -33,612.550
(23,573.470)

-----------------------------------------------
Observations 74
R2 0.616
Adjusted R2 0.605
Residual Std. Error 89,248.090 (df = 71)
F Statistic 56.861*** (df = 2; 71)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

We can perform t tests on the other coefficients in the usual way. The t statistic for the
coefficient of N is 8.34, so we conclude that the marginal cost is (very) significantly different
from 0.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 331.449***
(39.758)

OCC 133,259.100***
(20,827.580)

Constant -33,612.550
(23,573.470)

-----------------------------------------------
Observations 74
R2 0.616
Adjusted R2 0.605
Residual Std. Error 89,248.090 (df = 71)
F Statistic 56.861*** (df = 2; 71)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

In the case of the intercept, the t statistic is –1.43, so we do not reject the null hypothesis
H0: 1 = 0.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 331.449***
(39.758)

OCC 133,259.100***
(20,827.580)

Constant -33,612.550
(23,573.470)

-----------------------------------------------
Observations 74
R2 0.616
Adjusted R2 0.605
Residual Std. Error 89,248.090 (df = 71)
F Statistic 56.861*** (df = 2; 71)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

Thus one explanation of the nonsensical negative overhead cost of regular schools might
be that they do not actually have any overheads and our estimate is a random number.
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 331.449***
(39.758)

OCC 133,259.100***
(20,827.580)

Constant -33,612.550
(23,573.470)

-----------------------------------------------
Observations 74
R2 0.616
Adjusted R2 0.605
Residual Std. Error 89,248.090 (df = 71)
F Statistic 56.861*** (df = 2; 71)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

A more realistic version of this hypothesis is that 1 is positive but small (as you can see,
the 95 percent confidence interval includes positive values) and the error term is
responsible for the negative estimate.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

COST = 1 + TTECH + WWORKER + VVOC + 2N + u

Suppose there are three types of occupational school. There are technical schools training
technicians and skilled workers’ schools training craftsmen, apart from other vocational
schools.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

COST = 1 + TTECH + WWORKER + VVOC + 2N + u

So now the qualitative variable has four categories. The standard procedure is to choose
one category as the reference category and to define dummy variables for each of the
others.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

COST = 1 + TTECH + WWORKER + VVOC + 2N + u

General School COST = 1 + 2N + u


(TECH = WORKER = VOC = 0)

If an observation relates to a general school, the dummy variables are all 0 and the
regression model is reduced to its basic components.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

COST = 1 + TTECH + WWORKER + VVOC + 2N + u

General School COST = 1 + 2N + u


(TECH = WORKER = VOC = 0)

Technical School COST = (1 + T) + 2N + u


(TECH = 1; WORKER = VOC = 0)

If an observation relates to a technical school, TECH will be equal to 1 and the other dummy
variables will be 0. The regression model simplifies as shown.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

COST = 1 + TTECH + WWORKER + VVOC + 2N + u

General School COST = 1 + 2N + u


(TECH = WORKER = VOC = 0)

Technical School COST = (1 + T) + 2N + u


(TECH = 1; WORKER = VOC = 0)

Skilled Workers’ School COST = (1 + W) + 2N + u


(WORKER = 1; TECH = VOC = 0)

Vocational School COST = (1 + V) + 2N + u


(VOC = 1; TECH = WORKER = 0)

The regression model simplifies in a similar manner in the case of observations relating to
skilled workers’ schools and vocational schools.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

COST

Technical

1+T W T
1+W Workers’
Vocational
V
1+V
1 General

The diagram illustrates the model graphically. The  coefficients are the extra overhead
costs of running technical, skilled workers’, and vocational schools, relative to the
overhead cost of general schools.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

School Type COST N TECH WORKER VOC

1 Technical 345,000 623 1 0 0


2 Technical 537,000 653 1 0 0
3 General 170,000 400 0 0 0
4 Workers’ 526.000 663 0 1 0
5 General 100,000 563 0 0 0
6 Vocational 28,000 236 0 0 1
7 Vocational 160,000 307 0 0 1
8 Technical 45,000 173 1 0 0
9 Technical 120,000 146 1 0 0
10 Workers’ 61,000 99 0 1 0

Here are the data for the first 10 of the 74 schools. Note how the values of the dummy
variables TECH, WORKER, and VOC are determined by the type of school in each
observation.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
N

Technical schools Vocational schools General schools Workers' schools

The scatter diagram shows the data for the entire sample, differentiating by type of school.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES
===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 342.634***
(40.219)

TECH 154,110.900***
(26,760.410)

WORKER 143,362.400***
(27,852.800)

VOC 53,228.640*
(31,061.650)

Constant -54,893.090**
(26,673.080)

-----------------------------------------------
Observations 74
R2 0.632
Adjusted R2 0.611
Residual Std. Error 88,578.370 (df = 69)
F Statistic 29.631*** (df = 4; 69)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

The coefficient of N indicates that the marginal cost per student per year is 343 rupees.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES
===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 342.634***
(40.219)

TECH 154,110.900***
(26,760.410)

WORKER 143,362.400***
(27,852.800)

VOC 53,228.640*
(31,061.650)

Constant -54,893.090**
(26,673.080)

-----------------------------------------------
Observations 74
R2 0.632
Adjusted R2 0.611
Residual Std. Error 88,578.370 (df = 69)
F Statistic 29.631*** (df = 4; 69)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

The coefficients of TECH, WORKER, and VOC are 154,000, 143,000, and 53,000, respectively,
and should be interpreted as the additional annual overhead costs, relative to those of
general schools.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES
===============================================
Dependent variable:
---------------------------
COST
-----------------------------------------------
N 342.634***
(40.219)

TECH 154,110.900***
(26,760.410)

WORKER 143,362.400***
(27,852.800)

VOC 53,228.640*
(31,061.650)

Constant -54,893.090**
(26,673.080)

-----------------------------------------------
Observations 74
R2 0.632
Adjusted R2 0.611
Residual Std. Error 88,578.370 (df = 69)
F Statistic 29.631*** (df = 4; 69)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

The constant term is –55,000, indicating that the annual overhead cost of a general
academic school is –55,000 rupees per year. Obviously this is nonsense and indicates that
something is wrong with the model.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

^ = –55,000 + 154,000TECH + 143,000WORKER + 53,000VOC + 343N


COST

General School ^
COST = –55,000 + 343N
(TECH = WORKER = VOC = 0)

Technical School ^
COST = –55,000 + 154,000 + 343N
(TECH = 1; WORKER = VOC = 0) = 99,000 + 343N

Skilled Workers’ School ^


COST = –55,000 + 143,000 + 343N
(WORKER = 1; TECH = VOC = 0) = 88,000 + 343N

Vocational School ^
COST = –55,000 + 53,000 + 343N
(VOC = 1; TECH = WORKER = 0) = –2,000 + 343N

And similarly the extra overhead costs of skilled workers’ and vocational schools, relative
to those of general schools, are 143,000 and 53,000 rupees, respectively.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

^ = –55,000 + 154,000TECH + 143,000WORKER + 53,000VOC + 343N


COST

General School ^
COST = –55,000 + 343N
(TECH = WORKER = VOC = 0)

Technical School ^
COST = –55,000 + 154,000 + 343N
(TECH = 1; WORKER = VOC = 0) = 99,000 + 343N

Skilled Workers’ School ^


COST = –55,000 + 143,000 + 343N
(WORKER = 1; TECH = VOC = 0) = 88,000 + 343N

Vocational School ^
COST = –55,000 + 53,000 + 343N
(VOC = 1; TECH = WORKER = 0) = –2,000 + 343N

Note that in each case the annual marginal cost per student is estimated at 343 rupees. The
model specification assumes that this figure does not differ according to type of school.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Technical schools Vocational schools General schools Workers' schools

The four cost functions are illustrated graphically.


SLOPE DUMMY VARIABLES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Occupational schools Regular schools

The scatter diagram shows the data for the 74 schools and the cost functions derived from
a regression of COST on N and a dummy variable for the type of curriculum (occupational /
regular).
SLOPE DUMMY VARIABLES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Occupational schools Regular schools

The specification of the model incorporates the assumption that the marginal cost per
student is the same for occupational and regular schools. Hence the cost functions are
parallel.
SLOPE DUMMY VARIABLES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Occupational schools Regular schools

However, this is not a realistic assumption. Occupational schools incur expenditure on


training materials that is related to the number of students.
SLOPE DUMMY VARIABLES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Occupational schools Regular schools

Also, the staff-student ratio has to be higher in occupational schools because workshop
groups cannot be, or at least should not be, as large as academic classes.
SLOPE DUMMY VARIABLES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Occupational schools Regular schools

Looking at the scatter diagram, you can see that the cost function for the occupational
schools should be steeper, and that for the regular schools should be flatter.
SLOPE DUMMY VARIABLES

COST = 1 +  OCC + 2N + lN*OCC + u

We will relax the assumption of the same marginal cost by introducing what is known as a
slope dummy variable. This is N*OCC, defined as the product of N and OCC.
SLOPE DUMMY VARIABLES

COST = 1 +  OCC + 2N + lN*OCC + u

Regular school COST = 1 + 2N + u


(OCC = N*OCC = 0)

In the case of a regular school, OCC is 0 and hence so also is NOCC. The model reduces to
its basic components.
SLOPE DUMMY VARIABLES

COST = 1 +  OCC + 2N + lN*OCC + u

Regular school COST = 1 + 2N + u


(OCC = N*OCC = 0)

Occupational school COST = (1 +  ) + (2 + lN + u


(OCC = 1; N*OCC = N)

In the case of an occupational school, OCC is equal to 1 and N*OCC is equal to N. The
equation simplifies as shown.
SLOPE DUMMY VARIABLES

COST = 1 +  OCC + 2N + lN*OCC + u

Regular school COST = 1 + 2N + u


(OCC = N*OCC = 0)

Occupational school COST = (1 +  ) + (2 + lN + u


(OCC = 1; N*OCC = N)

The model now allows the marginal cost per student to be an amount l greater than that in
regular schools, as well as allowing the overhead costs to be different.
SLOPE DUMMY VARIABLES

COST Occupational

l
Regular

 1 +

1

The diagram illustrates the model graphically.


SLOPE DUMMY VARIABLES

School Type COST N OCC NOCC

1 Occupational 345,000 623 1 623


2 Occupational 537,000 653 1 653
3 Regular 170,000 400 0 0
4 Occupational 526.000 663 1 663
5 Regular 100,000 563 0 0
6 Regular 28,000 236 0 0
7 Regular 160,000 307 0 0
8 Occupational 45,000 173 1 173
9 Occupational 120,000 146 1 146
10 Occupational 61,000 99 1 99
SLOPE DUMMY VARIABLES

=================================================================
Dependent variable:
---------------------------------------------
COST
(1) (2)
-----------------------------------------------------------------
N 331.449*** 152.298**
(39.758) (60.019)

OCC 133,259.100*** -3,501.177


(20,827.580) (41,085.460)

N:OCC 284.479***
(75.632)

Constant -33,612.550 51,475.250


(23,573.470) (31,314.840)

-----------------------------------------------------------------
Observations 74 74
R2 0.616 0.680
Adjusted R2 0.605 0.667
Residual Std. Error 89,248.090 (df = 71) 81,979.800 (df = 70)
F Statistic 56.861*** (df = 2; 71) 49.643*** (df = 3; 70)
=================================================================
Note: *p<0.1; **p<0.05; ***p<0.01

Here is the regression output using the full sample of 74 schools. We will begin by
interpreting the regression coefficients.
SLOPE DUMMY VARIABLES

^ = 51,000 – 4,000 OCC + 152N + 284NOCC


COST

Here is the regression in equation form.


SLOPE DUMMY VARIABLES

^ = 51,000 – 4,000 OCC + 152N + 284NOCC


COST

Regular school ^
COST = 51,000 + 152N
(OCC = NOCC = 0)

Putting OCC, and hence NOCC, equal to 0, we get the cost function for regular schools. We
estimate that their annual overhead costs are 51,000 rupees and their annual marginal cost
per student is 152 rupees.
SLOPE DUMMY VARIABLES

^ = 51,000 – 4,000 OCC + 152N + 284NOCC


COST

Regular school ^
COST = 51,000 + 152N
(OCC = NOCC = 0)

Occupational school ^
COST = 51,000 – 4,000 + 152N + 284N
(OCC = 1; NOCC = N) = 47,000 + 436N

Putting OCC equal to 1, and hence NOCC equal to N, we estimate that the annual overhead
costs of the occupational schools are 47,000 rupees and the annual marginal cost per
student is 436 rupees.
SLOPE DUMMY VARIABLES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
N

Occupational schools Regular schools

You can see that the cost functions fit the data much better than before and that the real
difference is in the marginal cost, not the overhead cost.
SLOPE DUMMY VARIABLES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Occupational schools Regular schools

Now we can see why we had a nonsensical negative estimate of the overhead cost of a
regular school in previous specifications.
SLOPE DUMMY VARIABLES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Occupational schools Regular schools

The assumption of the same marginal cost led to an estimate of the marginal cost that was
a compromise between the marginal costs of occupational and regular schools.
SLOPE DUMMY VARIABLES

700000

600000

500000

400000
COST

300000

200000

100000

0
0 200 400 600 800 1000 1200 1400
-100000
N

Occupational schools Regular schools

The cost function for regular schools was too steep and as a consequence the intercept
was underestimated, actually becoming negative and indicating that something must be
wrong with the specification of the model.
SLOPE DUMMY VARIABLES

=================================================================
Dependent variable:
---------------------------------------------
COST
(1) (2)
-----------------------------------------------------------------
N 331.449*** 152.298**
(39.758) (60.019)

OCC 133,259.100*** -3,501.177


(20,827.580) (41,085.460)

N:OCC 284.479***
(75.632)

Constant -33,612.550 51,475.250


(23,573.470) (31,314.840)

-----------------------------------------------------------------
Observations 74 74
R2 0.616 0.680
Adjusted R2 0.605 0.667
Residual Std. Error 89,248.090 (df = 71) 81,979.800 (df = 70)
F Statistic 56.861*** (df = 2; 71) 49.643*** (df = 3; 70)
=================================================================
Note: *p<0.1; **p<0.05; ***p<0.01

We can perform t tests as usual. The t statistic for the coefficient of NOCC is 3.76, so the
marginal cost per student in an occupational school is significantly higher than that in a
regular school.
SLOPE DUMMY VARIABLES

=================================================================
Dependent variable:
---------------------------------------------
COST
(1) (2)
-----------------------------------------------------------------
N 331.449*** 152.298**
(39.758) (60.019)

OCC 133,259.100*** -3,501.177


(20,827.580) (41,085.460)

N:OCC 284.479***
(75.632)

Constant -33,612.550 51,475.250


(23,573.470) (31,314.840)

-----------------------------------------------------------------
Observations 74 74
R2 0.616 0.680
Adjusted R2 0.605 0.667
Residual Std. Error 89,248.090 (df = 71) 81,979.800 (df = 70)
F Statistic 56.861*** (df = 2; 71) 49.643*** (df = 3; 70)
=================================================================
Note: *p<0.1; **p<0.05; ***p<0.01

The coefficient of OCC is now negative, suggesting that the overhead costs of occupational
schools are actually lower than those of regular schools.
SLOPE DUMMY VARIABLES

=================================================================
Dependent variable:
---------------------------------------------
COST
(1) (2)
-----------------------------------------------------------------
N 331.449*** 152.298**
(39.758) (60.019)

OCC 133,259.100*** -3,501.177


(20,827.580) (41,085.460)

N:OCC 284.479***
(75.632)

Constant -33,612.550 51,475.250


(23,573.470) (31,314.840)

-----------------------------------------------------------------
Observations 74 74
R2 0.616 0.680
Adjusted R2 0.605 0.667
Residual Std. Error 89,248.090 (df = 71) 81,979.800 (df = 70)
F Statistic 56.861*** (df = 2; 71) 49.643*** (df = 3; 70)
=================================================================
Note: *p<0.1; **p<0.05; ***p<0.01

This is unlikely. However, the t statistic is only -0.09, so we do not reject the null hypothesis
that the overhead costs of the two types of school are the same.
Thanks

You might also like