You are on page 1of 20

Regression

What Is Regression
Regression is a measure of relation between a dependent variable and a set of independent variables which affect the value of the dependent variable.

Preference for purchase Vs price, popularity of the brand, product performance

The relationship derived is in the form of an equation Y = a+b1x1+b2x2+b3x3+


Where Y = dependent variable, x1,x2 = independent variable

Regression is usually done on variables that are measured on an interval scale.


2

Types of Regression
Linear :
Assumes linear relationship between variables Simple When on independent variable is used to predict the value of the dependent variable. When there are many independent variables used to predict the value of the dependant variable.

Multiple

Non linear :
When the relationship is non linear

Many models are available asymptotic, log linear, log logistic

In most cases the underlying relationship is assumed to be linear.


3

How To Determine Which Variables To Include In regression

Drop variables that are unlikely to affect value of dependent variable. Several models are available for eliminating variables from a regression analysis.

Eliminating independent variables having a low correlation to the dependent variable. Stepwise regression
Starting with the independent variable with the highest predictive value. And entering variables one by one examining at each stage, the improvement

over the predictive power in the previous iteration.

At each stage all variables in the equation are examined to check if they are

needed. And if at any stage they are found superfluous they are dropped.

Forward selection Similar to stepwise regression except that no variable is dropped once it is entered into the equation. Backward elimination Using all independent variables and eliminating variables that contribute the least, one by one.
4

Linear Regression : Standard Output and Interpretation (1)


Interpretation
The total variation of the dependent variable explained by the equation is 69%. This is a good fit and hence one can proceed to draw further inferences based on the assumption that the relationship is linear. Adjusted R2 is an improvement over R2 in that it takes into account the number of variables used for predicting. If R2 is low then the model cannot be assumed to be linear, further inferences should not be drawn in such cases.

Model summary
R2 0.693 Adjusted R2 0.694

Linear Regression : Standard Output and Interpretation (2)


ANOVA
Sum of squares Degrees of freedom (df) F Significance test If (100 sig test) value is high (95% or above) then the relationship exists. And the model is robust for prediction.

Interpretation
The F statistics is a measure of whether any relationship exists between the dependent and independent variable.

Linear Regression : Standard Output And Interpretation


Output
Constant B

Interpretation
Constant to be used in the regression equation There is a B value for each independent variable. It is the coefficient of each independent variable in the equation A unit change in the independent variable can cause B units of change in the dependent variable, if all other independent variables are constant It is the standard error of the coefficient B. It is the normalised value of B. And removes the effect of the scale differences in the independent variables. It is a measure of relative importance because it indicates the expected change in the dependent variable per unit change in the independent variable.

Standard error

Significance of t

If t is not significant, then the independent variable is not a good predictor. And should be removed from the analysis.

Applications Of Regression

Estimating Relative Importance Of variables In Choice


The values can be used as a measure of relative importance of independent variables in choice.
The rather than B value should be used as it eliminates problems related to differences in scale of measurement the independent variables. However, if all independent variables have been measured on the same scale then there would be no difference whether or B is used. Parameter Relative importance

Cleanliness Duration of billing

0.593 0.794

43% 57%

Dependent variable overall satisfaction with the store The inference

Both cleanliness and duration of billing are important contributors to overall satisfaction with the store. Duration of billing is a relatively more important contributor.
9

Forecasting
The regression equation can be used to predict the value of the dependent variable when the independent variable values are known. Y = a+b1x1+b2x2+b3x3+ Data available

Awareness for brand A during the period of a campaign. GRPs in TV for the ad campaign. What are the likely levels of awareness of brand A during the next campaign, for which estimates of GRP are available.

Can predict

10

Some Caveats To Remember While Predicting


The prediction can be done only for the range of values based on which the original estimation equation was obtained.

If the regression equation was obtained for the awareness of a brand vis--vis GRPs for a market leader, it cannot be extrapolated for a minority brand.

11

Is my model fit to predict sales ?


80 70 60 50 40 30 20 10 0 Actual Sales PR EDI CTI ON 1 PR EDI CTI ON 2
12

DISCRIMINANT ANALYSIS

13

What is Discriminant Analysis?


A modelling technique used when the dependent variable is a categorical variable and independent variables are continuous variables Applications

Selection Process for a job, Admission process of an educational program Dividing a group in potential buyer & non- buyer high risk low risk Y = a + k1x1+ k2x2 K1 and K2 should maximise the separation between two groups

Relationship is derived in the form of an equation

K1 and K2 are Coefficients of Independent Variable

14

Predicting the Group Membership


Model building based on the linear discriminant equation Y determinant score is calculated Cut Off point : Mid Point of mean discriminant scores of the two groups

15

Linear Discriminant Analysis Standard Outputs and Interpretation


Classification/ Confusion Matrix

Percent Correct/ Wrong Column 94.44% Model has correctly classified 94.44% of the cases Level of accuracy may not hold true for future predictions.. But is a good pointer towards model being a Good One

16

Linear Discriminant Analysis Standard Outputs and Interpretation


Wilks Lambda A low value of Wilks Lambda indicates high significance of the model F Test

P value is the decision criterion

17

Linear Discriminant Analysis Standard Outputs and Interpretation


Relative Importance of Independent Variables

Standardized Coefficients indicates relative importance of the variables Means of Canonical Variables Computed based on Raw co-efficient table Right side of Mid Point is Group 2 Left Side of Mid Point is Group 1

Classifying the cases

18

Case Study
A Business School selects its students every year through a written test, interview and group discussion. It then tracks the performance of students during the two year program by means of GPA. A GPA above 2.75 /4.0 is defined as Successful and below as Unsuccessful students.

Can you develop a model that predicts whether a student would be potentially successful or not.

19

How good is the model? Statistical Significance of the model Predictors Classification of new Student

20

You might also like