You are on page 1of 77

Interpreting Multiple Regression: A Short Overview

Abdel-Salam G. Abdel-Salam Laboratory for Interdisciplinary Statistical Analysis (LISA) Department of Statistics Virginia Polytechnic Institute and State University

http://www.stat.vt.edu/consult/
Short Course November the 12th , 2008

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

1 / 42

Learning Objectives
Today , we will cover how to do Linear Regression Analysis (LRA) in SPSS and SAS We will learn concepts and vocabularies in regression analysis such as:
1

5 6

How to use the F-test to determine if your predictor variables have a statistically signicant relationship with your outcome/response variable? What are the assumptions for LRA and what you should do to meet these assumptions? Why adjusted R2 is smaller than R2 and what these numbers mean when comparing between several models? What is the difference between regression and ANOVA and when are they equivalent? How can you select the best model? Other LR problems (Multicollinearity and Outliers observations), what you should do?? General Strategy for doing LRA?

PLEASE think about your research problem???

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

2 / 42

Learning Objectives
Today , we will cover how to do Linear Regression Analysis (LRA) in SPSS and SAS We will learn concepts and vocabularies in regression analysis such as:
1

5 6

How to use the F-test to determine if your predictor variables have a statistically signicant relationship with your outcome/response variable? What are the assumptions for LRA and what you should do to meet these assumptions? Why adjusted R2 is smaller than R2 and what these numbers mean when comparing between several models? What is the difference between regression and ANOVA and when are they equivalent? How can you select the best model? Other LR problems (Multicollinearity and Outliers observations), what you should do?? General Strategy for doing LRA?

PLEASE think about your research problem???

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

2 / 42

Learning Objectives
Today , we will cover how to do Linear Regression Analysis (LRA) in SPSS and SAS We will learn concepts and vocabularies in regression analysis such as:
1

5 6

How to use the F-test to determine if your predictor variables have a statistically signicant relationship with your outcome/response variable? What are the assumptions for LRA and what you should do to meet these assumptions? Why adjusted R2 is smaller than R2 and what these numbers mean when comparing between several models? What is the difference between regression and ANOVA and when are they equivalent? How can you select the best model? Other LR problems (Multicollinearity and Outliers observations), what you should do?? General Strategy for doing LRA?

PLEASE think about your research problem???

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

2 / 42

Learning Objectives
Today , we will cover how to do Linear Regression Analysis (LRA) in SPSS and SAS We will learn concepts and vocabularies in regression analysis such as:
1

5 6

How to use the F-test to determine if your predictor variables have a statistically signicant relationship with your outcome/response variable? What are the assumptions for LRA and what you should do to meet these assumptions? Why adjusted R2 is smaller than R2 and what these numbers mean when comparing between several models? What is the difference between regression and ANOVA and when are they equivalent? How can you select the best model? Other LR problems (Multicollinearity and Outliers observations), what you should do?? General Strategy for doing LRA?

PLEASE think about your research problem???

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

2 / 42

Outline
1

Introduction Denitions Example Regression Assumptions Example Problem1 Satisfying Assumptions Example Problem (2) Linear Regression vs. ANOVA Model Selection General Example Strategy for solving problems

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

3 / 42

Introduction

When should we use Regression??


Data can be continuous or discrete

ResponseVariable
ImportantInformation
(Y)

Discrete e.g.Gender
ContingencyTableAnalysis (UsingChisquareTest)

Continuous e.g. Time

Predictor Variables
(X)

Discrete

ANOVA

Regression Continuous Logistic (binaryorMultinomial)

Regression (Ourfocus)

In the regression , we need our response to be continuous and at least one predictor to be continuous.
Abdel-Salam G. Abdel-Salam (Virginia Tech) Short Course 2008, LISA, Department of Statistics November the 12th , 2008 4 / 42

Introduction

Denitions

Linear Regression Analysis


Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation.

Simple linear regression model


y i = 0 + 1 xi +
i

i = 1, 2, . . . , n

where yi represents the ith response value, 0 is the intercept (the mean value of y at x = 0), 1 (Slope, Regression coefcient) tells us that, on average, as x increases by 1 so y increases by 1 , and i is the error term.

The estimated model is


0 + 1 xi i = y i = 1, 2, . . . , n

i , represents the residuals for the ith observation. where i = yi y


Abdel-Salam G. Abdel-Salam (Virginia Tech) Short Course 2008, LISA, Department of Statistics November the 12th , 2008 5 / 42

Introduction

Denitions

Linear Regression Analysis


Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation.

Simple linear regression model


y i = 0 + 1 xi +
i

i = 1, 2, . . . , n

where yi represents the ith response value, 0 is the intercept (the mean value of y at x = 0), 1 (Slope, Regression coefcient) tells us that, on average, as x increases by 1 so y increases by 1 , and i is the error term.

The estimated model is


0 + 1 xi i = y i = 1, 2, . . . , n

i , represents the residuals for the ith observation. where i = yi y


Abdel-Salam G. Abdel-Salam (Virginia Tech) Short Course 2008, LISA, Department of Statistics November the 12th , 2008 5 / 42

Introduction

Denitions

Linear Regression Analysis


Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation.

Simple linear regression model


y i = 0 + 1 xi +
i

i = 1, 2, . . . , n

where yi represents the ith response value, 0 is the intercept (the mean value of y at x = 0), 1 (Slope, Regression coefcient) tells us that, on average, as x increases by 1 so y increases by 1 , and i is the error term.

The estimated model is


0 + 1 xi i = y i = 1, 2, . . . , n

i , represents the residuals for the ith observation. where i = yi y


Abdel-Salam G. Abdel-Salam (Virginia Tech) Short Course 2008, LISA, Department of Statistics November the 12th , 2008 5 / 42

Introduction

Denitions

A regression Line
A regression Line

EstimatedLine

Intercept

Slope

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

6 / 42

Introduction

Denitions

Multiple Linear Regression

What are the reasons for using a multiple regression? Two reasons for using multiple regression:
1

To be able to make stronger causal inferences from observed associations between two or more variables. To predict a dependent variable based on values of a number of other independent variables.

Example : There might be many factors associated with crime such as Poverty, Urbanisation, Low social cohesion and informal social control, and Education Therefore , we want to be able to understand the unique contribution of each variable to variation in crime levels.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

7 / 42

Introduction

Denitions

Multiple Linear Regression

What are the reasons for using a multiple regression? Two reasons for using multiple regression:
1

To be able to make stronger causal inferences from observed associations between two or more variables. To predict a dependent variable based on values of a number of other independent variables.

Example : There might be many factors associated with crime such as Poverty, Urbanisation, Low social cohesion and informal social control, and Education Therefore , we want to be able to understand the unique contribution of each variable to variation in crime levels.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

7 / 42

Introduction

Denitions

Multiple Linear Regression

What are the reasons for using a multiple regression? Two reasons for using multiple regression:
1

To be able to make stronger causal inferences from observed associations between two or more variables. To predict a dependent variable based on values of a number of other independent variables.

Example : There might be many factors associated with crime such as Poverty, Urbanisation, Low social cohesion and informal social control, and Education Therefore , we want to be able to understand the unique contribution of each variable to variation in crime levels.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

7 / 42

Introduction

Denitions

Multiple Linear Regression

The multiple regression model


yi = 0 + 1 x1i + 2 x2i + . . . + +K xKi + What does it mean?
1

i = 1, 2, . . . , n

Interpretation of 0 is the expected value of Y when X1 , X2 , . . . , XK are all equal zero. Interpretation of partial regression coefcient 1 is for every unit (increase/decrease) in the value of X1 we predict a 1 change in Y , controlling for the effect of the others. What do we mean by that?!

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

8 / 42

Introduction

Denitions

Multiple Linear Regression

The multiple regression model


yi = 0 + 1 x1i + 2 x2i + . . . + +K xKi + What does it mean?
1

i = 1, 2, . . . , n

Interpretation of 0 is the expected value of Y when X1 , X2 , . . . , XK are all equal zero. Interpretation of partial regression coefcient 1 is for every unit (increase/decrease) in the value of X1 we predict a 1 change in Y , controlling for the effect of the others. What do we mean by that?!

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

8 / 42

Introduction

Denitions

Multiple Linear Regression

The multiple regression model


yi = 0 + 1 x1i + 2 x2i + . . . + +K xKi + What does it mean?
1

i = 1, 2, . . . , n

Interpretation of 0 is the expected value of Y when X1 , X2 , . . . , XK are all equal zero. Interpretation of partial regression coefcient 1 is for every unit (increase/decrease) in the value of X1 we predict a 1 change in Y , controlling for the effect of the others. What do we mean by that?!

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

8 / 42

Introduction

Example

Example
We are interested in the effect of education and occupational status on general happiness. All measured on scales from 1 to 10.

Prediction model is
0 + 1 x1i + 2 x2i i = y You estimate model and get these results: i = 1, 2, . . . , n

Prediction model is
predicted HAPPINESS = 3 + 1 EDUC + .5 STATUS 0 = predicted HAPPINESS when EDUC and STATUS are both zero = 3 1 = for each unit increase in EDUC we predict a 1 unit rise in happiness, controlling for status. 2 = for each unit increase in status, we predict a .5 unit rise in happiness, controlling for education.
Abdel-Salam G. Abdel-Salam (Virginia Tech) Short Course 2008, LISA, Department of Statistics November the 12th , 2008 9 / 42

Introduction

Example

Example
We are interested in the effect of education and occupational status on general happiness. All measured on scales from 1 to 10.

Prediction model is
0 + 1 x1i + 2 x2i i = y You estimate model and get these results: i = 1, 2, . . . , n

Prediction model is
predicted HAPPINESS = 3 + 1 EDUC + .5 STATUS 0 = predicted HAPPINESS when EDUC and STATUS are both zero = 3 1 = for each unit increase in EDUC we predict a 1 unit rise in happiness, controlling for status. 2 = for each unit increase in status, we predict a .5 unit rise in happiness, controlling for education.
Abdel-Salam G. Abdel-Salam (Virginia Tech) Short Course 2008, LISA, Department of Statistics November the 12th , 2008 9 / 42

Introduction

Example

Example

Imagine two people Gordon and Tony both are having the same level of education but Gordon has status score 5, Tony has 6

Prediction model is
predicted HAPPINESS = 3 + 1 EDUC + .5 STATUS Who has more happiness?? Model predicts Tonys happiness score to be 0.5 greater than Gordons

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

10 / 42

Introduction

Example

Example

Imagine two people Gordon and Tony both are having the same level of education but Gordon has status score 5, Tony has 6

Prediction model is
predicted HAPPINESS = 3 + 1 EDUC + .5 STATUS Who has more happiness?? Model predicts Tonys happiness score to be 0.5 greater than Gordons

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

10 / 42

Introduction

Example

Example

Imagine two people Gordon and Tony both are having the same level of education but Gordon has status score 5, Tony has 6

Prediction model is
predicted HAPPINESS = 3 + 1 EDUC + .5 STATUS Who has more happiness?? Model predicts Tonys happiness score to be 0.5 greater than Gordons

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

10 / 42

Regression Assumptions

Regression Assumptions
1

Independence means that, the Y values are statistically independent of one another. Linearity means that, the mean value Y is proportional to the independent variable (X ), a straight line function. Normally Distributed means that, for a xed value of X , Y has a normal distribution.

NormalDistribution

Positivevs.NegativeSkewness

Homoscedasticity or Homogeneity means that, the variance of Y is the same for any X (constant variance).
Short Course 2008, LISA, Department of Statistics November the 12th , 2008 11 / 42

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Regression Assumptions

Regression Assumptions
1

Independence means that, the Y values are statistically independent of one another. Linearity means that, the mean value Y is proportional to the independent variable (X ), a straight line function. Normally Distributed means that, for a xed value of X , Y has a normal distribution.

NormalDistribution

Positivevs.NegativeSkewness

Homoscedasticity or Homogeneity means that, the variance of Y is the same for any X (constant variance).
Short Course 2008, LISA, Department of Statistics November the 12th , 2008 11 / 42

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Regression Assumptions

Checking Assumptions

Can be done graphically , most popular, and can be done by some statistical tests.
Independence Linearity Normality Homogeneity Response (Y ) or Residual () vs. Time Response (Y ) vs. Predictor (X ) Probability Plot (PP plot or QQ plot) Residual () vs. Predicted (X )

Statistical tests for these assumptions will not cover here. Examples

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

12 / 42

Regression Assumptions

Checking Assumptions

Can be done graphically , most popular, and can be done by some statistical tests.
Independence Linearity Normality Homogeneity Response (Y ) or Residual () vs. Time Response (Y ) vs. Predictor (X ) Probability Plot (PP plot or QQ plot) Residual () vs. Predicted (X )

Statistical tests for these assumptions will not cover here. Examples

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

12 / 42

Regression Assumptions

Checking Assumptions

Can be done graphically , most popular, and can be done by some statistical tests.
Independence Linearity Normality Homogeneity Response (Y ) or Residual () vs. Time Response (Y ) vs. Predictor (X ) Probability Plot (PP plot or QQ plot) Residual () vs. Predicted (X )

Statistical tests for these assumptions will not cover here. Examples

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

12 / 42

Regression Assumptions

Checking Assumptions

Can be done graphically , most popular, and can be done by some statistical tests.
Independence Linearity Normality Homogeneity Response (Y ) or Residual () vs. Time Response (Y ) vs. Predictor (X ) Probability Plot (PP plot or QQ plot) Residual () vs. Predicted (X )

Statistical tests for these assumptions will not cover here. Examples

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

12 / 42

Regression Assumptions

Checking Assumptions

Can be done graphically , most popular, and can be done by some statistical tests.
Independence Linearity Normality Homogeneity Response (Y ) or Residual () vs. Time Response (Y ) vs. Predictor (X ) Probability Plot (PP plot or QQ plot) Residual () vs. Predicted (X )

Statistical tests for these assumptions will not cover here. Examples

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

12 / 42

Regression Assumptions

Checking Assumptions

Can be done graphically , most popular, and can be done by some statistical tests.
Independence Linearity Normality Homogeneity Response (Y ) or Residual () vs. Time Response (Y ) vs. Predictor (X ) Probability Plot (PP plot or QQ plot) Residual () vs. Predicted (X )

Statistical tests for these assumptions will not cover here. Examples

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

12 / 42

Regression Assumptions

Independence
Independent of Time

Dependent

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

13 / 42

Regression Assumptions

Linearity
Linear
Scatterplot of Response vs Temperature
100

90

Response

80

70

60

50 50 60 70 80 Tem perature 90 100

Non-Linear
Scatterplot of Response vs Temperature
100000000

80000000

Response

60000000

40000000

20000000

0 50 60 70 80 Tem perature 90 100

Question??? In MLR plot the response against each predictor


November the 12th , 2008

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

14 / 42

Regression Assumptions

Linearity
Linear
Scatterplot of Response vs Temperature
100

90

Response

80

70

60

50 50 60 70 80 Tem perature 90 100

Non-Linear
Scatterplot of Response vs Temperature
100000000

80000000

Response

60000000

40000000

20000000

0 50 60 70 80 Tem perature 90 100

Question??? In MLR plot the response against each predictor


November the 12th , 2008

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

14 / 42

Regression Assumptions

Linearity
Linear
Scatterplot of Response vs Temperature
100

90

Response

80

70

60

50 50 60 70 80 Tem perature 90 100

Non-Linear
Scatterplot of Response vs Temperature
100000000

80000000

Response

60000000

40000000

20000000

0 50 60 70 80 Tem perature 90 100

Question??? In MLR plot the response against each predictor


November the 12th , 2008

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

14 / 42

Regression Assumptions

Normality
Data is Normal . The data follows a straight line

Data is Non-Normal . The data is slightly curved

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

15 / 42

Regression Assumptions

Homogeneity
Constant Variance . Randomly Distributed or No Pattern

Non-constant variance (Heterogeneity or Hetroscedasticity). Curved Shape or Megaphone Shape (Monotone Spread)

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

16 / 42

Regression Assumptions

Now...

Any Questions???

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

17 / 42

Regression Assumptions

Example Problem1

Example Problem (1)

A group of 13 children participated in a psychological study to analyze the relationship between age and average total sleep time (ATST). The results are displayed below. Determine the SLR model for the data? AGE (X )(Years) 4.4 14 10.1 6.7 1.5 9.6 12.4 8.9 11.1 7.75 5.5 8.6 7.2 ATST (Y )(Minutes) 586 462 491 565 462 532 478 515 493 528 576 533 531 Using Proc Glm or Proc Reg . proc glm data=sleep; model atst=age/solution clparm; run;

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

18 / 42

Regression Assumptions

Example Problem1

SPSS Analysis
Before running analysis check that data is listed as scale in the variable view screen

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

19 / 42

Regression Assumptions

Example Problem1

SPSS Analysis
Analyze Regression Linear

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

20 / 42

Regression Assumptions

Example Problem1

SPSS Analysis
Enter ATST in Dependent Box & Enter AGE in Independents & Add condence intervals: Statistics Check Condence Interval Box

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

21 / 42

Regression Assumptions

Example Problem1

SPSS Analysis - 2nd Method


Analyze General Linear Model Univariate

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

22 / 42

Regression Assumptions

Example Problem1

SPSS Analysis - 2nd Method


ATST in Dependent Variable & Age in Covariate(s)

Note that, SPSS & JMP are similar for regression analysis. Data and SAS code is posted for future analysis.

http://filebox.vt.edu/users/abdo/statwww/Example%201.pdf
Output

http://filebox.vt.edu/users/abdo/statwww/SAS%20Output1.mht
November the 12th , 2008

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

23 / 42

Regression Assumptions

Example Problem1

SPSS Analysis - 2nd Method


ATST in Dependent Variable & Age in Covariate(s)

Note that, SPSS & JMP are similar for regression analysis. Data and SAS code is posted for future analysis.

http://filebox.vt.edu/users/abdo/statwww/Example%201.pdf
Output

http://filebox.vt.edu/users/abdo/statwww/SAS%20Output1.mht
November the 12th , 2008

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

23 / 42

Regression Assumptions

Example Problem1

Now...

Any Questions???

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

24 / 42

Regression Assumptions

Satisfying Assumptions

What if the assumptions are not met?


Independence
If due to time, add time to model. Other methods for dealing with dependence :
1 2

Repeated Measures Paired Data

Linearity
Transformations
Try log, square root, square, and inverse transformation

Add Variables Add Interactions Add Higher Powers of Variables

Normality & Homoscedasticity


Transformations:
Try log, square root, and inverse transformation. Use rst transformed variable that satises normality criteria. If no transformation satises normality criteria, then Robust Regression where normality is not required.
Abdel-Salam G. Abdel-Salam (Virginia Tech) Short Course 2008, LISA, Department of Statistics November the 12th , 2008 25 / 42

Regression Assumptions

Satisfying Assumptions

What if the assumptions are not met?


Independence
If due to time, add time to model. Other methods for dealing with dependence :
1 2

Repeated Measures Paired Data

Linearity
Transformations
Try log, square root, square, and inverse transformation

Add Variables Add Interactions Add Higher Powers of Variables

Normality & Homoscedasticity


Transformations:
Try log, square root, and inverse transformation. Use rst transformed variable that satises normality criteria. If no transformation satises normality criteria, then Robust Regression where normality is not required.
Abdel-Salam G. Abdel-Salam (Virginia Tech) Short Course 2008, LISA, Department of Statistics November the 12th , 2008 25 / 42

Regression Assumptions

Example Problem (2)

SAS Analysis (2)

Example Problem 2 In order to study the growth rate of a particular type of bacteria biologists were interested in the relationship between time and the proportion of total area taken up by a colony of bacteria. The biologists placed samples in four Petri dishes and observed the percentage of total area taken up by the bacteria colony after xed time intervals Data and SAS code is posted for future analysis.

http://filebox.vt.edu/users/abdo/statwww/Example2.pdf
Output

http://filebox.vt.edu/users/abdo/statwww/SAS%20Output2.mht

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

26 / 42

Regression Assumptions

Example Problem (2)

Now...

Any Questions???

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

27 / 42

Linear Regression vs. ANOVA

Linear Regression vs. ANOVA

Linear regression Dependent : Continuous Independent : Continuous or Categorical Both are Linear models

ANOVA Dependent : Continuous Independent :Categorical

ANOVA is a special case from the regression analysis!!!

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

28 / 42

Linear Regression vs. ANOVA

Linear Regression vs. ANOVA

Linear regression Dependent : Continuous Independent : Continuous or Categorical Both are Linear models

ANOVA Dependent : Continuous Independent :Categorical

ANOVA is a special case from the regression analysis!!!

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

28 / 42

Linear Regression vs. ANOVA

Linear Regression vs. ANOVA

Linear regression Dependent : Continuous Independent : Continuous or Categorical Both are Linear models

ANOVA Dependent : Continuous Independent :Categorical

ANOVA is a special case from the regression analysis!!!

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

28 / 42

Linear Regression vs. ANOVA

Linear Regression vs. ANOVA

Example : Scientic Question : Is there any difference in the loneliness between female and male? H0 : Female = Male VS H1 : Female = Male

Student T -test or ANOVA ?? OR even Regression Analysis?

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

29 / 42

Linear Regression vs. ANOVA

Linear Regression vs. ANOVA


ANOVA WITH SOLUTION
PROC GLM DATA = MYLIB.LONELINESS; CLASS GENDER; MODEL LONELINESS= GENDER/SOLUTION; RUN;

LINEAR REGRESSION USING GLM


PROC GLM DATA = MYLIB.LONELINESS; MODEL LONELINESS = GENDER; RUN;

DependentVariable:loneliness
Sumof

Sumof SourceDFSquaresMeanSquareFValuePr>F Model14.9028524.9028522.240.1347 Error4981087.7094432.184156 CorrectedTotal4991092.612294

SourceDFSquaresMeanSquareFValuePr>F Model14.9028524.9028522.240.1347 Error4981087.7094432.184156 CorrectedTotal4991092.612294

RSquareCoeff Var RSquareCoeff Var RootMSElonely3Mean

RootMSElonely3Mean

0.00448729.180491.4778895.064648

0.00448729.180491.4778895.064648 SourceDFTypeISSMeanSquareFValuePr>F SourceDFTypeISSMeanSquareFValuePr>F gender14.902851794.902851792.240.1347 SourceDFTypeIIISSMeanSquareFValuePr>F SourceDFTypeIIISSMeanSquareFValuePr>F gender14.902851794.902851792.240.1347 Standard Standard ParameterEstimateErrortValuePr>|t| Intercept4.931462122B0.1107724544.52<.0001 gender10.206809959B0.138034881.500.1347 gender20.000000000B... gender14.902851794.902851792.240.1347 gender14.902851794.902851792.240.1347

ParameterEstimateErrortValuePr>|t| Intercept5.3450820390.1985016526.93<.0001 gender0.2068099590.138034881.500.1347

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

30 / 42

Linear Regression vs. ANOVA

Linear Regression vs. ANOVA

In the example , we show that ANOVA is a special case of linear regression. What if there are more than 2 groups in the ANOVA? Dummy variable for categorical data :
1 2

Outcome : Y (continuous) Predictor : X (categorical) where X=(Group1, Group2, Group3, Group4)

ANOVA : Y X

Regression : Y Z1 , Z2 , Z3

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

31 / 42

Linear Regression vs. ANOVA

Linear Regression vs. ANOVA

In the example , we show that ANOVA is a special case of linear regression. What if there are more than 2 groups in the ANOVA? Dummy variable for categorical data :
1 2

Outcome : Y (continuous) Predictor : X (categorical) where X=(Group1, Group2, Group3, Group4)

ANOVA : Y X

Regression : Y Z1 , Z2 , Z3

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

31 / 42

Linear Regression vs. ANOVA

Linear Regression vs. ANOVA

In the example , we show that ANOVA is a special case of linear regression. What if there are more than 2 groups in the ANOVA? Dummy variable for categorical data :
1 2

Outcome : Y (continuous) Predictor : X (categorical) where X=(Group1, Group2, Group3, Group4)

ANOVA : Y X

Regression : Y Z1 , Z2 , Z3

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

31 / 42

Linear Regression vs. ANOVA

Now...

Any Questions???

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

32 / 42

Model Selection

Goodness of Fit

R2 : The larger, the better.


Measures the proportion of variability in the response explained by the model, 0 R2 1. Compare models, and Evaluates How well the model explains your data.

Adjusted R2 : The larger, the better (Take degrees of freedom into consideration). Root MSE : The smaller, the better. Most frequently used statistic (probably) is CP , with P = 1+ number of independent variables (k) It is always happened that the model with all variable has CP = P exactly. For other models, a good t is indicated by CP P , with CP < P even better.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

33 / 42

Model Selection

Goodness of Fit

R2 : The larger, the better.


Measures the proportion of variability in the response explained by the model, 0 R2 1. Compare models, and Evaluates How well the model explains your data.

Adjusted R2 : The larger, the better (Take degrees of freedom into consideration). Root MSE : The smaller, the better. Most frequently used statistic (probably) is CP , with P = 1+ number of independent variables (k) It is always happened that the model with all variable has CP = P exactly. For other models, a good t is indicated by CP P , with CP < P even better.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

33 / 42

Model Selection

Goodness of Fit

R2 : The larger, the better.


Measures the proportion of variability in the response explained by the model, 0 R2 1. Compare models, and Evaluates How well the model explains your data.

Adjusted R2 : The larger, the better (Take degrees of freedom into consideration). Root MSE : The smaller, the better. Most frequently used statistic (probably) is CP , with P = 1+ number of independent variables (k) It is always happened that the model with all variable has CP = P exactly. For other models, a good t is indicated by CP P , with CP < P even better.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

33 / 42

Model Selection

Model Selection

Forward Selection Method ; Starting with the null model, add variables sequentially.
Drawback : Variables Added to the model cannot be taken out.

Backward Elimination Method ; Starting with the full model, delete variables with large P-values sequentially.
Drawback : Variables taken out of model cannot be added back in.

Stepwise Method ; Combination of Backward/Forward methods.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

34 / 42

Model Selection

Model Selection

Forward Selection Method ; Starting with the null model, add variables sequentially.
Drawback : Variables Added to the model cannot be taken out.

Backward Elimination Method ; Starting with the full model, delete variables with large P-values sequentially.
Drawback : Variables taken out of model cannot be added back in.

Stepwise Method ; Combination of Backward/Forward methods.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

34 / 42

Model Selection

Model Selection

Forward Selection Method ; Starting with the null model, add variables sequentially.
Drawback : Variables Added to the model cannot be taken out.

Backward Elimination Method ; Starting with the full model, delete variables with large P-values sequentially.
Drawback : Variables taken out of model cannot be added back in.

Stepwise Method ; Combination of Backward/Forward methods.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

34 / 42

Model Selection

Model Comparison

Full Model 1 : Compute regression sum of squares (RSS1 ) & degrees of freedom (df1 ) Reduced Model 2 : Compute regression sum of squares (RSS2 ) & df2

F-Test
Ftest = (RSS2 RSS1 )/(df2 df1 ) F(df2 df1 ,df1 ) RSS1 /df1

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

35 / 42

Model Selection

General Example

General Example

General Example From Motor Trend magazine data were obtained for n=32 cars on the following variables: Data and SAS code is posted for future analysis. Output http: //filebox.vt.edu/users/abdo/statwww/General%20Example.pdf

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

36 / 42

Model Selection

General Example

Now...

Any Questions???

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

37 / 42

Model Selection

General Example

Linear Regression and Outliers

Outliers can distort the regression results. When an outlier is included in the analysis, it pulls the regression line towards itself. This can result in a solution that is more accurate for the outlier, but less accurate for all of the other cases in the data set. The problems of satisfying assumptions and detecting outliers are intertwined. For example, if a case has a value on the dependent variable that is an outlier, it will affect the skew, and hence, the normality of the distribution. Removing an outlier may improve the distribution of a variable. Transforming a variable may reduce the likelihood that the value for a case will be characterized as an outlier.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

38 / 42

Model Selection

General Example

Linear Regression and Outliers

Outliers can distort the regression results. When an outlier is included in the analysis, it pulls the regression line towards itself. This can result in a solution that is more accurate for the outlier, but less accurate for all of the other cases in the data set. The problems of satisfying assumptions and detecting outliers are intertwined. For example, if a case has a value on the dependent variable that is an outlier, it will affect the skew, and hence, the normality of the distribution. Removing an outlier may improve the distribution of a variable. Transforming a variable may reduce the likelihood that the value for a case will be characterized as an outlier.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

38 / 42

Model Selection

General Example

Linear Regression and Outliers

Outliers can distort the regression results. When an outlier is included in the analysis, it pulls the regression line towards itself. This can result in a solution that is more accurate for the outlier, but less accurate for all of the other cases in the data set. The problems of satisfying assumptions and detecting outliers are intertwined. For example, if a case has a value on the dependent variable that is an outlier, it will affect the skew, and hence, the normality of the distribution. Removing an outlier may improve the distribution of a variable. Transforming a variable may reduce the likelihood that the value for a case will be characterized as an outlier.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

38 / 42

Model Selection

General Example

Linear Regression and Outliers

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

39 / 42

Strategy for solving problems

Strategy for solving problems


Our strategy for solving problems about violations of assumptions and outliers will include the following steps:
1

Run type of regression specied in problem statement on variables using full data set. Test the dependent variable for normality. If it does not satisfy the criteria for normality unless transformed, substitute the transformed variable in the remaining tests that call for the use of the dependent variable. Test for normality, linearity, homoscedasticity using scripts. Decide which transformations should be used. Substitute transformations and run regression entering all independent variables, saving studentized residuals. Remove the outliers (studentized residual greater(Smaller) than 3 (-3) , and run regression with the method and variables specied in the problem. Compare R2 for analysis using transformed variables and omitting outliers step 5 to R2 obtained for model using all data and original variables step 1 .
Short Course 2008, LISA, Department of Statistics November the 12th , 2008 40 / 42

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Strategy for solving problems

Strategy for solving problems


Our strategy for solving problems about violations of assumptions and outliers will include the following steps:
1

Run type of regression specied in problem statement on variables using full data set. Test the dependent variable for normality. If it does not satisfy the criteria for normality unless transformed, substitute the transformed variable in the remaining tests that call for the use of the dependent variable. Test for normality, linearity, homoscedasticity using scripts. Decide which transformations should be used. Substitute transformations and run regression entering all independent variables, saving studentized residuals. Remove the outliers (studentized residual greater(Smaller) than 3 (-3) , and run regression with the method and variables specied in the problem. Compare R2 for analysis using transformed variables and omitting outliers step 5 to R2 obtained for model using all data and original variables step 1 .
Short Course 2008, LISA, Department of Statistics November the 12th , 2008 40 / 42

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Strategy for solving problems

Strategy for solving problems


Our strategy for solving problems about violations of assumptions and outliers will include the following steps:
1

Run type of regression specied in problem statement on variables using full data set. Test the dependent variable for normality. If it does not satisfy the criteria for normality unless transformed, substitute the transformed variable in the remaining tests that call for the use of the dependent variable. Test for normality, linearity, homoscedasticity using scripts. Decide which transformations should be used. Substitute transformations and run regression entering all independent variables, saving studentized residuals. Remove the outliers (studentized residual greater(Smaller) than 3 (-3) , and run regression with the method and variables specied in the problem. Compare R2 for analysis using transformed variables and omitting outliers step 5 to R2 obtained for model using all data and original variables step 1 .
Short Course 2008, LISA, Department of Statistics November the 12th , 2008 40 / 42

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Strategy for solving problems

Strategy for solving problems


Our strategy for solving problems about violations of assumptions and outliers will include the following steps:
1

Run type of regression specied in problem statement on variables using full data set. Test the dependent variable for normality. If it does not satisfy the criteria for normality unless transformed, substitute the transformed variable in the remaining tests that call for the use of the dependent variable. Test for normality, linearity, homoscedasticity using scripts. Decide which transformations should be used. Substitute transformations and run regression entering all independent variables, saving studentized residuals. Remove the outliers (studentized residual greater(Smaller) than 3 (-3) , and run regression with the method and variables specied in the problem. Compare R2 for analysis using transformed variables and omitting outliers step 5 to R2 obtained for model using all data and original variables step 1 .
Short Course 2008, LISA, Department of Statistics November the 12th , 2008 40 / 42

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Strategy for solving problems

The End...

Thank You For Your Attention!


Acknowledgment for Dr. Schabanberger,Dr. J.P . Morgan, Jonathan Duggins, Dingcai Cao, and all our Consultants. Selected References : [for Interdisciplinary Statistical Analysis (LISA), , Schabenberger and Morgen, , Montgomery et al., 2006, Smaxone et al., 2005, Chatterjee and Hadi, 2006, Kutner et al., 2005, Kleinbaum et al., 2007, Myers, 1990, Zar, 1999, Grafarend, 2006, Hastie and Tibshirani., 1990, Rencher, 2000, Vonesh and Chinchilli, 1997, Lee et al., 2006]

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

41 / 42

References
Chatterjee, S. and Hadi, A. S. (2006). Regression Analysis by Example. 4th edition. ISBN: 978-0-471-74696-6. for Interdisciplinary Statistical Analysis (LISA), L. http://www.stat.vt.edu/consult/index.html. Grafarend, E. W. (2006). Linear and Nonlinear Models: Fixed Effects, Random Effects, and Mixed Models. Walter de Gruyter. Hastie, T. J. and Tibshirani., R. J. (1990). Generalized additive models. New York : Chapman and Hall. Kleinbaum, D., Kupper, L., Nizam, A., and Muller, K. (2007). Applied Regression Analysis and Multivariate Methods. 4th edition. ISBN - 13: 978-0-495-38496-0. Kutner, M., Nachtsheim, C., Neter, J., and Li, W. (2005). Applied Linear Statistical Models. 5th edition. ISBN - 13: 978-0-073-10874-2. Lee, Y., Nelder, J. A., and Pawitan, Y. (2006). Generalized Linear Models with Random Effects: Unied Analysis via H-likelihood. Chapman & Hall/CRC. Montgomery, D. C., Peck, E. A., and Vining, G. G. (2006). Introduction to Linear Regression Analysis, 4th Edition. John Wiley & Sones, New Jersey. Myers, R. H. (1990). Classical and Modern Regression with Applications. Second edition, Boston, MA : PWS-KENT. Rencher, A. C. (2000). Linear Models in Statistics. John Wiley and Sons, New York, NY. Schabenberger, O. and Morgen, J. P . Regression and anova course pack. STAT 5044. Smaxone, Bgballe, M., Rasmussen, B., and Skafte, C. (2005). Regression. BL Music Scarlet, S. Donato Mil.se. Indspillet I Danmark 2004-2005 Tekster p e omslag Indhold: 5 I see you, part 1 Regression Freedom 2003 Smiling Waiting Bad sensation Dead but alive Afterlife If you could I see you, part 2. Vonesh, E. F. and Chinchilli, V. M. (1997). Linear and Nonlinear Models for the Analysis of Repeated Measurements. Marcel Dekker, Inc.,New York. Zar, J. (1999). Biostatistical Analysis. 4th edition. ISBN - 13: 978-0-130-81542-2.

Abdel-Salam G. Abdel-Salam (Virginia Tech)

Short Course 2008, LISA, Department of Statistics

November the 12th , 2008

42 / 42

You might also like