You are on page 1of 19

Multiple Regression

MR Example with dummy variables


1

Problem / Background

The manager of a small sales force wants to know whether average monthly salary is different for males and females in the sales force. He obtains data on monthly salary and experience (in months) for each of the 9 employees as shown on the next slide.
2

You can use the data in the table below for replicating results shown on the following slides
Employee 1 2 3 4 5 6 7 8 9 Salary 7.5 8.6 9.1 10.3 13 6.2 8.7 9.4 9.8 Gender Male Male Male Male Male Female Female Female Female Experience 6 10 12 18 30 5 13 15 21

Creating a dummy variable for gender

Categorical data is included in regression analysis by using dummy variables


For example, we can assign a value of 0 for males and 1 for females in our data so that a MR model can be developed

Employee
1 2 3 4

Salary
7.5 8.6 9.1 10.3 13 6.2

Gender
0 0 0 0 0 1

5 6

7
8 9

8.7
9.4 9.8

1
1 1

What are dummy variables?

Dummy variables, also called indicator variables allow us to include categorical data (like Gender) in regression models A dummy variable can take only 2 values, 0 (absence of a category) and 1 (presence of a category) In our example, we set the dummy variable gender to 1 for females and 0 when the employee is not a female When interpreting results for gender, we remember that when dummy variable is 0 (not a female), we are talking about males

Regression analysis: Salary vs. Gender


SUMMARY OUTPUT Regression Statistics Multiple R 0.327783 R Square 0.107442 Adjusted R Square -0.02007 Standard Error 1.908159 Observations 9 ANOVA df Regression Residual Total SS MS F Significance F 1 3.068056 3.068056 0.842624 0.38918002 7 25.4875 3.641071 8 28.55556

Intercept Gender

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 9.7 0.853355 11.3669 9.14E-06 7.682138167 11.7178618 -1.175 1.280032 -0.91795 0.38918 -4.20179275 1.85179275

Predicted salary for males: Salary=9.7-1.175*0=9.7


Predicted salary for females: Salary=9.7-1.175*1=8.525 But, the difference in male / female salaries is NOT statistically significant because the p-value for gender is not significant (p=0.389).

More on the intercept and slope

The value of the intercept, 9.70, is the average salary for males (as we coded gender=1 for females and 0 for males)
The value of the slope, -1.175, tells us that the average females salary is lower than the average male salary by 1.175
7

Coding issues

What would have happened if we had used 0 for females and 1 for males in our data? Would our results be any different? Not really With coding as above, the intercept would change to 8.525 (the average female salary), the slope for gender would still be 1.175, but now it would have a positive sign (reflecting that average male salary is higher than average female salary by 1.175). Predicted salaries from the model for males / females would not change no matter how dummy variable is coded
8

Using additional information

The analyst decides to Employee Months Gender use additional Employed information to explain 1 6 0 employee salary 2 10 0 3 12 0 employees experience 4 18 0 at this company 5 30 0 (months employed) Gender is coded as 0 for males and 1 for female
6 7 8 9 5 13 15 21 1 1 1 1

Salary ($000) 7.5 8.6 9.1 10.3 13 6.2 8.7 9.4 9.8

Multiple regression: Salary vs. Gender and Experience


SUMMARY OUTPUT Regression Statistics Multiple R 0.986819 R Square 0.973812 Adjusted R Square 0.965083 Standard Error 0.353037 Observations 9 ANOVA Regression Residual Total df 2 6 8 SS MS F Significance F 27.80774 13.90387 111.5565 1.79599E-05 0.747812 0.124635 28.55556

Is the model valid? YES; significance F is much smaller than 0.10 Is gender significant (a=0.1)? YES, p-value is smaller than 0.10

Intercept Months Gender

Coefficients Std. Error t Stat P-value 6.2485 0.29145 21.43927 6.72E-07 0.2271 0.016117 14.08889 7.98E-06 -0.7890 0.238404 -3.3094 0.016217

10

Is the multiple regression model better than the simple regression model?

Was gender significant in the simple regression model? How do you explain the significant effect of gender in the multiple regression model? What is the salary equation for men? What is the salary equation for women?
11

More on dummy variables

For gender, we had only 2 categories female and male thus we used a single 0/1 variable for this
When there are more than 2 categories, the number of dummy variables that should be used equals the number of categories minus 1 No. of Dummy Variables = No. of levels -1
12

Example: Salary vs. Job Grade


Employee

In this example, the categorical variable job grade has 3 levels, 1 (lowest grade), 2, and 3 (highest job grade)

1 2 3 4 5 6 7 8 9

Job Grade 1 3 2 3 3 1 2 2 3

Salary ($000) 7.5 8.6 9.1 10.3 13 6.2 8.7 9.4 9.8
13

Dummy variables for a categorical variable with 3 levels

We could create 3 dummy variables for job grade as follows:


Job_1=1 if job grade=1, zero otherwise Job_2=1 if job grade=2, zero otherwise Job_3=1 if job grade=3, zero otherwise

However, we should only use (any) 2 in the regression model to represent the three levels (the reason is technical creating a dummy for each level leads to redundancy) 14

Representing 3-level job grade with two dummy variables

In the scheme below, job grades 1 and 2 will be explicitly represented using their own dummy variable while grade 3 will become reference level: For each employee, we create 2 new dummy variables called Job_1 and Job_2

For employees whose Job grade=1, we set Job_1 equal to 1 and Job_2 equal to zero
For employees whose Job grade=2, we set Job_1 equal to zero and Job_2 equal to 1 Employees whose Job grade=3 are represented when both Job_1 and Job_2 are equal to zero (thus job grade=3 becomes the reference or default category)

15

Representing 3-level Job Grade using dummy variables Job_1 and Job_2

Dummy Variables

Employee's Job Grade

Job Grade 1 2 3

Job_1 1 0 0

Job_2 0 1 0
16

Job Grade 3 is the reference category

EXCEL data file with dummy variables for job grade


Employee 1 2 3 4 5 6 7 8 9 Job Grade 1 3 2 3 3 1 2 2 3 Salary 7.5 8.6 9.1 10.3 13 6.2 8.7 9.4 9.8 Job_1 1 0 0 0 0 1 0 0 0 Job_2 0 0 1 0 0 0 1 1 0
17

Ready to use dummy variables?


Drug Effectiveness Study: A pharmaceutical company wants to study the effectiveness of three different versions of a drug. The company refers to these versions as A, B and C. A clinical study whereby patients are treated using one of the three versions is administered. Data on drug effectiveness, age, and the version of the drug taken are provided for 36 patients in the spreadsheet. The company wants to know whether the three versions of the drug equal in their effectiveness. It also wants to know whether age influences the effectiveness of these three versions. Use a multiple regression model with dummy variables for Drug version for answering these questions. Create dummy variables for Drug versions A and B making version C as the reference level. You can follow the method described on slides 15 / 16 for creating dummy variables.

Effectiveness 56 41 40 28 55 25 46 71 48 63 52 62 50 45 58 46 58 34 65 55 57 59 64 61 62 36 69 47 73 64 60 62 71 62 70 71

Age 21 23 30 19 28 23 33 67 42 33 33 56 45 43 38 37 43 27 43 45 48 47 48 53 58 29 53 29 58 66 67 63 59 51 67 63

Drug A B B C A C B C B A A C C B A C B C A B B C A A B C A B A B B A C C A C

18

Model Results
Your regression equation is right if it matches the one shown below: Effectiveness = 22.29 + 0.66*Age + 10.25*Drug_A + 0.45*Drug_B For a 50 year old patient, predict effectiveness if she takes:

Drug_A Drug_B Drug_C


19

You might also like