Professional Documents
Culture Documents
Problem / Background
The manager of a small sales force wants to know whether average monthly salary is different for males and females in the sales force. He obtains data on monthly salary and experience (in months) for each of the 9 employees as shown on the next slide.
2
You can use the data in the table below for replicating results shown on the following slides
Employee 1 2 3 4 5 6 7 8 9 Salary 7.5 8.6 9.1 10.3 13 6.2 8.7 9.4 9.8 Gender Male Male Male Male Male Female Female Female Female Experience 6 10 12 18 30 5 13 15 21
Employee
1 2 3 4
Salary
7.5 8.6 9.1 10.3 13 6.2
Gender
0 0 0 0 0 1
5 6
7
8 9
8.7
9.4 9.8
1
1 1
Dummy variables, also called indicator variables allow us to include categorical data (like Gender) in regression models A dummy variable can take only 2 values, 0 (absence of a category) and 1 (presence of a category) In our example, we set the dummy variable gender to 1 for females and 0 when the employee is not a female When interpreting results for gender, we remember that when dummy variable is 0 (not a female), we are talking about males
Intercept Gender
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 9.7 0.853355 11.3669 9.14E-06 7.682138167 11.7178618 -1.175 1.280032 -0.91795 0.38918 -4.20179275 1.85179275
The value of the intercept, 9.70, is the average salary for males (as we coded gender=1 for females and 0 for males)
The value of the slope, -1.175, tells us that the average females salary is lower than the average male salary by 1.175
7
Coding issues
What would have happened if we had used 0 for females and 1 for males in our data? Would our results be any different? Not really With coding as above, the intercept would change to 8.525 (the average female salary), the slope for gender would still be 1.175, but now it would have a positive sign (reflecting that average male salary is higher than average female salary by 1.175). Predicted salaries from the model for males / females would not change no matter how dummy variable is coded
8
The analyst decides to Employee Months Gender use additional Employed information to explain 1 6 0 employee salary 2 10 0 3 12 0 employees experience 4 18 0 at this company 5 30 0 (months employed) Gender is coded as 0 for males and 1 for female
6 7 8 9 5 13 15 21 1 1 1 1
Salary ($000) 7.5 8.6 9.1 10.3 13 6.2 8.7 9.4 9.8
Is the model valid? YES; significance F is much smaller than 0.10 Is gender significant (a=0.1)? YES, p-value is smaller than 0.10
Coefficients Std. Error t Stat P-value 6.2485 0.29145 21.43927 6.72E-07 0.2271 0.016117 14.08889 7.98E-06 -0.7890 0.238404 -3.3094 0.016217
10
Is the multiple regression model better than the simple regression model?
Was gender significant in the simple regression model? How do you explain the significant effect of gender in the multiple regression model? What is the salary equation for men? What is the salary equation for women?
11
For gender, we had only 2 categories female and male thus we used a single 0/1 variable for this
When there are more than 2 categories, the number of dummy variables that should be used equals the number of categories minus 1 No. of Dummy Variables = No. of levels -1
12
In this example, the categorical variable job grade has 3 levels, 1 (lowest grade), 2, and 3 (highest job grade)
1 2 3 4 5 6 7 8 9
Job Grade 1 3 2 3 3 1 2 2 3
Salary ($000) 7.5 8.6 9.1 10.3 13 6.2 8.7 9.4 9.8
13
However, we should only use (any) 2 in the regression model to represent the three levels (the reason is technical creating a dummy for each level leads to redundancy) 14
In the scheme below, job grades 1 and 2 will be explicitly represented using their own dummy variable while grade 3 will become reference level: For each employee, we create 2 new dummy variables called Job_1 and Job_2
For employees whose Job grade=1, we set Job_1 equal to 1 and Job_2 equal to zero
For employees whose Job grade=2, we set Job_1 equal to zero and Job_2 equal to 1 Employees whose Job grade=3 are represented when both Job_1 and Job_2 are equal to zero (thus job grade=3 becomes the reference or default category)
15
Representing 3-level Job Grade using dummy variables Job_1 and Job_2
Dummy Variables
Job Grade 1 2 3
Job_1 1 0 0
Job_2 0 1 0
16
Effectiveness 56 41 40 28 55 25 46 71 48 63 52 62 50 45 58 46 58 34 65 55 57 59 64 61 62 36 69 47 73 64 60 62 71 62 70 71
Age 21 23 30 19 28 23 33 67 42 33 33 56 45 43 38 37 43 27 43 45 48 47 48 53 58 29 53 29 58 66 67 63 59 51 67 63
Drug A B B C A C B C B A A C C B A C B C A B B C A A B C A B A B B A C C A C
18
Model Results
Your regression equation is right if it matches the one shown below: Effectiveness = 22.29 + 0.66*Age + 10.25*Drug_A + 0.45*Drug_B For a 50 year old patient, predict effectiveness if she takes: