You are on page 1of 8

West Visayas State University College of Education Graduate School

Multiple Regression

Multiple Regression is very popular among sociologists. 1. Most social phenomena have more than one cause. 2. It is very difficult to manipulate just one social variable through experimentation. 3. Sociologists must attempt to model complex social realities to explain them. Multiple Regression allows us to: a. Use several variables at once to explain the variation in a continuous dependent variable. b. Isolate the unique effect of one variable on the continuous dependent variable while taking into consideration that other variables are affecting it too. c. Write a mathematical equation that tells us the overall effects of several variables together and the unique effects of each on a continuous dependent variable. d. Control for other variables to demonstrate whether bivariate relationships are spurious For example: A sociologist may be interested in the relationship between Education and Income and Number of Children in a family.  Null Hypothesis: There is no relationship between education of respondents and the number of children in families. Ho : b1 = 0  Null Hypothesis: There is no relationship between family income and the number of children in families. Ho : b2 = 0

Bivariate regression is based on fitting a line as close as possible to the plotted coordinates of your data on a two-dimensional graph. Trivariate regression is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph.

Mathematically, that plane is: Y = a + b1X1 + b2X2 a = y-intercept, where X s equal zero b=coefficient or slope for each variable For our problem, SPSS says the equation is: Y = 11.8 - .36X1 - .40X2 Expected # of Children = 11.8 - .36*Educ - .40*Income

Conducting a Test of Significance for the slopes of the Regression Shape By slapping the sampling distribution for the slopes over a guess of the population s slopes, Ho, we can find out whether our sample could have been drawn from a population where the slopes are equal to our guess.

So what does our equation tell us? Y = 11.8 - .36X1 - .40X2 Expected # of Children = 11.8 - .36*Educ - .40*Income If Education equals: & If Income Equals: Then, children equals: 0 0 11.8 10 0 8.2 10 10 4.2 20 10 0.6 20 11 0.2 If Education equals: & If Income Equals: Then, children equals:

1 1 1 1 1 0 1 5 10 15

0 1 5 10 15 1 1 1 1 1

11.44 11.04 9.44 7.44 5.44 11.40 11.04 9.60 7.80 6.00

If graphed, holding one variable constant produces a two-dimensional graph for the other variable.

An interesting effect of controlling for other variables is Simpson s Paradox. The direction of relationship between two variables can change when you control for another variable.

More Variables! The social world is very complex. What happens when you have even more variables? For example: A sociologist may be interested in the effects of Education, Income, Sex, and Gender Attitudes on Number of Children in a family.

Now

Null Hypotheses: 1. There will be no relationship between education of respondents and the number of children in families. Ho : b1 = 0 Ha : b1 0 2. There will be no relationship between family income and the number of children in families. Ho : b2 = 0 Ha : b2 0 3. There will be no relationship between sex and number of children. Ho: b3 = 0 Ha : b3 0 4. There will be no relationship between gender attitudes and number of children. Ho : b4 = 0 Ha : b4 0 Bivariate regression is based on fitting a line as close as possible to the plotted coordinates of your data on a two-dimensional graph. Trivariate regression is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph. Regression with more than two independent variables is based on fitting a shape to your constellation of data on a multi-dimensional graph. Regression with more than two independent variables is based on fitting a shape to your constellation of data on an multi-dimensional graph. The shape will be placed so that it minimizes the distance (sum of squared errors) from the shape to every data point. Regression with more than two independent variables is based on fitting a shape to your constellation of data on an multi-dimensional graph. The shape will be placed so that it minimizes the distance (sum of squared errors) from the shape to every data point. The shape is no longer a line, but if you hold all other variables constant, it is linear for each independent variable. Imagining a graph with four dimensions! For our problem, our equation could be: Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4 E(Children) = 7.5 - .30*Educ - .40*Income + 0.5*Sex + 0.25*Gender Att.

So what does our equation tell us? Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4 E(Children) = 7.5 - .30*Educ - .40*Income + 0.5*Sex + 0.25*Gender Att. Education: 10 10 10 10 10 Income: 5 5 10 5 5 Sex: Gender 0 0 0 1 1 Att: 0 5 5 0 5 Children: 2.5 3.75 1.75 3.0 4.25

Each variable, holding the other variables constant, has a linear, twodimensional graph of its relationship with the dependent variable. Here we hold every other variable constant at zero.

What are dummy variables? They are simply dichotomous variables that are entered into regression. They have 0 1 coding where 0 = absence of something and 1 = presence of something. E.g., Female (0=M; 1=F) or Southern (0=Non-Southern; 1=Southern). Dummy Variables are especially nice because they allow us to use nominal variables in regression. A nominal variable has no rank or order, rendering the numerical coding scheme useless for regression The way you use nominal variables in regression is by converting them to a series of dummy variables. Recode into different Nomimal Variable Dummy Variables Race 1. White 1 = White 0 = Not White; 1 = White 2 = Black 2. Black 3 = Other 0 = Not Black; 1 = Black 3. Other 0 = Not Other; 1 = Other Religion 1. Catholic 1 = Catholic 0 = Not Catholic; 1 = Catholic 2 = Protestant 2. Protestant 3 = Jewish 0 = Not Prot.; 1 = Protestant 4 = Muslim 3. Jewish 5 = Other Religions 0 = Not Jewish; 1 = Jewish 4. Muslim 0 = Not Muslim; 1 = Muslim 5. Other Religions 0 = Not Other; 1 = Other Relig. When you need to use a nominal variable in regression (like race), just convert it to a series of dummy variables. When you enter the variables into your model, you MUST LEAVE OUT ONE OF THE DUMMIES. Leave Out One White Enter Rest into Regression Black Other

The reason you MUST LEAVE OUT ONE OF THE DUMMIES is that regression is mathematically impossible without an excluded group. If all were in, holding one of them constant would prohibit variation in all the rest. Leave Out One Catholic Enter Rest into Regression Protestant Jewish Muslim Other Religion

The regression equations for dummies will look the same. For Race, with 3 dummies, predicting self-esteem:

If our equation were: For Race, with 3 dummies, predicting self-esteem:

Dummy variables can be entered into multiple regression along with other dichotomous and continuous variables. For example, you could regress self-esteem on sex, race, and education:

1. 2. 3. 4.

Women s self-esteem is 4 points lower than men s. Blacks self-esteem is 5 points higher than whites . Others self-esteem is 2 points lower than whites and consequently 7 points lower than blacks . Each year of education improves self-esteem by 0.3 units.

Plugging in some select values, we d get self-esteem for select groups: White males with 10 years of education = 33 Black males with 10 years of education = 38 Other females with 10 years of education = 27 Other females with 16 years of education = 28.8 The same regression rules apply. The slopes represent the linear relationship of each independent variable in relation to the dependent while holding all other variables constant. Standardized Coefficients Sometimes you want to know whether one variable has a larger impact on your dependent variable than another. If your variables have different units of measure, it is hard to compare their effects. For example, if wages go up one thousand dollars for each year of education, is that a greater effect than if wages go up five hundred dollars for each year increase in age. So which is better for increasing wages, education or aging? One thing you can do is standardize your slopes so that you can compare the standard deviation increase in your dependent variable for each standard deviation increase in your independent variables. You might find that Wages go up 0.3 standard deviations for each standard deviation increase in education, but 0.4 standard deviations for each standard deviation increase in age. Recall that standardizing regression coefficients is accomplished by the formula: b(Sx/Sy)

In the example above, education and income have very comparable effects on number of children. Each lowers the number of children by .4 standard deviations for a standard deviation increase in each, controlling for the other. One last note of caution... It does not make sense to standardize slopes for dichotomous variables.

It makes no sense to refer to standard deviation increases in sex, or in race--these are either 0 or they are 1 only. Chen, X., Ender, P., Mitchell, M. and Wells, C. (2003). Regression with SPSS, from http://www.ats.ucla.edu/stat/spss/webbooks/reg/default.htm . http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_regression_linear_multiple.htm http://ccnmtl.columbia.edu/projects/qmss/multiple_regression/extensionquestions_5.html http://www.csulb.edu/~msaintg/ppa696/696regmx.htm

Why is Multiple regression important? Multiple regression is a term used in the field of statistics that refers to a specific method where variables, both the independent and dependent variables, are being examined and studied. Multiple regression was first coined in 1908 by a named Pearson. The general purpose of the inclusion of multiple regression in the field of statistics is that it explains the relationship of two variables of interest. In most cases the two variables are represented by an X and a Y. In the use of multiple regression, there are a lot of elements contained in an equation. An in each of these elements, one must have the knowledge of how these elements relate to one another for the equation to understood. Although there are quite a number of problems in the use of multiple regression such as the need for linear relationship to be able to study the variables, a minimum of 30 observations for every pair of dependent and independent variables and the number of independent variables must only be limited to only two factors. Regardless of these, multiple regression is equally important primarily because this technique can help predict the value of a dependent variable in relation to the independent variable. Adding to that, multiple regression can assess and determine causal linkage. This means the effects can be greatly determined through a study of what caused such effects. More so, multiple regression is important because it can forecast future outcomes. Once the causes are studied and the effects are determined, there are outlooks of the possibilities if the same variables are once again studied or encountered. Although multiple regression is quite a complex process, especially in the field of statistics, it is but important that this method be known and understood for greater use.

You might also like