You are on page 1of 13

Regression Using R

This page is one of a (small) set of additional appendices to the book Applying Regression and Correlation, by Jeremy Miles and Mark Shevlin. R is not the most user friendly program in the known universe. But it is one of the most powerful. And it's free (as in beer, and also as in speech). To make it a little easier for beginners, here is a guide to doing regression in R.

1. First, you need to get hold of R. Typing R into google will find you the R homepage, or you can go straight to the Comprehensive R Archive Network and download and install it from a mirror near you.

2. When you have R installed, and you run it, it should look a bit like:

3. We need to get some data into R. We'll cover two types of data, first tab-delimited, then SPSS For either kind of file, it will make life easier if you change the working directory to the same directory that your data are currently stored in. To do this, click on the file menu, and choose Change Dir:

Then choose the directory where your data are stored. 4. To read a tab delimited file, with variable names at the top, we'll assume that your data look like:
HASSLES ANX 38 10 10 12 60 21 90 16 88 27 96 30 1 9 41 7 86 32 59 11 ...

You can find the whole file here. We use the command read.table, and type: data1 <- read.table("data6_1a.dat", header = TRUE) If we go through that a bit at a time, we'll see what it means.
o

data1 is the name of the data object that we are creating in R. Effectively, R can have lots of datasets that it is working with at the same time, and it needs to know what to call each one. Because we aren't very imaginative, we are going to call it data1. <- is what we would call the assignment operator, if we wanted to use computery talk It's like using the = sign. (Except that the problem with the

equals sign is that it is ambiguous. If we write x = 4 does that mean "put the value 4 into the box labelled x", or does it mean "is x equal to 4"? In R, for the first, we would write x <- 4, and for the second, x == 4). Next we have our file name, in speech marks. We changed the directory that R would look in, so that we didn't have to type in the whole path (e.g. "c:\my documents\statistics\examples\R\data6_1a.dat"). Finally, we put header = TRUE, which means that it is true that the variable names are included in the header at the top of the file.

For an SPSS file, we have to be introduced to the world of R packages. R packages are rather cool, and are what really make R what it is. To read an SPSS file, you need to have a package loaded called 'foreign'. Click on the Packages menu, and then choose Load Package. You are presented with a list of packages that are available. If it's there, choose foreign, and click OK. If it's not there, you need to get it. Click on Packages and choose Set CRAN Mirror. If you are in a UK university, choose UK (Bristol). If you aren't, then choose the one that's nearest to you. Then choose packages, Install Packages, and choose the package 'foreign'. R will now download the appropriate file over the web. But it won't install it. You have to go back to packages, load package, foreign. The you just type: data1 <- read.spss("data6_1a.sav") And we have successfully read in our file. 5. Each variable can now be referred to using the name of the dataset, and the name of the variable, separated by a $ sign. So the first variable is called data1$HASSLES. Notice that R is case sensitive, and variable names (when read from SPSS) are always in capitals. If you type: data1$HASSLES R will give you the values of the figures in the first variable. You can refer to this in other ways (which is useful later on, when you want to program). data1[1] Will also give the values of the first variables, with the following output: > data1[1] $HASSLES [1] 38 10 60 90 88 96 55 36 96 36 91 47 63 [26] 64 98 81 99 90 95

1 41 86 59 25

3 16 22 41 29 72

5 71 82 97 47 30 75 35 78

If you use double square brackets, you get:

> data1[[1]] [1] 38 10 60 90 88 96 55 36 96 36 91 47 63 [26] 64 98 81 99 90 95

1 41 86 59 25

3 16 22 41 29 72

5 71 82 97 47 30 75 35 78

Notice the differeence? The second time, we didn't get the variable name at the top. We can also draw a histogram of the variable:We can also draw a histogram of the variable: > hist(data1$HASSLES) Will give:

And: > plot(data1$HASSLES, data1$ANX)

Will give

6. Now we've got our data in, we can go ahead and do our regression, but first we are going to do an extra step, because we are, fundamentally, lazy. It's a bit dull having to retype data1$ each time we want to access a variable. (It's useful, because we can have multiple datasets open, but it's still dull.) So, we are going to attach our dataset. When a dataset is attached, you only need to type the variable name, and R takes it from the dataset which is attached. We type: > attach(data1)

and our data are attached. 7. The basic regression command in R is GLM (general, or generalised, linear model). We use the command: glm(outcome ~ predictor1 + predictor2 + predictor3 ) For our first regression, the analysis we want to do is: glm(ANX ~ HASSLES) And R outputs: Call: glm(formula = ANX ~ HASSLES) HASSLES 0.2526

Coefficients: (Intercept) 5.4226

Degrees of Freedom: 39 Total (i.e. Null); 38 Residual Null Deviance: 4627 Residual Deviance: 2159 AIC: 279 But this is now gone - we have the estimates, but no more. We need to have the output stored somewhere, so we can do something with it. To do this, we use the assignment operator, as before: glm.linear <- glm(ANX ~ HASSLES) We are naming the output of our regression glm.linear - in R language, we are creating an object, called glm.linear . If you are used to almost any other program, the result will seem a little strange, because there isn't any. If you want to know what happened, you have to ask. One way to ask is to just type the name of the object, and the object will be give to you. (Just like when we typed the name of the dataset, the dataset was output). > glm.linear Call: glm(formula = ANX ~ HASSLES) HASSLES 0.2526 38 Residual

Coefficients: (Intercept) 5.4226

Degrees of Freedom: 39 Total (i.e. Null); Null Deviance: 4627 Residual Deviance: 2159 AIC: 279

But we can do other things with the object. The function summary(), for example, when applied to a variable, gave a certain kind of result. When applied to a dataset, it

gave a different kind of result. When applied to a glm object, it gives a different kind of result: > summary(glm.linear) Call: glm(formula = ANX ~ HASSLES) Deviance Residuals: Min 1Q -13.3153 -5.0549 Median -0.3794 3Q 4.5765 Max 17.5913

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.42265 2.46541 2.199 0.034 * HASSLES 0.25259 0.03832 6.592 8.81e-08 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for gaussian family taken to be 56.80561) Null deviance: 4627.1 Residual deviance: 2158.6 AIC: 279.05 on 39 on 38 degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 2 And that is what we would consider to be our answer from the regression. 8. However, we can do more. There are other functions that we can apply, to get different information from our regression analysis. We can ask for residuals, using the command (and this will surprise you): > residuals(glm.linear) Trouble is that the output isn't very helpful: 1 5 2 3 4 6 -5.02121610 4.05141414 0.42171728 -12.15610083 0.65091296 0.32833554 7 8 9 10 11 12 3.32475957 -8.77899791 4.85427492 -9.32568878 1.26250508 -1.68561618 13 14 15

16 17 13.81957170 -0.46414949 6.74787067 -2.60940996 19 20 22 23 -13.31531303 4.48397177 17.59130523 -0.29456154 25 26 28 29 -5.33606453 1.41134153 12.57055373 -5.15610083 31 32 34 35 9.58092948 9.31438382 7.07574161 -1.29456154 37 38 -5.00046461 -8.36719178

18 0.02028689 21 24 -12.67166446 27 30 9.82314767 33 36 -10.35681603 39 -3.26343429 4.86465066 40 -2.12497359 4.11724460 -1.51602823 7.22100209 -

But we can do stuff with that output to make it more useful. Specifically, we could put it into a variable, using the command: > glm.linear.resids <- residuals(glm.linear) > hist(glm.linear.resids) Which gives the following: > hist(glm.linear.resids) And then we could draw a histogram of that variable:

9. Finally, we could find the predicted values, using the fitted.values() function. > glm.linear.preds <- fitted.values(glm.linear) And draw a scatterplot of residuals against predicted values, to examine heteroscedasticity, linearity and multivariate outliers. > plot(glm.linear.preds, glm.linear.resids)

That was all rather hard work, wasn't it? Why bother, when we (probably) have access to SPSS, Excel, or other programs that can do regression? There are, as I see it, two reasons. First, R is more than a statistics package, it's a programming environment. You can access parts of the regression output, and do whatever you want to with them. Because glm.linear is an object, you can access it like a dataset. The first element in that object is referred to as glm.linear[1] - if you type this you get:

> glm.linear[1] $coefficients (Intercept) HASSLES 5.4226465 0.2525939 The coefficients, are another object. You can extract a part of that object, by asking for the first element of it, > glm.linear[[1]][[1]] [1] 5.422646 And because that's a bit of output, you can make it a variable, and do stuff with it. > x <- glm.linear[[1]][[1]] > x [1] 5.422646 > x <- glm.linear[[1]][[1]] > x * 3 [1] 16.26794 It's a bit like the output management system in SPSS, but much easier to use. (do you use SPSS? Have you ever used the output management system? Thought not - because it's fiddly.) The second reason to bother with this is that, although R has a steep learning curve to start with, it gets flatter, faster, than other programs, because all of the commands are very similar. To do a logistic regression in (say) SPSS, you need to use a different interface, with different rules. In R, to do a logistic regression, you add 'family = binomial', to do a Poisson regression, you add 'family = poisson'. Multilevel models in R are similar: instead of the glm() command, we use the lme() command (for linear mixed effects) or nlme() for non-linear mixed effects. The commands are very similar, so once you've learned one, you've learned them all.

You might also like