You are on page 1of 37

Correlation and

Linear Regression

Chapter 13

Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Learning Objectives
LO13-1 Explain the purpose of correlation analysis.
LO13-2 Calculate a correlation coefficient to test and interpret the
relationship between two variables.
LO13-3 Apply regression analysis to estimate the linear
relationship between two variables.
LO13-4 Evaluate the significance of the slope of the regression
equation.
LO13-5 Evaluate a regression equation’s ability to predict using
the standard estimate of the error and the coefficient of
determination.
LO13-6 Calculate and interpret confidence and prediction
intervals.
LO13-7 Use a log function to transform a nonlinear relationship.
13-2
LO13-1 Explain the purpose of
correlation analysis.

Correlation Analysis – Measuring


the Relationship Between Two Variables
 Analyzing relationships between two quantitative
variables.
 The basic hypothesis of correlation analysis: Does
the data indicate that there is a relationship between
two quantitative variables?
 For the Applewood Auto sales data, the data is
displayed in a scatter graph.
 Are profit per vehicle and age
correlated?

13-3
LO13-1

Correlation Analysis – Measuring


the Relationship Between Two Variables
The Coefficient of Correlation (r) is a measure of the
strength of the relationship between two variables.

 The sample correlation coefficient is identified by the lowercase letter r.


 It shows the direction and strength of the linear relationship between two
interval- or ratio-scale variables.
 It ranges from -1 up to and including +1.
 A value near 0 indicates there is little linear relationship between the variables.
 A value near +1 indicates a direct or positive linear relationship between the
variables.
 A value near -1 indicates an inverse or negative linear relationship between
the variables.

13-4
LO13-1

Correlation Analysis – Measuring


the Relationship Between Two Variables

13-5
LO13-2 Calculate a correlation coefficient to test and
interpret the relationship between two variables.

Correlation Analysis – Measuring


the Relationship Between Two Variables
 Computing the Correlation Coefficient:

13-6
LO13-2

Correlation Analysis – Example


The sales manager of Copier Sales of America has a large sales force
throughout the United States and Canada and wants to determine whether
there is a relationship between the number of sales calls made in a month
and the number of copiers sold that month. The manager selects a random
sample of 15 representatives and determines the number of sales calls each
representative made last month and the number of copiers sold.
Determine if the number of sales calls and copiers sold are correlated.

13-7
LO13-2

Correlation Analysis – Example


Step 1: State the null and alternate hypotheses.

H0:  = 0 (the correlation in the population is 0)


H1:  ≠ 0 (the correlation in the population is not 0)

Step 2: Select a level of significance.


We select a .05 level of significance.

Step 3: Identify the test statistic.


To test a hypothesis about a correlation we use the t-statistic.
For this analysis, there will be n-2 degrees of freedom.

13-8
LO13-2

Correlation Analysis – Example


Step 4: Formulate a decision rule.

Reject H0 if:
t > t/2,n-2 or t < -t/2,n-2

t > t0.025,13 or t < -t0.025,13


t > 2.160 or t < -2.160

13-9
LO13-2

Correlation Coefficient – Example


Step 5: Take a sample, calculate the statistics, arrive at a decision.

x = 96; y = 45;sx = 42.76;sy =12.89

Numerator

13-10
LO13-2

Correlation Coefficient – Example


Step 5 (continued): Take a sample, calculate the statistics, arrive at a decision.

The t-test statistic, 6.216, is greater than 2.160. Therefore,


reject the null hypothesis that the correlation coefficient is zero.
Step 6: Interpret the result. The data indicate that there is a significant
correlation between the number of sales calls and copiers sold. We
can also observe that the correlation coefficient is .865, which
indicates a strong, positive relationship. In other words, more sales
calls are strongly related to more copier sales. Please note that this
statistical analysis does not provide any evidence of a causal
relationship. Another type of study is needed to test that hypothesis.
13-11
LO13-3 Apply regression analysis to estimate
the linear relationship between two variables.

Regression Analysis
Correlation Analysis tests for the strength and direction of the
relationship between two quantitative variables.

Regression Analysis evaluates and “measures” the


relationship between two quantitative variables with a linear
equation. This equation has the same elements as any equation
of a line, that is, a slope and an intercept.

Y=a+bX

The relationship between X and Y is defined by the values of the


intercept, a, and the slope, b. In regression analysis, we use
data (observed values of X and Y) to estimate the values of a
and b.

13-12
LO13-3

Regression Analysis
EXAMPLES
 Assuming a linear relationship between the size of a home,
measured in square feet, and the cost to heat the home in
January, how does the cost vary relative to the size of the
home?

 In a study of automobile fuel efficiency, assuming a linear


relationship between miles per gallon and the weight of a car,
how does the fuel efficiency vary relative to the weight of a
car?

13-13
LO13-3

Regression Analysis: Variables


Y=a+bX

 Y is the Dependent Variable. It is the variable being predicted or


estimated.

 X is the Independent Variable. For a regression equation, it is the variable


used to estimate the dependent variable, Y. X is the predictor variable.

Examples of dependent and independent variables:

 How does the size of a home, measured in number of square feet, relate to the cost to heat the
home in January? We would use the home size as, X, the independent variable to predict the
heating cost, and Y as the dependent variable.
Regression equation: Heating cost = a + b (home size)

 How does the weight of a car relate to the car’s fuel efficiency? We would use car weight as, X,
the independent variable to predict the car’s fuel efficiency, and Y as the dependent variable.

Regression equation: Miles per gallon = a + b (car weight)

13-14
LO13-3

Regression Analysis – Example


 Regression analysis estimates a and b by fitting a line
to the observed data.

 Each line (Y = a + bX) is defined by values of a and b.

A way to find the line of “best fit” to the data is the:


LEAST SQUARES PRINCIPLE Determining a regression equation
by minimizing the sum of the squares of the vertical distances
between the actual Y values and the predicted values of Y.
13-15
LO13-3

Regression Analysis – Example


Recall the example involving Copier
Sales of America. The sales manager
gathered information on the number of
sales calls made and the number of
copiers sold for a random sample of
15 sales representatives. Use the
least squares method to determine a
linear equation to express the
relationship between the two
variables.
In this example, the number of sales
calls is the independent variable, X,
and the number of copiers sold is the
dependent variable, Y.
Number of Copiers Sold = a + b ( Number of Sales Calls)

What is the expected number of


copiers sold by a representative who
made 20 calls?
13-16
LO13-3

Regression Analysis – Example


Descriptive statistics:

Correlation coefficient:

13-17
LO13-3

Regression Analysis - Example


Step 1: Find the slope (b) of the line.

Step 2: Find the y-intercept (a).

Step 3: Create the regression equation.


Number of Copiers Sold = 19.9632 + 0.2608 ( Number of Sales Calls)

Step 4: What is the predicted number of sales if someone makes 20 sales calls?
Number of Copiers Sold = 25.1792 = 19.9632 + 0.2608(20)

13-18
LO13-3

Regression Analysis ANOVA


(Excel) – Example

b
Number of Copiers Sold = 19.9800 + 0.2606 ( Number of Sales Calls)

* Note that the Excel differences in the values of a and b are due to rounding.
13-19
LO13-4 Evaluate the significance of the
slope of the regression equation.

Regression Analysis: Testing the


Significance of the Slope – Example
Step 1: State the null and alternate hypotheses.

H0: β = 0 (the slope of the regression equation is 0)


H1: β ≠ 0 (the slope of the regression equation is not 0)

Step 2: Select a level of significance.


We select a .05 level of significance.

Step 3: Identify the test statistic.


To test a hypothesis about the slope of a regression equation, we use
the t-statistic. For this analysis, there will be n-2 degrees of freedom.

13-20
LO13-4

Regression Analysis: Testing the


Significance of the Slope – Example
Step 4: Formulate a decision rule.

Reject H0 if:
t > t/2,n-2 or t < -t/2,n-2

t > t0.025,13 or t < -t0.025,13


t > 1.771 or t < -1.771

13-21
LO13-4

Regression Analysis: Testing the


Significance of the Slope – Example
Step 5: Take a sample, calculate the ANOVA (Excel), arrive at a decision.

Decision: Reject the null hypothesis that the slope


of the regression equation is equal to zero.
13-22
LO13-4

Regression Analysis: Testing the


Significance of the Slope – Example
Step 6: Interpret the result. For the regression equation that predicts the
number of copier sales based on the number of sales calls, the data indicate
that the slope, (0.2606), is not equal to zero. Therefore, the slope can be
interpreted and used to relate the dependent variable (number of copier
sales) to the independent variable (number of sales calls). In fact, the value
of the slope indicates that for an increase of 1 sales call, the number of
copiers sold will increase 0.2606. If a salesperson increases their number of
sales calls by 10, the value of the slope indicates that the number of copiers
sold is predicted to increase by 2.606.

As in correlation analysis, please note that this statistical analysis does not
provide any evidence of a causal relationship. Another type of study is
needed to test that hypothesis.

13-23
LO13-5 Evaluate a regression equation’s ability to predict using the
standard estimate of the error and the coefficient of determination.

Regression Analysis: The Standard


Error of Estimate
 The standard error of estimate measures the scatter, or
dispersion, of the observed values around the line of
regression for a given value of X.
 The standard error of estimate is important in the
calculation of confidence and prediction intervals.
 Formula used to compute the standard error:

^
S(y - y) 2
sy. x =
n-2

13-24
LO13-5

Regression Analysis ANOVA: The


Standard Error of Estimate – Example
Recall the example
involving Copier Sales
of America. The sales
manager determined
the least squares
regression.

Determine the standard


error of estimate as a
measure of how well
the values fit the
regression line.

^
S(y - y)2 587.11
sy. x = = = 45.16231 = 6.720
n-2 13

13-25
LO13-5

Regression Analysis: Coefficient of


Determination
The coefficient of determination (r2) is the proportion of
the total variation in the dependent variable (Y) that is
explained or accounted for by the variation in the
independent variable (X). It is the square of the coefficient
of correlation.

 It ranges from 0 to 1.
 It does not provide any information on the
direction of the relationship between the variables.

13-26
LO13-5

Regression Analysis: Coefficient of


Determination – Example
 The coefficient of determination, r2, is 0.748. It
can be computed as the correlation coefficient,
squared: (0.865)2.

 The coefficient of determination is expressed as


a proportion or percent; we say that 74.8
percent of the variation in the number of
copiers sold is explained, or accounted for, by
the variation in the number of sales calls.

13-27
LO13-5

Regression Analysis ANOVA: Coefficient


of Determination – Example
The Coefficient of Determination can also be computed based on its
definition. We can divide the Regression Sum of Squares (the variation in
the dependent variable explained by the regression equation) divided by
the Total Sum of Squares (the total variation in the dependent variable).

Re gressionSumofSquares 1738.89
R2 = = = 0.748
TotalSumofSquares 2326

13-28
LO13-6 Calculate and interpret
confidence and prediction intervals.

Regression Analysis: Computing


Interval Estimates for Y
A regression equation is used to predict or estimate the population value
of the dependent variable, Y, for a given X. In general, estimates of
population parameters are subject to sampling error. Recall that
confidence intervals account for sampling error by providing an interval
estimate of a population parameter.

In regression analysis, interval estimates are also used to provide a


complete picture of the point estimate of Y for a given X by computing an
interval estimate that accounts for sampling error.

In regression analysis, there are two types of intervals:

 A confidence interval reports the interval estimate for the mean value
of Y for a given X.
 A prediction interval reports the interval estimate for an individual
value of Y for a particular value of X.
13-29
LO13-6

Regression Analysis: Computing


Interval Estimates for Y
Assumptions underlying linear regression:
 For each value of X, the Y values are normally distributed.
The means of these normal distributions of Y values all lie on
the regression line.
 The standard deviations of these normal distributions are
equal.
 The Y values are statistically independent. This means
that in the selection of a sample, the Y values chosen for a
particular X value do not depend on the Y values for any other
X values.

13-30
LO13-6

Regression Analysis: Computing


Interval Estimates for Y – Example
We return to the Copier Sales of America illustration. Determine a 95 percent
confidence interval for all sales representatives, that is, the population mean
number of copiers sold, who make 50 sales calls.

*Note the values of “a”


and “b” differ from the
EXCEL values due to
rounding.

Thus, the 95% confidence interval for all sales representatives who make
50 calls is from 27.3942 up to 38.6122. To interpret, let’s round the values.
For all sales representative who make 50 calls, the predicted mean number
of copiers sold is 33. The mean sales will range from 27 to 39 copiers.

13-31
LO13-6

Regression Analysis: Computing


Interval Estimates for Y – Example
Comments on calculation:

The t-statistic is 2.160 based on a two-tailed test with n – 2 =


15 – 2 = 13 degrees of freedom. The only new value is
å(x - x)2 .
Note that the width of the interval or the margin of error
when predicting the dependent variable is related to the
standard error of the estimate.

*Note the values of “a”


and “b” differ from the
EXCEL values due to
rounding.

13-32
LO13-6

Regression Analysis: Computing


Interval Estimates for Y – Example
We return to the Copier Sales of America illustration. Determine a 95 percent
prediction interval for individual sales representatives, such as Sheila
Baker, who makes 50 sales calls.

*Note the values of “a”


and “b” differ from the
EXCEL values due to
rounding.

Thus, the prediction interval of copiers sold by an individual sales person, such as Sheila
Baker, who makes 50 sales calls is from 17.442 up to 48.5644 copiers. Rounding these results,
the predicted number of copiers sold will be between 17 and 49. This interval is quite large. It is
much larger than the confidence interval for all sales representatives who made 50 calls. It is
logical, however, that there should be more variation in the sales estimate for an individual than
for the mean of a group.
13-33
LO13-6

Regression Analysis: Computing


Interval Estimates for Y – Example
Comments on calculation:

The t-statistic is 2.160 based on a two-tailed test with n – 2 = 15 – 2 = 13


degrees of freedom. The only new value is å(x - x).2

Note that the width of the interval or the margin of error when predicting the
dependent variable is related to the standard error of the estimate. Also,
note that the prediction interval is wider because 1 is added to the sum
under the square root sign.

*Note the values of “a”


and “b” differ from the
EXCEL values due to
rounding.

13-34
LO13-6

Regression Analysis: Computing Interval


Estimates for Y – Minitab Illustration

Confidence
Prediction Intervals
Intervals

13-35
LO13-7 Use a log function to
transform a nonlinear relationship.

Regression Analysis: Transforming


Non-linear Relationships
One of the assumptions of
regression analysis is that the
relationship between the dependent
and independent variables is
LINEAR. Sometimes, two variables
have a NON-LINEAR relationship.

When this occurs, the data can be


transformed to create a linear
relationship. The regression analysis
is applied on the transformed
variables.

13-36
LO13-7

Regression Analysis: Transforming


Non-Linear Relationships
In this case, the dependent variable, sales, is transformed to the
log(sales). The graph shows that the relationship between log(sales)
and price is linear. Now regression analysis can be used to create the
regression equation between log(sales) and price.

13-37

You might also like