Professional Documents
Culture Documents
9780521740494c23.xml
CUAU033-EVANS
11:42
C H A P T E R
23
Investigating the
relationship between
two numerical variables
Objectives
PL
E
P1: FXS/ABE
SA
M
In Chapter 22 statistics of one variable were discussed. Sometimes values of a variable for
more than one group have been examined, such as age of mothers and age of fathers, but only
one variable was considered for each individual at a time.
When two variables are observed for each subject, bivariate data are obtained. For example,
it might be interesting to record the number of hours spent studying for an exam by each
student in a class and the mark they achieved in the exam. If each of these variables were
considered separately the methods discussed earlier would be used. It may be of more interest
to examine the relationship between the two variables, in which case new bivariate techniques
are required. When exploring bivariate data, questions arise such as, Is there a relationship
between two variables? or Does knowing the value of one of the variables tell us anything
about the value of the other variable?
554
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
555
Consider the relationship between the number of cigarettes smoked per day and blood
pressure. Since one opinion might be that varying the number of cigarettes smoked may affect
blood pressure, it is necessary to distinguish between blood pressure, which is called the
dependent or response variable, and the number of cigarettes, which is called the
independent or explanatory variable. In this chapter some techniques are introduced which
enable questions concerning the nature of the relationship between such variables to be
answered.
PL
E
23.1
As with data concerning one variable, the most important first step in analysing bivariate data
is the construction of a visual display. When both of the variables of interest are numerical then
a scatterplot (or bivariate plot) may be constructed. This is the single most important tool in
the analysis of such bivariate data, and should always be examined before further analysis is
undertaken. The pairs of data points are plotted on the cartesian plane, with each pair
contributing one point to the plot. Using the normal convention, the variable plotted
horizontally is denoted as x, and the variable plotted vertically as y. The following example
examines the features of the scatterplot in more detail.
Example 1
SA
M
The number of hours spent studying for an examination by each member of a class, and the
marks they were awarded, are given in the table.
Student
Hours
Mark
1
4
27
2
36
87
3
23
67
4
28
84
5
25
66
6
11
52
7
18
61
8
13
43
9
4
38
10
8
52
Student
Hours
Mark
11
4
41
12
19
54
13
6
57
14
19
62
15
1
23
16
29
65
17
33
75
18
36
83
19
28
65
20
15
55
The first decision to be made when preparing this scatterplot is whether to show Mark
or Hours on the horizontal (x) axis. Since a students mark is likely to depend on the
hours that they spend studying, in this case Hours is the independent variable and
Mark is the dependent variable. By convention, the independent variable is plotted on
the horizontal (x) axis, and the dependent variable on the vertical (y) axis, giving the
scatterplot shown.
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
PL
E
P1: FXS/ABE
10
30
20
40
Hours x
From this scatterplot, a general trend can be seen of increasing marks with increasing hours of
study. There is said to be a positive association between the variables.
Two variables are positively associated when larger values of y are associated with larger
values of x, as shown in the previous scatterplot.
Examples of variables which exhibit positive association are height and weight, foot size and
hand size, and number of people in the family and household expenditure on food.
Example 2
SA
M
The age, in years, of several cars and their advertised price in a newspaper are given in the
following table.
Age (years)
4
6
5
Price ($)
13 000 9 800 11 000
Age (years)
7
Price ($)
9 700
7
4
2
3
3
8 300 10 500 15 800 14 300 13 800
6
4
6
4
8
9 500 13 200 10 000 11 800 8 000
6
12 200
In this case the independent variable is the age of the car, which is plotted on the
horizontal axis. The dependent variable, price, is plotted on the vertical axis.
Price ($) y
16 000
14 000
12 000
10 000
8 000
1
7
8
Age (years) x
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
From the scatterplot a general trend of decreasing price with increasing age of car can be seen.
There is said to be a negative association between the variables.
Two variables are negatively associated when larger values of y are associated with smaller
values of x, as shown in the scatterplot above.
Examples of other variables which exhibit negative association are weight and number of
weeks spent on a diet program, hearing ability and age, and number of cold rainy days per
week and sales of ice creams.
The third alternative is that a scatterplot shows no particular pattern, indicating no
association between the variables.
PL
E
P1: FXS/ABE
y
8
6
4
SA
M
There is no association between two variables when the values of y are not related to the
values of x, as shown in the preceding scatterplot.
Examples of variables which show no association are height and IQ for adults, price of cars
and fuel consumption, and size of family and number of pets.
When one point, or a few points, do not seem to fit with the rest of the data they are called
outliers. Sometimes a point is an outlier, not because its x value or its y value is in itself
unusual, but rather because this particular combination of values is atypical. Consequently such
an outlier cannot always be detected from single variable displays, such as stem-and-leaf plots.
For example, consider this scatterplot. While the
y
variable plotted on the horizontal axis takes values
8
from 1 to 8 and the variable plotted on the vertical
6
axis takes values from 2 to 8, the combination (2, 8)
4
is clearly an outlier.
2
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
PL
E
The calculator can be used to construct a scatterplot of statistical data. The procedure is
illustrated using the age and price of car data from Example 2.
The data is easiest entered in a Lists &
3).
Spreadsheet application (
) to
Firstly, use the up/down arrows (
name the first column age and the second
column price.
Then enter the data as shown.
SA
M
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
559
10
12
18
20
13
11
6
9
8
6
5
6
12
12
15
13
15
17
SA
M
PL
E
Enter the data into list1 (x) and list2 (y) in the
module. Tap SetGraph,
Setting . . . and select the tab for Graph3.
(Note: Following on from the types of graphs in
univariate statistics, this allows the Scatterplot
settings to be remembered and called upon when
required.)
Ensure that all other graphs are de-selected and
to produce the graph shown in the full
tap
screen.
Select the graph window (bold border) and tap Analysis, Trace to scroll from point to
point and display the coordinates at the bottom of the graph.
Exercise 23A
Save your data for 14 in named lists as they will be needed for later exercises.
1 The amount of a particular pain relief drug given to each patient and the time taken for the
patient to experience pain relief are shown.
Note:
Patient
Drug dose (mg)
Response time (min)
1
0.5
65
2
3
4
5
6
7
8
9 10
1.2 4.0 5.3 2.6 3.7 5.1 1.7 0.3 4.0
35 15 10 22 16 10 18 70 20
2 The proprietor of a hairdressing salon recorded the amount spent advertising in the local
paper and the business income for each month of a year, with the following results.
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
PL
E
3 The number of passenger seats on the most commonly used commercial aircraft, and the
airspeeds of these aircraft, in km/h, are shown in the following table.
Number of seats
Airspeed (km/h)
405
830
296
797
288
774
258
736
240
757
230
765
193
760
188
718
Number of seats
Airspeed (km/h)
148
683
142
666
131
661
122
378
115
605
112
620
103
576
102
603
SA
M
4 The price and age of several secondhand caravans are listed in the table.
Age (years)
7
7
8
9
4
8
1
Price ($)
4 800
3 900
4 275
3 900
6 900
6 500
11 400
Age (years)
10
9
9
11
3
4
7
Price ($)
8 700
1 950
3 300
1 650
9 600
8 400
6 600
23.2
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
561
numerical variables. Thus it is sensible, for example, to calculate the correlation between the
heights and weights for a group of students, but not between height and gender, as gender is
not a numerical variable. There are many different numerical measures of correlation, and each
has different properties. In this section the q-correlation coefficient will be introduced.
Consider the scatterplot of the number of hours spent by each member of a class when
studying for an examination, and the mark they were awarded, from Example 1. This shows a
positive association. To calculate the q-correlation coefficient, first find the median value for
each of the variables separately. This can be done from the data, but it is usually simpler to
calculate directly from the plot. There are 20 data points, and the median values are halfway
between the 10th and 11th points, both vertically and horizontally. A vertical line is then drawn
through the median x value, and a horizontal line through the median y value. The effect of this
is to divide the plot into four regions, as shown.
Marks y
80
60
40
SA
M
20
PL
E
P1: FXS/ABE
10
30
20
40
Hours x
Each of the four regions which have been created in this way is called a quadrant, and it can
be noticed immediately that most of the points in this plot are in the upper right and lower left
quadrants. In fact, wherever there is a positive association between variables this will be the
case.
Consider the scatterplot of the age of cars and the advertised price from Example 2, which
shows negative association. Again the median value for each of the variables is found
separately. There are 15 data points, giving the median values at the 8th points, both vertically
and horizontally. A vertical line is then drawn through the median x value, and a horizontal line
through the median y value. In this particular case they are coordinates of the same point, but
this need not be so.
Price ($) y
16 000
14 000
12 000
10 000
8 000
1
7
8
Age (years) x
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
It can be seen in this example that most points are in the upper left and the lower right
quadrants, and this is true whenever there is a negative association between variables.
These observations lead to a definition of the q-correlation coefficient.
The q-correlation coefficient can be determined from the scatterplot as follows.
Find the median of all the
x values in the data set, and
draw a vertical line through
this value.
PL
E
P1: FXS/ABE
The plane is now divided into four quadrants. Label the quadrants A, B, C and D as shown
in the diagram.
Count the number of points in each of the quadrants A, B, C and D. Any points which lie
on the median lines are omitted.
Let a, b, c, d represent the number of points in each of the quadrants A, B, C and D
respectively. Then the q-correlation coefficient is given by
(a + c) (b + d)
a+b+c+d
SA
M
q=
Example 3
Use the scatterplot from Example 1 to determine the q-correlation coefficient for the number
of hours each member of a class spent studying for an examination and the mark they were
awarded.
Solution
There are nine points in quadrant A, one in quadrant B, nine in quadrant C and one in
quadrant D.
Thus
(a + c) (b + d)
a+b+c+d
(9 + 9) (1 + 1)
=
9+1+9+1
18 2
=
20
16
=
20
= 0.8
q=
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
563
Example 4
Use the scatterplot from Example 2 to determine the q-correlation coefficient for the age of
cars and their advertised price.
Solution
There is one point in quadrant A, six in quadrant B, one in quadrant C and six in
quadrant D.
PL
E
P1: FXS/ABE
(a + c) (b + d)
a+b+c+d
(1 + 1) (6 + 6)
=
1+6+1+6
2 12
=
14
10
=
14
= 0.71
q=
From Examples 3 and 4 it can be seen that q-correlation coefficients may take both positive
and negative values. Consider the situation when all the points are in the quadrants A and C.
(a + c) (b + d)
a+b+c+d
a+c
(since b and d are both equal to zero)
=
a+c
=1
SA
M
q=
Thus the maximum value the q-correlation coefficient may take is 1, and this indicates a
measure of strong positive association.
Suppose all the points are in the quadrants B and D.
(a + c) (b + d)
a+b+c+d
(b + d)
(since a and c are both equal to zero)
=
b+d
= 1
q=
Thus the minimum value the q-correlation coefficient may take is 1, and this indicates a
measure of strong negative association.
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
When the same number of points are in each of the quadrants A, B, C and D then:
(a + c) (b + d)
a+b+c+d
0
(since a = b = c = d)
=
a+b+c+d
=0
q=
This value of the q-correlation coefficient clearly indicates that no association exists.
PL
E
0.75
0.50
0.25
< 0.25
< 0.50
< 0.75
1
Exercise 23B
SA
M
b q = 0.30
f q = 0.75
j q = 0.25
c q = 0.85
g q = 0.75
k q=1
d q = 0.33
h q = 0.24
l q = 0.50
2 Calculate the q-correlation coefficient for each pair of variables shown in the following
scatterplots.
a
36
24
12
15.0
20.0
25.0
30.0
35.0
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
565
280
210
140
120
y
8
6
4
2
0
y
160
200
SA
M
PL
E
P1: FXS/ABE
240
280
80
60
40
20
Example
10
20
30
40
3 The amount of a particular pain relief drug given to each patient and the time taken for the
patient to experience pain relief are shown.
Patient
Drug dose (mg)
Response time (min)
1
0.5
65
2
1.2
35
3
4.0
15
4
5.3
10
5
2.6
22
6
3.7
16
7
5.1
10
8
1.7
18
9
0.3
70
10
4.0
20
a Use your scatterplot from 1, Exercise 23A to find the q-correlation coefficient for
response time and drug dosage.
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
b Classify the strength and direction of the relationship between response time and drug
dosage according to the table given.
Example
4 The proprietor of a hairdressing salon recorded the amount spent advertising in the local
paper and the business income for each month of a year, with the following results.
Month Advertising ($) Business ($)
7
350
8 060
8
300
7 030
9
550
11 500
10
600
12 870
11
550
10 560
12
450
9 850
PL
E
a Use your scatterplot from 2, Exercise 23A to find the q-correlation coefficient for
advertising expenditure and total business conducted.
b Classify the strength and direction of the relationship between advertising expenditure
and business income according to the table given.
5 The number of passenger seats on the most commonly used commercial aircraft, and the
airspeeds of these aircraft, in km/h, are shown in the following table.
405
830
296
797
288
774
258
736
240
757
230
765
193
760
188
718
Number of seats
Airspeed (km/h)
148
683
142
666
131
661
122
378
115
605
112
620
103
576
102
603
SA
M
Number of seats
Airspeed (km/h)
a Use your scatterplot from 3, Exercise 23A to find the q-correlation coefficient for the
number of seats on an airline and the air speed.
b Classify the strength and direction of the relationship between the number of seats on
an airline and the air speed according to the table given.
6 The price and age of several secondhand caravans are listed in the table.
Age (years)
7
7
8
9
4
8
1
Price ($)
4 800
3 900
4 275
3 900
6 900
6 500
11 400
Age (years)
10
9
9
11
3
4
7
Price ($)
8 700
1 950
3 300
1 650
9 600
8 400
6 600
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
567
a Use your scatterplot from 4, Exercise 23A to find the q-correlation coefficient for price
and age of secondhand caravans.
b Classify the strength and direction of the relationship between price and age of
secondhand caravans according to the table given.
23.3
r=
PL
E
When a relationship is linear the most commonly used measure of strength of the relationship
is Pearsons product-moment correlation coefficient, r. It gives a numerical measure of the
degree to which the points in the scatterplot tend to cluster around a straight line.
Pearsons product-moment correlation is defined to be
degree which the variables vary together
degree which the two variables vary separately
Formally, if we call the two variables x and y and we have n observations then Pearsons
product-moment correlation for this set of observations is
n
1
xi x
yi y
r=
n 1 i=1
sx
sy
SA
M
where x and sx are the mean and standard deviation of the x scores and y and s y are the mean
and standard deviation of the y scores.
There are two key assumptions made in calculating Pearsons correlation coefficient r. They
are
the data is numerical
the relationship being described is linear.
Pearsons correlation coefficient r has the following properties:
If there is no linear relationship, r = 0.
y
r=0
r = +1
r = 1
Cambridge University Press Uncorrected Sample Pages 978-0-521-61252-4
2008 Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
Otherwise, 1 r +1
Pearsons correlation coefficient r can be classified as follows:
0.75 strong negative linear relationship
0.50 moderate negative linear relationship
0.25 weak negative linear relationship
< 0.25
no linear relationship
< 0.50
weak positive linear relationship
< 0.75
moderate positive linear relationship
1
strong positive linear relationship
PL
E
1 r
0.75 r
0.50 r
0.25 < r
0.25 r
0.50 r
0.75 r
The following scatterplots show linear relationships of various strengths together with the
corresponding value of Pearsons product-moment correlation coefficient.
30
25
12
Age
CO level
14
10
20
15
SA
M
600
1400
160
120
100
Score
Mortality
140
100
80
90
60
80
100
Smoking ratio
120
Verbal
700
20
600
15
500
10 12 14 16 18 20
Age 1st word
Calf
P1: FXS/ABE
10
5
400
500
700
600
Mathematics
800
30
40
Age
50
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
569
PL
E
SA
M
1
2
3
5
5
7
4
2
7
9
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
After the mean and standard deviation, Pearsons product-moment correlation is one of the
most frequently computed descriptive statistics. It is a powerful tool but it is also easily
misused. The presence of a linear relationship should always be confirmed with a scatterplot
before Pearsons product-moment correlation is calculated. And, like the mean and the
standard deviation, Pearsons correlation coefficient r is very sensitive to the presence of
outliers in the sample.
Exercise 23C
PL
E
P1: FXS/ABE
b r = 0.30
f r = 0.75
j r = 0.25
c r = 0.85
g r = 0.75
k r =1
d r = 0.33
h r = 0.24
l r = 1
2 By comparing the plots given to those on page 538 estimate the value of Pearsons
correlation coefficient r.
a
280
36
210
24
140
SA
M
12
15.0
20.0
25.0
30.0
35.0
y
8
14 000
12 000
10 000
8 000
2
3
20
f y
280
240
40
200
60
160
80
120
8 x
10
20
30
40
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
571
3 The amount of a particular pain relief drug given to each patient and the time taken for the
patient to experience relief are shown.
Patient
Drug dose (mg)
Response time (min)
1
0.5
65
2
1.2
35
3
4.0
15
4
5.3
10
5
2.6
22
6
3.7
16
7
5.1
10
8
1.7
18
9
0.3
70
10
4.0
20
PL
E
4 The proprietor of a hairdressing salon recorded the amount spent on advertising in the
local paper and the business income for each month for a year, with the following results.
Month Advertising ($) Business ($)
1
350
9 450
2
450
10 070
3
400
9 380
4
500
9 110
5
250
5 220
6
150
3 100
SA
M
5 The number of passenger seats on the most commonly used commercial aircraft, and the
airspeeds of these aircraft, in km/h, are shown in the following table.
Number of seats
Airspeed (km/h)
405
830
296
797
288
774
258
736
240
757
230
765
193
760
188
718
Number of seats
Airspeed (km/h)
148
683
142
666
131
661
122
378
115
605
112
620
103
576
102
603
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
6 The price and age of several secondhand caravans are listed in the table.
Age (years)
7
7
8
9
4
8
1
Price ($)
4 800
3 900
4 275
3 900
6 900
6 500
11 400
Age (years)
10
9
9
11
3
4
7
Price ($)
8 700
1 950
3 300
1 650
9 600
8 400
6 600
PL
E
P1: FXS/ABE
7 The following are the scores for a group of ten students who each had two attempts at a
test (out of 70).
Attempt 1
Attempt 2
53
63
56
66
57
67
49
58
44
54
69
70
66
70
40
55
53
63
43
53
68
70
64
70
a Construct a scatterplot of these data, and describe the relationship between scores on
attempt 1 and attempt 2.
b Is it appropriate to calculate the value of Pearsons correlation coefficient for these
data? Give reasons for your answer.
SA
M
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
23.4
573
Lines on scatterplots
PL
E
If a linear relationship exists between two variables it is possible to predict the value of the
dependent variable from the value of the independent variable. The stronger the relationship
between the two variables the better the prediction that is made. To make the prediction it is
necessary to determine an equation which relates the variables and this is achieved by fitting a
line to the data. Fitting a line to data is often referred to as regression, which comes from a
Latin word regressum which means moved back.
The simplest equation relating two variables x and y is a linear equation of the form
y = a + bx
where a and b are constants. This is similar to the general equation of a straight line, where a
represents the coordinate of the point where the line crosses the y axis (the y axis intercept),
and b represents the slope of the line. In order to summarise any particular (x, y) data set,
numerical values for a and b are needed that will make the line pass close to the data. There are
several ways in which the values of a and b can be found, of which the simplest is to find the
straight line by placing a ruler on the scatter diagram, and drawing a line by eye, which
appears to follow the general trend of the data.
Example 6
SA
M
The following table gives the gold medal winning distance, in metres, for the mens long jump
for the Olympic games for the years 1896 to 1996. (Some years were missing owing to the two
world wars.)
Find a straight line which fits the general trend of the data, and use it to predict the winning
distance in the year 2008.
Year
Distance (m)
Year
Distance (m)
1896
6.35
1960
8.13
1900
7.19
1964
8.08
1904
7.34
1968
8.92
1908
7.49
1972
8.26
1912
7.59
1976
8.36
1920
7.16
1980
8.53
1924
7.44
1984
8.53
1928
7.75
1988
8.72
1932
7.65
1992
8.67
1936
8.05
1996
8.50
1948
7.82
2000
8.55
1952 1956
7.57 7.82
2004
8.59
Solution
9.0
Distance
8.5
8.0
7.5
7.0
6.5
6.0
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
PL
E
Note that this scatterplot does not start at the origin. Since the values of the
coordinates that are of interest on both axes are a long way from zero, it is sensible to
plot the graph for that range of values only. In fact, any values less than 1896 on the
horizontal axis are meaningless in this context.
The line shown on the scatterplot is only one of many which could be drawn. To
enable the line to be used for prediction it is necessary to find its equation. To do this,
first determine the coordinates of any two points through which it passes on the
scatterplot. Appropriate points are (1932, 7.65) and (1976, 8.36). The equation of the
straight line is then found by substituting in the formula which gives the equation for a
straight line between two points.
y2 y1
y y1
=
x x1
x2 x1
8.36 7.65
y 7.65
=
x 1932
1976 1932
0.71
= 0.016
=
44
y 7.65 = 0.016(x 1932)
y = 0.016x 23.26
SA
M
The intercept for this equation is 23.26 m. In theory, this is the winning distance
for the year 0! In practice, there is no meaningful interpretation for the y axis intercept
in this situation. But the same cannot be said about the slope. A slope of 0.016 means
that on average the gold medal winning distance increases by about 1.6 centimetres at
each successive games.
Using this equation the gold medal winning distance for the long jump in 2008
would be predicted as
y = 23.26 + 0.016 2008 = 8.87 m
Obviously, attempting to project too far into the future may give us answers which are not
sensible. When using an equation for prediction, derived from data, it is sensible to use values
of the explanatory variable which are within a reasonable range of the data. The relationship
between the variables may not be linear if we move too far from the known values.
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
575
Example 7
The following table gives the alcohol consumption per head (in litres) and the hospital
admission rate to each of the regions of Victoria in 199495.
Region
LoddonMallee
Grampians
Barwon
Gippsland
Hume
Western Metropolitan
Northern Metropolitan
Eastern Metropolitan
Southern Metropolitan
Per capita
consumption
(litres of alcohol)
9.0
8.4
8.7
9.1
10.0
9.0
6.7
6.2
8.1
Hospital
admissions per
1000 residents
42.0
44.7
38.6
44.7
41.0
40.4
36.2
32.3
43.0
PL
E
P1: FXS/ABE
Find a straight line which fits the general trend of the data, and interpret the intercept and slope.
Solution
Admissions
SA
M
45.0
40.0
35.0
30.0
6.0
7.0
9.0
8.0
Alcohol consumption
10.0
One possible line passes through the points (7, 36) and (9, 42).
y2 y1
y y1
=
x x1
x2 x1
y 36
42 36
=
x 7
97
6
=
2
=3
y 36 = 3(x 7)
y = 3x + 15
or admission rate = 15 + 3 alcohol consumption
Thus
The intercept for this equation is 15, implying that we predict a hospital admission
rate would be 15 per 1000 residents for a region with 0 alcohol consumption. While
this is interpretable, it would be a brave prediction as it is well out of the range of the
Cambridge University Press Uncorrected Sample Pages 978-0-521-61252-4
2008 Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
data. A slope of 3 means that on average the admission rate rises by 3 per 1000
residents for each additional litre of alcohol consumed per capita.
Exercise 23D
Example
0
1
1
3
2
6
3
7
4
7
5
11
6
13
7
18
8
17
PL
E
P1: FXS/ABE
Draw a straight line which fits the data by eye, and find an equation for this line.
2 Plot the following set of points on graph paper.
x
y
3
5
2
2
1
0
0
6
1
7
2
11
3
13
4
20
Draw a straight line which fits the data by eye, and find an equation for this line.
7
3 The numbers of burglaries during two successive years for various districts in one state are
given in the following table.
a Make a scatterplot of the data.
District Year 1 (x) Year 2 (y)
b Find the equation of a straight line which
A
3233
2709
relates the two variables.
B
2363
2208
c Describe the trend in burglaries in this state.
C
4591
3685
SA
M
Example
D
E
F
G
H
I
J
K
4317
2474
3679
5016
6234
6350
4072
2137
4038
2792
3292
4402
5147
5555
4004
1980
4 The following data give a girls height (in cm) between the ages of 36 months and
60 months.
Age (x)
Height (y)
36
84
40
87
44
90
52
92
56
94
60
96
a
b
c
d
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
577
5 The following table gives the adult heights (in cm) of ten pairs of mothers and daughters.
Mother (x)
Daughter (y)
170
178
163
175
157
165
165
173
175
168
160
152
164
163
168
168
152
160
173
178
PL
E
P1: FXS/ABE
7 The amount of a particular pain relief drug given to each patient and the time taken for the
patient to experience pain relief are shown.
1
0.5
65
2
1.2
35
SA
M
Patient
Drug dose (mg)
Response time (min)
3
4.0
15
4
5.3
10
5
2.6
22
6
3.7
16
7
5.1
10
8
1.7
18
9
0.3
70
10
4.0
20
a Find the equation of a straight line which relates the two variables.
b Interpret the intercept and slope if appropriate.
c Use your equation to predict the time taken for the patient to experience pain relief if 6
mg of the drug is given. Is this answer realistic?
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
23.5
PL
E
P1: FXS/ABE
y
35
30
y = a + bx
25
20
(xi, yi)
15
SA
M
10
5
10
(yi a bxi ) = 0
i=1
i=1
n
(yi a bxi )2
i=1
n
From 1 ,
n
i=1
(yi a bxi ) = 0
i=1
yi na b
n
xi = 0
i=1
y a b x = 0
a = y b x
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
579
n
bxi ]2
[yi ( y b x)
i=1
n
2
=
[(yi y ) b(xi x)]
i=1
n
i y ) + b2 (xi x)
2]
=
[(yi y )2 2b(xi x)(y
i=1
PL
E
Simplifying gives
b=
i=1
n
2
(xi x)
i=1
SA
M
Equations 3 and 4 can then be used to calculate the least squares estimates of the y axis
intercept and the slope.
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
PL
E
P1: FXS/ABE
The following data gives the heights (in cm) and weights (in kg) of 11 people.
Height (x)
Weight (y)
177
74
182
75
167
62
178
63
173
64
184
74
162
57
169
55
164
56
170
68
180
72
SA
M
Note:
After the equation of the least squares line has been determined, we can interpret the
intercept and slope in terms of the problem at hand, and use the equation to make predictions.
The method of least squares is also sensitive to any outliers in the data.
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
581
Example 8
Consider again the gold medal winning distance, in metres, for the mens long jump for
the Olympic games for the years 1896 to 2004.
Find the equation of the least squares regression line for these data, and use it to predict
the winning distance for the year 2008.
Year
Distance (m)
Year
Distance (m)
Solution
1896
6.35
1960
8.13
1900
7.19
1964
8.08
1904
7.34
1968
8.92
1908
7.49
1972
8.26
1912
7.59
1976
8.36
1920
7.16
1980
8.53
1924
7.44
1984
8.53
1928
7.75
1988
8.72
1932
7.65
1992
8.67
1936
8.05
1996
8.50
1948
7.82
2000
8.55
1952 1956
7.57 7.82
2004
8.59
PL
E
P1: FXS/ABE
SA
M
Consider again the data from Example 7 which related alcohol consumption per head (in litres)
and the hospital admission rate to each of the regions of Victoria in 199495.
Region
LoddonMallee
Grampians
Barwon
Gippsland
Hume
Western Metropolitan
Northern Metropolitan
Eastern Metropolitan
Southern Metropolitan
Hospital admissions
per 1000 residents
42.0
44.7
38.6
44.7
41.0
40.4
36.2
32.3
43.0
Find the equation of the least squares regression line which fits these data.
Solution
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
PL
E
P1: FXS/ABE
Exercise 23E
Example
1 The following data give a girls height (in cm) between the ages of 36 months and
60 months.
Age (x)
Height (y)
36
84
40
87
44
90
52
92
56
94
60
96
SA
M
a Using the method of least squares find the equation of a straight line which relates the
two variables.
b Interpret the intercept and slope, if appropriate.
c Use your equation to estimate the girls height at age
i 42 months
ii 18 years
d How reliable are your answers to part c?
District
A
B
C
D
E
F
G
H
I
J
K
Year 1 (x)
3 233
2 363
4 591
4 317
2 474
3 679
5 016
6 234
6 350
4 072
2 137
Year 2 (y)
2 709
2 208
3 685
4 038
2 792
3 292
4 402
5 147
5 555
4 004
1 980
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
583
3 The following table gives the adult heights (in cm) of ten pairs of mothers and daughters.
Mother (x)
Daughter (y)
170 163 157 165 175 160 164 168 152 173
178 175 165 173 168 152 163 168 160 178
PL
E
a Using the method of least squares find the equation of a straight line which relates the
two variables.
b Interpret the slope in this context.
c Estimate the adult height of a girl whose mother is 170 cm tall.
4 The manager of a company which manufactures MP3 players keeps a weekly record of the
cost of running the business and the number of units produced. The figures for a period of
eight weeks are:
Number of MP3
players produced (x)
Cost in 000s $ (y)
a Using the method of least squares find the equation of a straight line which relates the
two variables.
b What is the manufactures fixed cost for operating the business each week?
c What is the cost of production of each unit, over and above this fixed operating cost?
SA
M
5 The amount of a particular pain relief drug given to each patient and the time taken for the
patient to experience pain relief are shown.
Patient
Drug dose (mg)
Response time (min)
1
2
3
4
5
6
7
8
9 10
0.5 1.2 4.0 5.3 2.6 3.7 5.1 1.7 0.3 4.0
65 35 15 10 22 16 10 18 70 20
a Using the method of least squares find the equation of a straight line which relates the
two variables.
b Interpret the intercept and slope if appropriate.
c Use your equation to predict the time taken for the patient to experience pain relief
if 6 mg of the drug is given. Is this answer realistic?
Month
1
2
3
4
5
6
7
8
9
10
11
12
Advertising ($)
350
450
400
500
250
150
350
300
550
600
550
450
Business ($)
9 450
10 070
9 380
9 110
5 220
3 100
8 060
7 030
11 500
12 870
10 560
9 850
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
c Use your equation to predict the volume which would be attracted if the proprietor of
the salon spent the following amounts on advertising.
i $1000
ii $0
PL
E
P1: FXS/ABE
10
12
18
20
13
11
6
9
8
6
5
6
12
12
15
13
15
17
SA
M
a
b
c
d
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
585
1
2
3
5
5
7
4
2
7
9
PL
E
SA
M
e Press
ENTER
Use a graphics calculator to determine the equation of the least squares regression line
that will enable weight to be predicted from height.
Enter the data into your calculator using the Stats/List Editor.
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
ENTER
SA
M
e Press
PL
E
P1: FXS/ABE
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
587
SA
M
PL
E
Bivariate data arises when measurements on two variables are collected for each subject.
A scatterplot is an appropriate visual display of bivariate data if both of the variables are
numerical.
A scatterplot of the data should always be constructed to assist in the identification of
outliers and illustrate the association (positive, negative or none).
Two variables are positively associated when larger values of y are associated with larger
values of x. Two variables are negatively associated when larger values of y are associated
with smaller values of x. There is no association between two variables when the values of
y are not related to the values of x.
When constructing the scatterplot, the independent or explanatory variable is plotted on the
horizontal (x) axis, and the dependent or response variable is plotted on the vertical (y) axis.
If a linear relationship is indicated by the scatterplot a measure of its strength can be found
by calculating the q-correlation coefficient, or Pearsons product-moment correlation
coefficient, r.
If the values on a scatterplot are divided by lines representing the median of x and the
median of y into four quadrants A, B, C and D, with a, b, c, d representing the number of
points in each quadrant respectively, then the q-correlation coefficient is given by
(a + c) (b + d)
q=
a+b+c+d
Pearsons product-moment correlation, r, is a measure of strength of linear relationship
between two variables, x and y. If we have n observations then for this set of observations
n
xi x
yi y
1
r=
n 1 i=1
sx
sy
where x and sx are the mean and standard deviation of the x scores and y and s y are the
mean and standard deviation of the y scores.
For these correlation coefficients
1 q 1
1 r 1
with values close to 1 indicating strong correlation, and those close to 0 indicating little
correlation.
If a linear relationship is indicated from the scatterplot a straight line may be fitted to the
data, either by eye or using the least squares regression method.
The least squares regression line is the line for which the sum of squares of the vertical
deviations from the data to the line is a minimum.
The value of the slope (b) gives the extent of the change in the dependent variable
associated with a unit change in the independent variable.
Once found, the equation to the straight line may be used to predict values of the response
variable (y) from the explanatory variable (x). The accuracy of the prediction depends on
how closely the straight line fits the data, and an indication of this can be obtained from the
correlation coefficient.
Review
Chapter summary
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
Multiple-choice questions
PL
E
1 For which one of the following pairs of variables would it be appropriate to construct a
scatterplot?
A eye colour (blue, green, brown, other) and hair colour (black, brown, blonde, red, other)
B score out of 100 on a test for a group of Year 9 students and a group of Year 11 students
C political party preference (Labor, Liberal, Other) and age in years
D age in years and blood pressure in mm Hg
E height in cm and gender (male, female)
2 For which one of the following plots would it be appropriate to calculate the value of the
q-correlation coefficient?
A
SA
M
Review
588
4 The scatterplot shows the relationship between age and the number of alcoholic drinks
consumed on the weekend by a group of people.
The value of the q-correlation coefficient is closest to
7
5
7
E 1
D
C
A 1
B
9
6
9
20
No. of drinks
15
10
0
10
20
30
40
Age
50
60
70
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
589
220
200
Weight
180
160
140
120
100
80
150
160
170 180
Height (cm)
190
200
SA
M
140
6 The value of the Pearson product-moment correlation coefficient r for these data is closest
to
E 0.8
D 0.7
C 0.6
B 0.4
A 0.2
7 The least squares regression line which would enable expenditure on food to be predicted
from weekly income is closest to
B 0.482 42.864 weekly income
A 0.482 + 42.864 weekly income
D 239.868 + 1.355 weekly income
C 42.864 + 0.482 weekly income
E 1.355 + 239.868 weekly income
Review
5 The following scatterplot shows the relationship between height and weight for a group of
people.
The value of the Pearsons product-moment correlation coefficient r is closest to
E 0
D 0.3
C 0.5
B 0.8
A 1
PL
E
P1: FXS/ABE
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
PL
E
8 Using this rule the expenditure on entertainment by an individual with an income of $600
per week is predicted to be
E $240
D $46
C $100
B $24 060
A $40
9 From this rule which of the following statements is correct?
A On average for each extra dollar of income an extra 10 cents is spent on entertainment
B On average for each extra 10 cents in income an extra $1 is spent on entertainment
C On average for each extra dollar of income an extra 40 cents is spent on entertainment
D On average people spend $40 per week on entertainment
E On average people spend $50 per week on entertainment
10 For the scatterplot shown the line of best fit would
have a slope closest to:
C 10
B 0.1
A 0.1
E 200
D 10
250
200
150
100
SA
M
Review
590
50
10
12
14
16
18
20
22
Short-answer questions
52
53
51
64
55
58
71
Score (points)
90
134
76
92
93
45
120
66
105
108
88
133
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
591
Distance (kms)
Time (mins)
12
15
50
75
40
50
25
50
45
80
20
50
10
10
3
5
10
10
30
35
4 The following scatterplot shows the relationship between height and weight for a group of
people. Draw a straight line which fits the data by eye, and find an equation for this line.
220
200
weight
180
160
140
120
SA
M
100
80
140
150
160
170
180
height (cm)
190
200
Time (seconds)
22.6
21.7
21.7
21.3
19.3
17.6
17.0
14.6
14.0
8.8
Errors
2
3
3
4
5
5
7
7
9
9
Review
3 The distance traveled to work and the time taken for a group of company employees are
given in the following table. Determine the value of the Pearson product-moment
correlation r for these data.
PL
E
P1: FXS/ABE
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
Extended-response questions
1 A marketing company wishes to predict the likely number of new clients each of its
graduates will attract to the business in their first year of employment, by using their scores
on a marketing exam in the final year of their course.
PL
E
SA
M
Review
592
2 To investigate the relationship between marks on an assignment and the final examination
mark a sample of 10 students was taken. The table indicates the marks for the assignment
and the final exam mark for each individual student.
a Which is the independent variable
Assignment mark Final exam mark
and which is the dependent variable?
(max = 80)
(max = 90)
b Construct a scatterplot of these data.
80
83
c Describe the association between the
77
83
assignment mark and exam mark.
71
79
d Determine the value of the q-correlation
78
75
coefficient for these data, and classify the
65
68
strength of the relationship.
80
84
e Determine the value of the Pearson
68
71
product-moment correlation coefficient
64
69
for these data and classify the strength of
50
66
the relationship.
66
58
P1: FXS/ABE
P2: FXS
9780521740494c23.xml
CUAU033-EVANS
11:42
593
PL
E
SA
M
3 A marketing firm wanted to investigate the relationship between airplay and CD sales (in the
following week) of newly released songs. Data was collected on a random sample of 10 songs.
a Which is the independent variable and which
No. of times the Weekly sales
is the dependent variable?
song was played
of the CD
b Construct a scatterplot of these data.
47
3950
c Describe the association between the number
34
2500
of times the song was played and weekly sales.
40
3700
d Determine the value of the q-correlation
34
2800
coefficient for these data, and classify the strength
33
2900
of the relationship.
50
3750
e Determine the value of the Pearson
28
2300
product-moment correlation coefficient for these
53
4400
data and classify the strength of the relationship.
25
2200
f Determine the equation for the least squares
46
3400
regression line and write it down in terms of the
variables Number of times the song was played and Weekly sales.
g Interpret the intercept and slope of the least squares regression line in terms of the
variables in the study.
h Use your regression equation to predict the weekly sales for a song which was played 60
times.
i How reliable is the prediction made in h?
Review
f Use your answer to d to comment on the statement: Good final exam marks are the
result of good assignment marks.
g Determine the equation for the least squares regression line and write it down in terms
of the variables Final exam mark and Assignment mark.
h Interpret the intercept and slope of the least squares regression line in terms of the
variables in the study.
i Use your regression equation to predict the Final exam mark for a student who scored 50
on the assignment.
j How reliable is the prediction made in i?