You are on page 1of 18

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

Tutorial 1: Solutions
(Material from lectures 1 and 2 and revision from QAM I)
Q1: Independent random sampling from two Normally distributed populations gives the
following results:
n x 64

n y 36

x 400
y 360

s x 20

s y 25

Find a 90% confidence interval estimate of the difference in the means of the two
populations. Clearly state all of your calculations.
(Question taken from Chapter 9, Statistics for Business and Economics, Sixth Edition, Paul Newbold, William L Carlson and Betty Thorne)

Answer:
We are told that we have independent samples and that both samples were taken from
Normal populations. As such, if we were to undertake a hypothesis test to compare the
means, we would use a two sample t test.
Hint: It can often help to think about the hypothesis test you would use to compare the
groups in order to decide on the correct form of the standard error to use in the
confidence interval.
Since the population variances are not known and are estimated from the data, we
could calculate a confidence interval based on the t distribution. However, since the
sample sizes are both large (> 30), we could also simply make use of the Normal
distribution. We will attempt both methods and compare the results:

A) Confidence interval based on t distribution


We can use the following standard error of the difference between the means to
calculate the confidence interval based on the t distribution.
1

QAM II: Week 2, Term 2

SE ( x y ) s

Dr Katy Hoad, 2016

1
1

nx n y

where s is the pooled standard deviation and is calculated as follows:


s

(n x 1) s x2 ( n y 1) s y2
nx n y 2

Putting the numbers into these formulas yields a standard error of 4.5661 (to four
decimal places) see accompanying Excel spreadsheet for calculations.
In this case, we make use of the t distribution to obtain the 90% confidence interval as
follows:
( x y ) t , 2 SE ( x y )

We have (64-1)+(36-1) = 98 degrees of freedom, but since t values are not reported in
the table for this exact number, we will use the t value corresponding to 120 degrees of
freedom, which is as close as we can get to 98 degrees of freedom without using a
computer. We therefore use t120, 0.05 = 1.658.
As such, the confidence interval is calculated as:
(400 360) (1.658 4.5661) = 32.43 to 47.57
(See Excel spreadsheet for in-depth calculations)
Note: If you look at the Excel spreadsheet, you will see that I have reported the exact t
value corresponding to 98 degrees of freedom which is, in fact, 1.661. As such, making
use of Excel yields the following 90% confidence interval based on the t distribution:
32.42 to 47.58 (practically the same as above).

B) Confidence interval based on Normal distribution


Since the sizes of the two samples are fairly large (n 1 = 64 and n2 = 36), we could also
calculate the confidence interval by making use of the Normal distribution. This works
on the assumption that the sample means have approximately Normal sampling
distributions so the resulting confidence interval is also approximate, the precision of
which will get better and better as the sample sizes increase.
2

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

In this case, the standard error of the difference between the means is calculated as:

SE ( x y )

2
s x2 s y

4.8591
nx n y

The z value for a 90% confidence interval is 1.645 so the 90% confidence interval
based on the Normal distribution is calculated as:
(400 360) (1.645 4.8591) = 32.01 to 47.99.

You can see that the confidence intervals obtained using the t distribution and the
Normal distribution are very similar which shows that the Normal confidence interval is a
good approximation in this case (since we have fairly large samples).

Note: Dont panic about which method to use in the exam I will ensure that
exam questions are written without any ambiguity!

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

Q2: A random sample of six salespersons that attended a motivational course on sales
techniques was monitored in the three months before and the three months after the
course. The tables shows the values of sales (in thousands of dollars) generated by
these six salespersons in the two periods. Assume that the population distributions are
normal.
(Data taken from: Chapter 9, Statistics for Business and Economics, Sixth Edition, Paul Newbold, William L Carlson and Betty Thorne)

Salesperson
1
2
3
4
5
6

Before course After course


212
282
203
327
165
198

237
291
191
341
192
180

a) Find a 95% confidence interval for the difference between the population means
b) By looking at the confidence interval you have calculated in part a) above, what
can you deduce about the mean sales before and after the motivational course?
Briefly justify your response.
c) Carry out a hypothesis test at the 5% level to formally assess whether there is
evidence of a difference in the mean sales before and after the course (note: this
is revision from QAM I)
Answer:
a) In this question, since sales were assessed before and after an intervention (i.e. the
motivational course), we have paired data. In other words, the two sets of observations
are not independent. To formally compare the two groups we must use a paired t test
and to calculate a 95% confidence interval for the difference between the population
means we can make use of the corresponding standard error as follows:
SD ( d )
where d represents the differences between the two sets of observations.
n
All calculations are shown in the accompanying Excel spreadsheet where it can be seen
that SE = 7.6627.
SE

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

In this case we have n1 = 61 = 5 degrees of freedom and the corresponding t value


for a 95% confidence interval is t 5, 0.025 = 2.571 (see tables of the t distribution). The
confidence interval is therefore calculated as:
d 2.571SE ( d )

where d is the mean difference between the before and after sales.
The confidence interval is therefore -7.5 (2.571 7.6627) = -27.2 to 12.2.
Note: in this case I subtracted the after values from the before values, but you could
have done the subtraction the other way round. This would simply have altered the
signs of the confidence interval to give the interval -12.2 to 27.2.
b) The confidence interval calculated in part a) has a lower bound which is negative
and an upper bound which is positive which means that the interval contains zero. Zero
is the null value (i.e. the value that indicates no difference between the groups). As
such, since we cannot rule out the possibility that the mean difference between the
groups is zero, we have no evidence of a difference in sales before and after the
motivational course.
c) See accompanying Excel spreadsheet for calculations. Look back at your QAM I
notes to revise the strategy.
Step 1: Null hypothesis: The mean difference between the two sets of sales is zero.
Step 2: Alternative hypothesis: The mean difference between the two sets of sales is not
zero.
Step 3: Assume that the null hypothesis is true.
Step 4: Calculate the test statistic for the paired t test as follows:
t

d
7.5

0.979
SE ( d ) 7.6627

The obtained t value is negative but this just reflects how the differences were
calculated. Since we are undertaking a two sided test we can simply treat the t value as
positive. We are carrying out the test at the 5% level of significance on 6-1 = 5 degrees
of freedom. From tables of the t distribution, 0.727 < 0.979 < 1.476 and these two
values (i.e. 0.727 and 1.476) correspond to areas in one tail of the distribution of 0.25
and 0.1 respectively. However, since we are carrying out a two tailed test we must
5

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

double up these probabilities. The p value therefore lies between 0.2 and 0.5. In fact,
from the Excel spreadsheet we can see that the exact p value is 0.373.
Step 5: Since the obtained p value is much greater than 0.05, we have no evidence
from the data that there is a difference in sales before and after the motivational course.
This supports the conclusion we drew after calculating the confidence interval in part a).
The course doesnt appear to have been very motivational after all!

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

Q3: A random sample of 100 men contained 61 in favour of a state constitutional


amendment to retard the rate of growth of property taxes. An independent random
sample of 100 women contained 54 in favour of this amendment. The confidence
interval:
0.04 < Px Py < 0.10
was calculated for the difference between the population proportions.
confidence level for this interval? Show all steps of your calculations.

What is the

(Question taken from Chapter 9, Statistics for Business and Economics, Sixth Edition, Paul Newbold, William L Carlson and Betty Thorne)

Answer:
The estimated proportions of men and women in favour of the constitutional amendment
are as follows:
Men: px = 61/100 = 0.61
Women: py = 54/100 = 0.54
The point estimate of the difference between the population proportions is therefore
obtained by finding the difference between the estimates for men and women:
0.61 0.54 = 0.07
To calculate a confidence interval for the difference between the population proportions,
the standard error of the difference is calculated as:
SE ( p1 p 2 )

SE ( p1 ) 2 SE ( p 2 ) 2

where
SE ( p x )

p x (1 p x )
and SE ( p y )
nx

p y (1 p y )
ny

Putting the relevant values into these formulas, we find that SE(diff) = 0.069735 as
shown in the accompanying Excel spreadsheet. Using the usual large sample formula
for obtaining a confidence interval, we know that:
7

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

0.07 ( z 0.069735) = 0.04 (the stated lower bound of the confidence interval).
2

We also know that:


0.07 + ( z 0.069735) = 0.10 (the stated upper bound of the confidence interval).
2

Rearranging the formula for the lower bound, we find that:

z = (0.07 0.04)/0.069735 = 0.43 (to two decimal places). You can verify that you
2

get the same result if you rearrange the formula for the upper bound.
As such, from tables of the Normal distribution the z value 0.43 corresponds to an area
of 0.6664 which represents the area to the left of this positive z value. As such, the area
in one tail of the distribution is 1 - 0.6664 = 0.3336. The area in two tails is therefore
twice this value, or 0.6672. As such, the confidence level is 1 0.6672 = 0.3328 or
38.28%. This is a bizarre level of confidence to use and I doubt it would really be used
in practice!

SOME TIPS
Calculating confidence intervals
When obtaining a confidence interval the main thing is to ensure that you work with the
correct form of the standard error for the given problem. I realise that this is not an
insignificant detail, as it can be tricky to be sure that you have chosen the correct
formula. However, if you take a methodical approach to the problem and think carefully
about what you have been asked to do, you will get the correct answer.
When you begin, clearly write down any assumptions that are being made. Are you
comparing means or proportions? Do you have small or large samples? If you are
comparing means, do you know the population standard deviations or are you
estimating these from the data? You can then decide whether you need to work with
the Normal distribution (i.e. use a z-value) or the t distribution (i.e. use a t value). You
can also find the correct form of the standard error by comparing your assumptions to
those stated in the lecture notes. Once you have this information, apply the usual
confidence interval formula as shown at the beginning of Lecture 1.
8

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

Dont forget.once you have obtained a confidence interval make sure


that you interpret it in simple terms for a lay reader. You always need to
relate it back to the original question.

Using statistical tables


In Question 1, when using the t distribution to calculate the confidence interval you have
98 degrees of freedom which, as you will have found out, does not appear in the tables.
This is a common problem when working with statistical tables. Firstly, it emphasises
the fact that its often easier to work with the Normal distribution when you have large
samples, knowing that the results you get will be more-or-less the same.
However, if you must work with tables in this situation then you have two options. Either
pick the closest number of degrees of freedom to the one you are trying to work with
and calculate a single confidence interval based on this value (as I did in the solution
where I used 120 degrees of freedom). Or you can calculate two confidence intervals
using a number of degrees of freedom less than the one you need and a number of
degrees of freedom greater than the one you need. This way you get a sense of what
the exact values of the lower and upper limits of the confidence interval are likely to be.
Of course, some of you will question why you would bother using tables when computer
software will give you exact values. I guess the main answer is a response based on
learning. It is fine to make use of software to undertake statistics and, in fact, if you go
on to work in this field it will be a necessity. However, you should always understand
the principles of the techniques you are using. No matter how good the software
programme is it wont decide whether your analysis is any good: rubbish in, rubbish
out. It is therefore a valuable learning experience to calculate things like confidence
intervals by hand, from first principles, to test your understanding. I should also say that
there are many occasions when I carry out calculations by hand in my job to ensure that
I understand exactly what the software output is telling me.

Q4: Fitting a linear regression model


The following dataset shows the height (cm), weight (kg) and gender (male = 1; female
= 0) of 19 children aged 11 to 16. The data are taken from the sample library for the
statistical software SAS (http://www.sas.com). The dataset can be found in the Excel file
for this tutorial session on my.wbs.
Name

Height

Gender

Weight

Alice

56.5

84.0
9

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

Becka

65.3

98.0

Gail

64.3

90.0

Karen

56.3

77.0

Kathy

59.8

84.5

Mary

66.5

112.0

Sandy

51.3

50.5

Sharon

62.5

112.5

Tammy

62.8

102.5

Alfred

69.0

112.5

Duke

63.5

102.5

Guido

67.0

133.0

James

57.3

83.0

Jeffrey

62.5

84.0

John

59.0

99.5

Philip

72.0

150.0

Robert

64.8

128.0

Thomas

57.5

85.0

William

66.5

112.0

In this question, we are interested in exploring how height and gender are related to
weight. As such, we will treat weight as the response variable (dependent variable)
and height and gender as explanatory variables (independent variables).
a) Use appropriate plots to investigate the relationship between weight and each of
the other two variables in turn. Comment briefly on each of the plots.

Answer:
I have used scatter plots to investigate the relationships between height and weight and
gender and weight respectively. In the case of gender, a box plot would be better but
its a pain to construct a decent box plot in Excel so the scatter plot will suffice. In
practice, I would use a statistical software package for this kind of data exploration
where box plots are readily available (e.g. SPSS, Stata, SAS, R, Minitab).

10

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

Figure 1: Height (cm) against weight (kg) for 11 to 16 year old children (n = 19)

Figure 2: Gender against weight (kg) for 11 to 16 year old children (n = 19)
Figure 1 shows that there appears to be a linear association between height and weight
(i.e. as height increases weight increases in a linear fashion). The variability in the data
points looks roughly constant across the height values and there are no obvious outliers
that may cause problems in the analysis.
In Figure 2 we can see that, on average, the boys weights tend to be heavier than the
girls weights. The variability (spread) of the weight values looks more-or-less the same
11

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

in both groups (....just look at the heights of the two sets of points on the scatter plot...)
and again, there are no obvious outliers.
The plots suggest that there are some interesting potential relationships in the data to
investigate further.

b) Using Excel, or a statistical software package of your choice, fit a linear


regression model to investigate the influence of height and gender on weight
c) Do both height and gender appear to be predictive of weight? Justify your
answer by making reference to relevant output from your analysis in part b).

Answer:
The complete regression output can be seen in the accompanying Excel file. The
estimated coefficients with corresponding p values and confidence intervals are as
follows:

By looking at the regression output above, it is clear that the p-value corresponding to
gender is not significant (p = 0.237). Remember that the associated hypothesis test
assesses whether the population coefficient corresponding to gender is equal to zero.
There is no evidence to suggest otherwise here, so we conclude that the gender
coefficient is not significantly different to zero and that the gender variable is therefore
not adding anything to the model in terms of explaining the variability in the response
variable. We will therefore remove this variable from the model and re-run the
regression analysis.

d) Based on your answer to part c), decide whether to retain both height and gender
in the model and, if necessary, re-run the regression analysis to obtain a final
fitted model that you are happy with. Carefully interpret your fitted model.
Answer:
12

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

After making the decision to re-run the regression without the gender variable, we are
now fitting a simple linear regression model to look at the relationship between height
and weight. The complete output from this model can be found in the accompanying
Excel file. The coefficient of determination, or R 2 value, for this model is 0.77, or 77%,
which means that height explains 77% of the variability in weight. The estimated slope
parameter for the model, i.e. the coefficient corresponding to height, is 3.90 as shown in
the following output:

The p value corresponding to height is highly significant (p < 0.001) which means that
height is explaining some of the variability in weight. The fitted model is as follows:
Predicted weight = -143.03 + 3.90Height
This tells us that a one cm increase in height results in an average increase in weight of
3.9 kg. The fitted model is simply a straight line. Since gender dropped out of the
model, from these data we have no evidence that weight is significantly different for
boys and girls after taking height into account.

Q5: In a multiple linear regression model where the least squares estimates were
based on 27 sets of sample observations, the total sum of squares (SST) and the
regression sum of squares (SSR) were found to be:
SST = 3.881 and SSR = 3.549
(Note that these were respectively referred to as SS total and SSreg in Lecture 2.)

a) Find and interpret the coefficient of determination


Answer:
The coefficient of determination (more commonly known as R 2) is the proportion of the
total variability (SST) explained by the regression model. It is therefore calculated as
SSR/SST = 3.549/3.881 = 0.9145 or 91.45%. This means that the fitted regression
13

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

model is explaining roughly 91% of the variability in the response variable, which is very
high indeed.

b) Find the error sum of squares


Answer:
The error sum of squares (which is more commonly referred to as the residual sum of
squares after fitting a model), represents the amount of unexplained variability in the
response variable after fitting the regression model. Recall from Lecture 2 that:
SSTotal = SSreg + SSerror
As such, the error sum of squares is simply obtained by subtracting SSR from SST:
SST SSR = 3.881 3.549 = 0.332

**Extra (more difficult) questions to make you think a bit:


NOTE: these questions were included to stretch you. If you didnt know how to
attempt them at this stage, dont worry. I just wanted you to have a go and start
thinking about the issues. Read the solution below and see if you can grasp the
concepts.
c) The fitted regression model included three explanatory variables and therefore
took the following form:
y b0 b1 X 1 b2 X 2 b3 X 3

By making use of all information given in this question, complete the missing
information in the following Analysis of Variance (ANOVA) table:
Answer:
Missing values shown in bold in the table.
ANOVA
14

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

df

SS

MS

Significance F

Regression

3.549

3.549/3 = 1.183

1.183/0.014 = 84.5

< 0.001

Residual

23

0.332

0.332/23 = 0.014

Total

26

3.881

d) By looking at the p value reported in this table (i.e. Significance F), what do you
conclude about the fitted regression model?
Answer:
The F test reported in the ANOVA table tests whether the regression model as a
whole (i.e. including all predictors) is explaining some of the variability in the
response variable. More formally, the null hypothesis for the test is that all
regression coefficients are equal to zero and the alternative hypothesis is that at
least one of the coefficients is not equal to zero. In other words, it tests to see
whether at least one of the explanatory variables is predictive of the outcome. In
this case the p value for the test is highly significant which tells us that at least
one of the predictor variables is usefully explaining some of the variability in the
response variable.
(Question taken from Chapter 13, Statistics for Business and Economics, Sixth Edition, Paul Newbold, William L Carlson and Betty Thorne)

Q6: Interpreting a linear regression model


Gaming World is a chain of stores selling video gaming merchandise. The company
undertook a study to investigate the relationship between sales in the last year (000s)
and the number of households in each stores catchment area (000s). The stores are
also located in one of two location types: suburban (coded 0 in the dataset) and town
centre/shopping mall (coded 1 in the dataset), and this was taken into account in the
analysis.
Data were collected on 22 stores around the country and a linear regression model was
fitted to the data to investigate the relationship described above. Selected Excel output
from the fitted model is as follows:

ANOVA
df

SS

MS

Significance
F
15

QAM II: Week 2, Term 2


Regressio
n

Residual

19

Total

21

Intercept
Househol
ds
Location

Dr Katy Hoad, 2016


45885.721
75
4707.97012
9
50593.6918
8

22942.8608
7
247.787901
6

92.5907226
7

Coefficien
ts

Standard
Error

t Stat

P-value

23.43013
017

13.26222
973

1.766681
067

0.093342
736

0.778335
508
26.45878
652

0.094568
835
8.801292
911

8.230359
471
3.006238
605

1.09657E07
0.007260
713

1.5959E-10

Lower 95%
4.32803560
5
0.58040066
2
8.037469

Upper
95%
51.18829
594
0.976270
354
44.88010
4

a) Fill in the missing sum of squares due to the regression in the ANOVA table and
compute the coefficient of determination for this model. Briefly discuss the result.
Answer:
Recall that SSreg + SSerror = SStotal. As such, the missing sum of squares is
calculated as 50593.69188 4707.970129 = 45885.72175.
We can obtain the coefficient of determination, or R 2, by dividing this value by the
total sum of squares as follows: R 2 = 45885.72175/50593.69188 = 0.91 or 91%.
This is a high R2 value and tells us that 91% of the variability in sales is explained
by the two variables households and location.

b) Calculate a 95% confidence interval for the population coefficient corresponding


to the location variable to fill in the missing information in the table above.
Comment on the width of the confidence interval and the implications of this.
Note, in this example to calculate the confidence interval the relevant number of
degrees of freedom is n K 1 = 22 2 1 = 19.
(Note: I simply gave you the number of degrees of freedom here, but in
general K is the number of estimated parameters, not including the
intercept. It is therefore 2 in this case since we estimated parameters (or
coefficients) for households and location. Once you know the degrees of
16

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

freedom, then this test follows the same procedure that you were taught in
QAM I).
Answer:
To calculate the required confidence interval, we must use the following formula:
b tn-K-1, 0.025SE(b)
The relevant t-value is t19,0.025 = 2.093, obtained from tables of the t distribution. The
estimate of the coefficient (labelled b above) is given in the regression output, as is
the standard error of the estimate. As such, we have:
26.45878652 (2.0938.801292911) = 8.04 to 44.88. This is a very wide
confidence interval which means that there is a lot of uncertainty associated with the
parameter estimate and, therefore, the true extent of the effect of location on sales.

c) Write down the form of the fitted regression model and clearly interpret it in
simple language.
Answer:
The fitted model is: Predicted sales = 23.43 + 0.78Households + 26.46Location
Firstly, an increase in households of one unit, or 1000 households, within a
catchment area, leads to an average increase in sales of 0.78, or 780, assuming
that location remains fixed.
Since the location variable is binary, a one unit change translates as a switch from
the reference location (suburban = 0) to the other location (town centre = 1). As
such, changing from a suburban to a town centre location increases sales, on
average, by 26.46 or 26,460 assuming that the number of households in the
catchment area remains fixed. As shown in Lecture 2, we could write two separate
fitted models, one for each location, which would represent two parallel lines.

SOME TIPS
17

QAM II: Week 2, Term 2

Dr Katy Hoad, 2016

Multiple linear regression


If you found you struggled with the regression questions dont worry, I will be spending
the next entire session talking about multiple linear regression so some of the ideas will
become clearer then.
Having said that, you also need to carefully review your notes on simple linear
regression from the QAM I module as well as the corresponding QAM I tutorial
exercises for this topic. This material gives you a basic grounding in regression
analysis and you need to make sure that you fully understand this foundation
information before you can hope to understand multiple linear regression.
One thing that students often ask about is whether you are expected to perform
calculations to fit a multiple regression model and estimate the model coefficients. The
short answer is no. I will not expect you to do this as it is too involved and, in practice,
no sane person would fit a multiple regression model by hand. Unless, of course, you
take a degree in statistics in which case you will probably have to do this kind of thing!
However, I do expect all of you to spend time ensuring that you understand how to fit a
model using software such as Excel. I also expect you to be able to interpret the output
from such software, and this is the main focus of the course. Whenever I introduce an
example in lectures, you should go through the example again at home and try to
replicate the outputs I have presented.

18

You might also like