Professional Documents
Culture Documents
Tutorial 1: Solutions
(Material from lectures 1 and 2 and revision from QAM I)
Q1: Independent random sampling from two Normally distributed populations gives the
following results:
n x 64
n y 36
x 400
y 360
s x 20
s y 25
Find a 90% confidence interval estimate of the difference in the means of the two
populations. Clearly state all of your calculations.
(Question taken from Chapter 9, Statistics for Business and Economics, Sixth Edition, Paul Newbold, William L Carlson and Betty Thorne)
Answer:
We are told that we have independent samples and that both samples were taken from
Normal populations. As such, if we were to undertake a hypothesis test to compare the
means, we would use a two sample t test.
Hint: It can often help to think about the hypothesis test you would use to compare the
groups in order to decide on the correct form of the standard error to use in the
confidence interval.
Since the population variances are not known and are estimated from the data, we
could calculate a confidence interval based on the t distribution. However, since the
sample sizes are both large (> 30), we could also simply make use of the Normal
distribution. We will attempt both methods and compare the results:
SE ( x y ) s
1
1
nx n y
(n x 1) s x2 ( n y 1) s y2
nx n y 2
Putting the numbers into these formulas yields a standard error of 4.5661 (to four
decimal places) see accompanying Excel spreadsheet for calculations.
In this case, we make use of the t distribution to obtain the 90% confidence interval as
follows:
( x y ) t , 2 SE ( x y )
We have (64-1)+(36-1) = 98 degrees of freedom, but since t values are not reported in
the table for this exact number, we will use the t value corresponding to 120 degrees of
freedom, which is as close as we can get to 98 degrees of freedom without using a
computer. We therefore use t120, 0.05 = 1.658.
As such, the confidence interval is calculated as:
(400 360) (1.658 4.5661) = 32.43 to 47.57
(See Excel spreadsheet for in-depth calculations)
Note: If you look at the Excel spreadsheet, you will see that I have reported the exact t
value corresponding to 98 degrees of freedom which is, in fact, 1.661. As such, making
use of Excel yields the following 90% confidence interval based on the t distribution:
32.42 to 47.58 (practically the same as above).
In this case, the standard error of the difference between the means is calculated as:
SE ( x y )
2
s x2 s y
4.8591
nx n y
The z value for a 90% confidence interval is 1.645 so the 90% confidence interval
based on the Normal distribution is calculated as:
(400 360) (1.645 4.8591) = 32.01 to 47.99.
You can see that the confidence intervals obtained using the t distribution and the
Normal distribution are very similar which shows that the Normal confidence interval is a
good approximation in this case (since we have fairly large samples).
Note: Dont panic about which method to use in the exam I will ensure that
exam questions are written without any ambiguity!
Q2: A random sample of six salespersons that attended a motivational course on sales
techniques was monitored in the three months before and the three months after the
course. The tables shows the values of sales (in thousands of dollars) generated by
these six salespersons in the two periods. Assume that the population distributions are
normal.
(Data taken from: Chapter 9, Statistics for Business and Economics, Sixth Edition, Paul Newbold, William L Carlson and Betty Thorne)
Salesperson
1
2
3
4
5
6
237
291
191
341
192
180
a) Find a 95% confidence interval for the difference between the population means
b) By looking at the confidence interval you have calculated in part a) above, what
can you deduce about the mean sales before and after the motivational course?
Briefly justify your response.
c) Carry out a hypothesis test at the 5% level to formally assess whether there is
evidence of a difference in the mean sales before and after the course (note: this
is revision from QAM I)
Answer:
a) In this question, since sales were assessed before and after an intervention (i.e. the
motivational course), we have paired data. In other words, the two sets of observations
are not independent. To formally compare the two groups we must use a paired t test
and to calculate a 95% confidence interval for the difference between the population
means we can make use of the corresponding standard error as follows:
SD ( d )
where d represents the differences between the two sets of observations.
n
All calculations are shown in the accompanying Excel spreadsheet where it can be seen
that SE = 7.6627.
SE
where d is the mean difference between the before and after sales.
The confidence interval is therefore -7.5 (2.571 7.6627) = -27.2 to 12.2.
Note: in this case I subtracted the after values from the before values, but you could
have done the subtraction the other way round. This would simply have altered the
signs of the confidence interval to give the interval -12.2 to 27.2.
b) The confidence interval calculated in part a) has a lower bound which is negative
and an upper bound which is positive which means that the interval contains zero. Zero
is the null value (i.e. the value that indicates no difference between the groups). As
such, since we cannot rule out the possibility that the mean difference between the
groups is zero, we have no evidence of a difference in sales before and after the
motivational course.
c) See accompanying Excel spreadsheet for calculations. Look back at your QAM I
notes to revise the strategy.
Step 1: Null hypothesis: The mean difference between the two sets of sales is zero.
Step 2: Alternative hypothesis: The mean difference between the two sets of sales is not
zero.
Step 3: Assume that the null hypothesis is true.
Step 4: Calculate the test statistic for the paired t test as follows:
t
d
7.5
0.979
SE ( d ) 7.6627
The obtained t value is negative but this just reflects how the differences were
calculated. Since we are undertaking a two sided test we can simply treat the t value as
positive. We are carrying out the test at the 5% level of significance on 6-1 = 5 degrees
of freedom. From tables of the t distribution, 0.727 < 0.979 < 1.476 and these two
values (i.e. 0.727 and 1.476) correspond to areas in one tail of the distribution of 0.25
and 0.1 respectively. However, since we are carrying out a two tailed test we must
5
double up these probabilities. The p value therefore lies between 0.2 and 0.5. In fact,
from the Excel spreadsheet we can see that the exact p value is 0.373.
Step 5: Since the obtained p value is much greater than 0.05, we have no evidence
from the data that there is a difference in sales before and after the motivational course.
This supports the conclusion we drew after calculating the confidence interval in part a).
The course doesnt appear to have been very motivational after all!
What is the
(Question taken from Chapter 9, Statistics for Business and Economics, Sixth Edition, Paul Newbold, William L Carlson and Betty Thorne)
Answer:
The estimated proportions of men and women in favour of the constitutional amendment
are as follows:
Men: px = 61/100 = 0.61
Women: py = 54/100 = 0.54
The point estimate of the difference between the population proportions is therefore
obtained by finding the difference between the estimates for men and women:
0.61 0.54 = 0.07
To calculate a confidence interval for the difference between the population proportions,
the standard error of the difference is calculated as:
SE ( p1 p 2 )
SE ( p1 ) 2 SE ( p 2 ) 2
where
SE ( p x )
p x (1 p x )
and SE ( p y )
nx
p y (1 p y )
ny
Putting the relevant values into these formulas, we find that SE(diff) = 0.069735 as
shown in the accompanying Excel spreadsheet. Using the usual large sample formula
for obtaining a confidence interval, we know that:
7
0.07 ( z 0.069735) = 0.04 (the stated lower bound of the confidence interval).
2
z = (0.07 0.04)/0.069735 = 0.43 (to two decimal places). You can verify that you
2
get the same result if you rearrange the formula for the upper bound.
As such, from tables of the Normal distribution the z value 0.43 corresponds to an area
of 0.6664 which represents the area to the left of this positive z value. As such, the area
in one tail of the distribution is 1 - 0.6664 = 0.3336. The area in two tails is therefore
twice this value, or 0.6672. As such, the confidence level is 1 0.6672 = 0.3328 or
38.28%. This is a bizarre level of confidence to use and I doubt it would really be used
in practice!
SOME TIPS
Calculating confidence intervals
When obtaining a confidence interval the main thing is to ensure that you work with the
correct form of the standard error for the given problem. I realise that this is not an
insignificant detail, as it can be tricky to be sure that you have chosen the correct
formula. However, if you take a methodical approach to the problem and think carefully
about what you have been asked to do, you will get the correct answer.
When you begin, clearly write down any assumptions that are being made. Are you
comparing means or proportions? Do you have small or large samples? If you are
comparing means, do you know the population standard deviations or are you
estimating these from the data? You can then decide whether you need to work with
the Normal distribution (i.e. use a z-value) or the t distribution (i.e. use a t value). You
can also find the correct form of the standard error by comparing your assumptions to
those stated in the lecture notes. Once you have this information, apply the usual
confidence interval formula as shown at the beginning of Lecture 1.
8
Height
Gender
Weight
Alice
56.5
84.0
9
Becka
65.3
98.0
Gail
64.3
90.0
Karen
56.3
77.0
Kathy
59.8
84.5
Mary
66.5
112.0
Sandy
51.3
50.5
Sharon
62.5
112.5
Tammy
62.8
102.5
Alfred
69.0
112.5
Duke
63.5
102.5
Guido
67.0
133.0
James
57.3
83.0
Jeffrey
62.5
84.0
John
59.0
99.5
Philip
72.0
150.0
Robert
64.8
128.0
Thomas
57.5
85.0
William
66.5
112.0
In this question, we are interested in exploring how height and gender are related to
weight. As such, we will treat weight as the response variable (dependent variable)
and height and gender as explanatory variables (independent variables).
a) Use appropriate plots to investigate the relationship between weight and each of
the other two variables in turn. Comment briefly on each of the plots.
Answer:
I have used scatter plots to investigate the relationships between height and weight and
gender and weight respectively. In the case of gender, a box plot would be better but
its a pain to construct a decent box plot in Excel so the scatter plot will suffice. In
practice, I would use a statistical software package for this kind of data exploration
where box plots are readily available (e.g. SPSS, Stata, SAS, R, Minitab).
10
Figure 1: Height (cm) against weight (kg) for 11 to 16 year old children (n = 19)
Figure 2: Gender against weight (kg) for 11 to 16 year old children (n = 19)
Figure 1 shows that there appears to be a linear association between height and weight
(i.e. as height increases weight increases in a linear fashion). The variability in the data
points looks roughly constant across the height values and there are no obvious outliers
that may cause problems in the analysis.
In Figure 2 we can see that, on average, the boys weights tend to be heavier than the
girls weights. The variability (spread) of the weight values looks more-or-less the same
11
in both groups (....just look at the heights of the two sets of points on the scatter plot...)
and again, there are no obvious outliers.
The plots suggest that there are some interesting potential relationships in the data to
investigate further.
Answer:
The complete regression output can be seen in the accompanying Excel file. The
estimated coefficients with corresponding p values and confidence intervals are as
follows:
By looking at the regression output above, it is clear that the p-value corresponding to
gender is not significant (p = 0.237). Remember that the associated hypothesis test
assesses whether the population coefficient corresponding to gender is equal to zero.
There is no evidence to suggest otherwise here, so we conclude that the gender
coefficient is not significantly different to zero and that the gender variable is therefore
not adding anything to the model in terms of explaining the variability in the response
variable. We will therefore remove this variable from the model and re-run the
regression analysis.
d) Based on your answer to part c), decide whether to retain both height and gender
in the model and, if necessary, re-run the regression analysis to obtain a final
fitted model that you are happy with. Carefully interpret your fitted model.
Answer:
12
After making the decision to re-run the regression without the gender variable, we are
now fitting a simple linear regression model to look at the relationship between height
and weight. The complete output from this model can be found in the accompanying
Excel file. The coefficient of determination, or R 2 value, for this model is 0.77, or 77%,
which means that height explains 77% of the variability in weight. The estimated slope
parameter for the model, i.e. the coefficient corresponding to height, is 3.90 as shown in
the following output:
The p value corresponding to height is highly significant (p < 0.001) which means that
height is explaining some of the variability in weight. The fitted model is as follows:
Predicted weight = -143.03 + 3.90Height
This tells us that a one cm increase in height results in an average increase in weight of
3.9 kg. The fitted model is simply a straight line. Since gender dropped out of the
model, from these data we have no evidence that weight is significantly different for
boys and girls after taking height into account.
Q5: In a multiple linear regression model where the least squares estimates were
based on 27 sets of sample observations, the total sum of squares (SST) and the
regression sum of squares (SSR) were found to be:
SST = 3.881 and SSR = 3.549
(Note that these were respectively referred to as SS total and SSreg in Lecture 2.)
model is explaining roughly 91% of the variability in the response variable, which is very
high indeed.
By making use of all information given in this question, complete the missing
information in the following Analysis of Variance (ANOVA) table:
Answer:
Missing values shown in bold in the table.
ANOVA
14
df
SS
MS
Significance F
Regression
3.549
3.549/3 = 1.183
1.183/0.014 = 84.5
< 0.001
Residual
23
0.332
0.332/23 = 0.014
Total
26
3.881
d) By looking at the p value reported in this table (i.e. Significance F), what do you
conclude about the fitted regression model?
Answer:
The F test reported in the ANOVA table tests whether the regression model as a
whole (i.e. including all predictors) is explaining some of the variability in the
response variable. More formally, the null hypothesis for the test is that all
regression coefficients are equal to zero and the alternative hypothesis is that at
least one of the coefficients is not equal to zero. In other words, it tests to see
whether at least one of the explanatory variables is predictive of the outcome. In
this case the p value for the test is highly significant which tells us that at least
one of the predictor variables is usefully explaining some of the variability in the
response variable.
(Question taken from Chapter 13, Statistics for Business and Economics, Sixth Edition, Paul Newbold, William L Carlson and Betty Thorne)
ANOVA
df
SS
MS
Significance
F
15
Residual
19
Total
21
Intercept
Househol
ds
Location
22942.8608
7
247.787901
6
92.5907226
7
Coefficien
ts
Standard
Error
t Stat
P-value
23.43013
017
13.26222
973
1.766681
067
0.093342
736
0.778335
508
26.45878
652
0.094568
835
8.801292
911
8.230359
471
3.006238
605
1.09657E07
0.007260
713
1.5959E-10
Lower 95%
4.32803560
5
0.58040066
2
8.037469
Upper
95%
51.18829
594
0.976270
354
44.88010
4
a) Fill in the missing sum of squares due to the regression in the ANOVA table and
compute the coefficient of determination for this model. Briefly discuss the result.
Answer:
Recall that SSreg + SSerror = SStotal. As such, the missing sum of squares is
calculated as 50593.69188 4707.970129 = 45885.72175.
We can obtain the coefficient of determination, or R 2, by dividing this value by the
total sum of squares as follows: R 2 = 45885.72175/50593.69188 = 0.91 or 91%.
This is a high R2 value and tells us that 91% of the variability in sales is explained
by the two variables households and location.
freedom, then this test follows the same procedure that you were taught in
QAM I).
Answer:
To calculate the required confidence interval, we must use the following formula:
b tn-K-1, 0.025SE(b)
The relevant t-value is t19,0.025 = 2.093, obtained from tables of the t distribution. The
estimate of the coefficient (labelled b above) is given in the regression output, as is
the standard error of the estimate. As such, we have:
26.45878652 (2.0938.801292911) = 8.04 to 44.88. This is a very wide
confidence interval which means that there is a lot of uncertainty associated with the
parameter estimate and, therefore, the true extent of the effect of location on sales.
c) Write down the form of the fitted regression model and clearly interpret it in
simple language.
Answer:
The fitted model is: Predicted sales = 23.43 + 0.78Households + 26.46Location
Firstly, an increase in households of one unit, or 1000 households, within a
catchment area, leads to an average increase in sales of 0.78, or 780, assuming
that location remains fixed.
Since the location variable is binary, a one unit change translates as a switch from
the reference location (suburban = 0) to the other location (town centre = 1). As
such, changing from a suburban to a town centre location increases sales, on
average, by 26.46 or 26,460 assuming that the number of households in the
catchment area remains fixed. As shown in Lecture 2, we could write two separate
fitted models, one for each location, which would represent two parallel lines.
SOME TIPS
17
18