Notes

DBA6000
Quantitative
Business
Research
Methods
Rob J Hyndman

c Rob J Hyndman, 2008.
Professor Rob Hyndman

Department of Econometrics and Business Statistics
Monash University (Clayton campus)
VIC 3800.
Email: Rob.Hyndman@buseco.monash.edu.au
Telephone: (03) 9905 2358
www.robhyndman.info
Contents
Preface 5
1 Research design 9
1.1 Statistics in research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Organizing a quantitative research study . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Some quantitative research designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 The survey process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Appendix A: Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Data collection 23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Data collecting instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Errors in statistical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Questionnaire design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Scale development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Appendix B: Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Data summary 53
3.1 Summarising categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Summarizing numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Summarising two numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Measures of reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4 Computing and quantitative research 70

4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Using a statistics package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 SPSS exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Significance 77
5.1 Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3
5.2 Numerical differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Statistical models and regression 88

6.1 One numerical explanatory variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 One categorical explanatory variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Several explanatory variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Comparing regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5 Choosing regression variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.6 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.7 SPSS exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7 Significance in regression 107

7.1 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 ANOVA tables and F-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3 t-tests and confidence intervals for coefficients . . . . . . . . . . . . . . . . . . . . . . 108
7.4 Post-hoc tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.5 SPSS exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8 Dimension reduction 112

8.1 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9 Data analysis with a categorical response variable 119

9.1 Chi-squared test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.2 Logistic and multinomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.3 SPSS exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10 A survey of statistical methodology 124
11 Further methods 131

11.1 Classification and regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
11.2 Structural equation modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
11.3 Time series models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.4 Rank-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
12 Presenting quantitative research 135

12.1 Numerical tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
12.2 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Appendix: Good graphs for better business . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
13 Readings 145
DBA6000: Quantitative Business Research Methods 4

Preface
Subject convenor
Professor Rob J Hyndman

B.Sc.(Hons), Ph.D., A.Stat
Department of Econometrics and Business Statistics
Location: Room 671, Menzies Building, Clayton.
Phone: (03) 9905 2358
Email: Rob.Hyndman@buseco.monash.edu.au
WWW: http://www.robhyndman.info
Objectives
On completion of this subject, students should have:
• the necessary quantitative skills to conduct high quality independent research related to
business administration;
• comprehensive grounding in a number of quantitative methods of data production and
analysis;
• been introduced to quantitative data analysis through a practical research activity.
Synopsis
This unit considers the quantitative research methods used in studying business, management
and organizational analysis. Topics to be covered:
1. research design including experimental designs, observational studies, case studies, lon-
gitudinal analysis and cross-sectional analysis;
2. data collection including designing data collection instruments, sampling strategies and
assessing the appropriateness of archival data for a research purpose;
3. data analysis including graphical and numerical techniques for the exploration of large
5
Preface
data sets and a survey of advanced statistical methods for modelling the relationships
between variables;
4. communication of quantitative research; and
5. the use of statistical software packages such as SPSS in research.
The effective use of several quantitative research methods will be illustrated through reading
research papers drawn from several disciplines.
References
None of these are required texts—they provide useful background material if you want to read
further. Huck (2007) is excellent on interpreting statistical results in academic papers. Pallant
(2007) is very helpful when using SPSS and in giving advice on how to write up research results.
Use Wild and Seber (2000) if you need to brush up on your basic statistics; it contains lots of
helpful advice and interesting examples.
1. H UCK , S.W. (2007) Reading statistics and research. 5th ed., Allyn & Bacon: Boston, MA
2. PALLANT, J. (2007) SPSS survival manual, 3rd ed., Allen & Unwin.
3. DE VAUS , D. (2002) Analyzing social science data. SAGE Publications: London.
4. W ILD , C.J., & S EBER , G.A.F. (2000) Chance encounters: a first course in data analysis and
inference. John Wiley & Sons: New York.
Timetable
17 July Introduction/Chapter 1
24 July Chapters 2
31 July Chapter 3
7 August Chapter 4 SPSS tutorial
14 August Chapter 5
21 August Chapter 6
28 August Chapter 7 SPSS tutorial
4 September Chapter 8–9 SPSS tutorial
11 September Chapter 10
18 September Chapter 11–12 First assignment due
25 September No class
2 October No class
9 October SPSS tutorial
16 October Oral presentations Second assignment due

Preface
Assessment
1. A written report presenting and critiquing a research paper which uses quantitative re-
search methods. 45%
• It can be a published research paper from a scholarly journal, or a company report.
It must contain substantial quantitative research. It must be approved in advance.
• Your report should include comments on the research questions addressed, the ap-
propriateness of the data used, how the data were collected, the method of analysis
chosen, and the conclusions drawn.
• Length: 4000–5000 words excluding tables and graphs.
• Due: 17 September
2. A written report presenting some original quantitative analysis of a suitable multivariate
data set. 45%
• You may use your own data, or use data that I will provide. The data set must
include at least four variables. It can be data from your workplace.
• Your report should include comments on the research questions addressed, the ap-
propriateness of the data used, how the data were collected, the method of analysis
chosen, and the conclusions drawn.
• You may use any statistical computing package or Excel for analysis.
• Length: 4000–5000 words excluding tables and graphs.
• Due: 15 October
3. A 20 minute oral presentation of one of the above reports. 10%.
• On either 8 or 15 October.
Assignment marking scheme
• Research questions addressed: 6%

• Appropriateness of data: 6%
• Data collection: 6%
• Description of statistical methods used: 6%
• Suitability of statistical methods: 6%
• Discussion of statistical results: 8%
• Conclusions (are they supported/valid?): 7%
Choosing a paper for Assignment 1
Choose something you are interested in. For example, it can be an article you are reading as
part of your other DBA studies or something you have read as part of your professional life.
The following journals contain some articles that would be suitable. There are also many others.
• Australian Journal of Management

• International Journal of Human Resource Management
• Journal of Advertising
• Journal of Applied Management Studies
• Journal of Management
• Journal of Management Accounting Research

Preface
• Journal of Management Development

• Journal of Managerial Issues
• Journal of Marketing
• Management Decision
You can obtain online copies for some of these via the Monash Voyager Catalogue. Hard copies
should be in the Monash library.
Things to look for:
• it should involve some substantial data analysis;

• it should involve more than summary statistics (e.g., a regression model, or some chi-
squared tests);
• it should not use sophisticated statistical methods that are beyond this subject (e.g., avoid
factor analysis and structural equation models).
All papers should be approved by Rob Hyndman before you begin work on the assignment.
Choosing a data set for Assignment 2
• Choose something you know about. The best data analyses involve a mix of good knowl-
edge of the data context as well as good use of statistical methodology.
• Don’t try to do too much. One response variable with 3–5 explanatory variables is usually
sufficient. Resist the temptation to write a long treatise!
• You will find it easier if the response variable is numeric. Analysing categorical response
variables with several explanatory variables can be tricky.
• Be clear about the purpose of your analysis. State some explicit objectives or hypotheses,
and address them via your statistical analysis.
• Think about what you include. A few well-chosen graphics that tell a story is better than
pages of computer output that mean very little.
• Start early. Even before we cover much methodology, you can do some basic data sum-
maries and think about the key questions you want to address.
• All data sets should be approved by Rob Hyndman before you begin work on the assign-
ment.
Readings
Most weeks we will read a case study from a research journal and discuss the analysis. Please
read these in advance. We will discuss them in the third hour. You cannot use a paper we
have discussed for your first assessment task. If you have a suggestion of a paper that may be
suitable for class discussion, please let me know.

CHAPTER
1
Research design
1.1 Statistics in research

“Statistics is the study of making sense of data.” Ott and Mendenhall
“The key principle of statistics is that the analysis of observations
doesn’t depend only on the observations but also on how they were
obtained.” Anonymous
• Data beat anecdotes “For example” proves nothing. (Hebrew proverb)

• Data beat intuition
“Belief is no substitute for arithmetic.” (Henry Spencer)
• Data beat “expert” opinion
“When information becomes unavailable, the expert comes into his own.” (A.J. Liebling)
1.1.1 Statistics answers questions using data
• Do pollutants cause asthma?

• Do transaction volumes on the stock market react to price changes?
• Does deregulation reduce unemployment?
• Does fluoride reduce tooth decay?
A definition
Statistical Analysis: Mysterious, sometimes bizarre, manipulations performed upon the col-
lected data of an experiment in order to obscure the fact that the results have no generalizable
meaning for humanity. Commonly, computers are used, lending an additional aura of unreality
to the proceedings.
(Source unknown)
97.3% of all statistics are made up.
9
Part 1. Research design
1.1.2 Some statistics stories
The Challenger disaster
2
Number of O-rings damaged
55 60 65 70 75 80
Ambient temperature at launch
Charlie’s chooks
14
12
Y: Percentage mortality
10
8
6
4
0 20 40 60 80 100
X: Percentage Tegel birds

Risk factors for heart disease
A doctor wants to investigate who is most at risk for coronary-related deaths. He selects 12
patients at random from his clinic and records their age, blood pressure and drug used. He
also records whether they eventually died from heart disease or not.
Age BP Drug L/D

18 68 1 D
20 64 2 L
22 72 1 D
25 67 2 L
29 80 – D
33 70 – D
34 86 1 D
36 85 – D
37 73 2 L
39 82 – L
41 90 1 D
45 87 2 L
Drug Lived Died % lived

1 0 4 0%
2 4 0 100%
– 1 3 25%
5 7
Drug 1 looks bad, 2 looks good.

1.1.3 Causation and association
Smoking and Lung Cancer
There is a strong positive correlation between smoking and lung cancer. There are several
possible explanations.
• Causal hypothesis: Smoking causes lung cancer.

• Genetic hypothesis: There is a hereditary trait which predisposes people to both nicotine
addiction and lung cancer.
• Sloppy lifestyle hypothesis: Smoking is most prevalent amongst people who also drink
too much, don’t exercise, eat unhealthy food, etc.
Postnatal care
Mothers who return home from hospital soon after birth do better than those who stay in
hospital longer.
• Causation hypothesis: Hospital is harmful and/or home is helpful.

• Common response hypothesis: Mothers return home early because they are coping well.
• Confounding hypothesis: Mothers return home early if there is someone at home to help.
University applicants
Male Female Total

Accept 70 40 110
Reject 100 100 200
Total 170 140 310
Is there evidence of discrimination?
Course: Introduction to bean counting
Male Female Total

Accept 60 20 80
Reject 60 20 80
Total 120 40 160

Course: Advanced welding
Male Female Total

Accept 10 20 30
Reject 40 80 120
Total 50 100 150
This is an example of Simpson’s Paradox. Simpson’s
Paradox occurs when the association between variables is
reversed when data from several groups are combined.
Other examples of Simpsons’ paradox
• Average tax rate has increased with time even though rate in every income category has
decreased. Why?
• Ave. female salary of B.Sc. graduates is lower than ave. male salary. Why?
Causality or association?
1. A positive correlation between blood pressure and income is observed. Does

this indicate a causal connection?
2. In a survey in 1960, it was found that for 25–34 y.o. males there was a positive
correlation between years of school completed and height. Does going to
school longer make a man taller?
3. The same survey showed a negative correlation between age and educational
level for persons aged over 25. Why?
4. Students at fee paying private schools perform better on average in VCE than
students at government funded schools. Why?
Some subtle differences
• Distinguish between: causation & association, prediction & causation, prediction & ex-
planation.
• Note difference between deterministic and probabilistic causation.

1.2 Organizing a quantitative research study
As a quick check, ask the following questions
1. What is your hypothesis (your research question)?
2. What is already known about the problem (literature review)?
3. What sort of design is best suited to studying your hypothesis? (method)
4. What data will you collect to test your hypothesis? (sample)
5. How will you analyse these data? (data analysis)
6. What will you do with the results of the study? (communication)
These questions are broken down in more detail below. (These are mostly taken from Rubin et
al. (1990), and have also appeared in Balnaves and Caputi (2001).)
1.2.1 Hypothesis
• What is the goal of the research?

• What is the problem, issue, or critical focus to be researched?
• What are the important terms? What do they mean?
• What is the significance of the problem?
• Do you want to test a theory?
• Do you want to extend a theory?
• Do you want to test competing theories?
• Do you want to test a method?
• Do you want to replicate a previous study?
• Do you want to correct previous research that was conducted in an inadequate manner?
• Do you want to resolve inconsistent results from earlier studies?
• Do you want to solve a practical problem?
• Do you want to add to the body of knowledge in another manner?
1.2.2 Review of literature
• What does previous research reveal about the problem?

• What is the theoretical framework for the investigation?
• Are there complementary or competing theoretical frameworks?
• What are the hypotheses and research questions that have emerged from the literature
review?

1.2.3 Method
• What methods or techniques will be used to collect the data? (This holds for applied and
non-applied research)
• What procedures will be used to apply the methods or techniques?
• What are the limitations of these methods?
• What factors will affect the study’s internal and external validity?
• Will any ethical principles be jeopardized?
1.2.4 Sample
• Who (what) will provide (constitute) the data for the research?
• What is the population being studied?
• Who will be the participants for the research?
• What sampling technique will be used?
• What materials and information are necessary to conduct the research?
• How will they be obtained?
• What special problems can be anticipated in acquiring needed materials and information?
• What are the limitations in the availability and reporting of materials and information?
1.2.5 Data analysis
• How will data be analysed?

• What statistics will be used?
• What criteria will be used to determine whether hypotheses are supported?
• What was discovered (about the goal, data, method, and data analysis) as a result of
doing preliminary work (if conducted)?
1.2.6 Communication
• How will the final research report be organised? (Outline)

• What sources have you examined thus far that pertain to your study? (Reference list)
• What additional information does the reader need?
• What time frame (deadlines) have you established for collecting, analysing and present-
ing data? (Timetable)
1.3 Some quantitative research designs

• Case study: questionnaire, interview, observation. Best for exploratory work and hy-
pothesis generation. Limited quantitative analysis possible.
• Survey: questionnaire, interview, observation. Best if sample is random.
• Experiment: questionnaire, interview, observation. Best for demonstrating
causality.

1.3.1 Cross-sectional vs longitudinal analysis
All designs can be either cross-sectional or longitudinal.
• Cross-sectional design involves data collection for one time only.

• Longitudinal design involves successive data collection over a period of time. Necessary
if you want to study changes over time.
1.3.2 Case study designs
• involves intense involvement with a few cases rather than limited involvement with
many cases
• can’t generalize results easily
• useful in exploring ideas and generating hypotheses
1.3.3 Survey designs
• Most popular in business/management research

• useful when you cannot control the things you want to study
• difficult to get random and representative samples
1.3.4 Experimental designs
• requires control group to allow for the placebo effect

• requires the experimenter to control all variables other than the variable of interest
• requires randomization to groups
• allows causation to be tested
Which research design would you use?
Hypotheses:
1. Women believe they are better at managing than men.
2. Children who listen to poetry in early childhood make better progress in learn-
ing to read than those who do not.
3. A business will run more efficiently if no person is directly responsible for more
than five other people.
4. There are inherent advantages in businesses staying small.
5. Employees with postgraduate qualifications have shorter job expectancy than
employees without postgraduate qualifications.
What data would you collect in each case?

1.4 Data structure
1.4.1 Populations and samples
A population is the entire collection of ‘things’ in which we are interested. A sample is a subset of
a population. We wish to make an inference about a population of interest based on information
obtained from a sample from that population.
E XAMPLES :
• You measure the profit/loss of 50 public hospitals in Victoria, randomly selected.

Population:
Sample:
Points of interest:
• Sales on 500 products from one company for the last 5 years are analysed.
Population:
Sample:
Points of interest:
1.4.2 Cases and variables
Think about your data in terms of cases and variables.
• A case is the unit about which you are taking measurements. E.g., a person, a business.
• A variable is a measurement taken on each case.
E.g., age, score on test, grade-level, income.
1.4.3 Types of Data
The ways of organizing, displaying and analysing data depends on the type of data we are
investigating.
• Categorical Data (also called nominal or qualitative)
e.g. sex, race, type of business, postcode

Averages don’t make sense. Ordered categories are called ordinal data
• Numerical Data (also called scale, interval and ratio)
e.g. income, test score, age, weight, temperature, time.

Averages make sense.
Note that we sometimes treat numerical data as categories. (e.g. three age groups.)

1.4.4 Response and explanatory variables
Response variable: measures the outcome of a study. Also called dependent variable.
Explanatory variable: attempts to explain the variation in the observed outcomes.

Also called independent variables.
Many statistical problems can be thought of in terms of a response
variable and one or more explanatory variables.
Sometimes the response variable is called the dependent variable and the explanatory variables
are called the independent variables.
• Study of profit/loss in Victorian hospitals.

Response variable:
Explanatory variables:
• Monthly sales of 500 products

Response variable:
Explanatory variables: competitor advertising.
1.5 The survey process

1. Planning a survey
State the objectives: In order to state the objectives we often need to ask questions such as:
• What is the survey’s exact purpose?
• What do we not know and want to know?
• What inferences do we need to draw?
Begin by developing a specific list of information needs. Then write focused survey ques-
tions.
2. Design the sampling procedure
Identify the target population: Whom are we drawing conclusions about?
Select a sampling scheme: Examples: simple random sampling, stratified random sampling,
systematic sampling, and cluster sampling.
3. Select a survey method
Decide how to collect the data: personal interviews, telephone interviews, mailed ques-
tionnaires, diaries, . . .
4. Develop the questionnaire
Write the questionnaire. Decide on the wording, types of questions, and other issues.
5. Pretest the questionnaire
Select a very small sample from the sampling frame. Conduct the survey and see what
goes wrong. Correct any problems before carrying out the full-scale study.
6. Conduct the survey
Run the survey in an efficient and time effective manner.
7. Analyze the data
Gather the results and determine outcomes.

Appendix A: Case studies
Injury management in NSW
Four injury management pilots (IMP) running during 2001:
• private hospitals and nursing homes within NSW;
• all industry groups within the Central West NSW region;
• two insurance companies (QBE and EML).
We wish to do a statistical comparison of the injury management pilots with the current stan-
dard injury management arrangements.
Performance measures
• incidence of specific payment types

• duration of claims
• number of claims
• proportion of claimants in receipt of weekly benefits at 4, 8, 13 and 26 weeks.
• costs for claimants at 4, 8, 13 and 26 weeks.
– medical, rehabilitation, physiotherapy, chiropractic
– weekly-benefits
• timeliness
– number of days from injury to agent notification
– number of days from injury to first payment
Some potential driving variables
• age
• gender
• injury type
• agency (e.g., powered tools)
• severity of injury
• medical interventions
• employer size
• insuring agency
• weekly pay at time of injury
• industry (ANZSIC code)
• occupation (ASCO code)
• Driving variables affect the performance measures.

• Variations between groups in key driver variables can induce apparent differences be-
tween groups. This is then confused with any real differences due to the programs being
evaluated.
• Therefore any comparisons of groups of employees should either eliminate the effect of
drivers or try to measure the effect of the drivers.

The ideal design!
Ideally, we would use a randomized control trial. This eliminates the effect of driving vari-
ables.
• The control group would be employees on the old IM system.

• The treatment group would be employees in the new IMP.
• Employees would be randomly allocated to the two groups.
• Statistical comparisons between the two groups would show differences between the old
IM system and the new IMP.
• This random allocation would prevent any systematic differences between those in the
IMP and those not in the IMP.
• Such a scheme is impracticable.
The actual design
We have to use pseudo-control groups and eliminate differences between the control and IMP
groups using statistical models.
• All injuries within the specified industry group, geographical region or insurer will be
subject to the new IMP during 2001.
• The pseudo-controls will be the equivalent groups of employees in 2000 who are not
subject to the new IMP.
Problem of confounding
• If there are differences between the IMP and the control, is it due to the different IM
program or the different group?
Solution:
• adjust for as many driving variables as possible;

• compare similar groups not subject to the IMP.
Comparisons undertaken
IMP group: Private hospitals/nursing homes in NSW 2001

Pseudo-control: Private hospitals/nursing homes 2000
IMP group: Central West NSW region 2001

Pseudo-controls: Central West NSW region 2000
IMP group: Insurance company 2001

Pseudo-control: Insurance company 2000
Non-IMP group: Comparable industry group 2001

Pseudo-controls: Comparable industry group 2000
Non-IMP group: Comparable NSW region 2001

Pseudo-controls: Comparable NSW region 2000

We do not directly compare:
• private hospitals/nursing homes with other industry groups;

• Central West NSW region with other geographical regions.
Instead, we compare the change between 2000 and 2001 in each industry group and each geo-
graphical region.
How to interpret the results. . .
• If all 2001 groups are different from the 2000 groups after taking into account all drivers,
then it is likely there are changes between years not reflected in the drivers. We won’t be
able to attribute any changes to the IMP.
• If all IMP 2001 groups are different from the 2000 groups after taking into account all drivers,
but the non-IMP 2001 groups are not different from the 2000 groups, then it is likely the
changes between years are due to the IMP.

Needlestick injuries
You are interested in the number and severity of needle stick injuries amongst health workers
involved in blood donation and transfusion. Work in groups of three to carefully define the
objectives of your survey. You will need to specify
• the objective of the survey

• what data are to be collected
• the target population
• the survey population
• the sample
• the data collection method
• potential errors which could occur in your survey.
Palliative care referrals
A few years ago, I helped the Health Department with a survey on palliative care. As part
of the study, it was necessary to study the ‘referral’ pattern for palliative care providers: how
many patients they send to hospital (for inpatient or outpatient treatment); how many they
refer to consultants for specialist comment; how many to community health programs; and so
on.
Possible sampling schemes:
1. sample a group of palliative care practitioners and study their referral patterns;
2. sample a group of palliative care patients and study their referral patterns.
Discuss the possible advantages and disadvantages of the two schemes.

CHAPTER
2
Data collection
2.1 Introduction
“You don’t have to eat the whole ox to know that the meat is tough.”
Samuel Johnson
Sampling is very familiar to all of us, because we often reach conclusions about phenomena
on the basis of a sample of such phenomena. You may test a swimming pool’s temperature by
dipping your toe in the water or the performance of a new vehicle by a short test drive. These
are among the countless small samples that we rely on when making personal decisions. We
tend to use haphazard methods in picking our sample and risk substantial sampling error.
Research also usually reaches its conclusions on the basis of sampling, but the methods used
must adhere to certain rules that are going to be discussed. The goal in obtaining data through
survey sampling is to use a sample to make precise inferences about the target population. We
want to be highly confident about our inferences. It is important to have a substantial grasp
of sampling theory to appraise the reliability and validity of the conclusions drawn from the
sample taken.
2.2 Data collecting instruments
The choice of data collection instrument is crucial to the success of the survey. When deter-
mining an appropriate data collection method, many factors need to be taken into account,
including complexity or sensitivity of the topic, response rate required, time or money avail-
able for the survey and the population that is to be targeted. Some of the most common data
collection methods are described in the following sections.
23
Part 2. Data collection
2.2.1 Interviewer enumerated surveys
Interviewer enumerated surveys involve a trained interviewer going to the potential respon-
dent, asking the questions and recording the responses.
The advantages of using this methodology are:

• provides better data quality
• special questioning techniques can be used
• greater rapport established with the respondent
• allows more complex issues to be included
• produces higher response rates
• more flexibility in explaining things to respondents
• greater success in dealing with language problems
The disadvantages of using this methodology are:

• expensive to conduct
• training for interviewers is required
• more intrusive for the respondent
• interviewer bias may become a source of error
2.2.2 Web surveys
Web surveys are increasingly popular, although care must be taken to avoid sample selection
bias and multiple responses from an individual.
The advantages of this methodology are:

• cheap to administer
• private and confidential
• easy to use conditional questions and to prompt if no response or inappropriate response.
• can build in live checking.
• can provide multiple language versions
The disadvantages of this methodology are:

• respondent bias may become a source of error
• not everyone has access to the internet
• language and interface must be very simple
• cannot build up a rapport with respondents
• resolution of queries is difficult
• only appropriate when straight forward data can be collected
2.2.3 Mail surveys
Self-enumeration mail surveys are where the questionnaire is left with the respondent to com-
plete.

• cheaper to administer
• more private and confidential
• in some cases does not require interviewers

• difficult to follow-up non-response
• respondent bias may become a source of error
• response rates are much lower
• language must be very simple
• problems with poor English and literacy skills
• cannot build up a rapport with respondents
• resolution of queries is difficult
• only appropriate when straight forward data can be collected
2.2.4 Telephone surveys
A telephone survey is the process where a potential respondent is phoned and asked the survey
questions over the phone.

• cheap to administer
• convenient for interviewers and respondents

• interviews easily terminated by respondent
• cannot use prompt cards to provide alternatives for answers
• burden placed on interviewers and respondents
• biased sample through households with phones
2.2.5 Diaries
Diaries can be used as a format for a survey. In these surveys respondents are directed to record
the required information over a predetermined period in the diary, book or booklet supplied.

• high quality and detailed data from the completed diaries
• more private and confidential circumstances for the respondent
• does not require interviewers

• response rates are lower and the diaries are rarely completed well
• language must be simple
• can only include relatively simple concepts
• cannot build up a rapport
• cannot explain the purpose of survey items to respondents

Face-to-face Telephone Mail

Response rates Good Good Good
Representative samples
Avoidance or refusal bias Good Good Poor
Control over who completes the questionnaire Good Good Satisfactory
Gaining access to the selected person Satisfactory Good Good
Locating the selected person Satisfactory Good Good
Effects on questionnaire design

Ability to handle:
Long questionnaires Good Satisfactory Satisfactory
Complex questions Good Poor Satisfactory
Boring questions Good Satisfactory Poor
Item non-response Good Good Satisfactory
Filter questions Good Good Satisfactory
Question sequence control Good Good Poor
Open ended questions Good Good Poor
Quality of answers
Minimize socially desirable responses Poor Satisfactory Good
Ability to avoid distortion due to
Interviewer characteristics Poor Satisfactory Good
Interviewer opinions Satisfactory Satisfactory Good
Influence of other people Satisfactory Good Poor
Allows opportunities to consult Satisfactory Poor Good
Avoids subversion Poor Satisfactory Good
Implementing the survey

Ease of finding suitable staff Poor Good Good
Speed Poor Good Satisfactory
Cost Poor Satisfactory Good
Table 2.1: Advantages and disadvantages of three methods of data collection. Table taken from de Vaus
(2001) who adapted it from Dillman (1978).
2.2.6 Ideas for increasing response rates
1. Provide reward
2. Systematic follow up
3. Keep it short.
4. Interesting topic.

2.2.7 Archival data
Rather than collecting your own data, you may use some existing data. If you do, keep the
following points in mind.
Available information Is there sufficient documentation of the original research proposal for
which the data were collected? If not, there may be hidden problems in re-using the data.
Geographical area Are the data relevant to the geographical area you are studying? e.g., what
country, city, state or other area does the archive data cover?
Time period Are the data relevant to the time period you are studying? Does your research
area cover recent events, or is it historical or does it look at changes over a specified range
of time? Most data are at least a year old before they are released to the public.
Population What population do you wish to study? This can refer to a group or groups of
people, particular events, official records, etc. In addition you should consider whether
you will look at a specific sample or subset of people, events, records, etc.
Context Does the archival data contain the information relevant to your research area?
2.3 Errors in statistical data
In sample surveys there are two types of error that can occur:
• sampling error which arises as only a part of the population is used to represent the whole
population and;
• non-sampling error which can occur at any stage of a sample survey.
It is important to be aware of these errors so that they can be minimized.
2.3.1 Sampling error
Sampling error is the error we make in selecting samples that are not representative of the
population. Since it is practically impossible for a smaller segment of a population to be exactly
representative of the population, some degree of sampling error will be present whenever we
select a sample. It is important to consider sampling error when publishing survey results as
it gives an indication of the accuracy of the estimate and therefore reflects the importance that
can be placed on interpretations.
If sampling principles are carefully applied within the constraints of available resources, sam-
pling error can be accurately measured and kept to a minimum. Sampling error is affected
by:
• sample size
• variability within the population
• sampling scheme

Generally larger sample sizes decrease sampling error. To halve the sampling error the sample
size has to be increased fourfold. In fact, sampling error can be completely eliminated by
increasing the sample size to include every element in the population.
The population variability also affects the error, more variable populations give rise to larger
errors as the samples or estimates calculated from different samples are more likely to have
greater variation. The effect of the variability within the population can be reduced by increas-
ing sample size to make it more representative of the target population.
2.3.2 Non-sampling error
Non-sampling error can be defined as those errors in a survey that are not sampling errors.
Non-sampling error is any error not caused by the fact that we have only selected part of
the population in the survey. Even if we were to undertake a complete enumeration of the
population, non-sampling errors might remain. In fact, as the size of the sample increases, the
non-sampling errors may get larger, because of such factors as possible increase in the response
rate, interviewer errors, and data processing errors.
For the most part we cannot measure the effect that non-sampling errors will have on the re-
sults. Because of their nature, these errors may not be totally eliminated. Perhaps the biggest
source of non-sampling error is a poorly designed questionnaire. The questionnaire can in-
fluence the response rate achieved in the survey, the quality of responses obtained and conse-
quently the conclusions drawn from survey results.
Some common sources of non-sampling error are discussed in the following paragraphs.
Target Population
Failure to identify clearly who is to be surveyed. This can result in an inadequate sam-
pling frame; imprecise definitions of concepts and poor coverage rules.
Non-response
A non-response error occurs when the respondents do not reflect the sampling frame.
This could occur when the people who do not respond to the survey differ to the people
who did respond to the survey. This often occurs in voluntary response polls. For ex-
ample, suppose that in an air bag study we asked respondents to call a 0018 number to
be interviewed. Because a 0018 call cost $2 per minute, many drivers may not respond.
Furthermore, those who do respond may be the people who have had bad experiences
with air bags. Thus the final sample of respondents may not even represent the sampling
frame.
For example,
• telephone polls miss those people without phones
• household surveys miss homeless, prisoners, students in colleges, etc.
• train surveys only target public transport users and tend to include regular public
transport users.

Manufacturers and advertising agencies often use interviews at shopping malls to

gather information about the habits of consumers and the effectiveness of ads. A
sample of mall shoppers is fast and cheap. “Mall interviewing is being propelled
primarily as a budget issue”, one expert told the New York Times. But people con-
tacted at shopping malls are not representative of the entire population. They are
richer, for example, and more likely to be teenagers or retired. Moreover, mall inter-
viewers tend to select neat safe looking individuals from the stream of customers.
Decisions based on mall interviews may not reflect the preferences of all consumers.
In 1991 it was claimed that data showed that right-handed persons live on average
almost a decade longer than left-handed or ambidextrous persons. The investigators
had compared mean ages at death of people who appeared to be survivors as left,
right or mixed handed.
• What is the problem?
The questionnaire
Poorly designed questionnaires with mistakes in wording, content or layout may make it
difficult to record accurate answers. The most effective methods of designing a question-
naire are discussed in Section 2.4. If these principles are followed it will help reduce the
non-sampling error associated with the questionnaire.
Interviewers
If an interviewer is used to administer the survey, their work has the potential to produce
non-sampling error. This can be due to the personal characteristics of the interviewer.
For example, an elderly person will often be more comfortable giving information to a
female interviewer. Other factors which could cause error are the interviewer’s opinions
and characteristics which may influence the respondent’s answers.
In 1968, one year after a major racial disturbance in Detroit, a sample of black resi-
dents was asked:
Do you personally feel that you can trust most white people, some white people,
or none at all?
Of those interviewed by whites, 35% answered “Most”, while only 7% of those in-
terviewed by blacks gave this answer. Many questions were asked in this study.
Only on some topics, particularly black-white trust or hostility, did the race of the
interviewer have a strong effect on the answers given. The interviewer was a large
source of non-sample error in this study.
Respondents
Respondents can also be a source of non-sampling error. They may refuse to answer ques-
tions, or provide inaccurate information to protect themselves. They may have memory
lapses and/or lack of motivation to answer the questionnaire, particularly if the ques-
tionnaire is lengthy, overly complicated or of a sensitive nature. Respondent fatigue is a
very important factor.
Social desirability bias refers to the effect where respondents will provide answers which
they think are more acceptable, or which they think the interviewer wants to hear. For
example, respondents may state that they have a higher income than is actually the case
if they feel this will increase their status.

Respondents may refuse to answer a question which they find embarrassing or choose
a response which prevents them from continuing with the questions. For example, if
asked the question: “Are you taking oral contraceptive pills for any reason?”, and know-
ing that if they respond “Yes” they will be asked for more details, respondents who are
embarrassed by the question are likely to answer “No”, even if this is incorrect.
Fatigue can be a problem in surveys which require a high level of commitment for respon-
dents. The level of accuracy and detail supplied may decrease as respondents become
tired of recording all information. Sometimes interviewer fatigue can also be a problem,
particularly when the interviewers have a large number of interviews to conduct.
Processing and collection

Processing and collection errors can be a source of non-sampling error. For example,
the results from the survey may be entered incorrectly . The time of year the survey is
enumerated can produce non-sampling error. For example, if the survey is conducted in
the school holidays, potential respondents with school children could possibly be away
or hard to contact.
The Shere Hite surveys
In 1987, Shere Hite published a best-selling book called Women and Love. The author distributed
100,000 questionnaires through various women’s groups, asking questions about love, sex, and
relations between women and men. She based her book on the 4.5% of questionnaires that were
returned.
• 95% said they were unhappily married

• 91% of those who were divorced said that they had initiated the divorce
What are the problems with this research?
Exercise 1: In Case 2, it was necessary to study the ‘referral’ pattern for palliative
care providers: how many patients they send to hospital (for inpatient or out-
patient treatment); how many they refer to consultants for specialist comment;
how many to community health programs; and so on. Two alternative sam-
pling schemes are available: sample a group of palliative care practitioners
and study their referral patterns; or sample a group of palliative care patients
and study their referral patterns. Discuss the possible advantages and disad-
vantages of the two schemes.
2.4 Questionnaire design
2.4.1 Introduction
The purpose of a questionnaire is to obtain specific information with tolerable accuracy and
completeness. Before the questionnaire is designed, the collection objectives should be defined.
These include:

• clarifying the objectives of the survey

• determining who is to be interviewed
• defining the content
• justifying the content
• prioritizing the data that are to be collected. This is important as it makes it easier to
discard items if the survey, once developed, is too lengthy.
Careful consideration should be given to the content, wording and format of the questionnaire
as one of the largest sources of non-sampling error is poor questionnaire design. This error can
be minimized by considering the objectives of the survey and the required output, and then
devising a list of questions that will accurately obtain the information required.
2.4.2 Content of the questionnaire
Relevant questions
It is important to ask only questions that are directly related to the objectives of a survey as a
means of minimizing the burden place on respondents. The concept of a fatigue point, which oc-
curs when respondents can no longer be bothered answering questions, should be recognized,
and questions designed so that the respondent is through the form before this point is reached.
Towards the end of long questionnaires, respondents may give less thought to their answers
and concentrate less on the instructions and questions, thereby decreasing the accuracy of in-
formation they provide. Very long questionnaires can also lead the respondent to refuse to
complete the questionnaire. Hence it is necessary to ensure only relevant questions are asked.
Reliable questions
It is important to include questions in a questionnaire that can be easily answered. This objec-
tive can be achieved by adhering to the following techniques.
Appropriate recall If information is requested by recall, the events should be sufficiently recent
or familiar to respondents. People tend to remember what they should have done, have
selective memories, and move into reference period activities which surround the event.
Minimizing the need for recall improves the accuracy of response.
Common reference periods To make it easier for the respondent to answer, use reference periods
which match those of the respondent’s records.
Results justify efforts The amount of effort to which a respondent goes to obtain the data must
be worth it. It is reasonable to accept a respondent’s estimate when calculating the exact
figures would make little difference to the outcome.
Filtering Respondents should not be asked question they cannot answer. Filter questions should
be asked to exclude respondents from irrelevant questions.

2.4.3 Types of questions
Factual questions
Information is required from these questions rather than an opinion. For example respon-
dents could be asked about behaviour patterns (e.g., When did you last visit a General
Practitioner?).
Classification or demographic questions

These are used to gain a profile of the population that has been surveyed and provide
important data for analysis.
Opinion questions
Rather than facts, these questions seek opinion. There are many problems associated with
opinion questions:
• a respondent may not have an opinion/attitude towards the subject so the response
may be provided without much thought;
• opinion questions are very sensitive to changes in wording;
• it is impossible to check the validity of responses to opinion questions.
Hypothetical questions
The “What would you do if . . . ?” type of question. The problems with these questions
are similar to opinion questions. You can never be certain how valid any answer to a
hypothetical is likely to be.
2.4.4 Answer formats
Questions can generally be classified as one of two types, open or closed, depending on the
amount of freedom allowed in answering the question. When deciding which type of question
to use, consideration should be given to the kind of information sought, ease of processing the
response, and the availability of the resources of time, money, and personnel.
Open questions
Open questions allow the respondents to answer the question in their own words. These ques-
tions allow as many possible answers and they can collect exact values from a wide range of
possible values. Hence, open questions are used when the list of responses is very long and not
obvious.
The major disadvantage of open questions is they are far more demanding than closed ques-
tions both to answer and process. These questions are most commonly used where a wide
range of responses is expected. Also, the answers to these questions depend on the respon-
dents ability to write or speak as much as their knowledge. Two respondents might have the
same knowledge and opinions, but their answers may seem different because of their varying
abilities.

Question Format
Which country makes the best cars Open ended

...............................................
Which country makes the best cars? Multiple choice questions

1. USA 2. Germany 3. Japan
Which country makes the best cars? Partially closed questions

1. USA 2. Germany 3. Japan
4. Other (please specify)
For the list provided, indicate which brand/s of Checklist questions

cars you have owned?
1. Ford 2. Toyota 3. BMW
I believe Japanese cars are less reliable than Likert scale (opinion) questions
European cars.
Strongly Agree Agree No opinion Disagree Strongly disagree
1 2 3 4 5
Closed questions
Closed questions ask the respondents to choose an answer from the alternatives provided.
These questions should be used when the full range of responses is known. Closed questions
are far easier to process than open questions. The main disadvantage of closed questions is the
reasons behind a particular selection cannot be determined.
There are a number of types of closed questions.
• Limited choice questions require the respondent to choose one of two mutually exclusive
answers. For example yes/no.
• Multiple choice questions require the respondent to choose from a number of responses
provided.
• Checklist questions allow a respondent to choose more than one of the responses pro-
vided.
• Partially closed questions provide a list of alternatives where the last alternative is “Other,
please specify”. These questions are useful when it is difficult to list all possible choices.
• Opinion (Likert) scale An opinion scale question seeks to locate a respondent’s opin-
ion on a rating scale with a limited number of points. For example, a five point scale
measure of strong and weak attitudes would ask the respondent whether they strongly
agree/agree/are neutral/disagree/strongly disagree with a particular statement of opin-

ion. Whereas a three point scale would only measure whether they agree, disagree or are
neutral. Opinion scales of this sort are called Likert scales.
Five point scales are best because:
–
–
–
Response Categories
When questions have categories provided, it is important that every response is catered for.
Number of Categories
The quality of the data can be influenced if there are too few categories as the respondent
may have difficulty finding one which accurately describes their situation. If there are too
many categories the respondent may also have difficulty finding one which accurately
describes their situation.
Don’t Know A ‘Don’t Know’ category can be included so respondents are not forced to make
decisions/attitudes that they would not normally make. Excluding the option is not usu-
ally good, however, it is hard to predict the effect of including it. The decision of whether
or not to include a ‘Don’t Know’ option depends, to a large extent, on the subject matter.
I was gifted to be able to answer promptly, and I did. I said I didn’t know.
Mark Twain, Life on the Mountain
2.4.5 Wording of questions
Language
Questions which employ complex or technical language or jargon can confuse or irritate re-
spondents. Respondents who do not understand the question may be unwilling to appear
ignorant by asking the interviewer to explain the question or if a interviewer is not present,
may not answer or answer incorrectly.
Ambiguity
If ambiguous words or phrases are included in a question, the meaning may be interpreted
differently by different people. This will introduce errors in the data since different respondents
will virtually be answering different questions.
For example “Why did you fly to New Zealand on Qantas airlines?”. Most might interpret
this question as was intended, but it contains three possible questions, so the response might
concern any of these:
• I flew (rather than another mode of travel) because . . .

• I went to New Zealand because . . .
• I selected Qantas because . . .

Double-barreled questions
When one question contains two concepts, it is known as a double-barreled question. For
example , “How often do you go grocery shopping and do you enjoy it?”.
Each concept in the question may have a different answer, or one concept may not be relevant,
respondents may be unsure how to respond. The interpretation of the answers to these ques-
tions is almost impossible. Double-barreled questions should be split into two or more separate
questions.
Leading questions
Questions which lead respondents to answers can introduce error. For example, the question
“How many days did you work last week?”, if asked without first determining whether re-
spondents did in fact take work in the previous week, is a leading question. It implies that
the person would have been at work. Respondents may answer incorrectly to avoid telling the
interviewer that they were not working.
Unbalanced questions
“Are you in favour of euthanasia?” is an unbalanced question because is provides only one al-
ternative. It can be reworded to ‘Do you favour or not favour euthanasia?’, to give respondents
more than one alternative.
Similarly, the use of a persuasive tone can affect the respondent’s answers. Wording should be
chosen carefully to avoid a tone that may produce bias in responses.
Recall/memory error
Respondents tend to remember what should have been done rather that what was done. The
quality of data collected from recall questions is influenced by the importance of the event to
the respondent and the length of time since the event took place. Subjects of greater interest or
importance to the respondent, or events which happen infrequently, will be remembered over
longer periods and more accurately. Minimizing the recall period also helps to reduce memory
bias.
Telescoping is a specific type of memory error. This occurs if the respondent reports events
as occurring either earlier or later than they actually occur. Error occurs when respondents
included details of an event which actually occurred outside the specified reference period.
Sensitive questions
Questions on topics which respondents may see as embarrassing or highly sensitive can pro-
duce inaccurate answers. If respondents are required to answer questions with information
that might seem socially undesirable, they may provide the interviewer with responses they
believe are more ‘acceptable’. If placed at the being of the questionnaire, it could lead to non-
response if respondents are unwilling to continue with the remaining questions.
For example, “Approximately how many cans of beer do you consume each week, on aver-
age?”
1. None

2. 1–3 cans
3. 4–6 cans
4. More than 6
A respondent might answer response 2 or 3 rather than admit to consuming the greatest quan-
tity on the scale. Consider extending the range of choices far beyond what is expected. The
respondent can select an answer closer to the middle and feel more in the normal range.
In 1980, the New York Times CBS News Poll asked a random sample of Americans
about abortion. When asked “Do you think there should be an amendment to the
Constitution prohibiting abortions, or should not there be such an amendment?”
29% were in favour and 62% were opposed. The rest of the sample were uncer-
tain. The same people were later asked a different question: “Do you believe there
should be an amendment to the Constitution protecting the life of the unborn child,
or should not there be such an amendment?” Now 50% were in favour and only
39% were opposed.
Acquiescence
This situation arises when there is a long series of questions for which respondents answer
with the same response category. Respondents get used to providing the same answer and
may answer inaccurately.
2.4.6 Questionnaire format
Including an introduction
It can be advantageous to include an introductory statement or explanation at the beginning of

a survey. The introduction may included such information as the purpose of the survey or the
scope of collection. It will aid the respondent when answering the questions if they know why
the information is being sought. The respondent should be given a context in which to frame
his or her answers. An assurance of confidentiality will provide respondents with confidence
that the results will not be obtained by unwanted parties.
Question and page numbers
To ensure that the questionnaire can be easily administered by interviewer or respondents, the
pages of the questionnaire and the questions should be number consecutively with a simple
numbering system. Question numbering is a way of providing sign-posts along the way. They
help if remedial action is required later, and you want to refer the interviewer or respondent
back to a particular place.
Sequencing
The questions in a questionnaire should follow an order which is logical and smoothly flows
from one question to the next. The questionnaire layout should have the following character-
istics.

Related questions grouped

Questions which are related should be grouped together and where necessary placed into
sections. Sections should contain an introductory heading or statement.
If possible, question ordering should try and anticipate the order in which respondents
will supply information. It shows good survey design if a question not only prompts an
answer but also prompts an answer to a question following shortly.
Question ordering
It is important to be aware that earlier questions can influence the responses of later ques-
tions, so the order of questions should be carefully decided. In attitudinal questions, it
is important to avoid conditioning respondents in an early question which could then
bias their responses to later questions. For example, you should ask about awareness of
a concept before any other mention of the concept.
Respondent motivation
Whenever possible, start the questionnaire with easy and pleasant questions to promote inter-
est in the survey and give the respondent confidence in their ability to complete the survey.
The opening questions should ensure that the particular respondent is a member of the survey
population.
Questions that are perceived as irritating or obtrusive tend to get a low response rate and
may effectively trigger a refusal from the respondent. These questions need to be carefully
positioned in a questionnaire where they are least likely to be sensitive.
It is also important that respondents are only asked relevant questions. Respondents may be-
come annoyed and disinterested if this does not occur. Include filter questions to direct re-
spondents to skip to questions which do not apply to them. Filter questions often identify
sub-populations. For example,
“Do you usually speak English at home?” Yes (Go to Q34)

No (Go to Q10)
Questionnaire layout
The questionnaire layout should be aesthetically pleasing, so the layout does not contribute to
respondent fatigue. Things that can interfere with the answering of a questionnaire are: unclear
instructions and questions, insufficient space to provide answers, hard-to-read text, difficulty
in understanding language, back-tracking through the form. Many of these things are bad form
design and are avoidable.
Only include essentials on the questionnaire form. Keep the amount of ink on the form to the
minimum necessary for the form to work properly. Anything that is not necessary contributes
to the fatigue point of the respondent and to the subsequent detriment of the data quality.

General layout
Consistency of layout: If consistency and logical patterns are introduced into the form design, it
eases the form filler’s task. Patterns that can be useful are:
• white spaces for responses

• using the same question type throughout the form
• using the same layout throughout the form
• using a different style, consistently, for instructions or directions.
Type Size: A font size between 10 and 12 is considered the best in most circumstances. If the
respondent does not have perfect vision, or ideal working conditions, small fonts can
cause problems.
Use of all upper-case text: It is best to avoid upper case text. Upper case text has been shown to
be hard to read, especially where large amounts of text are involved. Words lose their
shape when in upper case, becoming converted to rectangles. Text in upper case should
be left for use for titles or for emphasis but, this can often be done just as well using other
methods, such as bold, italics, or slightly larger type size.
Line length: As the eye has a clear focus range of only a few degrees, lines should be kept short.
It takes the eyeball several eye movements to scan a line of text. If more than 2 or 3 such
movement occur then the eye can become fatigued. There is a tendency for the eye to lose
track of which line it is reading. This leads to backtracking the text or misinterpretation.
Character and line spacing: It is very important to leave enough space on a form for answers. It
has been shown in research that forms requiring hand written responses need a distance
of 7–8mm between lines and a 4–5mm width for each possible character.
Response layout
Obtaining responses: A popular way of obtaining responses is using tick boxes. However, it is
usually preferable to use a labelled list (e.g., a, b, c, . . . ) and ask respondents to circle their
response. This makes coding and data entry easier.
If a written response is required it is best to provide empty answer spaces, with lines
made up of dots.
Positioning of responses: Vertical alignment of responses is preferred to horizontal alignment. It

is easier to read up and down the list, and select the correct box, than read across the page
and locate an item in a horizontal string. Captions to the left of the answer box are easier
for respondents to complete.
Order of response options: The consideration of the order of responses is important as the order
can be a source of bias. The options presented first may be selected because they make
an impact on respondents or because respondents lose concentration and do not hear or
read the remaining options. The last options may be chosen because it was easily recalled,
particularly if respondents are faced with a long list of options. Long or complex response
options may also make recall more difficult and increase the effects due to the order of

response options.
Prompt card: If the questionnaire is interviewer based, and a number of response options are
given for some questions, then a prompt card may be appropriate. A prompt card is a list
of possible responses to a question, displayed on a separate card which are shown by the
interviewer to assist respondents. This helps to decrease error resulting from respondents
being unable to remember all the options read out. However respondents with poor
eyesight, migrants with limited English or adults with literacy problems will experience
difficulties in answering accurately.
Exercise 2: (Case 2) The questionnaire on pages 47–48 was an early draft of the
questionnaire prepared by the client. The questionnaire on pages 49–51 is a
later draft of the questionnaire after I had provided the client with some advice.
See if you can determine why each of the changes has been made. How could
you further improve the questionnaire?
2.4.7 Pretesting the questionnaire
A pretest of a questionnaire should be considered mandatory. Although the designer of the

questionnaire would have reviewed the drafted questionnaire meticulously on all points of
good design, it is still likely to contain faults. Normally, a number of these emerge when the
form is used in the field, because the researcher did not completely anticipate what would take
place. The only way that these faults may be fully detected is by actually administering the
survey with the types of respondents who would be sampled in the study.
Each type of testing is used at a different stage of survey development and aims to test different
aspects of the survey.
Skirmishing
Skirmishing is the process of informally testing questionnaire design with groups of re-
spondents. The questionnaire is basically unstructured and is tested with a group of
people who can provide feedback on issues such as each question’s frame of reference,
the level of knowledge needed to answer the questions, the range of likely answers to
questions and how answers are formulated by respondents. Skirmishing is also used to
detect flaws or awkward wording of questionnaires as well as testing alternative designs.
At this stage we may use open-ended response categories to work-out likely responses.
The questionnaire should be redrafted after skirmishing.
Focus groups
A skirmish tests the questionnaire design against general respondents whilst focus groups
concentrate on a specific audience. For example, a survey studying the effects of living
on unemployment benefits could have a group of unemployed people as a focus group.
A focus group can be used to test questions directed at small sub-populations. For ex-
ample if we were looking at community services we may have a filter question to target
disabled people. Since there may not be many disabled chosen in the sample, we need to
test the questions on a focus group of disabled people, which is a biased sample.

Observational studies
Respondents complete a draft questionnaire in the presence of an observer during an
observational study. Whilst completing the form the respondents explain their under-
standing of the questions and the method required in providing the information. These
studies can be a means of identifying problem questions through observations, questions
asked by the respondents, or the time taken to complete a particular question. Data avail-
ability and the most appropriate person to supply the information can also be gauged
through observational studies. The form is being tested and not the respondent and this
should be stressed to the respondent.
Pilot testing
Pilot testing involves formally testing a questionnaire or survey with a small represen-
tative sample of respondents. Semi-closed questions are usually used in pilot testing to
gather a range of likely responses which are used to develop a more highly structured
questionnaire with closed questions. Pilot testing is used to identify any problems asso-
ciated with the form, such as questionnaire format, length, question wording and allows
comparison of alternative versions of a questionnaire.
2.5 Data processing
Data processing involves translating the answers on a questionnaire into a form that can be
manipulated to produce statistics. In general, this involves coding, editing, data entry, and
monitoring the whole data processing procedure. The main aim of checking the various stages
of data processing is to produce a file of data that is as error free as possible.
2.5.1 Data coding
Up to this point, the questionnaire has been considered mainly as a means of communication
with the respondent. Just as important, the questionnaire is a working document for the trans-
fer of data on to a computer file. Consequently it is important to design the questionnaire to
facilitate data entry.
Unless all the questions on a questionnaire are “closed” questions, some degree of coding is
required before the survey data can be sent for punching. The appropriate codes should be de-
vised before the questionnaires are processed, and are usually based on the results of pretesting.
Coding consists of labelling the responses to questions (using numerical or alphabetic codes) in
order to facilitate data entry and manipulation. Codes should be formulated to be simple and
easy. For example if Question 1 has four responses then those four responses could be given
the codes a, b, c, and d. The advantage of coding is the simplistic storage of data as a few-digit
code compared to lengthy alphabetical descriptions which almost certainly will not be easy to
categorize.
Coding is relatively expensive in terms of resource effort. However, improvements are always
being sought by developing automated techniques to cover this task. Other options include the
use of self coding where respondents answer the appropriate code or the interviewer performs

the coding task.
Before the interviewing begins, the coding frame for most questions can be devised. That is, the
likely responses are obvious from previous similar surveys or thorough pilot testing, allowing
those responses and relevant codes to be printed on the questionnaire. An “Other (Please
Specify)” answer code is often added to the end of a question with space for interviewers to
write the answer. The standard instruction to interviewers in doubt about any precodes is that
they should write the answers on the questionnaire in full so that they can be dealt with by a
coder later.
2.5.2 Data entry
Ensure that the questionnaire is designed so data entry personnel have minimal handling of
pages. For example, all codes should be on the left (or right) hand side of the page. It is
advisable to use trained data entry people to enter the data. It is quicker and more reliable and
therefore more cost effective.
2.6 Sampling schemes
When you have a clear idea of the aims of the survey and the data requirements, the degree of
accuracy required, and have considered the resources and time available, you are in a position
to make a decision about the size and the form of collection of sampling units.
The two qualities most desired in a sample (besides that of providing the appropriate findings),
are its representativeness and stability. Sample units may be selected in a variety of ways. The
sampling schemes fall into two general types: probability and non-probability methods.
2.6.1 Non-probability samples
If the probability of selection for each unit is unknown, or cannot be calculated, the sample is
called a non-probability sample. For non-probability samples, since there is no control over rep-
resentativeness of the sample, it is not possible to accurately evaluate the precision of estimates
(i.e., closeness of estimates under repeated sampling of the same size). However, where time
and financial constraints make probability sampling infeasible, or where knowing the level of
accuracy in the results is not an important consideration, non-probability samples do have a
role to play. Non-probability samples are inexpensive, easy to run and no frame is required.
This form of sampling is popular amongst market researchers and political pollsters as a lot of
their surveys are based on a pre-determined sample of respondents of certain categories.
One common method of non-probability sampling is voluntary response polling. A general

appeal is made (often via television) for people to contact the researcher with their opinion.
Voluntary response samples are rarely useful because they over-represent people with strong
opinions, most often negative opinion.

2.6.2 Probability sampling schemes
Probability sampling schemes are those in which the population elements have a known chance
of being selected for inclusion in a sample. Probability sampling rigorously adheres to a pre-
cisely specified system that permits no arbitrary or biased selection. There are four main types
of probability sampling schemes.
Simple Random Sample: If a sample size of size n is drawn from a population of size N in
such a way that every possible sample of size n has the sample chance of being selected,
the sampling procedure is called simple random sampling. The sample thus obtained
is called a simple random sample. This is the simplest form of probability sample to
analyse.
Stratified Random Sample: A stratified random sample is one obtained by separating the pop-
ulation elements into non-overlapping groups, called strata, and then selecting a simple
random sample from each stratum. This can be useful when a population is naturally
divided into several groups. If the results on each stratum vary greatly, then it is possi-
ble to obtain more efficient estimators (and therefore more precise results) than would be
possible without stratification.
Systematic Sample: A sample obtained by randomly selecting one element from the first k el-
ements in the frame and every kth element thereafter is called a 1-in-k systematic sample,
with a random start. This is obviously a simple method if there is a list of elements in
the frame. Systematic sampling will provide better results than simple random sampling
when the systematic sample has larger variance than the population. This can occur when
the frame is ordered.
Cluster Sample: A cluster sample is a probability sample in which each sampling unit is a
collection, or cluster, of elements. The population is divided into clusters and one or
more of the clusters is chosen at random and sampled. Sometimes the entire cluster is
sampled; on other occasions a simple random sample of the chosen clusters is taken.
Cluster sampling is usually done for administrative convenience, and is especially useful
if the population has a hierarchical structure.
A comparison of these four sampling schemes appears in the table on the following page.
Example (Case 2): A few years ago, I advised the Department of Health and Com-
munity Services on a survey of palliative care patients in Victoria.
Objective: To estimate the proportion of palliative care patients in Vic-
torian hospitals.
Difficulties: What is a “palliative care patient”? Proportion of what?
Target population: Patients in acute beds at the time of the survey?
Survey population: All patients in acute beds in Victorian hospitals except for
very small (< 10 bed) country hospitals.
Sampling scheme: Stratified (hospital types) and clustered (hospitals). Ran-
dom selection of hospitals within each strata. Total cover-
age of patients in the selected hospitals.
Sample: All patients in the 18 hospitals selected out of 115 hospitals
in Victoria.

Scheme How to select sample Strengths/Weaknesses
Simple Random Assign numbers to elements

• The basic building block
Sample in sampling. Use a random
• Simple, but often costly.
number table or random
• Cannot use unless we can
number generator to select
assign a number to each
sample.
element in a target
population.
Stratified Sample Divide population into

• With proper strata, can
groups that are similar
produce very accurate
within and different between
estimates.
on the variable of interest.
• Less costly than simple
Use random numbers to
random sampling.
select the sample from each
• Must stratify target
stratum.
population correctly.
Systematic Sample Select every kth element

• Produces very accurate
from a list after a random
estimates when elements
start.
in a population exhibit
order.
• Used when simple
random or stratified
sampling is impractical:
e.g., the population size is
not known.
• Simplifies the selection
process.
• Do not use with periodic
populations.
Cluster Sample Randomly choose clusters

• With proper clusters, can
and sample all elements
produce very accurate
within each cluster.
estimates.
• Useful when sampling
frame unavailable or
travel costs high.
• Must cluster target
population correctly.

Exercise 3: Consider the four cases listed in the Appendix. What sampling scheme
was used in each case? Why were these schemes used?
2.7 Scale development
With Likert scale data, it is common to construct a new numerical variable by summing the
values of questions on a related topic (treating the answers as numerical scores from 1–5). This
forms a “measure” or “scale” for the underlying “construct”.
More sophisticated means of deriving scales are possible. One common approach is to use
Factor Analysis (discussed in Section 8.1).
2.7.1 Validity
A valid measure is measuring the thing it is intended to measure.
E XAMPLE :
• A study compares job-satisfaction of people over time and finds it is declining. Does that
mean poor management is leading to declining job satisfaction?
• How would you construct a valid study which enables the measurement of the effect of
management on job satisfaction?
• How do you measure workplace harmony? Is frequency of arguments a valid measure?
• Are the results of a study in your company generalizable to other companies?
• How would you construct a valid study of this issue which applies to other companies?
2.7.2 Reliability
A reliable measure is one that gives the same ‘reading’ when used in repeated occasions.
• A measure is reliable but not valid if it is consistently wrong. e.g., survey on alcohol
intake.
• A measure is valid but unreliable if it sometimes measures the thing of interest, but not
always. e.g., survey on sexual experience.

Appendix B: Case studies
Case 1: Saulwick Poll
This appeared in The Age, 1 January 1990.

Case 2: Palliative care patients survey
This survey was designed to estimate the number of palliative care patients in Victorian hos-
pitals. A palliative care patient was defined as a patient who was terminally ill and whose
life expectancy was less than 6 months. The Department of Health and Community Services
did not know how many patients were in this category, but a previous survey in another
state indicated the proportion might be about 12%. The hospitals in Victoria were divided
into eight groups: metropolitan teaching, metropolitan large non-teaching, metropolitan small
non-teaching, country base, large country, small country, metropolitan extended care, country
extended care. These eight hospital types included 115 Victorian public hospitals.
Within each group of hospitals, one or more were selected at random for the sample. Eighteen
hospitals in total were sampled. For each hospital surveyed, the number of palliative care pa-
tients was recorded. From this information, the proportion of hospital patients in Victoria who
could be classified as “palliative care” patients was estimated. The final estimated proportion
was about 4.5%.






Case 3: Survey of frequency of needlestick injuries
This survey was conducted by a company who had designed and marketed health safety prod-
ucts including needle protectors. As part of their marketing, they were interested in the fre-
quency and severity of needlestick injuries amongst health workers. The survey was conducted
in seven Australian cities over a one week period. The sample consisted of 56 staff members of
the Red Cross Transfusion Services and 136 nursing staff in 25 Australian haemodialysis units.
All staff who worked during the survey week were included in the sample. Each filled in a
questionnaire.
Case 4: Church “Life Survey” of members’ opinions
The Catholic Church Life Survey is a collection of 25 separate questionnaires designed to collect
information about the opinions and characteristics of the Catholic church’s clergy and mem-
bership. Each diocese in Australia was surveyed. Within each diocese there are both urban and
rural parishes. A sample of urban parishes was surveyed and a sample of rural parishes was
surveyed within each diocese. For those parishes surveyed, a random sample consisting of 2/3
of the members who attended on the day of the survey completed the main questionnaire.

CHAPTER
3
Data summary
Recall: Types of data
The ways of organizing, displaying and analysing data depends on the type of data
we are investigating.
• Categorical Data (also called nominal or qualitative)
e.g. sex, race, type of business, postcode Averages don’t make sense. Or-
dered categories are called ordinal data
• Numerical Data (also called scale, interval and ratio)
e.g. income, test score, age, weight, temperature, time.
Averages make sense.
Note that we sometimes treat numerical data as categories. (e.g. three age
groups.)
53
Part 3. Data summary
3.1 Summarising categorical data
3.1.1 Percentages and frequency tables
Example: Causes of death
Deaths in 1979 for 20–25 year olds in Australia.

Cause Males Females Totals Percentage
Motor vehicle accidents 540 132 672 47.5%
All other accidents 197 43 240 16.9%
Suicide 149 48 197 13.9%
Diseases 78 52 130 9.2%
Neoplasms 48 36 84 5.9%
All other causes 56 37 93 6.6%
Totals 1068 348 1416 100.0%
This is a contingency table or frequency table or two-way table.
3.1.2 Bar charts
Pie chart shows proportion of observations in each category by angle of each segment; quite
poor at communicating the information.
Bar chart shows number of observations in each category by length of each bar. Much easier
to see differences.
For one categorical variable: use a bar chart:

Motor vehicle accidents
All other accidents
Diseases All other causes Suicide
All other accidents Diseases
Neoplasms
Motor vehicle accidents Suicide All other causes
Neoplasms 0 10 20 30 40
• It is harder to make comparisons with the pie chart

• It is harder to estimate percentages with the pie chart
• Labelling is messier with the pie chart
• The pie chart shows “parts of a whole” better
3.1.3 Barcharts with two variables

Sex by cause of death

All other accidents
female Suicide
Diseases
Neoplasms
All other causes
male
0 200 400 600 800 1000
Cause of death by sex
All other causes
Neoplasms
Diseases
Male
Female
Suicide
All other accidents
0 100 200 300 400 500 600

Cause of death by sex
All other accidents
Suicide
Diseases Female
Male
Neoplasms
All other causes
0 100 200 300 400 500
3.2 Summarizing numerical data
3.2.1 Percentiles
• Example: the 90th percentile is the point where 90% of data lie below that point and 10%
of data lie above that point.
• The median is the 50th percentile. It is sometimes labelled Q2.
• The median is the middle measurement when the measurements are arranged in order.
If there are an even number of measurements, it is the average of the middle two.
• The quartiles are the 25th and 75th percentiles. They are often labelled Q1 and Q3.
• The interquartile range is Q3−Q1.
Example: Letter recognition scores
Scores from letter recognition test conducted on 30 six-year-old girls.

0 0 0 0 0 0 1 1
1 2 2 2 3 3 3 3 Percentile 25% 50% 75%
3 4 4 4 4 5 5 6 1 3 5.1
7 7 8 12 13 20
3.2.2 Five number summary
Minimum Q1 Median Q3 Maximum

0 1 3 5.1 20
• 25% of data are between min and Q1

• 25% of data are between Q1 and median
• 25% of data are between median and Q3
• 25% of data are between Q3 and maximum

3.2.3 Outliers
One definition of an outlier: Any point which is more than 1.5(IQR) above Q3 or more than 1.5(IQR)
below Q1.
Don’t delete outliers. Investigate them!
For letter recognition data: Q1 = 1, Q3 = 5.1, IQR = 5.1 - 1 = 4.1.

• So observations above 5.1 + 1.5(4.1) = 11.2 are outliers.
• Observations below 1 - 1.5(4.1) = -5.15 are outliers. Of course, this can’t
happen.
Example: Air accidents
Number of airline accidents for 17 Asian airlines for 1985–1994. Source: Newsday
(1995).
Accidents Airline
0 Air India (India)
0 Air Nippon (Japan)
0 All Nippon (Japan)
1 Asiana (South Korea)
0 Cathay Pacific (Hong Kong)
1 Garuda (Indonesia)
5 Indian Airlines (India)
1 Japan Airlines (Japan)
0 Japan Air System (Japan)
1 Korean Air Lines (South Korea)
0 Malysia Air Lines (Malaysia)
10 Merpati (Indonesia)
0 Niu Guini (Papua New Guinea)
3 Philippine Air Lines (Philippines)
3 PIA (Pakistan)
0 SIA (Singapore)
1 Thai Airways (Thailand)
Percentile 25% 50% 75%

.... .... ....
IQR =
Outliers:

3.2.4 Boxplots
Graphical representation
of five number summary.
Outlier
Maximum when
outliers omitted
Q3 (upper quartile)
= 75th percentile
Median
Q1 (lower quartile)
= 25th percentile
Minimum when outliers

omitted
20
15
10
5
0
B5 G5 B6 G6 B7 G7 B8 G8 B9 G9 B 10G 10
Box plots of letter recognition scores in each age/sex group

3.2.5 Histograms
Useful for showing shape of distribution of a numerical variable.
Histogram of letter recognition scores

8
Number of girls
6
4
2
0
0 5 10 15 20
Score
3.2.6 Measures of location
Average (mean)
The average is the sum of the measurements divided by the number of measurements. Usually
denoted by x̄.
Suppose we have n observations and let x1 denote the first observation, x2 the second, and so
on up to xn . Then the average is
x1 +x2 +···+xn 1 Pn
Sample mean: x̄ = n = n i=1 xi .
This is the most widely used measure of the centre of the data set, and it has good arithmetic
properties. But it does have the drawback of being influenced by extreme values (“outliers”).
Trimmed mean
Mean of data when smallest and largest 5% of values omitted.
• The latter is more resistant to outliers.

• The median is the most resistant to outliers.
Mean Trimmed mean Median

4.10 3.68 3
Mean Trimmed mean Median

1.53 1.07 1

QUIZ
• True or False?
1. The median and the average of any data set are always close together
2. Half of a data set is always below average.
3. With a large sample, the histogram is bound to follow the normal curve
quite closely.
• In a study of family incomes, 1000 observations range from $12,400 a year
to $132,800 a year. By accident, the highest income gets changed to
$1,328,000.
1. Does this affect the mean? If so by how much?
2. Does this affect the median? If so by how much?
3.2.7 Measures of spread
Range
The range is the difference between the maximum and minimum. However, it is not a good
measure of spread since it is generally larger when more data are collected and it is sensitive to
outliers.
Interquartile range
The interquartile range (IQR) is the difference between the upper and lower quartiles: Q3 − Q1.
Range = 20 - 0 = 20 IQR = 5.1 - 1 = 4.1.
Range = IQR =
Variance and Standard deviation
The variance is based on the deviations from the mean, i.e. the difference between the indi-
vidual values and the mean of those values, represented by (xi − x̄). Obviously, if these were
simply added, or averaged, we would always end up with zero. Therefore, we want all values
to be positive. A simple way to do this is to square the deviations and then average these. This
is known as the variance

The variance of n observations x1 , x2 , . . . , xn is
s2 = n−1 1
(x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2

n
X
= 1
n−1 (xi − x̄)2
i=1
Note that this is not quite the average of the squared differences from the mean.
We use n − 1 instead of n, as dividing by n tends to underestimate the ‘true’ value. Dividing

by n − 1 eliminates this problem.
The variance is in squared units, so by taking the square-root of the variance, we have a mea-
sure of dispersion that is in the same units of measurement as the original variable. This is
called the standard deviation, and is denoted by:
√ q
1 Pn
s= s2 = n−1 i=1 (xi − x̄)2 .
SD = 1 SD = 5
15
15
Frequency
Frequency
10
10
5
5
0
−3 −2 −1 0 1 2 3 0 5 10 15 20 25
SD = 10 SD = 100
20
20
Frequency
Frequency
15
10
10
5
5
0
−130 −120 −110 −100 −90 −80 −300 −200 −100 0 100 200
1
s2 = (0 − 4.1)2 + (0 − 4.1)2 + · · · + (20 − 4.1)2

29
= 20.02
√
s = 20.02
= 4.47.

1
s2 = (0 − 1.53)2 + (0 − 1.53)2 + · · · + (10 − 1.53)2

16
= 6.765
√
s = 6.765
= 2.60.
• s > 0 unless all observations have the same value, then s = 0.

• s is not a resistant measure of spread.
• For many data sets, the standard deviation is approximately 0.6 of the IQR.
• Approximately 95% of observations usually fall within 2 standard deviations of the mean.
• For small data sets (less than 50 points), the standard deviation is about one quarter of
the range.
3.2.8 Basic statistics commands in Excel
• Mean: =AVERAGE(A1:A20)
• Standard deviation: =STDEV(A1:A20)
• Median: =MEDIAN(A1:A20)
• 75th percentile: =PERCENTILE(A1:A20,0.75)

3.3 Summarising two numerical variables
3.3.1 Scatterplots
Scatterplots are good at graphically displaying the relationship between two numerical vari-
ables.
Spotting bivariate outliers

3.3.2 Correlation
The Pearson correlation coefficient is a measure of the strength of the linear relationship be-
tween two numerical variables.
It is calculated by
1 Pn xi −x̄ yi −ȳ
r= n−1 i=1 sx sy
where sx is the sample standard deviation of the x observations and sy is the sample standard
deviation of the y observations.
• The value of r always lies between −1 and 1.

• Positive r indicates positive association between the variables and negative r indicates
negative association.
• The extreme values r = −1 and r = 1 only occur when the data lie exactly on a straight
line.
• If the variables have a strong non-linear relationship, r may be small. Always plot the
graph.
r2 : a useful interpretation
The squared correlation, r2 , is the fraction of the variation in the y values that
is explained by the linear relationship.
High correlation does not prove causality

Correlation = −0.99 Correlation = −0.75 Correlation = −0.5 Correlation = −0.25
Correlation = 0.99 Correlation = 0.75 Correlation = 0.5 Correlation = 0.25

● ● ● ●
9
9 10 ● ●
●
● ●
8
● ●
7
●
●
8
●
y
y
●
6
●
7
5
●
6
4
5
●
● ●
3
4
4 6 8 10 12 14 4 6 8 10 12 14
x x
● ●
12
12
10
10
y
y
● ●
●
8
● ●
8
● ●
●
● ●
●
● ●
●
6
●
6
● ●
●
● ●
4 6 8 10 12 14 8 10 12 14 16 18
x x
In every case: n = 11, x̄ = 9.0, ȳ = 7.5 and r = 0.82!
3.3.3 Spearman’s rank correlation
• same as ordinary correlation but applied to ranks or ordinal data.

• Often used with Likert scales.
• Main difference is in computation of p-values.
3.4 Measures of reliability
In many questionnaires, there are several questions that are designed to measure the same
thing (sometimes called a “construct”). The answers to the questions are often added together
to provide an overall “scale” which gives a single measure of the construct.
In these circumstances, it is useful to judge how closely the results from the questions are re-
lated to each other. This is called “internal consistency reliability”.
For example, suppose we constructed a questionnaire to measure people’s level of job-satisfaction.

We could provide several statements and ask respondents to give an answer on a 5-point Likert
scale (1=Strongly agree, 2=agree, 3=neutral, 4=disagree, 5=strongly disagree):
1. I look forward to going to work each day.

2. I feel I am engaged in work which is useful to my employer.
3. Staff morale in my workplace is generally high.
4. My employer treats me well.

Internal consistency reliability involves seeing how closely the answers to these questions (or
“items”) are related.
There are a range of internal consistency measures that can be used.
Average inter-item correlation
We can look at the correlation between any pair of items which are supposed to be measuring
the construct.
The average inter-item correlation is the average of all correlations between the pairs of items.
Split-half reliability
Here we randomly divide all items that are intended to measure the construct into two sets.
The total score for each set of items is then computed for each person. The split-half reliability
is the correlation between these two total scores.
Cronbach’s alpha
Cronbach’s alpha is the average of all split-half estimates. That is, if we computed all possible
split-half reliabilities (by computing it on all possible divisions of items), and averaged the
results, we would have Cronbach’s alpha.
In practice, there is a quicker way to compute it than actually doing all these split-half estimates.
Suppose there are k items, let si be the standard deviation of the answers to the ith item and s
be the standard deviation of the totals formed by summing all the items for each person. Then
Cronbach’s alpha can be calculated as follows:
k
!
k 1 X 2
α= 1− 2 si .
k−1 s
i=1
How large is good enough? Some books suggest that α > 0.7 is necessary to have a reliable
scale. I think this is an arbitrary figure, but it gives you some idea of what is expected.

Example 1
Q1 Q2 Q3 Q4 Correlation matrix:
1 5 2 2 5
2 2 3 1 1
3 2 3 3 1 Q1 Q2 Q3 Q4
4 3 3 5 1
5 2 1 2 2 Q1 1.000 0.521 0.250 0.726
6 5 5 2 5 Q2 0.521 1.000 0.182 0.328
7 2 1 2 2
8 3 2 2 1 Q3 0.250 0.182 1.000 0.029
9 3 3 5 1 Q4 0.726 0.328 0.029 1.000
10 1 2 5 2
11 1 3 3 1
12 2 2 4 2 Average inter-item correlation:
13 3 3 5 3
14 1 1 1 3 0.339
15 3 2 4 4
16 5 5 3 5
17 1 1 1 1 Cronbach’s alpha:
18 1 4 1 2 0.664
19 1 3 1 1
20 4 5 4 3
Example 2
1 1 1 1 1
2 2 3 2 2
3 3 2 4 3 Q1 Q2 Q3 Q4
4 5 5 5 3
5 1 2 1 1 Q1 1.000 0.819 0.826 0.684
6 4 4 2 1 Q2 0.819 1.000 0.697 0.515
7 4 3 5 4
8 4 2 3 5 Q3 0.826 0.697 1.000 0.813
9 5 5 5 5 Q4 0.684 0.515 0.813 1.000
10 5 5 5 5
11 4 3 4 5
13 3 3 2 2
14 3 3 5 5 0.725
15 2 3 2 4
16 4 5 5 5
18 3 3 3 3 0.910
19 5 5 5 4
20 2 2 1 1
Example 3
1 2 2 2 2
2 4 4 4 4
3 2 2 2 2 Q1 Q2 Q3 Q4
4 1 1 1 1
5 2 2 2 1 Q1 1.000 0.874 0.968 0.939
6 4 3 4 4 Q2 0.874 1.000 0.864 0.795
7 1 1 1 1
8 5 3 5 5 Q3 0.968 0.864 1.000 0.921
9 4 4 4 4 Q4 0.939 0.795 0.921 1.000
10 4 5 4 4
11 2 2 2 2
13 2 2 2 2
14 5 5 5 5 0.894
15 2 2 3 2
16 4 2 4 4
18 5 5 5 5 0.971
19 2 2 2 2
20 4 4 5 4

3.5 Normal distribution
Often a set of data, or some statistic calculated from the data, is assumed to follow a normal
distribution. Data which are normally distributed have a histogram with a symmetric bell-
shape like this.
µ−3σ µ−2σ µ−σ µ µ+σ µ+2σ µ+3σ
3.5.1 Parameters
The normal distribution is the basis of many statistical methods. It can be specified by two
parameters:
1. the mean µ (which determines the centre of the bell); and
2. the standard deviation σ (which determines the spread of the bell).
d
If we call the variable Y , we write Y = N(µ, σ 2 ). We use the probability model to draw conclu-
sions about future observations.
Mean µ: The mean µ is the average of measurements taken from the entire population (rather
than just a sample). We usually denote this by µ to distinguish it from the sample mean x̄. The
sample mean is often used as an estimate of µ.
Standard deviation σ: The standard deviation σ is defined similarly. It is denoted by σ to

distinguish it from the sample standard deviation, s. The sample standard deviation is often
used as an estimate of σ.
How do you know if your data are normal?
Many statistical methods assume the data are normal, or that the errors from a
fitted model are normal. To test this assumption:
• Plot the histogram. It should look bell-shaped.
• Do a QQ plot on a computer. It should look straight.

3.5.2 Normal Probability Tables
Probability of an observation lying within kσ from µ:
k Prob.
0.50 38.3%
0.67 50.0%
1.00 68.3%
1.28 80.0%
1.50 86.6%
1.64 90.0%
1.96 95.0%
2.00 95.5%
2.50 98.8% µ-kσ µ µ+kσ
2.58 99.0%
3.00 99.7%
3.29 99.9%
3.89 99.99%
Probability of an observation greater than µ + kσ (or less than µ − kσ):
k Prob.
0.00 50.0%
0.50 30.9%
0.84 20.0%
1.00 15.9%
1.28 10.0%
1.50 6.7%
1.64 5.0%
2.00 2.3%
2.33 1.0% µ µ+kσ
2.50 0.62%
3.00 0.13%
3.09 0.10%
3.72 0.01%
3.50 0.02%

CHAPTER
4
Computing and
quantitative
research
4.1 Data preparation
4.1.1 Data in Excel
One case per row, one variable per column
4.1.2 Things to watch
• Missing values are not zeros.
• Missing values are not “No”
• Keep a spare copy of your data in another location.
• Beware of using Excel for statistics.
• If you must use Excel for basic statistics, use a different spreadsheet from your main data
file.
• For categorical variables, use a code. (e.g., 1, 2, 3, . . . ).
70
Part 4. Computing and quantitative research
Figure 4.1: Typical set-up of excel spreadsheet ready for importing to a statistics package.
4.1.3 Data cleaning
Data cleaning is identifying mistakes and anomalies in your data.
• Almost all data is filthy.

• More than 50% of my consulting time is spent cleaning data.
• Double entry
• Range checks (e.g., age)
• Range checks on subsets (e.g., age by pregnant).
• Exploratory graphics
• Look for anomalies:
– The “out-of-range” score
– The 2000 degree day.
– The 96 year old who is pregnant.
– The person earning a negative salary.
4.2 Using a statistics package

• Almost all stats packages can read an Excel file directly with variables in columns, cases
in rows.
• Check the data types and variable definitions in the statistics package.
– Categorical, numerical variables.
– Missing values
– Choose a package that does what you want easily.

4.2.1 Microsoft Excel
Advantages: Disadvantages:
• Widely available • Too easy to enter data without structure

• Many people are already familiar with it. • Numerical routines can be unreliable
• Intuitive, easy to use. • Graphics are clumsy
• Good for data entry • Very limited statistical facilities
Numerical accuracy in Excel
• Computation of p-values inaccurate when close to zero or close to 1.

• unstable algorithm for computing variance and standard deviation. (e.g., problems with
large numbers)
• negative sums of squares for some ANOVA problems.
• problems with regression where there is high collinearity.
• pseudo random numbers fail some standard tests of randomness
Some of these problems have been known since at least 1994. Microsoft won’t respond to any
requests for fixes.
Conclusion: Don’t use Excel for any extended statistical computation.
When to use Excel
• For data entry

• For simple numerical summaries (means, standard deviations) provided the numbers are
not too big.
• If you have very little statistical work to do.
Excel: Data Analysis add-in

4.2.2 SPSS
Advantages: Disadvantages:
• Very widely used — lots of people to • Few modern methods included (e.g.,
help. nonparametric smoothing)
• Most standard methods are available. • Lots of irrelevant output. Hard to know
• Click and point interface as well as com- what’s important.
mand interface. • Routines used not properly docu-
mented.
• Very difficult to produce customized
analysis
• Graphics are difficult to customize with
code.
Guidelines for SPSS
• For summary stats on categorical data:

Use Analyze – Descriptive Statistics – Frequencies
• For summary stats on numerical data:

Use Analyze – Descriptive Statistics – Explore
• For summary stats on numerical data with a categorical explanatory variable:

Use Analyze – Descriptive Statistics – Explore
• Make sure the selected method is appropriate for your data.
4.2.3 Interactive statistics packages
• Click-and-point interface.
• Easy to learn and use.
• Sometimes limits on data size
• Tedious for repetitive tasks and repeated analyses.
• Examples: JMP, Statgraphics, Statview.
4.2.4 Large statistics packages
• Handle large data sets

• Generally less interactive and flexible than smaller packages
• Some customized analysis possible with programming.
• Examples: SPSS, SAS, Systat, Stata, Statistica, S-Plus

4.2.5 Statistical programming languages
• Extremely flexible in data handling and application of methods.

• Can write own routines to do virtually anything.
• Most useful for experienced statisticians.
• Example: R, S-Plus, Stata
4.2.6 Speciality packages
• Forecast Pro
• EViews (for econometric methods)
• Amos (for structural equation modelling)
4.2.7 Some more thoughts
• Think about methodology you need first.

• A few good graphs can make an enormous difference to a paper, a talk and a thesis.
Spend some time getting them right.
• My rule-of-thumb: produce a graph for every p-value.
• Packages perform statistical analyses quickly and easily.
That means you can make a great many mistakes quickly and easily.
• THINK before you CLICK.
• If you are using a package which is not so widely used, check the results. Mistakes have
been made.
• The best data analysis comes not from key strokes or print outs but from spending time
thinking.
4.2.8 Publication quality graphics
• Journals differ in standards.

• Excel and SPSS are adequate if you take some care and don’t use the defaults.
– reduce the size of data points
– remove grid lines
– remove or simplify legends
– remove coloured background
– fix axes and scales
– add meaningful titles
– etc.
• R, S-Plus and Systat produce excellent graphics
4.2.9 Choosing a statistics package
For most purposes:
• Excel and SPSS will be satisfactory.

• Both are freely available at Monash
BUT. . .
1. Does SPSS do what you want?

(e.g., SPSS won’t fit an additive model)
2. Do you require customized statistical analysis?
(e.g., calculation of variance of residuals from smoothing spline)
3. Do you require interactive or repetitive data analysis?
Using commands is worthwhile if you have repetition.
4. What are your colleagues using?
They will often be the first point of help.
Packages at Monash
• SPSS is freely available to be installed on any university computer.

• Systat is freely available via a site licence. It is very similar to SPSS. Better graphics and
some more modern methods. Some SPSS techniques not available.
• Minitab is available on MRGS computers. It is also sold relatively cheaply at the computer
centre and bookstore.
• SAS is extremely expensive, but some departments like it. I find it cumbersome.
• R is freeware (www.r-project.org) and extremely powerful and flexible. But you need
to have a good computing knowledge to use it.
4.3 Further reading

• Axford, R.L., Grunwald, G.K. and Hyndman, R.J. (1995) “The use of information tech-
nology in the research process”. Invited chapter in Health informatics: an overview, (ed.
Hovenga, Kidd, Cesnik).
• Knüsel, L. (1998) On the accuracy of statistical distributions in Microsoft Excel 97, Com-
putational Statistics and Data Analysis, 26, 375–377.
• McCullough, B.D. (1999) Assessing the reliability of statistical software. The American
Statistician 52, 358–366.
• McCullough, B.D. and Wilson, B. (1999) On the accuracy of statistical procedures in Mi-
crosoft Excel 97, Computational Statistics and Data Analysis, 31, 27–37.
crosoft Excel 2000 and Excel XP, Computational Statistics and Data Analysis, 40, 713–721.
crosoft Excel 2003, Computational Statistics and Data Analysis, 49, 1244–1252.
• Sawitzki, G. (1994) Testing numerical reliability of data analysis systems. Computational
Statistics and Data Analysis, 18, 269–286.

4.4 SPSS exercise
Data set
We will use data on emergency calls to the New York Auto Club (the NY equivalent of the
RACV). Download the data from
http://www.robhyndman.info/downloads/NYautoclub.xls
and save it to your disk.
The variable Calls concerns emergency road service calls from the second half of January in
1993 and 1994. In addition, we have the following variables:
• Fhigh: the forecast highest temperature for that day;

• Flow: the forecast lowest temperature for that day;
• Rain: 1=rain or snow forecast for that day; 0 otherwise;
• Snow: 1=snow forecast for that day; 0 otherwise;
• Weekday: 1=regular workday; 0 otherwise;
• Sunday: 1=Sunday; 0 otherwise.
The idea is to use these variables to predict the number of emergency calls.
Loading data
1. Run SPSS and open the excel file with the data.
2. Go to the “Variable view” sheet, and ensure the variables are correctly set to Scale (i.e.,
Numerical) or Nominal (i.e,. Categorical).
3. For the categorical variables, give the values meaningful labels.
Data summaries
4. Calculate appropriate summary statistics for all variables.

5. Calculate appropriate summary statistics for the Calls variable separately for Weekdays
and weekends.
6. Calculate appropriate summary statistics for the Calls variable separately for rain forecast
days and other days.
Exploratory graphs
6. Try plotting the number of calls against each of the other variables using an appropriate
plot (i.e., scatterplot or boxplot). [Go to Graphs in the menu.]
7. Are there any outliers in the data?
8. Which of the explanatory variables seem to be related to Calls?
9. Do you think the effects of some variables may be confounded with other variables?

CHAPTER
5
Significance
5.1 Proportions
Example: TV ratings survey
A survey of 400 people in Melbourne found that 45% were watching the Channel
Nine CSI show on Sunday night. The estimated proportion of people watching is
45
p̂ = = 0.1125 or 11.25%.
400
5.1.1 Standard errors
Let’s do a thought experiment. If we were able to collect additional samples of 400 customers,
we could calculate p̂ for each sample. Suppose we obtained 999 additional samples of 400
observations each. We would obtain a different value of p̂ each time because each sample
would be random and different. We now have 1000 values of p̂, all of them different. The
variability in these p̂ values tells us how accurate p̂ is for estimating p.
Of course, we can’t collect additional samples. We just have one sample. But statistical theory
can be used to calculate the standard deviation of these p̂ values if we were able to conduct
such an experiment. The standard deviation of p̂ is called the standard error:
q
p(1−p)
s.e.(p̂) = n−1
where n is the number of observations in our sample. (This is the standard deviation of the
estimated proportions if we took many samples of size n and estimated the proportion from
each sample.)
For percentages, the standard error is 100 times that for proportions. Notice that
• the standard error depends on the size of the sample but not the size of the target popu-
lation (assuming the target population is very large);
77
Part 5. Significance
• the standard error is smaller if the sample size is increased. This is to be expected: the
more elements in the survey, the more you will know.
Example: Customer analysis
The standard error of the TV ratings in our sample of 400 is

r r
p(1 − p) (0.1125)(0.8875)
s.e.(p̂) = ≈ = 0.018.
n−1 399
5.1.2 Confidence intervals
A confidence interval is a range of values which we can be confident includes the true value
of the parameter of interest, in this case the proportion p. If we wish to construct a confidence
interval for p we take a multiple of the standard error either side of the estimate of the propor-
tion.
An approximate 95% confidence interval for the proportion is
p̂ ± 1.96s.e.(p̂).
(This is an interval which we are 95% sure will contain the true proportion.)
Example: Customer analysis
An approximate 95% confidence interval for the TV rating is
0.1125 ± 1.96(0.018) = 0.1125 ± 0.036 = [0.077, 0.148].
Notice that this interval is quite wide. If another TV show rates 12.5%, then we can’t
say which of the two shows actually had the bigger audience.
The 95% confidence interval of the proportion can be interpreted to be the range of values that
will contain the true proportion with a probability of 0.95. Thus if we calculate the confidence
interval for a proportion for each of 1000 samples, we would expect that about 950 of the cal-
culated confidence intervals would actually contain the true proportion.
Other confidence intervals beside 95% intervals can be calculated by replacing 1.96 by a differ-
ent multiplying factor.
The multiplying factor (1.96 in the example above) depends on the number of observations in
the sample and the confidence level required.
• It only works for larger n. For small n, we need a different (and more complex) method
of calculation.

• As the confidence level increases, so does the multiplying factor.
These factors are given by tables or calculated by computer.
Example: Couples with children
Consider an example where the fraction of all Australian married couples with chil-
dren was to be estimated and a simple random sample was used. The population
characteristic is the proportion of married couples with children in the target pop-
ulation. We denote this by p. It cannot be known without surveying the entire
population. The statistic is the proportion of married couples with children in the
sample. We denote this by p̂. It is calculated from the survey data as follows.
no. couples with children

p̂ = .
no. couples in sample
Then the 95% confidence interval for p is approximately

r
p̂(1 − p̂)
p̂ ± 1.96
n−1
where n = no. couples in the sample, and assuming the target population is very
much bigger than the sample. That is, we are 95% confident that the true proportion
of married couples with children lies within that range.
For example, if our sample proportion was p̂ = 0.72, where the sample size was
n = 1000, then the 95% confidence interval is
r
(0.72)(0.28)
0.72 ± 1.96 = 0.72 ± 0.03.
999
5.1.3 Margin of error
The margin of error is usually defined as half the width of a 95% confidence interval.
So in the TV ratings example, the margin of error was 0.018. In the couples with children
example, the margin of error is 0.03.
Generally, the margin of error is computed as

r
p̂(1 − p̂)
m = 1.96s.e.(p̂) = 1.96 . (5.1)
n−1
The following table shows the margin of error for proportions for a range of sample sizes and
proportions.

Sample Sample size (n)

proportion 100 200 400 600 750 1000 1500
(p)
0.10 0.059 0.042 0.029 0.024 0.021 0.019 0.015
0.20 0.079 0.056 0.039 0.032 0.029 0.025 0.020
0.30 0.090 0.064 0.045 0.037 0.033 0.028 0.023
0.40 0.097 0.068 0.048 0.039 0.035 0.030 0.025
0.50 0.098 0.069 0.049 0.040 0.036 0.031 0.025
0.60 0.097 0.068 0.048 0.039 0.035 0.030 0.025
0.70 0.090 0.064 0.045 0.037 0.033 0.028 0.023
0.80 0.079 0.056 0.039 0.032 0.029 0.025 0.020
0.90 0.059 0.042 0.029 0.024 0.021 0.019 0.015
Notice that the margin of error is greatest when p = 0.5.
Exercise 4: In television ratings surveys, a simple random sample of 400 house-

holds is taken and their viewing patterns recorded in detail. A television station
claims a rating of 33% of the total viewing audience. Find a 95% confidence
interval for this proportion.
5.1.4 Sample size calculation
Sample size calculation is most often done by first specifying what is an acceptable margin of
error for a key population characteristic.
If the survey aims to estimate the proportion of couples with children, the key population
characteristic is p. Making n the subject of equation (5.1), we obtain
3.84p(1−p)
n=1+ m2
.
Then substituting in the chosen values for m and p, we can obtain the sample size required.
Again, we can ‘guess’ p from previous knowledge of the population such as a pilot survey or
previous surveys.
Alternatively, a conservative approach is to use p = 0.5 since this results in the largest sample
size. Using p = 0.5 gives the sample size
0.960 1
n=1+ m2
≈ m2
.
This provides an upper bound on the required sample size. Other values of p will give smaller
sample sizes.
The following table gives sample sizes for different values of m and p.

Sample Margin of error (m)

proportion 0.005 0.01 0.02 0.05 0.10
(p)
0.10 13831 3459 866 140 36
0.20 24588 6148 1538 247 63
0.30 32271 8069 2018 324 82
0.40 36881 9221 2306 370 94
0.50 38417 9605 2402 386 98
0.60 36881 9221 2306 370 94
0.70 32271 8069 2018 324 82
0.80 24588 6148 1538 247 63
0.90 13831 3459 866 140 36
Upper bound 40000 10000 2500 400 100
Exercise 5: For television ratings surveys, what number of people would need to
be surveyed for the margin of error to be 2%?
5.2 Numerical differences
Example: Change in test scores
A researcher is studying the change in stress scores over time for a in-house stress man-
agement program. 10 employees complete the test at the start of the program, and they do
the test again at the end of their first 6 weeks on the program.
Test 1 Test2 Difference Sample mean of differences: x̄ = 2.5

6.1 10.1 4.0
6.4 11.9 5.5 Sample sd of differences: s = 2.2.
6.1 7.6 1.5
4.4 6.9 2.5 • Has there been a significant increase
5.8 11.4 5.6
7.0 10.0 3.0
in score?
3.2 5.8 2.6 • Could this increase be due to chance?
5.5 5.3 -0.2
8.3 9.8 1.5
4.2 3.1 -1.1
5.2.1 Standard error
The standard deviation of x̄ (i.e., its standard error) is

√
s.e.(x̄) = s/ n
where s is the standard deviation of the sample data and √ n is the number of observations in
our sample. So in the example, the standard error is 2.2/ 10 = 0.7. This figure is used to draw
conclusions about the value of µ.

5.2.2 Confidence intervals for the mean
A confidence interval is a range of values which we can be confident includes the true value
of the parameter of interest, in this case the population mean µ. If we wish to construct a
confidence interval for µ we take a multiple of the standard error either side of the estimate of
the mean. For example a 95% confidence interval for µ in this example is x̄ ± 2.262s.e.(x̄).
That is, the 95% confidence interval is
2.5 ± 2.262(0.7) = 2.488 ± 1.6 = [0.9, 4.0].
A typical computer output for this computation is given below.

Variable N Mean StDev SE Mean 95.0 % C.I.
Difference 10 2.4884 2.1781 0.6888 ( 0.9302 4.0465 )
The 95% confidence interval of the mean can be interpreted to be the range of values that will
contain the true mean with a probability of 0.95. Thus if we calculate the confidence interval for
a mean for each of 1000 samples, we would expect that about 950 of the calculated confidence
intervals would actually contain the true mean.
Other confidence intervals beside 95% intervals can be calculated by replacing 2.262 by a dif-
ferent multiplying factor.
The multiplying factor (2.262 in the example above) depends on the number of observations in
the sample and the confidence level required.
• As n increases, the multiplying factor decreases.

• As the confidence level increases, so does the multiplying factor.
These factors are given by tables or calculated by computer.
5.2.3 Hypothesis testing
As well as finding an interval to contain µ it is of interest to test if µ is likely to be equal to

some specified value. In the example of the change in test results, we wish to know if µ is likely
to be different from zero (i.e., has there been an improvement). To answer this question, we
construct two competing hypotheses about µ.
Definition: The two complimentary hypotheses in a hypothesis testing problem are called the
null hypothesis and the alternative hypothesis. They are denoted by H0 and H1 respectively.
In this example, our two hypotheses are

H0 : µ = 0 H1 : µ 6= 0
The null hypothesis states that on average, there is no change in test results whereas the alter-

native hypothesis states that on average there is a change in test results.
In a hypothesis testing problem, after observing the sample the experimenter must decide ei-
ther to accept H0 as true or reject H0 as false and decide in favour of H1 .
Definition: A hypothesis test is a rule that specifies:
1. For which sample values the decision is made to accept H0 as true.

2. For which sample values H0 is rejected and H1 is accepted as true.
To make this decision we use a test statistic. That is we calculate the value of some formula
which is a function of the sample data. The value of the test statistic provides evidence for or
against the null hypothesis.
In the case of a test for the mean µ, the test statistic we use is
t = s.e.x̄(x̄)
x̄ 2.488
t= = = 3.613.
s.e.(x̄) 0.689
5.2.4 P-values
A p-value is the probability of randomly observing a value greater than or equal to the one
observed, when the null hypothesis is true. The decision to accept the null hypothesis is based
on the p-value.
In this context, the p-value is the probability of observing an absolute t value as greater than
or equal to the one observed (3.613), if µ = 0. That’s the same as the probability of observing a
value of x̄ at least as far away from 0 as the x̄ value we obtained for this sample (2.488). This
probability can be calculated easily using a statistical computer package.

Test of mu = 0.00 vs mu not = 0.00
Variable N Mean StDev SE Mean T P-Value

Difference 10 2.4884 2.1781 0.6888 3.6127 0.0056
So if the population mean µ = 0, then the probability of obtaining a sample mean
x̄ more than 2.488 away from µ is 0.0056. This is small enough to believe that the
assumption of µ = 0 is incorrect.
If we obtain a ‘large’ p-value, then we say that data similar to that observed are likely to have
occurred if the null hypothesis was true. Conversely, a small p-value would indicate that it

is unlikely that the null hypothesis was true (because if the null hypothesis were true, it is
unlikely that such data would occur by chance). The smaller the p-value the more unlikely the
null hypothesis.
The p-value is used to define statistical significance. If the p-value is below 0.05 then we say
this result is statistically significant. The choice of threshold is completely arbitrary. It is only
convention that dictates the use of a 0.05 or 0.01 significance level. Instead of saying an effect is
significant at the 0.05 level, quoting the actual p-value will allow the reader to make their own
interpretation.
One-sided tests
A one-sided test only looks at the evidence against the null hypothesis in one di-
rection (e.g., the mean µ is positive) and ignores the evidence against the null
hypothesis in the other direction (e.g., the mean µ is negative).
The question of whether a p-value should be one or two-sided may arise; a one-
sided p-value is rarely appropriate. Even though there may be a priori evidence to
suggest a one-sided effect, we can never really be sure that one treatment, say, is
better than another. If we did then there would be no need to do an experiment to
determine this! Therefore, routinely use two-sided p-values.
5.2.5 Type I and type II errors
There are at least two reasons why we might get the wrong answer with an hypothesis test.
Type I Error is where we accept the alternative hypothesis (reject the null hypothesis) even
though it is not true. This is sometimes referred to as a false positive. The type I error is set in
advance and is typically 5% (one in 20) or 1% (one in 100). This implies that one in 20 pieces of
scientific research based on an hypothesis test is mistaken! We use α to denote the probability
of type I error (the size or level of the test).
Type II Error is the risk of rejecting the hypothesis (accepting the null hypothesis) when it is in
fact true. This is sometimes referred to as a false negative. It is often denoted by β.
If the chance of making a type I error is made very small, then automatically the risk of making
a type II error will grow.
The power of a statistical test is 1 − β. This is the probability of accepting the alternative hy-
pothesis when it is true. Obviously we want this as high as possible. However, the smaller we
make α, the less power we have for the test.
These definitions are summarized in the following table.

Null hypothesis Null hypothesis

Decision false true
Reject Correct Type I error

null hypothesis Prob = 1 − β Prob = α
Don’t reject Type II error Correct

null hypothesis Prob = β Prob = 1 − α
5.2.6 Summary of key concepts
standard error: The standard deviation of a statistic calculated from the data, such as a pro-
portion or the mean difference.
p-value: The probability of observing a value as large as that which was observed if, in fact,
there is no real change.
If the p-value is small, we reject the hypothesis that there is no change.
Usually, this is done when the p-value is smaller than 0.05.
95% confidence interval: An interval which contains the true mean change with probability of
95%. So if the confidence interval does not include zero, then the p-value is smaller than
0.05.

Quiz from Campbell and Machin
Each statement is either true or false.
1. The diastolic blood pressures (DBP) of a group of young men are normally distributed
with mean 70mmHg and a standard deviation 10 mmHg. It follows that
(a) About 95% of the men have a DBP between 60 and 80 mmHg.
(b) About 50% of the men have a DBP above 70 mmHg.
(c) The distribution of DBP is not skewed
(d) All the DBPs must be less than 100 mmHg.
(e) About 2.5% of the men have DBP below 50 mmHg.
2. Following the introduction of a new treatment regime in an alcohol dependency unit,
‘cure’ rates improved. The proportion of successful outcomes in the two years following
the change was significantly higher than in the preceding two years (p < 0.05). It follows
that:
(a) If there had been no real change in cure rates, the probability of getting this differ-
ence or one more extreme by chance, is less than one in twenty.
(b) The improvement in treatment outcome is clinically important.
(c) The change in outcome could be due to a confounding factor.
(d) The new regime cannot be worse than the old treatment.
(e) Assuming that there are no biases in the study method, the new treatment should
be recommended in preference to the old.
3. As the size of a random sample increases:
(a) The standard deviation decreases.
(b) The standard error of the mean decreases.
(c) The mean decreases.
(d) The range is likely to increase.
(e) The accuracy of the parameter estimates increases.
4. A 95% confidence interval for a mean
(a) Is wider than a 99% confidence interval.
(b) In repeated samples will include the population mean 95% of the time.
(c) Will include the sample mean with a probability 1.
(d) Is a useful way of describing the accuracy of a study.
(e) Will include 95% of the observations of a sample.
5. The p-value
(a) Is the probability that the null hypothesis is false
(b) Is generally large for very small studies
(c) Is the probability of the observed result, or one more extreme, if the null hypothesis
were true.
(d) Is one minus the type II error
(e) Can only take a limited number of discrete values such as 0.1, 0.05, 0.01, etc.

Bicep circumference
[Ref: Bland, J.M. and Altman, D.G. (1986) Statistical methods for assessing agreement between
two methods of clinical measurement. Lancet, 307–310.]
The table below shows the circumference (cm) of the right and left bicep of 15 right-handed
tennis players.
Subject Right Left Difference

1 37.50 36.00 0.50
2 35.75 34.50 1.25
3 38.25 38.25 0.00
4 40.50 40.00 0.50
5 32.25 31.50 0.75
6 37.50 36.75 0.75
7 34.75 33.50 1.25
8 35.75 34.75 1.00
9 38.75 38.75 0.00
10 40.25 40.00 0.25
11 37.50 36.75 0.75
12 35.75 35.25 0.50
13 34.00 33.50 0.50
14 40.00 39.25 0.75
15 41.25 40.75 0.50
Mean 37.32 36.63 0.683
Stdev 2.606 2.803 0.438
Interpret the following computer output.
Paired samples t test on LEFT vs RIGHT with 15 cases
Mean RIGHT = 37.317

Mean LEFT = 36.633
Mean Difference = 0.683 95.00% CI = 0.441 to 0.926
SD Difference = 0.438 t = 6.045
df = 14 Prob = 0.000

CHAPTER
6
Statistical models
and regression
Regression is useful when there is a numerical response variable and one or more explana-
tory variables.
6.1 One numerical explanatory variable

Recall:
Numerical summaries: correlation
Graphical summaries: scatterplot.
Example: Pulp shipments and price
Ref: Makridakis, Wheelwright and Hyndman, 1998. Forecasting: methods and applications, John
Wiley & Sons Chapter 5.
Pulp shipments World pulp price Pulp shipments World pulp price
(millions metric tons) (dollars per ton) (millions metric tons) (dollars per ton)
Si Pi Si Pi
10.44 792.32 21.40 619.71
11.40 868.00 23.63 645.83
11.08 801.09 24.96 641.95
11.70 715.87 26.58 611.97
12.74 723.36 27.57 587.82
14.01 748.32 30.38 518.01
15.11 765.37 33.07 513.24
15.26 755.32 33.81 577.41
15.55 749.41 33.19 569.17
16.81 713.54 35.15 516.75
18.21 685.18 27.45 612.18
19.42 677.31 13.96 831.04
20.18 644.59
88
Part 6. Statistical models and regression
6.1.1 Scatterplots
‘Eye-balling’ the data would suggest that Shipments decreases with price. A plot of shipments
against price is a good preliminary step to ensure that a linear relationship is appropriate.
6.1.2 Statistical model
In regression problems we are interested in how changes in one variable are related to changes
in another. In the case of Shipments and Price we are concerned with how Shipments changes
with Price, not how Price changes with Shipments. The explanatory variable is Price, and the
response variable it predicts is Shipments.
The relationship between the explanatory variable, x, and the response variable, y, is
yi = a + bxi + ei
where a is the intercept of the line, b is the slope, and ei is the error, or that part of the observed
data which is not described by the linear relationship. ei is assumed to be Normally distributed
with mean 0 and standard deviation σ.
If we can find the line that best fits the data we could then determine what increase in price is
associated with a unit decrease in shipments.
35
30
Pulp shipments
25
20
15
10
500 600 700 800
World pulp price
Figure 6.1: The relationship between world pulp price and pulp shipments is negative. As the price
increases, the quantity shipped decreases.
The line of ‘best’ fit is found by minimizing the sum-of-squares of the deviations from the
observed points to the line. This method is called the method of least squares. So the line of
best fit minimizes the sum-of-squares of the deviations from the observed points to the line ,
that is, it minimizes
Xn n
X
(yi − ŷi )2 = (yi − â − b̂xi )2
i=1 i=1

where â and b̂ are the estimates of a and b.
Using calculus we find

Pn
(x − x̄)(yi − ȳ)
b̂ = Pn i
i=1
2
i=1 (xi − x̄)
â = ȳ − b̂x̄
These calculations are done easily using a statistics package (or even a calculator).
Example: Pulp price and shipments
The regression equation is

S = 71.7 − 0.075P.
The negative relationship is seen in the downward slope of −0.075. That is, when the price
increases by one dollar, sales decrease, on average, by about 75 thousand metric tons.
Further, this regression line can be used to predict the mean or expected Y for given X
values. For example, when the price is $600 per ton, the predicted shipments is 71.7 −
0.075(600) = 26.7 millions metric tons.
6.1.3 Outliers and influential observations
• Outliers: observations which produce large residuals.

• Influential observations: An observation is influential if removing it would markedly change
the position of the regression line. (Often outliers in the x variable).
• Lurking variable: an explanatory variable which was not included in the regression but
has an important effect on the response.
Points should not be removed without a good explanation of why they are different.
6.1.4 Residual plots
A useful plot for spotting outliers is the scatterplot of residuals ei against the explanatory vari-
able xi . This shows whether a straight line was appropriate. We expect to see a scatterplot
resembling a horizontal band with no values too far from the band and no patterns such as
curvature or increasing spread.
Another useful plot for spotting outliers and other unwanted features is to plot residuals
against the fitted values ŷi . Again, we expect to see no pattern.

4
2
Residuals
0
-2
-4
-6 500 600 700 800
World pulp price

Figure 6.2: Residual plot from the pulp regression. Here the residuals show a V-shaped pattern indicat-
ing that a straight line relationship is not appropriate for these data.
6.1.5 Correlation
Recall: the correlation coefficient is a measure of the strength of the linear relationship.
A useful formula: r = b̂sx /sy
The pulp price and shipments data have a correlation of r = −0.931, indicating a very strong
negative relationship between pulp price and pulp shipped. If the pulp price increases, the
quantity of pulp shipped tends on average to decrease and vice versa.
So r2 = 0.867 showing that 86.7% of the variation is explained by the regression line. The other
13.3% of the variation is random variation about the line.

Activity: Birth weights
The table below gives the values for 32 babies of x, the birth weight, and y, the increase in
weight between the 70th and 100th day of life, as a percentage of birth weight.
Birth weight Increase in weight Birth weight Increase in weight

(oz) 70–100 days as % of x (oz) 70–100 days as % of x
72 68 125 27
112 63 126 60
111 66 122 71
107 72 126 88
119 52 127 63
92 75 86 88
126 76 142 53
80 118 132 50
81 120 87 111
84 114 123 59
115 29 133 76
118 42 106 72
128 48 103 90
128 50 118 68
123 69 114 93
116 59 94 91
Percentage increase in birth weight (day 70-100)
120
100
80
60
40
80 100 120 140
Birth weight (oz)
Computer output for fitting a regression line is given below.
The regression equation is Increase = 168 - 0.864 Weight
Predictor Coef Stdev t-ratio p

Constant 167.87 19.88 8.44 0.000
Weight -0.8643 0.1757 -4.92 0.000
s = 17.80 R-sq = 44.7% R-sq(adj) = 42.8%
What would the expected percentage increase in weight be for an infant whose birth weight
was 94 oz?

6.2 One categorical explanatory variable

Recall:
Numerical summaries: group means, standard deviations, etc.
Graphical summaries: side-by-side boxplots.
Example: Comparative volatility of stock exchanges
Data: returns for 30 stocks listed on NASDAQ and NYSE for 9–13 May 1994.
We look at absolute return in prices of stocks. This is a measure of volatility. For example,
a market where stocks average a weekly 10% change in price (positive or negative) is more
volatile than one which averages a 5% change.
Graphical summary: boxplots
NYSE
NASDAQ
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Numerical summaries:
NASDAQ NYSE
Min. :0.00380 Min. :0.00260
1st Qu.:0.01745 1st Qu.:0.01120
Median :0.03930 Median :0.02480
Mean :0.04395 Mean :0.02913
3rd Qu.:0.05575 3rd Qu.:0.04010
Max. :0.12240 Max. :0.08910
6.2.1 Statistical model
Our model is that each group has a different mean. So if we let yi,j be the ith measurement
from the jth group and µj be the mean of the jth group, then we can write the model as
yi,j = µj + ei,j
d
Again, we assume ei,j = N (0, σ 2 ) That is, all groups have the same standard deviation.
We can estimate µj by ȳj , the sample mean of group j.

6.2.2 Dummy variable
If a categorical variable takes only two values (e.g., ‘Yes’ or ‘No’), then an equivalent numerical
variable can be constructed taking value 1 if yes and 0 if no. This is called a dummy variable.
In this case, the problem becomes identical to the case with a numerical explanatory variable.
If there are more than two categories, then the variable can be coded using several dummy
variables (one fewer than the total number of categories). Then the problem is one of several
numerical explanatory variables and is discussed in the next section.
6.3 Several explanatory variables
In multiple regression there is one variable to be predicted (e.g., sales), but there are two or
more explanatory variables. The general form of multiple regression is
Y = b0 + b1 X1 + b2 X2 + · · · + bk Xk + e.
Thus if sales were the variable to be modelled, several factors such as GNP, advertising, prices,
competition, R&D budget, and time could be tested for their influence on sales by using re-
gression. If it is found that these variables do influence the level of sales, they can be used to
predict future values of sales.
Each of the explanatory variables (X1 , . . . , Xk ) is numerical, although it is easy to handle cate-
gorical variables in a similar way using dummy variables.
Case Study: Mutual savings bank deposits
To illustrate the application of multiple regression, we will use a case study taken from Makri-
dakis, Wheelwright and Hyndman, 1998. Forecasting: methods and applications, John Wiley &
Sons Chapter 6.
These data refer to a mutual savings bank in a large metropolitan area. In 1993 there was
considerable concern within the mutual savings banks because monthly changes in deposits
were getting smaller and monthly changes in withdrawals were getting bigger. Thus it was
of interest to develop a short-term forecasting model to forecast the changes in end-of-month
(EOM) balance over the next few months. Table 6.1 shows 60 monthly observations (February
1988 through January 1993) of end-of-month balance (in column 2). Note that there was strong
growth in early 1991 and then a slowing down of the growth rate since the middle of 1991.
Also presented in Table 6.1 are the composite AAA bond rates (in column 3) and the rates on
U.S. Government 3-4 year bonds (in column 4). It was hypothesized that these two rates had
an influence on the EOM balance figures in the bank.
Now of interest to the bank was the change in the end-of-month balance and so first differences
of the EOM data in Table 6.1 are shown as column 2 of Table 6.2. These differences, denoted
D(EOM) in subsequent equations are plotted in Figure 6.3, and it is clear that the bank was

Example:
(1) (2) (3) (4) (1) (2) (3) (4)
Month (EOM) (AAA) (3-4) Month (EOM) (AAA) (3-4)
1 360.071 5.94 5.31 31 380.119 8.05 7.46
2 361.217 6.00 5.60 32 382.288 7.94 7.09
3 358.774 6.08 5.49 33 383.270 7.88 6.82
4 360.271 6.17 5.80 34 387.978 7.79 6.22
5 360.139 6.14 5.61 35 394.041 7.41 5.61
6 362.164 6.09 5.28 36 403.423 7.18 5.48
7 362.901 5.87 5.19 37 412.727 7.15 4.78
8 361.878 5.84 5.18 38 423.417 7.27 4.14
9 360.922 5.99 5.30 39 429.948 7.37 4.64
10 361.307 6.12 5.23 40 437.821 7.54 5.52
11 362.290 6.42 5.64 41 441.703 7.58 5.95
12 367.382 6.48 5.62 42 446.663 7.62 6.20
13 371.031 6.52 5.67 43 447.964 7.58 6.03
14 373.734 6.64 5.83 44 449.118 7.48 5.60
15 373.463 6.75 5.53 45 449.234 7.35 5.26
16 375.518 6.73 5.76 46 454.162 7.19 4.96
17 374.804 6.89 6.09 47 456.692 7.19 5.28
18 375.457 6.98 6.52 48 465.117 7.11 5.37
19 375.423 6.98 6.68 49 470.408 7.16 5.53
20 374.365 7.10 7.07 50 475.600 7.22 5.72
21 372.314 7.19 7.12 51 475.857 7.36 6.04
22 373.765 7.29 7.25 52 480.259 7.34 5.66
23 372.776 7.65 7.85 53 483.432 7.30 5.75
24 374.134 7.75 8.02 54 488.536 7.30 5.82
25 374.880 7.72 7.87 55 493.182 7.27 5.90
26 376.735 7.67 7.14 56 494.242 7.30 6.11
27 374.841 7.66 7.20 57 493.484 7.31 6.05
28 375.622 7.89 7.59 58 498.186 7.26 5.98
29 375.461 8.14 7.74 59 500.064 7.24 6.00
30 377.694 8.21 7.51 60 506.684 7.25 6.24
Table 6.1: Bank data: end-of-month balance (in thousands of dollars), AAA bond rates, and rates for
3-4 year government bond issues over the period February 1988 through January 1993.
facing a volatile situation in the last two years or so. The challenge is to forecast these rapidly
changing EOM values.
In preparation for some of the regression analyses to be done in this chapter, Table 6.2 desig-
nates D(EOM) as Y , the response variable, and shows three explanatory variables X1 , X2 , and
X3 . Variable X1 is the AAA bond rates from Table 6.1, but they are now shown leading the
D(EOM) values. Similarly, variable X2 refers to the rates on 3-4 year government bonds and
they are shown leading the D(EOM) values by one month. Finally, variable X3 refers to the first
differences of the 3-4 year government bond rates, and the timing for this variable coincides
with that of the D(EOM) variable.

10
8
6
D(EOM)
4
2
0
-2
8.0
7.5
AAA
7.0
6.5
6.0
8
7
(3-4)
6
5
4
0.5
D(3-4)
0.0
-0.5
1988 1989 1990 1991 1992 1993
Figure 6.3: (a) A time plot of the monthly change of end-of-month balances at a mutual savings bank.
(b) A time plot of AAA bond rates. (c) A time plot of 3-4 year government bond issues. (d) A time plot of
the monthly change in 3-4 year government bond issues. All series are shown over the period February
1988 through January 1993.

Example:
t Y X1 X2 X3 t Y X1 X2 X3
Month D(EOM) (AAA) (3-4) D(3-4) Month D(EOM) (AAA) (3-4) D(3-4)
1 1.146 5.94 5.31 0.29 31 2.169 8.05 7.46 -0.37
2 -2.443 6.00 5.60 -0.11 32 0.982 7.94 7.09 -0.27
3 1.497 6.08 5.49 0.31 33 4.708 7.88 6.82 -0.60
4 -0.132 6.17 5.80 -0.19 34 6.063 7.79 6.22 -0.61
5 2.025 6.14 5.61 -0.33 35 9.382 7.41 5.61 -0.13
6 0.737 6.09 5.28 -0.09 36 9.304 7.18 5.48 -0.70
7 -1.023 5.87 5.19 -0.01 37 10.690 7.15 4.78 -0.64
8 -0.956 5.84 5.18 0.12 38 6.531 7.27 4.14 0.50
9 0.385 5.99 5.30 -0.07 39 7.873 7.37 4.64 0.88
10 0.983 6.12 5.23 0.41 40 3.882 7.54 5.52 0.43
11 5.092 6.42 5.64 -0.02 41 4.960 7.58 5.95 0.25
12 3.649 6.48 5.62 0.05 42 1.301 7.62 6.20 -0.17
13 2.703 6.52 5.67 0.16 43 1.154 7.58 6.03 -0.43
14 -0.271 6.64 5.83 -0.30 44 0.116 7.48 5.60 -0.34
15 2.055 6.75 5.53 0.23 45 4.928 7.35 5.26 -0.30
16 -0.714 6.73 5.76 0.33 46 2.530 7.19 4.96 0.32
17 0.653 6.89 6.09 0.43 47 8.425 7.19 5.28 0.09
18 -0.034 6.98 6.52 0.16 48 5.291 7.11 5.37 0.16
19 -1.058 6.98 6.68 0.39 49 5.192 7.16 5.53 0.19
20 -2.051 7.10 7.07 0.05 50 0.257 7.22 5.72 0.32
21 1.451 7.19 7.12 0.13 51 4.402 7.36 6.04 -0.38
22 -0.989 7.29 7.25 0.60 52 3.173 7.34 5.66 0.09
23 1.358 7.65 7.85 0.17 53 5.104 7.30 5.75 0.07
24 0.746 7.75 8.02 -0.15 54 4.646 7.30 5.82 0.08
25 1.855 7.72 7.87 -0.73 55 1.060 7.27 5.90 0.21
26 -1.894 7.67 7.14 0.06 56 -0.758 7.30 6.11 -0.06
27 0.781 7.66 7.20 0.39 57 4.702 7.31 6.05 -0.07
28 -0.161 7.89 7.59 0.15 58 1.878 7.26 5.98 0.02
29 2.233 8.14 7.74 -0.23 59 6.620 7.24 6.00 0.24
30 2.425 8.21 7.51 -0.05
Table 6.2: Bank data: monthly changes in balance as response variable and three explanatory variables.
(Data for months 54–59 to be ignored in all analyses and then used to check forecasts.)
Referring to the numbers in the first row of Table 6.2, they are explained as follows:
1.146 = (EOM balance Mar. 1988) − (EOM balance Feb. 1988)

5.94 = AAA bond rate for Feb. 1988
5.31 = 3-4 year government bond rate for Feb. 1988
0.29 = (3-4 rate for Mar. 1988) − (3-4 rate for Feb. 1988)
(Note that the particular choice of these explanatory variables is not arbitrary, but rather based
on an extensive analysis that will not be presented in detail here.)
For the purpose of illustration in this chapter, the last six rows in Table 6.2 will be ignored in
all the analyses that follow, so that they may be used to examine the accuracy of the various
models to be employed. (The idea is to forecast the D(EOM) figures for periods 54–59, and

then compare them with the known figures not used in developing our regression model. This
comparison hasn’t actually been in these notes.)
The bank could model Y (the D(EOM) variable) on the basis of X1 alone, or on the basis of a
combination of the X1 , X2 , and X3 variables shown in columns 3, 4, and 5. So Y , the response
variable, is a function of one or more of the explanatory variables. Although several different
forms of the function could be written to designate the relationships among these variables, a
straightforward one that is linear and additive is
Y = b0 + b1 X1 + b2 X2 + b3 X3 + e, (6.1)
where Y = D(EOM),
X1 = AAA bond rates,
X2 = 3-4 rates,
X3 = D(3-4) year rates,
e = error term.
From equation (6.1) it can readily be seen that if two of the X variables were omitted, the
equation would be like those handled previously with simple linear regression.
Time plots of each of the variables are given in Figure 6.3. These show the four variables
individually as they move through time. Notice how some of the major peaks and troughs line
up, implying that the variables may be related.
Scatterplots of each combination of variables are given in Figure 6.4. These enable us to visu-
alize the relationship between each pair of variables. Each panel shows a scatterplot of one of
the four variables against another of the four variables. The variable on the vertical axis is the
variable named in that row; the variable on the horizontal axis is the variable named in that
column. So, for example, the panel in the top row and second column is a plot of D(EOM)
against AAA. Similarly, the panel in the second row and third column is a plot of AAA against
(3–4). This figure is known as a scatterplot matrix and is a very useful way of visualizing the
relationships between the variables.
Note that the mirror image of each plot above the diagonal is given below the diagonal. For
example, the plot of D(EOM) against AAA given in the top row and second column is mirrored
in the second row and first column with a plot of AAA against D(EOM).
Figure 6.4 shows that there is a weak linear relationship between D(EOM) and each of the
other variables. It also shows that two of the explanatory variables, AAA and (3-4), are related
linearly. This phenomenon is known as collinearity and means it may be difficult to distinguish
the effect of AAA and (3-4) on D(EOM).
For the bank data in Table 6.2—using only the first 53 rows—the model in equation (6.1) can be
solved using least squares to give
Ŷ = −4.34 + 3.37(X1 ) − 2.83(X2 ) − 1.96(X3 ). (6.2)
Note that a “hat” is used over Ŷ to indicate that this is an estimate of Y , not the observed
Y . This estimate Ŷ is based on the three explanatory variables only. The difference between
the observed Y and the estimated Ŷ tells us something about the “fit” of the model, and this
discrepancy is called the residual (or error):

6.0 7.0 8.0 -0.5 0.5

• • •
• • •• • •
•• • •
2 4 6 8
• •
•• • • •• • • •
• •• ••• • • ••••••••• • • • • •• •• •••
D(EOM) •
• • ••
•• •• •• •
• ••• • •••• •• •• •• • •• •• •
• ••• ••• •••• • •
•••• •••• ••• ••
•
• • ••• ••• •• •••
• • • •• • • •• •• ••• • • •• • • • •
•• • • • • • • • • •• • •• •
-2
• • • • •• • •
• •• •
•• • ••
8.0
•• • •
• • • • • •• •
••
•
• ••••• •• • •
• • • • • • ••
• •• •• •• •
• • •
•• • ••• • • •••• • •• • • • • •• • ••••••••• • • • • ••• • • •
•
•
•• • • • •• •• ••••• •
7.0
•
• •• AAA • •• • ••
•• • • •• • • •
•• • ••• •• •
• •••• •• •• • • •• ••
6.0
• • •• ••
• • •• •• •
• •• ••• • •
8
• •
• ••• ••• •
•
• •
• • ••• •
•• • • • •• • •
7
•• • •• • •
• •
• •• • ••• ••
• • ••• ••• • (3-4) • • ••
• •
••
6
•• • • • • • •• •
• • •• ••• •••• ••• •• ••• ••• •• ••••••• • • • ••••••• •• •
• •••• •• • ••••••
•• • •
• ••• •• • •
5
• •• • •
• • • •
• • •
4
• • •
• • •
0.5
• • • • •
•• •••• • • •
• • •• ••
•• • • •
• •• • •
•• ••
•• • • •• • • • • ••••••• • • • ••
• • • •• ••••••••• • ••• • •
• •• • • • • •
•• •• • •• • • • ••• D(3-4)
• • •••• • • • •• ••
••
• •• • •
•
•• • • •
•• •• • • • • • • •• • •• • •
•
• • • ••
-0.5
•
• • •• • • • •
• • • ••
• •
-2 2 6 10 4 5 6 7 8
Figure 6.4: Scatterplots of each combination of variables. The variable on the vertical axis is the variable
named in that row; the variable on the horizontal axis is the variable named in that column. This
scatterplot matrix is a very useful way of visualizing the relationships between each pair of variables.
Example:
Y X1 X2 X3
D(EOM) (AAA) (3-4) D(3-4)
Y = D(EOM) 1.000 0.257 -0.391 -0.195
X1 = (AAA) 0.257 1.000 0.587 -0.204
X2 = (3-4) -0.391 0.587 1.000 -0.201
X3 = D(3-4) -0.195 -0.204 -0.201 1.000
Table 6.3: Bank data: the correlations among the response and explanatory variables.

ei = Yi − Ŷi
↑ ↑
(observed) (estimated using
regression model)
Computer output
In the case of the bank data and the linear regression of D(EOM) on (AAA), (3-4), and D(3-4),
the full output from a regression program included the following information:
Term Coeff. Value se of bj t P -value

Constant b0 −4.3391 3.2590 −1.3314 0.1892
AAA b1 3.3722 0.5560 6.0649 0.0000
(3-4) b2 −2.8316 0.3895 −7.2694 0.0000
D(3-4) b3 −1.9648 0.8627 −2.2773 0.0272
R2 = 0.53.
Residual analysis
Figure 6.5 shows four plots of the residuals after fitting the model
D(EOM) = −4.34 + 3.37(AAA) − 2.83(3-4) − 1.96(D(3-4)).
These plots help examine the linearity and homoscedasticity assumptions.
The bottom right panel of Figure 6.5 shows the residuals (ei ) against the fitted values (Ŷi ). The
other panels show the residuals plotted against the explanatory variables. Each of the plots can
be interpreted in the same way as the residual plot for simple regression. The residuals should
not be related to the fitted values or the explanatory variables. So each residual plot should
show scatter in a horizontal band with no values too far from the band and no patterns such as
curvature or increasing spread. All four plots in Figure 6.5 show no such patterns.
If there is any curvature pattern in one of the plots against an explanatory variable, it sug-
gests that the relationship between Y and X variable is non-linear (a violation of the linear-
ity assumption). The plot of residuals against fitted values is to check the assumption of ho-
moscedasticity and to identify large residuals (possible outliers). For example, if the residuals
show increasing spread from left to right (i.e., as Ŷ increases), then the variance of the residuals
is not constant.
It is also useful to plot the residuals against explanatory variables which were not included in
the model. If such plots show any pattern, it indicates that the variable concerned contains
some valuable predictive information and it should be added to the regression model.
To check the assumption of normality, we can plot a histogram of the residuals. Figure 6.6
shows such a histogram with a normal curve superimposed. The histogram shows the number
of residuals obtained within each of the intervals marked on the horizontal axis. The normal

4
2
2
Residuals
Residuals
0
0
-2
-2
-4
-4
6.0 6.5 7.0 7.5 8.0 4 5 6 7 8
AAA (3-4)
4
4
2
2
Residuals
Residuals
0
0
-2
-2
-4
-0.5 0.0 0.5 -4 0 2 4 6

D(3-4) Fitted values
Figure 6.5: Bank data: plots of the residuals obtained when D(EOM) is regressed against the three
explanatory variables AAA, (3-4), and D(3-4). The lower right panel shows the residuals plotted against
the fitted values (ei vs Ŷi ). The other plots show the residuals plotted against the explanatory variables
(ei vs Xj,i ).
10
8
Frequency
6
4
2
0
-6 -4 -2 0 2 4
Figure 6.6: Bank data: histogram of residuals with normal curve superimposed.
curve shows how many observations one would get on average from a normal distribution. In
this case, there does not appear to be any problem with the normality assumption.
There is one residual (with value −5.6) lying away from the other values which is seen in
the histogram (Figure 6.6) and the residuals plots of Figure 6.5. However, this residual is not
sufficiently far from the other values to warrant much close attention.

6.4 Comparing regression models
Computer output for regression will always give the R2 value. This is a useful summary of the
model.
• It is equal to the square of the correlation between Y and Ŷ .
• It is often called the “coefficient of determination”.
• It can also be calculated as follows:
(Ŷi − Ȳ )2
P
2
R =P (6.3)
(Yi − Ȳ )2
• It is the proportion of variance accounted for (explained) by the explanatory variables

X1 , X2 , . . . , Xk .
However, it needs to be used with caution. The problem is that R2 does not take into account
“degrees of freedom”. This arises because the models are more flexible when more variables
are added. Consequently, adding any variable tends to increase the value of R2 , even if that
variable is irrelevant.
To overcome this problem, an adjusted R2 is defined, as follows:
(total df) n−1

R̄2 = 1 − (1 − R2 ) = 1 − (1 − R2 ) n−k−1
(error df)
where n is the number of observations and k is the number of explanatory variables in the
model. Note that R̄2 is referred to as “adjusted R2 ” or “R-bar-squared,” or sometimes as “R2 ,
corrected for degrees of freedom.”
There are other measures which, like R̄2 , can be used to find the best regression model. Some
computer programs will output several possible measures. Apart from R̄2 , the most commonly
used measures are Mallow’s Cp statistic and Akaike’s AIC statistic.

6.5 Choosing regression variables
Developing a regression model for real data is never a simple process, but some guidelines
can be given. Generally, we have a long list of potential explanatory variables. The “long list”
needs to be reduced to a “short list” by various means, and a certain amount of creativity is
essential.
There are many proposals regarding how to select appropriate variables for a final model. Some
of these are straightforward, but not recommended:
• Plot Y against a particular explanatory variable (Xj ) and if it shows no noticeable rela-
tionship, drop it.
• Look at the correlations among the explanatory variables (all of the potential candidates)
and every time a large correlation is encountered, remove one of the two variables from
further consideration; otherwise you might run into multicollinearity problems (see Sec-
tion 6.6).
• Do a multiple linear regression on all the explanatory variables and disregard all variables
whose p values are very large (say |p| > 0.2).
Although these approaches are commonly followed, none of them is reliable in finding a good
regression model.
Some proposals are more complicated, but more justifiable:

• Do a best subsets regression (see Section 6.5.1).
• Do a stepwise regression (see Section 6.5.2).
• Do a principal components analysis of all the variables (including Y ) to decide on which
are key variables (see Draper and Smith, 1981).
• Do a distributed lag analysis to decide which leads and lags are most appropriate for the
study at hand.
Quite often, a combination of the above will be used to reach the final short list of explanatory
variables.
6.5.1 Best subsets regression
Ideally, we would like to calculate all possible regression models using our set of candidate ex-
planatory variables and choose the best model among them. There are two problems here. First
it may not be feasible to compute all the models because of the huge number of combinations
of variables that is possible. Second, how do we decide what is best?
We will consider the second problem first. A naı̈ve approach to selecting the best model would
be to find the model which gives the largest value of R2 . In fact, that is the model which
contains all the explanatory variables! Every additional explanatory variable will result in an
increase in R2 . Clearly not all of these explanatory variables should be included. So maximizing
the value of R2 is not an appropriate method for finding the best model.
Instead, we can compare the R̄2 values for all the possible regression models and select the
model with the highest value for R̄2 . If we have 44 possible explanatory variables, then we

can use anywhere between 0 and 44 of these in our final model. That is a total of 244 = 18
trillion possible regression models! Even using modern computing facilities, it is impossible to
compute that many regression models in person’s lifetime. So we need some other approach.
Clearly the problem can quickly get out of hand without some help. To select the best ex-
planatory variables from among 44 candidate variables, we need to use stepwise regression
(discussed in the next section).
6.5.2 Stepwise regression
Stepwise regression is a method which can be used to help sort out the relevant explanatory
variables from a set of candidate explanatory variables when the number of explanatory vari-
ables is too large to allow all possible regression models to be computed.
Several types of stepwise regression are in use today. The most common is described below.
Step 1: Find the best single variable (X1∗ ).
Step 2: Find the best pair of variables (X1∗ together with one of the remaining explanatory
variables—call it X2∗ ).
Step 3: Find the best triple of explanatory variables (X1∗ , X2∗ plus one of the remaining ex-
planatory variables—call the new one X3∗ ).
Step 4: From this step on, the procedure checks to see if any of the earlier introduced variables
might conceivably have to be removed. For example, the regression of Y on X2∗ and
X3∗ might give better R̄2 results than if all three variables X1∗ , X2∗ , and X3∗ had been
included. At step 2, the best pair of explanatory variables had to include X1∗ , by step 3,
X2∗ and X3∗ could actually be superior to all three variables.
Step 5: The process of (a) looking for the next best explanatory variable to include, and (b)
checking to see if a previously included variable should be removed, is continued until
certain criteria are satisfied. For example, in running a stepwise regression program, the
user is asked to enter two “tail” probabilities:
1. the probability, P1 , to “enter” a variable, and
2. the probability, P2 , to “remove” a variable.
When it is no longer possible to find any new variable that contributes at the P1 level
to the R̄2 value, or if no variable needs to be removed at the P2 level, then the iterative
procedure stops.
The NY Auto Club example (page 106) shows
• Putting all the explanatory variables in can lead to a significant overall effect but with no
way of determining the individual effects of each variable.
• Using stepwise regression is useful for choosing only the most significant variables in the
regression model.
• Stepwise regression is not guaranteed to lead to the best possible model.

• If you are trying several different models, use the adjusted R2 value to select between
them.
6.6 Multicollinearity
In regression analysis, multicollinearity is the name given to any one or more of the following
conditions:
• Two explanatory variables are perfectly correlated.

• Two explanatory variables are highly correlated (i.e., the correlation between them is close
to +1 or −1).
• A linear combination of some of the explanatory variables is highly correlated with an-
other explanatory variable.
• A linear combination of one subset of explanatory variables is highly correlated with a
linear combination of another subset of explanatory variables.
The reason for concern about this issue is first and foremost a computational one. If perfect
multicollinearity exists in a regression problem, it is simply not possible to carry out the LS
solution. If nearly perfect multicollinearity exists, the LS solutions can be affected by round-
off error problems in some calculators and some computer packages. There are computational
methods that are robust enough to take care of all but the most difficult multicollinearity prob-
lems, but not all packages take advantage of these methods. Excel is notoriously bad in this
respect.
The other major concern is that the stability of the regression coefficients is affected by multi-
collinearity. As multicollinearity becomes more and more nearly perfect, the regression coef-
ficients computed by standard regression programs are therefore going to be (a) unstable—as
measured by the standard error of the coefficient, and (b) unreliable—in that different computer
programs are likely to give different solution values.
Multicollinearity is not a problem unless either (i) the individual regression coefficients are of
interest, or (ii) attempts are made to isolate the contribution of one explanatory variable to Y ,
without the influence of the other explanatory variables. Multicollinearity will not affect the
ability of the model to predict.
A common but incorrect idea is that an examination of the intercorrelations among the ex-
planatory variables can reveal the presence or absence of multicollinearity. While it is true that
a correlation very close to +1 or −1 does suggest multicollinearity, it is not true (unless there
are only two explanatory variables) to infer that multicollinearity does not exist when there are
no high correlations between any pair of explanatory variables. This point will be examined in
the next two sections.

6.7 SPSS exercises
6.7.1 Predicting NYAC calls
Recall the data on emergency calls to the New York Auto Club (the NY equivalent of the
RACV). (See p.76.)
1. Try fitting a linear regression model to predict Calls using the other variables. [Go to
Analyze → Regression → Linear .]
Include all possible explanatory variables and leave “Method” as “Enter”.
Which variables are significant? Are the coefficients in the direction you expected? Write
down the R2 and adjusted-R2 values.
2. Now try fitting the same regression model but with “Method” set to “Stepwise”. This
only puts in the variables which are useful and leaves out the others.
Now which variables are included? Write down the R2 and adjusted-R2 values. How
have they changed?
3. Finally, try the model with explanatory variables Flow and Rain. Write down the R2 and
adjusted-R2 values. This shows that the step-wise method in SPSS doesn’t always find
the best model!
4. This last model should have the following coefficients:
Constant 7952.7
Flow −173.8
Rain 1922.2
How can each of these be interpreted?
5. The busiest day of all in 1994 was January 27 when the daily forecast low was 14◦ F and
the ground was under six inches of snow. The club answered 8947 calls. Could this have
been predicted from the model?

CHAPTER
7
Significance in
regression
7.1 Statistical model
Our model is that the residuals are normally distributed with constant variance. So please
check the residual plots before doing any confidence intervals or tests of significance. If this
assumption is invalid, then the confidence intervals and tests are invalid.
7.2 ANOVA tables and F-tests
When a regression model is fitted, we can test whether the model is any better than having no
variables at all. The test is conducted using an ANOVA (ANalysis Of VAriance) table. The test
is called an F-test. Here the null hypothesis is that no variable has any effect (i.e., all coefficients
are zero). The alternative hypothesis is that at least one variable has some effect (i.e., at least
one coefficient is non-zero).
An analysis of variance seeks to split up the variation in the data into two components: the
variation due to the model and the variation left over in the residuals. If the null hypothesis
is true (no variable is relevant) then we would expect the variation in the residuals to be much
larger than the variation in the model. The calculations required to answer this question are
summarized in an “analysis of variance” or ANOVA table.

Analysis of Variance
Source DF SS MS F P
Regression 1 1357.2 1357.2 149.38 0.000
Residual Error 23 209.0 9.1
Total 24 1566.2
107
Part 7. Significance in regression
The Analysis of Variance (ANOVA) Table above contains six columns; Source of Variation,
degrees of freedom (DF), sums of squares (SS), mean square (MS), the variance ratio or F-value
(F), and the p-value (P). Of primary interest are the F and P columns.
• The F-value follows an F-distribution, and is used to decide if the model is significant.
• The p-value is the probability that a randomly selected value from the F-distribution is
greater than the observed variance ratio.
• As a general rule, if the F-probability (or p-value) is less than 0.05 then the model is
deemed to be significant.
In this case, there is a significant effect due to the included variable.
• If there are two groups, the p-value from the ANOVA (F-test) is the same as the p-value
from the t-test (provided a t-test with “pooled variance” is used).
7.3 t-tests and confidence intervals for coefficients
The regression equation gives us the equation for the line best relating shipments to price. We
now ask a statistical question namely, is the relationship significant? In the context of linear
regression a relationship is significant if the slope of the line is significantly different from zero.
Since a slope which is equal to zero would imply that as price increases shipments remains
unchanged, that is no relationship.
To test the significance of the relationship between shipments and price the hypotheses are:
H0 : b = 0 H1 : b 6= 0.
As usual, if the p-value is less than 0.05 then the linear regression is deemed to be significant.
This means that the estimated slope of the line is significantly greater than zero.

The regression equation is Shipments = 71.7 - 0.0751 Price
Predictor Coef StDev T P

Constant 71.668 4.195 17.08 0.000
Price -0.075135 0.006147 -12.22 0.000
S = 3.014 R-Sq = 86.7% R-Sq(adj) = 86.1%

The slope is estimated to be -0.075 with a standard error of 0.006. The estimated slope
divided by the standard error gives the t statistic.
The p-value is 0.000, which is identical to the p-value of the ANOVA as it should be.

Activity: Birth weights again
x = birth weight
y = increase in weight between the 70th and 100th day of life, as a percentage of birth weight.
Percentage increase in birth weight (day 70-100)
120
100
80
60
40
80 100 120 140
Birth weight (oz)
Computer output for fitting a regression line is given below.
The regression equation is Increase = 168 - 0.864 Weight
Predictor Coef Stdev t-ratio p

Constant 167.87 19.88 8.44 0.000
Weight -0.8643 0.1757 -4.92 0.000
s = 17.80 R-sq = 44.7% R-sq(adj) = 42.8%
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 7666.4 7666.4 24.20 0.000
Error 30 9502.1 316.7
Total 31 17168.5
1. Is there an association between birth weight and % weight increase in the 70th to 100th
day?
2. Is the constant in the regression equation needed?
3. Find a 95% confidence interval for the slope

7.3.1 Two groups
When the explanatory variable takes only two values (e.g., male/female), we use a two-sample
t-test and associated methods. The interpretation is similar to the paired t-test used in the
previous section.
• The p-value gives the probability of the group means being as different as was observed
if there was no real difference between the groups.
• The 95% confidence intervals contain the true difference between the means of the two
groups with probability 0.95.
Example: Stock exchange volatility again
Data: returns for 30 stocks listed on NASDAQ and NYSE for 9–13 May 1994.
We look at absolute return in prices of stocks. This is a measure of volatility. For example,
a market where stocks average a weekly 10% change in price (positive or negative) is more
volatile than one which averages a 5% change.
Numerical summaries:
NASDAQ NYSE
Min. :0.00380 Min. :0.00260
1st Qu.:0.01745 1st Qu.:0.01120
Median :0.03930 Median :0.02480
Mean :0.04395 Mean :0.02913
3rd Qu.:0.05575 3rd Qu.:0.04010
Max. :0.12240 Max. :0.08910
Analysis of Variance Table
Response: absreturn
Df Sum Sq Mean Sq F value Pr(>F)
exchange 1 0.003293 0.003293 4.0405 0.04908 *
Residuals 58 0.047270 0.000815
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Confidence interval for difference of means

[0.00001, 0.00296]
Conclusion
There is some evidence (but not very strong evidence) that the NASDAQ is more volatile
than the NYSE.

7.4 Post-hoc tests
When we have a categorical explanatory variable with more than two categories, it is natural to
ask which categories are different from each other? For example, if the variable is “Day of the
week”, then are all days different from each other, or are weekends different from weekdays,
or something more complicated?
The inclusion or otherwise of the variable is determined by an F-test or an adjusted R2 value.

But to test differences between levels of a category requires a post-hoc test.
7.5 SPSS exercises
7.5.1 Call centre patterns
We will use data on the number of calls to a Melbourne call centre. Download the data from
http://www.robhyndman.info/downloads/Calls.xls
and save it to your disk.
The variable Calls gives the total number of calls each day. The variable Trend gives the
smooth trend through the data eliminating the effect of daily fluctuations.
1. Produce a time plot of the data over time with the trend on the same graph. Can you
explain the fluctuations in the trend?
2. Calculate the percentage deviation from the trend for each day.
3. Compute summary statistics and boxplots for the deviations for each day.
4. Use an ANOVA test to check that the percentage deviations for each day are significantly
different from each other.
5. Which days are significantly different?

CHAPTER
8
Dimension
reduction
8.1 Factor analysis
Factor analysis is most useful as a way of combining many numerical explanatory variables
into a smaller number of numerical explanatory variables
Basic idea:
• You try to uncover some underlying, but unobservable quantities called “factors”. Each
variable is assumed to be (approximately) a linear combination of these factors.
• For example, if there are two factors called F1 and F2 , then the ith observed variable Xi
can be written as
Xi = b0 + b1 F1 + b2 F2 + error.
The coefficients b0 , b1 and b2 differ for each of the observed variables.
• The factors are assumed to be independent of each other.
• The factors are chosen so they explain as much of the variation in the observed variables
as possible.
• The factor loadings are the the values of b1 and b2 in the above equation.
• Principal components analysis is the usual method for estimating the factors.
• The estimated factors (or scores) can be used as an explanatory variable in subsequent
regression models.
112
Part 8. Dimension reduction
Example: National track records
The data on national track records for men are listed in the following table
Country 100m 200m 400m 800m 1500m 5000m 10000m Marathon

(s) (s) (s) (min) (min) (min) (min) (min)
Argentina 10.39 20.81 46.84 1.81 3.70 14.04 29.36 137.72
Australia 10.31 20.06 44.84 1.74 3.57 13.28 27.66 128.30
Austria 10.44 20.81 46.82 1.79 3.60 13.26 27.72 135.90
Belgium 10.34 20.68 45.04 1.73 3.60 13.22 27.45 129.95
Bermuda 10.28 20.58 45.91 1.80 3.75 14.68 30.55 146.62
Brazil 10.22 20.43 45.21 1.73 3.66 13.62 28.62 133.13
..
.
Turkey 10.71 21.43 47.60 1.79 3.67 13.56 28.58 131.50
USA 9.93 19.75 43.86 1.73 3.53 13.20 27.43 128.22
USSR 10.07 20.00 44.60 1.75 3.59 13.20 27.53 130.55
W.Samoa 10.82 21.86 49.00 2.02 4.24 16.28 34.71 161.83
Correlation matrix:
9.0 10.0 6.6 7.2 7.8 5.0 5.6 6.2 4.4 5.0
9.5
X100m
8.5
10.0
X200m
9.0
X400m 8.5
7.5
7.8
7.2
X800m
6.6
6.6
X1500m
6.0
6.2
X5000m
5.6
5.0
6.0
5.4
X10000m
4.8
5.0
Marathon
4.4
8.5 9.5 7.5 8.5 6.0 6.6 4.8 5.4 6.0
Figure 8.1: Scatterplot matrix of national track record data. All data in average metres per second.

100m 200m 400m 800m 1500m 5000m 10000m Marathon

100m 1.00 0.92 0.83 0.75 0.69 0.60 0.61 0.50
200m 0.92 1.00 0.85 0.80 0.77 0.69 0.69 0.59
400m 0.83 0.85 1.00 0.87 0.83 0.77 0.78 0.70
800m 0.75 0.80 0.87 1.00 0.91 0.85 0.86 0.80
1500m 0.69 0.77 0.83 0.91 1.00 0.93 0.93 0.86
5000m 0.60 0.69 0.77 0.85 0.93 1.00 0.97 0.93
10000m 0.61 0.69 0.78 0.86 0.93 0.97 1.00 0.94
Marathon 0.50 0.59 0.70 0.80 0.86 0.93 0.94 1.00
Loadings:
Factor1 Factor2
X100m 0.275 0.918
X200m 0.379 0.886
X400m 0.546 0.736
X800m 0.684 0.623
X1500m 0.799 0.527
X5000m 0.904 0.382
X10000m 0.911 0.387
Marathon 0.914 0.271
Factor1 Factor2
SS loadings 4.108 3.205
Proportion Var 0.513 0.401
Cumulative Var 0.513 0.914
Test of the hypothesis that 2 factors are sufficient.

The chi square statistic is 15.49 on 13 degrees of freedom.
The p-value is 0.278
Factor 1 seems to be mostly a measure of long-distance events. Factor 2 is mostly a measure of

short distance events.
2
1
0
Factor2
−1
−2
−3
Cook Islands
−3 −2 −1 0 1
Factor1
Figure 8.2: Scatter plot of the two estimated factors.

Example: Stock price data
Stock price data for 100 weekly rates of return on five stocks are listed below. The data were
collected for January 1975 through December 1976. The weekly rates of return as defined as
current Friday closing price − previous Friday closing price

Rate of return =
previous Friday closing price
adjusted for stock splits and dividends.
Week Allied Chemical Du Pont Union Carbide Exxon Texaco

1 0.000000 0.000000 0.000000 0.039473 -0.000000
2 0.027027 -0.044855 -0.003030 -0.014466 0.043478
3 0.122807 0.060773 0.088146 0.086238 0.078124
4 0.057031 0.029948 0.066808 0.013513 0.019512
5 0.063670 -0.003793 -0.039788 -0.018644 -0.024154
..
.
99 0.050167 0.036380 0.004082 -0.011961 0.009216
100 0.019108 -0.033303 0.008362 0.033898 0.004566
Allied Chemical Du Pont Union Carbide Exxon Texaco

Allied Chemical 1.00 0.58 0.51 0.39 0.46
Du Pont 0.58 1.00 0.60 0.39 0.32
Union Carbide 0.51 0.60 1.00 0.44 0.43
Exxon 0.39 0.39 0.44 1.00 0.52
Texaco 0.46 0.32 0.43 0.52 1.00
Loadings:
Factor1 Factor2
Allied Chemical 0.683 0.192
Du Pont 0.692 0.519
Union Carbide 0.680 0.251
Exxon 0.621
Texaco 0.794 -0.439
Factor1 Factor2
SS loadings 2.424 0.567
Proportion Var 0.485 0.113
Cumulative Var 0.485 0.598
Test of the hypothesis that 2 factors are sufficient.

The chi square statistic is 0.58 on 1 degree of freedom.
The p-value is 0.448
Factor 1 seems to be almost equally weighted and therefore indicates overall measure of mar-
ket activity. Factor 2 represents a contrast between the chemical stocks (Allied Chemical, Du
Pont and Union Carbide) and the oil stocks (Exxon and Texaco). Thus it measures an industry
specific difference.

Time series of stock returns
0.08
0.10
Allied Chemical
0.05
0.04
Exxon
0.00
0.00
−0.05
0.08 −0.04
0.10 −0.10
0.04
0.05
Du Pont
Texaco
0.00
0.00
−0.05
−0.04
0.10
0 20 40 60 80 100
Time
0.05
Union Carbide
0.00
−0.05
0 20 40 60 80 100
Figure 8.3: Time series of five stocks between January 1975 and December 1976.

−0.05 0.05 −0.04 0.02 0.08
0.10
Allied Chemical
0.00
−0.10
0.05
Du Pont
−0.05
0.05
Union Carbide
−0.05
0.08
0.02
Exxon
−0.04
0.02 0.06
Texaco
−0.04
−0.10 0.00 0.10 −0.05 0.05 −0.04 0.02 0.06
Figure 8.4: Scatterplots of the five stocks.

8.2 Further reading
8.2.1 Factor analysis

• J.F. H AIR , R.E. A NDERSON , R.L. TATHAM and W.C. B LACK (1998). Multivariate data
analysis, 5th Edition, Prentice Hall.

CHAPTER
9
Data analysis with
a categorical
response variable
9.1 Chi-squared test
Example: What’s your excuse?
The following results are from a survey of students’ excuses for not sitting exams.
United States France Britain
Dead grandparent 158 22 220
Car problem 187 90 45
Animal trauma 12 239 8
Crime victim 65 4 125
Do different nationalities have different excuses?
A two-way table consists of frequencies of observations split up by two categorical variables.

Each combination of values for the two variables defines a cell.
The question of interest is: is there a relation between the two variables?
H0 : there is no association between the two variables
H1 : the two variables are not independent.
P (O−E)2
We use a χ2 test. X 2 = E where O = observed values and E = expected values
(assuming independence).
• Large values of X 2 provide evidence against H0 .
• Small p-values provide evidence against H0 .
119
Part 9. Data analysis with a categorical response variable
Notes on Chi-square tests

• The results are only valid if the cell counts are large.
For 2 × 2 tables: all expected cell counts ≥ 5. Where this is not true, use Fisher’s exact
test.
• It is always two-sided
• If there are too few values in a cell, try combining rows or columns.
Example: What’s your excuse?

Expected counts are printed below observed counts
US France Britain Total

1 158 22 220 400
143.66 120.85 135.49
2 187 90 45 322
115.65 97.29 109.07
3 12 239 8 259
93.02 78.25 87.73
4 65 4 125 194
69.67 58.61 65.71
Total 422 355 398 1175
ChiSq = 1.431 + 80.856 + 52.713 +

44.026 + 0.546 + 37.635 +
70.568 +330.222 + 72.459 +
0.314 + 50.886 + 53.491 = 795.146
df = 6, p = 0.000

Example: Snoozing
How often do you press the snooze button in the morning?

Bentley college Babson College Total
Once 22 1 23
Twice 18 12 30
3 times 32 25 57
4 times 11 22 33
5 times 5 15 20
6+ times 12 25 37
Total 100 100 200
C1 C2 Total
1 22 1 23
11.50 11.50
2 18 12 30
15.00 15.00
3 32 25 57
28.50 28.50
4 11 22 33
16.50 16.50
5 5 15 20
10.00 10.00
6 12 25 37
18.50 18.50
Total 100 100 200
ChiSq = 9.587 + 9.587 +

0.600 + 0.600 +
0.430 + 0.430 +
1.833 + 1.833 +
2.500 + 2.500 +
2.284 + 2.284 = 34.468
df = 5, p = 0.000
So there is statistical evidence of a strong association between the variables.
Hence we can conclude that students from different colleges have different sleep
patterns.

Example: Survival and Pet ownership
Does having a pet help survival from coronary heart disease?

A 1980 study investigated 92 CHD patients who were classified according to
whether they owned a pet and whether they survived for 1 year.
Patient Status Owned Pet No Pet
Alive 50 28
Dead 3 11
Had.pet No.pet Total

Alive 50 28 78
44.93 33.07
Dead 3 11 14
8.07 5.93
Total 53 39 92
ChiSq = 0.571 + 0.776 +

3.181 + 4.323 = 8.851
df = 1, p = 0.003
1. Calculate relevant proportions.
2. What do you conclude?
9.2 Logistic and multinomial regression
Logistic regression is used when the response variable is categorical with two categories (e.g.,
Yes/No). The model allows the calculation of the probability of a “Yes” given the set of ex-
planatory variables.
Multinomial regression is a regression model where the response variable is categorical with
more than two categories.
Useful reference
• K LEINBAUM , D.G., and K LEIN , M. (2002) Logistic regression: a self-learning text, 2nd ed,
Springer-Verlag.

9.3 SPSS exercises
1. Repeat the examples in Section 9.1 using SPSS to find the p-values.
2. In a study of health in Zambia, people were rated as having ‘good’, ‘fair’ or ‘poor’ health.
Similarly, the economy of the village in which each person lived was rated as ‘poor’, ‘fair’
or ‘good’. For the 521 villagers assessed, the following data were observed.
Health
Village Good Fair Poor Total
Poor 62 103 68 233
Fair 50 36 33 119
Good 80 69 20 169
Total 192 208 121 521
(a) Find a 95% confidence interval for the proportion of poor villages in Zambia.
(b) Use SPSS to carry out a chi-squared test for independence on these data.
(c) Explain in one or two sentences how these data differ from what you would expect
if health and village were independent.
(d) Do these data show that economic prosperity causes better health? Explain in one
or two sentences.
(e) Consider now only people from poor villages. What proportion of these people
have health that is rated less than good? Give a 95% confidence interval for this
proportion.
(f) An alternative approach to this problem would have been to measure health numer-
ically for each person. What sort of analysis would have been most appropriate in
that case?

CHAPTER
10
A survey of
statistical
methodology
I use a decision tree based on the type of response variable and the type of explanatory vari-
able(s).
Recall: Response and explanatory variables
Response variable: measures the outcome of a study. Also called dependent vari-
able.
Explanatory variable: attempts to explain the variation in the observed outcomes.
Also called independent variables.
Many statistical problems can be thought of in terms
of a response variable and one or more explanatory
variables.
Sometimes the response variable is called the dependent variable and the explana-
tory variables are called the independent variables.
• Study of level of stress-related leave amongst Australian small business em-
ployees.
– Response variable: No. days of stress-related leave in fixed period.
– Explanatory variables: Age, gender, business-type, job-level.
• Return on investment in Australian stocks.
– Response variable: Return
– Explanatory variables: industry, risk profile of company, etc.
124
Part 10. A survey of statistical methodology
Taxonomy of statistical methodology
RESPONSE VARIABLE: Numerical

EXPLANATORY None Numerical Categorical
VARIABLE:
Graphics Boxplot, histogram Scatterplot Side by side box-
plots
Summary Stats Mean, percentiles, Correlation Mean, st.dev. by
IQR, st.dev. group, percentiles
by group
Methods t-test, confidence Regression 2 sample t-test (2
intervals groups), one-way
ANOVA
Regression, General Linear Model
RESPONSE VARIABLE: Categorical

EXPLANATORY None Numerical Categorical
VARIABLE:
Graphics Bar chart Side-by-side box- Side-by-side bar
plots charts
Summary Stats Percentages, pro- Mean, st.dev. by Percentages by
portions group group, contin-
gency tables
Methods Confidence inter- Logistic regression Chi-square test
vals
Logistic regression, Generalized Linear Model

Numerical response variable, no explanatory variables

• Graphical summaries: boxplot, histogram.
• Numerical summaries: percentiles, mean, median, standard deviation, etc.
• Statistical methods: confidence intervals and t-test
Categorical response variable, no explanatory variables

• Graphical summaries: bar chart
• Numerical summaries: group frequencies/percentages
• Statistical methods: confidence interval for proportion.
Numerical response variable, one numerical explanatory variable
• Graphical summaries: scatterplots E XAMPLES :

• Numerical summaries: correlation • Income and age
• Statistical methods: regression • Sales and advertising
• Number of accidents and length of time
worked.
• Turnover of company and number of
employees
Categorical response variable, one categorical explanatory variable
• Graphical summaries: bar charts and E XAMPLES :

variants • Gender and voting preference
• Numerical summaries: cross tabula- • Religion and education level
tions. • Head office location and industry
• Statistical methods:
– test for two proportions
– χ2 test for independence
(Note: response and explanatory variables reversible)
Numerical response variable, one categorical explanatory variable
• Graphical summaries: boxplots E XAMPLES :

• Numerical summaries: group means, • Strength of agreement to survey ques-
standard deviations. tion and location
• Statistical methods: • Income and gender
– t-tests • Stock return and use of hedging
– one-way ANOVA

Categorical response variable, one numerical explanatory variable
• Graphical summaries: boxplots E XAMPLES :

• Numerical summaries: group means, • Mortality and pollutant level
standard deviations. • Language spoken and length of time in
• Statistical methods: logistic regression. Australia
• Bankruptcy and level of investment risk
Numerical response variable, several explanatory variables

Statistical methods
• Multiple regression
(all numerical explanatory variables)
• multi-way ANOVA
(all categorical explanatory variables)
• General linear model
Categorical response variable, several explanatory variables

Statistical methods
• Multi-way contingency table and χ2 test
(all categorical explanatory variables).
• Multiple logistic regression
(all numerical explanatory variables)
• Generalized linear model
Other methods
M ULTIVARIATE METHODS • Used where there is more than one nu-

• Factor analysis merical response variable
• Principal components • Used to combine the number of explana-
tory variables
T IME SERIES
• Forecasting • Used where the response variable is ob-

• Spectral analysis served over time and where time is
treated like an explanatory variable.

Case 1: Lactobacillus counts

measured in 21 people with different degrees of susceptibility to dental caries.
Group 1: Rampant caries (5+ new lesions in past year)
Group 2: Normal caries (1–4 new lesions in past year)
Group 3: Caries resistant (no lesions in the past year)
Lactobacillus counts (in thousands):
Group 1 118 562 722 238 169 133 201
Group 2 422 109 261 147 330 97
Group 3 278 150 69 164 95 131 170 68
Response variable:
Type of analysis appropriate:
700
500
300
100
Rampant caries Normal caries Caries resistant
Analysis of Variance on Lactobil

Source DF SS MS F p
Caries. 2 102691 51345 2.02 0.162
Error 18 458242 25458
Total 20 560933
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ---------+---------+---------+-------
1 7 306.1 237.4 (----------*---------)
2 6 227.7 131.9 (----------*----------)
3 8 140.6 68.6 (---------*---------)
---------+---------+---------+-------
Pooled StDev = 159.6 120 240 360

Case 2: Charlie’s chooks

157 chicken farmers are worried about bird mortality. There are two types of birds: Tegel
(Australian) and imported.
Response variable:
Type of analysis appropriate: 14
12
10
8
6
4
0 20 40 60 80 100
Regression analysis: Mortality and Tegel.pc
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 4.2986 0.7063 6.0864 0.0000
tegelpc 0.0168 0.0079 2.1276 0.0350
Residual standard error: 1.852 on 155 degrees of freedom

Multiple R-Squared: 0.02838
F-statistic: 4.527 on 1 and 155 degrees of freedom,
the p-value is 0.03495
14
12
10
8
6
4
0 20 40 60 80 100

Example 1: Stress-related leave

What factors contribute to stress-related leave amongst Australian small business employees?
Response variable:
Example 2: Return on investment

What types of Australian companies give the greatest return on investment?
Response variable:
Example 3: English language proficiency

Large study of migrants from many different non-English speaking backgrounds. Want to
know what variables affect ability to speak English well.
Response variable:

CHAPTER
11
Further methods
11.1 Classification and regression trees
A method for predicting a response variable given a set of explanatory variables constructed
using a “tree”. This is constructed using binary splits of the explanatory variables. The re-
sponse variable can be categorical or numerical. The explanatory variables can be categorical
or numerical.
Example: completions of research students at Monash university

The data were provided by the Monash Research Graduate School (MRGS). There were 445
HDR students in the data set: 200 from the 1993 cohort and 245 from the 1994 cohort. These
were all the students who had received an Australian Postgraduate Award (APA), a Monash
Graduate Scholarship (MGS) or some other centrally-administered postgraduate scholarship.
The completion status of these students in July 2001 fell into two categories:
1. Completed, for students having their theses passed and degrees awarded, or whose theses
were submitted but still under examination;
2. Not completed, for all other students including students continuing to study in July 2001,
students who had discontinued candidature and those whose theses were submitted but
not passed.
The following ten variables were considered for their potential effect on students’ completion
status.
1. Faculty: the faculty that each student enrolled in. There are ten faculties.
2. Course: PhD or Master.
3. Age: the age in years of each student on the date of enrolment.
4. Gender: male or female.
5. Lab-based: whether the research of a student is laboratory-based or not.
6. International student: whether a student is an international or a local student.
7. First language: whether a student’s first language is English or not.
8. Admission degree: Honours, Masters or pass.
131
Part 11. Further methods
9. Admission qualification: the honors level of each student’s admission qualification. This
will be first class honors (H1), second class honors level A (H2A) or second class honors
level B (H2B). Note that Masters candidates were classified as either H1, H2A or H2B
equivalent.
10. Publications: whether a student had any publications when admitted to enter the course.
The tree is constructed by recursively partitioning the data into two groups at each step. The
variable used for partitioning the data at each step is selected to maximize the differences be-
tween the two groups. The splitting process is repeated until the groups are too small to be
split into significantly different groups.
Classification tree
0.6612
n = 428
Arts
| BusEco
Faculty
ArtDesign IT, Law
Education Medicine
Engineering Pharmacy
0.5226 Science
n = 199 0.7817
n = 229
Arts Faculty No Publication Yes
0.472 0.7456 0.8833
n = 125 n = 169 n = 60
Age ArtDesign Admission Admission

>=21.7 yrs Education H1 qualification degree
0.4538 Engineering 0.7346
n = 119 0.6081 n = 162
n = 74
< 21.7 yrs International Age H2A Honours Masters

0.8333 Student < 23.2 yrs >=23.2 yrs 1 Pass 1
n=6 0.7912 0.662 n=7 0.8478 n = 14
n = 91 n = 71 n = 46
No Yes Age Faculty

0.4141 0.65
n = 99 n = 20
< 22.6 yrs >=22.6 yrs BusEco Law

0.7532 1 IT Pharmacy
n = 77 n = 14 Science Medicine
0.623 0.9
n = 61 n = 10
Figure 11.1: Classification tree.
Figure 11.1 shows a classification and regression tree for the completion rate. Only six of the
ten variables were significant and used in the tree construction. These were:
• faculty
• age on enrolment
• admission qualification
• admission degree
• international student
• prior publication
On this tree, the variable used to split the data set is shown at each node. At each leaf or
terminal node, the class split at the upper node is displayed with its completion probability.
Note that although there were 445 students in the data set, 11 students had their admission
qualifications missing, 2 students had their admission degrees missing and a further four stu-
dents had other data missing. These students were all removed from the data to be able to

estimate the tree structure. So there are 428 observations used in the tree.
The following conclusions can be drawn from this analysis:
• The most important variable is Faculty with BusEco, IT, Law, Medicine, Pharmacy and
Science students having a higher completion probability than students from other fac-
ulties. Arts, in particular, had a low completion probability than students from other
faculties.
• For Arts students, the next most important variable was age with young students (en-
rolment age less than 22 years) having much higher completion rate than older students.
For the older Arts students, international students performed better.
• For students from BusEco, IT, Law, Medicine, Pharmacy and Science, the situation is more
complex. Students with a publication had a higher completion rate, especially if that also
had a Masters degree. Students without a publication did well if they had an H2A entry
rather than an H1 entry. For students with no publication and an H1 entry, the older
students (enrolment age greater than 23 years) did the worst.
• The groups having worst performance are:
– Arts students with completion probability of 0.47. Of these, students aged 22 or more
on enrolment had completion probability of 0.45 (and only 0.41 for non-international
students).
• The groups having best performance are:
– BusEco, IT, Law, Medicine, Pharmacy and Science students with a publication had
completion probability of 0.88 (and 100% for Masters students with a publication).
– Law, Pharmacy and Medicine students without a publication and over 23 on enrol-
ment had a completion probability greater than 0.9.
Further reading
• B REIMAN , L., F RIEDMAN , J., O LSHEN , R., and S TONE , C., (1984) Classification and regres-
sion trees, Wadsworth & Cole: Belmont, CA.
11.2 Structural equation modelling
Sets of linear equations used to specify phenomena in terms of presumed cause-and-effect vari-
ables. Some variables can be unobserved.
Further reading
• S CHUMACKER , R.E., and L OMAX , R.G. (1996) A beginners guide to structural equation mod-
eling, Hillsdale, N.J.: Lawrence Erlbaum Associates.
• K LINE , R.B. (1998) Principles and practice of structural equation modeling, Guilford: New
York.

11.3 Time series models
These are models of time series data and are usually designed for forecasting. The most com-
mon models:
• exponential smoothing;
• ARIMA (or Box-Jenkins) models;
• VAR models (for modelling several time series simultaneously).
Further reading
• M AKRIDAKIS , S., W HEELWRIGHT, S., and H YNDMAN , R.J. (1998). Forecasting: methods
and applications, John Wiley & Sons: New York. Chapters 4 and 7.
11.4 Rank-based methods
Replacements for t-tests and ANOVA when the data are not normally distributed.
Further reading
• G IBBONS , J.D., and C HAKRABORTI , S. (2003). Nonparametric statistical inference, CRC
Press.

CHAPTER
12
Presenting
quantitative
research
12.1 Numerical tables
• Give only as many decimal places as are accurate, meaningful and useful.
• Make sure decimal points are aligned.
• Use horizontal or vertical lines to help the reader make the desired comparisons.
• Avoid giving all grid lines.
• Give meaningful column and row headings.
• Give a detailed caption.
• A table should be as self-explanatory as possible.
• Do readers really need to know all the values?
• Replace large tables with graphs where possible.
135
Part 12. Presenting quantitative research
12.2 Graphics
12.2.1 The purpose of data-based graphics

Data graphics are paragraphs about data.
E.R. Tufte
Communication: Graphs provide a visual summary of data. The intention is usually to display
discovered patterns in the data.
Analysis: Graphs reveal facts about data that are difficult to detect from tables. The intention
is usually to discover patterns in the data.
There was initial resistance to the use of graphics for understanding data. In the 1800s, the
Royal Society requested that the automatically recorded graphs of an early weather clock be
“reduce[d] into writing . . . that thereby the Society might have a specimen of the
weather-clock’s performances before they proceed to the repairing of it.”
12.2.2 Produce meaningful graphs

A graph is designed to represent data using symbols such as the length of a line, or the distance
from an axis. The graphical elements should be proportional to the data. Some people are
tempted to over-decorate a graph and forget about the purpose of the graph: to show the data.
• Keep graphical elements proportional to data.
• Don’t let decoration destroy content.
• Make graphs informative, not just impressive.
12.2.3 Avoid design variation

The only element of a graph which should change is the element proportional to the data. If
other elements also change with the data, the meaning of the graph is easily lost.
• Keep the scales constant.
• Keep non-data features constant.
• Don’t use perspective as it is difficult to distinguish changes in data with changes due to
perspective.
• Don’t plot one dimensional data with two or three dimensional symbols.
12.2.4 Avoid scale distortion

Some data need to be scaled before any meaningful comparisons can be made. For example,
monetary values change due to inflation, monthly sales change due to the number of trading
days in a month, government income from taxes changes with the population.
• Scale for population changes
• Scale for inflation
• Scale for different time periods
12.2.5 Avoid axis distortion

By changing an axis, a graph can be made to look almost flat or wildly fluctuating. Such ex-
tremes should be avoided and those looking a graph should also look at the scales.

Some graphs mislead by showing a small section of data rather than the whole context in which
the data lie. A decrease in sales over a few months may be part of a long decreasing trend or a
momentary drop in an long increasing trend.
We are accustomed to interpreting graphs with the dependent variable (e.g. sales) on the ver-
tical axis and the independent variable (e.g. time) on the horizontal axis. Flouting convention
can be misleading.
Some graphs attempt to show data over a wide range by using a broken axis. This also can be
misleading. If the data range is too wide for the graph, the data is probably on the wrong scale.
• Look carefully at the axis scales
• Show context
• Keep dependent variable on vertical axis
• Avoid broken axes
12.2.6 Graphical excellence

Graphical displays should:
• show the data
• induce the viewer to think about the substance rather than about methodology, graphic
design, the technology of graphic production, or something else.
• make large data sets coherent
• present many numbers in a small space
• encourage the eye to compare different pieces of data
• reveal the data at several levels of detail, from a broad overview to the fine structure
• serve a reasonably clear purpose: description, exploration, tabulation, or decoration
• be closely integrated with the statistical and verbal descriptions of a data set.
Graphical integrity is more likely to follow if these six principles are followed:
• The representation of numbers, as physically measured on the surface of the graphic
itself, should be directly proportional to the numerical quantities represented.
• Clear, detailed, and thorough labeling should be used to defeat graphical distortion and
ambiguity. Write out explanations of the data on the graphic itself. Label important
events in the data.
• Show data variation, not design variation.
• In time-series displays of money, deflated and standardized units of monetary measure-
ment are nearly always better than nominal units.
• The number of information-carrying (variable) dimensions depicted should not exceed
the number of dimensions in the data.
• Graphics must not quote data out of context.
We choose to use graphs for a very good reason. Human beings are well equipped to recognise
and process visual patterns. Much of the processing power of the human brain is dedicated to
visual information. Of all our senses, vision is the most dominant. When data are presented
graphically, we can see the picture much more quickly than we would from the numbers them-
selves. We also tend to see subtleties that would be invisible if the data were presented in a
table.
Good graphs are a very effective way to present quantitative information. Bad graphs can be
misleading, or at best, useless.

Graphical competence demands three quite different skills: the substantive, sta-
tistical, and artistic. Yet now most graphical work, particularly at news publica-
tions, is under the direction of but a single expertise—the artistic. Allowing artist-
illustrators to control the design and content of statistical graphics is almost like
allowing typographers to control the content, style and editing of prose. Substan-
tive and quantitative expertise must also participate in the design of data graphics,
at least if statistical integrity and graphical sophistication are to be achieved.
E.R. Tufte
12.2.7 Cleveland’s paradigm

A graph encodes quantitative and categorical information using symbols, geometry and color.
Graphical perception is the visual decoding of the encoded information.
• The graph may be beautiful but a failure: the visual decoding has to work.
• To make graphs work we need to understand graphical perception: what the eye sees
and can decode well.
• We need a paradigm that integrates statistical graphics and visual perception.
Cleveland’s paradigm has been developed from intuition about works well in a graph, from
theory and experiments in visual perception, and experiments in graphical perception. There
are three basic elements:
1. A specification and ordering of elementary graphical-perception tasks.
2. The rôle of distance in graphical perception.
3. The rôle of detection in graphical perception.
Ten properties of graphs identified by Cleveland which relate to judgments we make to visually
decode quantitative information from graphs.
• Angle • Position along a common scale

• Area • Position along identical, non-aligned
• Colour hue scales
• Colour saturation • Slope
• Density (amount of black) • Volume
• Length (distance)
Cleveland’s conclusions are that there is an ordering in the accuracy with which we carry out
these tasks. The order, from most accurate to least accurate, is:
1. Position along a common scale
2. Position along identical, non-aligned scales
3. Length
4. Angle and slope
5. Area
6. Volume
7. Colour hue, colour saturation, density
Some of the tasks are tied in the list; we don’t have enough insight to determine which can be
done more accurately.
This leads to the basic principle:

Encode data on a graph so that the visual decoding involves tasks as high as
possible in the ordering.
There are some qualifications:
• It’s a guiding principle, not a rule to be slavishly followed;
• Detection and distance have to be taken into account; they may sometimes override the
basic principle.
This paradigm implies the following, which is not a systematic list, but a number of examples
of the insights which follow from it.
1. Pie charts are not a good method of graphing proportions because they rely on comparing
angles rather than distance. A better method is to plot proportions as a bar chart or dot
chart. It is also easier to label a bar chart or dot chart than a pie chart.
2. Categorical data with a categorical explanatory variable are difficult to plot. A common
approach is to use a stacked bar chart. The difficulty here is that we need to compare
lengths rather than distances. A better approach is the side-by-side bar chart which leads
to distance comparisons. Ordering the groups can assist making comparisons. However,
side-by-side bar charts can become very cluttered with several group variables.
3. Time series should be plotted as lines with time on the horizontal axis. This enables
distance comparisons, emphasises the ordering due to time and allows several time series
to be plotted on the same graph without visual clutter.
4. Avoid representing data using volumes.
5. If a key point is represented by a changing slope, consider plotting the rate of change
itself rather than the original data.
6. Think of simplifications which enhance the detection of the basic properties of the data.
7. Think of how the distance between related representations of data affects their interpre-
tation.
12.2.8 Aesthetics and technique in data graphical design

Thoughts from Tufte:
• Graphical elegance is often found in simplicity of design and complexity of data.
• Graphs should tend towards the horizontal, greater in length than height (about 50%
wider than taller).
• There are many specific differences between friendly and unfriendly graphics:

Friendly Unfriendly
words are spelled out, mysterious and abbreviations abound, requiring the
elaborate encoding avoided viewer to sort through text to decode
abbreviations
words run from left to right, the usual di- words run vertically, particularly along
rection for reading occidental languages the Y-axis; words run in several different
directions
little messages help explain data graphic is cryptic, requires repeated ref-
erences to scattered text
elaborately encoded shadings, cross- obscure codings require going back and
hatching and colors are avoided; instead, forth between legend and graphic
labels are placed on the graphic itself; no
legend is required
graphic attracts viewer, provokes curios- graphic is repellent, filled with chartjunk
ity
colors, if used, are chosen so that the design insensitive to color-deficient view-
color-deficient and color-blind (5 to 10 ers; red and green used for essential con-
percent of viewers) can make sense of the trasts
graphic (blue can be distinguished from
other colors by most color-deficient peo-
ple)
type is clear, precise, modest type is clotted, overbearing
type is upper-and-lower case, with serifs type is all capitals, sans serif
References
• C LEVELAND , W.S. (1985) The elements of graphing data, Wadsworth.
• C LEVELAND , W.S. (1993) Visualizing data, Hobart Press
• T UFTE , E.R. (1983) The visual display of quantitative information, Graphics Press.
• T UFTE , E.R. (1990) Envisioning information, Graphics Press.

The Quality Magazine 7 #4 (1998), 64-68.
Good Graphs for Better Business —

Data
— — —
— — — —
— — — —
— — — —
W. S. Cleveland & N.I. Fisher

Analysis and
Are we on track with our target for market share? Is Interpretation
our installation process capable of meeting the
industry benchmark? What things are causing most of
the unhappiness with our staff? Information
These are typical of the management problems that

require timely and accurate information. They are also Encoding
problems for which effective use of graphs can make a
big difference. 200
150 North
?
100 West
50 East
0
The good news is that graphical capabilities are now 1st 2nd 3rd 4th
Qtr Qtr Qtr Qtr
readily available in statistical, spreadsheet, and word-

processing packages. The bad news is that much of
this graphical capability produces graphs that can hide Decoding
or seriously distort crucial information contained in
the data.
Decoded
Information
For a long time, how to make a good graph was
largely a matter of opinion. However, the last 20 years
have seen the development of a set of principles for
sound graphical construction, based on solid scientific
research and experimentation. Good entry points to Business
decision
these principles are provided in the References. In this
article, we look at a couple of common examples of
applying these principles. Figure 1. The graphical process involves extraction of
information from data, a decision about which
It is helpful to think about the whole process of patterns are to be displayed, and then selection of a
graphing, shown schematically in Figure 1. The type of graph that will reveal this pattern to the user,
crucial question is: How does the choice of graph without distortion.
affect the information as perceived by the recipient of
the graph?
To see how graphs can conceal, or reveal,

information, consider the humble pie chart. Figure 2
shows data on the contributions to enterprise profits of
a particular product in various regions R1, R2, …
around Australia (labels modified from original, but
retaining the same ordering, which was alphabetical).
What information is this supposed to be purveying?
Certainly the caption doesn’t enlighten us. If we
wanted the actual percentage share for each region, we
should simply use a table: tables are intended to
provide precise numerical data, whereas the purpose
of graphs is to reveal pattern.
For a more elaborate example, we turn to another

popular graphical display, the divided bar chart for
trend data. Figure 3 shows data on market share of
whitegoods sales for an eighteen month period, based Figure 2. Pie chart, showing the relative contributions
on monthly industry surveys. What can we glean from to the profits of an enterprise from various Divisions
this? Total sales aren’t changing. Manufacturer 1 has around Australia.
the biggest market share. Is there nothing else?
Version 1.0, 28th May, 1998 1

The answer to this is provided by one of the

fundamental tenets of graphing. Detection of pattern
with this type of data is best done when each
measurement is plotted as a distance from a common
baseline. The baseline in Figure 4 is the left vertical
axis, and we’re simply comparing horizontal lengths
that are vertically aligned.
On the other hand, in the pie chart, we’re trying to

compare angles, and very small angles at that. This
sort of comparison is known to be imprecise. Similar
problems occur when the data are graphed in other
colourful or picturesque ways, such as when their
sizes are represented by 3-dimensional solid volumes.
However, we haven’t finished with this data set yet.

At least one more step should be taken: re-order the
data so that they plot from largest to smallest. The
final result is shown in Figure 5.
Figure 3. Divided bar chart, showing monthly sales of

different brands of whitegoods over an 18-month
period.
Returning to Figure 2, no obvious patterns emerge.

So what’s happened in the graphical process? Perhaps
the Analysis and Interpretation step wasn’t carried
out. So, let’s try a different plot that shows things
more simply, a dotplot: see Figure 4.
Figure 5. The display from Figure 4 has been

modified, so that the data plot from largest to smallest.
A further pattern emerges: different States tend to
contribute differently to enterprise profits.
A (potentially) more important pattern has now

emerged: different States are not contributing equally
to group profits. This information may be vital in
helping management to identify a major improvement
opportunity. We create one more graph to bring this
out: see Figure 6.
Figure 4. A dotplot of the data used in Figure 2, What can we now say in defence of Figure 2 in the
showing the relative contributions to enterprise profits light of Figures 5 and 6? Really, only that it was
from its various Divisions around Australia. The produced from a spreadsheet at the press of a button.
discrete nature of the data is immediately evident. But judged by a standard of how well it conveys
information, it has failed. This is typical of pie charts.
A simple pattern emerges immediately: there are
only five different levels of contribution; some Now let’s re-visit the market share data plotted in
rounding of the raw data has been performed. Why Figure 3. What aspects of this graph might be
wasn’t this evident in the pie chart? hindering or preventing us from seeing important
pattern?

Figure 6. The contributions to group profit by

different regions are plotted by State. The clear
differences between States are evident.
We can get some idea of what’s happening with the

goods sold by Manufacturer 1: probably not much. We Figure 8. The monthly sales data from Figure 7 have
see this pattern because we’re effectively comparing been replotted so that the dominant curve is displayed
aligned (vertical) lengths: the 18 values of monthly separately, with a false origin, and the other curves
sales for this Manufacturer are all measured up from a that are measured on a much smaller scale can then be
common baseline (the horizontal axis). plotted using a better aspect ratio. This reveals more
information about individual and comparative trends
However, this isn’t the case for any of the other in the curves.
variables. For example, the individual values for the
second manufacturer are measured upwards from We have made some progress. Manufacturer 1 is
where the corresponding values for the first more easily studied, and there is little evidence of
manufacturer stop: the lengths are not aligned. So the anything other than random fluctuations around an
first step is to re-plot the data in order that, for each average monthly sales volume of about 25,000 units.
variable, we are comparing aligned lengths. See However, possible patterns in the other curves are
Figure 7. difficult to detect because the scale of the axes is
adjusted to allow all the data to be plotted on the same
graph. The next step is to display the dominant curve
separately, and to use a better aspect ratio when
plotting the other variables. See Figure 8.
Now some very interesting patterns emerge. It is

evident that sales for Manufacturer 2 have been in
decline for most of this period. Not much is happening
with Manufacturer 4, who is averaging about 2500
units. However, there is something happening with
Manufacturers 3 and 5. From months 6 to 18, sales for
Manufacturer 3 rose significantly and then declined.
This appears to have been at the expense of
Manufacturer 5, whose sales declined significantly
and then recovered. If this comparison is really of
interest, it should be studied separately. The difference
is plotted in Figure 9.
Figure 7. The monthly sales data from Figure 3 have

been replotted so that sales patterns for each
manufacturer can be seen without distortion.
However, the curve for the dominant manufacturer is
compressing patterns in the other curves.

• two very commonly-used displays, pie charts

and divided bar charts, typically do a poor job
of revealing pattern
Can you afford not to be using graphs in the way

information is reported to you, or the way you are
reporting it? How else can vital patterns be revealed
and presented, and so provide effective input to
decision-making at all levels of your enterprise?
References
Cleveland, William S. (1993), Visualizing Data.

Hobart Press, Summit, New Jersey.
Cleveland, William S. (1994), The Elements of

Graphing Data. Second edition. Hobart Press,
Summit, New Jersey.
Figure 9. This graph shows the difference between
Tufte, Edward R. (1983). The Visual Display of
sales of Manufacturers 5 and 3. Over the period
Quantitative Information. Graphics Press, Cheshire,
January – July 1997 there was a marked increase in
Connecticut.
sales in favour of Manufacture 5; after July, this
advantage declined steadily to the end of the year.
Bill Cleveland is a member of the Statistics & Data
One plausible explanation would be a short but
Mining Research Department of Bell Labs, Murray
intense marketing campaign conducted during this
Hill, NJ, USA.
period; there may be others. The main point is that
appropriate graphs have elicited very interesting
Nick Fisher is a member of the Organisational
patterns in the data that may well be worthy of further
Performance Measurement and Data Mining research
exploration. The divided bar chart has done poorly;
groups in CSIRO.
this is typically the case.
There is much to be learnt about constructing good

graphs, to do with effective use of colour, choice of
aspect ratio, displaying large volumes of data, use of
false origin, and so on. These issues are discussed at
length in the references.
To summarise, what basic messages can we draw

from these simple examples? There are several:
• the process of effective graph construction

begins with simple analyses to see what sorts of
patterns, that is, information, are present.
• for many graphs, pattern detection is far more

acute when the data are measured from a
common baseline, so that we are comparing
aligned lengths
• re-arranging the values in a dot plot so that they

are in decreasing order provides greatly
enhanced pattern recognition
• graph construction is an iterative process
• sometimes more than one graph is needed to

show all the interesting patterns in the best
way

CHAPTER
13
Readings
24 July 2008 Claver and Quer (2001)

31 July 2008 Baugher, Varanelli and Weisbord (2000)
14 August 2008 Kanazawa and Kovar (2004)
21 August 2008 Yilmaz and Chatterjee (2003)
4 September 2008 Lanen (1999)
11 September 2008 Rozell, Pettijohn and Parker (2002)
18 September 2008 Lewis (2001)
145

Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes

Uploaded by

Copyright:

Available Formats

DBA6000

Professor Rob Hyndman

4 Computing and quantitative research 70

6 Statistical models and regression 88

7 Significance in regression 107

8 Dimension reduction 112

9 Data analysis with a categorical response variable 119

10 A survey of statistical methodology 124

11 Further methods 131

12 Presenting quantitative research 135

DBA6000: Quantitative Business Research Methods 4

Professor Rob J Hyndman

On completion of this subject, students should have:

DBA6000: Quantitative Business Research Methods 6

Assignment marking scheme

• Research questions addressed: 6%

Choosing a paper for Assignment 1

• Australian Journal of Management

DBA6000: Quantitative Business Research Methods 7

• Journal of Management Development

Things to look for:

• it should involve some substantial data analysis;

Choosing a data set for Assignment 2

DBA6000: Quantitative Business Research Methods 8

1.1 Statistics in research

• Data beat anecdotes “For example” proves nothing. (Hebrew proverb)

1.1.1 Statistics answers questions using data

• Do pollutants cause asthma?

97.3% of all statistics are made up.

1.1.2 Some statistics stories

The Challenger disaster

Ambient temperature at launch

X: Percentage Tegel birds

DBA6000: Quantitative Business Research Methods 10

Risk factors for heart disease

Age BP Drug L/D

Drug Lived Died % lived

Drug 1 looks bad, 2 looks good.

DBA6000: Quantitative Business Research Methods 11

1.1.3 Causation and association

Smoking and Lung Cancer

• Causal hypothesis: Smoking causes lung cancer.

• Causation hypothesis: Hospital is harmful and/or home is helpful.

Male Female Total

Is there evidence of discrimination?

Course: Introduction to bean counting

Male Female Total

DBA6000: Quantitative Business Research Methods 12

Course: Advanced welding

Male Female Total

Other examples of Simpsons’ paradox

1. A positive correlation between blood pressure and income is observed. Does

Some subtle differences

DBA6000: Quantitative Business Research Methods 13

1.2 Organizing a quantitative research study

As a quick check, ask the following questions

1. What is your hypothesis (your research question)?

2. What is already known about the problem (literature review)?

3. What sort of design is best suited to studying your hypothesis? (method)

4. What data will you collect to test your hypothesis? (sample)

5. How will you analyse these data? (data analysis)

6. What will you do with the results of the study? (communication)

• What is the goal of the research?

1.2.2 Review of literature