Professional Documents
Culture Documents
• be introduced to the concepts of data contamination and outliers, and learn of their
distinction
• Nonetheless, the data analyst must be critical of the data – it is not always the truth!
11
12 CHAPTER 2. EXPLORATORY DATA ANALYSIS
– the variable is correctly recorded, but the subject should not have been there in
the first place (eg, a study on flowers from the species Iris setosa accidentally
includes a specimen from the species Iris versicolor )
– ...
2.2.2 Outliers
• A very unusual subject is called an outlier
– There are even cases where the exploration causes the investigator to reconsider
how to pose the scientific questions to be investigated and the experimental
design to be used
Hints
☞ When screening the data for errors it is useful to check the minimum and maximum
for unusual values
• Sometimes, for some reason, there is a subject for which one of the variables is
not known
• Perhaps is was not measured in the first place
• Perhaps the value has been lost
• Perhaps the measuring equipment failed at that point
• We shall use the code NA in the place where the number should have been
☞ If the data have been entered by someone else, you must ascertain what code they
used for missing values
☞ Sometimes a value that could not be legitimate is used as a code for a missing value
• The area of each rectangle represents the proportion of observations in the bin or
class interval covered by its base
14 CHAPTER 2. EXPLORATORY DATA ANALYSIS
– In most cases, the bins are of equal width, in which case, the vertical axis can
equivalently be simply the number or frequency of observations in the corre-
sponding bin
❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)
• Construct a histogram of the boys’ lung capacities (measured as forced vital capacity
in litres)
• The histogram of the boys’ lung capacity data is shown in Figure 2.1
• The analyst uses the histogram and other summaries as investigating tools
– The examples do not form an exhaustive list, nor are they mutually exclusive
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 15
Figure 2.1: Histogram of the boys’ lung capacities (measured as forced vital capacity in
litres)
50
40
30
frequency
20
10
0
Boys.Lung.Capacity$fvc
16 CHAPTER 2. EXPLORATORY DATA ANALYSIS
Normal
• One central concentration of values (unimodal – one mode), with values tailing off
symmetrically each side in a characteristic bell shape
Uniform
– The end rectangles of the histogram could be relatively much shorter – this is
just an artifact due to misalignment of bins and the natural range of the data
Positive Skew
• A long tail of values at the high end and an abrupt tail at the low end
– Some years may be very wet but precipitation can not go below zero
Negative Skew
• A long tail of values at the low end and an abrupt tail at the high end
18 CHAPTER 2. EXPLORATORY DATA ANALYSIS
Bimodal
– Often indicates two sub-classes within the population from which the sample
was drawn
– Perhaps these should be examined separately
❡ Illustrative Example – The Obstetric Census data (Data Set A.6 in Appendix A of
the Subject Reader)
• Thus the boxplot depicts where the data are located and gives a coarse indication of
how they are distributed. The height of the box can be taken as a crude measure of
the spread of the data
20 CHAPTER 2. EXPLORATORY DATA ANALYSIS
– Outliers in the data set would can give rise to a misleading graph, as the whisker
would simply go uncritically out from the quartile to the outlier
– If the length of a whisker would have to exceed 1.5× the height of the box
in order to reach the maximum (or minimum) value in the sample, then the
whisker is terminated at the most extreme value within 1.5 box-lengths and all
other points are shown separately
– Such points are interpreted as outliers
• Observe that there are a few high outliers in the mothers’ ages data set
✾ How to . . . construct a time-series plot (data frame has no time variable, only
index numbers)
3. Select ObsNumber as the x-variable and the response variable of interest (reading)
as the y-variable
30
25
20
15
22 CHAPTER 2. EXPLORATORY DATA ANALYSIS
25
20
0 10 20 30 40 50 60
Newcomb$ObsNumber
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 23
❡ Illustrative Example – The Olympic Trends data (Data Set A.9 in Appendix A of
the Subject Reader)
• The time-series plot of the long jump time series is shown in Figure 2.9
• The long jump time series has an outlier at 1968
– This performance at the Mexico City Games is a correct but unusual value
• When time has no systematic effect on the variable of interest, the time series will
appear as white noise (see Figure 2.10)
• Some time series exhibit an upwards or downwards trend (see Figure 2.11)
– If the data come from the repetition of a physical experiment, a trend is indica-
tive of a drift in the calibration of the equipment
• The variability of the time series may systematically increase or decrease over time
(see Figure 2.12)
– If the data come from the repetition of a physical experiment and the variability
is systematically decreasing, this may reflect “learning” – the experimenter is
getting better at doing the experiment
– This typically means that the experimenter should have been given more time
to become familiar with the experiment before data were collected
• Some time series exhibit a periodic (cyclical) behaviour (see Figure 2.13)
24 CHAPTER 2. EXPLORATORY DATA ANALYSIS
Figure 2.9: Olympic gold medal distance (inches) for the long jump
340
320
long.jump
300
280
260
Olympic.Trends$year
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 25
Figure 2.10: A time series with no apparant dependence on time (white noise)
2
1
noise
0
−1
−2
0 5 10 15 20 25 30
Time.Series.Examples$ObsNumber
−4
−6
0 5 10 15 20 25 30
Time.Series.Examples$ObsNumber
26 CHAPTER 2. EXPLORATORY DATA ANALYSIS
4
2
learning
0
−2
−4
0 5 10 15 20 25 30
Time.Series.Examples$ObsNumber
0
−2
−4
0 5 10 15 20 25 30
Time.Series.Examples$ObsNumber
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 27
• The normal quantile-quantile (Q-Q) plot is a specialised graphical tool for assessing
the approximate normality of a single continuous variable
• If an investigation requires an assessment of approximate normality, then the normal
quantile-quantile (Q-Q) plot is to be preferred over merely inspecting a histogram
❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)
• Construct a normal quantile-quantile (Q-Q) plot of the boys’ lung capacities (mea-
sured as forced vital capacity in litres)
• The normal quantile-quantile (Q-Q) plot of the boys’ lung capacity data is shown in
Figure 2.14
• For an “ideal” normal distribution, the points on the plot should follow a straight
line
• The judgement of “approximate normality” requires skill and comes with experience
• The normal quantile-quantile (Q-Q) plot of the boys’ lung capacity data shown in
Figure 2.14 indicates approximate normality
• In order to begin to build an experience base for assessing normal quantile-quantile
(Q-Q) plots, consider the following examples
28 CHAPTER 2. EXPLORATORY DATA ANALYSIS
Figure 2.14: Normal quantile-quantile (Q-Q) plot of the boys’ lung capacities (measured
as forced vital capacity in litres)
4.5
4.0
Boys.Lung.Capacity$fvc
3.5
3.0
2.5
2.0
1.5
−2 −1 0 1 2
norm quantiles
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 29
Figure 2.15: Histogram and normal quantile-quantile (Q-Q) plot of an approximately nor-
mal data set
25
2
Normal.QQ.Plots.Examples$normal.1
20
1
15
frequency
0
10
−1
5
−2
0
−3 −2 −1 0 1 2 −2 −1 0 1 2
– It is a measure of location
– The sample mean is the ‘centre of gravity’
• If the distribution is normal, then the sample mean is about in the “middle”
30 CHAPTER 2. EXPLORATORY DATA ANALYSIS
Figure 2.16: Histogram and normal quantile-quantile (Q-Q) plot of an approximately nor-
mal data set
2
Normal.QQ.Plots.Examples$normal.2
15
1
frequency
10
0
−1
5
−2
0
−2 −1 0 1 2 −2 −1 0 1 2
Figure 2.17: Histogram and normal quantile-quantile (Q-Q) plot of an approximately nor-
mal data set
3
20
Normal.QQ.Plots.Examples$normal.3
2
15
frequency
1
10
0
5
−1
0
−2 −1 0 1 2 3 −2 −1 0 1 2
Figure 2.18: Histogram and normal quantile-quantile (Q-Q) plot of a positively skewed
data set
4
35
Normal.QQ.Plots.Examples$positive.skew
30
3
25
frequency
20
2
15
10
1
5
0
0 1 2 3 4 −2 −1 0 1 2
Figure 2.19: Histogram and normal quantile-quantile (Q-Q) plot of a negatively skewed
data set
5
40
Normal.QQ.Plots.Examples$negative.skew
4
30
frequency
3
20
2
10
1
0
1 2 3 4 5 −2 −1 0 1 2
Figure 2.20: Histogram and normal quantile-quantile (Q-Q) plot of a uniform data set
6.0
14
12
Normal.QQ.Plots.Examples$uniform
5.5
10
frequency
5.0
6
4
4.5
2
4.0
0
Figure 2.21: Histogram and normal quantile-quantile (Q-Q) plot of a heavily tailed data
set
6
40
Normal.QQ.Plots.Examples$heavy.tailed
4
30
2
frequency
0
20
−2
10
−4
−6
0
−5 0 5 −2 −1 0 1 2
• If the distribution is skewed, then the sample mean is “pulled” in the direction of the
skew
• The sample standard deviation is one way of quantifying the notion of the “spread”
of the data set
– If the standard deviation is 0 exactly, then all of the data values must be identical
so the data are not spread out at all
– In general, the larger the standard deviation, the more spread out the data are
• A rule of thumb for interpreting the standard deviation when the distribution of data
is approximately normal is
– approximately 68% of the data lie within one standard deviation of the sample
mean
– approximately 95% within two
– approximately 99.7% within three
❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)
• Find the sample mean and the sample standard deviation of the boys’ lung capacities
(measured as forced vital capacity in litres)
✾ How to . . . calculate the sample mean and the sample standard deviation
Figure 2.22: Histogram and normal quantile-quantile (Q-Q) plot of a data set with an
outlier 35
6
Normal.QQ.Plots.Examples$outlier.present
30
4
25
frequency
20
2
15
10
0
5
−2
0
−2 0 2 4 6 −2 −1 0 1 2
68%
Table 2.1: The sample mean and the sample standard deviation of the boys’ lung capacities
(measured as forced vital capacity in litres)
mean sd n
2.896850 0.5100509 127
2.5. EXPLORING MULTIVARIABLE DATA 35
• The sample mean and the sample standard deviation of the boys’ lung capacity data
is shown in Table 2.1
• Observe from the histogram in Figure 2.1 that the histogram “balances” at the point
3.00 litres (expressing the sample mean to two decimal places)
❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)
3. Uncheck the items Marginal boxplots, Least-squares line and Smooth Line
that are checked by default
• The scatterplot of the boys’ lung capacities against their heights is shown in Fig-
ure 2.24
• Observe that there is a positive linear relationship between boys’ lung capacity and
height – for unit increase in height there is a fixed increase in average lung capacity
• Furthermore, observe that the variability in lung capacity for any particular height
level remains approximately constant (this property is called homoscedasticity)
• When commenting on a scatterplot, you should not only report on the trend in
location (or absence of one) but also the trend in variation (or absence of one), as
done above for the boys’ lung capacities and heights
• Let us call the variable on the horizontal axis the “predictor” and the variable on the
vertical axis the “response”
• The trend describes how the average response changes as the predictor increases
• The trend in variation describes how the variability in the response changes as the
predictor increases
Figure 2.24: Boys’ lung capacities (measured as forced vital capacity in litres) against
heights (centimetres)
4.5
4.0
3.5
3.0
fvc
2.5
2.0
1.5
height
38 CHAPTER 2. EXPLORATORY DATA ANALYSIS
❡ Illustrative Example – The Mercury Contamination of Lakes data (Data Set A.10
in Appendix A of the Subject Reader)
– Infant’s Gender
– Infant’s Birth Weight
– Mother’s Smoking Status
• These types of questions require study of one variable conditionally on other variables
having certain values
• The next two sections examine the conditional boxplot (§ 2.5.3) and the conditional
sample mean and the conditional sample standard deviation (§ 2.5.4), respectively
• Some other types of conditional data analysis are illustrated in Practical 1
2.5. EXPLORING MULTIVARIABLE DATA 39
Figure 2.25: Mercury concentrations (parts per million) against alkalinity (measured as
milligrams per litre of calcium carbonate)
1.2
1.0
mercury.concentration
0.8
0.6
0.4
0.2
0.0
0 20 40 60 80 100 120
alkalinity
40 CHAPTER 2. EXPLORATORY DATA ANALYSIS
• Construct a boxplot of the mothers’ ages separately for each category of insurance
status
• The boxplot of the mothers’ ages separately for each category of insurance status is
shown in Figure 2.26
• Although a histogram would have given more insight into the distribution of moth-
ers’ ages for either group, it is difficult to compare between the two groups using
histograms
• The boxplot deliberately sacrifices internal group detail in order to permit ready
inter-group comparisons
• Find the sample mean and the sample standard deviation of the mothers’ ages sep-
arately for each category of insurance status
2.5. EXPLORING MULTIVARIABLE DATA 41
30
25
20
15
insurance.status
42 CHAPTER 2. EXPLORATORY DATA ANALYSIS
✾ How to . . . calculate the conditional sample means and the conditional sample
standard deviations
4. Click on the Summarize by groups. . . button and then select the categorical variable
defining the groups of interest as the Groups variable (insurance.status)
• The sample mean and the sample standard deviation of the mothers’ ages separately
for each category of insurance status is shown in Table 2.2
Table 2.2: The conditional sample means and the conditional sample standard deviations
of the mothers’ ages (years) by insurance status
mean sd n
health service patient 25.64586 5.133883 737
private patient 28.78891 4.160983 559
• Extract as a new data frame those subjects for which the mother’s insurance status
is “private patient”, and only include in this new data frame the mothers’ ages
3. If a subset of the original subjects is required, enter into the field Subset expression
the selection rule
(insurance.status=="private patient")
6. Note that the new data frame is now the “active data set”
• At any one time, only one of these is the “active data set”
☞ Note that the syntax for “is equal to” is a pair of equals signs: ==
☞ A single equals sign (=) is already part of the R language and the pair of equals signs
is necessary to avoid syntactic ambiguity
44 CHAPTER 2. EXPLORATORY DATA ANALYSIS
– ❡ Illustrative Example – The Obstetric Census data (Data Set A.6 in Ap-
pendix A of the Subject Reader)
– For the variable gravidity, it may be desired to combine the categories ‘first
baby, no previous incomplete pregnancy’ and ‘first baby, previous incomplete
pregnancy’ into a single category ‘first baby’
❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)
• Suppose that it is desired to classify 12-year-old boys according to the height ranges
specified in Table 2.4
Original New
Value Value
height < 145 short
145 ≤ height < 160 medium
160 ≤ height tall
46 CHAPTER 2. EXPLORATORY DATA ANALYSIS
3. Usually it is appropriate to leave the Name for factor field as the default <same
as original>
6. If the Name for factor field was left as the default, then a dialog box will appear
asking if the existing variable is to be overwritten; click on the Yes button
7. A Reorder Levels dialog box will appear; adjust the order as required
9. Note that this reordering will only apply to this R session, and will need to be reset
if the data is subsequently imported into a new R session
❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)
• Suppose that the body mass index (BMI) is required for each boy
weight
BMI = ,
height2
Hints
• As an example, consider 10 − 2 ∗ 4 ∧ 2:
10 − 2 ∗ 4 ∧ 2 = 10 − 2 ∗ 16 = 10 − 32 = −22 ,
☞ For inbuilt functions in R, such as sin (sine), sqrt (square-root), etc, consult the
on-line help facility
48 CHAPTER 2. EXPLORATORY DATA ANALYSIS