You are on page 1of 38

Chapter 2

Exploratory Data Analysis

2.1 Chapter Outline


In this chapter, you will:

• be introduced to the concepts of data contamination and outliers, and learn of their
distinction

• learn about exploratory data analysis for a single continuous variable

• learn how to explore multivariable data

• learn how to manipulate data and perform calculations

2.2 Data Contamination and Outliers

2.2.1 Data Contamination


• It is of course important to establish protocols in the experimental process to ensure
data integrity

• Nonetheless, the data analyst must be critical of the data – it is not always the truth!

• Despite careful protocols, there may still be contamination in the data

– a measuring machine has a fault


– a sensitive device is bumped, power surge, . . .
– an error is made in data copying (eg, 9.7 copied as 7.9)

11
12 CHAPTER 2. EXPLORATORY DATA ANALYSIS

– the variable is correctly recorded, but the subject should not have been there in
the first place (eg, a study on flowers from the species Iris setosa accidentally
includes a specimen from the species Iris versicolor )
– ...

• Often it can never be identified

• Data contamination may distort the analysis

2.2.2 Outliers
• A very unusual subject is called an outlier

– This is a concept not a precise definition


– A subject might be an outlier if one of its variables has an extremely low or an
extremely high value
– But also, a subject might be an outlier if two or more of its variables appear in a
very unusual combination, even if their individual values considered separately
are not unusual

• Outliers should always be investigated if possible

– An outlier sometimes corresponds to contamination in the data (but note that


data contamination does not always give rise to an outlier)
– Sometimes an outlier is a bona fide member of the data set, but is “interesting”
for some reason

2.3 Why Explore the Data?


• Having obtained your data, the first step is to explore it

– Summaries can be either graphical or numerical


– Appropriate summaries can be very useful for finding anomalies in the data.
This is an important part of checking that your data are correct
– Sometimes the anomaly is an outlier – this could be contamination in the data
or a bona fide member of the data set that is “interesting” for some reason –
either way, the analyst needs to know about it and to follow up on it to the
extent possible
– Appropriate summaries are also useful for understanding the general features of
a particular set of data
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 13

– There are even cases where the exploration causes the investigator to reconsider
how to pose the scientific questions to be investigated and the experimental
design to be used

Hints

☞ When screening the data for errors it is useful to check the minimum and maximum
for unusual values

☞ Sometimes it happens that there are missing values in the data

• Sometimes, for some reason, there is a subject for which one of the variables is
not known
• Perhaps is was not measured in the first place
• Perhaps the value has been lost
• Perhaps the measuring equipment failed at that point

☞ Typically, some sort of code is used to indicate that a value is missing

• We shall use the code NA in the place where the number should have been

☞ If the data have been entered by someone else, you must ascertain what code they
used for missing values

☞ Sometimes a value that could not be legitimate is used as a code for a missing value

• Common choices are -9999, 9999 or 0.00


• Watch out for this, as if this is not realised and calculations are performed then
nonsense will result!

2.4 Exploring a Single Continuous Variable

2.4.1 The Histogram


• The histogram is a graphical summary of a single continuous variable

• The horizontal axis represents the units of measurement

– On this axis lies a sequence of rectangles

• The area of each rectangle represents the proportion of observations in the bin or
class interval covered by its base
14 CHAPTER 2. EXPLORATORY DATA ANALYSIS

– In most cases, the bins are of equal width, in which case, the vertical axis can
equivalently be simply the number or frequency of observations in the corre-
sponding bin

❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)

• Construct a histogram of the boys’ lung capacities (measured as forced vital capacity
in litres)

✾ How to . . . construct a histogram

1. Use the menu item


Graphs / Histogram. . .

2. Select the variable of interest (fvc)

3. There is no need to alter any of the other settings

4. Click on the OK button

5. The graph appears in an R Graphics window in the RGui tab

• The histogram of the boys’ lung capacity data is shown in Figure 2.1

2.4.2 The Shape of the Distribution


• The histogram is used to describe the shape of the distribution of the data over the
number line

• This description must be interpreted in the context of the data source

• The analyst uses the histogram and other summaries as investigating tools

– The analyst is the detective, alert for anything worthy of remark


– Interpret what is seen in the context of the data source

• Some examples of commonly occurring features are given below

– The examples do not form an exhaustive list, nor are they mutually exclusive
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 15

Figure 2.1: Histogram of the boys’ lung capacities (measured as forced vital capacity in
litres)
50
40
30
frequency

20
10
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

Boys.Lung.Capacity$fvc
16 CHAPTER 2. EXPLORATORY DATA ANALYSIS

Figure 2.2: A normal distribution

Normal

• One central concentration of values (unimodal – one mode), with values tailing off
symmetrically each side in a characteristic bell shape

• This is a commonly occurring natural distribution

– The normal distribution

• The accumulation of several independent influences gives rise to normal data

– In particular, it is the natural distribution of independent measurement errors

• See Figure 2.2

Uniform

• No central concentration of values – data spread evenly over the range

– The end rectangles of the histogram could be relatively much shorter – this is
just an artifact due to misalignment of bins and the natural range of the data

• See Figure 2.3


2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 17

Figure 2.3: A uniform distribution

Figure 2.4: A positively skewed distribution

Positive Skew

• A long tail of values at the high end and an abrupt tail at the low end

• A common example is precipitation data at a location

– Some years may be very wet but precipitation can not go below zero

• See Figure 2.4

Negative Skew

• A long tail of values at the low end and an abrupt tail at the high end
18 CHAPTER 2. EXPLORATORY DATA ANALYSIS

Figure 2.5: A negatively skewed distribution

Figure 2.6: A bimodal distribution

• See Figure 2.5

Bimodal

• Two central concentrations of values

– Often indicates two sub-classes within the population from which the sample
was drawn
– Perhaps these should be examined separately

• See Figure 2.6


2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 19

2.4.3 The Boxplot


• The boxplot is a graphical summary of a single continuous variable
• The boxplot is much cruder than the histogram (§ 2.4.1)
• The utility of the boxplot is evident when considering conditional data analysis
(§ 2.5.3)

❡ Illustrative Example – The Obstetric Census data (Data Set A.6 in Appendix A of
the Subject Reader)

• Construct a boxplot of the mothers’ ages

✾ How to . . . construct a boxplot

1. Use the menu item


Graphs / Boxplot. . .
2. Select the variable of interest (age.mother)
3. Click on the OK button
4. The graph appears in an R Graphics window in the RGui tab

• The boxplot of the mothers’ ages is shown in Figure 2.7


• To interpret this graph, observe that the boxplot comprises a “box” divided by a
horizontal line with “whiskers” attached to the upper and low edges of the box

– The vertical scale corresponds to the units of measurement


– The lower whisker begins at the minimum observed value
– The lower edge of the box is at the lower quartile. That is, 25% of the data lie
below the lower quartile and 75% of the data lie above
– The horizontal line in the middle of the box is at the sample median. That is,
50% of the data lie below the sample median and 50% lie above
– The upper edge of the box is at the upper quartile. That is, 75% of the data lie
below the upper quartile and 25% of the data lie above
– The upper whisker terminates at the maximum observed value

• Thus the boxplot depicts where the data are located and gives a coarse indication of
how they are distributed. The height of the box can be taken as a crude measure of
the spread of the data
20 CHAPTER 2. EXPLORATORY DATA ANALYSIS

• The boxplot includes a modification to help identify potential outliers

– Outliers in the data set would can give rise to a misleading graph, as the whisker
would simply go uncritically out from the quartile to the outlier
– If the length of a whisker would have to exceed 1.5× the height of the box
in order to reach the maximum (or minimum) value in the sample, then the
whisker is terminated at the most extreme value within 1.5 box-lengths and all
other points are shown separately
– Such points are interpreted as outliers

• Observe that there are a few high outliers in the mothers’ ages data set

2.4.4 Time Series


Plotting Time Series

❡ Illustrative Example – Newcomb’s Speed of Light Experiment (Data Set A.1 in


Appendix A of the Subject Reader)

• There is no time variable explicitly recorded for this time series

• The index numbers are surrogates for the times

✾ How to . . . construct a time-series plot (data frame has no time variable, only
index numbers)

1. Use the menu item


Data / Manage variables in active data set / Add observation numbers to
data set

2. Use the menu item


Graphs / Line graph. . .

3. Select ObsNumber as the x-variable and the response variable of interest (reading)
as the y-variable

4. Click on the OK button

5. The graph appears in an R Graphics window in the RGui tab

• The time-series plot of Newcomb’s readings is shown in Figure 2.8


2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 21

Figure 2.7: Mothers’ ages (years)


45
40
35
age.mother

30
25
20
15
22 CHAPTER 2. EXPLORATORY DATA ANALYSIS

Figure 2.8: Time series plot of Newcomb’s experimental readings


40
35
30
reading

25
20

0 10 20 30 40 50 60

Newcomb$ObsNumber
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 23

❡ Illustrative Example – The Olympic Trends data (Data Set A.9 in Appendix A of
the Subject Reader)

• Consider the long jump time series

✾ How to . . . construct a time-series plot (data frame contains a time variable)

1. Use the menu item


Graphs / Line graph. . .
2. Select the time variable (year) as the x-variable and the response variable of interest
(long.jump) as the y-variable
3. Click on the OK button
4. The graph appears in an R Graphics window in the RGui tab

• The time-series plot of the long jump time series is shown in Figure 2.9
• The long jump time series has an outlier at 1968

– This performance at the Mexico City Games is a correct but unusual value

Trends in Time Series

• When time has no systematic effect on the variable of interest, the time series will
appear as white noise (see Figure 2.10)
• Some time series exhibit an upwards or downwards trend (see Figure 2.11)

– If the data come from the repetition of a physical experiment, a trend is indica-
tive of a drift in the calibration of the equipment

• The variability of the time series may systematically increase or decrease over time
(see Figure 2.12)

– If the data come from the repetition of a physical experiment and the variability
is systematically decreasing, this may reflect “learning” – the experimenter is
getting better at doing the experiment
– This typically means that the experimenter should have been given more time
to become familiar with the experiment before data were collected

• Some time series exhibit a periodic (cyclical) behaviour (see Figure 2.13)
24 CHAPTER 2. EXPLORATORY DATA ANALYSIS

Figure 2.9: Olympic gold medal distance (inches) for the long jump
340
320
long.jump

300
280
260

1900 1920 1940 1960 1980

Olympic.Trends$year
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 25

Figure 2.10: A time series with no apparant dependence on time (white noise)

2
1
noise

0
−1
−2

0 5 10 15 20 25 30

Time.Series.Examples$ObsNumber

Figure 2.11: A time series with a trend


0
−2
trend

−4
−6

0 5 10 15 20 25 30

Time.Series.Examples$ObsNumber
26 CHAPTER 2. EXPLORATORY DATA ANALYSIS

Figure 2.12: A time series with decreasing variability (“learning”)

4
2
learning

0
−2
−4

0 5 10 15 20 25 30

Time.Series.Examples$ObsNumber

Figure 2.13: A time series with periodic (cyclical) behaviour


2
periodic

0
−2
−4

0 5 10 15 20 25 30

Time.Series.Examples$ObsNumber
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 27

2.4.5 The Normal Quantile-Quantile (Q-Q) Plot


Constructing a Normal Quantile-Quantile (Q-Q) Plot

• The normal quantile-quantile (Q-Q) plot is a specialised graphical tool for assessing
the approximate normality of a single continuous variable
• If an investigation requires an assessment of approximate normality, then the normal
quantile-quantile (Q-Q) plot is to be preferred over merely inspecting a histogram

❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)

• Construct a normal quantile-quantile (Q-Q) plot of the boys’ lung capacities (mea-
sured as forced vital capacity in litres)

✾ How to . . . construct a normal quantile-quantile (Q-Q) plot

1. Use the menu item


Graphs / Quantile-comparison plot. . .
2. Select the variable of interest (fvc)
3. Normal is selected by default
4. Click on the OK button
5. The graph appears in an R Graphics window in the RGui tab

• The normal quantile-quantile (Q-Q) plot of the boys’ lung capacity data is shown in
Figure 2.14

Interpreting a Normal Quantile-Quantile (Q-Q) Plot

• For an “ideal” normal distribution, the points on the plot should follow a straight
line
• The judgement of “approximate normality” requires skill and comes with experience
• The normal quantile-quantile (Q-Q) plot of the boys’ lung capacity data shown in
Figure 2.14 indicates approximate normality
• In order to begin to build an experience base for assessing normal quantile-quantile
(Q-Q) plots, consider the following examples
28 CHAPTER 2. EXPLORATORY DATA ANALYSIS

Figure 2.14: Normal quantile-quantile (Q-Q) plot of the boys’ lung capacities (measured
as forced vital capacity in litres)
4.5
4.0
Boys.Lung.Capacity$fvc

3.5
3.0
2.5
2.0
1.5

−2 −1 0 1 2

norm quantiles
2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 29

1. approximately normal (Figure 2.15)


2. approximately normal (Figure 2.16)
3. approximately normal (Figure 2.17)
4. positively skewed (Figure 2.18)
5. negatively skewed (Figure 2.19)
6. uniform (Figure 2.20)
7. heavy-tailed distribution – more data in the extremes than for a normal distri-
bution (Figure 2.21)
8. an outlier present in the data (Figure 2.22)

Figure 2.15: Histogram and normal quantile-quantile (Q-Q) plot of an approximately nor-
mal data set
25

2
Normal.QQ.Plots.Examples$normal.1
20

1
15
frequency

0
10

−1
5

−2
0

−3 −2 −1 0 1 2 −2 −1 0 1 2

Normal.QQ.Plots.Examples$normal.1 norm quantiles

2.4.6 The Sample Mean and The Sample Standard Deviation


• The sample mean is one way of quantifying the notion of the “centre” of the data set

– It is a measure of location
– The sample mean is the ‘centre of gravity’

• An alternative word for mean is average

• The mean is sensitive to outliers

• If the distribution is normal, then the sample mean is about in the “middle”
30 CHAPTER 2. EXPLORATORY DATA ANALYSIS

Figure 2.16: Histogram and normal quantile-quantile (Q-Q) plot of an approximately nor-
mal data set

2
Normal.QQ.Plots.Examples$normal.2
15

1
frequency

10

0
−1
5

−2
0

−2 −1 0 1 2 −2 −1 0 1 2

Normal.QQ.Plots.Examples$normal.2 norm quantiles

Figure 2.17: Histogram and normal quantile-quantile (Q-Q) plot of an approximately nor-
mal data set
3
20

Normal.QQ.Plots.Examples$normal.3

2
15
frequency

1
10

0
5

−1
0

−2 −1 0 1 2 3 −2 −1 0 1 2

Normal.QQ.Plots.Examples$normal.3 norm quantiles


2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 31

Figure 2.18: Histogram and normal quantile-quantile (Q-Q) plot of a positively skewed
data set

4
35

Normal.QQ.Plots.Examples$positive.skew
30

3
25
frequency

20

2
15
10

1
5
0

0 1 2 3 4 −2 −1 0 1 2

Normal.QQ.Plots.Examples$positive.skew norm quantiles

Figure 2.19: Histogram and normal quantile-quantile (Q-Q) plot of a negatively skewed
data set
5
40

Normal.QQ.Plots.Examples$negative.skew

4
30
frequency

3
20

2
10

1
0

1 2 3 4 5 −2 −1 0 1 2

Normal.QQ.Plots.Examples$negative.skew norm quantiles


32 CHAPTER 2. EXPLORATORY DATA ANALYSIS

Figure 2.20: Histogram and normal quantile-quantile (Q-Q) plot of a uniform data set

6.0
14
12

Normal.QQ.Plots.Examples$uniform

5.5
10
frequency

5.0
6
4

4.5
2

4.0
0

4.0 4.5 5.0 5.5 6.0 −2 −1 0 1 2

Normal.QQ.Plots.Examples$uniform norm quantiles

Figure 2.21: Histogram and normal quantile-quantile (Q-Q) plot of a heavily tailed data
set
6
40

Normal.QQ.Plots.Examples$heavy.tailed

4
30

2
frequency

0
20

−2
10

−4
−6
0

−5 0 5 −2 −1 0 1 2

Normal.QQ.Plots.Examples$heavy.tailed norm quantiles


2.4. EXPLORING A SINGLE CONTINUOUS VARIABLE 33

• If the distribution is skewed, then the sample mean is “pulled” in the direction of the
skew

• The sample standard deviation is one way of quantifying the notion of the “spread”
of the data set

– It is a measure of variability or dispersion

• The standard deviation is also sensitive to outliers

• The standard deviation is always non-negative

– If the standard deviation is 0 exactly, then all of the data values must be identical
so the data are not spread out at all
– In general, the larger the standard deviation, the more spread out the data are

• A rule of thumb for interpreting the standard deviation when the distribution of data
is approximately normal is

– approximately 68% of the data lie within one standard deviation of the sample
mean
– approximately 95% within two
– approximately 99.7% within three

See Figure 2.23

❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)

• Find the sample mean and the sample standard deviation of the boys’ lung capacities
(measured as forced vital capacity in litres)

✾ How to . . . calculate the sample mean and the sample standard deviation

1. Use the menu item


Statistics / Summaries / Numerical summaries. . .

2. Select the variable of interest (weight.infant)

3. Leave Mean and Standard Deviation checked, but uncheck Quartiles

4. Click on the OK button


34 CHAPTER 2. EXPLORATORY DATA ANALYSIS

Figure 2.22: Histogram and normal quantile-quantile (Q-Q) plot of a data set with an
outlier 35

6
Normal.QQ.Plots.Examples$outlier.present
30

4
25
frequency

20

2
15
10

0
5

−2
0

−2 0 2 4 6 −2 −1 0 1 2

Normal.QQ.Plots.Examples$outlier.present norm quantiles

Figure 2.23: An idealised normal distribution

68%

mean-2stdev mean mean+2stdev


mean-3stdev mean-stdev mean+stdev mean+3stdev

Table 2.1: The sample mean and the sample standard deviation of the boys’ lung capacities
(measured as forced vital capacity in litres)

mean sd n
2.896850 0.5100509 127
2.5. EXPLORING MULTIVARIABLE DATA 35

• The sample mean and the sample standard deviation of the boys’ lung capacity data
is shown in Table 2.1

• Observe from the histogram in Figure 2.1 that the histogram “balances” at the point
3.00 litres (expressing the sample mean to two decimal places)

• Furthermore, since the distribution of the boys’ lung capacities is approximately


normal (Figure 2.14), about 68% of 12-year-old boys have a lung capacity in the
range
(2.896850 − 0.5100509 , 2.896850 + 0.5100509) = (2.39 , 3.41)
litres (to two decimal places), and about 95% of 12-year-old boys have a lung capacity
in the range

(2.896850 − 2 × 0.5100509 , 2.896850 + 2 × 0.5100509) = (1.88 , 3.92)

litres (to two decimal places)

2.5 Exploring Multivariable Data


• Data with more than one variable per subject are multivariable data
• With multivariable data the questions of substantive interest usually refer simulta-
neously to several of the variables

2.5.1 The Scatterplot


• When two numerical variables have been measured on each subject, we may wish to
explore their relationship

• A graphical tool for exploration of the relationship is the scatterplot

❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)

• Construct a scatterplot of the boys’ lung capacities against their heights

✾ How to . . . construct a scatterplot

1. Use the menu item


Graphs / Scatterplot. . .

2. Select the x-variable (height) and the y-variable (fvc)


36 CHAPTER 2. EXPLORATORY DATA ANALYSIS

3. Uncheck the items Marginal boxplots, Least-squares line and Smooth Line
that are checked by default

4. Click on the OK button

5. The graph appears in an R Graphics window in the RGui tab

• The scatterplot of the boys’ lung capacities against their heights is shown in Fig-
ure 2.24

• Observe that there is a positive linear relationship between boys’ lung capacity and
height – for unit increase in height there is a fixed increase in average lung capacity

• Furthermore, observe that the variability in lung capacity for any particular height
level remains approximately constant (this property is called homoscedasticity)

☞ Note the convention: a scatterplot of ‘variable A against (or versus) variable B’


means that A is on the vertical axis and B on the horizontal axis

• When commenting on a scatterplot, you should not only report on the trend in
location (or absence of one) but also the trend in variation (or absence of one), as
done above for the boys’ lung capacities and heights

• Let us call the variable on the horizontal axis the “predictor” and the variable on the
vertical axis the “response”

• The trend describes how the average response changes as the predictor increases

– It may be linear or nonlinear


– If it is nonlinear, then describe its nature
∗ eg, exponential growth, exponential decay, . . .

• The trend in variation describes how the variability in the response changes as the
predictor increases

– It may be homoscedastic or heteroscedastic


– If it is heteroscedastic, then describe its nature
– Heteroscedasticity often manifests as a fanning out or a fanning in of the scat-
terplot from left to right
2.5. EXPLORING MULTIVARIABLE DATA 37

Figure 2.24: Boys’ lung capacities (measured as forced vital capacity in litres) against
heights (centimetres)
4.5
4.0
3.5
3.0
fvc

2.5
2.0
1.5

140 150 160 170

height
38 CHAPTER 2. EXPLORATORY DATA ANALYSIS

❡ Illustrative Example – The Mercury Contamination of Lakes data (Data Set A.10
in Appendix A of the Subject Reader)

• Consider the scatterplot of mercury concentration against alkalinity (Figure 2.25)


• Observe that there is a negative nonlinear relationship between mercury concentra-
tion and alkalinity – for unit increase in alkalinity there is a decrease in average
mercury concentration, but the amount of this decrease diminishes as alkalinity in-
creases
• Further, observe that the variability in mercury concentration for any particular
alkalinity level systematically changes – there is heteroscedasticity, such that lakes
with a higher alkalinity have a lower variability in their mercury concentrations
• Note also that some lakes are outliers, in that they have unusual combinations of
alkalinity and mercury concentration – a report on these data should name these
lakes and describe how they are unusual

2.5.2 Conditional Data Analysis


❡ Illustrative Example – The Obstetric Census data (Data Set A.6 in Appendix A of
the Subject Reader)

• In the Obstetric Census data, three of the variables are:

– Infant’s Gender
– Infant’s Birth Weight
– Mother’s Smoking Status

• We might ask questions such as:

1. Is the distribution of weights the same for boys and girls?


2. Does maternal smoking influence weight?
3. If so, is the effect of maternal smoking the same for both sexes?

• These types of questions require study of one variable conditionally on other variables
having certain values

– That is, to concentrate on the set of subjects defined by those values.

• The next two sections examine the conditional boxplot (§ 2.5.3) and the conditional
sample mean and the conditional sample standard deviation (§ 2.5.4), respectively
• Some other types of conditional data analysis are illustrated in Practical 1
2.5. EXPLORING MULTIVARIABLE DATA 39

Figure 2.25: Mercury concentrations (parts per million) against alkalinity (measured as
milligrams per litre of calcium carbonate)
1.2
1.0
mercury.concentration

0.8
0.6
0.4
0.2
0.0

0 20 40 60 80 100 120

alkalinity
40 CHAPTER 2. EXPLORATORY DATA ANALYSIS

2.5.3 The Conditional Boxplot


❡ Illustrative Example – The Obstetric Census data (Data Set A.6 in Appendix A of
the Subject Reader)

• Construct a boxplot of the mothers’ ages separately for each category of insurance
status

✾ How to . . . construct a conditional boxplot

1. Use the menu item


Graphs / Boxplot. . .

2. Select the (continuous) variable of interest (age.mother)


3. Click on the Plot by groups. . . button and then select the categorical variable defin-
ing the groups of interest as the Groups variable (insurance.status)

4. Click on both OK buttons


5. The graph appears in an R Graphics window in the RGui tab

• The boxplot of the mothers’ ages separately for each category of insurance status is
shown in Figure 2.26

• Although a histogram would have given more insight into the distribution of moth-
ers’ ages for either group, it is difficult to compare between the two groups using
histograms

• The boxplot deliberately sacrifices internal group detail in order to permit ready
inter-group comparisons

• Imagine if there were 10 groups to compare – histograms would be quite impractical


for the comparison

2.5.4 The Conditional Sample Mean and the Conditional Sample


Standard Deviation
❡ Illustrative Example – The Obstetric Census data (Data Set A.6 in Appendix A of
the Subject Reader)

• Find the sample mean and the sample standard deviation of the mothers’ ages sep-
arately for each category of insurance status
2.5. EXPLORING MULTIVARIABLE DATA 41

Figure 2.26: Mothers’ ages (years) by insurance status


45
40
35
age.mother

30
25
20
15

health service patient private patient

insurance.status
42 CHAPTER 2. EXPLORATORY DATA ANALYSIS

✾ How to . . . calculate the conditional sample means and the conditional sample
standard deviations

1. Use the menu item


Statistics / Summaries / Numerical summaries. . .

2. Select the (continuous) variable of interest (age.mother)

3. Leave Mean and Standard Deviation checked, but uncheck Quartiles

4. Click on the Summarize by groups. . . button and then select the categorical variable
defining the groups of interest as the Groups variable (insurance.status)

5. Click on both OK buttons

• The sample mean and the sample standard deviation of the mothers’ ages separately
for each category of insurance status is shown in Table 2.2

Table 2.2: The conditional sample means and the conditional sample standard deviations
of the mothers’ ages (years) by insurance status

mean sd n
health service patient 25.64586 5.133883 737
private patient 28.78891 4.160983 559

2.6 Data Manipulation and Calculations

2.6.1 Selecting a Subset of the Data


❡ Illustrative Example – The Obstetric Census data (Data Set A.6 in Appendix A of
the Subject Reader)

• Extract as a new data frame those subjects for which the mother’s insurance status
is “private patient”, and only include in this new data frame the mothers’ ages

✾ How to . . . select a subset of subjects or variables

1. Use the menu item


Data / Active data set / Subset active data set. . .
2.6. DATA MANIPULATION AND CALCULATIONS 43

2. If a subset of the original variables is required, uncheck the box


Include all variables
and select only those variables required
(age.mother)

3. If a subset of the original subjects is required, enter into the field Subset expression
the selection rule
(insurance.status=="private patient")

4. Name the new data set


(Ages.Private.Patients)

5. Click on the OK button

6. Note that the new data frame is now the “active data set”

• Several data sets may be open in the R session

• At any one time, only one of these is the “active data set”

• The user may select which is to be the active data set

✾ How to . . . select the active data set

1. Use the menu item


Data / Active data set / Select active data set. . .

2. Select the data frame required

3. Click on the OK button

• The syntax in R for logical operations is shown in Table 2.3

☞ Note that the syntax for “is equal to” is a pair of equals signs: ==

☞ A single equals sign (=) is already part of the R language and the pair of equals signs
is necessary to avoid syntactic ambiguity
44 CHAPTER 2. EXPLORATORY DATA ANALYSIS

2.6.2 Recoding a Variable


• On occasion there is cause to recode a variable
• Sometimes a categorical variable is to be made coarser by amalgamating some of the
categories

– ❡ Illustrative Example – The Obstetric Census data (Data Set A.6 in Ap-
pendix A of the Subject Reader)
– For the variable gravidity, it may be desired to combine the categories ‘first
baby, no previous incomplete pregnancy’ and ‘first baby, previous incomplete
pregnancy’ into a single category ‘first baby’

• Sometimes it is required to code a numerical variable into a categorical one

❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)

• Suppose that it is desired to classify 12-year-old boys according to the height ranges
specified in Table 2.4

✾ How to . . . recode a variable

1. Use the menu item


Data / Manage variables in active data set / Recode variable. . .
2. Select the variable to be recoded (height)
3. Name the new variable (height.coded)
4. Leave the Make (each) new variable a factor box checked
5. Enter the recode directives:
lo:144="short"
145:159="medium"
160:hi="tall"
6. Click on the OK button

• By default, the categories of a categorical variable are listed in alphabetical order


• In many examples, such as the one above, this is inappropriate

– The default order is ‘medium’, ‘short’, ‘tall’


– The natural order is ‘short’, ‘medium’, ‘tall’
2.6. DATA MANIPULATION AND CALCULATIONS 45

Table 2.3: Logical operators in R

Description Usual Mathematical Symbol R Syntax


is equal to = ==
is not equal to 6= !=
is less than < <
is less than or equal to ≤ <=
is greater than > >
is greater than or equal to ≥ >=
if expr1 and expr2 are logical expressions . . .
(expr1) and (expr2) (expr1) & (expr2)
(expr1) or (expr2) (expr1) | (expr2)
not(expr1) !(expr1)

Table 2.4: Recoding height

Original New
Value Value
height < 145 short
145 ≤ height < 160 medium
160 ≤ height tall
46 CHAPTER 2. EXPLORATORY DATA ANALYSIS

✾ How to . . . change the category order

1. Use the menu item


Data / Manage variables in active data set / Reorder factor levels. . .

2. Select the factor (categorical variable) in question (height.coded)

3. Usually it is appropriate to leave the Name for factor field as the default <same
as original>

4. Leave the Make ordered factor box unchecked

5. Click on the OK button

6. If the Name for factor field was left as the default, then a dialog box will appear
asking if the existing variable is to be overwritten; click on the Yes button

7. A Reorder Levels dialog box will appear; adjust the order as required

8. Click on the OK button

9. Note that this reordering will only apply to this R session, and will need to be reset
if the data is subsequently imported into a new R session

2.6.3 Performing Calculations with Variables

❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)

• Suppose that the body mass index (BMI) is required for each boy

– The body mass index is defined as

weight
BMI = ,
height2

where height is in metres and weight is in kilograms


– Note that the height variable is measured in centimetres, and so a rescaling is
necessary in the calculation
2.6. DATA MANIPULATION AND CALCULATIONS 47

✾ How to . . . perform calculations with variables

1. Use the menu item


Data / Manage variables in active data set / Compute new variable. . .

2. Choose a name for the transformed variable (bmi)

3. Enter the expression for the required calculation


(weight / (height / 100) ∧ 2)

4. Click on the OK button

Hints

☞ The standard arithmetic operators of addition, subtraction, multiplication and divi-


sion are represented in R by the symbols:
+ − ∗/

☞ Exponentiation (raising a number to a power) is represented in R by the symbol ∧


and so, for example, var∧2 squares the variable (multiplies it by itself)

☞ Observe the predecence rules of arithmetic: in an arithmetic expression, exponenti-


ation is done before multiplication / division, which in turn is done before addition
/ subtraction, unless brackets indicate otherwise

• As an example, consider 10 − 2 ∗ 4 ∧ 2:

10 − 2 ∗ 4 ∧ 2 = 10 − 2 ∗ 16 = 10 − 32 = −22 ,

but (10 − 2) ∗ 4 ∧ 2 is:


8 ∗ 4 ∧ 2 = 8 ∗ 16 = 128

☞ For inbuilt functions in R, such as sin (sine), sqrt (square-root), etc, consult the
on-line help facility
48 CHAPTER 2. EXPLORATORY DATA ANALYSIS

You might also like