You are on page 1of 6

LC•GC Europe Online Supplement statistics and data analysis 3

Understanding the
Structure of
Scientific Data
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

This is the first in a series of articles that aims to promote the better use of
statistics by scientists. The series intends to show everyone from bench
chemists to laboratory managers that the application of many statistical
methods does not require the services of a ‘statistician’ or a ‘mathematician’
to convert chemical data into useful information. Each article will be a con-
cise introduction to a small subset of methods. Wherever possible, diagrams
will be used and equations kept to a minimum; for those wanting more the-
ory, references to relevant statistical books and standards will be included.
By the end of the series, the scientist should have an understanding of the
most common statistical methods and be able to perform the test while
avoiding the pitfalls that are inherent in their misapplication.

In this article we look at the initial steps in known as a dot plot) can be used to graph types with a few clicks of the
data analysis (i.e., exploratory data analysis), explore how the data set is distributed mouse. All of these plots can give an
and how to calculate the basic summary (Figure 1). Blob plots are constructed indication of the presence or absence of
statistics (the mean and sample standard simply by drawing a line, marking it off outliers (1). The frequency histogram, stem
deviation). These two processes, which with a suitable scale and plotting the data and leaf plot, and blob plot can also
increase our understanding of the data along the axis. indicate the type of distribution the data
structure, are vital if the correct selection of A stem-and-leaf plot is yet another belongs to. It should be remembered that
more advanced statistical methods and method for examining patterns in the data if the data set is from a non-normal (2)
interpretation of their results are to be set. These are complex to describe and distribution, (Figure 2(a) and possibly
achieved. From that base we will progress to perceived as old fashioned, especially with Figure 1(a)), it may be that which looks like
significance testing (t-tests and the F-test). the modern graphical packages available an outlier is in fact a good piece of
These statistics allow a comparison between today. For the sake of completeness they information. The outliers are the most
two sets of results in an objective and are described in Box 1. extreme points on the right-hand side of
unbiased way. For example, significance For larger data sets, frequency Figures 1(a) and 2(a). Note: Outliers, outlier
tests are useful when comparing a new histograms (Figure 2(a)) and Box and tests and robust methods will be the
analytical method with an old method or Whisker plots (Figure 2(b)) may be better subject of a later article.
when comparing the current day’s options to display the data distribution. Assuming there are no obvious outliers,
production with that of the previous day. Once the data set is entered, or as is more we still have to do one more plot to make
usual with modern instrumentation, sure we understand the data structure. The
Exploratory Data Analysis electronically imported, most modern PC individual results should be plotted against
Exploratory data analysis is a term used to statistical packages can construct these a time index (i.e., the order the data were
describe a group of techniques (largely
graphical in nature) that sheds light on the
structure of the data. Without this
knowledge the scientist, or anyone else, (a)
cannot be sure they are using the correct
form of statistical evaluation. Scale
The statistics and graphs referred to in this Mean
(b)
first section are applicable to a single
column of data (i.e., univariate data), such Scale
as the number of analyses performed in a Mean
laboratory each month. For small amounts
of data (<15 points), a blob plot (also figure 1 Blob plots of the raw data.
4 statistics and data analysis LC•GC Europe Online Supplement

obtained). If any systematic trends are together with how they relate to the Unfortunately, the mean is often reported
observed (Figures 3(a)–3(c)) then the confidence intervals for normally as an estimate of the ‘true-value’ (m) of
reasons for this must be investigated. distributed data. whatever is being measured without
Normal statistical methods assume a considering the underlying distribution.
random distribution about the mean with The Mean This is a mistake. Before any statistic is
time (Figure 3(d)) but if this is not the case The average or arithmetic mean (3) is calculated it is important that the raw data
the interpretation of the statistics can be generally the first statistic everyone is should be carefully scrutinized and plotted
erroneous. taught to calculate. This statistic is easily as described above. An outlying point can
found using a calculator or spreadsheet have a big effect on the mean (compare
Summary Statistics and simply involves the summing of the Figure 1(a) with 1(b)).
Summary statistics are used to make sense individual results x1, x2, x3, ..., xi) and
of large amounts of data. Typically, the division by the number of results (n), The Standard Deviation (3)
mean, sample standard deviation, range, n The standard deviation is a measure of the
 xi
confidence intervals, quantiles (1), and i1 spread of data (dispersion) about the mean
measures for skewness and x n and can again be calculated using a
spread/peakedness of the distribution where, calculator or spreadsheet. There is,
(kurtosis) are reported (2). The mean and however, a slight added complication; if
n
sample standard deviation are the most   x1  x2  x3  … xi you look at a typical scientific calculator
widely used and are discussed below i1 you will notice there are two types of
Frequency (Nº of data points in each bar)

(a) (b)
Box 1: Stem-and-leaf plot
1.5  interquartile
A stem-and-leaf plot is another
method of examining patterns in the
upper quartile value
data set. They show the range, in
which the values are concentrated, interquartile
and the symmetry. This type of plot is median
constructed by splitting data into the lower quartile value
stem (the leading digits). In the figure
below, this is from 0.1 to 0.6, and 1.5  interquartile
the leaf (the trailing digit). Thus,
0.216 is represented as 2|1 and
0.350 by 3|5. Note, the decimal *outlier
places are truncated and not round-
ed in this type of plot. Reading the *The interquartile range is the range which contains the middle 50% of the data when
it is sorted into ascending order.
plot below, we can see that the data
values range from 0.12 to 0.63. The figure 2 Frequency histogram and Box and Whisker plot.
column on the left contains the
depth information (i.e., how many
leaves lie on the lines closest to the
end of the range). Thus, there are 13
(a) (b)
points which lie between 0.40 and
0.63. The line containing the middle Magnitude Magnitude
value is indicated differently with a 10 10
count (the number of items in the 8 8
line) and is enclosed in parentheses. 6 6
4 4
2 2
Stem-and-leaf plot 0 Time 0 Time
n = 7, mean = 6, standard deviation = 2.16 n = 9, mean = 6, standard deviation = 2.65
Units = 0.1 1|2 = 0.12 Count =
(c) (d)
42
Magnitude Magnitude
5 1|22677 10 10
14 2|112224578 8 8
(15) 3|000011122333355 6 6
13 4|0047889 4 4
6 5|56669 2 2
0 Time 0 Time
1 6|3 n = 9, mean = 6, standard deviation = 2.06 n = 9, mean = 6, standard deviation = 1.80

figure 3 Time-indexed plots.


LC•GC Europe Online Supplement statistics and data analysis 5

standard deviation (denoted by the


99.7% symbols n and n-1, or  and s). The
95% correct one to use depends upon how the
68%
problem is framed. For example, each
Mean
batch of a chemical contains 10 sub-units.
You are asked to analyse each sub-unit, in
a single batch, for mercury contamination
and report the mean mercury content and
standard deviation. Now, if the mean and
-3 -2 -1 0 1 2 3 standard deviation are to be used solely
with this analysed batch, then the 10
Standard deviations from the mean
results represent the whole population (i.e.,
all are tested) and the correct standard
figure 4 The relationship between the deviation to use is the one for a population
normal distribution curve, the mean and (n). If, however, the intended use of the
standard deviation. results is to estimate the mercury

(a)
(i)
probably not different and would 'pass' the t-test
(ii) (tcrit > tcalculated value)

(b)
(i)
probably different and would 'fail' the t-test contamination for several batches of the
(tcrit < tcalculated value) chemical, the 10 results then represent a
(ii)
sample from the whole population and the
correct standard deviation to use is that for
(c) a sample (n-1). If you are using a statistical
(i) package you should always check that the
could be different but not enough data to say for correct standard deviation is being
(ii) sure (i.e., would 'pass' the t-test [tcrit > tcalculated value])
calculated for your particular problem.

µ1
(d)   n  ((xi  µ)2 / n)
(i)
practically identical means, but with so many data
points there is a small but statistically siginificant s  n–1 ((xi  x)2 / n  1)
(ii) ('real') difference and so would 'fail' the t-test
(tcrit < tcalculated value)
µ2

Interpreting the mean and standard


deviation
(e)
If the distribution is normal (i.e., when the
(i)
spread in the data as measured by the variance data are plotted it approximates to the curve
are similar would 'pass' the F-test (Fcrit > Fcalculated value) shown in Figure 4) then the mean is located
(ii) at the centre of the distribution. Sixty-eight
per 0cent of the results will be contained
within ±1 standard deviation from the mean,
(f) 95% within ±2 standard deviations and
(i) 99.7% within ±3 standard deviations.
spread in the data as measured by the variance are
different would 'fail' the F-test (Fcrit < Fcalculated value) Using the above facts it is possible to
(ii) and hence (i) gives more consistent results than (ii) estimate a standard deviation from a
stated confidence interval and vice versa a
confidence interval from a standard
(g) deviation. For example, if a mean value of
(i) 0.72 ±0.02 g/L at the 95% confidence
could be a different spread but not enough data level is quoted then it follows that the
to say for sure would 'pass' the F-test
(Fcrit > Fcalculated value) standard deviation = 0.02/2 or 0.01 g/L. If
(ii)
the same figure was quoted at the 99.7%
confidence level the standard deviation
figure 5 Comparison of different data sets. would be 0.02/3 or 0.0066 g/L.
6 statistics and data analysis LC•GC Europe Online Supplement

Significance Testing and its spread. For example, consider the nearly always lead to a significant
Suppose, for example, we have the blob plots shown in Figure 5. For the two difference but a statistically significant
following two sets of results for lead data sets shown in Figure 5(a), the means result is not necessarily an important result.
content in water 17.3, 17.3, 17.4, 17.4 for set (i) and set (ii) are numerically For example in Figure 5(d) there is a
and 18.5, 18.6, 18.5, 18.6. It is fairly clear, different. From the limited amount of statistically significant difference, but does
by simply looking at the data, that the two information available, however, they are it really matter in practice?
sets are different. In reaching this from a statistical point of view the same.
conclusion you have probably considered For Figure 5(b), the means for set (i) and What is a t-test?
the amount of data, the average for each set (ii) are probably different but when A t-test is a statistical procedure that can
set and the spread in the results. The fewer data points are available, Figure 5(c), be used to compare mean values. A lot of
difference between two sets of data is, we cannot be sure with any degree of jargon surrounds these tests (see Table 1
however, not so clear in many situations. confidence that the means are different for definition of the terms used below) but
The application of significance tests gives even if they are a long way apart. With a they are relatively simple to apply using the
us a more systematic way of assessing the large number of data points, even a very built-in functions of a spreadsheet like
results with the added advantage of small difference, can be significant (Figure Excel or a statistical software package.
allowing us to express our conclusion with 5(d)). Similarly, when we are interested in Using a calculator is also an option but you
a stated degree of confidence. comparing the spread of results, for have to know the correct formula to apply
example, when we want to know if (see Table 2) and have access to statistical
What does significance mean? method (i) gives more consistent results tables to look up the so-called critical
In statistics the words ‘significant’ and than method (ii), we have to take note of values (4).
‘significance’ have specific meanings. A the amount of information available Three worked examples are shown in
significant difference, means a difference (Figures 5(e)–(g)). Box 2 (5) to illustrate how the different
that is unlikely to have occurred by chance. It is fortunate that tables are published t-tests are carried out and how to interpret
A significance test, shows up differences that show how large a difference needs to the results.
unlikely to occur because of a purely be before it can be considered not to have
random variation. occurred by chance. These are, critical What is an F-test?
As previously mentioned, to decide if one t-value for differences between means, An F-test compares the spread of results in
set of results is significantly different from and critical F-values for differences two data sets to determine if they could
another depends not only on the between the spread of results (4). reasonably be considered to come from the
magnitude of the difference in the means Note: Significance is a function of sample same parent distribution. The test can,
but also on the amount of data available size. Comparing very large samples will therefore, be used to answer questions
such as are two methods equally precise?
The measure of spread used in the F-test is
variance which is simply the square of the
Jargon Definition standard deviation. The variances are
ratioed (i.e., divide the variance of one set
Alternate Hypothesis A statement describing the alternative to the null hypothesis
(H1) (i.e., there is a difference between the means [see two-tailed] of data by the variance, of the other) to
or mean1 is ≥ mean2 [see one-tailed]). get the test value F = 2
S1 2
S2
Critical Value The value obtained from statistical tables or statistical packages at a
(tcrit or Fcrit) given confidence level against which the result of applying a signifi- This F value is then compared with a critical
cance test is compared. value that tells us how big the ratio needs
Null hypothesis A statement describing what is being tested to be to rule out the difference in spread
(H0) (i.e., there is no difference between the two means [mean1 = mean2]). occurring by chance. The Fcrit value is
found from tables using (n1–1) and (n2–1)
One-tailed A one-tailed test is performed if the analyst is only interested in the
answer when the result is different in one direction, for example, (1) degrees of freedom, at the appropriate
the level of confidence.
new production method results in a higher yield, or (2) the amount of [Note: it is usual to arrange s1 and s2 so
waste product is reduced (i.e., a limit value ≤, >, <, or ≥ is used in the that F > 1]. If the standard deviations are to
alternate hypothesis). In these cases the calculation to determine the be considered to come from the same
t-value is the same as that for the two-tailed t-test but the critical population then Fcrit > F. As an example we
value is different.
use the data in Example 2 (see Box 2).
Population A large group of items or measurements under investigation
(e.g., 2500 lots from a single batch of a certified reference material). 2
F  2.75 1.471 2  3.49
Sample A group of items or measurements taken from the population
(e.g., 25 lots of a certified reference material taken from a batch
containing 2500 lots). Fcrit = 9.605 (5–1) and (5–1) degrees of
Two-tailed A two-tailed t-test is performed if the analyst is interested in any
freedom at the 97.5% confidence level.
change. For example, is method A different from method B As Fcrit> Fcalculated we can conclude that the
(i.e., ≠ is used in the alternate hypothesis. Under most circumstances spread of results in the two data sets are
two-tailed t-tests should be performed). not significantly different and it is,
therefore, reasonable to combine the two
table 1 Definitions of statistical terms used in significance testing. standard deviations as we have done.
LC•GC Europe Online Supplement statistics and data analysis 7

(3) BS 2846 part 4 (ISO 2854): Techniques of


Using statistical software Shaun Burke currently works in the Food
Estimation Relating to Means and Variances
(what is a p-value?) (1976). Technology Department of RHM Technology
When you use statistical software packages (4) D.V. Lindley and W.F. Scott, New Cambridge Ltd, High Wycombe, Buckinghamshire, UK.
and some spreadsheet functions, the Elementary Statistical Tables (ISBN: 0 521 However, these articles were produced while
48485 5). Cambridge University Press (1995).
results of performing a significance test are (5) T.J. Farrant, Practical Statistics for the Analytical he was working at LGC, Teddington,
often summarized as a p-value. The Scientist: A Bench Guide (ISBN: 085 404 4426), Middlesex, UK (http://www.lgc.co.uk).
p-value represents an inverse index of the Royal Society of Chemistry (1997).
(6) M. Sargent, VAM Bulletin, Issue 13, 4–5,
reliability of the statistic (i.e., the (Laboratory of the Government Chemist,
probability of error in accepting the Teddington, UK) Autumn 1995.
observed result as valid). Thus, if we are
comparing two means to see if they are Bibliography
different a p-value of 0.10 is equivalent to 1. G.B. Wetherill, Elementary Statistical
saying we are 90% certain that the means Methods, Chapman and Hall, London,
are different; 0.05 is equivalent to saying UK.
we are 95% certain that the means are 2. J.C. Miller and J.N. Miller, Statistics for
different; and 0.01 we are 99% certain Analytical Chemistry, Ellis Horwood PTR
that the means are different, i.e., [(1–p) x Prentice Hall, London, UK.
100%]. It is usual when analysing chemical 3. J. Tukey, Exploration of Data Analysis,
data (but somewhat arbitrary) to say that Edison and Westley.
p-levels ≤ 0.05 are statistically significant. 4. T.J. Farrant, Practical Statistics for the
Analytical Scientist: A Bench Guide
Some assumptions (ISBN: 085 404 4426), Royal Society of
behind significance testing Chemistry, London, UK (1997).
In most statistical tests it is
assumed that the sample correctly
represents the population and that the
population follows a normal distribution. t-test to use when comparing Equation
Although these assumptions are never The long-term average (population mean, µ) with a sample mean
complied with precisely, in a large number x µ
t
s/ n
of situations where laboratory data is being
used they are not grossly violated.
The difference between two means (e.g., two analytical methods) For a two-tailed test

Conclusions d  n
t sd
• Always plot your data and understand
the patterns in it before calculating any For a one-tailed test
statistic, even the arithmetic mean. the sign is important
• Make sure the correct standard deviation
d n
is calculated for your particular t sd
circumstance. This will nearly always be
the sample standard deviation (n-1). x1  x2
Difference between independent sample means with equal variances t
• Significance tests are used to compare, 1 1
in an unbiased way, the means or spread sc n1  n2
(variance) of two data sets.
• The tests are easily performed using Difference between independent sample means with unequal variances† x1  x2
t
statistical routines in spreadsheets and s21 s22
statistical packages. n1  n2
• The p-value is a measure of confidence
in the result obtained when applying a where:
significance test. –
x is the sample mean, µ is the population mean, s is the standard deviation for the sample, n is the number items in the sample,
– –
|d | is the absolute mean difference between pairs, d is the mean difference between pairs, sd is the sample standard deviation for the
Acknowledgement – –
pairs, x1 and x2 are two independent sample means, n1 and n2 are the number of items making up each sample
The preparation of this paper was
2 2
supported under a contract with the UK s1 n1  1  s2 n2  1
and s is the combined standard deviation found using sc 
Department of Trade and Industry as part c
n1  n2 2
of the National Measurement System Valid where s1 and s2 are the sample standard deviations.
Analytical Measurement Programme
(VAM)6. †Note: The degrees of freedom (υ) used for looking up the critical t value for independent sample means with unequal variances

References 1 s41 s42 s21 s22


υ  k 2 n2 n  1  k 2 n2 n – 1 where k  n1  n2
is given by
(1) ISO 3534 part 1: Statistics Vocabulary and
1 1 2 2
Symbols. Part 1: Probability and General
Statistical Terms (1993).
(2) BS 2846 part 7: Tests for Departure from
Normality (1984). table 2 Summary of statistical formulae.
8 statistics and data analysis LC•GC Europe Online Supplement

Box 2

x s
Example 1 Method 1 4.2 4.5 6.8 7.2 4.3 5.40 1.471
Method 2 9.2 4.0 1.9 5.2 3.5 4.76 2.750
A chemist is asked to validate a new
economic method of derivatization table 3 Results from two methods used to determine concentrations of selenium.
before analysing a solution by a standard
gas chromatography method. The long-
term mean for the check samples using tcrit = 2.26 at the 95% confidence
the old method is 22.7 µg/L. For the new level for 9 degrees of freedom. t 0.64 0.64 0.459
2.205  0.632 1.395
method the mean is 23.5 µg/L, based on As tcalculated > tcrit we can reject the null
10 results with a standard deviation of hypothesis and conclude that we are 95% The 95% critical value is 2.306 for
0.9 µg/L. Is the new method equivalent certain that there is a significant difference n = 8 (n1 + n2 –2 ) degrees of freedom.
to the old? To answer this question we between the new and old methods. This exceeds the calculated value of
use the t-test to compare the two mean [Note: This does not mean the new 0.459, thus the null hypothesis (H0)
values. We start by stating exactly what derivatization method should be cannot be rejected and we conclude
we are trying to decide, in the form of abandoned. A judgement needs to there is no significant difference between
two alternative hypotheses; (i) the means be made on the economics and on the means or the results given by the
could really be the same, or (ii) the whether the results are ‘fit for purpose’. two methods.
means could really be different. In The significance test is only one piece
statistical terminology this is written as: of information to be considered.] Example 3 (5)
• The null hypothesis (H0): new method Two methods are available for
mean = long-term check sample mean. Example 2 (5) determining the concentration of
• The alternative hypothesis (H1): new Two methods for determining the vitamins in foodstuffs. To compare
method mean ≠ long-term check sample concentration of Selenium are to be the methods several different sample
mean. compared. The results from each matrices are prepared using the same
To test the null hypothesis we calculate method are shown in Table 3: technique. Each sample preparation is
the t-value as below. Note, the calculated Using the t-test for independent then divided into two aliquots and
t-value is the ratio of the difference sample means we define the null readings are obtained using the two
– –
between the means and a measure of hypothesis H0 as x 1 = x 2 methods, ideally commencing at the
the spread (standard deviation) and the This means there is no difference between same time to lessen the possible effects
amount of data available (n). the means of the two methods (the of sample deterioration. The results are
alternative hypothesis is H1: x–1 ≠ x–2). If shown in Table 4.
23.5  22.7 –
t  2.81 the two methods have sample standard The null hypothesis is H0: d = 0
0.9 / 10 –
deviations that are not significantly against the alternative H1: d ≠ 0
In the final step of the significance test different then we can combine (or pool) The test is a two-tailed test as we are
– –
we compare the calculated t-value with the standard deviation (Sc). interested in both d<0 and d>0

the critical t-value obtained from tables (see What is an F-Test?) The mean d = 0.475 and the sample
(4). To look up the critical value we need standard deviation of the paired
to know three pieces of information: 1.4712  (5  1)  2.7502  (5  1) differences is sd = 0.700
Sc 
(i) Are we interested in the direction (5  5  2)
of the difference between the two 0.475  8
 2.205 t  1.918
0.700
means or only that there is a difference,
for example, are we performing a one- If the standard deviations are The tabulated value of tcrit (with
sided or two-sided t-test (see Table 1)? significantly different then the t-test n = 7 degrees of freedom, at the 95%
In the case above, it is the latter, there- for un-equal variances should be used confidence limit) is 2.365. Since the
fore, the two-sided critical value is used. (Table 2). calculated value is less than the critical
(ii) The degrees of freedom: this is Evaluating the test statistic t value, H0 cannot be rejected and it
simply the number of data points follows that there is no difference between
minus one (n–1). (5.40  4.76) the two techniques.
t =>
(iii) How certain do we want to be 2.205 1 1
5 5
about our conclusions? It is normal
practice in chemistry to select the 95%
confidence level (i.e., about 1 in 20 Matrix
times we perform the t-test we could
arrive at an erroneous conclusion). Method 1 2 3 4 5 6 7 8
However, in some situations this is an A (mg/g) 2.52 3.13 4.33 2.25 2.79 3.04 2.19 2.16
unacceptable level of error, such as in B (mg/g) 3.17 5.00 4.03 2.38 3.68 2.94 2.83 2.18
medical research. In these cases, the Difference (d) -0.65 -1.87 0.30 -0.13 -0.89 0.10 -0.64 -0.02
99% or even the 99.9% confidence
level can be chosen. table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.

You might also like