You are on page 1of 9

Descriptive statistics

Descriptive statistics are used simply to describe the sample you are concerned with. They are used in the first instance to get a feel for the data, in the second for use in the statistical tests themselves, and in the third to indicate the error associated with results and graphical output. Many of the descriptions or "parameters" such as the mean will be familiar to you already and probably use them far more than you are aware of. For instance, when have you taken a trip to see a friend without a quick estimate of the time it will take you to get there (= mean)? Very often you will give your friend a time period within which you expect to arrive "say between 7.30 and 8.00 traffic depending". This is an estimate of the standard deviation or perhaps standard error of the times taken in previous trips. The more often you have taken the same journey the better the estimate will be. It is the same when measuring the length of the forelegs of a sample of donkeys in a biological experiment.

This page is divided up into two main sections but you must also refer to the pages on normal, binomial, negative binomial and Poisson distributions as these are also descriptive of data sets.

All examples on this page refer to samples and not the population as a whole.

All the following can be calculated using the "descriptive statistics" or "summary statistics" function in Excel or any of the statistics software.

1|Page

Measures of central tendency


Most data sets have many values that cluster about the most common value (mean). The number of data points with a given value will decline the farther the value is from the mean. This phenomenon can clearly be seen in the following frequency distribution graphs.

Do ensure that your data does not follow the pattern displayed in bold "bimodal distribution". This suggests that you have sampled two populations (such as male and female where sexual dimorphism is apparent) and such data cannot be analysed easily.

Mean
The most common description of the central tendency is the mean ( ) and is found using:
28.5 18.75 22.9 25.4 24.55 23.7 23.9

i.e.

2|Page

By examination of the data the mean can be estimated at around 24. Using the above equation, it is:

Median
However, the mean can distort the picture if there are a few extreme but legitimate values (not affected by inaccurate measurements). The median can help with this scenario and is found by locating the "middle" value. 22.9 23.7 23.9 24.55 25.4 28.5 If n is an even number the median is the mean of the two middle values.

Mode
This is the value that occurs most often and does not exist in many data sets including the one above. It is of use where the above two parameters cannot be found; most often in categorized data sets i.e. from the following pitfall trap data

The mean for each category is already displayed and the median is irrelevant. The mode is the Coleoptera group with 35 hits.

3|Page

Measures of Dispersion and Variability


We can describe data more fully using other parameters that are also used in the hypothesis tests.

Range - the highest and lowest value in a data set 18.75 - 28.5 and 2 - 35 respectively in the above data sets Standard Deviation (s) - Useful to assess how variable a sample is. But the coefficient of variation is easier to use. Coefficient of Variation - Useful to see how much variation occurs within your data set. The higher it is the more data points you need to collect to be confident that the sample is representative of the population. It can also be used to compare variation between data sets. Calculated using: standard deviation. where s is

Variance (s2) - This is the most difficult value to use and need only be considered when using t-tests or ANOVA. Two or more s2 values can be compared statistically using the F-test or homogeneity of variance tests. Standard Error (SE) - This is essential to assess how closely your sample relates to the population. By calculating the 95% confidence intervals ( ) you can say that the population mean has a 95% chance of being within this range. Such information should be included in graphical output.

4|Page

Z-tests and t tests


Data types that can be analysed with z-tests
data points should be independent from each other z-test is preferable when n is greater than 30. the distributions should be normal if n is low, if however n>30 the distribution of the data does not have to be normal the variances of the samples should be the same (F-test) all individuals must be selected at random from the population all individuals must have equal chance of being selected sample sizes should be as equal as possible but some differences are allowed

Data types that can be analysed with t-tests


data sets should be independent from each other except in the case of the paired-sample t-test where n<30 the t-tests should be used the distributions should be normal for the equal and unequal variance t-test (K-S test or Shapiro-Wilke) the variances of the samples should be the same (F-test) for the equal variance t-test all individuals must be selected at random from the population all individuals must have equal chance of being selected sample sizes should be as equal as possible but some differences are allowed
5|Page

Limitations of the tests


if you do not find a significant difference in your data, you cannot say that the samples are the same

Introduction to the z and t-tests


Z-test and t-test are basically the same; they compare between two means to suggest whether both samples come from the same population. There are however variations on the theme for the t-test. If you have a sample and wish to compare it with a known mean (e.g. national average) the single sample t-test is available. If both of your samples are not independent of each other and have some factor in common, i.e. geographical location or before/after treatment, the paired sample t-test can be applied. There are also two variations on the two sample t-test, the first uses samples that do not have equal variances and the second uses samples whose variances are equal. It is well publicised that female students are currently doing better then male students! It could be speculated that this is due to brain size differences? To assess differences between a set of male students' brains and female students' brains a z or t-test could be used. This is an important issue (as I'm sure you'll realise lads) and we should use substantial numbers of measurements. Several universities and colleges are visited and a set of male brain volumes and a set of female brain volumes are gathered (I leave it to your imagination how the brain sizes are obtained!).

6|Page

Results and interpretation


Degrees of freedom: For the z-test degrees of freedom are not required since z-scores of 1.96 and 2.58 are used for 5% and 1% respectively. For unequal and equal variance t-tests = (n1 + n2) - 2 For paired sample t-test = number of pairs - 1 The output from the z and t-tests are always similar and there are several values you need to look for:

You can check that the program has used the right data by making sure that the means (1.81 and 1.66 for the t-test), number of observations (32, 32) and degrees of freedom (62) are correct. The information you then need to use in order to reject or accept your HO, are the bottom five values. The t Stat value is the calculated value relating to your data. This must be compared with the two t Critical values depending on whether you have decided on a one or two-tail test (do not confuse these terms with the one or two-way ANOVA). If the calculated value exceeds the critical values the HO must be rejected at the level of confidence you selected before the test was executed. Both the one and twotailed results confirm that the HO must be rejected and the HA accepted. We can also use the P(T<=t) values to ascertain the precise probability rather than the one specified beforehand. For the results of the t-test above the probability of the differences occurring by chance for the one-tail test are 2.3x10-9 (from 2.3E-11 x 100). All the above P-values denote very high significant differences.
7|Page

Key for the selection of a test for comparisons

Are your data to be selected at random from the population you are studying?

Has every data point available got the same chance of being selected and are they independent from each other (ie not leaves from the same plant if you are comparing leaves between plants)?

These two points should be set in stone! If not the test results will be biased in one form or another and the conclusions that you make may be completely irrelevant. For example, if you were to sample a river bed with a net that had mesh size of 0.5mm but your species of interest had circumference of between 0.25 and 1.3mm, you would be including bias in your sampling by excluding those individuals smaller than 0.5mm.

Use the key by answering the questions in the most relevant way. It is advised that you double check through a good reference guide.

1. Have you got more than two samples? No......go to 2 Yes.....go to 8 2. Have you got one or two samples? One.....Single sample t-test Two....go to 3
8|Page

3. Are your data sets normally distributed (K-S test or Shapiro-Wilke)? No.......go to 4 Yes......go to 5 4. Do your data sets have any factor in common (dependence), i.e. location or individuals? No.Mann Whitney U test YesWilcoxon Matched Pairs 5. Do your data sets have any factor in common (dependence), i.e. location or individuals? No......go to 6 Yes.....paired sample t-test 6. Do your data sets have equal variances (f-test)? No......unequal variance t-test Yes.....go to 7 7. Is n greater or less than 30? <30.....equal variance t-test or ANOVA >30.....z-test or ANOVA 8. Are your samples normally distributed and with equal variances? No......Kruskal-Wallis non-parametric ANOVA Yes.....go to 9 9. Does your data involve one factor or two factors? One.....One-way ANOVA (see also Multiple comparison tests) Two.....Two-way ANOVA (see also Multiple comparison tests)

9|Page

You might also like