You are on page 1of 3

Hypothesis-Driven and Exploratory Data

Analysis
The 14th-century maxim known as Ockham's Razor, paraphrased by Jefferys and Berger (1992) as "It
is vain to do with more what can be done with less", is usually applied to the interpretation of
scientic results. However, it applies equally well to choice of analysis. Thus if one has a very simple
ecological data set, consisting of few species and few samples, ordination is not worthwhile. In such a
case, the data are easiest to interpret in a simple table.
In a typical data set, however, there are dozens of species and samples. It is impossible for the human
mind to simultaneously contemplate dozens of dimensions. The purpose of ordination is to assist the
implementation of Ockham's Razor: a few dimensions are easier to understand than many dimensions.
A good ordination technique will be able to determine the most important dimensions (or gradients) in
a data set, and ignore "noise" or chance variation.
Both direct and indirect gradient analysis have the potential to reduce the dimensionality of a data set.
However, reduction of dimensionality is not the only reason to use ordination. Before the
development of CCA, most widely-used ordination techniques were indirect, and the primary goal of
ordination was considered "exploratory" (Gauch 1982). It was the job of the ecologist to use his or her
knowledge and intuition to collect and interpret data; pure objectivity could potentially interfere with
the ability to distinguish important gradients. Ordination was often considered as much an art as a
science.
Once CCA was available, multivariate direct gradient analysis became feasible. It became possible to
rigorously test statistical hypotheses and go beyond mere "exploratory" analysis. However, testing
hypotheses requires complete objectivity, which results in repeatability and falsiability. The two
basic motivations for multivariate direct gradient analysis, hypothesis testing and exploratory analysis,
conict with each other to some extent:
Table 1. Hypothesis-driven analysis, exploratory analysis, and their major characteristics and
motivations. This table applies to regression techniques and indirect gradient analysis in addition to
CCA.
HYPOTHESIS DRIVEN EXPLORATORY
Motivating Question: "Can I reject the null hypothesis that
species are unrelated to a postulated environmental factor or
factors?"
Motivating Question: "How can I optimally explain or
describe variation in my data set?"
objective subjective
sites must be representative of universe: random, stratied
random, regular placement
sites can be "encountered" or subjectively located
analyses must be planned a priori "data diving" permissible; post-hoc analyses, explanations,
hypotheses OK
p-values meaningful p-values only a rough guide
Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate.htm
1 of 3 6/5/14, 6:44 AM
stepwise techniques not valid without cross-validation stepwise techniques (e.g. forward selection) valid and
useful.
To perform a hypothesis-driven analysis, one must be very specic about the analyses one wishes to
perform. The null hypothesis must be clearly stated, and the data must be collected in a repeatable
manner. Usually, the sampling design will involve random, stratied random, or regular distribution of
study plots. If there is any subjectivity involved in locating or orienting study plots, the results are
technically not valid. All of the analyses, including variations of data transformation and use of
different ordination options (e.g. detrending or not), must be planned in advance, or else the user runs
the risk of "data diving" or "data mining", i.e. getting an articially signicant result because so many
options are tried. Stepwise techniques (discussed later) are automated forms of "data diving", and will
typically also lead to incorrect statistical inference (Cliff 1987, Draper and Smith 1981). The reward
for rigorously adhering to these rather stringent criteria is that the statistical inference (i.e. the p-value)
is valid.
Exploratory analyses might lack statistical rigor, but they are still a mainstay of vegetation research.
The purpose of exploratory analysis is to nd pattern in nature, which is an inherently subjective
enterprise. Exploratory analyses incorporate the wisdom, skill, and intuition of the investigator into
the experiment. Unless you can nd another investigator with identical wisdom, skill and intuition, the
analyses are not strictly repeatable, and are hence not falsiable. While it is possible to perform
exploratory analyses on sample plots located according to a rigorous, objective sampling design, such
careful placement is not necessary. Indeed, an exploratory analysis can be aided if the investigator
subjectively places study plots in locations he or she considers to be important or interesting.
Orienting plots within vegetation which appears homogeneous is highly subjective, but very useful in
evaluating differences between plots.
With exploratory analysis, "data diving" (e.g. using different transformations of species abundances,
adjusting ordination options, selecting different subsets of environmental variables, or selecting
different subsets of study plots) is no longer to be avoided. Instead, it is a way for the investigator to
learn more about the data set. Stepwise analysis is a form of automated data diving. It is useful as a
tool to help discover "important" or "interesting" variables.
Ecologists are often mislead into thinking that p-values from stepwise methods have a rigorous
meaning, and that the results of stepwise methods give the best possible model. Such thinking is false.
It is possible to combine exploratory analysis and hypothesis-driven analysis into a larger study. One
way of doing this is to perform a 2-phase study, in which the rst phase is an exploratory analysis,
perhaps involving subjectively located plots and employing many variations on analysis. The patterns
found in the rst phase are then posed as hypotheses for the second phase. The second phase involves
the collection of fresh data from objectively located plots, and an entirely planned data analysis.
A second way to combine the two major types of analysis is through data set subdivision. The data set
is randomly divided into two subsets: an exploratory subset and a conrmatory subset (alternatively
called model building and model validation, respectively). Many, varied analyses can be performed on
the exploratory subset (including stepwise analysis) - and such analyses can be based upon intuition,
hunches, or superstition. If interesting patterns are found with respect to particular environmental
variables, and using particular data transformations, these patterns can be statistically tested using the
Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate.htm
2 of 3 6/5/14, 6:44 AM
conrmatory subset. To use data set subdivision properly, samples must be objectively located.
Literature cited
(see also selected references for self-education)
Cliff, N. 1987. Analyzing Multivariate Data. Harcourt Brace Jovanovich, Publishers, San Diego,
California.
Draper, N. R., and H. Smith. 1981. Applied Regression Analysis. second edition. Wiley, New York.
Gauch, H. G., Jr. 1982. Multivariate Analysis and Community Structure. Cambridge University Press,
Cambridge.
Hallgren, E., M. W. Palmer, and P. Milberg. 1999. Data diving with cross validation: an investigation
of broad-scale gradients in Swedish weed communities. Journal of Ecology 87:1037-1051.
Jefferys, W. H., and J. O. Berger. 1992. Ockham's Razor and Bayesian Analysis. Am. Sci. 80:64-72.
This page was created and is maintained by Michael Palmer
To the ordination web page





Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate.htm
3 of 3 6/5/14, 6:44 AM

You might also like