You are on page 1of 29

WinTWINS

version 2.3

Mark O. Hill Petr milauer


2005

1 Preface
TWINSPAN, based partly on an earlier program called Indicator Species Analysis (Hill et al. 1975), was written in 1979, five years before the Apple Macintosh revolutionized personal computing, and two years before MS DOS was launched. The first version of Windows did not appear till November 1985. In 1979, a personal computer was an expensive luxury possessed only by a few geeks; all serious calculations were made on mainframes. At Cornell University, newly-available computer terminals had made programming much easier than in earlier years. Programs and problems could be submitted electronically (to another room in the Cornells Langmuir Lab), though the output was still normally on paper. The time was therefore ripe for development of numerical methods in ecology to the point where they could become routine tools rather than interesting prospects for development. The early proponents of numerical methods, notably Goodall (1953a, 1953b), had seen themselves as champions of objectivity. They were uncomfortable about the ZrichMontpellier tradition of continental Europe, which had sought to construct a comprehensive system of knowledge. In the eyes of many British and American ecologists, the Z-M system was subjective and therefore intellectually dubious, because field workers sampled in a way that allowed them to prove what they wanted to find out in the first place. However, not everybody in Britain and America was convinced that objectivity and the Z-M system were in opposition. R.H. Whittaker (1962) urged ecologists to be pragmatic. He visited Txen in Germany, and with Txens blessing edited the monumental Ordination and classification of communities (1973) in which the various protagonists set out their points of view. The publication of a good English-language manual by the German ecologists Mueller-Dombois and Ellenberg (1974), included an exposition of Tablework, a nearly-algorithmic method of sorting two-way tables. This narrowed the gap still further. Thus it was natural that Mark Hill, visiting Whittaker in Cornell, should seek to develop an algorithm whose purpose was to sort tables in an objective way but in the spirit of Z-M methodology. There remained a difficulty, namely that existing numerical methods almost all classified either the samples (the so-called Q methods) or the species (R methods) but did not seek to arrange both together. Underlying this difficulty was the fact that many mathematicians were committed to the metric paradigm, according to which a classification should reflect as faithfully as possible a metric compositional distances between samples (for Q methods) or between species (for R methods). By 1979, the metric paradigm for ordination had already been subjected to severe criticism (Austin 1976), and its applicability to classification was therefore also open to question. This was the background to the writing of TWINSPAN. A practical problem that had already been solved for Indicator Species Analysis was to keep magnitude of the calculation so that it rose only linearly with the size of the dataset. (Clearly any explicit calculation of distance matrices would increase the problem to magnitude m2 if m is the number of samples, or n2 where n is the number of species.) An algorithm linear in the size of the dataset was achieved by, figuratively speaking, sending signals through the data matrix in search of resonances in which the species and samples sounded together, and then dividing the data accordingly. When the samples and subsequently the species were repeatedly divided, TWINSPAN resulted. TWINSPAN was originally written in FORTRAN 4, a language well suited to mathematical calculations but with poor handling of alphabetical data. Some improvements

were made with FORTRAN 77, which handled alphabetical data rather better. However, when MS DOS was dropped in favour of Windows, the excellent Microsoft Fortran compiler was discontinued. TWINSPAN became difficult to run except under older operating systems such as UNIX or in a DOS window. Given that the program has retained its popularity, there is a need for it to be available in a modern form. It is with pleasure therefore that after 25 years, we present it in an updated form suitable for the 21st Century.

Mark O. Hill Petr milauer

2 Copyright
The WinTWINS software consists from three parts, with different code ownership. The twindll.dll dynamic library implements the actual algorithm of the TWINSPAN method and is owned by Mark Hills employer, the Natural Environment Research Council of Great Britain. The executable file wintwins.exe represents a user-friendly wrapper for the method, allowing the user to easily specify analysis options and inspect analysis results. Its code is owned by Petr Smilauer, esk Budjovice, Czech Republic. The CanoDatC.dll dynamic library parses the input data files in format compatible with Canoco software and it is written by Cajo Ter Braak and owned by Biometris, Wageningen, The Netherlands. All the code owners give to everyone the right to use the WinTWINS software for analysis of research data, both in non-commercial and in commercial work. This software can be further distributed in the form of installation program. Integration of any part of the WinTWINS software into any commercial or non-commercial software package is explicitly prohibited. When using this software, please cite it as follows: Hill, M.O. & milauer, P. (2005): TWINSPAN for Windows version 2.3. Centre for Ecology and Hydrology & University of South Bohemia, Huntingdon & Ceske Budejovice. This software is provided on an "AS IS" basis, without warranty of any kind, including without limitation the warranties of merchantability, fitness for a particular purpose and noninfringement. The entire risk as to the quality and performance of the software is borne by you. Should the software prove defective, you and not the authors of the software assume the entire cost of any service and repair. The source code for the twindll.dll and wintwins.exe parts is available from its respective authors: Dr. Mark Hill - moh@ceh.ac.uk Dr. Petr milauer - petrsm@canodraw.com

3 TWINSPAN Method
3.1 Ordered two-way tables

TWINSPAN (Two-way indicator species analysis) is a computer program designed primarily for ecologists and vegetation scientists who have collected data on the occurrence of a set of species in a set of samples. The samples may be stands, relevs, stomach contents, island faunas, or whatever is appropriate to the study. The program first constructs a classification of the samples, and then uses this classification to obtain a classification of the species according to their ecological preferences. The two classifications are then used together to obtain an ordered two-way table that expresses the species' synecological relations as succinctly as possible.
1 7 3 Air pra 2 5 Ant odo 4 12 Emp nig 13 Hyp rad 2 1 Ach mil 2 6 Bel per 7 Bro hor 9 Cir arv 18 Pla lan 2 26 Tri pra 28 Vic lat 23 Rum ace 11 Ely rep 17 Lol per 19 Poa pra 1 16 Leo aut 2 20 Poa tri 27 Tri rep 29 Bra rut 4 Alo gen 24 Sag pro 25 Sal rep 2 Agr sto 8 Che alb 15 Jun buf 14 Jun art 10 Ele pal 21 Pot pal 22 Ran fla 30 Cal cus 0 0 1 9 3 4 2 5 6 2 3 3 3 0 0 1 1 2 3 2 6 4 5 3 4 2 0 1 0 0 1 8 2 3 1 2 3 5 2 6 3 0 1 0 0 5 4 2 2 2 5 2 5 4 2 2 3 6 2 2 0 1 0 1 6 3 2 5 5 6 6 3 3 4 5 6 0 1 0 1 7 2 2 2 5 2 3 6 4 3 5 2 2 2 0 1 0 1 1 0 4 4 2 4 3 1 6 4 3 4 6 2 0 1 0 1 1 1 4 6 4 2 0 1 1 2 3 3 4 4 5 4 5 6 5 2 0 1 1 3 2 4 6 5 2 6 2 2 6 4 0 1 1 4 2 3 2 4 5 4 2 5 1 2 2 5 6 0 1 1 8 4 4 3 4 2 2 5 2 4 4 4 2 1 0 9 2 6 2 4 2 5 3 2 3 2 3 4 4 1 0 1 2 2 2 4 3 4 6 4 4 4 1 0 1 3 2 2 6 2 5 2 5 1 3 2 1 0 1 4 2 6 4 4 2 2 4 1 1 1 5 2 1 4 4 3 5 2 2 1 1 1 6 2 4 4 6 3 6 2 3 1 1 2 0 2 4 5 5 4 4 4 3 1 1

0000 0000 0000 0000 000100 000100 000100 000100 000101 000101 000101 00011 001 001 001 01 01 01 01 10 10 10 110 1110 1110 11110 11111 11111 11111 11111

Table 1 : An ordered two-way table derived from the Dune meadow data and ordered by TWINSPAN; the 0/1 numbering at the bottom and right specify classifications of the samples and species

In an ordered two-way table, some of the structure of a dataset is apparent. For example in Table 1, we see that the dry species Aira praecox, Anthoxanthum odoratum, Empetrum nigrum and Hypochaeris radicata are found especially in samples 17 and 19. Likewise the wet species Eleocharis palustris, Potentilla palustris, Ranunculus flammula and Calliergonella cuspidata are found especially in samples 14, 15, 16 and 20. Other species such as the semi-dry Lolium perenne and Poa pratensis avoid the wettest samples, while the semi-wet Agrostis stolonifera avoids the driest samples. In this case the arrangement reflects one major variable. This is often what is found. However, in some cases, more complex groupings are revealed and (often) aberrant samples or groups of samples are identified. 3.2 Differential species

The wet and the dry species are examples of differential species. A differential species is one with clear ecological preferences, so that its presence can be used to identify particular environmental conditions. In Table 2.1, the wet species Agrostis stolonifera and the dry species Lolium perenne are examples of differential species. There is some overlap between them, but on the whole they avoid each other. On the other hand Leontodon autumnalis is a poor differential species, not showing any marked affinity for a particular group of samples. 3.3 Basic structure of TWINSPAN

TWINSPAN is designed to construct ordered two-way tables, and the method of doing so is by identification of differential species. In this respect it closely resembles the graphical method of classification outlined by Mueller-Dombois & Ellenberg (1974). It differs, however, in its treatment of the species. In the method outlined by Mueller-Dombois & Ellenberg, the species are classified at the same time as the samples. In TWINSPAN, on the other hand, the samples are classified first, and the species are classified second, using the classification of the samples as a basis. The basic structure of TWINSPAN is as follows. 1. Classify the samples in a divisive hierarchy, dividing them first into 2 subsets, then 4, 8, 16, etc. 2. Convert the sample classification into an ordering. 3. Using the groups of samples as a basis, construct attributes for the species. For example, in Table 2.1, species 1, 6, 7 and 18 would be described as possessing the attribute preferential to the left side of the major division. 4. Classify the species in the same way as the samples, but with the difference that whereas the species were treated as attributes of the samples, the species have attributes of the kind indicated above. 5. Convert the species classification also into an ordering. 6. Print out the resulting ordered two-way table.

3.4

Making a dichotomy

The basic activity in TWINSPAN is to make a dichotomy. Indeed, as indicated above, all that the program does, in effect, is to divide up the samples into groups by repeated dichotomization, and then to do the same for the species. It can be argued that this is a bad way to organize a table because dichotomies do not arise naturally. Indeed, Mueller-Dombois & Ellenberg (1974) recommend dividing the species into three categories, those that are preferential to one side of the division, those that are preferential to the other, and those that are indifferent. In practice, the indifferent categories are usually picked out in later dichotomies. For example the species labeled 01 in Table 1 are indifferent. The stages of making a dichotomy in TWINSPAN are as follows. 1. Identify a direction of variation in the data by ordinating the samples. This ordination is referred to below as the primary ordination, and is made by the method of correspondence analysis (also known as reciprocal averaging; Hill, 1973a, 1974). 2. Divide the ordination at its middle to get a crude dichotomy of the samples. 3. Identify differential species that are preferential to one side or the other of the crude dichotomy. 4. Construct an improved ordination (referred to below as the refined ordination), using the differential species as a basis. 5. Divide the refined ordination at an appropriate point to derive the desired dichotomy. If this were all, TWINSPAN would be relatively easy to describe. However, yet a third ordination is also constructed, the indicator ordination. This is based on a small number of the most strongly differential species, and is designed to provide a simple criterion for reidentification of the groups. So a further stage must be added. 6. Construct a simplified ordination, the indicator ordination, based on a few of the most highly preferential species. See whether the dichotomy suggested by the refined ordination can be reproduced by a division of the indicator ordination. To summarize, TWINSPAN makes its dichotomies by dividing ordinations in half. There are three ordinations involved: 1. The primary ordination (correspondence analysis), which is divided to obtain an initial, crude dichotomy; 2. The refined ordination, which is derived from the primary ordination through the identification of differential species; and 3. The indicator ordination. With the exception of borderline cases, the refined ordination is the one that is used to determine the dichotomy. The indicator ordination is essentially an appendage, put there for the convenience of users who want a succinct characterization of the dichotomy.

3.5

Pseudospecies

The idea of a differential species is essentially qualitative, and to be effective with quantitative data must be replaced by a quantitative equivalent. This equivalent is the pseudospecies (Hill, Bunce & Shaw, 1975; Hill, 1977). The essential idea is that much of the quantitative information can be retained by expressing it on a relatively crude scale such as the Braun-Blanquet scale of cover-abundance (Mueller-Dombois & Ellenberg, 1974). The levels of abundance that are used in TWINSPAN to define the crude scale are here termed pseudospecies cut levels. Consider, for the sake of example, the Braun-Blanquet scale. Ignoring the distinction between 1, +, and r, the scale is as follows: 1 2 3 0 - 4% 5 - 25% 26 - 50% 4 5 51 - 75% 76 - 100%

If quantitative data with cover expressed on a percentage scale are entered into TWINSPAN, then the Braun-Blanquet scale can be used by entering the pseudospecies cut levels: 0 5 26 51 76

Consider now a species, for example Poa annua, whose cover in one sample is 18% and in another is 36%. Although these values differ by a factor of 2, the samples have an important feature in common, namely that they share a moderate abundance of Poa annua. Given the Braun-Blanquet scale defined above, this fact can be expressed in terms of pseudospecies by saying that the samples contain the following pseudospecies: Sample with Poa annua 18% - P. annua 1, P. annua 2; Sample with Poa annua 36% - P. annua 1, P. annua 2, P. annua 3. Hence, using the Braun-Blanquet scale, the samples are registered as having two pseudospecies in common, and one different. This is arguably a correct way to view them; in spite of the rather large difference in cover, the samples actually have more in common than by way of difference. The method of pseudospecies allows quantitative values to be used as differential species and as indicators. Thus, in TWINSPAN, instead of a differential species Poa annua, it is possible to have a differential species P. annua 2, whichwith the pseudospecies cut levels defined aboveoccurs when and only when P. annua has cover 5% or greater. In practice, users of WinTWINS who are mainly interested in a good tabular arrangement will hardly concern themselves with the technical detail of pseudospecies. For them it suffices to know that they are using a particular scale of abundance. However, there is one important practical point, namely that information on the occurrence of each pseudospecies is stored separately in the computer. Use of very numerous pseudospecies may therefore be undesirable with larger problems, as they use up storage space.

4 Working with WinTWINS


Each analysis is represented by a project that is stored in a file with .twp file extension. This is a binary file that cannot be directly viewed or edited with other programs. To modify project settings, you must use the Setup Wizard, a series of dialog pages, where you specify input data and other analysis options (see section 4.2 for details of this procedure). 4.1 Program workspace

WinTWINS works at any time with a single project. The project is manipulated using the commands in the program menu and the main analysis results are displayed in the WinTWINS window (see Figure 1).

Figure 1 The commands in the File submenu allow you to create a new project, open an existing one and save the project, optionally under different name (the Save As command). The project log (shown in the window) can be also printed. The commands in the Edit submenu (the Undo, Cut, Copy, and Paste commands) provide basic support for working with the analysis log displayed in the program window. The commands in the View submenu allow you to show or hide the program toolbar (Toolbar), show or hide the status bar at the window lower edge (Status Bar), or to change type and size of font used to display analysis log (Window Font). 4.2 Setup Wizard

The project setup wizard is shown when you select the Project / Settings menu command and also when you start the program or open new TWINSPAN project file. At the first wizard page (Figure 2), you must specify the name of data file. 9

Figure 2 The WinTWINS program is able to read data in condensed, full or free formats, corresponding to formats supported by Canoco for Windows, version 4.x (Ter Braak & milauer 2002). The maximum allowed number of samples is 25000, maximum number of variables is 10000. Note that the later number includes not only the variables (species) present in your data file, but also the pseudo-variables (pseudo-species) that are created based on the cut levels you specify in the next setup wizard page. The maximum number of presences for variables and pseudo-variables (pseudo-species) is set to 4000000, but it should not exceed 2000000 for a classification of species to be created in addition to a classification of your samples. After you click the Next button, WinTWINS program parses the data to check their size and range of values. This information is displayed in the following page (Figure 3) and also the suggested cut levels are adjusted for the range of data values. WinTWINS uses the default cut levels (0, 2, 5, 10, 20), but removes the levels exceeding the largest value in data.

Figure 3 You must select unique, increasing cut levels. They should be chosen so as to reflect typical values of abundance, e.g. "present", "a little", "a lot", "more-or-less dominant". It is important, however, not to over-weight the effect of dominance by including many relatively high cut levels. With percentage data the default cut levels have proved very effective, and the user is recommended to adhere to them until he or she has some experience with the method.

10

For each cut level, you can specify its relative weight in the analysis. Note that this weight must be specified with whole positive numbers, representing the multiple of the default weight (=1) to use for a particular level. In many applications, it is not necessary to use weights at all (i.e. to use the value of 1 for all cut levels). However, with community composition data taken from very large plots, it may sometimes be advisable to give presences and absences relatively low weights, as they may be due to anomalous small patches of terrain that occupy little of the plot area. The last column in the Cut Levels page represents the indicator potential for individual cut levels. If a particular box is checked, the (pseudo-)variables resulting from that level (or, usually, the presence of species for the first cut level) can be used as indicators for individual splits. Very often, all levels have an indicator potential. The most common variation is that all the indicators should be real species, not pseudospecies. In this case, only the first level is checked. Note that if you increase the number of cut levels from that suggested by the setup wizard, the extra levels do not have appropriate threshold values. You must specify the correct thresholds (increasing from top to bottom) before you can proceed to the next page. In the next property page (Figure 4), you can select samples that will be omitted in the analysis.

Figure 4 To specify the samples to be "deleted", select them in the left-hand list and click the Select>> button. Depending on the structure of the analysed data, you might prefer your samples to be listed in the increasing order of their indices or sorted alphabetically using their names (labels). You choose the appropriate method using the two radio-button controls at the bottom. The next wizard page (Figure 5) can be used to select species to be ignored in the analysis, based on their frequency (number of occurrences) in the data.

11

Figure 5 You should specify the minimum number of occurrences (positive values) that a species must have to take part in the analysis. In Figure 5, the listed five species have obviously just one occurrence (or they might be absent altogether) in the data. Note that the occurrences in the samples deleted in the previous page are ignored here. In the next page (Figure 6), you can explicitly specify species to omit from the analysis. Note that all the species in your data are shown there, even those with implied omission by the preceding page. The final list of omitted species represents a logical union of the two lists.

Figure 6 The Nondiagnostic Species wizard page (Figure 7) seems similar to the previous one, but species selected here are not completely omitted from the analysis. They are just ignored when WinTWINS looks for diagnostic species, used in the indicator ordination step (see section 3.4). This is useful, for example, if the indicator criterion is to be applied by relatively inexperienced field workers. It may be then better if difficult taxonomic groups are omitted.

12

Figure 7 In WinTWINS, you can also specify specific weights for some samples and/or species, that are combined with the sample and species weights implied by the reciprocal averaging (RA) algorithm, used in the TWINSPAN method. In the Sample Weights wizard page (Figure 8), you may specify non-default user weights for samples. Samples not present in the list have the default user weight (1.0).

Figure 8 To remove weighted samples from the list, select the corresponding rows in the list and click the Delete button. Use the Add button to specify additional weighted samples. A new dialog box is shown (see Figure 9).

Figure 9

13

In this dialog box, all the samples with default weight are listed. Select in the list the samples to be weighted, enter the required weight (range 0.001 to 1000.0) and click the OK button. To change weight value for an already weighted sample, you must first Delete it in the parental page and then specify new weight for it here. A similar property page is also available for species. The species weights are applied to all corresponding pseudo-species that result from the specified pseudo-species cut levels. The weights have an effect on the species classifications as well. The final page of the project setup wizard (Figure 10) collects the options concerning the TWINSPAN algorithm and also the output shown in the WinTWINS main window or stored in the classification file.

Figure 10 The maximum number of division levels (1 to 9) determines the maximum level of recursive splitting for samples (in the sample classification) and for species (in the species classification). Even with large data sets there is little purpose in continuing to subdivide groups beyond a certain limit. Six levels of division are apt to produce about 64 groups. If there is large number of groups, then interpretation can be difficult. Users will soon get a feel for how much division they require with any problem. Rarely will they want fewer than four levels, and almost never more than seven. The groups of samples or species (in species classification step) smaller than the value specified in the Minimum group size for division field will not be further divided. It must be noted, however, that groups smaller than the specified size will often be formed, e.g. when the natural structure of the data is one large group and a few outliers. It is generally better to control the maximum number of division levels (preceding parameter) than to control the size of generated groups, though there certainly does come a size below which it is not worth dividing groups further. The value specified in the Maximum number of indicators per division determines the maximum number of species that can be used in the indicator ordination (see section 3.4). If no indicator ordination is desired, then this parameter can be set to zero. If a particular dichotomy can be specified precisely by an indicator ordination based on a smaller number of indicators than the maximum, then the smaller number will be used. The Number of species in final tabulation is used when creating the two-way table at the end of WinTWINS output (the table is created in two alternative formats, see section 5.7). 14

It is often inconvenient to clutter this table with rare species. Therefore only the N commonest species are shown, where N is the value specified here. The diagrams of division display in a simple form how the indicator ordination relates to the refined ordination, and in particular whether the misclassifications are approximately borderline cases (see section 5.4 for further explanation). If the Show diagrams of division in the analysis log option is checked, a text-form scatter diagram is present in the analysis log for each dichotomy (split). The hierarchical classification of samples and species can be stored in simply formatted text file (known as the "machine-readable copy of solution" in the previous TWINSPAN versions), typically with a .pun extension. If the Store classification in a file option is checked, the output file name should be specified in the Classification file name edit field. WinTWINS project setup wizard fills the file name field with a default value, which is the file path for the input data, with the file extension changed to .pun. If the final option, named Continue with the project analysis, is checked, WinTWINS continues, after you click the Finish button, with the actual analysis of the project data. Instead of pressing the Finish button, you can also use the Back button to return to previous pages, to review or modify the settings made there. 4.3 Analysis output

The analysis results are placed (except the classification of individual samples and species, placed in the classification file) in the WinTWINS window, where they can be inspected, copied to other programs and saved into a file with text format (using the Project / Save log menu command). To make reading of the analysis log shown in the WinTWINS window easier, you can adjust the size of displayed characters using the dialog displayed by the View / Window Font. Note that the output formatting relies in many places on use of a fixed pitch (non-proportional) font, such as Courier or Courier New. If you select instead a typeface with varying width of individual characters, the layout of many tables will be broken. The output from the TWINSPAN method is described in the next chapter.

15

5 WinTWINS Output
5.1 Reading the data matrix

The first part of the output concerns the input of the data matrix and the omission of samples. The cut levels are explained above. Only the beginning and end of the data matrix are printed, and all quantities are multiplied by 1000 and rounded to the nearest integer. Values of -1 indicate the end of a sample. Thus a record 1 46000 5 2500 -1 1 56000 6 4500 -1 would refer to two samples, one with species 1 having quantity 46.0 and species 5 having quantity 2.5, and the other with species 1 having quantity 56.0 and species 6 having quantity 4.5. The matrix is printed out in the form that is held in the computer in order to remind the user of two things: 1. How the length of the raw data array relates to the number of non-zero items in the data matrix; and 2. That quantities smaller than 0.001 are lost in roundoff. 5.2 Entry of parameters

The program gives information about parameters entered when the project was set up. After the input parameters, three important statistics on the data are printed out: 1. Length of data array after defining pseudospecies; 2. Total number of species and pseudospecies; and 3. Number of species, excluding pseudospecies and ones with no occurrences. With big problems these statistics may be relevant to determining whether it is necessary to reduce the number of pseudospecies to make room for the data in the computer. 5.3 Classification of the samples Divisions are made successively, according to the scheme below. 1 * 2 3 *0 *1 4 5 6 7 *00 *01 *10 *11

16

Each group is represented by two numbers, one in decimal, one in binary notation. If, in the hierarchy above, the symbol * is replaced by the number 1, then the decimal and binary numbers can be seen to be identical. Thus 10 is the representation of 2 in binary code, 110 represents 6, etc. In general, group n is divided to obtain group 2n (the negative group) and group 2n+1 (the positive group). The binary representation of the nodes of the hierarchy is more directly interpretable than the decimal representation, 0 denoting a left arm and 1 denoting a right arm.
*********************************************************************************** DIVISION 1 (N= 20) I.E. GROUP * Eigenvalue 0.531 at iteration 3 INDICATORS, together with their SIGN Agr sto 1(+) Ran fla 1(+) Lol per 5(-) Maximum indicator score for negative group 0 Minimum indicator score for positive group 1 Items in NEGATIVE group 2 (N= 12) i.e. group *0 1....... 2....... 3....... 4....... 5....... 6....... 7....... 10...... 11...... 17...... 18...... 19...... Items in POSITIVE group 3 (N= 8) i.e. group *1 8....... 9....... 12...... 13...... 14...... 15...... 16...... 20......
NEGATIVE PREFERENTIALS Ach mil 1( 7, 0) Ant odo 1( 6, 0) Bel per 1( 6, 0) Bro hor 1( 5, 0) Ely rep 1( 5, 1) Hyp rad 1( 3, 0) Lol per 1( 10, 2) Pla lan 1( 7, 0) Poa pra 1( 11, 3) Tri pra 1( 3, 0) Vic lat 1( 3, 0) Ach mil 2( 6, 0) Ant odo 2( 6, 0) Bel per 2( 6, 0) Bro hor 2( 5, 0) Ely rep 2( 5, 1) Hyp rad 2( 3, 0) Lol per 2( 10, 2) Pla lan 2( 7, 0) Poa pra 2( 10, 3) Tri pra 2( 3, 0) Ant odo 3( 5, 0) Bro hor 3( 3, 0) Ely rep 3( 5, 1) Leo aut 3( 8, 1) Lol per 3( 8, 1) Pla lan 3( 6, 0) Poa pra 3( 9, 2) Rum ace 3( 3, 0) Ant odo 4( 4, 0) Ely rep 4( 5, 1) Leo aut 4( 4, 0) Lol per 4( 8, 1) Pla lan 4( 3, 0) Poa pra 4( 7, 2) Tri rep 4( 3, 1) Leo aut 5( 4, 0) Lol per 5( 8, 0) Pla lan 5( 3, 0) Tri rep 5( 3, 1) Lol per 6( 6, 0) Poa tri 6( 3, 1) POSITIVE PREFERENTIALS Agr sto 1( 2, 8) Alo gen 1( 3, 5) Ele pal 1( 0, 5) Jun art 1( 0, 5) Jun buf 1( 1, 3) Pot pal 1( 0, 2) Ran fla 1( 0, 6) Sag pro 1( 3, 4) Cal cus 1( 0, 3) Agr sto 2( 2, 8) Alo gen 2( 3, 5) Ele pal 2( 0, 5) Jun art 2( 0, 5) Jun buf 2( 1, 3) Pot pal 2( 0, 2) Ran fla 2( 0, 6) Sag pro 2( 3, 4) Cal cus 2( 0, 3) Agr sto 3( 2, 8) Alo gen 3( 1, 5) Ele pal 3( 0, 5) Jun art 3( 0, 5) Jun buf 3( 0, 3) Cal cus 3( 0, 3) Agr sto 4( 2, 7) Alo gen 4( 1, 4) Ele pal 4( 0, 5) Jun art 4( 0, 3) Jun buf 4( 0, 2) Bra rut 4( 3, 4) Agr sto 5( 1, 3) Alo gen 5( 1, 3) Ele pal 5( 0, 2) NON-PREFERENTIALS Leo aut 1( 11, 7) Poa tri 1( 8, 5) Rum ace 1( 3, 2) Tri rep 1( 10, 6) Bra rut 1( 9, 6) Leo aut 2( 11, 7) Poa tri 2( 8, 5) Rum ace 2( 3, 2) Tri rep 2( 9, 5) Bra rut 2( 9, 6) Poa tri 3( 7, 4) Tri rep 3( 4, 3) Bra rut 3( 4, 4) Poa tri 4( 7, 4) Poa tri 5( 5, 2)

End of level 1 *********************************************************************************** Figure 11 First dichotomy resulting from application of WinTWINS to the dune meadow data. The indicators have been underlined and emboldened. It may seem surprising that the pseudospecies Sag pro 2 (3, 4) is a positive preferential. The reason for this is that it occurs in 3/12 = 25% of the samples in the negative group, and in 4/8 = 50% of the samples in the positive group. It is twice as likely to occur in the positive group as in the negative group and hence qualifies as preferential.

Consider the output for a dichotomy line by line (refer to Fig. 11). Line 1 gives the number of the group to be divided, both in decimal and in binary notation, together with the number of elements that it contains. Line 2 gives information on the primary ordination (correspondence analysis). Each iteration involves eight passes of the data.

17

Lines 3-5 specify the indicator ordination. Each sample is given an indicator score found by adding +1 for each positive indicator and -1 for each negative indicator that it contains. The sign (+ or -) of each indicator follows immediately after its name in the list of indicators. The division derived from the indicator ordination is specified on line 5. The two values printed are the maximum indicator score for a sample to be assigned to the negative group, and the minimum score for it to be assigned to the positive group. If these values are A and B respectively, then B is always 1+A. In the example, only those samples with an indicator score of 0 or less are included in the negative group. Referring to the two-way table (Table 1), it can be seen that two samples (numbers 3,4) have indicators for both sides, namely Lol per 5 (high abundance of Lolium perenne) indicating the drier end and Agr sto 1 (presence of Agrostis stolonifera) indicating the drier end. These samples are assigned to the negative group, but are obviously somewhat borderline though not so much so as to count as borderline cases for this dichotomy. (In the second division *0, separating the extreme dry groups 17,19 from the ordinarily dry samples, the sample 18 is signified as a borderline positive.) In the first division, the extreme dry samples 17, 19 contain no indicators for the first division but are classified correctly. The indicators are listed in an approximate order of effectiveness. Thus in Fig. 11, Agrostis stolonifera (positive) is a better indicator than Ranunculus flammula (positive) and Lolium perenne 5 (negative). 5.4 Relation between indicator ordination and refined ordination

This section is somewhat technical, and is for readers who want to understand how borderline and misclassified cases are defined. We consider here the ordinations for the first dichotomy obtained when classifying the Danube meadow data discussed by Mueller-Dombois & Ellenberg (1974). A scatter diagram showing the position of the samples is shown below (Fig. 12). For ease of computing the ordination is divided into segments. These have no special significance, but are merely a convenient way of calculating where to locate the zone of indifference. The critical zone (Fig. 12) is a zone near the centre of the refined ordination where it is allowable to make divisions. There are five possible positions for the zone of indifference such that it lies entirely within the critical zone. The location of the zone of indifference is selected to minimize the number of misclassified samples (see below). Segments of the refined ordination are shown along top of the scatter diagram, which is not to scale. The critical zone (segments 5-12) is 20% of the length of the whole ordination. The length of the segments within the critical zone is one-quarter of the length of the segments outside it. Borderline negatives are those items that are in the negative group and which also lie in the zone of indifference. In Fig. 12, sample 16 is a borderline negative, and is assigned to the negative group because it has an indicator score of -1. The refined ordination is highly polarized, so that there are normally few borderline cases. A borderline case is assigned to its class according to its indicator score, the refined ordination being held in this case to be indecisive. Misclassified negatives are those samples lying to the left of the zone of indifference but whose indicator score would assign them to the positive arm of the dichotomy. This is regarded as a failure on the part of the indicator ordination to reproduce the dichotomy accurately, and

18

hence as a misclassification. Because such samples are outside the narrow zone of indifference, the refined ordination takes priority. Items in positive group, borderline positives, and misclassified positives are defined as for negatives.

Fig. 12 Relation between indicator ordination and refined ordination for Danube meadow data (not discussed here). The segments of the refined ordination are indicated along the top of the diagram. Segments 5-12 constitute the critical zone, and segments 8-11 constitute the zone of indifference (abbreviated Z.I.) in the diagram. Samples are designated as borderline cases if they lie in the zone of indifference. In this example, only sample 16 is borderline. 5.5 Preferential species and pseudospecies

Negative preferentials are those pseudospecies and species that are at least twice as likely to occur on the negative side as on the positive side. Only those that occur in at least 20% of the samples on the negative side are listed. Values given in brackets are actual numbers of occurrences, so that the entry Lol per 2( 10, 2) signifies that Lolium perenne occurs with abundance 2 or more in 10 samples on the negative side of the dichotomy and in 2 samples on the positive side of the dichotomy (Tab. 1). When there is a very uneven split, negative preferentials can easily occur in more samples on the positive side of the dichotomy than on the negative side. Suppose, for example, that the split is

19

into 2 samples on the negative side and 10 on the positive. Then a species that occurs in 2 samples on the negative side and 4 samples on the positive side is a good negative preferential. It has 100% frequency on the negative side and 40% frequency on the positive side, and is therefore 2.5 times as likely to occur on the negative side as on the positive side. Positive preferentials are defined as for negative preferentials; non-preferentials are pseudospecies and species that are reasonably common and which are not preferentials. Here again, only those that achieve 20% frequency on one side or the other are listed. 5.6 Species classification

Species are classified by WinTWINS in much the same way as samples. However, there is an important difference in that the species classification is made in the light of the sample classification, and not using the raw data. In fact, the species classification is made on the basis of fidelity, i.e. using the degree to which species are confined to particular groups of samples. For example, in Tab. 1, Aira praecox and Achillea millefolium are both completely faithful to group *0 of the sample hierarchyi.e. they have no occurrences outside this group. However, Aira is also completely faithful to group *00 of the sample hierarchy, whereas Achillea is not. Species are assigned attributes according to differing degrees of fidelity to sample groups. For example, both Aira praecox and Achillea millefolium will be registered as having the attribute very highly faithful to group *0, but Aira is also very highly faithful to group *00, whereas Achillea is not. Indeed, Achillea occurs widely outside group *00, in group *01. Because the species classification is made on the basis of fidelity to groups defined by the sample classification, altering the level of division of the classification may alter the species classification even if no other changes are made. The species classification differs from the sample classification in that indicator characters are of little interest. It is scarcely of much significance to know that (for example) low level of fidelity to sample group X is a preferential attribute of a particular group of species. Consequently, the indicator ordination is omitted from the calculation and output. If the data set is very large, there may be no room for the new matrix of species' attributes as well as the original matrix. If so, then the species classification is abandoned. 5.7 Order of species and samples

The samples are generally ordered so that the group *1 is smaller than *0, and thereafter so that *01 resembles *1 more than *00 resembles *1. Further divisions are also ordered by similarity. Species are ordered so that species in group *0 generally occur in sample group *0 and species in group *1 generally occur in sample group *1. Thus the ordering of each dichotomy is not arbitrary, but is designed to ensure that the occurrences of species are located on the positive diagonal (Tab. 1). These orderings are printed out for two purposes. In the first place, the final table may not contain all the less common species, so that the orderings may be convenient as an indication of where the rarer species would come in the arrangement. Secondly and more important, the final tabulation refers to the samples by number, not by name as elsewhere in the program. The (historic) reason for this is that it was difficult to write the names vertically using FORTRAN 4. In WinTWINS, the sample names are available in the second tabulation, in TSV format. This can be read directly into a spreadsheet. 20

If there are more than 250 samples, the table is split into subtables, each (except, possibly, the last one) with 250 sample columns. In such case, each subtable is introduced by another line saying Part <I> of <N>, where <N> is the total number of subtables and <I> is a value between 1 and <N>.

21

6 Key algorithms of TWINSPAN method


6.1 Refined ordination

The primary ordination is polarized to produce the refined ordination, which is in effect a discriminant function to make the classification. First, the primary ordination is converted to a dichotomy by dividing it at its centre of gravity. Then the frequencies of attributes on the positive and negative sides of the dichotomy are used to construct a new ordination (and hence the final dichotomy). The discriminant function is derived by adding together two ordinations. The first of these is additive, the second is based on a mean. The first ordination will generally be the dominant partner at the lower levels of the hierarchy; the second will be dominant at the higher levels. The first ordination is got by adding together the preference scores of the commoner and more preferential species. A preference score of +1 is given to each attribute that is at least three times more frequent on the positive side as on the negative side, and which is commoner than a certain minimum. Common negative preferentials are treated similarly, and are given a score of -1. Attributes that are rarer, or which are less markedly preferential, are downweighted accordingly. In symbols, let an attribute J have frequency AYY on the positive side, and frequency AY on the negative. Define PREF = (AYY - AY)/(AYY + AY) FREQ = AYY + AY Then the score given to the attribute is Preference score = (FREQ/FRQLIM) * (PREF/PRLIM)5 This downweighting is applied only when either FREQ falls below the threshold FRQLIM (set to 0.2), or when the absolute value of PREF falls below PRLIM (which is equal to (RATLIM - 1)/(RATLIM + 1), where RATLIM is the frequency-ratio limit, currently set to 3.0). Clearly, rare attributes are mildly downweighted, but attributes with an insufficient preference ratio are greatly so. The first ordination is completed by standardizing the values so that the maximum absolute value is 1. The other ordination is got simply by taking the mean value of PREF/PRLIM (truncated to a maximum absolute value of 1) in the individual. This allows rarer attributes to contribute to the discrimination as well. Finally, the refined ordination is got by adding these two. In practice, this process polarizes the ordination quite strongly, so that there are few borderline cases. 6.2 Ordering of dichotomies

Consider the following hierarchy, and in particular the question of how to order the pairs 10, 11 and 14, 15.

22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 To get as smooth an arrangement as possible, we want 10 to be like 9, and 11 to be like 12; and we want 14 to be like 13, whereas 15 should be relatively unlike 13. One way to achieve this aim would be first to complete all the divisions at a particular level of the hierarchy, and then to try swiveling the dichotomies to see which ordering is the most satisfactory. An alternative approach, adopted here, is to try to see whether 10 or 11 more closely resembles 4, and whether 14 or 15 more closely resembles 6. This has the advantage that groups at level 4 are compared with those at level 3, so that the correct order can be determined at the time each dichotomy is formed. It has the added advantage that it compares the new groups with what are likely to be relatively large groups, so that order is determined more by general relationships than by particular accidents. The basic ordering criterion in WinTWINS, then, is to take two groups (2n) and (2n + 1) and to compare them with (n - 1) if n is odd and with (n + 1) if n is even. The comparison is made by setting up a discriminant function to distinguish (2n) from (2n + 1), and then asking where this discriminant would place the comparison group (which is either (n - 1) or (n + 1), according as n is odd or even). This simple ordering criterion is complicated by the fact that the groups (2n) and (2n + 1) are compared not only with one or other of (n - 1) and (n + 1), but also with one or other of (n/2 - 1) and (n/2 + 1). That is to say, not only is the local order considered, but also the relation to groups two stages back in the hierarchy. Let YSCOR1, YSCOR2 be the values of the discriminant function for comparisons one and two levels back in the hierarchy. Then the final discriminant is YSCORE = YSCOR1 + W*YSCOR2 where W is a weight function set to +0.5 or -0.5 (the sign depending on the remainder when n is divided by 4). This device pushes responsibility for the order even further back in the hierarchy, with the aim of achieving greater stability in the ordering. In particular, if the division at level 2 was effectively orthogonal to that at level 3, then the order dictated by relations at level 2 may be irrelevant to those at level 3; whereas relations at level 1 may still indicate a clear preference. 6.3 Conversion of data matrix for species classification

It was noted above that species fidelity to sample groups are used as attributes for making the species classification. The rationale behind using fidelity is that species should be considered as similar or dissimilar not according to whether they are common or rare, but according to whether they occur in similar sample groups. Of course, an exceedingly common species such as Leontodon autumnalis in Tab. 1 will not be highly faithful to any group, and will therefore not appear similar to any exceedingly rare species. 23

Fidelity is defined as follows. Consider a class of samples, IC. Then the fidelity of species J to class IC is defined by the ratio RAT(IC,J), which is the frequency of J in class IC divided by its frequency in the samples not in IC. Using the fidelity ratio RAT, it is possible to define attributes for the species. Species J is deemed to have attribute F(IC,K) if RAT(IC,J) is greater than or equal to the limit SPE(K). Three values of SPE are used here: SPE(1) = 0.8; SPE(2) = 2.0; SPE(3) = 6.0 . In the species classification, both the species and their attributes are given differing weights, as follows. 1. Extra weight for high fidelity. Attributes of the form F(IC,2) and F(IC,3) are given twice the weight of F(IC,1). 2. Extra weight for commoner species. In most applications it is undesirable to have numerous small groups of similar rare species segregating off at an early stage of the classification. This can be avoided by weighting the species according to their abundance, in which case the rarer species are unlikely to dominate the classification, and will tend to align themselves with commoner ones. The weighting used here is w j = bij
i

where the matrix elements bij are (as in the definition of RAT) those that appear in the final tabulation. 3. Extra weight for larger groups and for the higher levels of the hierarchy in the sample classification. Weights of the attributes F(IC,K) corresponding to the class IC are multiplied by

wIC = 2 L / 2 bij
i IC j

where the first summation is taken over those samples i that belong to group IC and L is the level of group IC. In effect, this means that each level down the hierarchy is given weight 2 less than that above, and that less attention is paid to fidelity to small groups than to large ones.

24

25

7 Sample dataset
WinTWINS distribution includes sample dataset the dune meadows data, in the file TABLE01.DTA. Simple analysis of this dataset results in the arrangement of samples and species shown in Table 1. To achieve identical output, you must select 6 pseudospecies cut levels in the Cut Levels dialog, as shown in the following Figure 11.

Note that the cut level with value 0 is not present. It is not necessary, because the smallest positive value in this data set is equal to 1, and zero values are not recorded.

26

8 Acknowledgements
TWINSPAN was written during a year of paid leave at Cornell University, granted to Mark Hill (MOH) by his employer, the Natural Environment Research Council of Great Britain. Funds for computing and travelling were provided by the US National Science Foundation Grant DEB 78-09340. The late Dr R.H. Whittaker invited MOH to visit Cornell, suggested that he should write a program to eliminate the irritating arch effect of correspondence analysis, and encouraged him to pursue the possibilities of numerical classification. TWINSPAN and a program called DECORANA were the result. MOH is indebted both to him and to H.G. Gauch, Jr, for much encouragement and criticism, and to S.B. Singer for his expert and good-humoured advice about the Cornell computer. Dr P.L. Marks provided some test data as well as valuable help and criticism. At Cornell, Beth Marks did an expert job in typing up the manuals. The manuals were subsequently retyped on a wordprocessor in England by Liz Guerin and parts of her script are reproduced here. The TWINSPAN code includes the corrections for algorithmic flaw that was pointed out by J. Oksanen and P.R. Minchin (1997). Also the TWINSPAN code changes by Cajo Ter Braak for user-defined weights of species and samples and by John Birks for sideways printing of table labels are included and thankfully acknowledged. In preparing WinTWINS, we have received valuable criticism from Jan Lep and John Birks.

27

9 References
1. Austin, M.P. (1976): On non-linear species response models in ordination. Vegetatio, 33, 33-41. 2. Cormack, R.M. (1971): A review of classification. Journal of the Royal Statistical Society, Series A, 134, 321-267. 3. Gauch, H.G. (1977): ORDIFLEXA flexible computer program for four ordination techniques: weighted averages, polar ordination, principal components analysis, and reciprocal averaging, Release B. Ecology and Systematics, Cornell University, Ithaca, New York. 4. Gauch, H.G. (1982): Multivariate analysis in community ecology. Cambridge University Press. 5. Gauch, H.G. & Whittaker, R.H. (1972): Comparison of ordination techniques. Ecology, 53, 868-875. 6. Gauch, H.G., Whittaker, R.H. & Wentworth, T.R. (1977): A comparative study of reciprocal averaging and other ordination techniques. Journal of Ecology, 65, 157-174. 7. Goodall, D.W. (1953a): Objective methods for the classification of vegetation. I. The use of positive interspecific correlation. Australian Journal of Botany, 1, 39-63. 8. Goodall, D.W. (1953b): Objective methods for the classification of vegetation. II. Fidelity and indicator value. Australian Journal of Botany, 1, 434-456. 9. Gower, J.C. (1974): Maximal predictive classification. Biometrics, 30, 643-654. 10. Hill, M.O. (1973a): Reciprocal averaging: an eigenvector method of ordination. Journal of Ecology, 61, 237-249. 11. Hill, M.O. (1973b): Diversity and evenness: a unifying notation and its consequences. Ecology, 54, 427-432. 12. Hill, M.O. (1974): Correspondence analysis: a neglected multivariate method. Applied Statistics, 23, 340-354. 13. Hill, M.O., Bunce, R.G.H. & Shaw, M.W. (1975): Indicator species analysis, a divisive polythetic method of classification, and its application to a survey of native pinewoods in Scotland. Journal of Ecology, 63, 597-613. 14. Hill, M.O. (1977): Use of simple discriminant functions to classify quantitative phytosociological data. First International Symposium on Data Analysis and Informatics, Versailles 7-9 Sept. 1977 (Ed. by E Diday, L Lebart, J P Pages & R Tomassone), Vol. 1, pp. 181-199. Institut de Recherche d'Informatique et d'Automatique, Domaine de Voluceau, Rocquencourt, le Chesnay, France. 15. Hill, M.O. (1979a): DECORANA a FORTRAN program for detrended correspondence analysis and reciprocal averaging. Ecology and Systematics, Cornell University, Ithaca, New York. 28

16. Hill, M.O. (1979b): TWINSPAN - a FORTRAN program for arranging multivariate data in an ordered two-way table by classification of the individuals and attributes. Cornell University, Ithaca, New York. 17. Hill, M.O. & Gauch, H.G. (1980): Detrended correspondence analysis: an improved ordination technique. Vegetatio, 42, 47-58. 18. Hill, M.O. (1993): TABLEFIT version 0.0, for identification of vegetation types. Institute of Terrestrial Ecology, Huntingdon. 19. Kendall, D.G. (1971): Seriation from abundance matrices. Mathematics in the Archaeological and Historical Sciences (Ed. by F.R. Hodson, D.G. Kendall & P. Tautu), pp. 215-252. 20. Kershaw, K.A. & Looney, J.H. (1985): Quantitative and dynamic plant ecology, 3rd edition. Edward Arnold, London. 21. Maarel, E. van der, Janssen, J.G.M. & Louppen, J.M.W. (1978): TABORD, a program for structuring phytosociological tables. Vegetatio, 38, 143-156. 22. Mueller-Dombois, D. & Ellenberg, H. (1974): Aims and methods of vegetation ecology John Wiley & Sons, New York. 23. Oksanen, J. & Minchin P.R. (1997): Instability of ordination results under changes in input data order: explanations and remedies. Journal of Vegetation Science, 8, 447-454. 24. Ter Braak, C.J.F. & milauer, P. (2002): CANOCO Reference manual and CanoDraw for Windows User's guide: software for canonical community ordination (version 4.5) Microcomputer Power, Ithaca, New York, 500 pp. 25. Westhoff, V. & Maarel, E. van der (1973): The Braun-Blanquet approach. Ordination and Classification of Communities (Ed. by R H Whittaker), pp. 617-726. Junk, The Hague. 26. Whittaker, R.H. (1960): Vegetation of the Siskiyou Mountains, Oregon and California. Ecological Monographs, 30, 279-338. 27. Whittaker, R.H. (1962): Classification of natural communities. Botanical Review, 28, 1239. 28. Whittaker, R.H., ed. (1973): Ordination and classification of communities. Junk, The Hague.

29

You might also like