You are on page 1of 4

Randomization test: Fishers exact test (By Prof.

George Cobb at Mount Holyoke)


What if the data values are categorical? One of the simplest randomization tests is Fishers exact
test, which is used to test hypotheses about data that can be summarized in a 2x2 table of counts.
Example 1. The Salem witchcraft hysteria
The year 1692 saw nineteen convicted witches hanged in Salem Village (now Danvers)
Massachusetts. Almost three centuries later, historians examining documents related to the trials
discovered a striking pattern relating trial testimony and geography. Those who testified against
the accused witches tended to live in the western part of Salem Village; those who testified in
defense of the accused tended to live in the eastern part, which was wealthier, more commercial,
more cosmopolitan, and closer to the town of Salem, at the time the second busiest port in the
colonies.1 A total of 61 residents testified in the trials, of whom 35 lived in the western part of the
village, and 26 in the eastern part. Of the 35 westerners, 30 were accusers and only 5 were
defenders; of the 26 easterners, only 2 were accusers; the remaining 24 were defenders:
Testimony
Geography

Accuser

West

30

East
Total

Defender

Total

% accuser

35

85.7%

24

26

7.7%

32

29

61

Display 2.5 Geography and testimony in the Salem witch trials of 1692
Is it possible to get a pattern as extreme as this just by chance? If the relationship between
geography and testimony were purely random, how likely would a pattern this extreme be?
Represent the residents who testified by poker chips, 35 marked West and 26 marked East.
Put all 61 chips in a bag, mix thoroughly, and draw out 32 at random. Call these Accusers; count
how many of the accuser chips say West, how many say East, and record the results in a table.2 I
just did this, and got:
Testimony
Geography

Accuser

West

19

East
Total

Defender

Total

% accuser

16

35

54.3%

13

13

26

50.0%

32

29

61

Display 2.6 Results for a sample of accusers drawn at random


My random data set is not nearly as extreme as the actual one.
1 Historians Boyer and Nissenbaum cite this pattern as part of the evidence in support of an economically based
interpretation of the witchcraft hysteria. See Boyer, Paul and Stephen Nissenbaum (1974). Salem Possessed: The
Social Origins of Witchcraft, Cambridge: Harvard University Press

2 The chips left in the bag are the defenders. You could count them to fill in the rest of the table, or you could get
the missing values by subtraction, since they are determined by what you draw out.

I repeated the whole process -- random draws, count, compare 10,000 times, using the R code
in Display 2.7, and not once did I get a table as extreme as the actual data. Conclusion: If you
draw at random, it is all but impossible to get a data table like the observed one. In other words,
Its just a chance relationship is not a believable explanation for the data.
For the R simulation, I used 0s and 1s to represent East and West. There were 26 people from
the East who testified, and 35 from the West, so my population has 26 0s and 35 1s.
pop <- c(rep(0,26),rep(1,35))
NRep <- 10000
compare <- replicate(NRep,(sum(sample(pop,32,replace=F))>=30))
pHat <- sum(compare)/NRep
pHat

Display 2.7 R code for drawing random samples of accusers and estimating p
Heres an abstract version of the same analysis:
Step 1: Generate random data sets
Population: 61 individuals, 35 of them 1s (West) and 26 of them 0s (East).
Sample: A subset of 32.
Null model: The sample is chosen completely randomly; all subsets of 32 are
equally likely.
Step 2. Compare random data sets with the actual data.
Test statistic: Number of 1s in the sample (= number of West chips among the
randomly chosen accusers.)
Actual data value: There were, in fact, 30 residents of the western part of Salem
Village among the accusers.
Compare: Record a Yes if there are 30 or more 1s in the sample.
Step 3. Estimate.
Out of 10,000 data sets, none had as many as 30 1s.
Because the p-value is so very tiny, we reject the null model. It is not a believable explanation
for the actual data.
Drill Exercises:
13. A small version of the witch data.
Assume that only 10 people had testified, of whom 4 lived in the west, 6 in the east. Assume
also that 3 were accusers, and that all 3 came from the west. Set up the population, null model,
test statistic and observed value. Then find the p-value and state your conclusion.
14. There is more than one way to define the population and null model for R. Two easy
variations are (a) to reverse the labels for 1s and 0s, so that 1 represents East and 0 represents

West, and/or (b) to reverse the labels for in the sample and not in the sample, so that the sample
represents Defenders, and those not in the sample are the Accusers. For each of these variations,
tell what the test statistic would be, and how to compute the p-value. Verify that the p-value is
the same for all these variations.
15. A more substantive variation reverses the roles of population and sample: Let the 1s and 0s
in the population tell whether an individual was an Accuser or Defender. Let in the sample
correspond to West, and not in the sample to East. Define a test statistic, and tell how to
modify the R code in Display 2.7 to compute the p-value. (Optional: Run your modified R
code, and verify the (non-obvious) fact that (apart from random variation) it is equal to the pvalue from the original code.
Example 2. US v. Gilbert
For several years in the 1990s, Kristen Gilbert worked as a nurse in the intensive care unit (ICU)
of the Veterans Administration hospital in Northampton, Massachusetts. Over the course of her
time there, other nurses came to suspect that she was killing patients by injecting them with the
heart stimulant epinephrine.3 Part of the evidence against Gilbert was a statistical analysis of
more than one thousand 8-hour shifts during the time Gilbert worked in the ICU. Was there an
association between Gilberts presence on the ICU and whether or not someone died on the shift?
Here are the data:

Display 2.8 Data on possible association between nurse Gilberts presence in the ICU of the
Northampton VA hospital and deaths on a shift.
Drill Exercises:
16. Define the population of 0s and 1s: What does a 0 represent? a 1? How many 0s, and how
many 1s, are in the population. (Note: There is more than one right way to do this.)
17. Define the null model: If you think of drawing a random sample from the population,, what
does drawn out (that is, in the sample) represent? What does not drawn out represent?
18. Define the test statistic: What does the number of 1s in the sample represent? What is the
observed value of the test statistic?
3 A synthetic form of adrenaline.

19. p-value. Modify the R code in Display 2.7 so that it would compute the p-value. (Optional:
Compute the p-value using your code.)
Example 7. Anthrax: Does moving the air give a better test?
After Senator Tom Daschle received a letter containing anthrax, the Hart Senate Office Building
was fumigated in an attempt to kill the spores. After the first fumigation, public health officials
conducted two tests to see whether the building was safe to work in. In the first test, 17 strips
capable of detecting live anthrax spores were placed throughout the test area, and later were
checked for anthrax. Five of the 17 were positive. In the second test, another 17 strips were
placed in the same locations, but this time suitably protected technicians walked around on the
carpet, moving the room air in the process, to simulate normal office traffic. This time 16 of 17
strips were positive. The results of these tests led to a second, more vigorous, and successful
fumigation.
Exercises
20. Summarize the test results in a 2x2 table.
21. Define a suitable population, null model, and test statistic for Fishers exact test.
22. Use R to compute an appropriate p-value.
Summary. Fishers exact test is appropriate when (1) you want to compare two randomly
chosen groups of individuals, and (2) the feature of the individuals that you use to make the
comparison is dichotomous reducible to yes/no. Think of randomly drawing marbles from a
bucket. This gives two randomly chosen groups: those drawn out, and those left in. The
marbles are of two colors; color is the feature used to compare the two groups. For the Salem
witch data (Example 5), we asked, What if the accusers and defenders had been chosen at
random? In that example, the actual accusers and defenders were not chosen at random, but we
wanted to test whether the observed data was consistent with random selection. So under our
null model, the accusers and defenders were the randomly chosen groups. The feature used
for the comparison was geography, east or west. For the Gilbert data (Example 6) we asked,
What if the deaths had occurred on randomly chosen shifts? Here, also, the actual shifts were
not chosen that way, but we wanted to compare the actual data with what we would be likely to
get if the shifts had been chosen randomly. Thus shifts with and without deaths were the
randomly chosen groups. The feature used for comparing groups was whether or not Gilbert was
present on the shift.

You might also like