You are on page 1of 5

GV207 Political Analysis, Week 06

Department of Government, University of Essex

Crosstabulation and the 2 Test

Recap from last week:

We used observed sample means to draw inferences about unobserved population means.
We performed a significance test by formulating an assumption about the unknown population
mean and calculating the probability of observing our sample mean, assuming the hypothesised
population mean is correct. If the probability was less than 0.05 (i.e. p < 0.05), we concluded that
the difference between our sample mean and the assumed population mean was statistically
significant at the 5% level and we rejected our assumption about the population mean.
We also saw last week that we can follow exactly the same approach if we want to test whether
an observed relationship between two variables, X and Y, in our sample is significantly different
from the assumed relationship in the population.
Remember that we always assume the null hypothesis (H0) holds, meaning there is no
relationship between X and Y in the population. This allows us to calculate the probability of
observing the relationship in our sample (just because of sampling variance) if the null hypothesis
is correct. If the probability is less than 0.05 (i.e. p < 0.05), then we again conclude that the
observed relationship in our sample is statistically significant at the 5% level and we can reject
the null hypothesis.1

Measuring relationships between X and Y:


If we want to measure the relationship between an independent variable (X) and a dependent variable
(Y), we are typically interested in three things:
1. Is there a relationship?
This question is about the significance of the relationship: how confident can we be that
there really is a relationship in the population?
2. What is the form of the relationship?
This question deals with the nature of the relationship: is it positive or negative, linear or
u shaped, or is there an interaction?
3. How strong is the relationship?
This question is about how strongly X and Y are related: How well can we predict the
values of Y given information on X?
Today, we only deal with the first question. We will look at the other questions next week.

Crosstabulations:
There are many different statistical techniques for analysing the relationships between variables. The
method we use depends on the level of measurement of both X and Y. Today we look at how to
analyse relationships between categorical (i.e. nominal or ordinal) variables.
When analysing relationships between categorical variables, we typically rely on
crosstabulations.
Crosstabs are a nice way of looking at the association between two or even three variables
(crosstabs with four or more variables get really nasty).
If a crosstab only includes dummy or ordinal variables, one easy way to evaluate the association
between them is to look at the diagonal strength of their relationship.
Recall that all significance tests (including the test discussed below) rely on the assumption of random
sampling. This assumption is valid when analysing survey data that is based on a random sample of the
underlying population. However, it is clearly violated in the case of the cross-national country data set we are
using in our class. It is important to keep this in mind.
1

GV207 Political Analysis, Week 06

Department of Government, University of Essex

Let us look at the example of the relationship between two variables in our dataset. The variable
govspend has a value of 1 if a countrys government spending is low (i.e. less than 20% of GDP) and
a value of 2 if government spending is high (i.e. greater than 20% of GDP). The variable
school_enrol has a value of 1 if school enrolment is low (i.e. less than 70% of the relevant age
groups) and a value of 2 if school enrolment is high (i.e. greater than 70%).2 Using Statas tabcommand, we can get a crosstab of the two variables that allows us to evaluate whether government
spending is effective for increasing school enrolment.
. tab govspend school_enrol
Government
spending

School enrolment
Low <70% High >70%

Total

Low (<20% of GDP)


High (>20% of GDP)

63
11

58
24

121
35

Total

74

82

156

Question: Does it matter which of the two variables we put into the rows and which into the columns?

We can also ask Stata to calculate row percentages using the row option of the tab-command.
Government
spending

School enrolment
Low <70% High >70%

Total

Low (<20% of GDP)

63
52.07

58
47.93

121
100.00

High (>20% of GDP)

11
31.43

24
68.57

35
100.00

Total

74
47.44

82
52.56

156
100.00

School enrolment
Low <70% High >70%

Total

Using the col option, we can get column percentages.


Government
spending
Low (<20% of GDP)

63
85.14

58
70.73

121
77.56

High (>20% of GDP)

11
14.86

24
29.27

35
22.44

Total

74
100.00

82
100.00

156
100.00

Note that these variables are not included in the original data set, but I created them using the following
commands: (1) gen govspend = . (2) replace govspend = 1 if cengov2000 < 20 (3) replace govspend = 2 if
cengov2000 >= 20 & cengov2000 != . (4) gen school_enrol = . (5) replace school_enrol = 1 if educ2001 <
70 (6) replace school_enrol = 2 if educ2001 >= 70 & educ2001 != .
2

GV207 Political Analysis, Week 06

Department of Government, University of Essex

Question: Do you think row or column percentages are more appropriate in our example?

Pearsons 2 test:
How can we use the information in the crosstab to find out whether there really is a relationship
between the two variables? One way to do so is to compare the observed frequencies in our sample
with the frequencies we would expect if the null hypothesis is true. For each cell we can calculate the
expected frequencies (E) in the following way:

For example, to get the expected frequencies for the first cell of our table we simply calculate
. If we do this for all four cells of the crosstab, we get the following table of expected
frequencies:
School enrolment
Government spending

Low <70%

High >70%

Total

Low (<20% of GDP)

57.4

63.6

121

High (>20% of GDP)

16.6

18.4

35

Total

74

82

Question: Comparing this with the observed frequencies, do you see any pattern?

By comparing the table of the observed frequencies with the table of the expected frequencies, we can
see that the observed frequencies differ from what we would have expected under the null hypothesis.
But how different are the observed frequencies from the expected frequencies?
To put it differently: are the differences between the observed frequencies and the expected
frequencies statistically significant?
We can test whether the observed frequencies are significantly different from our expectations based
on the null hypothesis (that there is no relationship between the variables) by using the chi-squared
(2) test statistic, calculated as follows:

where FO is the observed frequency, and FE is the expected frequency.


Question: What is the minimum value of 2?

GV207 Political Analysis, Week 06

Department of Government, University of Essex

To find out the probability of observing the resulting value of 2 assuming that the null hypothesis
(that there is no relationship between X and Y) is true, we would need to look at the table of the 2
distribution in the appendix of our statistics textbook.3 Fortunately, Stata automatically does this for
us. By using the chi2 option, Stata will provide us with the value of the 2 statistic and the associated
probability of observing the value of the 2 statistic assuming that the null hypothesis is correct.4
. tab govspend school_enrol, chi2
Government
spending

School enrolment
Low <70% High >70%

Total

Low (<20% of GDP)


High (>20% of GDP)

63
11

58
24

121
35

Total

74

82

156

Pearson chi2(1) =

4.6371

Pr = 0.031

Question: What would you conclude based on the p-value provided by Stata? Is the relationship
significant at the 5% level? Is it significant at the 1% level?

Crosstabs with more than two variables:


When using crosstabs, we can control for a second categorical variable. This allows us to assess:
Whether the relationship between X and Y is spurious. This would be the case if there is no
longer a relationship between X and Y after adding a third variable (Z).
Whether X and Z interact. This would be the case if we observe different relationships between
X and Y for different values of Z.
To control for a third variable (Z) in Stata, we can use the if condition to create separate crosstabs for
all different values of Z. For example, we can test whether the relationship between government
spending and school enrolment still holds after controlling for a countrys regime type. We can do this
by using the aclp_democ2000 variable, which has a value of 0 if a country is an autocracy and a value
of 1 if a country is a democracy.
. tab govspend school_enrol if aclp_democ2000 == 0, chi2
Government
spending

School enrolment
Low <70% High >70%

Total

Low (<20% of GDP)


High (>20% of GDP)

29
7

19
4

48
11

Total

36

23

59

Pearson chi2(1) =

0.0390

Pr = 0.843

This is the same procedure as finding the probability associated with a given z-score in the case of a normal
distribution. Note that the appendix of the Kellstedt and Whitten (2009) texbook includes a 2 distribution table.
In order to find the critical 2 value, which would indicate statistical significance at the 5% level, we would also
need to calculate the so-called degrees of freedom (df) for our crosstab, which are defined as:
df = (c 1) (r 1), where c equals the number of columns and r the number of rows in the table.
4
The 2 test was developed by Karl Pearson, which is why Stata calls it the Pearson 2 statistic.
4

GV207 Political Analysis, Week 06

Department of Government, University of Essex

. tab govspend school_enrol if aclp_democ2000 == 1, chi2


Government
spending

School enrolment
Low <70% High >70%

Total

Low (<20% of GDP)


High (>20% of GDP)

34
4

39
20

73
24

Total

38

59

97

Pearson chi2(1) =

6.7805

Pr = 0.009

Question: What do you observe? Does the relationship between spending and school enrolment
hold, is it spurious or is there an interaction between spending and regime type?

Stata exercise:
1. Load up the data set Democracy small.dta and open a do-file to write the commands in.
2. Find two interval variables and create a crosstab using the tab-command. What happens and
why?
3. Find the variables internal2000 and aclp_democ2000. Find out what both variables are
measuring. Come up with a hypothesis and a null hypothesis about the relationship between the
two variables.
4. Create a crosstab of the two variables. Also ask for row and column percentages. Interpret the
percentages in the table. By looking at the frequencies and percentages, do you think there is a
relationship?
5. Ask Stata for the chi-squared value using the chi2 option. Is there a significant relationship at the
5% level?
6. Now use the fhcat2000 variable instead of the aclp_democ2000 variable and create a crosstab of
internal2000 and fhcat2000. Again calculate row and column percentages and request the value
of the chi-squared test statistic. Is there a significant relationship at the 5% level?
7. Finally, lets check whether the relationship between fhcat2000 and internal2000 is spurious and
driven by the level of development of a country. In order to do so lets create a new dummy
variable, which groups countries into low income countries (i.e. GDP per capita is less than US$
5000) and high income countries (i.e. GDP per capita is greater than US$ 5000):
generate incomecat = .
replace incomecat = 1 if gdppc2000 < 5000
replace incomecat = 2 if gdppc2000 >= 5000 & gdppc2000 != .
8. Create two crosstabs of fhcat2000 and internal2000, one for low income countries (i.e.
incomecat == 1) and one for high income countries (i.e. incomecat == 2). Also ask for chisquared test statistics. What do you observe? Does the relationship between fhcat2000 and
internal2000 hold, is it spurious, or is there an interaction?
9. Compare the total numbers of observations in the two crosstabs you just created with the number
of observations in the original crosstab between fhcat2000 and internal2000. What has happened
and why? Is this problematic?

You might also like