Professional Documents
Culture Documents
We used observed sample means to draw inferences about unobserved population means.
We performed a significance test by formulating an assumption about the unknown population
mean and calculating the probability of observing our sample mean, assuming the hypothesised
population mean is correct. If the probability was less than 0.05 (i.e. p < 0.05), we concluded that
the difference between our sample mean and the assumed population mean was statistically
significant at the 5% level and we rejected our assumption about the population mean.
We also saw last week that we can follow exactly the same approach if we want to test whether
an observed relationship between two variables, X and Y, in our sample is significantly different
from the assumed relationship in the population.
Remember that we always assume the null hypothesis (H0) holds, meaning there is no
relationship between X and Y in the population. This allows us to calculate the probability of
observing the relationship in our sample (just because of sampling variance) if the null hypothesis
is correct. If the probability is less than 0.05 (i.e. p < 0.05), then we again conclude that the
observed relationship in our sample is statistically significant at the 5% level and we can reject
the null hypothesis.1
Crosstabulations:
There are many different statistical techniques for analysing the relationships between variables. The
method we use depends on the level of measurement of both X and Y. Today we look at how to
analyse relationships between categorical (i.e. nominal or ordinal) variables.
When analysing relationships between categorical variables, we typically rely on
crosstabulations.
Crosstabs are a nice way of looking at the association between two or even three variables
(crosstabs with four or more variables get really nasty).
If a crosstab only includes dummy or ordinal variables, one easy way to evaluate the association
between them is to look at the diagonal strength of their relationship.
Recall that all significance tests (including the test discussed below) rely on the assumption of random
sampling. This assumption is valid when analysing survey data that is based on a random sample of the
underlying population. However, it is clearly violated in the case of the cross-national country data set we are
using in our class. It is important to keep this in mind.
1
Let us look at the example of the relationship between two variables in our dataset. The variable
govspend has a value of 1 if a countrys government spending is low (i.e. less than 20% of GDP) and
a value of 2 if government spending is high (i.e. greater than 20% of GDP). The variable
school_enrol has a value of 1 if school enrolment is low (i.e. less than 70% of the relevant age
groups) and a value of 2 if school enrolment is high (i.e. greater than 70%).2 Using Statas tabcommand, we can get a crosstab of the two variables that allows us to evaluate whether government
spending is effective for increasing school enrolment.
. tab govspend school_enrol
Government
spending
School enrolment
Low <70% High >70%
Total
63
11
58
24
121
35
Total
74
82
156
Question: Does it matter which of the two variables we put into the rows and which into the columns?
We can also ask Stata to calculate row percentages using the row option of the tab-command.
Government
spending
School enrolment
Low <70% High >70%
Total
63
52.07
58
47.93
121
100.00
11
31.43
24
68.57
35
100.00
Total
74
47.44
82
52.56
156
100.00
School enrolment
Low <70% High >70%
Total
63
85.14
58
70.73
121
77.56
11
14.86
24
29.27
35
22.44
Total
74
100.00
82
100.00
156
100.00
Note that these variables are not included in the original data set, but I created them using the following
commands: (1) gen govspend = . (2) replace govspend = 1 if cengov2000 < 20 (3) replace govspend = 2 if
cengov2000 >= 20 & cengov2000 != . (4) gen school_enrol = . (5) replace school_enrol = 1 if educ2001 <
70 (6) replace school_enrol = 2 if educ2001 >= 70 & educ2001 != .
2
Question: Do you think row or column percentages are more appropriate in our example?
Pearsons 2 test:
How can we use the information in the crosstab to find out whether there really is a relationship
between the two variables? One way to do so is to compare the observed frequencies in our sample
with the frequencies we would expect if the null hypothesis is true. For each cell we can calculate the
expected frequencies (E) in the following way:
For example, to get the expected frequencies for the first cell of our table we simply calculate
. If we do this for all four cells of the crosstab, we get the following table of expected
frequencies:
School enrolment
Government spending
Low <70%
High >70%
Total
57.4
63.6
121
16.6
18.4
35
Total
74
82
Question: Comparing this with the observed frequencies, do you see any pattern?
By comparing the table of the observed frequencies with the table of the expected frequencies, we can
see that the observed frequencies differ from what we would have expected under the null hypothesis.
But how different are the observed frequencies from the expected frequencies?
To put it differently: are the differences between the observed frequencies and the expected
frequencies statistically significant?
We can test whether the observed frequencies are significantly different from our expectations based
on the null hypothesis (that there is no relationship between the variables) by using the chi-squared
(2) test statistic, calculated as follows:
To find out the probability of observing the resulting value of 2 assuming that the null hypothesis
(that there is no relationship between X and Y) is true, we would need to look at the table of the 2
distribution in the appendix of our statistics textbook.3 Fortunately, Stata automatically does this for
us. By using the chi2 option, Stata will provide us with the value of the 2 statistic and the associated
probability of observing the value of the 2 statistic assuming that the null hypothesis is correct.4
. tab govspend school_enrol, chi2
Government
spending
School enrolment
Low <70% High >70%
Total
63
11
58
24
121
35
Total
74
82
156
Pearson chi2(1) =
4.6371
Pr = 0.031
Question: What would you conclude based on the p-value provided by Stata? Is the relationship
significant at the 5% level? Is it significant at the 1% level?
School enrolment
Low <70% High >70%
Total
29
7
19
4
48
11
Total
36
23
59
Pearson chi2(1) =
0.0390
Pr = 0.843
This is the same procedure as finding the probability associated with a given z-score in the case of a normal
distribution. Note that the appendix of the Kellstedt and Whitten (2009) texbook includes a 2 distribution table.
In order to find the critical 2 value, which would indicate statistical significance at the 5% level, we would also
need to calculate the so-called degrees of freedom (df) for our crosstab, which are defined as:
df = (c 1) (r 1), where c equals the number of columns and r the number of rows in the table.
4
The 2 test was developed by Karl Pearson, which is why Stata calls it the Pearson 2 statistic.
4
School enrolment
Low <70% High >70%
Total
34
4
39
20
73
24
Total
38
59
97
Pearson chi2(1) =
6.7805
Pr = 0.009
Question: What do you observe? Does the relationship between spending and school enrolment
hold, is it spurious or is there an interaction between spending and regime type?
Stata exercise:
1. Load up the data set Democracy small.dta and open a do-file to write the commands in.
2. Find two interval variables and create a crosstab using the tab-command. What happens and
why?
3. Find the variables internal2000 and aclp_democ2000. Find out what both variables are
measuring. Come up with a hypothesis and a null hypothesis about the relationship between the
two variables.
4. Create a crosstab of the two variables. Also ask for row and column percentages. Interpret the
percentages in the table. By looking at the frequencies and percentages, do you think there is a
relationship?
5. Ask Stata for the chi-squared value using the chi2 option. Is there a significant relationship at the
5% level?
6. Now use the fhcat2000 variable instead of the aclp_democ2000 variable and create a crosstab of
internal2000 and fhcat2000. Again calculate row and column percentages and request the value
of the chi-squared test statistic. Is there a significant relationship at the 5% level?
7. Finally, lets check whether the relationship between fhcat2000 and internal2000 is spurious and
driven by the level of development of a country. In order to do so lets create a new dummy
variable, which groups countries into low income countries (i.e. GDP per capita is less than US$
5000) and high income countries (i.e. GDP per capita is greater than US$ 5000):
generate incomecat = .
replace incomecat = 1 if gdppc2000 < 5000
replace incomecat = 2 if gdppc2000 >= 5000 & gdppc2000 != .
8. Create two crosstabs of fhcat2000 and internal2000, one for low income countries (i.e.
incomecat == 1) and one for high income countries (i.e. incomecat == 2). Also ask for chisquared test statistics. What do you observe? Does the relationship between fhcat2000 and
internal2000 hold, is it spurious, or is there an interaction?
9. Compare the total numbers of observations in the two crosstabs you just created with the number
of observations in the original crosstab between fhcat2000 and internal2000. What has happened
and why? Is this problematic?