Professional Documents
Culture Documents
In Lecture 3, we discussed the use of the normal distribution in descriptive statistics (bell-shaped traits)
If these are bullet holes on a target, where do you guess the bulls eye is?
Yesbut why?
Because we assume that error was randomly distributed around the bulls eye
o if youre trying to hit the target, it is unlikely that you
As weve seen, the normal distribution describes such random and symmetrical deviations around a central value
This common-sense observation captures the essence of the Central Limit Theorem, the analytical foundation of predictive statistics
6
Back to statistics: lets say we take a small sample from a population, and calculate proportion of atheists, or mean height in sample
o This sample may not represent the true
But lets say I take many samples, and calculate mean height in each one; what would their distribution look like? Answer: the same as the bullet holes in the target
o errors are random o best guess for position of true mean height is the
centre of distribution of samples o i.e. the best guess is the mean of the samples (the mean of sample means)
N= 10 samples of 5
N= 30 samples of 5
distribution o mean of means approaches true mean of the population of 100 balls!
N=100 samples of 5
9 N=200 samples of 5
So when we are trying to identify the true mean of a variable in a population, the best guess is the sample means If we have many samples, their mean is the best guess
o But this is very rarely the case!
In most cases, we have only one sample; the sample mean is your best (and only!) estimator of true mean
10
Sample has mean and standard deviation; so what is the probability that it identifies a certain value (the true mean we want t find)? W can say that the true mean is the sample mean plus or minus x (the margin of error)
o
95%
Example: take lifespan variable > mean(lifespan, na.rm=T) [1] 69.71495 sd(lifespan, na.rm=T) >[1] 9.644646 So what is the true mean?
95%
Conventionally, we take the 95% interval around sample mean as the confidence interval: it is 95% certain that the true mean is inside it To calculate 95% confidence interval around sample mean, type
> t.test(lifespan)$conf.int [1] 68.34922 71.08068 attr(,"conf.level") [1] 0.95 Conclusion: if sample mean is 69.72 and standard deviation is 9.64, it is 95% likely that true mean is between 68.35 and 71.08
If my estimator has mean and standard deviation shown, is the true mean (the bulls eye) inside my 95% confidence interval? In this example, yes!
11
Sem is like the standard deviation, but divided by sample size (minus one to avoid bias
sem =
when sampled population (balls in a bag, subjects in my study) shows more variation, samples are more variable and error (deviation between sample mean and true mean) is larger
12
Last thing: to calculate confidence intervals (margin of error), we have to use the Students t-distribution
o it is similar to normal, but used when standard
deviation in population is unknown (remember: we only know standard deviation of sample) o works better than normal with small sample sizes o approaches normal when n is large
95%
sample t-test) o are European countries richer than sub-Saharan countries? (two-sample t-test) o does a new drug increase survival of patients (paired t-test)
t-tests provide such group comparisons; they are important to validate statements about social indicators, income, fairness, justice, historical processes etc.
o does European colonisation affect country income
size
t-tests simply calculate whether the difference between two means/values is real = statistically significant = different from zero
o
In order to use probability distributions, we must standardise variables; so the difference is standardised
t=
2.5%
95%
2.5%
So what we want to know is whether t (difference) is too different from zero (i.e. not similar)
t=-1.96 0 t=1.96
What is too different? Conventionally, we calculate 95% confidence intervals; if a value is inside it, it is not different from test value and there is no difference (well see how it works)
Basic rule: what we need to know is the P-value (probability value) of a t-test In a t-test, the null hypothesis (=status quo, conservative hypothesis) is always that there is no difference between the two compared values
o i.e. if you want to prove that two groups differ, you
2.5%
95%
2.5%
The P-value of a test is the probability that null hypothesis is true (i.e. groups are not different)
o conventionally, we only reject null hypothesis is P
t=-1.96
t=1.96
value is less than 5% =P<0.05 o (as well see, thats because we use 95% confidence intervals)
Sample mean=69.71
o
> t.test(lifespan, mu=70) One Sample t-test data: lifespan t = -0.4117, df = 193, p-value = 0.681 alternative hypothesis: true mean is not equal to 70 95 percent confidence interval: 68.34922 71.08068 sample estimates: mean of x 69.71495
that doesnt seem to be very different from 70) t statistic, the standardised difference between sample mean and test value, is close to zero
t=-0.41
o
Confidence interval of lifespan: my sample suggests that life expectancy in the world in between 68.3 and 71.08 years; and this it includes 70 years
P value=0.681=68%
o
> t.test(lifespan, mu=70) One Sample t-test data: lifespan t = -0.4117, df = 193, p-value = 0.681 alternative hypothesis: true mean is not equal to 70 95 percent confidence interval: 68.34922 71.08068 sample estimates: mean of x 69.71495
This is the probability of null hypothesis (=life expectancy is not different from 70 years) Therefore, you must accept the null hypothesis
P is high
o
Conclusion: based on our sample, life expectancy in the world is not significantly different/shorter than 70 years
> t.test(lifespan, mu=75) One Sample t-test data: lifespan t = -7.6324, df = 193, p-value = 1.033e-12 alternative hypothesis: true mean is not equal to 75 95 percent confidence interval: 68.34922 71.08068 sample estimates: mean of x 69.71495
P = 1.033*10(-12) = 0.000000000001033 = 0.00000000001033%; This is very low! We must reject null hypothesis and and accept alternative hypothesis t=-7.63; thats significantly different from 0 75 years is outside 95% CI Therefore, life expectancy is below 75 years
You may also want to test whether two samples are significantly different in some respect
o for example, are South and Southeast Asian
countries richer than Latin American countries? o i.e. do differences or similarities in economic models in recent decades cause differences in average income between the two areas?
Procedure is similar: but t-statistic is now the difference between means of the two compared groups
t=
12
In file HDR2011, variable continent is seasia for South and Southeast Asian countries and latin for Latin American countries; others are NA (non-available)
> t.test(GNI ~ continent) Welch Two Sample t-test data: GNI by continent t = -1.1455, df = 20.327, p-value = 0.2653 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -13340.319 3876.397 sample estimates: mean in group latin mean in group seasia 9054.355 13786.316 Conclusion: We may think that the difference of ~US4,700 between the areas was large enough to prove a significant difference But it isnt: there is too much variation in income in the two areas
A paired test should be used when the two compared measurements are linked, i.e. the subjects/cases are not independent For example, the two group means may be two measurements from the same individual
o In the case of a trial of a new drug for blood pressure, blood pressure before and
The file intake has data on pre- and post-menstrual calorie consumption in 11 women;
o Question: is there a difference in caloric intake before and after menstrual cycle?
> t.test(pre, post, paired=T) Paired t-test data: pre and post t = 11.9414, df = 10, p-value = 3.059e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1074.072 1566.838 sample estimates: mean of the differences 1320.455 P value: very low! We must reject null hypothesis (no difference) Confidence interval: 95% likely that difference in calorie is between 1074 and 1566 kcal Conclusion: there is a clear difference between calorie intake pre and post
1) One-sample t-test Is income per capita (GNI) in the world significantly less than US$20000? 2) Two-sample t-test Let us compare schooling years in Southeast Asia and Latin America
o What is the average schooling of children in the two regions? o Does schooling significant differ between the two areas? What is
3) Paired t-test Give two examples of studies that could require paired t-tests
Confidence intervals and all t-tests assume a normal distribution, even when sample is small
o And they are based on a theory of means of various samples, which in
practice we dont have o Thats why you do not prove differences; you compare groups and give an estimate of the probability that they are difference or similar
Remember: null hypothesis is always that means are not different Current trend is to provide confidence intervals rather then P values when reporting results of tests in general (not just t-tests), so get used to calculating and interpreting them