Professional Documents
Culture Documents
EXERCISE 1
A plumber has noticed a certain association between the typology of sink chosen by his clients and
the area of residency. In order to confirm or reject his intuition, he selects a sample of 500 clients and he
classifies them according to the contingency table reported below:
(O − Eij )
2
r c Ri C j
∑∑ where Eij =
ij
=i 1 =j 1 Eij n
under the null hypothesis. This statistics has approximately a Chi-square distribution with
(r-1)(c-1) degrees of freedom. By setting α = 0.01 and by taking into account that in this case, r = 3 and c = 3,
we have χ2 4, 0.01 = 13.28.
Therefore, the null hypothesis will be rejected only if the observed value of the test statistics is larger than
13.28.
The “Expected frequencies” are the absolute expected frequencies in the case in which the characters A and
B were independent.
𝑅𝑅𝑖𝑖 𝐶𝐶𝑗𝑗 (𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑖𝑖 𝑡𝑡ℎ 𝑟𝑟𝑟𝑟𝑟𝑟)(𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑗𝑗 𝑡𝑡ℎ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐)
𝐸𝐸𝑖𝑖𝑖𝑖 = =
𝑛𝑛 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
The following table shows the expected frequencies under the null hypothesis:
10.4294 < 13.28 so that we do not reject H0: we can conclude that there is no association between
“sink” and “Area”. (p-value 0.0338)
EXERCISE 2
A plumber has noticed a certain association between the typology of sink chosen by his clients and
the area of residency. In order to confirm or reject his intuition, he selects a sample of 1000 clients and he
classifies them according to the contingency table reported below:
(O − Eij )
2
r c Ri C j
∑∑ where Eij =
ij
=i 1 =j 1 Eij n
under the null hypothesis. This statistics has approximately a Chi-square distribution with
(r-1)(c-1) degrees of freedom. By setting α = 0.025 and by taking into account that in this case, r = 3 and c =
3, we have χ2 4, 0.025 = 11.14.
Therefore, the null hypothesis will be rejected only if the observed value of the test statistics is larger than
11.14.
The “Expected frequencies” are the absolute expected frequencies in the case in which the characters A and
B were independent.
12.4279 > 11.14 so that we reject H0: we can conclude that there exists an association between “sink” and
“Area”.(p-value 0.0144)
EXERCISE 3
On a sample of 100 workers, we want to find a possible link between the average contract duration (years)
and the industry type (A=Automotive, NA=Not Automotive). This contingency table shows the results
Contract duration 2 5 10
Industry Type
A 10 15 20 45
NA 15 10 30 55
25 25 50 100
c) Make your final decision about hypothesis with a level of significance of 5%.
We reject H0 if the observed value for the test statistics is > 5.99.
Contract duration 2 5 10
Industry type
A 11.25 11.25 22.5
NA 13.75 13.75 27.5
Contract duration 2 5 10
Industry type
A 0.1389 1.2500 0.2778
NA 0.1136 1.0227 0.2273
Considering the data at hand, 3.0303 < 5.99, then we do not reject H0.
EXERCISE 4
On a sample of 100 workers, we want to find a possible link between the age and the means of transport
used to reach the work place. This contingency table shows the results
c) Calculate (approximately) the p-value and make your final decision about hypothesis with a level of
significance of 5%.
To derive the p-value the observed value for the test statistics is required.
Transport Private Public Transport Others
Age car
18<age<30 19.25 27.5 8.25
≥30 15.75 22.5 6.75
EXERCISE 5
In order to evaluate a possible association between the payment of the vehicle tax and the geographical
area, a survey of 400 users was conducted. The results are shown in the following table:
No 40 70
H1: there exists dependence between payment of the vehicle tax and the geographical area
b) Compute the p-value of the test. For a significance level equal to α=0.05, what is the decision
concerning the hypothesis of the test?
On the basis of the marginal frequencies Oi. and O.j and of the sample size n
No 40 7 110
Oi. O. j
We compute the expected frequencies: E ij =
n
Payment vehicle tax Center-North South-Islands
No 55 55
c = ∑∑
2
r c (O
ij − Eij )
2
=
(40 − 55)2 + (70 − 55)2 + (160 − 145)2 + (130 − 145)2 = 11.2853
i =1 j =1 Eij 55 55 145 145
Therefore, p − value = Pr (χ 1 > 11.2852 ) . From the tables it results that p-value < 0.005 (the precise
value, not available on the table is equal to 0.0008). Therefore, for every level of significance greater than
1%, p-value < α. We reject the null hypothesis. There exists dependence between payment of the vehicle
tax and the geographical area.
EXERCISE 6
In order to evaluate the existence of a possible association between the payment of the Rai tax (i.e.
“canone Rai”) and the geographical area, a survey of 200 users was conducted. The results are shown in the
following table:
Yes 80 65
No 20 35
H0: independence between payment of the Rai tax and the geographical area;
H1: there exists dependence between payment of the Rai tax and the geographical area
b) Compute the p-value of the test. For a significance level equal to α=0.01, what is the decision concerning
the hypothesis of the test?
On the basis of the marginal frequencies Oi. and O.j and of the sample size n
Payment vehicle tax Center-North South-Islands Oi.
No 80 65 145
Yes 20 35 55
Oi. O. j
We compute the expected frequencies: E ij =
n
Payment vehicle tax Center-North South-Islands
No 72.5 72.5
r c
(Oij − Eij )
2
=
(80 − 72.5) (65 − 72.5) (20 − 27.5) (35 − 27.5)
2
+
2
+
2
+
2
=
c 2 = ∑∑ Eij 72.5 72.5 27.5 27.5
i =1 j =1
= 0.7759 + 0.7759 + 2.0455 + 2.0455 = 5.6428
Therefore, p − value = Pr (χ1 > 5.6428) . From the tables it results that 0.01 < p-value < 0.025 (the
precise value, not available on the table is equal to 0.0175). Therefore, we do not reject the null
hypothesis.
EXERCISE 7
Gianni is gambling with Andrea on the results of a dice roll. Gianni, thinks that the dice used by Andrea is
unfair. Gianni carries out 60 rolls of the dice and observe the following results.
Result 1 2 3 4 5 6
Observed frequency 15 5 5 15 10 10
In order to verify whether the dice used by Andrea is really unfair, Gianni decides to run an appropriate
hypothesis test.
a) Specify the hypothesis to be verified.
If the dice is fair, the probability of each outcome is 1/6 and then, over 60 rolls the expected frequency for
1
each outcome is ⋅ 60 = 10 . Let us use a chi-square test to evaluate whether the deviation between the
6
observed frequencies (Oi) and the expected frequencies (Ei) is significantly different than 0. Hence:
b) Compute the p-value of the test. Fix a significance level equal to =0.05 and decide about the hypothesis
previously specified.
χ2 = ∑
6
(Oi − Ei )2 = (15 − 10)2 + (5 − 10)2 + (5 − 10)2 + (15 − 10)2 + (10 − 10)2 + (10 − 10)2 = 10
i =1 Ei 10
(
The p-value of the test is: p − value = Pr χ 52 > 10 . )
From the tables it results that 0.05 < p − value < 0.1 , therefore we do not reject the null hypothesis:
there is no sufficient empirical evidence for the dice to be considered unfair (the exact value of the p-value,
not given in the tables is 0.0752).
EXERCISE 8
Luigi is gambling with Filippo on the results of a dice roll. Luigi, thinks that the dice used by Filippo is unfair.
Luigi carries out 120 rolls of the dice and observe the following results.
Result 1 2 3 4 5 6
Observed frequency 15 25 23 22 17 18
In order to verify whether the dice used by Filippo is really unfair, we decide to run an appropriate
hypothesis test.
a) Specify the hypothesis to be verified.
If the dice is fair, the probability of each outcome is 1/6 and then, over 120 rolls the expected frequency for
1
each outcome is ⋅ 120 = 20 . Let us use a chi-square test to evaluate whether the deviation between the
6
observed frequencies (Oi) and the expected frequencies (Ei) is significantly different than 0. Hence:
b) Compute the p-value of the test. Fix a significance level equal to =0.05 and decide about the hypothesis
previously specified.
6
(Oi − Ei )2 (15 − 20)2 + (25 − 20)2 + (23 − 20)2 + (22 − 20)2 + (17 − 20)2 + (18 − 20)2
χ =∑
2
= = 3.8
i =1 Ei 20
(
The p-value of the test is: p − value = Pr χ 52 > 3.8 . )
From the tables it results that p − value > 0.05 , therefore we do not reject the null hypothesis: there is
no sufficient empirical evidence for the dice to be considered unfair (the exact value of the p-value, not
given in the tables is 0.5786).
EXERCISE 9
A pastry chef wants to know if his customers predominantly like some kind of pastries. A sample of 80
customers has been asked which kind of pastries prefers and the classification in the following table has
been obtained:
22 26 15 17
a) At a 10% significance level, is it possible for the pastry chef to conclude, on the basis of sample
evidence, that all the 4 typologies of pastry are equally preferred?
With With With Dry Total
fruits cream chocolate
Probability distribution of
p1 p2 p3 p4 1
the population
Oi = Observations drawn
from the population 22 26 15 17 80
Assumed probability
distribution of the 1/4 1/4 1/4 1/4 1
population
Ei = expected number of
observations if the assumed 80*1/4=20 80*1/4=20 80*1/4=20 80*1/4=20 80
distribution is true
1
H 0 : the probability distribution of the population is uniform i.e. p1 = p2 = p3 =
3
(Oi − Ei )
K 2
U= ∑i =1 Ei
=
20
+
20
+
20
+
20
= 3,7
(Oi − Ei )
K 2
We reject H 0 if: ∑
i =1 Ei
> χ 3,0.1
2
3,7 < 6, 25 and assuming a 10% significance level, I do not reject the null hypothesis H 0 : “there is no
sufficient empirical evidence to state that the customers prefer a particular kind of pastry”. It is possible for
the pastry chef to conclude that on the basis of sample evidence, that all the 4 typologies of pastry are
equally preferred.
b) Describe the logic of the test statistic used to verify the hypothesis in the previous point.
A test with significance level α , under H 0 , against the alternative hypothesis that the assumed
probabilities are not correct, is based on the following decision rule
(Oi − Ei )
K 2
reject H 0 if: ∑
i =1 Ei
> χ K2 −1,α
where χ
2
K −1,α (
is the value for which P χ K2 −1 > χ K2 −1,α =
α )
And the random variable χ K2 −1 follows a chi-square distribution with ( K − 1) degrees of freedom.
EXERCISE 10
A professor suggests to his students to choose one of the following books for their studies: A, B, C or D. He
believes that the students have no particular preferences and so the books will be equally (Uniformly)
chosen. To test this assumption, the professor collects the chosen book for a random sample of 100
students, with these results: 20 preferred book A, 40 B, 30 C and 10 D.
a) Which is the statistic we must use to help the professor in verifying his hypothesis?
We must use the “Chi-square test”, that is we must calculate the statistic:
K
(O i − E i ) 2
χ2 = ∑
i =1 Ei
where Oi are the observed frequencies and Ei those expected, in this case the frequencies of the Uniform
distribution.
This statistic is distributed like Chi-square distribution with k-1 degrees of freedom (K = nr. of classes).
A B C D
Oi 20 40 30 10
Ei 25 25 25 25
(O i − E i ) 2 (−5) 2 15 2 5 2 (−15) 2
K
from which: χ = ∑ 2
= + + + = 20
i =1 Ei 25 25 25 25
In a Chi-square distribution with 3 (4-1) degrees of freedom the percentile 99 is, according to our table,
11.34.
Since 20 > 11.34 we must reject the null hypothesis H0 of the professor: the distribution of preferences
among books is not Uniform.
EXERCISE 11
A professor suggests to his students to choose one of the following books for their studies: A, B, C or D. He
believes that the students have no particular preferences and so the books will be equally (Uniformly)
chosen. To test this assumption, the professor collects the chosen book for a random sample of 200
students, with these results: 70 preferred book A, 40 B, 60 C and 30 D.
We must use the “Chi-square test”, that is we must calculate the statistic:
K
(O i − E i ) 2
χ2 = ∑
i =1 Ei
where Oi are the observed frequencies and Ei those expected, in this case the frequencies of the Uniform
distribution.
This statistic is distributed like Chi-square distribution with k-1 degrees of freedom (K = nr. of classes).
A B C D
Oi 70 40 60 30
Ei 50 50 50 50
K
(O i − E i ) 2 (20) 2 (−10) 2 10 2 (−20) 2
from which: χ 2 = ∑
i =1 Ei
=
50
+
50
+
50
+
50
= 20
In a Chi-square distribution with 3 (4-1) degrees of freedom the percentile 95 is, according to our table,
7.81. Since 20 > 7.81 we must reject the null hypothesis H0 of the professor: the distribution of preferences
among books is not Uniform.
EXERCISE 12
You want to verify if there is an association between the area of residence of families and the presence of
underage children. To this aim, a random sample of 100 families is analyzed and the collected information
is organized in the following contingency table:
Area of residence
Urban Rural
b) Determine, at the significance level α = 0.05, if there is association between the two variables.
(O − Eij )
2
r c Ri C j
∑∑ where Eij =
ij
.
=i 1 =j 1 Eij n
Under the null hypothesis, the test statistic has an approximate Chi-square distribution with (r − 1)(c − 1)
degrees of freedom.
(O − Eij )
2
r c
Setting alpha α = 0.05 and since r = 2 and c = 2 we have that χ1;0.05 = 3.84 . In the following table we
2
Ri ⋅ C j
calculate the expected frequencies Eij = :
n
Area of residence
Urban Rural
(O − Eij )
2
ij
While in the following table we calculate the quantities:
Eij
Area of residence
Urban Rural
Since 3.1232 < 3.84 we do not reject H0 and we conclude that there is no evidence that the Presence of
underage children and the Area of residence are associated.
EXERCISE 13
You want to verify if there is an association between the area of residence of families and the presence of
underage children. To this aim, a random sample of 500 families is analyzed and the collected information
is organized in the following contingency table:
Area of residence
Urban Rural
b) Determine, at the significance level α = 0.01, if there is association between the two variables.
(O − Eij )
2
r c Ri C j
∑∑ where Eij =
ij
.
=i 1 =j 1 Eij n
Under the null hypothesis, the test statistic has an approximate Chi-square distribution with (r − 1)(c − 1)
degrees of freedom.
(O − Eij )
2
r c
Setting alpha α = 0.01 and since r = 2 and c = 2 we have that χ1;0.01 = 6.63 . In the following table we
2
Ri ⋅ C j
calculate the expected frequencies Eij = :
n
Area of residence
Urban Rural
(O − Eij )
2
ij
While in the following table we calculate the quantities:
Eij
Area of residence
Urban Rural
The sum of the values in the last table, i.e. the value of the test statistic, is 4.2618.
Since 4.2618 < 6.63 we do not reject H0 and we conclude that there is no evidence that the Presence of
underage children and the Area of residence are associated.
EXERCISE 14
A sample of cyclists has been classified according to their gender and to their opinion towards the bike
routes of a city:
OPINION
Low 16 39
Sufficient 24 17
Good 20 24
You would like to carry on a test in order to verify whether there is an association between the two
variables.
a) Write down the hypotheses to test.
b) Does the data constitute enough empirical evidence in order to affirm that the two variables are
significantly related? Answer by adopting an α = 0.05.
We need to run a Chi-square test of independence. The test statistic to be utilized is:
Under the null hp, the random variable associated with the test follows a chi square distribution with (r-
1)(c-1) degrees of freedom.
The null hp will be rejected if the observed value of the test statistic is above the critical value of
χ 22,0.05 = 5.99
OPINION
Low 16 39 55
Sufficient 24 17 41
Good 20 24 44
Cj 60 80 140
Ri C j
Below, the expected frequencies under the null hp Eij = :
n
OPINION
r c (Oij − Eij ) 2 (16 − 23.5714) 2 (39 − 31.4286) 2 (24 − 17.5714) 2 (17 − 23.4286) 2
∑∑ = + + + +
i =1 j =1 Eij 23.5714 31.4286 17.5714 23.4286
(20 − 18.8571) 2 (24 − 25.1429) 2
+ = 8.4931
18.8571 25.1429
8.4931 > 5.99 therefore we reject H0. Data provide sufficient empirical evidence to affirm that the two
variables are associated at a level of significance α = 0.05. (p-value 0.014313)
EXERCISE 15
100 people are selected for an interview at random. The following two-way table refers to the questions
“will you take part in the next masked parade?” and “interviewee’s gender”.
Yes No
Gender M 12 28
F 13 47
a) What is the statistical test used to detect association between the variables in the two-way table? What
are the null and the alternative hypotheses?
In order to test for independence we carry out the Chi-squared ( χ 2 ) test. The test’s hypotheses are:
b) Calculate the p-value of the test you indicated in the previous point.
Yes No
Gender M 10 30
F 15 45
The test statistic is calculated as follows
c = ∑∑
2
r c (O
ij − Eij )
2
=
(12 − 10)2 + (13 − 15)2 + (28 − 30)2 + (47 − 45)2 =
i =1 j =1 Eij 10 15 30 45
4 4 4 4
= + + + = 0,8889
10 15 30 45
The p-value is:
( ) (
p − value = P χ 12 > χ 2 = P χ 12 > 0,8889 )
Thus the p-value is larger than 0,01.
c) Based on the result obtained in point b), make an assertion on H0, with α = 0.05. Briefly comment on the
output.
Since p − value > a we don’t reject the null hypothesis: there is no sufficient empirical evidence to state
that the willingness to take part in the parade varied across genders.
EXERCISE 16
100 people are selected for an interview at random. The following two-way table refers to the questions
“will you take part in the next masked parade?” and “interviewee’s gender”.
YES No
Gender M 10 30
F 30 30
a) What is the statistical test used to detect association between the variables in the two-way table? What
are the null and the alternative hypotheses?
In order to test for independence we carry out the Chi-squared ( χ 2 ) test. The test’s hypotheses are:
b) Calculate the p-value of the test you indicated in the previous point.
Yes No
Gender M 16 24
F 24 36
i =1 j =1 E ij 16 24 24 36
36 36 36 36
= + + + = 6,25
16 24 24 36
c) Based on the result obtained in point d), make an assertion on H0, with α = 0.05. Briefly comment on the
output.
Since p − vαlue < α we reject the null hypothesis: there’s sufficient empirical evidence to state that the
willingness to take part in the parade varied across genders (the variables are statistically dependent).
EXERCISE 17
A retail manager of home appliances wants to analyze whether customers do have a particular preference
when they choose a flat TV. The three kind of flat TV sold are: lcd, plasma, led. In a simple random sample
of 288 purchasers of a flat TV 112 bought lcd, 103 plasma, 73 led.
a) Calculate the p-value of the test to verify whether the customers do have a preference or not.
To verify whether the customers do have a preference or not we need to compare the distribution of the
preferences with a uniform distribution:
LCD PLASMA LED Total
p1 p2 p3 1
Test:
1
H 0 : uniform distribution so p1 = p2 = p3 =
3
H 1 : distribution is not uniform, customers have preferences
U =∑
k
(Oi − Ei )2 is distributed as a χ k2−1 (with k=3 and n=287).
i =1 Ei
U =∑
k
(Oi − Ei )2 = (112 − 96)2 + (103 − 96)2 + (73 − 96)2 = 8,3902
i =1 Ei 96 96 96
( ) (
p − value = P χ 22 > U = P χ 22 > 8.6875 )
From χ 22 distribution table we can see that 0.01 < p − value < 0.025 .
b) Assuming α equal to 0.05, which is the final conclusion you would suggest to the retail manager?
Justify your answer.
With a level of significance of 5%, we reject the Null hypothesis H 0 : there’s enough empirical evidence to say
that customers do have a preference in their TV set choice.
EXERCISE 18
A market research on drink consumption has been conducted in order to verify the association between
type of drinks and consumers’ age. A survey was administered to 120 customers and their preferences have
been collected and recorded into the following cross-table:
Type of drink
Cocktail Liqueur
< 30 years 70 10
Age of customers
≥ 30 years 18 22
a) Using an appropriate statistical test, verify the hypothesis of independence of the two variables,
providing the assumptions and the computations needed to take the final decision. What are the
conclusions? Briefly explain.
Perform a chi-square test. The hypotheses to be tested are:
H0: The variables “Age of customer” and “Type of drink” are independent
H1: The variables “Age of customer” and “Type of drink” are dependent
Not having specified the level of significance, we calculate the p-value considering 𝜒𝜒 2 with (2-1)(2-1)= 1 d.f.
Using the sample data we can conclude that the dependence between Type of drink and Age of customer is
statistically significant.
EXERCISE 19
The business agents working for the same company presume a salary gap against women. In order to verify
their suspects, they consider a sample of 46 agents and collect the following data on gender and income:
Women 5 10 4
Men 6 10 11
a) Can they conclude that there is actually a dependency between gender and income? Build and use an
appropriate statistical test with a level of significance α = 0.01. What could the agents conclude?
The appropriate test to verify the association between the two variables is the chi-square test of
independence.
The hypotheses to be tested are:
H 0 : there is no association between the variables "Income" and "Gender" (i.e. they are independent)
H1 : there is an association between the variables "Income" and "Gender" (i.e. they are dependent)
r c (O − Eij )
2
Eij
that when H 0 is true, it has distribution χ ( r −1)( χ −1) = χ 2 (where c=3 are the number of columns and r=2
2 2
11 20 15 46 11
r c (O − E ij )
2
(5 − 4.5435)
2
(10 − 8.2609)
2
(4 − 6.1957 )
2
(6 − 6.4565)
2
c = ∑∑ = + + + +
2 ij
+
(10 − 11.7391)2 + (11 − 8.8043)2 = 2.027
11.7391 8.8043
We reject H 0 when the observed value of the statistics test χ 2 is located in the right tail of the
distribution χ 22 . From the table of Chi-Square distribution we get χ 22; 0, 01 = 9.21 > 2.027 We do not
reject the null hypothesis of independence at 1% significance (Income and Gender are statistically
independent).
On the basis of the considering sample and the test result, there is not a salary gap against women.
EXERCISE 20
A researcher wants to analyse the relation, if any, between children weight and time spent in sports in a
week. A random sample of 320 children between 8 and 10 years old is observed. Let X be weekly time
spent in sports (0 = “less than 1 hour”; 1 = “1-3 hours”; 2 = “more than 3 hours”) and Y be the weight (0 =
“normal”; 1 = “slightly overweight”; 2 = “heavily overweight”).
X\Y 0 1 2
0 30 70 0
1 20 30 60
2 10 40 60
b) Write the general formula of the statistic that should be used in this case.
Compute the expected frequencies for each pair (i, j) of values of X and Y under the null hypothesis as
Ri C j
Eij = ; where Ri is the marginal frequency corresponding to the i-th row and C is the marginal
n
3 3 (O − Eij )
2
Eij
> χ (3−1)(3−1), 0.05 , where Oij are
c) Test the hypothesis of a relation between the two variables with a significance level α = 0.05.
X\Y 0 1 2 Tot
The values
(O ij − Eij )
2
The sum, 90.7359, exceeds χ (3−1)(3−1), 0.05 = 9.49: we reject the hypothesis that X and Y are independent.
EXERCISE 21
You want to study the relationship existing between the revenues of a group of companies and whether or
not the companies have a web page. You conduct a sample survey on 445 companies and obtain:
>10 Mil $ 56 88
i) Are revenues and having a web site independent (state the null and the alternative hypotheses and verify
whether there is an association between the two variables)? Use the p-value approach.
H1 the two variables X, Y are NOT independent, there is a relationship between the two variables X, Y
Ri C j
Expected frequency: Eij = Revenues\ Web site Yes No
n
>10 Mil $ 155*144/445 290*144/445
=50.1573 =93.8427
=104.8427 =196.1573
r χ (Oij − Eij ) 2
We reject H0 when ∑∑
i =1 j =1 Eij
> χ (2r −1)( χ −1),α
r c (Oij − Eij ) 2
∑∑
i =1 j =1 Eij
= 0.6806 + 0.3638 + 0.3256 + 0.1740 = 1.544 p-value >0.10
For any reasonable α do not reject H0, the two variables are independent.