You are on page 1of 62

ESS

116
Introduc)on to Data Analysis in Earth Science

Instructor: Mathieu Morlighem


E-mail: mmorligh@uci.edu (include ESS116 in subject line)
Oce Hours: 3218 Croul Hall, Friday 3:00 pm - 4:15 pm

Image Credit: NASA

Midterm exam
Part 1: take home available NOW on the class website:
hVps://eee.uci.edu/15f/42120/midterm
DUE: November 5th, 2:00 pm (do your own work)
I highly recommend starZng to work on the midterm early

Part 2: 30 min in class next week


November 5th, 2:40 pm (MSTB 118)
Open Book (no laptop/cell phone/tablet)
mix of mulZple choice and short answer quesZons
Everything from Lecture 1 to 5 (no hypothesis tesZng)

No make up exam for the midterm, no late submission


Midterm grade dropped if Final grade is beVer

Midterm EvaluaZon
Open unZl next lecture
What should be improved ?
What can I do to help you learn or is there something
that isnt working ?
Are the quick reviews useful ?
Are the lab useful ? Should they be longer ?
Do you want more MATLAB, more stats, or this is a good
balance ?

Todays lecture
1. Lecture 5 quick review

2. Lecture 6 Hypothesis tesZng
Sampling DistribuZon of the sample mean
Central Limit Theorem (CLT)
Condence intervals
Hypothesis TesZng
t-test (Comparing means)
2-test (Goodness of t)

Lecture 5 - review
Popula'on: the actual properZes of the real world
Sample: set of values imperfectly represenZng the
populaZon

Parameters: refer to the popula)on (e.g., and )
x

Sta's'cs: refer to the sample (e.g., and s)


Accuracy: quality of being close to the true value
Precision: number of signicant digits in a numerical
value (measurements or calculaZon)

Lecture 5 - review
Sample visualizaZon
Frequency Table
CumulaZve Frequency
Histogram

Rules for a good histogram


q

number of bins number of data values


histogram takes either a number
of bins, or a list of bin edges

What you need to know


Central Tendency:
Mean (average)
Median (50% higher, 50% lower)
Mode(s) (peak value(s))

Dispersion:
Range (max min)
Standard deviaZon (average distance to mean)
Variance (square of Std Dev)

Shape:
Skewness (posiZve: tail to the right, negaZve: tail to the lem)

Know how they relate to visual features on a histogram

Probability Density FuncZons


Histograms: empirical frequency distribuZon of our
sample.
N !1
A histogram for and an innitely small bin
size will produce a Probability Density func'on (PDF)
The probability that x is between x1 and x2 is:

P (x1 < x < x2 ) =

x2

f (x)dx
x1

Examples of theoreZcal DistribuZons:


Normal distribuZon (2 parameters: and )
Z distribuZon (0 parameters)
Students t distribuZon (1 parameter: )

MATLAB theoreZcal distribuZons


Normal (,)
Given x0, nd p0
>> p0 = normcdf(x0,mu,sigma);

Given p0, nd x0
>> x0 = norminv(p0,mu,sigma);

Z-distribuZon
>> p0 = normcdf(x0);
>> x0 = norminv(p0);

t-distribuZon
>> p0 = tcdf(x0,V);
>> x0 = tinv(p0,V);

2-distribuZon
>> p0 = chi2cdf(x0,V);
>> x0 = chi2inv(p0,V);

p0 = P( x < x0)

e.g.: 0.88 = P(x < 1.17)

i>Clicker quesZon
ESS 116 grades follow a Normal distribuZon of mean
800 with a standard deviaZon of 100. What is the
probability of having a grade below 500?
A.
B.
C.
D.

1 normcdf(500,800,100);
1 norminv(500,800,100);
normcdf(500,800,100);
norminv(500,800,100);

Lecture 6 Hypothesis tesZng

Sampling distribuZons

Sampling distribuZons
For one populaZon, results will vary from sample to
sample
How much do we except these results to vary from
sample to sample?
Sampling distribu'on: distribuZon associated to
samples rather than individual values from a populaZon
Example: Sampling distribu)on of the sample mean
Graph of all possible values of the sample mean and
how omen they occur
The mean of the populaZon of all possible sample
means is the same as the mean of the enZre populaZon

Sampling distribuZon
PopulaZon distribuZon

Sampling distribuZon of the


sample mean for n=10

i>Clicker quesZon
What would happen to the standard deviation of the sample mean (right) if we
increase the number of rolls for all sample (n=20 or n=100) ?
A.
B.
C.
D.

The standard deviation would increase


The standard deviation would decrease
The standard deviation would remain unchanged
Dont know.

Standard Error (SE)


Variability of X is measured by the standard deviaZon
There might be a gap between the sample mean
x

and the populaZon mean


Standard Error: variability in the sample mean
PopulaZon standard deviaZon

=p

Sample size

Decreases as the sample size increases (more precise)

Central Limit Theorem

Central Limit Theorem


If the distribuZon of X is normal:
The distribuZon of the sample mean is also normal

If the distribuZon of X is unknown or not normal


If n>30 the distribuZon of the sample mean X can be
approximated by a normal distribuZon:
Mean:
Standard deviaZon: x = p
n

The Central Limit Theorem does not care what the


distribuZon of X is!
hVp://onlinestatbook.com/2/sampling_distribuZons/clt_demo.html

Central Limit Theorem


PopulaZon distribuZon

Sampling
distribuZon
of the
sample mean
X =

Central Limit Theorem in AcZon

Example

The average male drinks 2 L of water when acZve


outdoors (with a standard deviaZon of 0.7 L). You are
planning a full day in nature with 50 men and will bring
110 L of water. What is the probability that you run out ?

Example
Population distribution

P(run out) = P(average water use > 110/5 L)

= 0.7L

= P(average water use > 2.2 L)


= P( x
> 2.2)

< 2.2)
= 1 P( x

= 2L
Sampling distribution of the Sample mean

= 1 normcdf(2.2,2,0.7/sqrt(50))
= 0.0217

=p

0.7
=p
50
N

P (
x > 2.2L)
X = = 2L

The probability of running out of water is 2.17%

Condence Interval

Condence interval in the mean


Condence Intervals: provide staZsZcal limits for your mean
values based on a degree of staZsZcal condence.

Ex: We can say with 95% condence that the average


temperature in Irvine is within [18C 24C] or 21 3 C

How to calculate this interval?


Set the level of signicance ( = 0.05 for a 95% CI)
Use a Normalized sample distribuZon of the sample mean

X
X
Follows a normal distribution
Z =

Find T such that P( -T < < +T)


= P(-T < < +T)

X
T x
T
= P( < < )

= 1-

Follows a z-distribution !

DistribuZon of the sample mean

CLT: the distribuZon of the sample mean is nearly Normal


What if we dont know , can we say s ?
If the sample size n > 30: Yes
If the sample size n < 30: Yes but
the distribuZon of X needs to be roughly normal
we pay a penalty: a t-distribuZon (faVer tails)

Example n>30
You sample 36 apples from your farms harvest of over
200,000 apples. The mean weight of the sample is 112
grams (with s = 40 grams).
What is the probability that the mean weight of the
200,000 apples is within 100 and 124 grams?

Example n>30
Population distribution

) = P( x
within 12 of )
P( within 12 of x
= P( x
within 12 of X )
= P(

12

Z=

<

+12

= normcdf(12/(40/6))
- normcdf(-12/(40/6))

Sampling distribution of
the Sample mean Normalized

<

= 0.9281

X
ps
n

We have a 92.8% chance that the actual


mean is within 12 grams of our sample mean
12/

12/

Example n<30
7 paZents blood pressures have been measured amer
having been given a new drug for 3 months. They had
blood pressure increases of 1.5 2.9 0.9 3.9 3.2 2.1 and 1.9

Construct the 95% Condence Interval (CI) for the true
expected blood pressure increase for all paZents in a
populaZon.

Example n<30
Population distribution

Here are our statistics for n = 7

x
= 2.3429

s = 1.0422

What is our 95% confidence interval ?


-z = Znv(0.025,7-1)
= -2.4469

Sampling distribution of the Sample mean


(Students t distribution with = n-1 = 6)

Z=

95%

z = Znv(0.975,7-1)
= +2.4469

x
p = 2.4469
s/ n
X

ps
n

s
x = 2.4469 p
n
x = 0.9639

2.5%

- z

2.5%

There is a 95% chance


that the mean, , is within 2.3429 0.9639

Hypothesis TesZng

IntroducZon
You read that, on average, a volcanic erupZon lasts 7 weeks (=7).
But we suspect that this number is wrong and should higher (>7).

How can we prove for a given level of signicance (=0.05)?

x

We look at the past n=100 erupZons and nd =7.2 and s =1 week.



Assuming that =7

s
=p 'p
n
n

> 7.2)
P(x

< 7.2)
= 1 P( x
= 1 normcdf(7.2,7,1/10)
= 0.0228 <

=7

x
= 7.2

Assuming that = 7, there is only a 2.3% chance of


finding a mean of 7.2 weeks, so we can reject =7
Conclusion: >7

TesZng one populaZon mean


You read that, on average, a volcanic erupZon lasts 7 weeks (=7).
But we suspect that this number is wrong and should higher (>7).

Null Hypothesis H
0
How can we prove for a given level of signicance (=0.05)?


AlternaZve Hypothesis H1
x

We look at the past n=100 erupZons and nd =7.2 and s =1 week.



Assuming that =7

=7

s
=p 'p
n
n

x
= 7.2

> 7.2)
P(x

p-value

< 7.2)
= 1 P( x
= 1 normcdf(7.2,7,1/10)
= 0.0228 <

Assuming that = 7, there is only a 2.3% chance of


finding a mean of 7.2 weeks, so we can reject =7
Conclusion: >7

Hypothesis tesZng
The classical way to make staZsZcal comparisons is to prepare
a statement about a fact for which it is possible to calculate
its probability of occurrence.
This statement is the null hypothesis and its counterpart is
the alterna've hypothesis.
The null hypothesis is tradiZonally wriVen as H0 and the
alternaZve hypothesis as H1.
A staZsZcal test measures the experimental strength of
evidence against the null hypothesis.
Curiously, depending on the risks at stake, the null hypothesis
is omen the reverse of what the experimenter actually
believes for tacZcal reasons.

Examples of Hypotheses
Let 1 and 2 be the means of 2 samples
We want to invesZgate the likelihood that their means
are the same:
Null Hypothesis:
H0: 1 = 2
AlternaZve Hypothesis : H1: 1 2
The AlternaZve Hypothesis could also be: H1: 1 > 2

The rst example of H1 is said to be two-sided or twotailed (includes both 1 > 2 and 1 < 2)
The second is said to be one-sided or one-tailed.
The number of sides has implicaZons on how to
formulate the test

Possible outcomes
H0 is correct

H0 is incorrect

H0 is accepted

Correct decision
Probability: 1-

Type II error
(missed detec)on)
Probability:

H0 is rejected

Type I error
(false alarm)
Probability:

Correct decision
Probability: 1-

Level of signicance: probability of comming a Type I error


is set before performing the test.
In a two-sided test, is split between the two opZons.
Omen, H0 and are designed with the intenZon of rejecZng H0,
thus risking a Type I error and avoiding the unbound Type II error.
The more likely this is, the more power the test has. Power is 1

Importance of choosing H0
SelecZng H0 has consequences on decision making
Customarily, tests operate on the lem column of the conZngency
table and the harder to analyze right column remains unchecked
Consider a jury trial:
H0 : not guilty
True

False

Test Accept Correct


acZon Reject Wrong
H0 : guilty
True
Test Accept Correct
acZon Reject Wrong

False

A: You assume that the defendant


isnt guilty. Wrong rejecZon: an
innocent person is guilty and
punished for the crime s/he did
not commit

B: You assume that the defendant


is guilty. Wrong rejecZon: a guilty
person is innocent and let go free

Importance of choosing H0
SelecZng H0 has consequences on decision making
Customarily, tests operate on the lem column of the conZngency
table and the harder to analyze right column remains unchecked
Consider environmental remedial acZon:
H0 : Site is clean
True

False

Test Accept Correct


acZon Reject Wrong

H0 : Site is contaminated
True
Test Accept Correct
acZon Reject Wrong

A: Wrong rejecZon means the site


is declared contaminated when it
is actually clean, which should
lead to unnecessary cleaning

False

B: Wrong decision declares a


contaminated site clean. No
acZon prolongs a health hazard
In both cases: P(Type I Error)

StaZsZc
A key step in the feasibility of being able to run a test is the ability of
nding an analyZcal expression for a staZsZc such that:
It is sensiZve to all parameters involved in the null hypothesis
It has an associated probability distribuZon

p-value: the probability that the staZsZc takes values beyond the
value calculated using the data while H0 is sZll true. Hence:
If p-value > (level of signicance), H0 is accepted
The lower the p-value, the stronger is the evidence provided by
the data against the null hypothesis.

The p-value allows to convert the staZsZc to probability units

ParZZon
The level of signicance is employed to parZZon the range of
possible values of the staZsZc into two classes:
One interval, usually the longest one (in green), contains those
values that, although not necessarily saZsfying the null
hypothesis exactly, are quite possibly the result of random
variaZon. If the staZsZc falls in this interval, H0 is accepted
accept

reject

The red interval comprises those values that, although possible,


are highly unlikely to occur. In this situaZon, H0 rejected. The
departure from H0 most likely is real, signicant.
When the test is two-sided, there are two rejecZon zones.
reject

accept

reject

Sampling distribuZon
The sampling distribuZon of a staZsZc is the distribuZon
of values taken by the staZsZc for all possible random
samples of the same size from the same populaZon.
Examples of such sampling distribuZons are:
Standard normal and the t-distribuZons for the
comparison of two means
The F-distribuZon for the comparison of two variances
The 2-distribuZon for the comparison of two
distribuZons

TesZng Procedure
1. Select the null hypothesis H0 and
the alternaZve hypothesis H1.
2. Choose the appropriate staZsZc
3. Set the level of signicance
4. Evaluate the staZsZc for the case
of interest, zs.
5. Use the distribuZon for the staZsZc in combinaZon with the level
of signicance to dene the acceptance and rejecZon intervals.
Find out either the corresponding:
p-value of the staZsZc in the probability space, or
level of signicance in the staZsZc space, z.

6. Accept the null hypothesis if zs < z or if p-value > . Otherwise,


reject H0 because its chances to be true are less than .

t-test Comparing Means

Dierence in mean
Is the dierence in mean between these 2 groups
systemaZc, or just due to chance?

Dierence in mean
Is the dierence in mean between these 2 groups
systemaZc, or just due to chance?

Dierence in mean
Is the dierence in mean between these 2 groups
systemaZc, or just due to chance?

Dierence in mean
Factors aecZng our condence in the answer:
Natural variability (standard deviaZon)
Sample sizes (n)

How to quanZfy our condence in the answer of the


dierence between the mean in two data samples?
Due to variability with the sample
Due to the amount of data points in the sample

Students t test
Null-Hypothesis:

(H0): 1 = 2

PopulaZon means are not staZsZcally dierent


AlternaFve Hypothesis: (H1): 1 2
PopulaZon means are staZsZcally dierent.

To accept the alternaZve hypothesis at 95% condence:



We must show only 5% probability the null hypothesis (H0) is
true, which jusZes rejecZng it

Paired vs Unpaired t-test


Unpaired t-test: the two samples are from independent
populaZons
Ex1: Are tropical sh larger than temperate sh?
Ex2: Are the temperatures in Long Beach and Death Valley
signicantly dierent?

Paired t-test: the two samples are from the same


populaZon
Ex1: Do sh get larger as they age?
Ex2: Is the annual temperature in the last 5 years in Death
Valley signicantly higher than in the Earlier 5 years?

Paired vs Unpaired t-test


Unpaired t-test (independent populaZons)
Sample 1: size n1, mean m1 and standard deviaZon s1
Sample 2: size n2, mean m2 and standard deviaZon s2
m1 m2
tstat = q 2
= n1 + n2 - 2
s1
s22
n1 + n2

Paired t-test (same populaZon)

xd , sd )
We look at the dierences between all n pairs: (

tstat

x
d
p
=
sd / n

= n - 1

Students t-test: Paired test


tstat
If tstat is large:

x
d
p
=
sd / n

The dierence between groups is bigger than the normal


variability within the sample
Therefore: the means of the 2 samples are signicantly
dierent from each other

If tstat is small:
The dierence between groups is smaller than the normal
variability within the sample
Therefore: the means of the 2 samples are not signicantly
dierent from each other

Students t-test: Paired test


We need a threshold tcrit:

If |tstat|> tcrit : the dierence between the means is unlikely


to have occurred by chance
theres likely to be a real systemaZc dierence between
the two groups (and thus theres likely to be a real
systemaZc dierence between the two condiZons)

Given a probability p, we can determine tcrit using


Students t distribuZon (with = n 1) as a funcZon
of p

Students t test (2-tailed)


tstat value is staZsZcally disZnguishable from zero. Example (90% conf)
0.4

0.35

0.3

Density

0.25

There is only a 5%
probability of
nding
t values higher
than this value
purely by chance...

There is only a 5%
probability of
nding
t values higher
than this value
purely by chance...

There is
a 90% chance
of nding
t values in this
range by chance

0.2

0.15

0.1

0.05

0
-4

-3

-2

-1

Critical Value

Summary
How to conduct a t-test?
1. Decide upon a level of signicance .
e.g. 99% and 95% are typical ( = 0.01 or 0.05)
2. From this, decide if one-tailed or two-tailed and
use MATLABS tinv command to nd tcrit
3. Compute tstat from your sample
4. Compare tstat and tcrit
If |tstat| > tcrit: the dierence is signicant (theres likely
an actual dierence between the to means)
else: the dierence is not signicant

5. OpZonal: determine the p-value

Example
We are interested in ocean acidicaZon. We measure the pH of
ocean water at the pier of Newport Beach at two dierent dates:

In 1994: 8.03, 8.08, 7.99, 8.00, 7.93, 7.98
In 2004: 7.99, 8.02, 7.92, 7.94, 8.01, 7.93
From our two sample, we have:
In 1994: m1 = 8.0017 and s1 = 0.0504
In 2004: m2 = 7.9683 and s2 = 0.0406
Does the dierence between the two means show a signicant
decrease or is it likely caused just by chance?

Example
1. Choose a level of signicance = 0.1 (CI 90%)
2. This is a one tailed test (H1: m2 < m1)

Numbers of degrees of freedom: = 6 -1 = 5


tcrit = tinv(1-0.1,5)
= 1.4759

3. Now, we have our criZcal value, what is our staZsZcs?

d = [8.03, 8.08, 7.99, 8.00, 7.93, 7.98]


- [7.99, 8.02, 7.92, 7.94, 8.01, 7.93];
tstat = mean(d)/(std(d)/sqrt(6));
= 1.4464

4. tstat < tcrit : We cannot reject H0


5. OpZonal: p-value = tcdf(tstat,5) = 0.8953

2
-test Goodness of t

2-test Goodness of t
We want to compare an observed frequency
distribuZon to a theoreZcal distribuZon.
Ex: we want to show that the yearly averaged rainfall in
Irvine follows a normal distribuZon
Ex: we want to make sure that a dice is not loaded

2 staZsZc
We decompose the number of observaZons (n) over
k intervals (or bins, or classes)
k must saZsfy n/k 5
k 10
So n 50

The Expected number of counts in any cell is Ei


The Observed number of counts is Oi
2
stat

k
X
(Oi
i=1

Ei )
Ei

2 staZsZc
2stat measures the mismatch between the Expected
and the Observed distribuZons
2stat = 0 perfect t
2stat large: poor t

Our staZsZc 2stat follows


a 2 -distribuZon !

ConducZng a 2 test
1. Formulate a null and alternaZve hypothesis:
H0: The data are consistent with a specied distribuZon
H1: The data are not consistent with a specied distribuZon

2. Choose a Signicance level: = 0.05 (5%)


3. use MATLABS chi2inv command to nd 2crit
4. Analyze Sample data
Degrees of freedom = k-1
Calculate the expected frequency counts Ei
k
2
Calculate the test staZsZc
X
(O
E
)
i
i
2
stat

5. Interpret the results

i=1

Ei

Example
Is our dice loaded? Compare to a uniform distribuZon
Value

Observed freq.

Expected freq.

(O-E)^2/E

16

10

3.6

10

2.5

10

0.1

10

0.9

10

1.6

17

10

4.9

Total

60

60

13.6


For alpha = 0.02:
chi2crit = chi2inv(1-0.1,6-1) =13.3882
The die is loaded (98% condence interval)

Next Week

Lab 6: Hypothesis tesZng


DUE: two weeks amer the lab starts (EEE)
Lecture 7: Curve Fing and interpolaZon
Midterm Part 1 take home
Midterm Part 2 in class

You might also like