Loosmore Course PointPaterrnSpatialStatistics

Inference for Point Pattern Spatial
Statistics
N. Bert Loosmore
nhl@u.washington.edu
QERM 550
University of Washington
May 11 & 13, 2005
Inference for Point Pattern Spatial Statistics – p.1/49

Outline
Use of Point Pattern Statistics in Ecology

Outline
The Failure of the Simulation Envelope

Outline
Diggle’s (1983, 2003) ‘Goodness of Fit’ Test

Outline
Unresolved Implementation Issues

Outline
Parameterization Based on the Ecological Research

Question

Outline

Question
Characterizing Type I, II Error Rate Performance

Point Pattern Statistics in Ecology
Spatial processes Ecological processes
200
150
Northing(m)
Northing
100
50
0
0 50 100 150 200
Easting
Easting(m)
200
150
Northing(m)
Northing
100
What pattern for the

green points?
50
0
0 50 100 150 200
Easting
Easting(m)
200
150
Northing(m)
Northing
100
What pattern for the

red points?
50
0
0 50 100 150 200
Easting
Easting(m)
200
150
Northing(m)
Northing
100
Do we see (or expect)

stationarity?
50
0
0 50 100 150 200
Easting
Easting(m)
Point Pattern Spatial Stats: How?
Evaluate observed pattern against ideas of

aggregation,
rMatClust() with 105 points, radius = 0.1

aggregation,
CSR,
CSR pattern with 100 points

aggregation,
CSR,
inhibition rSSI() with 100 points, radius = 0.05

aggregation,
CSR,
inhibition
Analyze distances between events:

aggregation,
CSR,
inhibition

G (nearest neighbor),

aggregation,
CSR,
inhibition

F(grid to nearest point),

aggregation,
CSR,
inhibition

K/L (all neighbors)

aggregation,
CSR,
inhibition

K/L (all neighbors)
Typically perform analysis using ‘Simulation Envelope’

Definition of the G and F Statistics
G statistic uses the nearest neighbor distances ( ) for each

of sample points as:

F statistic uses the distances ( ) from each of sample

points (typically located on a grid) to their nearest event as:

Under CSR, both the G and F statistic is approximated as


Definition of the K and L Statistics
K statistic uses the distances between all neighbors ( ) as:

Under CSR, K statistic can be approximated by

L statistic used to set mean
and (supposedly) stabilize

variance as:


Building the Simulation Envelope
A CSR pattern with

1.0
0.8
0.6
G(t)

0.4
0.2
0.0
0.00 0.05 0.10 0.15 0.20
distance
Distance

Building the Simulation Envelope
99 CSR patterns with

1.0
0.8
0.6
G(t)

0.4
0.2
0.0
0.00 0.05 0.10 0.15 0.20
distance
Distance

Using the Simulation Envelope

Plot after subtracting

0.3
0.2
rSSI(r=0.03, n=100)
0.1
hat G−bar G

0.0

−0.1
−0.2
−0.3
0.00 0.05 0.10 0.15 0.20
Distance
Distance
Perceived Level Performance

Using all results from 19 simulations yields , or

Throwing out upper and lower 2 simulations at each
distance ( ) from 99 simulations also yields


Kenkel (1988) Methods
Evaluated spatial locations of all live trees, all (live +
standing dead) trees in a jack pine Pinus Bansiana forest.

Map of live + standing dead represents distribution

following early sapling mortality, but prior to the onset of
density-depending mortality.

Map of live + standing dead represents distribution

following early sapling mortality, but prior to the onset of
density-depending mortality.
Methods: Used MC techniques for the G and L statistics

to evaluate observed results against of i) random

locations (CSR) and ii) random mortality.

Kenkel (1988) Conclusions
G: live + dead shows no departure from randomness
whereas live trees only shows significant regularity

L: live + dead shows no departure from CSR at small

scales, live trees show regularity at smaller scales

L: live + dead shows no departure from CSR at small

scales, live trees show regularity at smaller scales
But is this interpretation correct?

Examples in Ecological Research
Author (Year) Statistics Patterns in “CI” (%) Marginal
Used Sim Env (s) Results (y/n)
Batista and Maguire (1998) G, K 19 95% n
Dolezal et al. (2004) K 99 95% y
Freeman and Ford (2002) G, K 99 99% n
Grassi et al. (2004) K 99 95% n
Hirayama and Sakimoto (2003) K 19,99 95%, 99% n
Martens et al. (1997) L 99 95% n
Moeur (1997) G, K 200 90% n
Parish et al. (1999) G, K 19 95% n
Salvador-
Van Eysenrode et al. (2000) G, K 1000 95% y
Srutek et al. (2002) L 99 95% y
Tirado and Pugnaire (2003) K 1000 99% n

Outline

Question

Sim Env Level Performance
Simulation study with independent ‘trials’ of a CSR
pattern against a CSR envelope.
Designate ‘failure’ if pattern exceeds envelope at any

distance. (Type I error)
Expected type I error rate 0.05 ...

Sim Env Level Performance
Simulation study with independent ‘trials’ of a CSR
pattern against a CSR envelope.
Designate ‘failure’ if pattern exceeds envelope at any

distance. (Type I error)
Expected type I error rate 0.05 ...

... actual type I error rate 0.5-0.7

Monte Carlo Simulation Theory
For a univariate continuous distribution,


Monte Carlo Simulation Theory
For a univariate continuous distribution,

But does the simulation envelope comprise a univariate
distribution?

How the Envelope is Really Made
Simulation envelope built from 100 patterns:
0.3
0.2
55 patterns comprising
the simulation envelope
0.1
G−G

^

0.0

−0.1
−0.2
−0.3
0.00 0.05 0.10 0.15 0.20 0.25
distance
Distance

Failure of the Simulation Envelope
Although built from patterns, complexity of both

1. G, F, and/or K statistics, and
2. spatial patterns
yields a multivariate result.
Since evaluation of the observed pattern occurs at many

distances we are performing simultaneous inference and
thus is increased.

Further, if the simulation envelope is invalid, then how can

we use it to determine scale?

Outline

Question

Proper Statistical Methods
From Diggle (1983, 2003), for a given :

1. At a single a priori distance - use upper and lower
simulated values
2. Across a range of distances - use Goodness of Fit test

The Goodness of Fit Test - 1
1. Represent the empirical results as:

observed pattern, and

for simulated patterns


2. Calculate:

for

Summary statistic indicative of the total deviation of the

given pattern from the theoretical result

2. Calculate:

for

but use

to reduce bias

3. Reject (fail to) based on the rank of using the

p-value, calculated as

for . So, if (the largest), then

Now we have quantitative results to evaluate a pattern’s

significance based on an “exact” level test because of

proper MC methods
Outline

Question

What is the optimal method to calculate ?

How to:
replace integration with summation

incorporate edge correction methods

choose limits , distance list

simulate patterns from null process

Replacing Integration with Summation
We can rewrite Eqn (1) as

But how accurate is this approximation?

Edge Correction
Used to eliminate bias from edge interfering with detecting
a point’s neighbor
Reduced Sample edge correction approach:

Let be the distance for point to the closest

boundary
Remove point from calculation at distance where

Other approaches (toroidal, isotropic, etc.)

Choice of Limits ( ), Distance List ( )

Recommended default for , but application

dependent!



dependent!

, are discrete, change where




dependent!

,
are discrete, change where

new neighbor detected, or



dependent!

,


point removed from sample



dependent!

,


Use empirical distance list for exact results from a single

pattern



dependent!

,


Use empirical distance list for exact results from a single

pattern

Because of calculation, especially , for exact

solution, need to use complete empirical distance list (i.e.

from all patterns) for evaluation of each pattern

Resolution of Simulated Patterns
Complexity? - Number of distances grows with ,


Resolution (i.e. vs ) of simulated

patterns should be equivalent to that of observed
pattern



pattern
Limiting resolution helps constrain complexity



pattern

is highly accurate for ecological

data (Freeman and Ford, 2002)



pattern

is highly accurate for ecological

data (Freeman and Ford, 2002)
Combining resolution and default leads to at most

25,000 distances in , regardless of , or test statistic, and

provides an exact solution

Outline

Question

Parameterization - 1
“How to run any given test based on the ecological
research question”
Number of simulations ( )

Parameterization - 1
“How to run any given test based on the ecological
research question”
Number of simulations ( )

Choice of , including choice of


versus

Uncertainly in realized p-value ( ) results from the use of

MC simulations
Ramifications of ? Affects precision of through

actual simulated patterns against which observed
pattern tested, and
number of those patterns
Note about exact level performance (across many tests

vs. variation of p-value for single test)

Distribution of

Let and for . The p-value for

the test is then:


Distribution of

Let and for . The p-value for

the test is then:

The expected value of P is:

Assuming Y comes from , then . So,

each of the


Variance of P ( )

Looking at the variance of we have


Variance of P ( )

Looking at the variance of we have

Hence we can model the theoretical distribution of as

from a binomial(p,s) distribution.
Managing Uncertainty in

Rem that binomial quickly converges to Normal


Create 95% CI on (true p-value) near as




95% of CI created this way should contain the true
value of , and so set decision rule: e.g. reject if

CI contains or fully below 0.05





Choose acceptable range of uncertainty for .





Choose acceptable range of uncertainty for . For

example if is ok, use





Choose acceptable range of uncertainty for . For

example if is ok, use

Use relationship between and to find value of


as a function of

0.07
0.06
0.05
0.04

σp
0.03
0.02
0.01
0 500 1000 1500 2000
Number of Simulations (s)

# of Simulations

Choice of
Use all available ecological knowledge for a more
informative test

Choice of
informative test
Null point process just needs to be able to be simulated,

many models available (e.g. spatstat) or write your
own!

Choice of
informative test

own!
At the very least, choose simple inhibition model based

on physical separation

Choice of
informative test

own!

EDA vs. confirmatory analysis, results in iterative

nature of research, with (hopefully) tests on
independent data sets

Choice of
informative test

own!

EDA vs. confirmatory analysis, results in iterative

nature of research, with (hopefully) tests on
independent data sets
Use the model to determine information on scale!

Example of model fitting
Attempt to fit a clustered model, representing
establishment processes to the lower SW quadrant of
the WRCCRF data, for all trees in height.


Used Poisson Clustered model, with represents the
number of parents and represents the expected

number of children per parent, and where clustering of
‘children’ around each parent are described as





How to choose values for and ? ( )





How to choose values for and ? ( )

Note that my null ‘model’ here describes not only the

process, but also the parameter values.

Example of model fitting - 2
This is Exploratory Data Analysis!

If we knew the theoretical value of G, K for this model,
use Diggle’s ‘Least Squares Estimation’ method

Otherwise, use GoF test to estimate parameter space

Find for different combinations of and ‘accept’

model where


a) G statistic b) K statistic
0.4
0.4
0.3
0.3
σ σ
0.2
0.2
0.1
0.1
0 20 40 60 80 100 0 20 40 60 80 100
ρ ρ
Inference? For the observed data, if this model fits, then
larger suggests lower (i.e. few parents) and so more

children/parent.


children/parent.
Conversely a smaller clustering radius requires higher
and so fewer children per parent.


children/parent.
Is this model a good fit? What might the physiological
and/or ecological implications be?


children/parent.
Is this model a good fit? What might the physiological
and/or ecological implications be?
gives us hints about scale.

, Variance stabilization

should be chosen before the test, and based on

research question. (i.e. what is the interaction distance of

interest?)

, Variance stabilization

should be chosen before the test, and based on

research question. (i.e. what is the interaction distance of

interest?)
0.05
0.00
K(t)

−0.05
−0.10
0.00 0.05 0.10 0.15 0.20
Distance
distance
Variance stabilization - to make variance independent of .

Outline

Question

Type I Error Rate ( ) - 1
Simulation study of Type I error rate performance
Evaluated different levels, for different point pattern

intensities (

)

Results within LRT boundaries

Type I Error Rate ( ) - 2
Simulations of 1000 independent trials using

a) Type I error rates for G b) Type I error rates for K
0.15
0.15
0.10
0.10
^
α ^
α
0.05
0.05
0.00
0.00
0 50 100 150 200 250 0 50 100 150 200 250
λ λ
# points ( ) # points ( )
Type II Error Rate (1-Power)
Type II error rate is the prob of accepting given that

is really true.


is really true.
Requires definition of .


is really true.
Power will be a function of ‘how far’ is from .

(‘Easy’ to think of this distance when using Normal
distribution, but more difficult to conceptualize here.)


is really true.
Power will be a function of ‘how far’ is from .

(‘Easy’ to think of this distance when using Normal
distribution, but more difficult to conceptualize here.)
Often overlooked for spatial point process analysis, but
can be simulated.

Analysis of Type II Error Rate
Analysis of power against of CSR for WRCCRF
example for different parameterizations of .

Type II error rate tells us the ability to distinguish the
pattern from CSR.
As increases, larger clusters are more like CSR.

a)ρ=20 b)ρ=40
1.0
1.0
0.8
0.8
0.6
0.6
Power
Power
0.4
0.4
0.2
0.2
0.0
0.0
0.05 0.15 0.25 0.35 0.05 0.15 0.25 0.35
σ σ

Power of the G Statistic
‘Large’ deviation at small distances may be swamped out
0.3
0.2
0.1
G−G

^

0.0

−0.1
−0.2
rSSI(r=0.02)
rSSI(r=0.03)
−0.3
0.00 0.05 0.10 0.15 0.20
distance
Distance

Parameters that may improve Power
Rewriting Equation (2) in its ‘full’ form (Diggle, 2003):



, as parameters to improve Power against certain



Use of not well explored, but could be used to

emphasize certain distances.

For my calculations,



For ,

use for L statistic.

use for power against clustered patterns

(Diggle, 2003)
other?

Conclusions
Simulation envelope does not result in expected Type I
error rates. Limits are not confidence intervals.

Conclusions
For more precise, reliable results, implement Diggle’s
goodness of fit test

Conclusions
Previous marginal results should be re-examined

Conclusions
Choice of , based on research question and

previous knowledge

Conclusions

previous knowledge
Evaluate the Power of your test

Conclusions

previous knowledge
Evaluate the Power of your test
R software availability:
http://students.washington.edu/nhl/masters.html

R software resources
CRAN (Comprehensive R Archive Network) site
http://cran.r-project.org/
A. Baddeley’s spatstat package

http://www.maths.uwa.edu.au/ adrian/spatstat.html
P. Diggle’s splancs package

http://www.maths.lancs.ac.uk/ rowlings/Splancs/
UW R and S-plus user support group

http://mailman1.u.washington.edu/mailman/listinfo/s plus

Loosmore Course PointPaterrnSpatialStatistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Loosmore Course PointPaterrnSpatialStatistics

Uploaded by

Copyright:

Available Formats

Inference for Point Pattern Spatial

Inference for Point Pattern Spatial Statistics – p.1/49

Inference for Point Pattern Spatial Statistics – p.2/49

The Failure of the Simulation Envelope

Inference for Point Pattern Spatial Statistics – p.2/49

The Failure of the Simulation Envelope

Diggle’s (1983, 2003) ‘Goodness of Fit’ Test

Inference for Point Pattern Spatial Statistics – p.2/49

The Failure of the Simulation Envelope

Diggle’s (1983, 2003) ‘Goodness of Fit’ Test

Unresolved Implementation Issues

Inference for Point Pattern Spatial Statistics – p.2/49

The Failure of the Simulation Envelope

Diggle’s (1983, 2003) ‘Goodness of Fit’ Test

Unresolved Implementation Issues

Parameterization Based on the Ecological Research

Inference for Point Pattern Spatial Statistics – p.2/49

The Failure of the Simulation Envelope

Diggle’s (1983, 2003) ‘Goodness of Fit’ Test

Unresolved Implementation Issues

Parameterization Based on the Ecological Research

Characterizing Type I, II Error Rate Performance

Inference for Point Pattern Spatial Statistics – p.2/49

0 50 100 150 200

What pattern for the

0 50 100 150 200

What pattern for the

0 50 100 150 200

Do we see (or expect)

0 50 100 150 200

Inference for Point Pattern Spatial Statistics – p.4/49

rMatClust() with 105 points, radius = 0.1

Inference for Point Pattern Spatial Statistics – p.4/49

Inference for Point Pattern Spatial Statistics – p.4/49

Inference for Point Pattern Spatial Statistics – p.4/49

Analyze distances between events:

Inference for Point Pattern Spatial Statistics – p.4/49

Analyze distances between events:

Inference for Point Pattern Spatial Statistics – p.4/49

Analyze distances between events:

Inference for Point Pattern Spatial Statistics – p.4/49

Analyze distances between events:

Inference for Point Pattern Spatial Statistics – p.4/49

Analyze distances between events:

Typically perform analysis using ‘Simulation Envelope’

Inference for Point Pattern Spatial Statistics – p.4/49

Under CSR, both the G and F statistic is approximated as

Inference for Point Pattern Spatial Statistics – p.5/49

K statistic uses the distances between all neighbors ( ) as:

Inference for Point Pattern Spatial Statistics – p.6/49

0.00 0.05 0.10 0.15 0.20

Inference for Point Pattern Spatial Statistics – p.7/49

0.00 0.05 0.10 0.15 0.20

Inference for Point Pattern Spatial Statistics – p.7/49

0.00 0.05 0.10 0.15 0.20

Inference for Point Pattern Spatial Statistics – p.9/49

Inference for Point Pattern Spatial Statistics – p.10/49

Map of live + standing dead represents distribution

Inference for Point Pattern Spatial Statistics – p.10/49

Map of live + standing dead represents distribution

Methods: Used MC techniques for the G and L statistics

Inference for Point Pattern Spatial Statistics – p.10/49

Inference for Point Pattern Spatial Statistics – p.11/49