You are on page 1of 100

Spatio-Temporal Data Mining and

Analysis of Precipitation Extremes


ELIZABETH WU
SID: 200113854
S
I
D
E
RE M
E
N
S

E
A
D
E
M

MUTA
T
O
Supervisor: Associate Professor Sanjay Chawla
Associate Supervisor: Associate Professor Joseph Davis
This thesis is submitted in partial fulllment of
the requirements for the degree of
Master of Science
School of Information Technologies
The University of Sydney
Australia
10 February 2008
ii
Copyright by Elizabeth Wu 2008
All Rights Reserved
Abstract
The purpose of the work presented in this thesis was to discover new ways to analyse and mine
information about precipitation extremes in South America from spatio-temporal data. This was done
in two ways. First, analysis was performed through the use of statistical measures that provided insight
into the behaviour of precipitation extremes between regions. Second, a new spatio-temporal outlier
detection algorithm was introduced to discover moving outliers from the data.
Using Extreme Value Theory (EVT) to model precipitation extremes and a measure of local spatial
autocorrelation known as the local Moran I statistic, the precipitation extremes and their relationship
with nearby regions was analysed. To then determine the effect that El Nio had on the local spatial
relationships of precipitation extremes, the mean Moran I value was compared to the mean Southern
Oscillation Index (SOI) over several strong El Nio periods. Bootstrap analysis was used to show the
relationship between the local spatial autocorrelation values and the SOI.
As a further extension of this and a contribution to the eld of spatio-temporal data mining, we
provide a discussion and analysis denition of a spatio-temporal outlier, and a new method with for
spatio-temporal outlier detection for gridded data was developed. The algorithm known as Outstretch
rst discovers the top-k outliers using our extension of the Exact-Grid spatial scan statistic, called Exact-
Grid Top-k. Once these have been identied, Outstretch uses a stretching window to nd all outliers in
subsequent time periods that fall within the window. Each of the sequences are then stored into a tree
for easy storage and retrieval of patterns.
iii
Acknowledgements
The work conducted in this thesis is due to the support and encouragement of many staff members,
friends and family, and I am grateful to all those who have helped me complete my research.
First and foremost, I would like to thank my supervisor, Sanjay Chawla, for providing guidance
and support throughout this research project. His knowledge, assistance and feedback throughout my
research has been invaluable and very much appreciated.
I would also like to thank my associate supervisor Joseph Davis for providing an enjoyable and
educational research environment through Knowledge Management Research Group (KMRG) meetings.
The meetings allowed me to present my research and gave me valuable feedback that contributed towards
the quality of my research. They also helped to expand my knowledge of other topic areas relevant to
my research.
A special thanks also to Josiah Poon, who organised the Machine Learning seminars. These pro-
vided an additional outlet for me to present my work and get some feedback and presentation experience.
Attending other presentations also helped to gain insight into the work of other students and researchers.
For proofreading and providing comments and feedback, I would like to thank Vincent Hale, Peter
Phillips and my mum Irene Wu. Their comments have helped me to rene and enhance the presentation
of my work.
Finally, I would like to thank my parents for supporting me throughout my research.
iv
Publications
Elizabeth Wu and Sanjay Chawla (2007) Spatio-Temporal Analysis of the relationship between
South American Precipitation Extremes and the El Nino Southern Oscillation, 2007 International
Workshop on Spatial and Spatio-temporal Data Mining (SSTDM) in conjunction with IEEE Interna-
tional Conference on Data Mining (ICDM), October 28, Omaha, NE, USA. IEEE Computer Society.
[Accepted as a regular-presentation paper with an acceptance rate of 28%]
v
Dedication
It started out with data,
Which made no sense at all.
We wanted information,
That was our research call.
So people got together,
To come up with a plan.
They called it data mining,
Thats how it all began.
Then different types of data,
Broke out onto the scene.
Some spread out in space,
Others over time they streamed.
This thesis is dedicated,
To all those in the past,
Who make more research possible,
And help to make it last!
- By Elizabeth Wu
vi
Glossary
The following key concepts and terms are used in this thesis, and have been listed here for ease of
reference:
Autocorrelation: Loosely dened as how a single variable correlates to itself between pairs
of observations.
Block Maxima: A method of selecting extreme values where data is divided into blocks and
the maximum value from the block is selected for the set of extremes.
Bootstrap Analysis: A procedure where random samples are taken from a dataset with re-
placement, and the same type of analysis is performed on them. It can be used to help under-
stand the uncertainty associated with some statistical estimators.
Clustering: Agrouping of elements that are similar to each other in that grouping, but different
to objects in other groupings. (See also Spatial Clustering).
Correlation: An association or relationship between variables.
Deseasonalisation: The removal of the effects of different seasons (such as Summer, Winter)
on values contained in the dataset.
El Nio: The warm phase of the El Nio Southern Oscillation (ENSO), indicated by the
warming of the Eastern Pacic Ocean, and a cooling of the Western Pacic Ocean. (A more
precise denition of El Nio including exact conditions is still subject to debate.)
El Nio Southern Oscillation (ENSO): A naturally occurring phenomenon that consists of
two phases, the warm phase El Nio and the cool phase La Nia.
Extreme Value Theory (EVT): A statistical methodology that attempts to quantify the sto-
chastic behaviour of a process at extremely large or small levels.
Gearys C Statistic: A statistical measure used to quantify spatial autocorrelation.
Generalised Extreme Value Distribution (GEVD): An extreme value distribution that mod-
els the behaviour of extremes that were selected using the block-maxima technique.
vii
GLOSSARY viii
Generalised Pareto Distribution (GPD): An extreme value distribution that models the be-
haviour of extremes selected based on a threshold technique such as Peak Over Threshold
(POT).
Geographic Knowledge Discovery (GKD): The process of extracting information and knowl-
edge from massive geo-referenced databases. (Miller and Han, 2001)
Getis-Ords G Statistic: A statistical measure used to quantify spatial autocorrelation.
Grid Data: Spatial data contained in grids, where each grid represents the region it covers
with values pertaining to that region.
High Discrepancy Region: Also known as a hotspot, a high discrepancy region is any region
whose discrepancy score is considered to be high. The discrepancy score is calculated using
a likelihood ratio test to nd the actual location of anomalous regions.
Hotspot: See High Discrepancy Region.
Independent and Identically Distributed (iid) Random Variables: Random variables that
have the same probability distribution and whose outcome is mutually independent.
La Nia: The cool phase of the El Nio Southern Oscillation (ENSO) named because of its
cooling effect on waters in the eastern Pacic ocean.
Local Indicators of Spatial Association (LISA): A statistic that quanties the local spatial
autocorrelation of a region to nearby regions. To qualify as a LISA, a statistic must be propor-
tional to a global statistic, and must give an indication as to the extent of the spatial clustering
around the observation.
Morans I Statistic: A statistical measure used to quantify spatial autocorrelation.
Multivariate ENSO Index (MEI): A measure of the strength of an El Nio or La Nia event.
Outlier: (see also Spatial Outlier) An observation or subset of observations that appear in-
consistent with the remainder of that set of data. (Barnett and Lewis, 1994)
Peak Over Threshold (POT): An approach for selecting the extreme values from a given set
of data. Values that fall into a certain percentile, such as the 95
th
percentile are considered
extreme.
Precipitation: The deposition of liquid water droplets and ice particles that are formed in the
atmosphere and grow to a sufcient size so that they are returned to the Earths surface by
gravitational settling (Hornberger et al., 1998).
Precipitation Residuals: The maximum weekly values extracted for each week from the daily
data following deseasonalisation.
GLOSSARY ix
Region: The extended spatial location of something.
Sea Level Pressure (SLP): The atmospheric pressure at mean sea level.
Sea Surface Temperature (SST): The temperature representative of the upper few metres of
the oceans surface (as opposed to the oceans skin only which is the upper few centimeters).
Southern Oscillation Index (SOI): The Southern Oscillation Index (SOI) is one measure of
the large-scale uctuations in air pressure occurring between the western and eastern tropi-
cal Pacic (i.e., the state of the Southern Oscillation) during El Nio and La Nia episodes.
Traditionally, this index has been calculated based on the differences in air pressure anomaly
between Tahiti and Darwin, Australia.
Spatial Autocorrelation: Correlation of a variable with itself through space.
Spatial Cluster: A group of spatial objects whose properties are similar to other objects in the
same group and dissimilar to the properties of objects in different groups.
Spatial Data: Data about the location and/or shape of an object.
Spatial Data Mining: A variation of data mining to enable the mining of patterns from spatial
data.
Spatial Neighbourhood: The area that lies in close proximity to a given object in a spatial
dataset, dened by a topological, distance or direction relationship (Ester et al., 2001).
Spatial Outlier: A spatially referenced object whose non-spatial attribute values are signif-
icantly different from those of other spatially referenced objects in its spatial neighbourhood
(Shekhar and Chawla, 1993; Shashi Shekhar, 2003).
Spatial Scan Statistic: A spatial data mining approach that aims to nd the highest discrep-
ancy region/s in a spatial dataset. One of the most well known is Kulldorffs spatial scan
statistic (Kulldorff, 1997).
Spatio-Temporal Data: Data that contains spatio-temporal objects.
Spatio-Temporal Data Mining: The automated discovery of interesting spatial patterns from
data over time using data mining techniques on spatially and temporally distributed data. In
other words, the automated discovery about the behaviour of and relationship between spatio-
temporal objects contained in spatio-temporal data sets.
Spatio-Temporal Neighbourhood: An object which is in the spatial neighbourhood of an-
other, at some particular time, is said to be in the spatio-temporal neighbourhood of the rst
object for that particular time period.
GLOSSARY x
Spatio-Temporal Object: A time-evolving spatial object whose evolution or history is rep-
resented by a set of instances with a spacestamp (location identier) and a timestamp (time
identier).
Spatio-Temporal Outlier: A spatio-temporal object that is signicantly different from other
objects in its spatial and temporal neighbourhoods.
Spatio-Temporal Outlier Detection: A spatio-temporal data mining technique that aims to
discover outliers in spatio-temporal data.
Teleconnection: The simultaneous variation in climate and related processes over widely sep-
arated points on the Earth.
Temporal Neighbourhood: A period of time which occurs soon before, at the same time or
after a particular event, is the events temporal neighbourhood.
Toblers rst law of Geography: Everything is related to everything else, but near things are
more related than distant things (Tobler, 1970).
CONTENTS
Abstract iii
Acknowledgements iv
Publications v
Dedication vi
Glossary vii
List of Figures xiv
List of Tables xvi
Chapter 1 Introduction 1
1.1 Summary of our work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2 Background 9
2.1 Spatial Data and Spatial GISs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Spatial Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Spatial Data Mining Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Temporal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Temporal Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Temporal Data Mining Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Temporal Data Characterisation and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Temporal Pattern Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.4 Temporal Prediction and Trend Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
xi
CONTENTS xii
2.5 Spatio-Temporal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Spatio-Temporal Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Spatio-Temporal Data Mining Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.2 Spatio-Temporal Co-location Episodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.3 Spatio-Temporal Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.4 Spatio-Temporal Episodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.5 Spatio-Temporal Co-Occurrence Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.6 Periodic Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.7 Spatio-Temporal Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Hydrology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7.1 Precipitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.2 El Nio Southern Oscillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.3 Southern Oscillation Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Geographic Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 Extreme Value Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 Spatial Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10.1 Measuring Spatial Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.10.2 Local Indicators of Spatial Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.10.3 Local Morans I Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.11 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.11.1 Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.11.2 Deseasonalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 3 Spatio-Temporal Analysis of the relationship between South American
Precipitation Extremes and the El Nio Southern Oscillation 43
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Our Contribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Strong ENSO events and Local Moran I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
CONTENTS xiii
Chapter 4 Spatio-Temporal Outlier Detection in Precipitation Data 52
4.1 Denition and Properties of a Spatio-Temporal Outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Spatial Scan Statistic: Exact-Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Limitations of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Our Contribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.1 Exact-Grid Top-k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.2 The Outstretch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.3 Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 5 Conclusion 75
5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
References 79
List of Figures
1.1 Activities performed during spatio-temporal analysis of precipitation extremes 4
1.2 Activities performed during spatio-temporal outlier detection 4
2.1 Example Co-location Dataset 12
2.2 Example Spatial Outlier 13
2.3 Types of spatio-temporal objects 18
2.4 A co-location episode 19
2.5 Types of Spatial Object Association Patterns 21
2.6 A moving cluster over three snapshots 22
2.7 The location of Darwin and Tahiti in the South Pacic Ocean 29
2.8 The modal and tail regions of a normal distribution 32
2.9 Different methods for selecting maxima from data 33
2.10 Examples of Positive, Negative and No Autocorrelation 36
3.1 Number of non-missing regions for each strong ENSO period 48
3.2 Mean of SOI and local Moran Is over strong ENSO periods between 1978-2004 49
3.3 Bootstrap Analysis of Local Moran I of Scale over strong ENSO periods 50
4.1 Spatio-Temporal Outlier Examples 54
4.2 An example of a spatio-temporal outlier in gridded data 55
4.3 An example grid for calculating Kulldorffs scan statistic 57
4.4 An example grid with missing data for calculating Kulldorffs scan statistic 57
4.5 ALGORITHM: Exact-Grid 58
4.6 Exact-Grid Sweep Lines 59
4.7 The overlap problem 61
xiv
LIST OF FIGURES xv
4.8 All possible overlap types between two regions 62
4.9 The chain overlap problem 63
4.10 The union solution to the overlap problem 63
4.11 ALGORITHM: Exact-Grid Top-k 64
4.12 SUBRUOTINE: Update Top-k 65
4.13 Region Stretch Size r 65
4.14 An example outlier tree built by the Outstretch algorithm 66
4.15 ALGORITHM: Outstretch 67
4.16 Example Outstretch Algorithm Sequence 73
4.17 ALGORITHM: RecurseNodes 74
4.18 Length and Number of Outliers Found 74
List of Tables
2.1 Autocorrelation values for Gearys C and Morans I 37
3.1 Strong El Nio Periods and Analysis Years 48
4.1 The outlier table corresponding to the outlier tree shown in Figure 4.14 67
4.2 Experiment 1 Variable Setup 69
4.3 Experiment 1 Data Setup 70
4.4 Number of top-k outliers found for each year 71
4.5 Length and Number of Outlier Sequences Found 71
xvi
CHAPTER 1
Introduction
The popularity of databases has brought with it an ever increasing amount of data. For example, in 2005,
one of the largest known databases, managed by the Max Planck Institute for Meteorology, contained
over 222 terabytes of data (Winter Corporation, 2005). This enormous growth in data has highlighted
the need to make sense of it. Data mining has gathered signicant attention, as it has the capability of
providing a better understanding of the information contained within the data.
The aim of data mining is to nd previously unknown, nontrivial information from data. This process
can be somewhat automated using algorithms that enable the efcient discovery of such information
from the data. Data mining is one part in the process of Knowledge Discovery from Databases (KDD),
and it can be used to nd information from data in a wide variety of application settings.
Data mining typically involves several steps. First, a dataset containing data on the topic of interest must
be acquired. This data is then preprocessed into an appropriate format for mining, and noisy, missing
and inconsistent data is dealt with. Once complete, a data mining algorithm is applied. The way in
which it is applied depends on a number of factors, such as the type of information or patterns being
mined, the algorithm being used and the nal results needed by the end user.
There are many types of patterns that can be discovered using data mining techniques. Most patterns
can be categorized as either predictive or descriptive. Predictive patterns use historical data to build a
model that can help to predict future events. For example, by looking at previous credit card approvals,
a model can be built to determine which combinations of attributes, such as whether the individual has
dependents or a job, are likely to be approved or denied for a credit card. This model, built using pre-
classied data can then be used to discover favourable candidates from new unseen data. Descriptive
patterns, on the other hand, examine historical data to nd interesting patterns only within the given data
set. One example of a descriptive pattern is a spatial cluster of individuals with cholera located around
particular water wells, such as those discovered in London by John Snow in 1854 (Snow, 1856).
1
1 INTRODUCTION 2
The focus of this thesis is on another type of descriptive pattern - an outlier. Since an outlier can con-
stitute different things in different application settings, the precise denition of an outlier is difcult to
capture (Ng, 2001). Barnett and Lewis (Barnett and Lewis, 1994) describe an outlier as an observa-
tion which appears to be inconsistent with the remainder of that set of data. Hawkins (Hawkins, 1980)
denes an outlier as an observation which deviates so much from other observations as to arouse sus-
picions that it was generated by a different mechanism. From these denitions, Ng (2001) notes two
possible types of outliers. The rst type are extreme value outliers. If the values under examination
follow a normal distribution for example, then we would only be interested in the extreme values at the
tail of the distribution. The second type are non-extreme value outliers. For example, if we nd in a
dataset containing different types of values, that the incidence of one of these types is much less than
the others, it can be considered to be an outlier. In this work we consider extreme value outliers.
Although spatio-temporal outlier detection is the main focus of this study, pre-processing and statistical
analysis is required to enable us to get to the stage where such discoveries are possible. Pre-processing of
the data is particularly important since the spatio-temporal data used is quite complex. For example, the
data must be deseasonalised in order to eliminate seasonal variations in precipitation levels that would
cause the results to be unreliable and uninformative. To prepare the data adequately, an understanding
of the area from which the data has been gathered is necessary so that particular features of that data set
can be taken into account. Of equal importance are the use of the appropriate pre-processing techniques.
For example, there may be several ways to perform deseasonalisation, and selecting different methods
can lead to different results. Statistical analysis is also an important step, as it helps us to understand
what statistical methods can achieve with the data sets we are using. The methods used in this study
have not previously been applied to our precipitation dataset and so the results are able to provide greater
understanding of the way precipitation extremes behave.
A summary of the contents of this thesis, and an overview of our contributions is provided in the follow-
ing section.
1.1 SUMMARY OF OUR WORK 3
1.1 Summary of our work
There are a four major outcomes that we achieve from this study. They are:
Preprocessing of precipitation data;
The use of statistical methods to calculate the relationship between neighbouring regions, in
terms of the behaviour of their precipitation extremes;
Comparing these relationships to the El Nio Southern Oscillation (ENSO); and,
Finding all moving precipitation extreme/outlier paths over several El Nio periods.
As with all data mining tasks the purpose is to provide some understanding of the data in the form of
useful information. In the case of spatio-temporal data mining, the data itself is not always in a format
that is appropriate for immediate application of data mining algorithms, so pre-processing is essential.
Care must be taken in this step, as pre-processing is often the most time consuming procedure in a data
mining task. There are several stages involved in this process, so this forms the rst part of our work.
The second part of this work focuses on improving our understanding of the behaviour of precipitation
extremes. First we focus on a single time period, for which we determine the distribution of extremes
for each grid in the dataset. Once we have this distribution we calculate a value to describe how similar
regions are to each other. This value is known as a Local Indicator of Spatial Autocorrelation (LISA)
(Anselin, 1995), and gives a good indication of the similarity of precipitation extremes in neighbouring
grids.
Finding the LISAs is done for all time periods. Once complete, we then focus on the temporal changes
in the data to determine how these relationships vary over time. The time periods chosen correspond
to strong El Nio events. We then compare the strength of the El Nio events with the average spatial
autocorrelation value for all grids. To provide further understanding of the correlation between El Nio
and the average spatial autocorrelation value, bootstrapping analysis is performed so that results can be
visualised. These rst portions of our work are depicted in Figure 1.1.
The last part of this work, summarised in Figure 1.2, is our spatio-temporal outlier detection algorithm.
The purpose of our algorithmis to nd outlier regions of extreme precipitation and track their movements
over time. The algorithm rst nds the top-k outlier regions over each time period using a spatial scan
statistic known as Exact-Grid Top-k. Then, using the outliers from preceeding time periods, the Out-
stretch algorithm nds those outliers in subsequent time periods that fall within an area that is stretched
1.1 SUMMARY OF OUR WORK 4
FIGURE 1.1: Activities performed during spatio-temporal analysis of precipitation ex-
tremes
just beyond the previous periods regions outlier. As a result, all moving outlier regions are found in the
gridded data and stored into a tree data structure. The RecurseNodes algorithm is used to retrieve the
outliers.
FIGURE 1.2: Activities performed during spatio-temporal outlier detection
1.2 MOTIVATION 5
1.2 Motivation
Each of the four major outcomes of this work focus on the main task of furthering our understanding of
precipitation extremes. Therefore it is necessary that we rst describe why precipitation extremes are a
challenging, interesting and important area of study.
The study of precipitation extremes is valuable for several reasons. Extreme levels of precipitation may
lead to ooding, which often has devastating effects on the natural and built environments. Although
precipitation is not the only factor that leads to ooding, it often plays a signicant part in ood events.
Floods can have both short and long-termcosts. Of the most devastating is the loss of life fromdrowning,
disease and lack of access to food and medical supplies. Financial costs also place a signicant burden on
those who must provide shelter and rebuild the damaged infrastructure. Being able to further understand
the nature of precipitation may help communities to better prepare themselves for events, to mitigate the
damage, reduce the negative effects of ooding, and to recover from such extreme events.
From another perspective, extreme precipitation events are also signicant since some individuals may
rely on heavy precipitation to replenish dry agricultural areas. Discovering precipitation patterns is
therefore critical to the success or failure of crops or livestock, which often relies on such knowledge.
Understanding precipitation extremes is of great importance, and is only made possible through the
development of techniques and methods that can be used to nd information about precipitation from
data. Thus, the focus of this thesis is on the development of such techniques and methods.
The statistical techniques, such as the local indicators of spatial autocorrelation, have not previously
been applied to this type of data. We are able to provide a quantitative analysis of the relationships
between regions, by giving them a numerical value. This is important as previous work has involved
a more qualitative analysis of the relationship by observation, which is not as reliable as quantitative
results.
Analysing how the strength of inter-regional relationships are affected by the strength of ENSO is some-
thing that has also not previously been studied, and as a result, our statistical analysis provides an
important step toward greater understanding of the relationship between the behaviour of precipitation
extremes and ENSO.
The spatio-temporal outlier detection algorithm, which we have named Outstretch, is a signicant con-
tribution not only to the study of precipitation extremes but also to the eld of spatio-temporal data
1.4 CONTRIBUTIONS 6
mining. This method allows the discovery of moving outliers in all types of grid-based data. There are
many reasons we might want to nd moving outliers depending on the type of data that is under study.
In our case, nding moving outliers may help to understand the movement of outliers over time in the
past, and help to anticipate future outlier locations. This leads us back to the original important goal
of enabling communities to understand, prepare for, cope with and recover from extreme precipitation
events.
1.3 Problem Description
There are three main areas where previous research is currently lacking.
Previous work from a geoscientic study has used Extreme Value Theory (EVT) to describe the distribu-
tion of extreme precipitation. From this, Khan et al. (2007) were able to conduct a qualitative analysis of
the behaviour of precipitation extremes. However, qualitative analysis is unable to detect some features
that are not as noticeable from visual observation of the results on a map.
Another area lacking in previous research is the study of the relationship between precipitation extremes
and ENSO.
In the area of spatio-temporal outlier detection, no current algorithms or research were found to be suited
to the discovery of outliers in grid-based data. Most research into spatio-temporal outlier detection
focuses on point-based data, which is not suitable for many application domains.
1.4 Contributions
The contributions made in this thesis include the following:
A quantitative analysis of the extent to which precipitation extremes in one region are corre-
lated with those in neighbouring regions;
An analysis of the relationship between the spatial autocorrelation of a regions precipitation
extremes and the ENSO phenomenon using bootstrapping;
A discussion and renement of the denition of a spatio-temporal outlier; and,
The development of a spatio-temporal outlier detection algorithm for grid-based data, using
the spatial scan statistic, which is able to nd outliers even where missing values are present.
1.4 CONTRIBUTIONS 7
Earlier work has investigated the nature of precipitation extremes using Extreme Value Theory (EVT)
(Khan et al., 2007). In their study, they t each grid to the Generalised Pareto (GP) distribution, and
obtain parameters. They do this for a number of time periods. They then plot these onto a map and
visually interpret the results in a qualitative fashion for each time period. The problemwith this approach
is that the interpretation of these results may be subjective, and therefore not as reliable as a quantitative
analysis. Our work extends the work performed by Khan et al.. We provide a quantitative analysis to
describe how the distribution of precipitation extremes in one region correlate to those of neighbouring
regions. To the best of our knowledge, our research is the rst to provide a quantitative measure of the
distribution of precipitation extremes between neighbouring regions.
Our work also considers the relationship between the spatial autocorrelation of regions precipitation
extremes and the ENSO phenomenon. We investigate how the strength of an El Nio event correlates
with the inter-regional behaviour of precipitation extremes. This is done by taking the mean of the local
indicator of spatial autocorrelation of each GP parameter. These are then plotted on a graph for each El
Nio period. The mean of the SOI is also plotted on a graph. From this we may or may not be able to
visually identify a pattern, so in order to determine whether or not there is indeed a relationship, we use
bootstrap analysis. The identication of the effect that the strength of an El Nio event has on the local
spatial autocorrelation of precipitation extremes is an additional contribution of our work.
Lastly, we have contributed to the discovery of spatio-temporal outlier detection, through the creation of
our own algorithm. Although there has been previous work performed on the discovery of these types
of patterns, most earlier techniques focus on point-based spatio-temporal outlier discovery. In addition,
previous denitions of spatio-temporal outliers are limited, which has led us to provide an extended
and more comprehensive denition of a spatio-temporal outlier. In the process we have discovered that
the earlier denitions may exclude the discovery of many possible valid spatio-temporal outliers. Due
to this lack of grid-based algorithms for spatio-temporal outlier detection, and the limits imposed on
spatio-temporal outlier detection by previous denitions, we have developed Outstretch which extends
the spatial-scan statistic for use in the spatio-temporal domain. Our algorithm is able to discover spatio-
temporal outliers in grid-based data, while not excluding outliers that may have been excluded during
pattern discovery under previous denitions.
To the best of our knowledge, our Outstretch algorithm is the rst grid-based spatio-temporal outlier
detection algorithm. The applicability and therefore versatility of our algorithm to virtually any grid-
based spatio-temporal dataset makes it a valuable contribution to the eld of data mining.
1.5 THESIS STRUCTURE 8
1.5 Thesis Structure
Following this introduction, we rst provide the necessary background information that is required in
order to understand the work performed in this thesis. The background begins by providing an overview
of previous spatial, temporal and spatio-temporal data mining research. Following this is information
about hydrology including precipitation and ENSO, and a description of the statistical methods that were
utilised for our work.
Chapter 3 describes the setup and results of the work performed in (Wu and Chawla, 2007), relating to
the pre-processing of the data and the statistical analysis that was performed on it.
Chapter 4 details the spatio-temporal outlier detection algorithm, Outstretch.
Chapter 5 concludes the ndings of our research and summarises potential future work that could take
place following on from this work.
CHAPTER 2
Background
2.1 Spatial Data and Spatial GISs
Spatial data has been recorded in databases for many years. Several types of spatial data exist, but at a
minimum, spatial data must consist of location information. An example of spatial data in a traditional
database is the address of a client or vendor. Despite the long existence of such data, its use and value has
previously been relatively limited. For example, it may have been used to send out invoices, payments
or marketing documents. While these are important tasks, it is only the tip of the iceberg in terms of
the potential uses of such data. Only recently, it has come to be recognised as a potential value creating
asset, which has been highlighted by the growing adoption of Geographic Information Systems (GISs)
to store spatial data.
The adoption and storage of data into GISs has allowed users to nd answers to spatially-related queries.
One such query might be, What are the names of all customers who live within 10km of my store?.
Another example may be a query from a mobile device user who may ask Where is the nearest Au-
tomatic Teller Machine?. The latter example is already being performed through the use of a GIS by
Mastercard .
While the advances in GISs are signicant and important, realising the full value of spatial data is not
yet complete. The growth in the use of such systems has led to an increase in the storage of spatial data.
As with traditional databases, the decrease in storage costs and ease of data collection has meant that
discovering useful and interesting spatial information from the raw data is a difcult but important task.
Most GISs, without the help from spatial data mining techniques are only able to perform limited analy-
sis to nd interesting and previously unknown information. One example of such information is What
services are frequently requested together from a mobile device?. Spatial data mining as described in
the following sections, is able to provide a solution to this problem.
9
2.2 SPATIAL DATA MINING 10
2.2 Spatial Data Mining
Spatial data mining is the process of discovering interesting, previously unknown, but potentially useful
information from spatial data (Shekhar and Chawla, 1993; Shekhar et al., 2003). Spatial data is made
up of spatial objects, which can be considered to be the spatial equivalent of instances in a transactional
dataset. Spatial objects can take several forms including points, lines, areas and grids (Goodchild, 1986).
Points are often used to represent the location of physical objects such as people, animals, plants, a
shopping centre or fuel stations. Lines may or may not be connected, and can be used to represent roads
or train tracks. Areas are zones, and do not necessarily need to be uniformly sized or shaped. Areas
can be suburbs, districts, forests or areas of homogeneous soil. Grids, or lattices, are articial square
uniform subdivisions of the spatial dataset. Grids data is often used in the analysis of climatic data, and
is the data format for the precipitation data that is used in our study.
Due to the greater complexity of spatial data compared to traditional data, previous methods for data
mining are generally not suitable for application directly on spatial data. There are several reasons for
this identied by Shekhar and Vatsavai (2002) and Shekhar et al. (2003).
First, spatial data is generally distributed over a continuous space, as opposed to traditional datasets that
are usually discrete. Second, the patterns found in traditional data tend to be global in nature, whereas
spatial patterns tend to be local. Third, spatial regions tend to display a higher level of dependence
between them, while traditional data has a higher degree of independence from one sample to another.
Although it is possible to apply some traditional data mining techniques to spatial data, ignoring these
features of spatial data can often lead to poor performance in terms of both computational efciency and
pattern discovery (Shekhar and Vatsavai, 2002). As a result, new data mining techniques to discover
information from spatial data have been developed.
2.2.1 Spatial Data Mining Techniques
Shekhar et al. (2003) identify four important spatial data mining techniques, which include predictive
modelling, spatial co-location rule discovery, spatial clustering and spatial outlier detection. A brief
description of these is provided in the following subsections. However, since spatial data mining is still
growing, it is important to note that other spatial data mining techniques also exist.
2.2 SPATIAL DATA MINING 11
2.2.1.1 Predictive Modelling
Predictive models aim to predict events that are likely to occur at a particular geographic location
(Shekhar et al., 2003). For example, predictive models can be built to attempt to predict the location of
birds nests in wetland regions.
2.2.1.2 Spatial Co-location Rule Discovery
The task of spatial co-location rule discovery involves nding subsets of objects whose boolean feature-
sets are often in close geographic proximity to one another (Shekhar et al., 2003). A boolean spatial
feature refers to any spatial feature that is either present or not present in a particular location. Another
denition describes spatial co-location patterns as those that represent relationships among events hap-
pening in different and possibly nearby locations (Shekhar and Huang, 2001).
A number of methods have been developed to mine co-location rules. Some have been developed to
mine spatial co-location rules without a support threshold (Huang et al., 2003), using condence instead
of support measures to prune candidates. This method ensures that all high condence co-location rules
are discovered, even when one or both items are of low support.
There have also been improvements in speed of rule discovery through the development of join-based
(Shekhar and Huang, 2001), partial-join (Yoo and Shekhar, 2004) and joinless (Yoo and Shekhar, 2006)
approaches.
The join-based approach, introduced by Shekhar and Huang (2001), uses an apriori-like algorithm that
performs joins to discover event centric co-location patterns. Event centric co-location rules are those
that discover all association rules in a given region. That is, if event A occurs in region R then we
want to nd the likelihood of an event B occurring in region R. The join operation combines all
prevalent k 1 itemsets to generate k-length co-location candidates, while checking if they are in the
same neighbourhood. For example, given the dataset in Figure 2.1, we combine the length 2 frequent sets
{A, B} and {B, C} in their spatial neighbourhoods using a join to correctly produce length 3 candidates
of {A, B, C} with the sets {A.3, B.1, C.3} and {A.4, B.3, C.5}.
The partial-join approach provided improvements to efciency over the join-based approach by reducing
the number of instance joins (Yoo and Shekhar, 2004). This was done by transactionising the spatial
2.2 SPATIAL DATA MINING 12
FIGURE 2.1: Example Co-location Dataset
dataset based on their neighbour relationships, and then tracing only those neighbourhood relationships
that were cut apart by the transactionising process.
Following this, a further improvement in the form of a joinless algorithm was made to completely
eliminate the need for spatial joins, which are a time consuming operation (Yoo and Shekhar, 2006).
Using star instances, which consist of each object and their neighbourhood pairs, and a participiation
index to check for prevalence of the patterns, candidates are generated. The spatial join is avoided by
using the participation index. The candidates are then pruned where they do not form co-locations, and
when complete, co-location rules are generated.
2.2.1.3 Spatial Clustering
Grouping spatial objects whose features are similar to one another is known as spatial clustering (Shekhar
et al., 2003). In such clusters, the objects in the cluster should be similar to each other while being dis-
similar to objects in other clusters. An example of the application of spatial clustering could include
discovering regions where there are unusual levels of crime activities.
2.2.1.4 Spatial Outlier Detection
Outliers are inconsistencies that occur in a dataset (Shekhar et al., 2003). Outlier detection aims to nd
those objects that are different from the remainder of the dataset. However, spatial outlier detection is
different to traditional outlier detection in relational databases, because the outlier does not have to be
different from the remainder of the dataset, but from its neighbouring values only.
2.3 TEMPORAL DATA 13
A spatial outlier is dened as an object whose non-spatial attributes are different from other objects in
its surrounding neighbourhood. This means that it is possible for the values of a particular outlier to be
similar to the remaining dataset, while still being classied as a spatial outlier due to the fact that it is
signicantly different from its neighbours. An example of this is illustrated in Figure 2.2 where there are
an even number of red and blue marbles laid out on a table such that the blue marbles are on the left and
the red marbles are on the right. If a red marble is then placed on the left amongst the blue marbles, that
red marble would be considered to be an outlier because it is different from its neighbours, even though
the existence of a red marble in the dataset is not unusual.
FIGURE 2.2: Example Spatial Outlier
Examples of the application of spatial outlier detection to real problems are detailed by Zhao et al.
(2003). They detect region outliers using a wavelet-based approach on meteorological datasets.
2.3 Temporal Data
Temporal data is data concerning time-based events and, like spatial data, has existed in databases since
early on in their adoption. Some examples include a customers birth date, or a date on which a customer
purchased some inventory. In hydrology, temporal data could include the date and amount of precipita-
tion at that date. Despite the existence of such data, most data mining techniques treat temporal data as
an unordered collection of events (Antunes and Oliveira, 2001).
Temporal data can be arranged in a number of ways. A time series is one arrangement, where data is
characterised by continuous, real-valued elements (Antunes and Oliveira, 2001).
2.4 TEMPORAL DATA MINING 14
2.4 Temporal Data Mining
The analysis of temporal data in the form of ordered events (sequential data) is known as temporal
data mining. Temporal data mining tasks can be divided into two main sub-divisions. The rst is the
discovery of causal relationships between temporal events. The second is the discovery of patterns in
the same or in more than one time series (Roddick and Spiliopoulou, 1999).
A more formal denition of temporal data mining is provided by Lin et al. (2002). They dene temporal
data mining as a single step in the process of Knowledge Discovery in Temporal Databases that enumer-
ate structures (temporal patterns or models) over the temporal data, and any algorithm that enumerates
temporal patterns from, or ts models to, temporal data is a Temporal Data Mining Algorithm.
The discovery of associations in temporal data usually refers to temporal association rule mining tasks. A
temporal association rule is one where two or more events occur together during a specied time interval
(Li et al., 2001a). An example provided by Li et al. (2001a) is that eggs and coffee are frequently sold
together in the morning. Another example provided by Lee et al. (2004) is that on February 14th each
year, owers and chocolates are frequently bought together.
2.4.1 Temporal Data Mining Techniques
There are a number of techniques that can be applied to temporal data according to Lin et al. (2002).
They include:
Temporal data characterisation and comparison;
Temporal clustering analysis;
Temporal classication;
Temporal association rules;
Temporal pattern analysis; and,
Temporal prediction and trend analysis.
2.4.2 Temporal Data Characterisation and Comparison
Temporal data characterisation can include the discovery of outliers or interesting patterns in temporal
data.
2.4 TEMPORAL DATA MINING 15
For example, Martin and Yohai (2001) focus on the detection of unusual movements and sudden changes
in temporal data, to include patterns such as isolated outliers, level shifts, shifts in variability, slope
changes and changes in frequency. In particular they focus on the characterisation of isolated outliers
which are single values that are uncharacteristic for that particular point in time in the sequence, and
level shifts which are an elevated or depressed change in the values that persists for an extended period
of time, unlike outlier values that are usually only present momentarily.
These techniques can be applied to tobacco sales data for example, to determine whether an increase in
spending on tobacco is momentary or whether it will be sustained.
2.4.2.1 Temporal Clustering Analysis
Clustering temporal data can be performed on data that has been broken into subsequences as noted by
Oates (1999). For example, if we look at the records on the black boxes of failed aircraft, we would
notice that many events are the same, such as takeoff, but some would be different. By clustering time
series we could identify time series that failed aircraft have in common with each other but not with
functional aircraft.
2.4.2.2 Temporal Classication
Temporal classication involves predicting the class of a temporal sequence (Laxman and Sastry, 2006).
This is done by looking at previously classied temporal data, building a prediction model based on that
data, and then using it to predict the category of unclassied sequences in the data. There are many
applications for temporal classication such as speech recognition and gene sequence classication.
2.4.2.3 Temporal Association Rules
Association rules are used to describe relationships between items in transaction data. For example,
if bread and jam are frequently purchased together, we can determine the probability that either item
implies the purchase of the other item. The rules bread jam and jambread are known as association
rules. Association rules can be extended to the temporal domain by considering that certain patterns
are more prevalent during certain times of the year. Li (2001a) denes a temporal association rule as
one where two or more events occur together during a specied time interval. For example, turkey
and pumpkin pie may rarely be purchased together, except during the week prior to the Thanksgiving
2.5 SPATIO-TEMPORAL DATA 16
holiday in the United States (Li et al., 2001b). Such association rules, which include temporal intervals
are known as Temporal Association Rules (Li et al., 2001b).
2.4.3 Temporal Pattern Analysis
Temporal pattern analysis includes the application of temporal pattern matching, as described in (Falout-
sos et al., 1994). Some examples of such analysis include matching two companies whose stock prices
move in a similar fashion, or nding periods with similar weather patterns. Keogh and Smyth (1997)
also applied pattern matching techniques on space shuttle mission data. A query is supplied as a de-
scription of a sequence of peaks, troughs or other features, and the pattern is then matched to a similar
pattern in the dataset.
2.4.4 Temporal Prediction and Trend Analysis
Temporal prediction aims to forecast unknown future values of a time series based on known historical
values (Laxman and Sastry, 2006). Although temporal prediction and classication both use historical
data to perform their tasks, they are different. Temporal classication uses pre-classied sequence data
to build a classication model, and place new un-classied sequence data into a category. On the other
hand, temporal prediction uses historical data values to try to predict future data values.
Some examples of the applications for temporal prediction are outlined by Yu et al. (2001), such as
weather forecasting, prediction of future stock values, managing real and virtual trafc, predicting sales
or completing a composers unnished work.
2.5 Spatio-Temporal Data
Spatio-temporal data is spatial data that has been collected over several time snapshots. Snapshots are
often recorded in point or grid format.
Points are instances of an spatial object or feature, such as an animal or hospital, for example. Points
that may persist and move between one time period to the next are often referred to as moving points.
Some examples of moving point data include the trajectories of animals, the location of hurricanes, or
the locations of crimes over time. Points can also be grouped as clusters, and these can also be tracked
over time.
2.6 SPATIO-TEMPORAL DATA MINING 17
Grid data on the other hand usually consists of grids that contain summary data of the region that is
encompassed by the grid. Some examples of grid data that may be collected over several time snapshots
include mean precipitation or crime statistics for each grid location. A moving region is a set of grids
that have some common non-spatial characteristics and are located in neighbouring spatial regions over
one or more time periods. For example, for a region that is an outlier in one time period, a region in the
same spatio-temporal neighbourhood may exist in the following time period. This is an example of a
spatio-temporal outlier, and is elaborated upon in Section 4.1. A visual example of a moving region is
also shown in Figure 2.3(b).
Both types of data contain spatio-temporal objects. A spatio-temporal object, identied by its o_id, has
been dened by Theodoridis et al. (1999) as a time-evolving spatial object whose evolution or history
is represented by a set of instances (o_id, s
i
, t
i
), where the spacestamp s
i
, is the location of object o_id
at timestamp t
i
. According to this denition, a two dimensional point or region is represented by a line
or solid respectively in three-dimensional space. These are shown in Figure 2.3.
A spatio-temporal database may consist of many spatio-temporal objects. It is the goal of spatio-
temporal data mining to discover the behaviour of and interaction between these objects.
2.6 Spatio-Temporal Data Mining
Spatio-temporal data mining is the automated discovery of interesting spatial patterns in data over time
using data mining techniques on spatially and temporally distributed data (Gudmundsson et al., 2004).
The same reasons that prevent the application of classical data mining techniques to spatial or temporal
data, apply to their application to spatio-temporal data as well. Spatio-temporal data displays depen-
dence over both space and time. For example, whether or not it rained yesterday in London will affect
whether or not it will rain today in London. In addition, the high dimensionality of spatio-temporal data
means that classical methods may not be suited to such data and may be highly inefcient. For these
reasons, specic techniques to mine spatio-temporal data need to be developed.
Techniques used to mine patterns from spatio-temporal data are often spatial techniques that have been
extended to incorporate temporal data. In many cases this is because spatio-temporal data consists of
spatial snapshots over time. That is, for each time period, we would have the same location map with
values recorded for each time period.
2.6 SPATIO-TEMPORAL DATA MINING 18
(a) A moving point
(b) A moving region
FIGURE 2.3: Types of spatio-temporal objects
Point-based or grid-based spatial techniques may be modied to include the additional dimension of
time. For example, in spatial data mining we could nd spatial co-locations of objects that frequently oc-
cur together. This is extended to the spatio-temporal domain to nd spatio-temporal co-location episodes
which show the relationships between the same and different types of spatial co-locations over time.
2.6 SPATIO-TEMPORAL DATA MINING 19
2.6.1 Spatio-Temporal Data Mining Techniques
Although spatio-temporal data mining is a relatively new eld, several tasks for spatio-temporal data
mining have already been identied and are continuing to develop. These include the discovery of:
Spatio-Temporal Co-location Episodes;
Spatio-Temporal Association Rules;
Spatio-Temporal Episodes;
Spatio-Temporal Co-Occurence Patterns (including Moving Clusters and Flock Patterns);
Periodic Patterns; and,
Spatio-Temporal Outliers.
2.6.2 Spatio-Temporal Co-location Episodes
A spatial co-location is made up of a subset of spatial features that are frequently located together in
spatial neighbourhoods. A co-location episode is an extension of a spatial co-location, and is dened as
a sequence of spatio-temporal co-location events consisting of a set of objects moving close to each
other for some time period (Cao et al., 2006). For example, if we notice a puma moving close to a
deer, we can expect that a vulture will move close to the deer soon after with high probability. This is
illustrated in Figure 2.4, where t represents a particular point in time.
FIGURE 2.4: A co-location episode
Co-location episodes usually have a centric feature, such as the deer in the above example. To be a valid
co-location episode instance, the deer must also be the same deer in each snapshot.
2.6 SPATIO-TEMPORAL DATA MINING 20
2.6.3 Spatio-Temporal Association Rule Mining
Spatio-Temporal Association Rule (STAR) mining can be considered an extension of the classical asso-
ciation rule mining task, where items frequently occur together. In the case of STARs, the relationship
can be dened in several ways, such that there are a number of different types of STARs that can be
mined.
One denition of a Spatio-Temporal Association Rule (STAR) describes it as the manner in which
objects move between regions over time (Verhein, 2006). For example, if an object is in region A in
time period x it is likely to move to region B in time period y.
Lee and Chan (2006) mine association rules in the spatial domain that persist over several time periods
in a calendar map, by extending the well known Apriori algorithm and developing their own Apriori-
like technique. They dene calendar-map patterns that specify the location and time for an event to take
place. For example, (USA, Florida, *, 1, 10) of the form (country, province, week, day, hour) which
species the 10th hour of the rst day (i.e., Monday) of every week in Florida, USA. If an association rule
is prevalent enough in the particular calendar-map pattern then it can be considered a spatio-temporal
association rule.
Gyozo and Pederson (2005) outline a spatio-temporal association rule mining technique based on piv-
oting, which reduces the spatio-temporal data task into a market basket type of analysis before applying
classical Apriori-like data mining techniques to the data.
Another Apriori-based spatio-temporal association rule mining algorithm has also been developed by
Kechadi and Bertolotto (2006), which consists of a localiser and a miner. The localiser focuses on the
data attributes including the spatial and temporal dimensions, and then the miner processes the data
using the relationships that are provided by the localiser.
2.6.4 Spatio-Temporal Episodes
Spatio-temporal episodes are Spatial Object Association Patterns (SOAPs) such as a star or clique that
continues over some time period (Yang et al., 2005; Cao et al., 2006). They are different to Spatio-
Temporal Association Rules (STARs) since they deal with groups of objects rather than the association
between single objects.
2.6 SPATIO-TEMPORAL DATA MINING 21
According to Yang and Parthasarathy (2006), spatio-temporal episodes go through three stages known
as formation, dissipation and continuation. Formation is when the number of patterns changes from
zero to non-zero. Dissipation is when all the patterns become invalid. Continuation is when a pattern
continues to exist in both time period t and time period t + 1.
Some of the possible SOAP patterns identied by Yang and Parthasarathy (2006) are shown in Fig-
ure 2.5. The edges (lines) indicate a relationship between objects. The relationships are distance-based,
topological and directional. A star pattern (Figure 2.5(a)) consists of a center object that has a relation-
ship with all the other objects in the same SOAP. A clique pattern (Figure 2.5(b)) is where all the objects
have a relationship with all the other objects in the clique. In a sequence (Figure 2.5(c)), only items
that follow the right direction have relationships with one another. In a minLink SOAP a parameter is
specied. For example, in Figure 2.5(d) the parameter is specied as minLink=2, meaning that each
object must have a relationship with at least two other objects in the SOAP.
(a) A star (b) A clique
(c) A sequence (d) minLink = 2
FIGURE 2.5: Types of Spatial Object Association Patterns
2.6 SPATIO-TEMPORAL DATA MINING 22
2.6.5 Spatio-Temporal Co-Occurrence Patterns
A co-occurrence pattern includes moving clusters and ock patterns (Celik et al., 2006). According to
Celik et al., moving clusters are mixed groups of moving objects, while ock patterns are uniformgroups
of moving objects. There also exist variations of ock patterns, which include leadership, convergence
and encounter patterns as described by Gudmundsson et al. (2004).
2.6.5.1 Moving Clusters
A moving cluster has been dened by Kalnis et al. (2005) as a set of objects that move close to each
other for a long time interval. Some examples provided by Kalnis et al. include a group of animals
that are migrating, or a convoy of cars moving through a city. These patterns are mined from a database
of object trajectories. In such patterns the identity of the moving cluster remains the same despite the
content changing. For example, some animals may leave the group while new animals may enter it.
A moving cluster can be viewed as a sequence of clusters that appear in consecutive time snapshots, such
that the snapshot of each cluster shares a large portion of common objects. An example of a moving
cluster is shown in Figure 2.6.
FIGURE 2.6: A moving cluster over three snapshots
2.6.5.2 Flock Patterns
Celik et al. (2006) dene a ock as a moving group of the same kind of object, such as a sheep ock
or a bird ock. Gudmundsson et al. (2004) describe a ock pattern as entities moving in the same
2.6 SPATIO-TEMPORAL DATA MINING 23
direction while being close to each other. Celik et al. (2006) differentiate ock patterns from moving
clusters by describing ock patterns as uniform groups of moving objects, and moving clusters as mixed
groups of moving objects.
Variations of ock patterns include leadership, convergence and encounter patterns (Gudmundsson et
al., 2004). A leadership pattern is where a single object is already moving in a specied direction for
a period of time prior to the occurrence of a ock pattern which it is involved in. Convergence occurs
where objects move towards the same location and where the direction of their movements does not
change. An encounter pattern is the same as a convergence pattern, except that the objects involved
meet at the location they are travelling towards at the same time.
2.6.6 Periodic Patterns
Another type of spatio-temporal pattern mined from a trajectory database is known as a periodic pattern.
Periodic patterns are spatial trajectories that are repeated over time (Cao et al., 2007; Mamoulis et al.,
2004). For example, Bob may follow the same route to work every day.
What makes this task particularly difcult and therefore ideal for data mining, is that the locations and
times recorded are usually not exactly the same with each pattern, but they may vary slightly between
occurrences. For example, Bob may take a detour to work on one day, or may wake up late on another
day. In addition, there is usually a time range during when the periodic pattern is frequent. For example,
if Bob is employed at another location, his original path will no longer be frequent. Therefore there is
also a need to discover how the patterns vary with space and time.
2.6.7 Spatio-Temporal Outlier Detection
Cheng and Li (2006) dene a spatio-temporal outlier to be a spatial-temporal object whose thematic
attribute values are signicantly different from those of other spatially and temporally referenced objects
in its spatial and/or temporal neighbourhoods. Birant and Kut (2006) dene a spatio-temporal outlier
as an object whose non-spatial attribute value is signicantly different from those of other objects in its
spatial and temporal neighborhood.
In the spatio-temporal domain, outlier detection is a particularly important task. This is because there
is a need to understand the behaviour between regions whose values are out of the ordinary. This is
highlighted by the growth in the use of spatial autocorrelation techniques, as described in Section 2.10,
2.6 SPATIO-TEMPORAL DATA MINING 24
which measure the similarity and dissimilarity in the behaviour of values between neighbouring regions.
Spatial autocorrelation techniques, however, are focused on nding general patterns, and are not oriented
towards understanding and predicting changes over time (Ren et al., 2003). Therefore, there is a need
for spatio-temporal data mining techniques to be developed for outlier detection. In addition, the use
of data mining techniques is an important task because outliers are often difcult to visually detect and
quantify in the spatial domain, and this is made even more difcult when spatial outliers and their values
need to be compared over time to discover patterns.
Two of the most popular spatio-temporal outlier detection techniques are distance and density-based
methods (Adam et al., 2004). Most of the spatio-temporal outlier detection algorithms use one of these
two methods to discover spatio-temporal outliers. In addition to these, Birant and Kut (2006) have
also identied distribution, clustering and depth-based outlier detection methods, but not all have been
extended to mine spatio-temporal outliers.
Distribution-based techniques use standard statistical distributions to identify outliers as those points
which deviate from the model. The main problem with these techniques is that for many datasets the
underlying distribution is unknown, and a large number of tests are required to determine which would
be most suitable, so that the testing process can be slow and costly.
Clustering methods detect outliers as a by-product of clustering. Clusters are not necessarily outliers
themselves, but groups of objects who are similar to those in the same cluster and dissimilar to those in
other clusters. Some clustering algorithms such as CLARANS (Ng and Han, 1994), DBSCAN (Ester
et al., 1996) and CURE (Guha et al., 1998) may detect outliers as a by-product of the clustering pro-
cedure. Clustering methods have been extended to the spatio-temporal domain. One such method is a
spatio-temporal extension of DBSCAN, known as ST-DBSCAN, developed by Birant and Kut (2007).
However, due to the orientation of clustering algorithms toward the discovery of clusters, they are not
optimised for outlier detection and may not perform well (Birant and Kut, 2006; Breunig et al., 2000).
Depth-based methods use computational geometry to compute different layers of k-d convex hulls,
where outliers are more likely to be those with smaller depths. That is, the depth of each point is
usually calculated using methods from computational geometry, then they are assigned to a layer where
shallower layers are expected to contain more outlying points. In one implementation by Johnson et al.
(1998), rather than performing the computationally expensive task of calculating a depth score for each
2.7 HYDROLOGY 25
point in a two-dimensional (2-d) space and then assigning it to a layer, a subset of points are examined,
and from these points depth contours can be identied.
Distance-based outlier detection methods use a distance metric to determine the distances between ob-
jects. An example of a distance-based spatio-temporal outlier detection method that uses the Jaccard
similarity coefcient to determine the distance between neighbourhoods was developed by Adam et al.
(2004). Their method rst identies all the outliers and then from these, nds those with the same
timestamp to discover spatio-temporal anomalies.
Density-based approaches assign a Local Outlier Factor (LOF) to each sample based on their local
neighbourhood density. The LOF measures the degree to which an object is an outlier, when compared
to the density of its local neighbourhood (Angiulli et al., 2005; Breunig et al., 2000). An example of a
density-based outlier detection method that employs the use of an LOF, has been proposed by Breunig et
al. (2000), which operates on two-dimensional point data. Another popular density-based method is the
spatial scan statistic (Kulldorff, 1997), which has the additional advantage of being able to nd outliers
in both point and grid data. This method is described in more detail in Chapter 4, as it forms the basis
of our spatio-temporal outlier detection method.
Birant and Kut (2006) have also developed a spatio-temporal outlier detection algorithm. Their algo-
rithm uses aspects of clustering and density-based approaches, whereby any objects not located in a
cluster, or any clusters which are signicantly different from other clusters, are considered to be spatial
outliers or S-Outliers. After detecting S-Outliers, they nd Spatio-Temporal Outliers, or ST-Outliers. In
order for an outlier to be an ST-Outlier, the values in the S-Outliers spatial neighbourhood in consec-
utive temporal periods, such as the previous or following day, must be signicantly different from the
S-Outlier. This is different to our denition of a spatio-temporal outlier, since in our work spatial outliers
may persist for one or more time periods, rather than a single time period as in their work. Our denition
of a spatio-temporal outlier is claried in Section 4.1. In addition, Birant and Kuts spatial outlier detec-
tion method performs clustering on point-based data to detect outliers, while the density-based spatial
scan statistic we utilise can also be applied to grid-based data.
2.7 Hydrology
Hydrology is the study of the movement (properties, distribution and circulation) of water on and below
the Earths surface and in Earths atmosphere.
2.7 HYDROLOGY 26
Brutsaert (2005) describes Hydrology as the science that deals with those aspects of the cycling of water
in the natural environment that relate specically with:
The continental water processes, namely the physical and chemical processes along the various
pathways of continental water (solid, liquid and vapour) at all scales, including those biological
processes that inuence this water cycle directly; and with
The global water balance, namely the spatial and temporal features of the water transfers (solid,
liquid and vapour) between all compartments of the global system, i.e. atmosphere, oceans and
continents, in addition to stored water quantities and residence times in these compartments.
This study aims to place emphasis on the spatial and temporal features of precipitation over the Pacic
Ocean and South America.
2.7.1 Precipitation
Precipitation forms a fundamental part of the hydrologic cycle. Precipitation is dened by Hornberger et
al. (1998) as the deposition of liquid water droplets and ice particles that are formed in the atmosphere
and grow to a sufcient size so that they are returned to the Earths surface by gravitational settling.
There are many different forms of precipitation, including drizzle, rain, snow, sleet, glaze (freezing rain),
snow pellets, small hail, soft hail, hail balls, dew and hoar frost (Brutsaert, 2005).
2.7.2 El Nio Southern Oscillation
Believed to have been rst documented in 1892 by Carillo (Allen et al., 1996), the term El Nio in
Spanish refers to the baby Jesus or the Christ child. It is believed to have been coined by Peruvian
sailors who noticed a warmer current off the coastline during the Christmas period. At this time, the
phenomenon was thought to be of only local signicance in Peru and Ecuador (Glantz, 2001).
Prompted by an address to the 6th International Geographical Congress in London in 1895, numerous
papers were then published on the topic of El Nio linking oceanic upwelling and ooding rains in
Peru to the phenomenon. In 1935, Schott was the rst to describe El Nios extended inuence on the
movement of warm water south from the Galapagos Islands. Then in the International Geophysical
Year of 1957-58, observations of large-scale oceanic warming that reached across the Pacic Ocean
associated with El Nio were made.
2.7 HYDROLOGY 27
Bjerknes was the rst to link El Nio with the Southern Oscillation in papers published in 1966, 1969
and 1972, putting forward the idea that the irregular warmings were a result of large-scale, ocean-
atmosphere interactions across the Pacic basin.
Since the mid-1970s there has been growing interest in what is now known by the scientic commu-
nity as the El Nio Southern Oscillation (ENSO). It is described by DAleo and Grube (2002) as an
interannual, coupled oscillation in the atmosphere and ocean of the Tropical Pacic. Changes in the
atmosphere are an east-west see-saw of surface pressure and the related patterns of clouds, winds, tem-
peratures, and precipitation. While oceanic changes are an east-west ip-op of the location and depth
of warm and cool pools of water. References to east and west refer to locations in the Pacic Ocean,
where east is the eastern Pacic Ocean near South America, and west is the western Pacic Ocean near
Tahiti and Australia.
There are a number of denitions of El Nio, as researchers have been unable to agree on a single
denition (Trenberth, 1997). This has led Glantz (2001) to attempt to identify some common aspects of
El Nio that are frequently mentioned in these denitions. These are that El Nio:
Is an anomalous warming of surface water;
Is a warm southward-owing current off the coast of Peru;
Involves sea surface temperature increases in the eastern and central Pacic;
Appears off the coasts of Ecuador and northern Peru (sometimes Chile);
Is linked to changes in pressure at sea level (the Southern Oscillation);
Accompanies a slackening of westward-owing equatorial trade winds;
Recurs but not at regular intervals;
Returns around Christmas time; and,
Lasts between 12 and 18 months.
The Australian Bureau of Meteorology denes El Nio as a sustained warming over a large part of the
central and eastern tropical Pacic Ocean. Combined with this warming are changes in the atmosphere
that affect weather patterns across much of the Pacic Basin, including Australia.
Although ENSO is a natural phenomenon it is usually described in a negative light due to the extreme
nature of the weather patterns that it brings, and the consequent human costs. The most noticeable effects
are of oods and droughts. During an ENSO episode, high precipitation around the east equatorial
2.7 HYDROLOGY 28
Pacic ocean often leads to oods, and low precipitation around the west equatorial Pacic ocean often
causes drought.
Since these precipitation extremes are of the most concern, it is the aim of this study to analyse historical
precipitation data to nd extreme patterns and learn more about the ENSO phenomenon.
The El Nio Southern Oscillation (ENSO) consists of two main phases, El Nio and La Nia. El Nio is
associated with heavier precipitation in many parts of South America due to the warming of the eastern
Pacic ocean. At the same time, El Nio brings drought to the western Pacic ocean near Australia.
An El Nio phase typically lasts between 1 to 2 years, and is often followed by a La Nia event, which
has the opposite effect of bringing low precipitation to South America (DAleo and Grube, 2002), and
higher precipitation to areas near the eastern Pacic including Australia. In this study we focus on the
relationship of high precipitation in South America with El Nio.
2.7.3 Southern Oscillation Index
Both atmospheric and oceanic changes can be used to evaluate the strength of an ENSO event. The three
most common methods are the Southern Oscillation Index (SOI) which uses atmospheric pressure, the
Sea Surface Temperature (SST) anomalies within oceanic data, and the Multivariate ENSO Index (MEI)
that provides an evaluation of both atmospheric and oceanic measure. Our study focuses on the SOI, as
it is one of the most widely used and longest recorded statistics.
The SOI measures the difference in Sea Level Pressure (SLP) between Tahiti and Darwin, relative to the
normal SLP. Tahiti and Darwin are identied on the map in Figure 2.7. Prolonged negative SOI values
have been associated with El Nio events (DAleo and Grube, 2002).
The SOI can be calculated in a number of different different ways. The SOI data used in our experiments
comes from the National Oceanic & Atmospheric Administration (NOAA) who dene the SOI as:
SOI =
S
T
S
D
MSD
(2.1)
where S
T
is the Standardised Tahitian SLP, S
D
is the Standardised Darwin SLP and MSD is the
Monthly Standard Deviation.
2.7 HYDROLOGY 29
FIGURE 2.7: The location of Darwin and Tahiti in the South Pacic Ocean
The Standardised Tahitian SLP is calculated using the following:
S
T
=
TA
(Tahiti)
(2.2)
where TA stands for Tahiti Anomaly and is the actual(SLP) x(SLP) with x(SLP) representing
the mean SLP for Tahiti. The symbol (Tahiti) is the Standard Deviation Tahiti, dened as:
(Tahiti) =
_

(TA)
2
N
(2.3)
where N is the number of months.
The Standardised Darwin SLP is calculated in a similar way to the Tahitian SLP using:
S
D
=
DA
(Darwin)
(2.4)
2.8 GEOGRAPHIC KNOWLEDGE DISCOVERY 30
where DA stands for Darwin Anomaly and is the actual(SLP) x(SLP) with x(SLP) representing
the mean SLP for Darwin. The symbol (Darwin) is the Standard Deviation Darwin, dened as:
(Darwin) =
_

(DA)
2
N
(2.5)
where N is the number of months.
The Monthly Standard Deviation (MSD) is calculated as:
MSD =
_

(S
T
S
D
)
2
(2.6)
with N the total number of summed months.
The SOI measures are provided on a daily basis by the NOAA. To compare the SOI with the average
local Moran I for each time period, we take the average of all the SOI measures for the same time period.
The result of this comparison is shown in Chapter 3.
2.8 Geographic Knowledge Discovery
Geographic Knowledge Discovery (GKD) is the process of extracting information and knowledge from
massive geo-referenced databases (Miller, 2007). One part of GKD is geographic data mining. Geo-
graphic data mining involves the use of computational techniques and tools to discover patterns that are
distributed over geographic space and time (Miller and Han, 2001). Geographic data mining must also
take into consideration data features that are specic to geographic domains (Openshaw, 1999).
Geographical data mining techniques are necessary, as classical data mining algorithms often assume
characteristics of the data that are inconsistent with those of geographical data (Chawla et al., 2001). For
example, often the data is assumed to be independent and/or identically distributed. These assumptions
would violate Toblers rst law of Geography, which states that everything is related to everything else,
but near things are more related than distant things (Tobler, 1970).
As a result, geographical data dependencies and features must be considered when performing data
mining on geographical data. Some additional data features that are specic to geographical data have
been described by Miller and Han (2001) and Openshaw (1999).
Openshaw (1999) describes a number of features that are specic to geographic domains. These include
the following:
2.9 EXTREME VALUE ANALYSIS 31
Observations are not independent.
Data uncertainty and errors are often spatial structured.
Whole map statistics are seldom helpful.
Non-stationarity is to be expected.
Relationships are often geographical localised - rather than global.
Non-linearity is the norm.
Data distributions are non-normal.
High levels of multivariateness but with redundancy.
Time often interacts with space.
Most GIS data layers are categorical.
The locational element is important.
The modiable nature of all spatially aggregated data;
Results reect denitional dependencies; and
There cam be a fair proportion of junk data.
A further discussion of geographic data features in relation to the precipitation data we used, is described
in Section 4.4.3.
2.9 Extreme Value Analysis
The Extreme Value Theory (EVT) methodology was proposed in the 1950s to attempt to quantify the
stochastic behaviour of a process at extremely large or extremely small levels (Coles, 2001). The theory
was developed to assist in predicting the likelihood of future extreme values.
EVT is a statistical area that is concerned with developing techniques and models for describing the
unusual rather than the usual (Coles, 2001). Instead of using all the values of a distribution, EVT only
considers the extreme values located in the tail of the distribution. This is the main advantage of EVT
over earlier statistical techniques, which often perform poorly on areas in the distribution of low density,
such as the higher or lower ends, and the extremes. The reason for this is that these earlier models are
chosen based on their ability to t well near the mode as shown in Figure 2.8(a), and therefore may not
model well on the extremes where less data is available. EVT is designed to t tail data to a distribution,
such as that shown in Figure 2.8(b). However, before EVT can be applied to the data, maxima need to
be selected.
2.9 EXTREME VALUE ANALYSIS 32
(a) The modal region of a normal distribution
(b) The tail end of a normal distribution
FIGURE 2.8: The modal and tail regions of a normal distribution
There are two basic approaches for selecting the maxima, and two models from EVT have been de-
veloped to handle the resulting values from both methods of selection. These are block maxima and
threshold-based models.
2.9 EXTREME VALUE ANALYSIS 33
Block maxima models divide the data into blocks and take the maximum value of each block. For
example, we might divide weekly maxima residuals over several years into 3 month blocks, from which
we would take the maximum precipitation for the season. This would leave us with 4 blocks from
which we derive 4 maxima for the year. This is shown in Figure 2.9(a), where the selected maxima are
highlighted pink. The gure shows ten blocks, with a single maximum value selected from each block.
The main problem with this approach is that many extreme values may be overlooked since only the
highest extreme value for any given block would be considered.
(a) Block maxima selection
(b) Variable threshold maxima selection
FIGURE 2.9: Different methods for selecting maxima from data
For precipitation extremes, a threshold approach is usually considered more suitable. This study and
previous work in (Khan et al., 2007) use the Peak Over Threshold (POT) method to select the extremes.
With POT, any value that exceeds a given threshold is considered extreme. This concept is illustrated in
Figure 2.9(b). It is not necessary to divide the data into blocks. The threshold can be set as a xed or
variable threshold. Methods of selecting the threshold are described in detail in Chapter 3.
Once the extremes have been extracted from the data, the next step is to model them using an extreme
value distribution to approximate the distribution of exceedances. Two major extreme value models
2.10 SPATIAL AUTOCORRELATION 34
exist. The classical Generalised Extreme Value (GEV) distribution is used to model the distribution of
block maxima. However, it does not utilise all the data available in the upper tail of the distribution.
Therefore, in order to model threshold-based POT exceedances, we instead use the Generalised Pareto
(GP) distribution (Khan et al., 2007).
The GP distribution takes extreme events described by a threshold u, where y
1
, y
2
, , y
k
are the k
exceedances over the threshold. The distribution function for the GP distribution is given as:
F
,
(y) =
_
1-[1+(y/)]
1/
, 1 + (y/) > 0, = 0
1-e
y/
, = 0
Maximum Likelihood (ML) is used to estimate the parameters of the GP distribution. These parameters
are known as shape and scale. The shape parameter () has the following properties. If < 0, the values
t to a bounded tail distribution, meaning that the range of values are nite and have an upper limit. If
= 0, the values t to a light or exponential tail distribution. If > 0, the values t to a heavy or
polynomial tail distribution. Therefore if 0, there is no upper limit on the distributions tail. The
scale parameter () is always positive for the GP distribution and indicates the spread of the distribution.
In this thesis the relationships of these parameters between geographical regions over South America
is examined. Understanding the shape and scale parameters can provide some insights into extreme
precipitation data but further analysis is required to understand the relationships between regions. One
way of providing greater understanding is through spatial autocorrelation measures such as Morans I
statistic, which is introduced in the following section.
2.10 Spatial Autocorrelation
Spatial autocorrelation is the correlation of a variable with itself over space. Goodchild (1986) denes
spatial autocorrelation as concerned with the degree to which objects or activities at some place on
the Earths surface are similar to objects or activities nearby. That is, spatial autocorrelation statistics
describe the correlation of a variable with itself in reference to its spatial location. Spatial autocorrelation
is descriptive in the sense that it describes the way things are distributed over space, but is also seen as a
causal process since it measures the degree of inuence objects or activities exert over their neighbours.
As pointed out by Goodchild (1986), spatial data consists of two types of information. These are the
attributes of spatial features such as the amount of precipitation for example, and their location such as
2.10 SPATIAL AUTOCORRELATION 35
their geographical location on the map. The majority of earlier statistical techniques only used location
to select regions that fell within the boundaries specied, but once selected the locations are more or
less ignored, and shufing the variables around would have little consequence on the results. On the
other hand, other techniques focused on the location of objects, but did not consider the attributes.
Spatial autocorrelation is different to previous techniques since it considers both the attribute values and
location of the data.
Cliff and Ord (1973) provide two examples of applications where spatial autocorrelation statistics would
be useful. The rst was in discovering whether potential environmental or other factors that affect the
mortality rate of cancers, were spatially grouped. Identication of such a grouping would help to isolate
the factors that might increase the incidence of a particular type of cancer. The second application was
to nd out whether the voting preferences of one state affect those of neighbouring states.
In this study, the spatial autocorrelation of the parameters of the GP distribution is calculated. The value
returned helps us to understand how similar the extreme value distributions, or the distributions that
describe the behaviour of extreme events, are similar to one another in nearby regions.
2.10.1 Measuring Spatial Autocorrelation
As stated previously, spatial autocorrelation is concerned with the similarity of both attributes and loca-
tion. This is reected when these features are combined into an index. Given any two spatial objects, i
and j in the dataset, we can calculate their attribute similarity as c
ij
and their locational similarity w
ij
and combine this into an index of the form

j
c
ij
w
ij
.
The resulting value of a spatial autocorrelation statistic can indicate either positive, negative or no spatial
autocorrelation between regions.
Positive autocorrelation indicates that an event in one region will increase the likelihood of the same
event in a neighbouring region. That is, nearby regions will likely have similar values of the particular
variable for which spatial autocorrelation is being calculated. Positive autocorrelation does not neces-
sarily distinguish between high or low data values that are positively correlated. For example, positive
autocorrelation can indicate that high precipitation in one region would mean there would also be high
precipitation in nearby regions. However, it could indicate that low precipitation in one region means
there is likely to be low precipitation in neighbouring regions.
2.10 SPATIAL AUTOCORRELATION 36
Negative autocorrelation indicates that an event in one region will decrease the likelihood of the same
event in a neighbouring region. For example, high precipitation in one region would mean that it is
unlikely for there to be high precipitation in a nearby region. Conversely, low precipitation in one region
would mean that it is unlikely for there to be low precipitation in nearby regions.
Finally, no autocorrelation indicates that an event in one region will have no effect on the events of
another neighbouring region. No autocorrelation is often present when the values are random.
Positive, negative and no autocorrelation are illustrated in Figure 2.10. Figure 2.10(a) shows extreme
positive autocorrelation where all the same type of attributes (black cells or white cells) are located in
close spatial proximity to one another. Figure 2.10(b) illustrates extreme negative autocorrelation where
black cells are only located next to white cells. Figure 2.10(c) demonstrates no autocorrelation where
the nearby cells are selected randomly.
(a) Positive Autocorrelation (b) Negative Autocorrelation
(c) No Autocorrelation
FIGURE 2.10: Examples of Positive, Negative and No Autocorrelation
2.10 SPATIAL AUTOCORRELATION 37
It is important to note that in this set of examples, neighbours are dened by their horizontal or vertical
adjacency. A different denition of a neighbour could change the results. Another important factor to
consider is that the degree of spatial autocorrelation is dependent on the scale. For example, if we were
to look at a subset of grids from one of those in Figure 2.10, then the value of the spatial autocorrelation
statistic could be very different.
Spatial autocorrelation is signicant because, as an index, it provides information about a spatial dis-
tributed phenomenon that is not available in other forms of statistical analysis. Goodchild (1986) asserts
that if one were forced to summarise a spatial distribution of unequal attributes in a single statistic,
one would in all likelihood choose a spatial autocorrelation index, just as one would probably choose a
measure of central tendency such as the mean or median to summarize a non spatial data set. Spatial
autocorrelation measures allow us to more accurately and objectively evaluate something that would
otherwise have relied on subjective and possibly inaccurate visual observations from a map.
There are two well known statistics for calculating spatial autocorrelation. These include Gearys C
statistic and Morans I statistic.
While Gearys C statistic and Morans I statistic differ in their computation, they are very similar. The
main advantage of the Moran I statistic over Gearys C is that the values associated with positive, neg-
ative and no spatial autocorrelation are more intuitive. With Gearys C, positive autocorrelation is rep-
resented by values between 0 and 1, no autocorrelation is represented by values equal to 1 and negative
autocorrelation is represented by values greater than 1. However, using the Moran I statistic, positive
autocorrelation is indicated by values greater than 0, no autocorrelation is indicated by values 0 or close
to 0 and negative autocorrelation is indicated by values below 0. The differences are summarised in
Table 2.1. Perhaps due to the intuitiveness of the resulting values of Morans I, it has become the more
popular of the two statistics. It may be important to note in some settings, that for Morans I, the value
is a number close to zero in each instance, and not exactly zero.
Statistic Gearys C Morans I
Positive
Autocorrelation 0 < c < 1 I > 0
No
Autocorrelation c = 1 I = 0
Positive
Autocorrelation c > 0 I < 0
TABLE 2.1: Autocorrelation values for Gearys C and Morans I
2.10 SPATIAL AUTOCORRELATION 38
More recently, Getis and Ord (1992) have introduced the G statistic as a measure of spatial association.
There are advantages and disadvantages to both the Moran I and Getis Ord G statistics. For example,
a disadvantage of using Morans I is that it is unable to discern between high and low values that are
correlated. This means when areas with high precipitation are located together, or when areas with
low precipitation are located together, the value of I could potentially be the same, while the value of
Getis-Ords G would be different in order to discern between high or low values. One disadvantage
of Getis-Ords G is that it is not suitable for all applications, and would prove inappropriate to use
in studying residuals from regression. Getis and Ords study nds that the I and G statistics measure
different things, and so the authors suggest that Getis-Ords G be used in conjunction with Morans I.
In our study, a local variation of Morans I has been utilised as the rst stage of analysis. This is described
in the following section. However, future studies could incorporate Getis-Ords G statistic to supplement
the results found in this thesis.
2.10.2 Local Indicators of Spatial Association
Spatial autocorrelation statistics can be used to measure global or local autocorrelation. Usually a global
statistic describes the whole dataset with a single autocorrelation value for a given variable. A local
statistic, on the other hand, provides a variable for each of the local regions on the map.
There also exist differences in the way that global and local autocorrelation is calculated. Global statis-
tics tend to look at zones or regions with continuous variables, and compare the values of one region to
those of all other regions. Local statistics only consider the nearby regions when comparing values.
It has been suggested that global measures that consider a large number of spatial observations are unable
to take into account spatial structural instability (Anselin, 1995). Therefore local statistics have been
developed that provide a measure of the spatial association between regions on a local scale. These are
known as Local Indicators of Spatial Association (LISAs), introduced by Anselin (1995), who denes
a LISA as any statistic that satises the following requirements:
The LISA for each observation gives an indication of the extent of signicant spatial clustering
of similar values around that observation; and,
The sum of LISAs for all observations is proportional to a global indicator of spatial associa-
tion.
2.10 SPATIAL AUTOCORRELATION 39
A more formal denition expresses LISA for a variable y
i
, observed at location i, as a statistic L
i
, such
that L
i
= f(y
i
, y
J
i
), where f is a function and y
J
i
are the values observed in the neighbourhood J
i
of i.
Local spatial outliers, also known as hotspots, can be identied as those locations or sets of contiguous
locations where the value of the LISA statistic is signicant.
As previously discussed, a popular statistic for measuring spatial autocorrelation is known as Morans I.
It is used to test the null hypothesis - that the spatial autocorrelation of a variable is zero (Ord and Getis,
1995).
2.10.3 Local Morans I Statistic
The purpose of Morans I statistic is to measure spatial autocorrelation between neighbouring regions. A
local spatial autocorrelation statistic for Morans I is also available in addition to the global statistic. As
previously stated, the sum of the local indicators for all observations should be proportional to a global
indicator of spatial association. The proportionality may vary between different statistics, therefore we
dene this for the Moran I statistic below.
For an observation i, the local Moran I statistic is dened as:
I
i
= z
i

jJ
i
w
ij
z
j
,
where z
i
,z
j
are observations calculated as deviations from the mean, and the summation over j incorpo-
rates only neighbouring values j J
i
. The adjacency weights between regions i and j are represented
by w
ij
where w
ii
= 0.
To understand the relationship between the local and global Moran statistics, we should compare the
sum of local Morans to the global Moran. The sum of local Morans is dened as:

i
I
i
=

i
z
i

j
w
ij
z
j
, (2.7)
while the global Moran I is dened as:
I = (n/S
0
)

j
w
ij
z
i
z
j
/

i
z
2
i
, (2.8)
or
I =

i
I
i
/
_
S
0
_

i
z
2
i
/n
__
, (2.9)
2.11 DATA 40
where S
0
=
i

j
w
ij
. Therefore, by taking m
2
=
i
z
2
i
/n as the second moment, (which is a consistent,
although slightly biased measure of the variance), the factor of proportionality between the sum of the
local Moran I and the global Moran I value is dened as:
= S
0
m
2
. (2.10)
For more information on the local Moran I statistic, see Anselin (1995).
In our work, we use the local Moran I statistic to determine the extent to which the parameters of the GP
extreme value distribution of one region, are correlated with the parameters of the surrounding regions.
2.11 Data
This section describes the South American precipitation dataset that was used in our experiments in both
Chapters 3 and 4.
The data used in our experiments is the South American precipitation data set obtained from the NOAA
(Liebmann and Allured, 2005). It is provided in a geoscience format, known as NetCDF. A description
of the data is provided in both (Liebmann and Allured, 2005) and (Wu and Chawla, 2007). The data are
presented in a grid consisting of 713 regions, which contain daily precipitation values from around 7900
stations. These values are averaged for each grid between 1940-2006. In our experiments we consider
the ten year period from 1985-2004, since those years contain data for most regions that lie over land in
South America.
2.11.1 Locations
The data are presented in a grid from latitude 60

S to 15

N and longitude 85

W to 35

W. Thus with a
2.5

gridding, there are 31 latitude points and 23 longitude points, totalling 713 regions. However many
of these regions do not contain observations. The data are taken from 7900 stations. For each grid, there
is usually more than one station, so the average is taken from all stations in that grid for any given day.
2.11 DATA 41
2.11.2 Deseasonalisation
Raw precipitation values themselves may not provide interesting patterns, since usually it is the case that
we know in what regions the rain is likely to fall more or less.
A more interesting statistic is the deviation from the normal amount of precipitation. To nd this we
must deseasonalise the data. The rst step of this process is to nd the mean of all values for the entire
period. Then for each value in the period, we subtract this mean, to get the deseasonalised value. This
process is described in (Wu and Chawla, 2007), and outlined here for reference.
To perform deseasonalisation, let m
w
iy
be the maximum precipitation value at week w
iy
of a week i
(between 1 and 53) and year y from a time period of several years.
For any given week in the data w
iy
, we calculate the weekly maximum residual by nding the average
of a given week of the year between 1 to 53, w
i
and subtracting it from m
w
iy
. We can calculate w
i
as:
w
i
=

ey
y=sy
w
iy
N
y
where sy is the rst year of the period (eg. 1995), ey is the last year of the period (eg. 2004), and N
y
is
the number of years in the period. We then use w
i
to remove seasonal effects from a given week w
iy
to
nd its deseasonalised value d
w
iy
as follows:
d
w
iy
= m
w
iy
w
i
(2.11)
When calculating the average value w
i
, we must take into account missing values contained in the data.
The sum of all weeks, excluding those with missing values is calculated,
ey
y=sy
w
iy
, and is divided by
the total number of years N
k
for that particular week i that do not contain missing values. So for each
week i from 1 to 53,
w
i
=

ey
y=sy
w
iy
N
k
. (2.12)
2.12 SUMMARY 42
2.12 Summary
This chapter has introduced the evolution from spatial data mining (Section 2.2) and temporal data
mining (Section 2.4) to spatio-temporal data mining (Section 2.6).
Some of the techniques for spatio-temporal data mining have been outlined, with particular focus on
spatio-temporal outlier detection, which is the focus of research described in chapter 4.
An introduction to some aspects of Hydrology, Geographic Knowledge Discovery and statistical anal-
ysis methods including Extreme Value Analysis and spatial autocorrelation techniques have also been
provided as background for the spatio-temporal analysis performed in chapter 3.
CHAPTER 3
Spatio-Temporal Analysis of the relationship between South American
Precipitation Extremes and the El Nio Southern Oscillation
This chapter is based on the work published in:
Elizabeth Wu and Sanjay Chawla (2007) Spatio-Temporal Analysis of the relationship between South
American Precipitation Extremes and the El Nio Southern Oscillation, 2007 International Work-
shop on Spatial and Spatio-temporal Data Mining (SSTDM) in conjunction with IEEE International
Conference on Data Mining (ICDM), October 28, Omaha, NE, USA. IEEE Computer Society.
3.1 Related Work
The spatio-temporal variability of South American precipitation extremes has previously been investi-
gated by Khan et al. (2007). Their study involved the use of a subset of the dataset used in our study
from the NOAA (Liebmann and Allured, 2005).
In their work, they analysed the daily and weekly variability of these extremes using data over the
period of 1940-2004 on 2.5

spatial grids. Daily precipitation measurements were taken from over 7900
stations in Brazil, Venezuela, North Argentina, Paraguay, Uruguay, Suriname and French Guiana.
Their analysis involved the use the Generalized Pareto (GP) distribution from Extreme Value Theory
(EVT) to approximate the distribution of exceedances. Rather than use a constant threshold for all
spatial grid points, the authors chose to classify the x
th
percentile of the distribution as extreme events.
For daily extremes, x is set to the 99%-quantile, while for weekly extremes, x is set to the 95%-quantile.
By calculating the parameters of the GP distribution for each region, Khan et al. (2007) qualitatively
evaluate several regions at a time by visually observing trends in the regions that cover particular coun-
tries in South America. However, better use can be made of the parameters of the GP distribution if a
43
3.1 RELATED WORK 44
spatial autocorrelation statistic such as Morans I is applied to quantitatively measure the relationship
between regions. This is one of the contributions of our work. Their work also doesnt consider the
relationship of precipitation with the El Nio Southern Oscillation (ENSO) weather phenomenon.
Some research efforts have placed emphasis on the relationship of ENSO with various climate phenom-
enon, such as river ows in Khan et al. (2006) which are associated with precipitation. In their paper,
they reveal relationships using Mutual Information (MI) between two variables such as stream ow and
El Nio. Their work however, does not investigate precipitation directly.
Ropelewski and Halpert (1987) have looked at the relationship of El Nio with precipitation, however
they do not specically consider extreme precipitation or its behaviour with neighbouring regions.
Gershunov (1998) investigates the relationship of El Nio events with extreme precipitation, but does
not utilise EVT to do so. Because of this, they could be missing out on the benets of EVT as described
in Section 2.9.
There are a fewmajor measures of an El Nio event that can be used. One is the Sea Surface Temperature
(SST). The SST between regions against the mean precipitation for a given time period was compared by
Liebmann (2005). In our experiments we have used a different ENSO indicator known as the Southern
Oscillation Index (SOI), to evaluate the relationship between ENSO and the relationships between the
GP distribution parameters over several El Nio periods. While the Multivariate ENSO Index (MEI)
also exists, the values have only been recorded by the NOAA since 1955.
In summary, previous work has investigated precipitation extremes using EVT, but does not use a qual-
itative spatial autocorrelation statistic. It also does not consider the relationship between the extreme
values and the ENSO weather phenomenon. Earlier work does however lay the foundations for expan-
sion to cover spatial autocorrelation statistics in addition to the use of EVT. This can then be furthered
with a study of the relationship of precipitation extremes with ENSO. Although other work does consider
ENSO in relation to different environmental processes, they do not use either EVT or spatial autocorre-
lation statistics to do so, and do not place focus on extreme values.
3.3 EXPERIMENTAL SETUP 45
3.2 Our Contribution
After performing our own analysis of South American precipitation extremes using the Generalised
Pareto (GP) distribution of Extreme Value Theory (EVT), we are able to go beyond this to perform our
own additional analysis.
This analysis included the:
Application of the local Moran I statistic, a Local Indicator of Spatial Autocorrelation (LISA)
to the parameters of the GP distribution; and,
Comparison of the average local Moran I to the average Southern Oscillation Index (SOI) over
several strong El Nio periods, incorporating the use of bootstrapping to evaluate the results.
Our contribution is therefore:
To use a spatial autocorrelation statistic to provide a quantitative analysis, as opposed to the
qualitative analysis done in previous work, of the relationship between regions based on their
GP distribution parameters; and,
To compare the values of the local Moran I statistic to El Nio to determine the extent to which
their values are correlated.
3.3 Experimental Setup
The experiments in this chapter aim to evaluate the spatio-temporal variability of the parameters of the
GP distribution. This was done by analysing weekly precipitation residuals, which are the maximum
weekly values extracted for each week from the daily data following deseasonalisation. From these
weekly maxima residuals, the 95%-quantile is obtained for use with the GP distribution. The exper-
iments also aimed to evaluate the relationship between the parameters of the distribution over several
ENSO cycles.
The steps involved in our experimentation are as follows:
(1) Unpack netCDF formatted precipitation data using MATLAB.
(2) Pre-process: Deseasonalise, then Remove Missing/ Negative/ Zero Values.
(3) Get the El Nio Years from the NOAA.
3.3 EXPERIMENTAL SETUP 46
(4) Calculate the number of valid regions for each El Nio period.
(5) For each El Nio Period, get the top 5% of deseasonalised precipitation values.
(6) Fit these to a GP distribution to estimate the shape & scale parameters using Maximum Like-
lihood (ML).
(7) Calculate the Local Moran I for each region in South America.
(8) Compare the average of the local Moran I values for each strong El Nio period between
1978 and 2004 to the SOI using bootstrap analysis.
To conduct this research, a number of important factors had to be taken into consideration. Consider-
ations made about the data are described in Section 3.3.1. The selection of strong ENSO periods is
described in Section 3.3.2.
After taking the mean of the Moran I values and the mean of the SOI, we are left with three separate
line graphs. To determine the similarity of the trend between the graphs, we use a method known as
boostrapping or boostrap analysis. Bootstrapping is the process of taking samples of the values of two
variables with replacement and nding the correlation coefcient to determine the effect that one has
on the other. The resulting correlation coefcient values are then compared between one another to
compare the variation between them. This gives us an indication of how much one variable is correlated
with another. In our experiments, we compare the mean Moran I values for each of the GP parameters,
to the mean SOI values, in order to discover the extent that the SOI and Moran I values are correlated.
3.3.1 Data
South American precipitation data (Liebmann and Allured, 2005) was obtained for use in this study. The
dataset used was the daily 2.5

2.5

gridded data, provided in NetCDF format.


(a) Extremes
There are two main ways to classify a precipitation value as an extreme value. The rst is to set a xed
numerical threshold where any precipitation value that exceeds the threshold is classied as an extreme.
However, extreme values are generally considered to be extreme based on the normal behaviour of the
region. For example, a warm day in a region far from the equator may be extreme but the very same tem-
perature closer to the equator may be considered normal. As a result, this and previous studies classify
an extreme value as one which falls in the x
th
percentile of the empirical distributions. This selection
method is known as Peak Over Threshold (POT). Our study ts a GP distribution to all those values
3.3 EXPERIMENTAL SETUP 47
exceeding the 95%-quantile of the weekly maxima residuals.
(b) Independent and Identical Distribution
As the analysis requires independent and identically distributed (iid) values, daily data could not be
used. This is because daily data exhibits a high degree of dependence from one day to the next. To
overcome this, we generate weekly precipitation data using the deseasonalised maximum value for each
single week period.
(c) Time Intervals
The data points are recorded in NetCDF format as an index corresponding to hours since 12:00am on
January 1, 1800. To analyse the variability of precipitation extremes over the lifecycle of ENSO, several
time periods were selected. These correspond to the strong El Nio periods between 1978 and 2004.
(d) Deseasonalisation
Since some seasons usually have more precipitation than others, we want to remove the effect of sea-
sonality from the data. Seasonality is where precipitation is higher in some seasons, such as winter,
and lower in others such as summer. Removing this seasonality allows us to nd precipitation that is
signicantly different from what is expected. In this way, we can see which precipitation events differ
from the expected precipitation for that particular time of the year. As a result, extreme precipitation
events are those which are signicantly above or below the expected precipitation for a particular time
of year. The process of deseasonalisation is described in Section 2.11.2.
3.3.2 Strong ENSO events and Local Moran I
Several strong El Nio events have been identied by the NOAA. Using time series of data that over-
lapped each of the El Nio events, we analysed the relationship between the average local Moran I values
of the GPD parameters and the SOI during several El Nio events.
These are shown in Figure 3.1, where Period Start and Period End specify the range of dates covered by
each block of data for each strong El Nio Event.
3.4 EXPERIMENTAL RESULTS 48
Event El Nio Period Period
# Event Start End
1 1939-1941 01-Jan-1940 30-Jun-1949
2 1957-1959 01-Jul-1949 30-Jun-1966
3 1972-1973 01-Jul-1966 31-Dec-1977
4 1982-1983 01-Jan-1978 31-Dec-1986
5 1990-1993 01-Jan-1987 30-Jun-1995
6 1997-1998 01-Jul-1995 31-Jun-2000
7 Remaining 01-Jul-2000 31-Dec-2004
TABLE 3.1: Strong El Nio Periods and Analysis Years
FIGURE 3.1: Number of non-missing regions for each strong ENSO period
3.4 Experimental Results
The periods selected by El Nio event, are shown in Table 3.1. As some regions contain missing data in
the earlier years, fewer regions are available to analyse. This is shown in Figure 3.1.
For each El Nio period, the mean SOI is calculated for all the months. The mean of the local Moran
I statistic for the shape and scale parameters, over all non-missing locations is also calculated. These
means are presented in Figure 3.2. The years prior to the 1982-83 El Nio event have been omitted since
they do not contain enough data to produce reliable Moran I means.
3.4 EXPERIMENTAL RESULTS 49
(a) Mean SOI
(b) Mean Shape Local Moran I
(c) Mean Scale Local Moran I
FIGURE 3.2: Mean of SOI and local Moran Is over strong ENSO periods between
1978-2004
3.4 EXPERIMENTAL RESULTS 50
(a) Bootstrap of Mean Local Moran I of Scale vs SOI Hist
(b) Bootstrap of Mean Local Moran I of Shape vs SOI Hist
FIGURE 3.3: Bootstrap Analysis of Local Moran I of Scale over strong ENSO periods
The results show that when the local Moran I values are low, the SOI is also low. This means that the
relationship between precipitation in neighbouring regions is weaker when the SOI is lower.
As we can see, there is a common pattern amongst the means in Figure 3.2. This has been illustrated
using bootstrap analysis, shown in Figure 3.3. We see that the means of the Moran I values for the shape
and scale parameters of the GP distribution, are positively correlated with the SOI since values closer
to one indicate positive correlation. This positive correlation shows that during strong El Nio periods
nearby regions experience similar extreme weather patterns. This could be attributed to the stronger El
Nio periods bringing more extremes throughout all regions in South America, which means that a trend
is present between regions also. In weaker El Nio events, extremes may be experienced in only some
regions while neighbouring regions may have relatively few extremes, creating a weaker relationship
between regions.
3.5 DISCUSSION 51
3.5 Discussion
In this study we have analysed the spatial autocorrelation of the local Moran I values, for both scale and
shape parameters of the GP distribution. We have shown that a relationship exists between the mean
local Moran I values and the mean SOI over strong El Nio periods.
A higher local Moran I means that there is a strong relationship between neighbouring regions, while a
lower local Moran I means that there is a weaker relationship between neighbouring regions. A higher
SOI indicates a weaker El Nio phase, while a lower SOI indicates a stronger El Nio phase. From the
results we can see that during a stronger El Nio phase, the strength of the relationships between regions
is weaker. This means that when we have a strong El Nino, the extreme values of neighbouring regions
are not as closely correlated with each other as in weaker El Nio events.
This is the rst time such a relationship has been established. It is signicant as it provides a further
understanding of the teleconnection between ENSO and precipitation in South America.
Future work could investigate the potential non-linear relationship between the Moran I of the parameters
of the GP distribution, and other measures of El Nio periods including SST anomalies and MEI. As
more data becomes available, this study could also benet from re-analysis that includes additional El
Nio periods beyond 2004.
CHAPTER 4
Spatio-Temporal Outlier Detection in Precipitation Data
This chapter provides a denition of a spatio-temporal outlier and its associated properties. Since this
work is an extension of current spatial outlier detection techniques, these spatial outlier detection meth-
ods are described. Following this, our spatio-temporal outlier detection method is introduced and the
results of experiments conducted on the South American precipitation dataset are supplied.
4.1 Denition and Properties of a Spatio-Temporal Outlier
Cheng and Li (2006) dene a spatio-temporal outlier to be a spatial-temporal object whose thematic
attribute values are signicantly different from those of other spatially and temporally referenced objects
in its spatial or/and temporal neighbourhoods. Birant and Kut (2006) dene a spatio-temporal outlier
as an object whose non-spatial attribute value is signicantly different from those of other objects in its
spatial and temporal neighborhood.
The denition of a spatio-temporal object is given in Section 2.5. For a spatio-temporal object to be a
spatio-temporal outlier, three important aspects need to be considered. These are the objects:
(1) Spatial neighbourhood;
(2) Temporal neighbourhood; and their
(3) Thematic attribute values.
A spatial neighbourhood is the area that lies in close proximity to a given object in a spatial dataset,
dened by a topological, distance or direction relationship (Ester et al., 2001). If another object lies
within the spatial neighbourhood of the rst, then they are said to be spatial neighbours. The period of
time which occurs soon before, at the same time or soon after a particular event is the events temporal
neighbourhood. If another event occurs during that time period, the two events are said to be temporal
52
4.1 DEFINITION AND PROPERTIES OF A SPATIO-TEMPORAL OUTLIER 53
neighbours. An object which lies in the spatial and temporal neighbourhood of another, is said to be a
spatio-temporal neighbour of the rst object.
According to the denition of a spatio-temporal object dened by Theodoridis et al. (1999), a spatio-
temporal object is a line or solid in three-dimensional space. Therefore, the temporal neighbourhood is
dened as time periods that directly preceed or follow the commencement or completion of the spatio-
temporal event respectively.
In previous work, Birant and Kut (2006) dene two spatial objects (S-objects) as temporal neighbors if
the values of these objects are observed in consecutive time units such as consecutive days in the same
year or in the same day in consecutive years. As we follow the denition of a spatio-temporal object
provided by Theodoridis et al. (1999), we consider that a spatio-temporal outlier may exist over more
than one time period, while Birant and Kut regard a spatio-temporal outlier to be a spatial outlier from
a single time period that is different from its immediate temporal neighbours. For example, in our work
if there is higher than average precipitation in Peru over the years 1998-2002, then the solid in three
dimensional space is an outlier. In experiments conducted by Birant and Kut (2006), they discover a
region in the Mediterranean Sea that is a spatial outlier in 1998, where the years immediately preceeding
and following 1998, 1997 and 1999, contain different values for the region. While this is also considered
to be an spatio-temporal outlier by our denition, we are also able to discover spatio-temporal outliers
that persist over several time periods and which may move or evolve in shape and size.
The difference between the above two denitions can be explained through the use of an example. In
Figure 4.1, the darker orange regions represent an outlier. In both sub-gures there are ve time periods.
Birant and Kut (2006) specify that a spatial outlier is only a spatio-temporal outlier if the same spatial
region in the time periods immediately preceeding and following the time period where the outlier is
present, are signicantly different. This is illustrated in Figure 4.1(a), where the single outlier in the
third time period would be considered to be a spatio-temporal outlier under Birant and Kuts denition.
Figure 4.1(b) shows a spatio-temporal outlier that persists over three time periods. We consider this to
be a single spatio-temporal object, and so is detected as a spatio-temporal outlier under our denition.
However, Birant and Kut would not consider this a spatial outlier. Our method is able to detect both
forms of spatio-temporal outlier from Figure 4.1.
It is clear that there are three types of attributes of a spatio-temporal object. These attributes are spatial,
temporal and thematic. A spatial attribute is the location of an object, and could be recorded as a set of
4.1 DEFINITION AND PROPERTIES OF A SPATIO-TEMPORAL OUTLIER 54
(a)
(b)
FIGURE 4.1: Spatio-Temporal Outlier Examples
latitude and longitude coordinates or a street address for example. A temporal attribute is the time of
observation of an object, and can specify a particular point in time, such as a particular hour, calendar
date or year. Thematic attributes are all other attributes which are neither spatial or temporal in nature,
and describe the objects characteristics, such as its volume, weight, texture or colour.
Therefore, a spatio-temporal object O whose spatio-temporal neighbourhood is O
N
, with thematic at-
tributes O
A
, can be considered a spatio-temporal outlier when the thematic attributes O
A
are signif-
icantly different to those of the other objects in O
N
. That is, for an object to be a spatio-temporal
outlier, it must have signicantly different thematic attribute values to the objects that lie in its spatial
and temporal neighbourhoods.
A simplied example of a spatio-temporal outlier is shown in Figure 4.2. In the gure we can see the
smaller red region at the top right is an outlier since the thematic attributes dened by its colour, are
different to those in its spatio-temporal neighbourhood. There are two important things to notice from
the diagram. The rst is that the outlier may move between time periods. The second is that the outlier
may change shape or size.
Figure 4.2 is a simplied version of the spatio-temporal outliers we aim to nd. There are several
additional factors that we need to consider. In our work, we discover the top-k outliers, as opposed to a
single outlier depicted in the gure. Finding more than one outlier introduces new challenges that need
to be overcome. An example of one such challenge is where two outlier regions overlap. In addition,
spatio-temporal outliers may begin and end at any time, and can exist over single or multiple time
4.2 PREVIOUS RESEARCH 55
FIGURE 4.2: An example of a spatio-temporal outlier in gridded data
periods. A spatio-temporal outlier can also have different thematic attribute values from one time period
to the next, but can still be considered a continuous spatio-temporal outlier if its thematic attributes are
signicantly different to those of the surrounding spatial neighbourhood at that snapshot in time. For
example, in Figure 4.2, the centre of the outlier does not have to remain red over each time snapshot, as
long as it different from its neighbours, then it can still be a continuous spatio-temporal outlier.
4.2 Previous Research
Detecting spatial relationship changes over time is the aim of our spatio-temporal outlier detection algo-
rithm. In order to detect these over time however they rst need to be discovered, for each time snapshot,
over space. Ren et al. (2003) noted that efforts in spatial data mining either focus on the spatial features
or the thematic attributes, while both spatial and thematic attributes should be considered. They also
noted that spatial autocorrelation studies focus on general patterns and are not oriented to help predict
changes over time.
Since spatial autocorrelation studies have drawbacks when conducting spatio-temporal pattern discov-
ery, we have chosen to use a spatial scan statistic to detect spatial outliers in gridded data. This is
described in Section 4.2.1. We then extend this algorithm to incorporate temporal aspects of the spatio-
temporal objects in order to discover moving outlier regions.
4.2 PREVIOUS RESEARCH 56
4.2.1 Spatial Scan Statistic: Exact-Grid
The Exact-Grid spatial scan statistic was introduced by Agarwal et al. (2006a). It nds every possible
different rectangular region in the data using four sweep lines to bound them. Once found, a well-known
spatial scan statistic known as Kulldorffs scan statistic is applied to give each rectangular region a
discrepancy value that indicates how different it is from the rest of the dataset (Agarwal et al., 2006b).
The Kulldorff spatial scan statistic uses two values: a measurement and a baseline. The measurement
is the number of incidences of an event, and the baseline is the total population at risk. For example,
when nding disease clusters, the measurement m would be the number of cases of the disease and the
baseline b would be the population at risk of catching the disease (Agarwal et al., 2006a).
To calculate the Kulldorff scan statistic, d(m, b, R) for a region R with measurement value m and
baseline value b, we rst need to nd the measurement M and baseline B values for the whole dataset,
where M =
pU
m(p) and B =
pU
b(p), where U is a box enclosing the entire dataset.
We then use these global values to nd m and b for the local region R, by letting m
R
=
pR
m(p)
M
and
b
R
=
pR
b(p)
B
.
Once these values have been found, all we need to do is perform a simple substitution into the Kulldorff
scan statistic, which is given by
d(m
R
, b
R
) = m
R
log(
m
R
b
R
) + (1 m
R
)log(
1m
R
1b
R
)
if m
R
> b
R
and 0 otherwise.
An example of the application of the Kulldorff scan statistic is provided in Figure 4.3. In this example,
the discrepancy of the shaded area is calculated by nding M = 6, B = 16, m
R
=
4
6
and b
R
=
4
16
.
Substituting this into the formula gives a value of d(m
R
, b
R
) = 0.3836.
One of the most notable advantages of using the spatial scan statistic is that its ability to detect outliers
is unaffected by missing data regions. This is particularly important in geographical data, which often
contains a large number of missing values for regions and time periods.
An example of how the spatial scan statistic deals with missing data is provided in Figure 4.4. The
discrepancy of the shaded area is calculated by nding M = 4, B = 12, m
R
=
3
4
and b
R
=
4
12
. From
these values it can be seen that we have considered the baseline measurement for the missing grids to be
4.2 PREVIOUS RESEARCH 57
FIGURE 4.3: An example grid for calculating Kulldorffs scan statistic
zero, and so they do not affect the calculation of the discrepancy. The resulting value of the calculations
would be d(m
R
, b
R
) = 0.87.
FIGURE 4.4: An example grid with missing data for calculating Kulldorffs scan statis-
tic
Exact-Grid Algorithm
The Exact-Grid algorithm was proposed by Agarwal et al. in (Agarwal et al., 2006a). It is given in
Figure 4.5. Exact-Grid uses 4 sweep lines to nd all possible different shaped regions that are located
over a grid space. Each sweep line is one of the boundaries of the region whose discrepancy we are
calculating. That is, they dene the left, right, bottom and top sides of a rectangular region. This is
illustrated in Figure 4.6, where the region being examined is shaded in blue.
The left sweep line moves from left to right, and the right sweep line is always at a position at least
one grid cell greater than the left sweep line, so that at least one cell is between the sweep lines. For
example, if the left sweep line is at position x = 2 then the right sweep line will sweep from x = 3 to
the maximum x value, until all regions are examined.
4.2 PREVIOUS RESEARCH 58
Each time the left or right sweep lines move, we examine all positions of the bottom and top sweep lines
in a similar fashion. That is, once the position of the left and right sweep lines have been chosen, we
check all different sized regions that are bounded by the bottom and top sweep lines. The horizontal
sweep lines move from the bottom to the top of the grids. For each position of the bottom line, we check
each position of the top sweep line that is at least one cell greater than the bottom sweep line. That is, if
the bottom sweep line is at position y = 2 then the top sweep line will move from position y = 3 until
the maximum y value, until all the different bounding regions are examined.
The Exact-Grid algorithm takes O(g
4
) time to run, since there are O(g
4
) rectangles to consider. Running
time is minimised by maintaining a count of the m and b values for each row between the left and right
scan lines. By doing this they are able to calculate the Kulldorff discrepancy value in constant time.
ALGORITHM: Exact-Grid
INPUT:
(1) G: a g g grid with values m(i, j), b(i., j)
OUTPUT (1) max: highest discrepancy region
01. for i = 1 to g do {Left Sweep Line}
02. Initialize m[y] = m(i, y), b[y] = b(i, y) for all y
03. for y = 2 to g do
04. m[y]+ = m[y 1], b[y]+ = b[y 1]
05. for j = i + 1 to g do {Right Sweep Line}
06. m = 0, b = 0
07. for y = 1 to g do
08. m+ = m(j, y), b+ = b(j, y),
09. m[y]+ = m, b[y]+ = b
10. for k = 1 to g do {Bottom Sweep Line}
11. for l = k to g do {Top Sweep Line}
12. if k = 1 then
13. m = m[k],b = b[k]
14. else
15. m = m[l] m[k 1],
16. b = b[l] b[k 1],
17. if (d(m, b) > max) then
18. max = d(m, b)
FIGURE 4.5: ALGORITHM: Exact-Grid
4.3 LIMITATIONS OF PREVIOUS WORK 59
FIGURE 4.6: Exact-Grid Sweep Lines
4.3 Limitations of Previous Work
There are two limitations in previous work that we have identied, and that we solve through our re-
search.
While Exact-Grid is able to discover outliers in spatial data, it is not able to nd outliers in spatio-
temporal data. Without this extension, nding spatio-temporal outliers would require the examination
of each time snapshot to manually identify outliers that are in the same spatial and temporal neighbour-
hoods, which would be a difcult and time consuming task. This is compounded by the fact that due to
the high dimensionality of the data, many regions would need to be examined.
In addition, the Exact-Grid algorithm only nds the maximum discrepancy region, rather than the top-k
maximum discrepancy regions, which is important when we want to nd more than one outlier region. A
single outlier region nds the most different region in the data but ignores all the other regions which may
also be spatial outliers. This is a problem where more than one region may be signicant. For example,
if we are looking at a map containing precipitation levels, then knowing the most signicant outlier
would only reveal data about one location, when there may be many outliers of importance contained
in the data. For instance, this could occur when there is heavy precipitation in both Peru and Brazil.
4.4 OUR CONTRIBUTION 60
Finding only the highest discrepancy region may only notice heavy precipitation in Brazil because it has
a higher discrepancy value, while ignoring the heavy precipitation in Peru.
4.4 Our Contribution
To overcome the limitations described in the previous section, we have developed three algorithms:
(1) Exact-Grid Top-k: to nd the top-k outliers in any given time snapshot;
(2) Oustretch: to nd all spatio-temporal outlier sequences and store them in an outlier tree; and,
(3) RecurseNodes: to retrieve all possible sequences from the outlier tree.
Finding more than one high discrepancy region introduces new issues that need to be addressed, such
as what to do when there are two regions in the top-k list of high discrepancy regions that overlap one
another, or a chain of regions which overlap each other. We discuss this in detail in Section 4.4.1.
Discovering a sequence of spatial outliers over time, or spatio-temporal outliers, is also not a simple task,
since the maximum discrepancy regions are rarely the same from one time period to the next. Therefore
we need to work out which outliers are related over time. To do this we introduced a stretching window,
which sets the spatial boundary of outliers in subsequent time periods which can be considered to be the
next outlier in a sequence. Once we have the outlier sequences, we store them in an outlier tree so as to
optimise retrieval operations.
RecurseNodes is the nal algorithm which retrieves the sequences from the outlier tree and returns them
as a list of spatio-temporal outlier sequences.
The nal resulting output from the above steps is a list of all sequences and subsequences of outliers that
are found in the dataset. Each of these steps is detailed in the following subsections.
4.4.1 Exact-Grid Top-k
Our extension to the Exact-Grid algorithm, called Exact-Grid Top-k, nds the top-k outliers for each
time period.
As the Exact-Grid algorithm only nds the single highest discrepancy outlier, it did not have to take into
account overlapping regions, as any region with a lower discrepancy was simply replaced. When adding
4.4 OUR CONTRIBUTION 61
regions to the list of top-k algorithms however, we need to consider the case where there are overlapping
regions, or we could end up with a list of top-k regions that cover the same area. This is illustrated in
Figure 4.7, where the green region is overlapping the blue region.
FIGURE 4.7: The overlap problem
The different types of overlap that we considered are shown in Figure 4.8.
However, simply eliminating the lowest discrepancy region of the two is not always the best solution.
Particularly if the regions only overlap slightly, as this could eliminate some potentially interesting
outliers. Therefore, we have introduced a threshold parameter, that allows us to specify the maximum
amount of allowable overlap between regions.
Another issue that had to be dealt with is the scenario where there is a chain of overlaps, as shown
in Figure 4.9. In this scenario, the discrepancy of the blue region is less than that of the green, and
the discrepancy of the green is less than the yellow (d(blue) < d(green) < d(yellow)). If we are
eliminating based on the highest discrepancy and we nd the blue region rst and add it to our list of
top-k outliers, when we nd the green region, it will replace the blue region in the top-k outlier list.
Then, when we nd the yellow region, it will replace the green region in the list of top-k outliers. This
creates a chain effect, which is problematic since the blue region may be quite different or far from the
yellow region and yet has been eliminated.
One option that was considered was to form a union between the two regions, and then if the union was
of higher discrepancy, discard the other two outliers and store the union in the list of top-k. This concept
4.4 OUR CONTRIBUTION 62
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
FIGURE 4.8: All possible overlap types between two regions
is shown in Figure 4.10. However, this would have decreased the efciency of our algorithm, as it would
be more difcult to search irregular shapes for overlaps.
To overcome this problem, we chose to allow some overlap between regions. The amount of overlap is
specied as a percentage, and the algorithm allows the user to vary this to the most appropriate amount
for their particular application domain. The procedure is described in the following paragraphs.
The Exact-Grid Top-k algorithm nds the top-k outliers for each time period by keeping track of the
highest discrepancy regions as they are found. As it iterates through all the region shapes, it may nd a
new region that has a discrepancy value higher than the lowest discrepancy value (k
th
value) of the top-k
4.4 OUR CONTRIBUTION 63
FIGURE 4.9: The chain overlap problem
FIGURE 4.10: The union solution to the overlap problem
regions so far. We then need to determine if this region should be added to the list of top-k regions. To
do this we need to determine the amount of overlap that the new region has with regions already in the
top-k.
For any top-k region that this new candidate region overlaps with, we rst calculate the percentage of
overlap between the two regions. If it overlaps more than the maximum overlap percentage specied
by the user, then we compare the discrepancy values of the two regions. If the new region has a higher
discrepancy value, it will replace the other region, otherwise the newregion will not be added. In the case
where the percentage overlap is greater than the parameter specied maximum allowable overlap, the
4.4 OUR CONTRIBUTION 64
ALGORITHM: Exact-Grid Top-k
INPUT:
(1) G: a g g grid with values m(i, j), b(i., j)
(2) max_overlap: The maximum allowed overlap.
OUTPUT
(1) topk: Top-k highest discrepancy regions
01. for i = 1 to g do {Left Sweep Line}
02. Initialize m[y] = m(i, y), b[y] = b(i, y) for all y
03. for y = 2 to g do
04. m[y]+ = m[y 1], b[y]+ = b[y 1]
05. for j = i + 1 to g do {Right Sweep Line}
06. m = 0, b = 0
07. for y = 1 to g do
08. m+ = m(j, y), b+ = b(j, y),
09. m[y]+ = m, b[y]+ = b
10. for k = 1 to g do {Bottom Sweep Line}
11. for l = k to g do {Top Sweep Line}
12. if k = 1 then
13. m = m[k],b = b[k]
14. else
15. m = m[l] m[k 1],
16. b = b[l] b[k 1],
17. if (d(m, b) > topk(k)) then
18. c=the current region
19. topk=update_topk(c,topk)
FIGURE 4.11: ALGORITHM: Exact-Grid Top-k
region will be added to the list of top-k, provided that it does not violate the overlap threshold condition
with any of the other top-k regions. This algorithm is shown in Figure 4.11 which calls Figure 4.12
shown below.
Exact-Grid Top-k computes the overlapping region in O(n) using the subroutine in Figure 4.12, since it
has to check the new potential top-k region against all previous regions for overlap. Because of this, the
total time required by the algorithm is O(n
5
).
The Update Top-k subroutine calls the get_overlap method, which calculates the percentage overlap
between region c, the current region under examination and each of the regions tk in the list of top-k
regions. If the overlap is less than the maximum allowable overlap then the region will be added top
the top-k and will bump the k
th
highest discrepancy region off the list. Otherwise only the highest
discrepancy region will be kept in the top-k list.
4.4 OUR CONTRIBUTION 65
SUBROUTINE: Update Top-k
INPUT:
(1) c: ,
(2) topk: The number of outliers to nd, and
(3) max_overlap: The maximum allowed overlap.
OUTPUT
(1) topk: Top-k highest discrepancy regions
01. for tk = topk(1) to topk(size(topk)) do
02. ov=get_overlap(c, tk)
03. if ov < max_overlap then
04. add c to topk
05. else
06. if(dval(c)>dval(tk)) then
07. replace tk with c in topk
FIGURE 4.12: SUBRUOTINE: Update Top-k
4.4.2 The Outstretch Algorithm
This section describes our algorithm, known as Outstretch (Algorithm 4.15). Outstretch takes as input
the top-k values for each year period under analysis, and a variable r, the region_stretch, which is the
number of grids to stretch by on each side of an outlier. This is shown in Figure 4.13.
FIGURE 4.13: Region Stretch Size r
Oustretch then examines the top-k values of the second to last available year periods. For all the years,
each of the outliers from the current year are examined to see if they are framed by any of the stretched
regions from the previous year. If they are, the variable framed will return true, and the item will be
4.4 OUR CONTRIBUTION 66
added to the end of the previous years child list. In this way, the Outstretch algorithm stores all possible
sequences over all years into a tree structure.
An example of a possible tree structure is shown in Figure 4.14. The rst node is the empty set. The
following nodes are each of the outliers found at different time periods. In the gure, each is labeled
with 2 numbers. The rst indicates the time period from which the outlier comes from, and the second is
an identication number. The Outstretch algorithm stores all single top-k outliers from each time period
as rows in a table, where each of these rows contains the outliers children. An example of this table, that
corresponds with the example tree in Figure 4.14 is given in Table 4.1.
FIGURE 4.14: An example outlier tree built by the Outstretch algorithm
The example shown in Figure 4.16 shows how a sequence of outliers over three time periods can be
collected. In each diagram, the outlier is represented by the solid blue region, while the stretch region
is represented by the shaded blue region. In this example, the stretch size r equals 1. To begin, the
outlier from the rst time period is found. Then the region is extended by r on all sides, and is searched
for outliers that are enclosed by it in the following year. If one is found that lies completely within the
stretch region, a new stretch region is generated around the new outlier. This new stretch region is then
searched for outliers in the third time period. This process continues until all time periods have been
examined or there are no outliers that fall completely within the stretch region. As each of these outliers
sequences are discovered they are stored into a tree.
4.4 OUR CONTRIBUTION 67
Number of Child
Outlier Children List
(1,1) 2 (2,1),(2,2)
(1,2) 2 (2,3),(2,4)
(1,3) 0
(2,1) 2 (3,1),(3,2)
(2,2) 0
(2,3) 2 (3,3),(3,4)
(2,4) 1 (3,5)
(2,5) 2 (3,6),(3,7)
(3,1) 0
(3,2) 0
(3,3) 0
(3,4) 0
(3,5) 0
(3,6) 0
(3,7) 0
TABLE 4.1: The outlier table corresponding to the outlier tree shown in Figure 4.14
ALGORITHM: Outstretch
INPUT:
(1) k: yrly_topkvals; and
(2) r: region_stretch
OUTPUT:
(1) tr: outlier_tree
01. for yr = 2 to y do
02. c = yrly_topkvals(yr);
03. for c = 1 to size(c) do
04. p = yrly_topkvals(yr-1);
05. for p = 1 to size(p) do
06. framed = is_framed(c,p,r);
07. if framed == true then
08. tr(p,len(tr(p))+1) = c;
FIGURE 4.15: ALGORITHM: Outstretch
The Oustretch algorithm runs in O(n
3
), since for each of the time periods available it iterates through
all the top-k outliers for that period, and compares them against all the outliers from the previous time
period.
From the tree structure, all possible sequences can be extracted using a simple recursive algorithm,
described in Algorithm 4.17.
RecurseNodes takes 4 input variables. The rst, outlier_tree, contains a list of each node and its children.
The second, sequence, contains the sequence nodes so far, not including the child value. The third,
4.4 OUR CONTRIBUTION 68
child_list is a list of all the children of the last sequence item. Finally, sequence_list, is a list of all the
sequences that have been generated from the outlier_tree at any point in time.
The RecurseNodes algorithm has a running time of O(n
y
) where n is the total number of length-1
outliers, and y is the number of years of data and maximum possible length of an outlier sequence.
4.4.3 Data
This section describes features that are specic to geographic data mining and must be taken into con-
sideration when using geographical data. This is discussed in relation to the precipitation dataset we
used.
Geographical Data Features
As previously mentioned in Section 2.8, classical data mining algorithms are not typically suited for
application to geographical data. Therefore when we are performing data mining, we need to consider
the data dependencies and special features of the data.
For the purpose of our experiments, we have removed some of the temporal dependencies in the data
through deseasonalisation. By removing the seasonal effect of precipitation, and instead considering
how much each value deviates from the value expected at that particular time of the year, we are able to
discover more interesting patterns.
Secondly, we have had to consider the effects of missing data. In our dataset, there are a large number
of missing values, mostly in regions that do not lie over land masses. One of the advantages of using
the spatial scan statistic is that it is able to discover signicant outliers despite their close proximity to
regions that contain missing data.
For the Exact-Grid algorithm, one approach of dealing with missing values was to set the missing value
as the average value in the dataset. That would mean that the measurement m for the grid cell would
be 0 since it is not above the extreme threshold, while the baseline b for the grid cell would become 1.
This large baseline measure has an impact on the data, and causes larger grids to be selected as high
discrepancy regions due to the larger baseline population considered to be at risk. Instead, the approach
4.5 EXPERIMENTAL SETUP 69
we have adopted for each missing grid, is to set the baseline b to 0, and the measure mto 0. Our approach
to dealing with missing values is also described in Section 4.2.1.
By taking into consideration the above features of geographical data, we have helped to ensure that
interesting patterns can be discovered.
4.4.4 Preprocessing
There are two steps involved in preprocessing the data prior to applying the Outstretch algorithm.
First, we need to deseasonalise it so that interesting patterns can be found. The deseasonalisation proce-
dure is the same as that described in Section 2.11.2.
Once we have completed this deseasonalisation procedure, before running the Exact-Grid Top-k algo-
rithm, we take the mean of the deseasonalised values over each period, for each grid.
This data is then examined using our spatio-temporal outlier detection algorithm to discover spatio-
temporal outliers.
4.5 Experimental Setup
The effectiveness of the Outstretch algorithm is evaluated by counting the total number of spatio-
temporal outliers in the outlier tree, and the length of these outliers.
For this experiment, we set the input variables as shown in Table 4.2.
Variable Value
Allowable Overlap 10%
Number of top-k 5
Extreme Precipitation 90%
Region Stretch 1
TABLE 4.2: Experiment 1 Variable Setup
The rst variable, Allowable Overlap is the maximum allowable size of two overlapping regions. The
allowable overlap depends on the application setting and the type and size of patterns that the individual
performing the analysis aims to discover. In our experiments, we use an allowable threshold of 10%.
This means that if two regions are overlapping by less than 10%, we will not replace the one with the
4.6 EXPERIMENTAL RESULTS 70
lower discrepancy value with the other as normal. This enables us to nd outlier regions that are located
nearby but represent different regions.
The second, Number of top-k is the maximum number of high discrepancy regions that are to be found
by the Exact-Grid Top-k algorithm. In this case, we chose to nd a maximum of 5 high discrepancy
regions.
The third, Extreme Precipitation, sets the threshold percentage that the deseasonalised average precipi-
tation must exceed to be considered extreme. In our experiments we chose the 90
th
percentileofvalues.ThismeansthatinS
The nal variable, Region Stretch describes the number of grids to extend beyond the outlier to check
for outliers that fall completely inside it in the subsequent region, as shown in Figure 4.13. In these
experiments, we stretch each outlier region by 1 grid cell on each of its sides to nd outliers in subsequent
time periods.
Table 4.3 describes the subset of South American precipitation data that we use in our experiments. The
subset of data consists of data from the 10 year period between 1994 to 2004. It contains 713 2.5

x
2.5

grids which are dened by 31 latitudes and 23 longitudes.


Variable Value
Number of Years (Periods) 10
First Year 1994
Last Year 2004
Grid Size 2.5

x2.5

Num Latitudes 31
Num Longitudes 23
Total Grids 713
TABLE 4.3: Experiment 1 Data Setup
4.6 Experimental Results
The application of our algorithm to South American precipitation data involved three steps. First, each
of the top-k highest discrepancy regions are found for each time period. Second, a tree is built to store
all possible sequences. Third, and nally, all possible sequences and sub-sequences are extracted from
the tree.
4.6 EXPERIMENTAL RESULTS 71
The rst stage of our algorithm involved nding the top-k outliers at each time step using the Exact-Grid
Top-k algorithm. The results of this are shown in Table 4.4, which show that when we set k = 5, we
were able to nd 5 top outlier regions for all years except for 2003, where only 3 were found.
Year Number of
top-k Outliers
1994 5
1995 5
1996 5
1997 5
1998 5
1999 5
2000 5
2001 5
2002 5
2003 3
2004 5
Total 48
TABLE 4.4: Number of top-k outliers found for each year
From the second and third stages of our algorithm, we found a total of 8819 outlier sequences and sub-
sequences over a 10 year period. These sequences ranged from a minimum length of 1 to a maximum
length of 10, corresponding to year periods in the subset of data we used. The results of this are reported
in Table 4.5 and summarised in Figure 4.18, which show the number of outliers and the length of these
outliers.
Length of Outliers Total Number Found
All 8819
1 48
2 98
3 151
4 241
5 458
6 763
7 1217
8 1621
9 2172
10 2050
TABLE 4.5: Length and Number of Outlier Sequences Found
4.7 DISCUSSION 72
4.7 Discussion
In this Chapter, we have provided a complete denition of a spatio-temporal outlier. Based on this
denition, with the aim of developing a method to discover outliers in grid-based spatio-temporal data,
we have introduced the Exact-Grid Top-k algorithm, which nds the top-k high discrepancy regions
from a spatial grid. This is our extension to the Exact-Grid algorithm. We have also furthered the
Exact-Grid Top-k algorithm to include the ability to discover spatio-temporal outlier sequences that
change location, shape and size, with our Outstretch algorithm that stores the discovered sequences
into a tree structure. And nally, to extract all possible sequences from the tree, we have provided the
RecurseNodes algorithm.
Our results demonstrate the successful application of our algorithm to precipitation data. We have shown
that our algorithm is capable of nding spatial outlier sequences and subsequences that occur over sev-
eral time periods in the South American precipitation data.
Future work in this area could use a similar approach to that taken in this paper, but instead apply it to
point-data. One such algorithmhas been proposed by Agarwal et al. (2006a) who introduce an algorithm
known as Exact to discover high discrepancy regions in point data.
4.7 DISCUSSION 73
(a) Outlier at t=1
(b) Outlier at t=2
(c) Outlier at t=3
FIGURE 4.16: Example Outstretch Algorithm Sequence
4.7 DISCUSSION 74
ALGORITHM: RecurseNodes
INPUT:
(1) tr: outlier_tree
(2) seq: sequence
(3) ch_list: child_list
(4) seq_list: sequence_list
OUTPUT:
(1) seq_list: sequence_list
01. for c = ch_list(1) to ch_list(size(ch_list)) do
02. new_seq = seq + c
03. // append new_seq to the end of seq_list:
04. seq_list(len + 1) = new_seq
05. //get the grandchildren:
06. gchildr = tr(c)
07. if size(gchild) > 0 then
08. seq_list = RecurseNodes(tr,new_seq, gchildr,seq_list)
FIGURE 4.17: ALGORITHM: RecurseNodes
FIGURE 4.18: Length and Number of Outliers Found
CHAPTER 5
Conclusion
5.1 Summary of Contributions
The work conducted in this thesis addresses the four main areas we identied at the start of this project.
The rst contribution was to provide a quantitative analysis of the behaviour of precipitation extremes
between regions. Each region was analysed using EVT to determine the extremal behaviour of the
region, as done in previous research. Our extension used the local Moran I spatial autocorrelation
statistic to assign a value to each region that quantitatively described their similarity or difference to
neighbouring regions, which was not done by previous research.
The second contribution was to analyse how the spatial autocorrelation of regions varied with the
strength of an El Nio event. By taking the mean local Moran I value over all regions and the mean
SOI value, for each time period, we discovered that during a stronger El Nio event, the spatial autocor-
relation between regions is weaker. We then used bootstrapping to validate our results. To the best of
our knowledge, spatio-temporal analysis of this kind has not previously been done.
The third contribution was a dicussion and renement of the denition of a spatio-temporal outlier. Be-
fore discovery of spatio-temporal outliers could be performed, a more precise denition and discussion
of the properties of a spatio-temporal outlier needed to be performed. Taking into account previous def-
initions of spatio-temporal objects and spatio-temporal outliers, we have described a new denition of
a spatio-temporal outlier, since earlier denitions excluded some potential patterns that could be mined
from the data. Following on from this, as our fourth contribution, we developed an algorithm to mine
spatio-temporal outliers that meet our denition.
The fourth contribution is the discovery of spatio-temporal outliers from gridded data. To do this we
rst examined each time snapshot to discover outliers, then we used a stretched window to nd outliers
75
5.2 DISCUSSION AND FUTURE WORK 76
in subsequent snapshots. This would show a region that is moving over a number of snapshots. These
then had to be stored and retrieved. To complete these tasks we extended the Exact-Grid algorithm from
previous work to nd the top-k outliers, rather than only the most discrepant outlier. To then incorporate
time into the pattern discovery, we developed the Outstretch algorithm, that would nd outliers that are
related between one time snapshot to the next. Then an outlier tree storage and retrieval method was
dened for efcient retrieval.
By performing each of these tasks we have satised the aims of our research and contributed advance-
ments in spatio-temporal analysis and data mining of spatio-temporal and geographic data.
5.2 Discussion and Future Work
The contributions we have made to the discovery of interesting patterns from spatio-temporal databases
is signicant because they provide tools that knowledge seekers can use. This has been shown through
our application of these methods to South American Precipitation data, where we have been able to
quantify and analyse precipitation extremes using EVT, and where we have discovered moving outliers.
There are a number of directions that future work based on this research could follow.
The Getis-Ord G statistic could be used in addition to the Moran I statistic to provide another type
of quantitative analysis of the spatial autocorrelation between regions. This statistic has the ability to
distinguish between high and low values where positive autocorrelation exists, so may provide new
information about inter-regional extremal behaviour.
As more and more precipitation data is collected from sensor networks, the potential to discover inter-
esting patterns increases. Since there have only been a few major El Nio events during the periods
we have examined, studies in the future could benet from an increased amount of data available for
analysis.
Another option is to study the data for not only strong El Nio events but also moderate events. This
would mean that more temporal periods could be examined and compared to the mean local Moran I
values to determine the spatial autocorrelation with other types of El Nio events.
Spatially, we have taken the mean of the whole region of South America and compared this to the SOI.
The data could be broken up into several sub-regions in South America, such as by political divisions, or
5.2 DISCUSSION AND FUTURE WORK 77
by larger sub-grids. Each of these subdivisions could then be compared to the SOI. From this we could
see if the strength of an El Nio event causes weaker spatial autocorrelation as a general trend over all
subdivisions. The reason we did not perform this is that our study maintains a focus on the techniques
used rather than to study the behaviour of Hydrologic patterns in detail.
While our study uses 2.5

gridded precipitation data, there are also 1

grids available. The reason we


did not use 1

grids is because they contain a lot more missing data. However, as future data is collected
and past data gaps are lled, there could be the potential for more interesting patterns to be discovered.
In terms of the methodological approach taken, we have only analysed univariate (single variable) pre-
cipitation values. The study of multivariate spatio-temporal data using similar methods could prove very
interesting, and provides a challenge suitable for the application of data mining.
In addition to spatio-temporal analysis, we have also made contributions in the area of spatio-temporal
data mining. In this area, spatio-temporal outlier detection presents a variety of other challenges that
future research could tackle.
Performance improvements could be made to the Exact-Grid Top-k algorithm. As an alternative to
the Exact-Grid algorithm, another algorithm, Approx-Grid, was proposed by Agarwal et al. (2006a) to
improve efciency. This algorithm could be extended to nd the top-k outliers for situations where speed
is an important consideration. The drawback of this approach is that the algorithm exploits and therefore
relies on there only being a single maximum discrepancy region. Modications could potentially be
made to this to make it able to discover more than one outlier region.
Another signicant issue is to nd a better way of interpreting the results. Our method delivers to the
user at the end of completion, a list of outliers. Some visualisation techniques could help the user to
interpret and understand the results. An attempt at the visualisation of spatio-temporal data mining re-
sults has been done by Kechadi and Bertolotto (2006), but their visualisation technique models different
phenomena.
In this study, we have met the goals that we set out to achieve, through our spatio-temporal analysis
and data mining techniques. We have shown that they are able to perform on real precipitation data, to
nd interesting spatio-temporal patterns. We have also uncovered patterns that may be of interest to the
Hydrological community in relation to ENSO and precipitation extremes. Given such a wide variety of
5.2 DISCUSSION AND FUTURE WORK 78
future research challenges, this work has provided a good foundation in an area that has not yet been
widely investigated.
References
Nabil R. Adam, Vandana Pursnani Janeja, and Vijayalakshmi Atluri. 2004. Neighborhood based detec-
tion of anomalies in high dimensional spatio-temporal sensor datasets. In Proceedings of the 2004
ACM symposium on Applied computing SAC 04, pages 576583. ACM Press, March.
Deepak Agarwal, Andrew McGregor, Jeff M. Phillips, Suresh Venkatasubramanian, and Zhenguan Zhu.
2006a. Spatial scan statistics: Approximations and performance study. In KDD 06: Proceedings
of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages
2433.
Deepak Agarwal, Jeff M. Phillips, and Suresh Venkatasubramanian. 2006b. The hunting of the bump:
On maximizing statistical discrepancy. In Proc. 17th Ann. ACM-SIAM Symp. on Disc. Alg., pages
11371146, January.
Rob Allen, Janette Lindesay, and David Parker. 1996. El Nino Southern Oscillation and climatic
variability. CSIRO Publishing.
Fabrizio Angiulli, Stefano Basta, and Clara Pizzuti. 2005. Detection and prediction of distance-based
outliers. In ACM Symposium on Applied Computing.
Luc Anselin. 1995. Local indicators of spatial association - lisa. Geographical Analysis, 27(2):93115.
Cludia M. Antunes and Arlindo L. Oliveira. 2001. Temporal data mining: an overview. In Proceedings
of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
V. Barnett and T. Lewis. 1994. Outliers in statistical data. Jon Wiley & Sons, Chichester, UK.
Derya Birant and Alp Kut. 2006. Spatio-temporal outlier detection in large databases. In 28th Interna-
tional Conference on Information Technology Interfaces, pages 179184.
Derya Birant and Alp Kut. 2007. ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data
& Knowledge Engineering, 60(1):208221, January.
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jrg Sander. 2000. LOF: Identifying
density-based local outliers. ACM SIGMOD Record, 29(2):93104, June.
Wilfried Brutsaert. 2005. Hydrology: An introduction. Cambridge University Press.
Huiping Cao, Nikos Mamoulis, and D.W. Cheung. 2006. Discovery of collocation episodes in spa-
tiotemporal data. In Sixth International Conference on Data Mining, 2006., pages 823 827.
79
REFERENCES 80
Huiping Cao, Nikos Mamoulis, and D.W. Cheung. 2007. Discovery of periodic patterns in spatiotem-
poral sequences. IEEE Transactions on Knowledge and Data Engineering, 19(4):453467.
Mete Celik, Shashi Shekhar, James P. Rogers, James A. Shine, and Jin Soung Yoo. 2006. Mixed-drove
spatio-temporal co-occurrence pattern mining: A summary of results. In Proceedings of the sixth
International Conference on Data Mining, pages 119128. IEEE Computer Society.
Sanjay Chawla, Shashi Shekhar, Weili Wu, and Uygar Ozesmi. 2001. Modelling spatial dependencies
for mining geospatial data: An introduction. Geographic Data Mining and Knowledge Discovery,
Taylor & Francis, New York, NY, pages 131159.
Tao Cheng and Zhilin Li. 2006. A multiscale approach for spatio-temporal outlier detection. Transac-
tions in GIS, 10(2):253263.
A.D. Cliff and J.K. Ord. 1973. Spatial Autocorrelation. Pion Limited.
Stuart Coles. 2001. An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag London.
Joseph S. DAleo and Pamela G. Grube. 2002. The Oryx Resource Guide to El Nio and La Nia. Oryx
Press, CT, USA.
Martin Ester, Hans-Peter Kriegel, Jrg Sander, and Xiaowei Xu. 1996. A density-based algorithm
for discovering clusters in large spatial databases with noise. In Proceedings of 2nd International
Conference on Knowledge Discovery and Data Mining, pages 226231, Portland, OR.
Martin Ester, Hans-Peter Kriegel, and Jrg Sander. 2001. Algorithms and applications for spatial data
mining. Geographic Data Mining and Knowledge Discovery, Taylor & Francis, New York, NY, pages
160187.
Christos Faloutsos, M. Ranganathan, and Yannis Manopoulos. 1994. Fast subsequence matching in time
series databases. In Proceedings of the 1994 ACMSIGMODinternational conference on Management
of data, pages 419429, Minneapolis, MN, USA. ACM Press.
Alexander Gershunov. 1998. Enso inuence on intraseasonal extreme rainfall and temperature fre-
quencies in the contiguous united states: Implications for long-range predictability. Monthly Weather
Review, 11(12):31923203.
Arthur Getis and J.K. Ord. 1992. The analysis of spatial association by use of distance statistics.
Geographical Analysis, 24(3):189 206, July.
Gyo z Gidfalvi and Torben Bach Pedersen. 2005. Spatio

Utemporal rule mining: Issues and techniques.


Data Warehousing and Knowledge Discovery, 3589:275284.
Michael Howard Glantz. 2001. Currents of Change: Impacts of El Nio and La Nia on Climate and
Society, 2nd Edition. Cambridge University Press.
Michael F. Goodchild. 1986. Spatial Autocorrelation. Norwich : Geo Books.
Joachim Gudmundsson, Marc van Kreveld, and Bettina Speckmann. 2004. Efcient detection of motion
patterns in spatio-temporal data sets. In Proceedings of the 12th annual ACM international workshop
on Geographic information systems, pages 250 257. ACM Press.
REFERENCES 81
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: an efcient clustering algorithm for
large databases. In Proceedings of the ACM SIGMOD International Conference on Management of
Data, pages 7384, Seattle, WA.
D. Hawkins. 1980. Identication of Outliers. Chapman and Hall.
George M. Hornberger, Jeffy P. Raffensperger, Patricia L. Wiberg, and Keith N. Eshleman. 1998.
Elements of Physical Hydrology. The John Hopkins University Press.
Yan Huang, Hui Xiong, Shashi Shekhar, and Jian Pei. 2003. Mining condent colocation rules without a
support threshold. In Proceedings of the 2003 ACM Symposium on Applied Computing (SAC), March
9-12, 2003, Melbourne, FL, USA, pages 497501, Melbourne, FL, USA, March. ACM Press.
Ted Johnson, Ivy Kwok, and Raymond Ng. 1998. Fast computation of 2-dimensional depth contours.
In Knowledge Discovery and Data Mining.
Panos Kalnis, Nikos Mamoulis, and Spiridon Bakiras. 2005. On discovering moving clusters in spatio-
temporal data. Lecture Notes in Computer Science, 3633/2005:364381.
M-Tahar Kechadi and Michela Bertolotto. 2006. A visual approach for spatio-temporal data mining. In
2006 IEEE International Conference on Information Reuse and Integration, pages 504 509. IEEE,
September.
Eamonn Keogh and Padhraic Smyth. 1997. A probabilistic approach to fast pattern matching in time
series databases. In Proceedings of the third international conference on knowledge discovery and
data mining, pages 2430, Menlo Park, CA, USA. AAAI Press.
Shiraj Khan, Auroop R. Ganguly, Sharba Bandyopadhyay, Sunil Saigal, David J. Erikson III, Vladimir
Protopopescu, and George Ostrouchov. 2006. Nonlinear statistics reveals stronger ties between enso
and the tropical hydrological cycle. Geophysical Research Letters, 33(24):1, December.
Shiraj Khan, Gabriel Kuhn, Auroop R. Ganguly, David Julius Erickson III, and George Ostrouchov.
2007. Spatio-temporal variability of daily and weekly precipitation extremes in south america. Water
Resources Research, 43(11), November.
M. Kulldorff. 1997. A spatial scan statistic. Comm. in Stat.: Th. and Meth., 26:14811496.
Srivatsan Laxman and P. S. Sastry. 2006. A survey of temporal data mining. SADHANA: Academy Pro-
ceedings in Engineering Sciences, Special Issue on Statistical Techniques in Electrical and Computer
Engineering, 31(2):173198.
Eric M. H. Lee and Keith C. C. Chan. 2006. Discovering association patterns in large spatio-temporal
databases. In Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops,
pages 349354, Washington, DC, USA. IEEE Computer Society.
Wan-Jui Lee, Jung-Yi Jiang, and Shie-Jue Lee. 2004. An efcient algorithm to discover calendar-based
temporal association rules. In Proceedings of the IEEE International Conference on Systems, Man &
Cybernetics, pages 3122 3127, Los Alamitos, CA, USA. IEEE Computer Society.
Yingjiu Li, Peng Ning, X. Sean Wang, and Sushil Jajodia. 2001a. Discovering calendar-based temporal
association rules. In Proceedings of the Eighth International Symposium on Temporal Representation
REFERENCES 82
and Reasoning, pages 111118, Los Alamitos, CA, USA. IEEE Computer Society.
Yingjiu Li, Peng Ning, X. Sean Wang, and Sushil Jajodia. 2001b. Discovering calendar-based temporal
association rules. In Eighth International Symposium on Temporal Representation and Reasoning
(TIME01), page 111.
Brant Liebmann and D. Allured. 2005. Daily precipitation grids for south america. Bull. Amer. Meteor.
Soc., 86:15671570.
Weiqiang Lin, Mehmet A. Orgun, and Graham J. Williams. 2002. An overview of temporal data mining.
In Proceedings of the 1st Australian Data Mining Workshop, pages 8309, Canberra, ACT, Australia.
Nikos Mamoulis, Marios Hadjieleftheriou, Huiping Cao, Yufei Tao, George Kollios, and David W.
Cheung. 2004. Mining, indexing, and querying historical spatiotemporal data. In Proceedings of
the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages
236245, NY, NY, USA. ACM.
R. Douglas Martin and Victor Yohai. 2001. Data mining for unusual movements in temporal data. In
7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop
on Temporal Data Mining, San Francisco, CA, USA.
Harvey J. Miller and Jiawei Han. 2001. Geographic data mining and knowledge discovery: An overview
in Geographic Data Mining and Knowledge Discovery. Taylor & Francis, New York, NY.
Harvey J. Miller. 2007. Geographic data mining and knowledge discovery. Wilson and A. S. Fothering-
ham (eds.) Handbook of Geographic Information Science.
Raymond T. Ng and Jiawei Han. 1994. Efcient and effective clustering methods for spatial data
mining. In 20th International Conference on Very Large Data Bases, pages 144155, Santiago,
Chile, September.
Raymond T. Ng. 2001. Detecting outliers from large datasets in Geographic Data Mining and Knowl-
edge Discovery. Taylor & Francis, New York, NY.
Tim Oates. 1999. Identifying distinctive subsequences in multivariate time series by clustering. In
S. Chaudhuri and D. Madigan, editors, Fifth International Conference on Knowledge Discovery and
Data Mining, pages 322326, San Diego, CA, USA. ACM Press.
Stan Openshaw. 1999. Geographical data mining: key design issues. In Proceedings of GeoComputation
99.
J. K. Ord and Arthur Getis. 1995. Local spatial autocorrelation statistics: Distributional issues and an
application. Geographical Analysis, 27(4):286306.
Jia-Dong Ren, Jie Bao, and Hui-Yu Huang. 2003. The research on spatio-temporal data model and
related data mining. In Proceedings of the Second International Conference on Machine Learning
and Cybernetics, pages 3740. IEEE, November.
John F. Roddick and Myra Spiliopoulou. 1999. A bibliography of temporal, spatial and spatio-temporal
data mining research. ACM SIGKDD Explorations Newsletter, 1(1):34 38.
REFERENCES 83
C. F. Ropelewski and M. S. Halpert. 1987. Global and regional scale precipitation patterns associated
with the el nio southern oscillation. Monthly Weather Review, 115(8):16061626.
Pusheng Zhang Shashi Shekhar, Chang-Tien Lu. 2003. A unied approach to detecting spatial outliers.
GeoInformatica, 7:139166.
Shashi Shekhar and Sanjay Chawla. 1993. Spatial Databases: A Tour. Prentice Hall.
Shashi Shekhar and Yan Huang. 2001. Discovering spatial co-location patterns: A summary of
results. Lecture Notes in Computer Science: Proceedings of Advances in Spatial and Temporal
Databases : 7th International Symposium, SSTD 2001, Redondo Beach, CA, USA, July 12-15, 2001,
2121/2001:236.
Shashi Shekhar and Ranga Raju Vatsavai. 2002. Spatial data mining research by the spatial database
research group, university of minnesota.
Shashi Shekhar, Pusheng Zhang, Yan Huang, and Ranga Raju Vatsavai. 2003. Data Mining: Next Gen-
eration Challenges and Future Directions. Chapter 19. Trends In Spatial Data Mining. AAAI/MIT
Press.
John Snow. 1856. Cholera and the water supply in the south districts of london in 1854. Journal of
Public Health, 2:239257.
Yannis Theodoridis, Jefferson R. O. Silva, and Mario A. Nascimento. 1999. On the generation of
spatiotemporal datasets. In Proceedings of the 6th International Symposium on Advances in Spatial
Databases, pages 147 164. Springer-Verlag.
Waldo R. Tobler. 1970. A computer model simulation of urban growth in the Detroit region. Economic
Geography, 46(2):234240.
Kevin E. Trenberth. 1997. The denition of El Nio. Bulletin of the American Meteorological Society,
78:27712777.
Florian Verhein. 2006. k-STARs: Sequences of spatio-temporal association rules. In Sixth IEEE
International Conference on Data Mining Workshops, pages 387394. IEEE, December.
Winter Corporation. 2005. 2005 TopTen Award Winners. http://www.wintercorp.com/vldb/
2005_topten_survey/toptenwinners_2005.asp [Accessed 12 December, 2007].
Elizabeth Wu and Sanjay Chawla. 2007. Spatio-temporal analysis of the relationship between south
american precipitation extremes and the el nino southern oscillation. In Proceedings of the 2007
International Workshop on Spatial and Spatio-temporal Data Mining. IEEE.
Hui Yang and Srinivasan Parthasarathy. 2006. Mining spatial and spatio-temporal patterns in scientic
data. In Proceedings of the 22nd International Conference on Data Engineering Workshops, page
146. IEEE Computer Society.
Hui Yang, Srinivasan Parthasarathy, and Sameep Mehta. 2005. A generalized framework for mining
spatio-temporal patterns in scientic data. In Proceeding of the eleventh ACM SIGKDD international
conference on Knowledge discovery in data mining, pages 716721, New York, NY, USA. ACM
Press.
REFERENCES 84
Jin Soung Yoo and Shashi Shekhar. 2004. A partial join approach for mining co-location patterns. In
Proceedings of the 12th annual ACM international workshop on Geographic information systems,
pages 241249, New York, NY, USA. ACM Press.
Jin Soung Yoo and Shashi Shekhar. 2006. A joinless approach for mining spatial colocation patterns. In
IEEE Transactions on Knowledge and Data Englineering, pages 13231337, Piscataway, NJ, USA.
IEEE Educational Activities Department.
Peng Yu, Anna Goldberg, and Zhiqiang Bi. 2001. Time series forecasting using wavelets with predictor-
corrector boundary treatment. In 7th ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining Workshop on Temporal Data Mining, San Francisco, CA, USA.
Jiang Zhao, Chang-Tien Lu, and Yufeng Kou. 2003. Detecting region outliers in meteorological data.
In Proceedings of the 11th ACM international symposium on Advances in geographic information
systems, pages 4955, New York, NY, USA. ACM Press.

You might also like