Professional Documents
Culture Documents
Shashi Shekhar
Pusheng Zhang
& %
' $
⋆ Spatial patterns
• Spatial outlier, discontinuities
– bad traffic sensors on highways (DOT)
• Location prediction models
– model to identify habitat of endangered species
• Spatial clusters
– crime hot-spots (NIJ), cancer clusters (CDC)
• Co-location patterns
– predator-prey species, symbiosis
– Dental health and fluoride
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
1
' $
Location As Attribute
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
2
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
3
' $
⋆ Spatial Outliers
• Traffic Data in Twin Cities
• Abnormal Sensor Detections
• Spatial and Temporal Outliers
160
10
140
I35W Station ID(South Bound)
20
120
100
30
80
40
60
40
50
20
60
0
50 100 150 200 250
Time
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
4
' $
10
20
30
40 Marsh land
Nest sites
50
60
70
80
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
5
' $
⋆ Given:
• A collection of different types of spatial events
⋆ Illustration
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
6
' $
Application Domains
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
8
' $
Figure 1: Geographic distribution of male colorectal cancer in Long Island, New York(in courtesy
of TerraSeer)
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
9
' $
Business Applications
⋆ Sample Questions:
• What happens if a new store is added?
• How much business a new store will divert from existing
stores
• Other “what if” questions:
– changes in population, ethic-mix, and transportation network
– changes in retail space of a store
– changes in choices and communication with customers
Map Construction
⋆ Sample Questions
• Which features are anomalous?
• Which layers are related?
• How can the gaps be filled?
⋆ Korea Data
• Latitude 37deg15min to 37deg30min
• Longitude 128deg23min51sec to 128deg23min52sec
⋆ Layers
• Obstacles (Cut, embankment, depression)
• Surface drainage (Canal, river/stream, island, common
open water, ford, dam)
• Slope
• Soils (Poorly graded gravel, clayey sand, organic silt,disturbed
soil)
• Vegetation (Land subject to inundation, cropland, rice
field, evergreen trees, mixed trees)
• Transport (Roads, cart tracks, railways)
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
11
' $
⋆ Road: river/stream
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
12
' $
Colocation Example
⋆ Road-River/Stream Colocation
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
13
' $
⋆ SQL3/OGC (Postgres/Postgis)
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
14
' $
Colocation: Road-River
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
15
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
16
' $
⋆ Outlier detection
• Extra/erroneous features
• Positional accuracy of features
• Predict mislabeled/misclassified features
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
17
' $
Outliers in Example
⋆ Map production
• Identifying errors
e.g., expected colocation : (bridge, (road, river))
T
–
Figure 5: Finding errors in maps having road, river and bridges (Korea dataset)
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
18
' $
Overview
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
19
' $
Overview
⇒ Input
⋆ Statistical Foundation
⋆ Output
⋆ Computational process
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
20
' $
Overview of Input
⋆ Data
• Table with many columns(attributes)
tid f1 f2 ... fn
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
21
' $
⋆ Non-spatial Information
• Same as data in traditional data mining
• Numerical, categorical, ordinal, boolean, etc
• e.g., city name, city population
⋆ Spatial Information
• Spatial attribute: geographically referenced
– Neighborhood and extent
– Location, e.g., longitude, latitude, elevation
• Spatial data representations
– Raster: gridded space
– Vector: point, line, polygon
– Graph: node, edge, path
Figure 6: Raster and Vector Data for UMN Campus (in courtesy of UMN, MapQuest)
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
22
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
23
' $
OGC Model
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
24
' $
⋆ Choices:
• Materialize spatial info + classical data mining
• Customized spatial data mining techniques
Relationships Materialization Customized SDM Tech.
Topological Neighbor, Inside, Outside Classical Data Mining NEM, co-location
Euclidean Distance, can be used K-means
density DBSCAN
Directional North, Left, Above Clustering on sphere
Others Shape, visibility
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
25
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
26
' $
Overview
√
Input
⇒ Statistical Foundation
⋆ Output
⋆ Computational process
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
27
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
28
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
29
' $
Spatial Autocorrelation(SA)
10 10
20 20
30 30
40 40
50 50
60 60
70 70
80 80
(a) Pixel property with independent iden- (b) Vegetation Durability with SA
tical distribution
⋆ Spatial autocorrelation
• Nearby things are more similar than distant things
• Traditional i.i.d. assumption is not valid
• Measures: K-function, Moran’s I, Variogram, ···
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
30
' $
⋆ K-function Definition:
• Test against randomness for point pattern
• K(h) = λ−1E [number of events within distance h of an
arbitrary event]
– λ is intensity of event
• Model departure from randomness in a wide range of scales
⋆ Inference
• For Poisson complete spatial randomness(csr): K(h) = πh2
• Plot Khat(h) against h, compare to Poisson csr
– >: cluster
– <: decluster/regularity
1400
Poisson CSR
Cluster Process
Decluster Process
1200 Envelope
1000
800
K−function
600
400
200
−200
0 2 4 6 8 10 12 14 16 18 20
Distance h
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
31
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
32
' $
Cross-Correlation
70
60
50
40
Y
30
20
10
0
0 10 20 30 40 50 60 70 80
X
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
33
' $
Illustration of Cross-Correlation
* and x
600 * and +
400
200
0
0 2 4 6 8 10
Distance h
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
34
' $
Spatial Slicing
⋆ Spatial heterogeneity
• “Second law of geography”[M. Goodchild, UCGIS 2003]
• Global model might be inconsistent with regional models
– spatial Simpson’s Paradox (or Ecological Inference)
⋆ Spatial Slicing
• Slicing inputs can improve the effectiveness of SDM
• Slicing output can illustrate support regions of a pattern
– e.g., association rule with support map
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
35
' $
Edge Effect
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
36
' $
⋆ Research Needs
• Correlating extended features:
– Example data: Korea data
– e.g. road, river (line strings)
– e.g. cropland (polygon), road, river
• Edge effect
• Relationship to classical statistics
– Ex. SVM with spatial basis function vs. SAR
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
37
' $
Overview
√
Input
√
Statistical Foundation
⇒ Output
⋆ Computational process
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
38
' $
14
70
13
50
12
30
11
10 10
9
−10
8
−30
−50
6
−70
5
−90 4
−180 −140 −100 −60 −20 20 60 100 140 180
Support
⋆ Unsupervised Learning:
• Clustering
• Outlier Detection
• Association
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
40
' $
Nest sites for 1995 Darr location Vegetation distribution across the marshland
0
0
10
10
20
20
30
30 40
40 Marsh land 50
Nest sites 60
50
70
60
80
70
0 20 40 60 80 100 120 140 160
nz = 5372
80
10 10
20 20
30 30
40 40
50 50
60 60
70 70
80 80
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
41
' $
⋆ Prediction
• Continuous: trend, e.g., regression
– Location aware: spatial autoregressive model(SAR)
• Discrete: classification, e.g., Bayesian classifier
– Location aware: Markov random fields(MRF)
Classical Spatial
y = Xβ + ǫ y = ρW y + Xβ + ǫ
i )P r(Ci ) P r(ci )∗P r(X,CN |ci )
P r(Ci|X) = P r(X|C
P r(X) P r(c i |X, CN ) = P r(X,C ) N
Table 7: Prediction Models
0.9 0.9
0.8 0.8
Truth Positive Rate
0.7 0.7
Truth Positive Rate
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
(e) ROC curves for learning (f) ROC curves for testing
Figure 11: (a) Comparison of the classical regression model with the spatial autoregression model
on the Darr learning data. (b) Comparison of the models on the Stubble testing data.
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
42
' $
y = ρW y + Xβ + ǫ.
yi = f (yj ) i 6= j.
• Example:
y = Xβ ′ + ǫ′.
• where β ′ and ǫ′ are location dependent
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
43
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
44
' $
⋆ Open Problems
• Estimate W for SAR and MRF-BC
• Scaling issue in SAR
– scale difference: ρWy vs. Xβ
• Spatial interest measure: e.g., avg dist(actual, predicted)
P Legend
A P P A P A = nest location
A = actual nest in pixel
P P
P = predicted nest in pixel
A A A A A A
Figure 12: An example showing different predictions: (a)The actual sites, (b)Pixels with actual
sites, (c)Prediction 1, (d)Prediction 2. Prediction 2 is spatially more accurate than 1.
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
45
' $
Clustering
⋆ Statistical Significance
• Complete spatial randomness, cluster, and decluster
Figure 13: Inputs: Complete Spatial Random (CSR), Cluster, and Decluster
3
Data is of Complete 2
2
3
Data is of
4 1
Spatial Randomness
2
Decluster Pattern
3
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
46
' $
Clustering
⋆ Similarity Measures
• Non-spatial: e.g., soundex
• Classical clustering: Euclidean, metric, graph-based
• Topological: neighborhood EM(NEM)
– seeks a partition that is both well clustered in feature space
and spatially regular
– Implicitly based on locations
• Interest measure:
– spatial continuity
– cartographic generalization
– unusual density
– keep nearest neighbors in common cluster
⋆ Challenges
• Spatial constraints in algorithmic design
– Clusters should obey obstacles
– Ex. rivers, mountain ranges, etc
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
47
' $
⋆ Semi-supervised MRF
• Idea: use unlabeled samples to improve classification
– Ex. reduce salt-N-pepper noise
• Effects on land-use data - smoothing
Supervised Semi−Supervised
BC
Pixel−Based
MRF−BC
Context−Based
Figure 16: Bayesian Classifier (Top Left); Semi-Supervised BC (Top Right);BC-MRF (Bottom
Left); BC-EM-MRF (Bottom Right)
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
48
' $
⋆ Location-awareness
←S
I35W Station ID(North Bound)
7
160
10
6 140
20
Attribute Values
120
5
P→ 100
4 30
D↑ 80
3
Q→ 40 60
2 40
L
50
1 20
60 0
0
0 2 4 6 8 10 12 14 16 18 20 2 4 6 8 1012 14 16 18 20 22 24
Location Time
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
49
' $
⋆ Model Building
Neighborhood aggregate function N 1
P
• faggr : E(x) = k y∈N (x) f (y)
⋆ Testing
• Difference function Fdif f
– ǫ = E(x) − (m ∗ f (x) + b)
where 1
P
– E(x) = k y∈N (x) f (y)
– | ǫ−µ
σǫ
ǫ
|>θ
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
50
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
51
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
52
' $
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
53
' $
Spatial Colocation
⋆ Association
• Domain(fi) = union { any, domain(fi )}
• Co-location
– Effect of transactionizing: loss of info
– Alternative: use spatial join, statistics
70
70
60
60
50 50
40
Y
40
Y
30 30
20 20
10 10
0 0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
X X
(d)
Figure 17: a) A spatial dataset. Shapes represent different spatial feature types. (b) Transaction-
azing continuous space splits circled instances of colocation patterns into separated transactions
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
54
' $
⋆ Approaches
• Spatial Join-based Approaches
– Join based on map overlay, e.g. [Estivill-Castro and Lee, 1001]
– Join using K-function, e.g. [Shekhar and Huang, 2001]
• Transaction-based Approaches
– e.g., [Koperski and Han, 1995] and [Morimoto,2001]
⋆ Challenges
• Neighborhood definition
• “Right” transactionazation
• Statistical interpretation
• Computational complexity
– large number of joins
– join predicate is a conjunction of:
∗ neighbor
∗ distinct item types
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
55
' $
Overview
√
Input
√
Statistical Foundation
√
Output
⇒ Computational process
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
56
' $
Computational Process
⋆ Challenges
• Does spatial domain provide computational efficiency?
– Low dimensionality: 2-3
– Spatial autocorrelation
– Spatial indexing methods
• Generalize to solve spatial problems
– Linear regression vs SAR
∗ Continuity matrix W is assumed known for SAR, however,
estimation of anisotropic W is non-trivial
– Spatial outlier detection: spatial join
– Co-location: bunch of joins
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
57
' $
⋆ Teleconnection
• Find locations with climate correlation over θ
Figure 18: Global Influence of El Nino during the Northern Hemisphere Winter(D: Dry;
W:Warm; R:Rainfall)
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
58
' $
⋆ Challenge:
• high dim(e.g., 600) feature space
• 67k land locations and 100k ocean locations
• 50-year monthly data
⋆ Computational Efficiency
• Spatial autocorrelation:
– Reduce Computational Complexity
• Spatial indexing to organize locations
– Top-down tree traversal is a strong filter
– Spatial join query: filter-and-refine
∗ save 40% to 98% computational cost at θ = 0.3 to 0.9
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
59
' $
y ρ W y X β ε
= + +
⋆ Computational Challenges
• Eigenvalue + Least square + M. L.
• Computing all eigenvalues of a large matrix
• Memory requirement
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
60
' $
Summary
Classical DM Spatial DM
Input All explicit, simple types often Implicit relationships, complex types
and transactions
Stat Foundation Independence of samples spatial autocorrelation
Output Interest measures: set-based Location-awareness
Computational Process Combinatorial optimization Computational efficiency opportunity
Spatial autocorrelation, plane-sweeping
Numerical alg. New complexity: SAR, co-location mining
Estimation of anisotropic W is nontrivial
Objective Function Max likelihood Map Similarity(Actual, Predicted)
Min sum of squared errors
Constraints Discrete space Keep NN together
Support threshold Honor geo-boundaries
Confidence threshold
Other Issues Edge effect, scale
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
61
' $
Book
Figure 19: Spatial Databases: A Tour (a) English Version (b) Russian Version (c) Chinese Version
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
62
' $
References
⋆ References
• [Cressie, 1991], N. Cressie, Statistics for Spatial Data, John Wiley and Sons,
1991
• [Goodchild, 2001], M. Goodchild, Spatial Analysis and GIS, 2001 ESRI User
Conference Pre-Conference Seminar
• [Miller, Han, 2001], H. Miller and J. Han(eds), Geographic Data Mining and
Knowledge Discovery, Taylor and Francis, 2001
& %
GIScience 2004 - Spatial Data Mining: Accomplishments and Research Needs
63
' $
References
⋆ References
• [Roddick, 2001], J. Roddick, K. Hornsby and M. Spiliopoulou, Yet Another
Bibliography of Temporal, Spatial Spatio-temporal Data Mining Research, KDD
Workshop, 2001
• [Tan et al, 2001], P. Tan and M. Steinbach and V. Kumar and C. Potter
and S. Klooster and A. Torregrosa, Finding Spatio-Temporal Patterns in Earth
Science Data, KDD Workshop on Temporal Data Mining, 2001