Cluster by Evan

Clustering Analysis
(of Spatial Data and using Peano Count Trees)

(Ptree technology is patented by NDSU)
Notes:
1. over 100 slides not going to go through each in detail.
Clustering Methods
A Categorization of Major Clustering Methods
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Clustering Methods based on Partitioning

Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
k-means (MacQueen67): Each cluster is represented by the
center of the cluster
k-medoids or PAM method (Partition Around Medoids)
(Kaufman & Rousseeuw87): Each cluster is represented by 1
object in the cluster (~ the middle object or median-like object)
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps (assumes partitioning
criteria is: maximize intra-cluster similarity and minimize inter-cluster
similarity. Of course, a heuristic is used. Method isnt really an optimization)
1.
Partition objects into k nonempty subsets (or pick k initial means).
2.
Compute the mean (center) or centroid of each cluster of the current

partition (if one started with k means, then this step is done).
centroid ~ point that minimizes the sum of dissimilarities from the mean or the
sum of the square errors from the mean.
Assign each object to the cluster with the most similar (closest) center.
3.
Go back to Step 2
4.
Stop when the new set of means doesnt change (or some other stopping
condition?)
k-Means
Step
1
10
Step
2
10
10
0
0
10
10
Step
3
10
10
Step
4
10
The K-Means Clustering Method

Strength
Relatively efficient: O(tkn),
n is # objects,
k is # clusters
t is # iterations.
Normally, k, t << n.
Weakness
Applicable only when mean is defined (e.g., a vector space)
Need to specify k, the number of clusters, in advance.
It is sensitive to noisy data and outliers since a small number of
such data can substantially influence the mean value.
The K-Medoids Clustering Method

Find representative objects, called medoids, (must be an actual object in the
cluster, where as the mean seldom is).
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids
iteratively replaces one of the medoids by a non-medoid
if it improves the aggregate similarity measure, retain the swap. Do this over
all medoid-nonmedoid pairs
PAM works for small data sets. Does not scale for large data sets
CLARA (Clustering LARge Applications) (Kaufmann,Rousseeuw, 1990) Sub-samples
then apply PAM
CLARANS (Clustering Large Applications based on RANdom
Search) (Ng & Han, 1994): Randomized the sampling
PAM (Partitioning Around Medoids) (1987)

Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCi,h
For each pair of i and h,
If TCi,h < 0, i is replaced by h
Then assign each non-selected object to the most similar
representative object
repeat steps 2-3 until there is no change
CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the sample
is biased
CLARANS (Randomized CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized Search)
(Ng and Han94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a graph
where every node is a potential solution, that is, a set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
(Genetic-Algorithm-like)
Finally the best local optimum is chosen after some stopping
condition.
It is more efficient and scalable than both PAM and CLARA
Distance-based partitioning has drawbacks
Simple and fast O(N)
The number of clusters, K, has to be arbitrarily chosen before it is known how

many clusters is correct.
Produces round shaped clusters, not arbitrary shapes (Chameleon data set below)
Sensitive to the selection of the initial partition and may converge to a local
minimum of the criterion function if the initial partition is not well chosen.
Correct
result
K-means
result
Distance-based partitioning (Cont.)

If we start with A, B, and C as the
initial centriods around which the
three clusters are built, then we end
up with the partition {{A}, {B, C},
{D, E, F, G}} shown by ellipses.
Whereas, the correct three-cluster
solution is obtained by choosing, for
example, A, D, and F as the initial
cluster means (rectangular clusters).
A Vertical Data Approach

Partition the data set using rectangle P-trees (a gridding)
These P-trees can be viewed as a grouping (partition) of data
Pruning out outliers by disregard those sparse values
Input: total number of objects (N), percentage of outliers (t)
Output: Grid P-trees after prune
(1) Choose the Grid P-tree with smallest root count (P gc)
(2) outliers:=outliers OR Pgc
(3) if (outliers/N<= t) then remove Pgc and repeat (1)(2)
Finding clusters using PAM method; each grid P-tree is an object

Note: when we have a P-tree mask for each cluster, the mean is just
the vector sum of the basic Ptrees ANDed with the cluster Ptree,
divided by the rootcount of the cluster Ptree
Distance Function
Data Matrix
n objects p variables
X11 X1f X1p
.
.
.
.
.
.
.
.
.
Xi1 Xif Xip

.
.
.
.
.
.
.
.
.
Xn1 Xnf Xnp
Dissimilarity Matrix
n objects n objects
0
d(2,1) 0
d(3,1) d(3,2) 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
d(n,1) d(n,2) d(n,n-1) 0
Euclidian Distance between object i and j :

d(i, j) ( X i1 X j1 ) 2 ( X i 2 X j 2 ) 2 ( X ip X jp ) 2
AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)
Use the Single-Link (distance between two sets is the
minimum pairwise distance) method
Merge nodes that are most similarity
Eventually all nodes belong to the same cluster
10
10
10
0
0
10
0
0
10
10
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)
Inverse order of AGNES (intitially all objects are in one cluster; then
it is split according to some criteria (e.g., maximize some aggregate
measure of pairwise dissimilarity again)
Eventually each node forms a cluster on its own
10
10
10
0
0
10
0
0
10
10
Contrasting Clustering Techniques

Partitioning algorithms: Partition a
dataset to k clusters, e.g., k=3
Hierarchical alg: Create hierarchical decomposition of ever-finer partitions.
e.g., top down (divisively).
bottom up (agglomerative)
Hierarchical Clustering
Step 0
a
b
Step 1
Step 2 Step 3 Step 4
ab
abcde
c
d
e
cde
de
Agglomerative
Hierarchical Clustering (top down)
a
b
ab
abcde
cde
de
e
Step 4
Divisive
Step 3
Step 2 Step 1 Step 0
In either case, one gets a nice dendogram in which any maximal antichain (no 2 nodes linked) is a clustering (partition).
Hierarchical Clustering (Cont.)
Recall that any maximal anti-chain (maximal set of nodes in which

no 2 are chained) is a clustering (a dendogram offers many).
But the horizontal anti-chains are the clusterings resulting from the
top down (or bottom up) method(s).

Most hierarchical clustering algorithms are variants of the singlelink, complete-link or average link.
Of these, single-link and complete link are most popular.
In the single-link method, the distance between two clusters is the minimum
of the distances between all pairs of patterns drawn one from each cluster.
In the complete-link algorithm, the distance between two clusters is the
maximum of all pairwise distances between pairs of patterns drawn one
from each cluster.
In the average-link algorithm, the distance between two clusters is the
average of all pairwise distances between pairs of patterns drawn one from
each cluster (which is the same as the distance between the means in the
vector space case easier to calculate).
Distance Between Clusters

Single Link: smallest distance between any pair of
points from two clusters
Complete Link: largest distance between any pair

of points from two clusters
Distance between Clusters (Cont.)

Average Link: average distance between points
from two clusters
Centroid: distance between centroids of
the two clusters
Single Link vs. Complete Link (Cont.)
Single link
works but not
complete link
Complete link
works but not
single link

Single link works
Complete link doesnt

1
1
1
2 2
2 2
1
1
1
1
2 2
2 2
1
1
1

Complete link does
Single link
doesnt works
2
1
1
1
2
2
2
2
2
1-cluster
noise
2-cluster
1
1
2
2
2
2
2
Hierarchical vs. Partitional
Hierarchical algorithms are more versatile than partitional

algorithms.
For example, the single-link clustering algorithm works well on data
sets containing non-isotropic (non-roundish) clusters including wellseparated, chain-like, and concentric clusters, whereas a typical
partitional algorithm such as the k-means algorithm works well only
on data sets having isotropic clusters.
On the other hand, the time and space complexities of the

partitional algorithms are typically lower than those of the
hierarchical algorithms.
More on Hierarchical Clustering Methods

Major weakness of agglomerative clustering methods
do not scale well: time complexity of at least O(n2), where n is the
number of total objects
can never undo what was done previously (greedy algorithm)
Integration of hierarchical with distance-based clustering

BIRCH (1996): uses Clustering Feature tree (CF-tree) and
incrementally adjusts the quality of sub-clusters
CURE (1998): selects well-scattered points from the cluster and then
shrinks them towards the center of the cluster by a specified fraction
CHAMELEON (1999): hierarchical clustering using dynamic
modeling
Density-Based Clustering Methods

Clustering based on density (local cluster criterion), such as
density-connected points
Major features:
Discover clusters of arbitrary shape

Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD96)

OPTICS: Ankerst, et al (SIGMOD99).
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawal, et al. (SIGMOD98)
Density-Based Clustering: Background

Two parameters:
: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an -neighbourhood
of that point
N(p):
{q belongs to D | dist(p,q) }
Directly (density) reachable: A point p is directly

density-reachable from a point q wrt. , MinPts if
1) p belongs to N (q)
2) q is a core point:
|N(q)| MinPts
p
q
MinPts = 5
= 1 cm
Density-Based Clustering: Background (II)

Density-reachable:
A point p is density-reachable from a point q (p)
wrt , MinPts if there is a chain of points p1, , pn,
p1=q, pn=p such that pi+1 is directly densityreachable from pi
q, q is density-reachable from q.
p
q
p1
Density reachability is reflexive and transitive, but

not symmetric, since only core objects can be density
reachable to each other.
Density-connected
A point p is density-connected to a q wrt , MinPts
if there is a point o such that both, p and q are
density-reachable from o wrt , MinPts.
Density reachability is not symmetric, Density
connectivity inherits the reflexivity and transitivity
and provides the symmetry. Thus, density
connectivity is an equivalence relation and
therefore gives a partition (clustering).
q
o
DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as an equivalence
class of density-connected points.
Which gives the transitive property for the density connectivity binary relation
and therefore it is an equivalence relation whose components form a partition
(clustering) according to the duality.
Discovers clusters of arbitrary shape in spatial databases with noise
Outlier
Border
Core
= 1cm
MinPts = 3
DBSCAN: The Algorithm

Arbitrary select a point p
Retrieve all points density-reachable from p wrt , MinPts.
If p is a core point, a cluster is formed (note: it doesnt matter which of the core
points within a cluster you start at since density reachability is symmetric on
core points.)
If p is a border point or an outlier, no points are density-reachable from p and
DBSCAN visits the next point of the database. Keep track of such points. If
they dont get scooped up by a later core point, then they are outliers.
Continue the process until all of the points have been processed.
What about a simpler version of DBSCAN:
Define core points and core neighborhoods the same way.
Define (undirected graph) edge between two points if they cohabitate a core nbrhd.
The connectivity component partition is the clustering.
Other related method? How does vertical technology help here? Gridding?
OPTICS
Ordering Points To Identify Clustering Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)
http://portal.acm.org/citation.cfm?id=304187
Addresses the shortcoming of DBSCAN, namely choosing
parameters.
Develops a special order of the database wrt its density-based
clustering structure
This cluster-ordering contains info equivalent to the density-based
clusterings corresponding to a broad range of parameter settings
Good for both automatic and interactive cluster analysis, including
finding intrinsic clustering structure
OPTICS
Does this order

resemble the
Total Variation
order?
Reachability
-distance
undefined
Cluster-order
of the objects
DENCLUE: using density functions

DENsity-based CLUstEring by Hinneburg & Keim (KDD98)
Major features
Solid mathematical foundation
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45 claimed by authors ???)
But needs a large number of parameters
Denclue: Technical Essence

Uses grid cells but only keeps information about grid cells that do actually contain data
points and manages these cells in a tree-based access structure.
Influence function: describes the impact of a data point within its neighborhood.
F(x,y) measures the influence that y has on x.
A very good influence function is the Gaussian, F(x,y) = e
d2(x,y)/2
Others include functions similar to the squashing functions used in neural networks.
One can think of the influence function as a measure of the contribution to the
density at x made by y.
Overall density of the data space can be calculated as the sum of the influence function
of all data points.
Clusters can be determined mathematically by identifying density attractors.
Density attractors are local maximal of the overall density function.
DENCLUE(D,,c,)
1.
2.
Grid Data Set (use r = , the std. dev.)

Find (Highly) Populated Cells (use a
threshold=c) (shown in blue)
Identify populated cells (+nonempty cells)
3.
4.
Find Density Attractor pts, C*, using hill

climbing:
1.
2.
3.
4.
5.
Randomly pick a point, pi.

Compute local density (use r=4)
Pick another point, pi+1, close to pi,
compute local density at pi+1
If LocDen(pi) < LocDen(pi+1), climb
Put all points within distance /2 of path,
pi, pi+1, C* into a density attractor
cluster called C*
5.
Connect the density attractor clusters,

using a threshold, , on the local densities
of the attractors.
A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering in Multimedia Databases with

Noise. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1998. & KDD 99
Workshop.
Comparison: DENCLUE Vs DBSCAN
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by
Zhang, Ramakrishnan, Livny (SIGMOD96
http://portal.acm.org/citation.cfm?id=235968.233324&dl=GUIDE&dl=ACM&idx=235968&part=periodical&WantType=periodical&title
=ACM%20SIGMOD%20Record&CFID=16013608&CFTOKEN=14462336
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data

structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of
the data that tries to preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan and improves
quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of the data
record.
ABSTRACT
BIRCH
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most
widely studied problems in this area is the identification of clusters, or densely populated regions, in
a multi-dimensional dataset.
Prior work does not adequately address the problem of large datasets and minimization of I/O costs.
This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and
Clustering using Hierarchies), and demonstrates that it is especially suitable for very large
databases.
BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to
produce the best quality clustering with the available resources (i.e., available memory and time
constraints).
BIRCH can typically find a good clustering with a single scan of the data, and improve the quality
further with a few additional scans.
BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points
that are not part of the underlying pattern) effectively.
We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through
several experiments.
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2
CF = (5, (16,30),(54,190))
Branching factor = max # children
Threshold = max diameter of leaf cluster
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
Iteratively put points into closest leaf until threshold is exceed, then split leaf.
Inodes summarize their subtrees and Inodes get split when threshold is exceeded.
Once in-memory CF tree is built, use another method to cluster leaves together.
Birch
Root
CF1
CF2 CF3
CF6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
CF6 next
Branching factor, B=6

Threshold, L = 7
Leaf node
prev
CF1 CF2
CF4 next
CURE (Clustering Using

REpresentatives )
CURE: proposed by Guha, Rastogi & Shim, 1998

Stops the creation of a cluster hierarchy if a level consists of k clusters
Uses multiple representative points to evaluate the distance between clusters
adjusts well to arbitrary shaped clusters (not necessarily distance-based
avoids single-link effect
Drawbacks of Distance-Based Method
Drawbacks of square-error based clustering method

Consider only one point as representative of a cluster
Good only for convex shaped, similar size and density, and
if k can be reasonably estimated
Cure: The Algorithm

Very much a hybrid method (involves pieces from many others).
Draw random sample s.
Partition sample to p partitions with size s/p
Partially cluster partitions into s/pq clusters
Eliminate outliers
By random sampling
If a cluster grows too slow, eliminate it.
Cluster partial clusters.
Label data in disk
ABSTRACT
Cure
Clustering, in data mining, is useful for discovering groups and identifying interesting
distributions in the underlying data. Traditional clustering algorithms either favor clusters with
spherical shapes and similar sizes, or are very fragile in the presence of outliers.
We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies
clusters having non-spherical shapes and wide variances in size.
CURE achieves this by representing each cluster by a certain fixed number of points that are
generated by selecting well scattered points from the cluster and then shrinking them toward
the center of the cluster by a specified fraction.
Having more than one representative point per cluster allows CURE to adjust well to the
geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers.
To handle large databases, CURE employs a combination of random sampling and partitioning. A
random sample drawn from the data set is first partitioned and each partition is partially
clustered. The partial clusters are then clustered in a second pass to yield the desired clusters.
Our experimental results confirm that the quality of clusters produced by CURE is much better
than those found by existing algorithms.
Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only
outperform existing algorithms but also to scale well for large databases without sacrificing
clustering quality.
Data Partitioning and Clustering

s = 50
p=2
s/p = 25
s/pq = 5
y
y
x
y
y
x
x
x
x
Cure: Shrinking Representative Points

y
Shrink the multiple representative points towards the

gravity center by a fraction of .
Multiple representatives capture the shape of the cluster
Clustering Categorical Data: ROCK

5
ROCK: Robust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE99).
Agglomerative Hierarchical
Use links to measure similarity/proximity
Not distance based
O(n 2 nmmma n 2 log n)
Computational complexity:
Basic ideas:
Similarity function and neighbors:
Let T1 = {1,2,3}, T2={3,4,5}
Sim( T 1, T 2)
T1 T2
Sim( T1 , T2 )
T1 T2
{3}
1
0.2
{1,2,3,4,5}
5
Abstract
ROCK
Clustering, in data mining, is useful to discover distribution patterns in the underlying

data.
Clustering algorithms usually employ a distance metric based (e.g., euclidean)
similarity measure in order to partition the database such that data points in the
same partition are more similar than points in different partitions.
In this paper, we study clustering algorithms for data with boolean and categorical
attributes.
We show that traditional clustering algorithms that use distances between points for
clustering are not appropriate for boolean and categorical attributes. Instead, we
propose a novel concept of links to measure the similarity/proximity between a pair
of data points.
We develop a robust hierarchical clustering algorithm ROCK that employs links and
not distances when merging clusters.
Our methods naturally extend to non-metric similarity measures that are relevant in
situations where a domain expert/similarity table is the only source of knowledge.
In addition to presenting detailed complexity results for ROCK, we also conduct an
experimental study with real-life as well as synthetic data sets to demonstrate the
effectiveness of our techniques.
For data with categorical attributes, our findings indicate that ROCK not only generates
better quality clusters than traditional algorithms, but it also exhibits good
scalability properties.
Rock: Algorithm
Links: The number of common neighbors for the two pts
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}

{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
3
{1,2,3}
{1,2,4}
Algorithm
Draw random sample

Cluster with links
Label data in disk
CHAMELEON
CHAMELEON: hierarchical clustering using dynamic
modeling, by G. Karypis, E.H. Han and V. Kumar99
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items within the
clusters
A two phase algorithm

1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters
ABSTRACT
CHAMELEON
Many advanced algorithms have difficulty dealing with highly variable clusters that do
not follow a preconceived model.
By basing its selections on both interconnectivity and closeness, the Chameleon
algorithm yields accurate results for these highly variable clusters.
Existing algorithms use a static model of the clusters and do not use information about
the nature of individual clusters as they are merged.
Furthermore, one set of schemes (the CURE algorithm and related schemes) ignores the
information about the aggregate interconnectivity of items in two clusters.
Another set of schemes (the Rock algorithm, group averaging method, and related
schemes) ignores information about the closeness of two clusters as defined by the
similarity of the closest items across two clusters.
By considering either interconnectivity or closeness only, these algorithms can select
and merge the wrong pair of clusters.
Chameleon's key feature is that it accounts for both interconnectivity and closeness in
identifying the most similar pair of clusters.
Chameleon finds the clusters in the data set by using a two-phase algorithm.
During the first phase, Chameleon uses a graph-partitioning algorithm to cluster the
data items into several relatively small subclusters.
During the second phase, it uses an algorithm to find the genuine clusters by repeatedly
combining these sub-clusters.
Overall Framework of CHAMELEON

Construct
Partition the Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
Grid-Based Clustering Method

Using multi-resolution grid data structure
Several interesting methods
STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB98)
A multi-resolution clustering approach using
wavelet method
CLIQUE: Agrawal, et al. (SIGMOD98)
Vertical gridding
We can observe that almost all methods discussed so far suffer from the curse of cardinality (for very
large cardinality data sets, the algorithms are too slow to finish in the average life time!) and/or the curse
of dimensionality (points are all at ~ same distance).
The work-arounds employed to address the curses
sampling (throw out most of the points in a way that what remains is low enough cardinality for the
algorithm to finish and in such a way that the remaining sample contains all the information of the
original data set (Therein is the problem that is impossible to do in general);
Gridding (agglomerate all points in a grid cell and treat them as one point (smooth the data set to this
gridding level). The problem with gridding, often, is that info is lost and the data structure that holds the
grid cell information is very complex. With vertical methods (e.g., P-trees), all the info can be retained
and griddings can be constructed very efficiently on demand. Horizontal data structures cant do this.
Subspace restrictions (e.g., Principal Components, Subspace Clustering)
Gradient based methods (e.g., the gradient tangent vector field of a response surface reduces the
calculations to the number of dimensions, not the number of combinations of dimensions.)
j-hi gridding: the j hi order bits identify a grid cells and the rest identify points in a particular cell.
Thus, j-hi cells are not necessarily cubical (unless all attribute bit-widths are the same).
j-lo gridding; the j lo order bits identify points in a particular cell and the rest identify a grid cell.
Thus, j-lo cells always have a nice uniform shape (cubical).
1-hi gridding of Vector Space, R(A1, A2, A3) in which all bit-widths are the same = 3
(so
each grid cell contains 22 * 22 * 22 = 64 potential points). Grid cells are identified by their Peano id (Pid) internally
the points cell coordinates are shown - called the grid cell id and
cell points
are ided by coordinates within their cell.
1
Pid = 001
gci=001
gci=001 gci=001 gci=001
gcp=
gcp=
gcp=
gcp= gci=001
gci=001 gci=001 gci=001
01,11,00 10,11,00 gcp=
11,11,00
00,11,00 gcp=
gcp=
gcp= gci=001
gci=001
gci=001 gci=001
11,11,01
00,11,01 01,11,01 gcp=
10,11,01 gcp=
gcp=
gcp= gci=001
gci=001
gci=001
gci=001
gci=001 gci=001
gci=001 gci=001
11,11,10
00,11,10 gcp=
10,11,10 gcp=
01,11,10 gcp=
gcp=
gcp=
gcp= gci=001
gcp= gci=001
gcp= gci=001
00,10,00 01,10,00
11,10,00
11,11,11 gcp=
10,11,11gci=001
00,11,11 gcp=
10,10,00
01,11,11
gcp=
gci=001
gcp= gci=001
gci=001
00,10,01gci=001
11,10,01
01,10,01 10,10,01 gcp=
gcp=
gci=001
gcp= gci=001
gci=001
gcp= gci=001
gci=001 gci=001
gci=001
00,10,10
11,10,10
gcp=
10,10,10 gcp=
gcp=
01,10,10
gcp=
gcp=
gcp=
gcp=
gci=001
gci=001
gci=001
00,10,11 00,01,00
01,10,11 gcp=
11,10,11gci=001
01,01,00
10,10,11 gcp=
10,01,00
11,01.00
gcp=
gci=001 gci=001
gcp=
gci=001
00,01,01 gcp=
01,01,01 10,01,01gci=001
11,01.01
gcp= gci=001
gcp= gci=001
gcp=
gci=001 gci=001
gci=001
gci=001
gci=001
gci=001
01,01,10 10,01,10
00,01,10 gcp=
11,01,10
gcp=
gcp=
gcp=
gcp= gci=001
gcp=
gcp=
gcp= gci=001
gci=001
gci=001
01,01,11 01,00,00
11,01,11
00,01,11 00,00,00
10,01,11 10,00,00
11,00,00
gcp=
gcp= gci=001
gci=001
gcp=
gcp=
gci=001
gci=001
01,00,01 10,00,01
00,00,01 gcp=
gcp=
11,00,01
gcp=
gcp= gci=001
gci=001
gci=001
00,00,10gci=001
01,00,10
11,00,10
10,00,10
gcp=
gcp=
gcp=
gcp=
00,00,11
A2
hi-bit
01,00,11
10,00,11
0
1
11,00,11
1
A1
hi-bit
A3
hi-bit
2-hi gridding of Vector Space, R(A1, A2, A3) in which all bitwidths are the same = 3
each grid cell contains 21 * 21 * 21 = 8 points).
(so
11
10
A2
Pid = 001.001
01
00
00
gci=
gci=
00,00,11 00,00,11
gci=
gci=
gcp=0,1,0 gcp=1,1,0
00,00,11 00,00,11
gcp=0,1,1
gci=gcp=1,1,1
gci=
00,00,11 00,00,11
gci=
gci=
gcp=0,0,0 gcp=1,0,0
00,00,11 00,00,11
gcp=0,0,1 gcp=1,0,1
00
01
10
11
01
A1
10
11
A3
1-hi gridding of R(A1, A2, A3), bitwidths of 3,2,3
Pid = 001
1
0
A2
hi-bit
gci=001
gci=001
gci=001 gci=001
gcp=
gcp=
gcp= gci=001
gcp= gci=001
gci=001
gci=001
11,1,00
10,1,00
00,1,00 gcp=
01,1,00 gcp=
gcp=
gcp=
gci=001
gci=001 gci=001
01,1,01 10,1,01 gci=001
00,1,01
11,1.01
gcp=
gcp= gci=001
gcp= gci=001
gcp=
gci=001 gci=001
gci=001 gci=001
gci=001
gci=001
00,1,10
01,1,10
10,1,10
11,1,10
gcp=
gcp=
gcp=
gcp=
gcp= gci=001
gcp=
gcp=
gcp=
gci=001
gci=001
gci=001
00,1,11 00,0,00
10,1,11
11,1,11
01,0,00
11,0,00
10,0,00
gcp=01,1,11 gcp=
gci=001
gcp=
gci=001
gci=001 gci=001
gcp=
00,0,01
01,0,01
gcp=
11,0,01
gcp=
gcp= gci=001
10,0,01
gcp=
gci=001
gci=001
gci=001
00,0,10
11,0,10
01,0,10
10,0,10
gcp=
gcp=
gcp=
gcp=
01,0,11 10,0,11
11,0,11
00,0,11
0
1
1
A1
hi-bit
A3
hi-bit
2-hi gridding) of R(A1, A2, A3), bitwidths of 3,2,3 (each grid cell contains 21 * 20 * 21 = 4 potential pts).
11
00
10
gcp=
gcp=0,,0
01
gcp=
gcp=1,,0
0,,1
10
00
A2
2-hi-bit
01
1,,1
00
01
A1
2-hi-bit
10
11
11
Pid = 3.1.3
A3
2-hi-bit
HOBBit disks and rings:

(HOBBit = Hi Order Bifurcation Bit)
4-lo grid where A1,A2,A3 have bit-widths, b1+1, b2+1, b3+1,
HOBBit grid centers are points of the form
x=(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)
(exactly one per grid cell):

where xi,js range over all binary patterns
HOBBit disk about x, of radius 20 , H(x,20).

Note: we have switched the direction of A3
(x1,b1..x1,41010, x2,b2..x2,41011, x3,b3..x3,41011)

(x1,b1..x1,41010, x2,b2..x2,41011, x3,b3..x3,41010)
(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41011)

(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)
A2
A3
A1
(x1,b1..x1,41011, x2,b2..x2,41011, x3,b3..x3,41011)

gcp=
1010,
gcp=1011,
1010,1011
1011,
1010gcp=
1010,
gcp=
1010,
1010,1011
1010,
1010
gcp=
1011,
gcp=1011,
1011,1011
1011,
1010gcp=
1011,
gcp=1010,
1011,1011
1010,
1010
(x1,b1..x1,41011, x2,b2..x2,41011, x3,b3..x3,41010)
(x1,b1..x1,41011, x2,b2..x2,41010, x3,b3..x3,41011)

(x1,b1..x1,41011, x2,b2..x2,41010, x3,b3..x3,41010)
H(x,21) HOBBit disk about a HOBBit grid center pt, x = (x
1,b1
..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)
A2
, of radius 21
A3
(x1,b1..x1,41000, x2,b2..x2,41011, x3,b3..x3,41011)
(x1,b1..x1,41011, x2,b2..x2,41011, x3,b3..x3,41011)
(x1,b1..x1,41011, x2,b2..x2,41010, x3,b3..x3,41011)

(x1,b ..x1,41000, x2,b ..x2,41011, x3,b ..x3,41000)
1
A1
(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)
(x1,b ..x1,41011, x2,b ..x2,41000, x3,b ..x3,41011)

1
(x1,b ..x1,41011, x2,b ..x2,41000, x3,b ..x3,41010)

1
(x1,b ..x1,41011, x2,b ..x2,41000, x3,b ..x3,41001)

1
(x1,b1..x1,41000, x2,b2..x2,41000, x3,b3..x3,41000)
(x1,b1..x1,41011, x2,b2..x2,41000, x3,b3..x3,41000)
The Regions of H(x,21) are as follows:

A2
A3
A1
These REGIONS are labeled with dimensions in which length is increased

A2
(e.g., all three dimensions are increased below).
A3
A1
123-REGION
A2
A3
A1
13-REGION
A2
A3
A1
23-REG
A2
A3
A1
12-REGION
A2
A3
3-REG
A1
A2
A3
A1
2-REG
A2
A3
1-REGION
A1
H(x,20) =
123-REG
Of H(x,20)
A2
A3
A1
H(x,20
1-REGION
13-REGION
3-REG
A2
12-REGION
A3
2-REG
A1
123-REGION
23-REG
Algorithm (for computing gradients):

1.
Select an outlier threshold, (pts without neighbors in their ot L -disk are outliers That is,
there is no gradient at these outlier points (instantaneous rate of response change is zero).
2.
Create an j-lo grid with j=ot
3.
Pick a point, x in R. Build out alternating one-sided-rings centered at x until a neighbor is

found or radius ot is exceeded (in which case x is declared an outlier). If a neighbor is
found at a raduis, ri < ot 2j, f/ xk(x) is estimated as below:
1.
2.
3.
(see previous slides - where HOBBit disks are built out from HOBBit centers
x = ( x1,b1x1,ot+11010 , , xn,bnxn,ot+11010 ), xi,js ranging over all binary patterns).
Note: one can use L-HOBBit or L ordinary distance.

Note: One-sided means that each successive build out increases aternatively only in the positive direction in all
dimensions then only in the negative direction in all dimensions.
Note: Building out HOBBit disks from a HOBBit center automatically gives one-sided rings (a built-out ring
is defined to be the built-out disk minus the previous built-out disk) as shown in the next few slides.
( RootcountD(x,ri) - RootcountD(x,ri)k ) / xk
where D(x,ri)k is D(x,ri-1) expanded

in all dimensions except k.
Alternatively in 3., actually calculate the mean (or median?) of the new points encountered in
D(x,ri) (we have a P-tree mask for the set so this it trivial) and measure the x k-distance.
NOTE: Might want to go 1 more ring out to see if one gets the same or a similar gradient
(this seems particularly important when j is odd (since the gradient then points the opposite way.)
gr
ad
ien
Est f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-1)/(-1) = -1
A2
H(x,21)
First new point
H(x,21)1
HOBBit center, x=(x1,b1..x1,41010, x2,b2..x2,41010, x3,b3..x3,41010)
A3
A1
gr
ad
ien
( RootcountD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-1)/(-1) = -1
H(x,21)2
A2
H(x,21)
A3
A1
gr
a
die
n
H(x,21)3
A2
H(x,21)
A3
A1

Est f/ xk(x)= ( RcD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-1)/(-1) = -1
Estimated gradient
Check other regions too?
gr
a
die
n
Est f/ xk(x)=( RcD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-1)/(-1) = -1
A2
H(x,21)
A3
A1
First new point
gradient
Est f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-2)/(-1) = 0
H(x,21)
First new point
A2
H(x,21)1
A3
A1
gradient
H(x,21)2
A2
H(x,21)
A3
A1
gradient
H(x,21)3
A2
H(x,21)
A3
A1
Estimated gradient
gradient
A2
H(x,21)
A3
A1
A2
H(x,21)
First new point
H(x,21)1
A3
A1
( RootcountD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-2)/(-1) = 0
H(x,21)2
A2
H(x,21)
A3
A1
( RootcountD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-2)/(-1) = 0
H(x,21)3
A2
H(x,21)
A3
A1
Estimated gradient
A2
A3
A1
Intuitively, this Gradient estimation

method seems to work.
Next we consider a potential accuracy
improvement in which we take the
medoid of all new points as the
gradient
H(x,21)
(or, more accurately, as the point to

which we climb in any response
surface hill climbing technique)
Estimate the gradient arrowhead as being at the medoid of the new

point set (or, more correctly, estimate the next hill-climb step).
Note: If the original points are truly part of a strong cluster, the hill
climb will be excellent.
A2
H(x,21)
A3
A1
new points =
new points centroid =
Estimate the gradient arrowhead as being at the medoid of the new

point set (or, more correctly, estimate the next hill-climb step).
Note: If the original points are not truly part of a strong cluster, the
weak hill climb will indicate that.
A2
H(x,21)
A3
A1
new points =
new points centroid =
Est f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-1)/(3) =

Est f/1/3
xk(x)= ( RcD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-1)/(3) = 1/3
Est f/ xk(x)=( RcD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-1)/(3) = 1/3
H(x,22)1
A2
H(x,22)
A3
A1
Note that the gradient points in the right direction and is
very short (as it should be!)
First new
point
To evaluate how well the formula estimates the

gradient, it is important to consider all cases of the
new point appearing in one of these regions
(if 1 point appears, gradient components are
additive, so it suffices to consider 1?
H(x,2)1
H(x,2)1
H(x,2)1
H(x,2)1
H(x,2)1
H(x,2)
H(x,1)1
H(x,0)
H(x,2)1
H(x,1)3
H(x,1)13
H(x,1)2
H(x,1)12
H(x,1)123
H(x,1)23
H(x,2)1
To evaluate how well the formula estimates the gradient, it

is important to consider all cases of the new point appearing
in 1 of these regions (if 1 pt appears, gradient comps add)
H(x,2)23
H(x,1)1
H(x,0)
H(x,2)2
H(x,1)13H(x,1)3
H(x,2)123
H(x,2)12
H(x,2)13
H(x,1)H(x,2)
12 H(x,1)
2
H(x,1)123
3
H(x,1)
23
H(x,2)1
H(x,3)3
H(x,2)23
H(x,1)1
H(x,0)
H(x,2)2
H(x,1)13H(x,1)3
H(x,2)123
H(x,2)12
H(x,2)13
H(x,1)H(x,2)
12 H(x,1)
2
H(x,1)123
3
H(x,1)
23
H(x,3)1
H(x,3)13
H(x,3)123
H(x,3)13
H(x,2)1
H(x,3)13
H(x,3)13
Notice that the HOBBit

center moves more and
more toward the true
center as the grid size
increases.
H( x,23 )
Grid based Gradients and Hill Climbing

If we are using gridding to produce the gradient vector field of a response surface, might we
always vary xi in the positive direction only? How can that be done most efficiently?
1.
j-lo gridding, building out HOBBit rings from HOBBit grid centers
(see previous slides where this approach was used.)
or
j-lo gridding. building out HOBBit rings from lo-value grid pts (ending in j 0-bits)
x = ( x1,b1x1,j+100 , , xn,bnxn,j+100 )
2.
3.
4.
Ordinary j-lo griddng, building out rings from lo-value ids (ending in j zero bits)
Ordinary j-lo gridding, uilding out Rings from true centers.
Other? (there are many other possibilities, but we will first explore 2.)
1.
2.
Using j-lo gridding with j=3 and lo-value cell identifiers, is shown on the next slide.
3.
Of course, we need not use HOBBit build out.
4.
With ordinary unit radius build out, the results are more exact, but are the calculations may
be more complex???
HOBBit j-lo rings using lo-value cell ids

x=(x1,b1x1,j+100 ,, xn,bnxn,j+100)
H(x,2)23
1
H(x,2)12
H(x,2)1123
H(x,2)112
H(x,2)13
1
H(x,2)31
H(x,1)23H(x,1)123
H(x,1)2 H(x,1)12
H(x,1)3
H(x,1)13
H(x,1)
1
H(x,0)
H(x,2)1
Ordinary j-lo rings using lo-value cell ids

x=(x1,b1x1,j+100 ,, xn,bnxn,j+100)
= PDisk(x,3)^PDisk(x,2)
Ring(x,2)
Ring(x,1)
= PDisk(x,2)^PDisk(x,1)
wherePD(x,i) =
Disk(x,0)
Pxb^..^Pxj+1 ^Pj^..^Pi+1
k-Medoids Clustering Review:

Find representative objects (medoids) (actual objects in the cluster - mean seldom is).
PAM (Partitioning Around Medoids)
Select k representative objects arbitrarily
For each pair of non-selected object h and selected object i, calculate the total swapping cost TCi,h
For each pair of i and h,
If TCi,h < 0, i is replaced by h
Then assign each non-selected object to the most similar representative object
repeat steps 2-3 until there is no change
CLARA (Clustering LARge Apps) draws multiple samples of the data set, applies PAM on each sample,
and gives the best clustering as the output. Strength: deals with larger data sets than PAM. Weakness:
Efficiency depends on the sample size. A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the sample is biased
CLARANS (Clustering Large Apps based on RANdom Search) draws sample of neighbors dynamically.
The clustering process can be presented as searching a graph where every node is a potential solution,
that is, a set of k medoids. If the local optimum is found, CLARANS starts with new randomly selected
node in search for a new local optimum (Genetic-Algorithm-like). Finally the best local optimum is
chosen after some stopping condition. It is more efficient and scalable than both PAM and CLARA
A Vertical k-Medoids Clustering Algorithm

Following PAM (to illustrate the main killer idea but it can apply much more widely)
Select k component P-trees
1. The Goal here is to efficiently get one Ptree mask for each component
2. e.g., calculate the smallest j : the j-lo gridding has > k cells.
3. Agglomerate into precisely k components (by ORing Ptrees of cells with closest
means/mediods/corners(single_link)
Where PAM uses: For each pair of non-selected object h and selected object i, calculate total
swapping cost TCi,h. For each pair of i and h, if TCi,h < 0, i is replaced by h, then assign each
non-selected object to the most similar object., use:
1. Find Medoid of each component, Ci: Calculate TV(Ci,x) x Ci. (create P-tree of all
points with min TV so far. smaller TV, reset this P-tree to it. Ending up with a P-tree,
PtMi of tieing-Medoids for Ci. Calc TV(PtMi ,x) x PtMi and pick its medoid (if there
are still multiple, repeat). This is Ci-medoid! Alternatively, just pick 1 pre-Medoid!
Note: this avoids expense of pairwise swappings, and avoids subsampling as in CLARA, CLARANS)
2. Put each point with its closest Medoid (building P-tree component masks as you do this).
3. Repeat 1 & 2 until (some stopping condition such as: no change in Medoid set?)
4. Can we cluster at step 4 without a scan (create component P-trees??)
A Vertical k-Means Clustering Algorithm

As mentioned on previous slide, a Vertical k-Means algorithm goes similarily:
1. Select k component P-trees (The Goal here is to efficiently get one Ptree mask
for each component, e.g., calculate the smallest j : the j-lo gridding has > k cells
and agglomerate into precisely k components (by ORing Ptrees of cells with
closest means)
2. Calculate the Means of each component, C h, by
1. ANDing each basic Ptree with P Ci
2. In dimension, Ak , calculate the k-component of the mean as
i=bk..0 2i * rc(PCh ^ Pk,i )
3. Put each pt with closest Mean (building P-tree component masks as you do this).
4. Repeat 2, 3 until (some stopping condition such as: no change in Mean set?)
5. Can we cluster at step 4 without a scan (create component P-trees??)
Zillions of vertical hybrid clustering algs leap to mind (involving partitioning, hierarchical, density methods)!
Pick one!
Finding Density Attractors for Density Clustering Alg

Finding density attractors (and their attractor sets which,
when agglomerated via a density threshold constitutes the
generic density clustering algorithm)
1. Pick a point.
2. Build out (HOBBit or Ordinary or?) rings one at a time until
the first k neighbors are found.
3. In that ring, computer the medoid points (as in step 1 of the
V-k-Medoids algorithm above)
4. if the Medoid increases the density, climb to it and goto 2
5. Else declare it a density attractor and goto 1
j-grids - P-tree relationship
ci=010
1-hi-gridding of R(A1, A2, A3).

Bitwidths: 3,2,3 (dim cardinalities: 8,4,8)
Using tree node-like identifiers
cell (tree) id (ci) of the form, c0c1cd
point (coord) id (pi) of the form, p1,p2,p3
ci=011
ci=111
ci=000
ci=100
ci=001
A2
00.1.00
00.1.01
01.1.01
01.1.01
01.1.10
00.1.10
A1
A3
ci=110
11.1.00
10.1.00
11.1.01
10.1.01
10.1.10
ci=101
11.1.10
01.1.11
00.0.00
11.1.11
10.1.11 10.0.00
11.0.00
01.0.00
10.0.01
11.0.01
00.0.01
01.0.01
11.0.10
10.0.10
00.0.10
01.0.10
11.0.11
10.0.11
00.0.11
01.0.11
00.1.11
root
000
001
010
011
100
101
110
111
11.1.11
11.1.10
11.0.11
11.0.10
10.1.11
10.1.10
10.0.11
10.0.10
11.1.01
11.1.00
11.0.01
11.0.00
10.1.01
10.1.00
10.0.01
10.0.00
01.1.11
01.1.10
01.0.11
01.0.10
00.1.11
00.1.10
00.0.11
00.0.10
01.1.01
01.1.00
01.0.01
01.0.00
00.1.01
00.1.00
00.0.01
00.0.00
A 1-hi-grid yields a P-tree with level-0 (cell level) fanout of 23 and level-1 (point level) fanout of 25. If leaves are segment labelled (not coords):
010.11
010.10
000.11
000.10
010.01
010.00
000.01
000.00
...
j-grids - P-tree relationship (Cont.)

One can view a standard P-tree as nested
1-hi-griddings, with compressed out
constant subtrees.
R(A1, A2, A3) with bitwidths: 3,2,3
010
110
011
111
000
A2
010
011
A1
A3
110
001
001
00
101
111
000
01
100
100
101
10
11
root
000
000
00
001
01
010
10
001
011
11
100
010
011
101
100
110
101
111
110
111
Gridding categorical data?
The following bioinformatics (yeast genome) data

used was extracted mostly from the MIPS database
(Munich Information center for Protein Sequences)
Left column shows features
treat these with hi-order bit (1 iff gene participates)

There may be more levels of hierarchy (e.g., function:
some genes actually cause the function when they
express in sufficient quantities, while others are
transcription factors for those primary genes. Primary
genes have hi bit on, tfs have second bit on)
Right column shows # distinct feature values
Bitmap these
Feature
Total Values
pathway
80
EC
622
complexes
316
function
259
localization
43
protein class
191
phenotype
181
interactions
6347
Data Representation
gene-by-feature table.
For a categorical feature, we consider each category as a
separate attribute or column by bit-mapping it.
The resulting table has a total of
8039 distinct feature bit vectors (corresponding to items in
MBR) for
6374 yeast genes (corresponding to transactions in MBR)
STING: A Statistical Information Grid Approach

Wang, Yang and Muntz (VLDB97)
The spatial area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
STING: A Statistical Information Grid Approach (2)

Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
count, mean, s, min, max
type of distributionnormal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layertypically with a small number of
cells
For each cell in the current level compute the confidence interval
STING: A Statistical Information Grid Approach (3)

Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the next lower
level
Repeat this process until the bottom layer is reached
Advantages:
Query-independent, easy to parallelize, incremental update
O(K), where K is the number of grid cells at the lowest level
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and
no diagonal boundary is detected
WaveCluster (1998)
Sheikholeslami, Chatterjee, and Zhang (VLDB98)
A multi-resolution clustering approach which applies
wavelet transform to the feature space
A wavelet transform is a signal processing technique that
decomposes a signal into different frequency sub-band.
Both grid-based and density-based

Input parameters:
# of grid cells for each dimension
the wavelet, and the # of applications of wavelet transform.
WaveCluster (1998)
How to apply wavelet transform to find clusters
Summaries the data by imposing a multidimensional
grid structure onto data space
These multidimensional spatial data objects are
represented in a n-dimensional feature space
Apply wavelet transform on feature space to find the
dense regions in the feature space
Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
What Is Wavelet (2)?
Quantization
Transformation
WaveCluster (1998)
Why is wavelet transformation useful for clustering
Unsupervised clustering
It uses hat-shape filters to emphasize region where points cluster, but
simultaneously to suppress weaker information in their boundary
Effective removal of outliers
Multi-resolution
Cost efficiency
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD98).
Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length interval
It partitions an m-dimensional data space into non-overlapping rectangular
units
A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace
CLIQUE: The Major Steps

Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the Apriori
principle
Identify clusters:
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of interests.
Generate minimal description for the clusters

Determine maximal regions that cover a cluster of connected dense
units for each cluster
Determination of minimal cover for each cluster
ABSTRACT
CLIQUE
Data mining applications place special requirements on clustering algorithms including:

the ability to find clusters embedded in subspaces of high dimensional data,
scalability, end-user comprehensibility of the results,
non-presumption of any canonical data distribution, and
insensitivity to the order of input records.
We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies
dense clusters in subspaces of maximum dimensionality.
It generates cluster descriptions in the form of DNF expressions that are minimized for ease of
comprehension.
It produces identical results irrespective of the order in which input records are presented and does not
presume any specific mathematical form for data distribution.
Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional
datasets.
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
la
a
S
ry
30
Vacation(
week)
0 1 2 3 4 5 6 7
age
60
20
50
30
40
age
50
age
60
Strength and Weakness of CLIQUE

Strength
It automatically finds subspaces of the highest dimensionality such
that high density clusters exist in those subspaces
It is insensitive to the order of records in input and does not
presume some canonical data distribution
It scales linearly with the size of input and has good scalability as
the number of dimensions in the data increases
Weakness
The accuracy of the clustering result may be degraded at the
expense of simplicity of the method
Model-Based Clustering Methods

Attempt to optimize the fit between the data and some
mathematical model
Statistical and AI approach
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled objects
Finds characteristic description for each concept (class)
COBWEB (Fisher87)
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification tree
Each node refers to a concept and contains a probabilistic description of that
concept
COBWEB Clustering Method
A classification tree
More on Statistical-Based Clustering

Limitations of COBWEB
The assumption that the attributes are independent of each other is
often too strong because correlation may exist
Not suitable for clustering large database data skewed tree and
expensive probability distributions
CLASSIT
an extension of COBWEB for incremental clustering of continuous
data
suffers similar problems as COBWEB
AutoClass (Cheeseman and Stutz, 1996)

Uses Bayesian statistical analysis to estimate the number of
clusters
Popular in industry
Other Model-Based Clustering Methods

Neural network approaches
Represent each cluster as an exemplar, acting as a
prototype of the cluster
New objects are distributed to the cluster whose exemplar
is the most similar according to some dostance measure
Competitive learning
Involves a hierarchical architecture of several units
(neurons)
Neurons compete in a winner-takes-all fashion for the
object currently being presented
Model-Based Clustering Methods
Self-organizing feature maps (SOMs)

Clustering is also performed by having several units
competing for the current object
The unit whose weight vector is closest to the current object
wins
The winner and its neighbors learn by having their weights
adjusted
SOMs are believed to resemble processing that can occur in
the brain
Useful for visualizing high-dimensional data in 2- or 3-D
space
Hybrid Clustering
Hybrid clustering combines partitioning clustering and

hierarchical clustering approach
One of the common approaches
is to combine k-means method
and hierarchical clustering
First partition the dataset into k
small clusters, and then merge
the clusters based on similarity
using hierarchical method.
K=
7
Problems of Existing Hybrid Clustering
Predefine the number of preliminary clusters,

K.
Unable to handle noisy data
120
100
80
60
40
20
0
0
20
40
60
80
100
120
Similarity Measures
Similarity is fundamental to the definition of a
cluster.
1. Minkowski Metric
2. Context Metrics:
1. Mutual neighbor distance
2. Conceptual similarity measure
Minkowski Metric
Minkowski Metric = ||xi xj||p
It works well when a data set has compact or
isolated clusters
The drawback is the tendency of the largest-scaled
feature to dominate the others.
Solutions to this problem include normalization of the
continuous features (to a common range or variance) or
other weighting schemes.
Mutual neighbor distance (MND)

MND comes from real life observations and is based on a nonsymmetric similarity (e.g., friendship level).
Two persons A and B group together as close friends if they
mutually feel that the other is his closest friend.
If A feels that B is not such a close friend to him, then even though
B may feel that A is his closest friend, the bond of friendship
between them is comparatively weak.
The strength of the bond of friendship between two persons is a
function of mutual feeling rather than one-way feeling.
Mutual neighbor distance (MND)
MND(xi, xj) = NN(xi, xj) + NN(xj, xi)

where NN(xi, xj) is the neighbor-ness value (or similarity) of x j to xi
The MND is not a metric (it does not satisfy the triangle inequality). It satisfies
the first two conditions of a metric:
1. MND(xi, xj) >= 0, and MND(xi, xj) = 0 xi = xj positive definite
2. MND(xi, xj) = MND(xj, xi)
symmetric
Note the symmetry was manufactured by defining MND as the sum of the two nonsymmetric NN measures.
In spite of this, MND has been successfully applied in several clustering

applications.
This observation supports the viewpoint that the dissimilarity does not need to
be a metric.
MND Distance (Cont.)
Figure 4:
NN(A, B) = 1, NN(B, A) = 1, MND(A,B) = 2
NN(B, C) = 1, NN(C, B) = 2, MND(B, C) = 3
Figure 5:
NN(A, B) = 1, NN(B, A) = 4, MND(A,B) = 5
NN(B, C) = 1, NN(C, B) = 2, MND(B, C) = 3
Conceptual Similarity Measure

In the case of conceptual clustering, the similarity
between xi and xj is defined as
s( xi, xj ) = f( xi, xj, , )
where is the context (the set of surrounding points, and

is a set of pre-defined concepts.
The conceptual similarity measure is the most general
similarity measure.
Conceptual Similarity Measure (Cont.)
The Euclidean distance

between points A and B is less
than that between B and C.
However, B and C can be
viewed as more similar than
A and B because B and C
belong to the same concept
(ellipse) and A belongs to a
different concept (rectangle).
Problems and Challenges

Some progress has been made in scalable clustering methods
Partitioning: k-means, k-medoids, CLARANS
Hierarchical: BIRCH, CURE
Density-based: DBSCAN, CLIQUE, OPTICS
Grid-based: STING, WaveCluster
Model-based: Autoclass, Denclue, Cobweb
Current clustering techniques do not address all the requirements

adequately
Constraint-based clustering analysis: Constraints exist in data space
(bridges and highways) or in user queries
Still the curse of cardinality stands as
THE difficult problem!
What Is Outlier Discovery?

What are outliers?
The set of objects are considerably dissimilar from the
remainder of the data
Example: Sports: Michael Jordon, Wayne Gretzky, ...
Problem
Find top n outlier points
Applications:
Credit card fraud detection

Telecom fraud detection
Customer segmentation
Medical analysis
Outlier Discovery:
Statistical Approaches
Assume a model underlying distribution that generates

data set (e.g. normal distribution)
Use discordancy tests depending on
data distribution
distribution parameter (e.g., mean, variance)
number of expected outliers
Drawbacks
most tests are for single attribute
In many cases, data distribution may not be known
Outlier Discovery: Distance-Based Approach

Introduced to counter the main limitations imposed by statistical methods
We need multi-dimensional analysis without knowing data distribution.
Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at
least fraction p (usually ~99/100 or so) of the objects in T lie at a distance greater
than D from O (basically: Too many of the points lie far from O)
A dual formulation is, basically: Too few of the points lie close to O, so O is a
DB(p, D)-outlier if at least fraction p (usually ~1/100 or so) of the objects in T lie at
a distance less than D from O
Algorithms for mining distance-based outliers

Index-based algorithm
Nested-loop algorithm
Cell-based algorithm
Outlier Discovery: Deviation-Based Approach

Identifies outliers by examining the main characteristics of
objects in a group
Objects that deviate from this description are considered
outliers
sequential exception technique
simulates the way in which humans can distinguish unusual objects from
among a series of supposedly like objects
OLAP data cube technique

uses data cubes to identify regions of anomalies in large
multidimensional data
Outlier Detection
Huge amount of Data
Unexpected Knowledge/Anomaly Interest
Outlier Detection
Mining rare events, deviant objects and exceptions

Find anomaly interests in many domains:
Criminal activities in electronic commerce

Intrusion in network
Pest infestation in agriculture
Detecting bugs in software, etc
An example in business
British Telecom broke a multimillion dollar fraud
Outlier Analysis (Cont.)
Outlier Analysis
Statistic-based
Distance-based
Distribution
Depth
(Barnett 1994) (Preparata 1988)
Knorr and Ng
(VLDB 1998)
Density-based
Breunigs LOF
(SIGMOD 2000)
Ramaswamy
(MOD 2000)
Angiulli
(TKDE2005)
Cluster-based
Papadimitrious
LOCI
(ICDE 2002)
Jiangs MST
(PRL)
Hes CBLOF
(PRL)
Distance-based Outlier Detection

Distance-based outlier definition, DB (p, D), by Knorr and Ng (1997)
p: a number percentage; D: distance threshold
Unify distribution-based outlier definitions
Most well known definition
Outlier detection methods
Nested loop method (Knorr and Ng, 1998)

Cell structure-based method (Knorr and Ng, 1998)
Partition-based method (Ramaswamy, 2000)
Work well for small datasets
Not efficient for large datasets
Our proposed work followed aims to improve the efficiency of the DB

outlier detection process
Outlier Definition
Knorr and Ngs Definition
x is an outlier if {y X |d (y-x) D & |y| p*|X|}
d(y,x): distance between y and x
|y| < (1-p)*|X|
|y| p*|X|
x
complement
Figure 1. Knorrs definition
x
D
Figure 2. Equivalent definition
Equivalent Description
x is an outlier if {y X |d (y-x) < D & |y|<p*|X|}
Our Definition
Definition 1: Disk neighborhood
Define the Disk-neighborhood of x with radius r as a set
satisfying
DiskNbr (x, r) = {y X |d(y-x) r}
where, y is any point in DiskNbr(x,r) and d is the distance between y and x.
Define points in DiskNbr (x, r) as r-neighbors of x.
Definition 2: DiskNbr-based outlier

A point x is considered as a DB (p, D) outlier iff |
DiskNbr(x,D)| (1-p)*|X|.
where |DiskNbr (x,D)| is the number of D-neighbors of x,
|X| is the total size of dataset X
(1-p)*|X| is a number threshold given user defined p.
Proposition 1
If |DiskNbr (x, D/2)| (1-p)*|X|, then all the D/2-neighbors of x
are not outliers.
Proof: For q DiskNbr(x,D/2)
DiskNbr (q, D) DiskNbr
(x, D/2)
|DiskNbr(q, D)| |DiskNbr (x, D/2)| (1-p)*|X|
q is not an outlier.
Mark D/2-neighbors as non-outliers

A pruning rule
DiskNbr (x, D/2)
D
q
x
D/2
D
Figure 3. D/2-neighbors of x marked as non-outliers
Proposition 2
If |DiskNbr (x, 2D)| < (1-p)* |X|, then all D-neighbors of x
are outliers.
Proof: For q DiskNbr (x, D)
DiskNbr (q, D)
DiskNbr (x, 2D)
|DiskNbr (q, D)| |DiskNbr(x, 2D)| < (1-p)* |X|
q is an outlier
All points in DiskNbr(x,D) are outliers
Identify outliers by-neighborhood, faster than by point
DiskNbr(x,D)
D
q
D
2D
Figure 4. All D-Neighbors of x are outliers
The DBODLP Algorithm

Step 1: Select a point x arbitrarily from X;
Step 2: Search D-neighbors of x and calculate |
DiskNbr(x,D)|:
|DiskNbr(x,D)| < (1-p)*|X|
r <- 2D
Compute |DiskNbr (x, 2*D)|
IF |DiskNbr(x,2*D)| < (1-p)*|X|
insert D-neighbors into the outlier set.
ELSE only x is an outlier.
|DiskNbr(x,D)| (1-p)*|X|
r <- D/2;
Calculate |DiskNbr(x,D/2)|.
IF |DiskNbr(x,D/2)| (1-p) * |X|
mark all D/2-neighbors as non-outliers.
ELSE only x is non outlier.
Step 3: The process is conducted iteratively until all points

in dataset X are examined.
2D
D x
xD/2
D
Figure 5. An example
The DBODLP Algorithm

The algorithm is implemented based on a vertical
storage and index model
Vertical Data Structures
Vertical DBODLP
DBODLP is implemented based on P-Tree
structure
Predicate Range Tree is used to calculate the
number of disk nearest neighbors
PDiskNbr(x,D)= Py>x-D AND Pyx+D
|DiskNbr(x,D)|=COUNT (PDiskNbr(x,D))
The experiments are done in DataMIME
Architecture for the DataMIME

System
(DataMIMEtm = P-tree based Data base and data mining system)
YOUR DATA
YOUR DATA
MINING
Query and Data
mining
Meta
Data File
P-Tree
Feeders
Data Integration
Language
DIL
DII (Data Integration Interface)
Ptree (Predicates) Query

Language
DMI (Data Mining Interface)
Data Repository
lossless, compressed, distributed, verticallystructured P-tree database
Preliminary experimental analysis
Figure 6. Run Time Comparison
Figure 7. Scalability Comparisons
1. Our method has almost an order of magnitude of

improvement over the nested loop method in terms of run
time
2. All methods produce same outlier set
3. Our method scales better than others
Summary
Cluster analysis groups objects based on their similarity and
has wide applications
Measure of similarity can be computed for various types of
data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods, gridbased methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical, distancebased or deviation-based approaches
There are still lots of research issues on cluster analysis, such
as constraint-based clustering
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the
clustering structure, SIGMOD99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in
large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on
dynamic systems. In Proc. VLDB98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
References (2)
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis.
John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering.
John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets.
Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering
approach for very large spatial databases. VLDB98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data
Mining, VLDB97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very
large databases. SIGMOD'96.

Cluster by Evan

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster by Evan

Uploaded by

Copyright:

Available Formats

Clustering Analysis

(of Spatial Data and using Peano Count Trees)

Clustering Methods based on Partitioning

The K-Means Clustering Method

Partition objects into k nonempty subsets (or pick k initial means).

Compute the mean (center) or centroid of each cluster of the current

The K-Means Clustering Method

The K-Medoids Clustering Method

Search) (Ng & Han, 1994): Randomized the sampling

PAM (Partitioning Around Medoids) (1987)

CLARA (Clustering Large Applications) (1990)

CLARANS (Randomized CLARA) (1994)

Distance-based partitioning has drawbacks

Simple and fast O(N)

The number of clusters, K, has to be arbitrarily chosen before it is known how

Distance-based partitioning (Cont.)

A Vertical Data Approach

Finding clusters using PAM method; each grid P-tree is an object

Xi1 Xif Xip

Xn1 Xnf Xnp

d(n,1) d(n,2) d(n,n-1) 0

Euclidian Distance between object i and j :

AGNES (Agglomerative Nesting)

DIANA (Divisive Analysis)

Contrasting Clustering Techniques

Step 2 Step 3 Step 4

Hierarchical Clustering (top down)

Step 2 Step 1 Step 0

Hierarchical Clustering (Cont.)

Recall that any maximal anti-chain (maximal set of nodes in which

Hierarchical Clustering (Cont.)

Hierarchical Clustering (Cont.)

Distance Between Clusters

Complete Link: largest distance between any pair

Distance between Clusters (Cont.)

Single Link vs. Complete Link (Cont.)

Single Link vs. Complete Link (Cont.)

Complete link doesnt

Single Link vs. Complete Link (Cont.)

Hierarchical vs. Partitional

Hierarchical algorithms are more versatile than partitional

On the other hand, the time and space complexities of the

More on Hierarchical Clustering Methods

Integration of hierarchical with distance-based clustering

Density-Based Clustering Methods

Discover clusters of arbitrary shape

Several interesting studies:

DBSCAN: Ester, et al. (KDD96)

Density-Based Clustering: Background

Directly (density) reachable: A point p is directly

Density-Based Clustering: Background (II)

Density reachability is reflexive and transitive, but

DBSCAN: Density Based Spatial Clustering of Applications with Noise

DBSCAN: The Algorithm

Does this order

DENCLUE: using density functions

Denclue: Technical Essence

Grid Data Set (use r = , the std. dev.)

Find Density Attractor pts, C*, using hill

Randomly pick a point, pi.

Connect the density attractor clusters,

A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering in Multimedia Databases with

Comparison: DENCLUE Vs DBSCAN

Incrementally construct a CF (Clustering Feature) tree, a hierarchical data