Professional Documents
Culture Documents
Notes:
1. over 100 slides not going to go through each in detail.
Clustering Methods
A Categorization of Major Clustering Methods
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
2.
Assign each object to the cluster with the most similar (closest) center.
3.
Go back to Step 2
4.
Stop when the new set of means doesnt change (or some other stopping
condition?)
k-Means
Step
1
10
Step
2
10
10
0
0
10
10
Step
3
10
10
Step
4
10
Normally, k, t << n.
Weakness
Applicable only when mean is defined (e.g., a vector space)
Need to specify k, the number of clusters, in advance.
It is sensitive to noisy data and outliers since a small number of
such data can substantially influence the mean value.
Produces round shaped clusters, not arbitrary shapes (Chameleon data set below)
Sensitive to the selection of the initial partition and may converge to a local
minimum of the criterion function if the initial partition is not well chosen.
Correct
result
K-means
result
Distance Function
Data Matrix
n objects p variables
X11 X1f X1p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Dissimilarity Matrix
n objects n objects
0
d(2,1) 0
d(3,1) d(3,2) 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
0
0
10
0
0
10
10
10
10
10
0
0
10
0
0
10
10
bottom up (agglomerative)
Hierarchical Clustering
Step 0
a
b
Step 1
ab
abcde
c
d
e
cde
de
Agglomerative
a
b
ab
abcde
cde
de
e
Step 4
Divisive
Step 3
In either case, one gets a nice dendogram in which any maximal antichain (no 2 nodes linked) is a clustering (partition).
But the horizontal anti-chains are the clusterings resulting from the
top down (or bottom up) method(s).
Single link
works but not
complete link
Complete link
works but not
single link
1
1
2 2
2 2
1
1
1
1
2 2
2 2
1
1
1
Single link
doesnt works
2
1
1
1
2
2
2
2
2
1-cluster
noise
2-cluster
1
1
2
2
2
2
2
N(p):
{q belongs to D | dist(p,q) }
p
q
MinPts = 5
= 1 cm
q, q is density-reachable from q.
p
q
p1
Density-connected
A point p is density-connected to a q wrt , MinPts
if there is a point o such that both, p and q are
density-reachable from o wrt , MinPts.
Density reachability is not symmetric, Density
connectivity inherits the reflexivity and transitivity
and provides the symmetry. Thus, density
connectivity is an equivalence relation and
therefore gives a partition (clustering).
q
o
Outlier
Border
Core
= 1cm
MinPts = 3
Other related method? How does vertical technology help here? Gridding?
OPTICS
Ordering Points To Identify Clustering Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)
http://portal.acm.org/citation.cfm?id=304187
Addresses the shortcoming of DBSCAN, namely choosing
parameters.
Develops a special order of the database wrt its density-based
clustering structure
This cluster-ordering contains info equivalent to the density-based
clusterings corresponding to a broad range of parameter settings
Good for both automatic and interactive cluster analysis, including
finding intrinsic clustering structure
OPTICS
Reachability
-distance
undefined
Cluster-order
of the objects
Major features
Solid mathematical foundation
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45 claimed by authors ???)
But needs a large number of parameters
d2(x,y)/2
Others include functions similar to the squashing functions used in neural networks.
One can think of the influence function as a measure of the contribution to the
density at x made by y.
Overall density of the data space can be calculated as the sum of the influence function
of all data points.
Clusters can be determined mathematically by identifying density attractors.
Density attractors are local maximal of the overall density function.
DENCLUE(D,,c,)
1.
2.
3.
4.
5.
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by
Zhang, Ramakrishnan, Livny (SIGMOD96
http://portal.acm.org/citation.cfm?id=235968.233324&dl=GUIDE&dl=ACM&idx=235968&part=periodical&WantType=periodical&title
=ACM%20SIGMOD%20Record&CFID=16013608&CFTOKEN=14462336
Scales linearly: finds a good clustering with a single scan and improves
quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of the data
record.
ABSTRACT
BIRCH
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most
widely studied problems in this area is the identification of clusters, or densely populated regions, in
a multi-dimensional dataset.
Prior work does not adequately address the problem of large datasets and minimization of I/O costs.
This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and
Clustering using Hierarchies), and demonstrates that it is especially suitable for very large
databases.
BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to
produce the best quality clustering with the available resources (i.e., available memory and time
constraints).
BIRCH can typically find a good clustering with a single scan of the data, and improve the quality
further with a few additional scans.
BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points
that are not part of the underlying pattern) effectively.
We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through
several experiments.
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
Iteratively put points into closest leaf until threshold is exceed, then split leaf.
Inodes summarize their subtrees and Inodes get split when threshold is exceeded.
Once in-memory CF tree is built, use another method to cluster leaves together.
Birch
Root
CF1
CF2 CF3
CF6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
ABSTRACT
Cure
Clustering, in data mining, is useful for discovering groups and identifying interesting
distributions in the underlying data. Traditional clustering algorithms either favor clusters with
spherical shapes and similar sizes, or are very fragile in the presence of outliers.
We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies
clusters having non-spherical shapes and wide variances in size.
CURE achieves this by representing each cluster by a certain fixed number of points that are
generated by selecting well scattered points from the cluster and then shrinking them toward
the center of the cluster by a specified fraction.
Having more than one representative point per cluster allows CURE to adjust well to the
geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers.
To handle large databases, CURE employs a combination of random sampling and partitioning. A
random sample drawn from the data set is first partitioned and each partition is partially
clustered. The partial clusters are then clustered in a second pass to yield the desired clusters.
Our experimental results confirm that the quality of clusters produced by CURE is much better
than those found by existing algorithms.
Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only
outperform existing algorithms but also to scale well for large databases without sacrificing
clustering quality.
s/pq = 5
y
y
x
y
y
x
x
x
x
Agglomerative Hierarchical
Use links to measure similarity/proximity
Not distance based
O(n 2 nmmma n 2 log n)
Computational complexity:
Basic ideas:
Similarity function and neighbors:
Let T1 = {1,2,3}, T2={3,4,5}
Sim( T 1, T 2)
T1 T2
Sim( T1 , T2 )
T1 T2
{3}
1
0.2
{1,2,3,4,5}
5
Abstract
ROCK
Rock: Algorithm
Links: The number of common neighbors for the two pts
Algorithm
CHAMELEON
CHAMELEON: hierarchical clustering using dynamic
modeling, by G. Karypis, E.H. Han and V. Kumar99
http://portal.acm.org/citation.cfm?id=621303
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items within the
clusters
ABSTRACT
CHAMELEON
Many advanced algorithms have difficulty dealing with highly variable clusters that do
not follow a preconceived model.
By basing its selections on both interconnectivity and closeness, the Chameleon
algorithm yields accurate results for these highly variable clusters.
Existing algorithms use a static model of the clusters and do not use information about
the nature of individual clusters as they are merged.
Furthermore, one set of schemes (the CURE algorithm and related schemes) ignores the
information about the aggregate interconnectivity of items in two clusters.
Another set of schemes (the Rock algorithm, group averaging method, and related
schemes) ignores information about the closeness of two clusters as defined by the
similarity of the closest items across two clusters.
By considering either interconnectivity or closeness only, these algorithms can select
and merge the wrong pair of clusters.
Chameleon's key feature is that it accounts for both interconnectivity and closeness in
identifying the most similar pair of clusters.
Chameleon finds the clusters in the data set by using a two-phase algorithm.
During the first phase, Chameleon uses a graph-partitioning algorithm to cluster the
data items into several relatively small subclusters.
During the second phase, it uses an algorithm to find the genuine clusters by repeatedly
combining these sub-clusters.
Sparse Graph
Data Set
Merge Partition
Final Clusters
Vertical gridding
We can observe that almost all methods discussed so far suffer from the curse of cardinality (for very
large cardinality data sets, the algorithms are too slow to finish in the average life time!) and/or the curse
of dimensionality (points are all at ~ same distance).
The work-arounds employed to address the curses
sampling (throw out most of the points in a way that what remains is low enough cardinality for the
algorithm to finish and in such a way that the remaining sample contains all the information of the
original data set (Therein is the problem that is impossible to do in general);
Gridding (agglomerate all points in a grid cell and treat them as one point (smooth the data set to this
gridding level). The problem with gridding, often, is that info is lost and the data structure that holds the
grid cell information is very complex. With vertical methods (e.g., P-trees), all the info can be retained
and griddings can be constructed very efficiently on demand. Horizontal data structures cant do this.
Subspace restrictions (e.g., Principal Components, Subspace Clustering)
Gradient based methods (e.g., the gradient tangent vector field of a response surface reduces the
calculations to the number of dimensions, not the number of combinations of dimensions.)
j-hi gridding: the j hi order bits identify a grid cells and the rest identify points in a particular cell.
Thus, j-hi cells are not necessarily cubical (unless all attribute bit-widths are the same).
j-lo gridding; the j lo order bits identify points in a particular cell and the rest identify a grid cell.
Thus, j-lo cells always have a nice uniform shape (cubical).
1-hi gridding of Vector Space, R(A1, A2, A3) in which all bit-widths are the same = 3
(so
each grid cell contains 22 * 22 * 22 = 64 potential points). Grid cells are identified by their Peano id (Pid) internally
the points cell coordinates are shown - called the grid cell id and
cell points
are ided by coordinates within their cell.
1
Pid = 001
gci=001
gci=001 gci=001 gci=001
gcp=
gcp=
gcp=
gcp= gci=001
gci=001 gci=001 gci=001
01,11,00 10,11,00 gcp=
11,11,00
00,11,00 gcp=
gcp=
gcp= gci=001
gci=001
gci=001 gci=001
11,11,01
00,11,01 01,11,01 gcp=
10,11,01 gcp=
gcp=
gcp= gci=001
gci=001
gci=001
gci=001
gci=001 gci=001
gci=001 gci=001
11,11,10
00,11,10 gcp=
10,11,10 gcp=
01,11,10 gcp=
gcp=
gcp=
gcp= gci=001
gcp= gci=001
gcp= gci=001
00,10,00 01,10,00
11,10,00
11,11,11 gcp=
10,11,11gci=001
00,11,11 gcp=
10,10,00
01,11,11
gcp=
gci=001
gcp= gci=001
gci=001
00,10,01gci=001
11,10,01
01,10,01 10,10,01 gcp=
gcp=
gci=001
gcp= gci=001
gci=001
gcp= gci=001
gci=001 gci=001
gci=001
00,10,10
11,10,10
gcp=
10,10,10 gcp=
gcp=
01,10,10
gcp=
gcp=
gcp=
gcp=
gci=001
gci=001
gci=001
00,10,11 00,01,00
01,10,11 gcp=
11,10,11gci=001
01,01,00
10,10,11 gcp=
10,01,00
11,01.00
gcp=
gci=001 gci=001
gcp=
gci=001
00,01,01 gcp=
01,01,01 10,01,01gci=001
11,01.01
gcp= gci=001
gcp= gci=001
gcp=
gci=001 gci=001
gci=001
gci=001
gci=001
gci=001
01,01,10 10,01,10
00,01,10 gcp=
11,01,10
gcp=
gcp=
gcp=
gcp= gci=001
gcp=
gcp=
gcp= gci=001
gci=001
gci=001
01,01,11 01,00,00
11,01,11
00,01,11 00,00,00
10,01,11 10,00,00
11,00,00
gcp=
gcp= gci=001
gci=001
gcp=
gcp=
gci=001
gci=001
01,00,01 10,00,01
00,00,01 gcp=
gcp=
11,00,01
gcp=
gcp= gci=001
gci=001
gci=001
00,00,10gci=001
01,00,10
11,00,10
10,00,10
gcp=
gcp=
gcp=
gcp=
00,00,11
A2
hi-bit
01,00,11
10,00,11
0
1
11,00,11
1
A1
hi-bit
A3
hi-bit
2-hi gridding of Vector Space, R(A1, A2, A3) in which all bitwidths are the same = 3
each grid cell contains 21 * 21 * 21 = 8 points).
(so
11
10
A2
Pid = 001.001
01
00
00
gci=
gci=
00,00,11 00,00,11
gci=
gci=
gcp=0,1,0 gcp=1,1,0
00,00,11 00,00,11
gcp=0,1,1
gci=gcp=1,1,1
gci=
00,00,11 00,00,11
gci=
gci=
gcp=0,0,0 gcp=1,0,0
00,00,11 00,00,11
gcp=0,0,1 gcp=1,0,1
00
01
10
11
01
A1
10
11
A3
Pid = 001
1
0
A2
hi-bit
gci=001
gci=001
gci=001 gci=001
gcp=
gcp=
gcp= gci=001
gcp= gci=001
gci=001
gci=001
11,1,00
10,1,00
00,1,00 gcp=
01,1,00 gcp=
gcp=
gcp=
gci=001
gci=001 gci=001
01,1,01 10,1,01 gci=001
00,1,01
11,1.01
gcp=
gcp= gci=001
gcp= gci=001
gcp=
gci=001 gci=001
gci=001 gci=001
gci=001
gci=001
00,1,10
01,1,10
10,1,10
11,1,10
gcp=
gcp=
gcp=
gcp=
gcp= gci=001
gcp=
gcp=
gcp=
gci=001
gci=001
gci=001
00,1,11 00,0,00
10,1,11
11,1,11
01,0,00
11,0,00
10,0,00
gcp=01,1,11 gcp=
gci=001
gcp=
gci=001
gci=001 gci=001
gcp=
00,0,01
01,0,01
gcp=
11,0,01
gcp=
gcp= gci=001
10,0,01
gcp=
gci=001
gci=001
gci=001
00,0,10
11,0,10
01,0,10
10,0,10
gcp=
gcp=
gcp=
gcp=
01,0,11 10,0,11
11,0,11
00,0,11
0
1
1
A1
hi-bit
A3
hi-bit
2-hi gridding) of R(A1, A2, A3), bitwidths of 3,2,3 (each grid cell contains 21 * 20 * 21 = 4 potential pts).
11
00
10
gcp=
gcp=0,,0
01
gcp=
gcp=1,,0
0,,1
10
00
A2
2-hi-bit
01
1,,1
00
01
A1
2-hi-bit
10
11
11
Pid = 3.1.3
A3
2-hi-bit
A2
A3
A1
gcp=
1011,
gcp=1011,
1011,1011
1011,
1010gcp=
1011,
gcp=1010,
1011,1011
1010,
1010
1,b1
A2
, of radius 21
A3
A1
A1
A1
123-REGION
A2
A3
A1
13-REGION
A2
A3
A1
23-REG
A2
A3
A1
12-REGION
A2
A3
3-REG
A1
A2
A3
A1
2-REG
A2
A3
1-REGION
A1
H(x,20) =
123-REG
Of H(x,20)
A2
A3
A1
H(x,20
1-REGION
13-REGION
3-REG
A2
12-REGION
A3
2-REG
A1
123-REGION
23-REG
Select an outlier threshold, (pts without neighbors in their ot L -disk are outliers That is,
there is no gradient at these outlier points (instantaneous rate of response change is zero).
2.
3.
(see previous slides - where HOBBit disks are built out from HOBBit centers
x = ( x1,b1x1,ot+11010 , , xn,bnxn,ot+11010 ), xi,js ranging over all binary patterns).
( RootcountD(x,ri) - RootcountD(x,ri)k ) / xk
Alternatively in 3., actually calculate the mean (or median?) of the new points encountered in
D(x,ri) (we have a P-tree mask for the set so this it trivial) and measure the x k-distance.
NOTE: Might want to go 1 more ring out to see if one gets the same or a similar gradient
(this seems particularly important when j is odd (since the gradient then points the opposite way.)
gr
ad
ien
A2
H(x,21)
H(x,21)1
A3
A1
gr
ad
ien
H(x,21)2
A2
H(x,21)
A3
A1
gr
a
die
n
H(x,21)3
A2
H(x,21)
A3
A1
Estimated gradient
gr
a
die
n
A2
H(x,21)
A3
A1
gradient
H(x,21)
A2
H(x,21)1
A3
A1
gradient
H(x,21)2
A2
H(x,21)
A3
A1
gradient
H(x,21)3
A2
H(x,21)
A3
A1
Estimated gradient
gradient
A2
H(x,21)
A3
A1
A2
H(x,21)
H(x,21)1
A3
A1
H(x,21)2
A2
H(x,21)
A3
A1
H(x,21)3
A2
H(x,21)
A3
A1
Estimated gradient
A2
A3
A1
A2
H(x,21)
A3
A1
new points =
A2
H(x,21)
A3
A1
new points =
H(x,22)1
A2
H(x,22)
A3
A1
Note that the gradient points in the right direction and is
very short (as it should be!)
First new
point
H(x,2)1
H(x,2)1
H(x,2)1
H(x,2)1
H(x,2)
H(x,1)1
H(x,0)
H(x,2)1
H(x,1)3
H(x,1)13
H(x,1)2
H(x,1)12
H(x,1)123
H(x,1)23
H(x,2)1
H(x,2)23
H(x,1)1
H(x,0)
H(x,2)2
H(x,1)13H(x,1)3
H(x,2)123
H(x,2)12
H(x,2)13
H(x,1)H(x,2)
12 H(x,1)
2
H(x,1)123
3
H(x,1)
23
H(x,2)1
H(x,3)3
H(x,2)23
H(x,1)1
H(x,0)
H(x,2)2
H(x,1)13H(x,1)3
H(x,2)123
H(x,2)12
H(x,2)13
H(x,1)H(x,2)
12 H(x,1)
2
H(x,1)123
3
H(x,1)
23
H(x,3)1
H(x,3)13
H(x,3)123
H(x,3)13
H(x,2)1
H(x,3)13
H(x,3)13
H( x,23 )
j-lo gridding, building out HOBBit rings from HOBBit grid centers
(see previous slides where this approach was used.)
or
j-lo gridding. building out HOBBit rings from lo-value grid pts (ending in j 0-bits)
x = ( x1,b1x1,j+100 , , xn,bnxn,j+100 )
2.
3.
4.
Ordinary j-lo griddng, building out rings from lo-value ids (ending in j zero bits)
Ordinary j-lo gridding, uilding out Rings from true centers.
Other? (there are many other possibilities, but we will first explore 2.)
1.
2.
Using j-lo gridding with j=3 and lo-value cell identifiers, is shown on the next slide.
3.
4.
With ordinary unit radius build out, the results are more exact, but are the calculations may
be more complex???
H(x,2)23
1
H(x,2)12
H(x,2)1123
H(x,2)112
H(x,2)13
1
H(x,2)31
H(x,1)23H(x,1)123
H(x,1)2 H(x,1)12
H(x,1)3
H(x,1)13
H(x,1)
1
H(x,0)
H(x,2)1
= PDisk(x,3)^PDisk(x,2)
Ring(x,2)
Ring(x,1)
= PDisk(x,2)^PDisk(x,1)
wherePD(x,i) =
Disk(x,0)
Pxb^..^Pxj+1 ^Pj^..^Pi+1
2. Put each point with its closest Medoid (building P-tree component masks as you do this).
3. Repeat 1 & 2 until (some stopping condition such as: no change in Medoid set?)
4. Can we cluster at step 4 without a scan (create component P-trees??)
ci=010
ci=011
ci=111
ci=000
ci=100
ci=001
A2
00.1.00
00.1.01
01.1.01
01.1.01
01.1.10
00.1.10
A1
A3
ci=110
11.1.00
10.1.00
11.1.01
10.1.01
10.1.10
ci=101
11.1.10
01.1.11
00.0.00
11.1.11
10.1.11 10.0.00
11.0.00
01.0.00
10.0.01
11.0.01
00.0.01
01.0.01
11.0.10
10.0.10
00.0.10
01.0.10
11.0.11
10.0.11
00.0.11
01.0.11
00.1.11
root
000
001
010
011
100
101
110
111
11.1.11
11.1.10
11.0.11
11.0.10
10.1.11
10.1.10
10.0.11
10.0.10
11.1.01
11.1.00
11.0.01
11.0.00
10.1.01
10.1.00
10.0.01
10.0.00
01.1.11
01.1.10
01.0.11
01.0.10
00.1.11
00.1.10
00.0.11
00.0.10
01.1.01
01.1.00
01.0.01
01.0.00
00.1.01
00.1.00
00.0.01
00.0.00
A 1-hi-grid yields a P-tree with level-0 (cell level) fanout of 23 and level-1 (point level) fanout of 25. If leaves are segment labelled (not coords):
010.11
010.10
000.11
000.10
010.01
010.00
000.01
000.00
...
010
110
011
111
000
A2
010
011
A1
A3
110
001
001
00
101
111
000
01
100
100
101
10
11
root
000
000
00
001
01
010
10
001
011
11
100
010
011
101
100
110
101
111
110
111
Bitmap these
Feature
Total Values
pathway
80
EC
622
complexes
316
function
259
localization
43
protein class
191
phenotype
181
interactions
6347
Data Representation
gene-by-feature table.
For a categorical feature, we consider each category as a
separate attribute or column by bit-mapping it.
The resulting table has a total of
8039 distinct feature bit vectors (corresponding to items in
MBR) for
6374 yeast genes (corresponding to transactions in MBR)
WaveCluster (1998)
Sheikholeslami, Chatterjee, and Zhang (VLDB98)
A multi-resolution clustering approach which applies
wavelet transform to the feature space
A wavelet transform is a signal processing technique that
decomposes a signal into different frequency sub-band.
WaveCluster (1998)
How to apply wavelet transform to find clusters
Summaries the data by imposing a multidimensional
grid structure onto data space
These multidimensional spatial data objects are
represented in a n-dimensional feature space
Apply wavelet transform on feature space to find the
dense regions in the feature space
Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
Quantization
Transformation
WaveCluster (1998)
Why is wavelet transformation useful for clustering
Unsupervised clustering
It uses hat-shape filters to emphasize region where points cluster, but
simultaneously to suppress weaker information in their boundary
Effective removal of outliers
Multi-resolution
Cost efficiency
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
http://portal.acm.org/citation.cfm?id=276314
Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length interval
It partitions an m-dimensional data space into non-overlapping rectangular
units
A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace
ABSTRACT
CLIQUE
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
la
a
S
ry
30
Vacation(
week)
0 1 2 3 4 5 6 7
age
60
20
50
30
40
age
50
age
60
Weakness
The accuracy of the clustering result may be degraded at the
expense of simplicity of the method
COBWEB (Fisher87)
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification tree
Each node refers to a concept and contains a probabilistic description of that
concept
A classification tree
CLASSIT
an extension of COBWEB for incremental clustering of continuous
data
suffers similar problems as COBWEB
Competitive learning
Involves a hierarchical architecture of several units
(neurons)
Neurons compete in a winner-takes-all fashion for the
object currently being presented
Hybrid Clustering
K=
7
20
40
60
80
100
120
Similarity Measures
Similarity is fundamental to the definition of a
cluster.
1. Minkowski Metric
2. Context Metrics:
1. Mutual neighbor distance
2. Conceptual similarity measure
Minkowski Metric
Minkowski Metric = ||xi xj||p
It works well when a data set has compact or
isolated clusters
The drawback is the tendency of the largest-scaled
feature to dominate the others.
Solutions to this problem include normalization of the
continuous features (to a common range or variance) or
other weighting schemes.
symmetric
Note the symmetry was manufactured by defining MND as the sum of the two nonsymmetric NN measures.
Figure 4:
NN(A, B) = 1, NN(B, A) = 1, MND(A,B) = 2
NN(B, C) = 1, NN(C, B) = 2, MND(B, C) = 3
Figure 5:
NN(A, B) = 1, NN(B, A) = 4, MND(A,B) = 5
NN(B, C) = 1, NN(C, B) = 2, MND(B, C) = 3
Problem
Find top n outlier points
Applications:
Outlier Discovery:
Statistical Approaches
Drawbacks
most tests are for single attribute
In many cases, data distribution may not be known
Outlier Detection
Huge amount of Data
Outlier Detection
An example in business
British Telecom broke a multimillion dollar fraud
Outlier Analysis
Statistic-based
Distance-based
Distribution
Depth
(Barnett 1994) (Preparata 1988)
Knorr and Ng
(VLDB 1998)
Density-based
Breunigs LOF
(SIGMOD 2000)
Ramaswamy
(MOD 2000)
Angiulli
(TKDE2005)
Cluster-based
Papadimitrious
LOCI
(ICDE 2002)
Jiangs MST
(PRL)
Hes CBLOF
(PRL)
Outlier Definition
Knorr and Ngs Definition
x is an outlier if {y X |d (y-x) D & |y| p*|X|}
d(y,x): distance between y and x
|y| < (1-p)*|X|
|y| p*|X|
x
complement
x
D
Equivalent Description
x is an outlier if {y X |d (y-x) < D & |y|<p*|X|}
Our Definition
Definition 1: Disk neighborhood
Define the Disk-neighborhood of x with radius r as a set
satisfying
DiskNbr (x, r) = {y X |d(y-x) r}
where, y is any point in DiskNbr(x,r) and d is the distance between y and x.
Proposition 1
If |DiskNbr (x, D/2)| (1-p)*|X|, then all the D/2-neighbors of x
are not outliers.
Proof: For q DiskNbr(x,D/2)
DiskNbr (q, D) DiskNbr
(x, D/2)
|DiskNbr(q, D)| |DiskNbr (x, D/2)| (1-p)*|X|
q is not an outlier.
q
x
D/2
D
Proposition 2
If |DiskNbr (x, 2D)| < (1-p)* |X|, then all D-neighbors of x
are outliers.
Proof: For q DiskNbr (x, D)
DiskNbr (q, D)
DiskNbr (x, 2D)
|DiskNbr (q, D)| |DiskNbr(x, 2D)| < (1-p)* |X|
q is an outlier
All points in DiskNbr(x,D) are outliers
Identify outliers by-neighborhood, faster than by point
DiskNbr(x,D)
D
q
D
2D
|DiskNbr(x,D)| (1-p)*|X|
r <- D/2;
Calculate |DiskNbr(x,D/2)|.
IF |DiskNbr(x,D/2)| (1-p) * |X|
mark all D/2-neighbors as non-outliers.
ELSE only x is non outlier.
2D
D x
xD/2
D
Figure 5. An example
Vertical DBODLP
DBODLP is implemented based on P-Tree
structure
Predicate Range Tree is used to calculate the
number of disk nearest neighbors
PDiskNbr(x,D)= Py>x-D AND Pyx+D
|DiskNbr(x,D)|=COUNT (PDiskNbr(x,D))
The experiments are done in DataMIME
YOUR DATA
YOUR DATA
MINING
Query and Data
mining
Meta
Data File
P-Tree
Feeders
Data Integration
Language
DIL
Data Repository
lossless, compressed, distributed, verticallystructured P-tree database
Summary
Cluster analysis groups objects based on their similarity and
has wide applications
Measure of similarity can be computed for various types of
data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods, gridbased methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical, distancebased or deviation-based approaches
There are still lots of research issues on cluster analysis, such
as constraint-based clustering
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the
clustering structure, SIGMOD99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in
large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on
dynamic systems. In Proc. VLDB98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
References (2)
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis.
John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering.
John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets.
Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering
approach for very large spatial databases. VLDB98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data
Mining, VLDB97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very
large databases. SIGMOD'96.