Professional Documents
Culture Documents
Ramalingaswamy cheruku
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
2
Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition: p MinPts = 5
4
DBSCAN
Published by bMartin Ester, Hans-Peter Kriegel, Jorg
Sander, Xiaowei Xu at KDD-96 proceedings.
Test of Time award at KDD 2014
11500 citations on Google Scholar
p q
o
DBSCAN: The Algorithm
1. Arbitrary select a point p
11
When DBSCAN Works Well
When DBSCAN Does NOT
Work Well
DBSCAN: Sensitive to Parameters
14
Choosing parameters of DBSCAN
algorithm
• DBSCAN algorithm requires 2 parameters
• - epsilon , which specifies how close points should be
to each other to be considered a part of a cluster;
and
• minPts , which specifies how many neighbors a point
should have to be included into a cluster.
• However, you may not know these values in advance.
Estimating epsilon:
Estimating distance to the
nearest neighbor : It
calculates distance from
each point to its nearest
neighbor within the same
cluster.
Distance to Nearest
Neighbor produces a
histogram which is depicted
in figure .
It indicates that the vast
majority of points lie within
21.7027 units from their
nearest neighbor. So, 22 may
be a reasonable guess for
the epsilon parameter.
MinPts estimation:
https://www.naftaliharris.com/blog/visualizing-
dbscan-clustering/
OPTICS: Ordering Points To Identify
Clustering Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
DBSCAN extension.
Idea: Higher density points should be processed
first. i.e. Find the high-density clusters first.
OPTICS store such a clustering order using two pieces
of information:
1. Core-distance
2. Reachability- distance
OPTICS: Terminology
• Core Distance: Core distance of object p is the smallest value
of Eps such that Eps-neighborhood of p has at least MinPts
objects
• Reachability Distance of object p from the core object q is the
min. radius value that makes p density-reachable from q.
Mathematically:
Max ( Core-distance(p), distance(p,q)).
Reachability plot for dataset
Reachability
-distance
undefined
‘
Cluster-order
of the objects
Reachability plot for dataset
Reachability plot for dataset
f Gaussian ( x, y) e 2 2 f Gaussian i 1
e
d ( x , xi ) 2
influence of y
( x, xi ) i 1 ( xi x) e
N
• Major features
on x f Gaussian
D 2 2
gradient of x in
– Solid mathematical foundation the direction of
xi
– Good for data sets with large amounts of noise
– Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
– Significant faster than existing algorithm (e.g., DBSCAN)
– But needs a large number of parameters
26
DENCLUE:
It builds on kernel density estimation functions.
It estimate the probability density of the data directly from the
data instances.
In DENCLUE the probability density in the data space is
estimated as a function of all data instances:
• A clustering in the DENCLUE is defined by the local maxima of the estimated density
function.
• A hill-climbing procedure is started for each data instance, which assigns the instance
to a local maxima.
• In case of Gaussian kernels, the hill climbing is guided by the gradient of ^p(x), which
takes the form
• The hill climbing procedure starts at a data point and iterates until the density does
not grow anymore. The update formula of the iteration to proceed from x(l) to x(l+1) is
influence of y
The sum of all kernels gives an estimate of on x