You are on page 1of 6

Caterina Alexandra Predun

252 MaCIP2
DBSCAN: A Density-Based Spatial Clustering Algorithm

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a data


clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jrg Sander and Xiaowei
Xu in the Second International Conference on Knowledge Discovery and Data Mining (KDD-
96).[1]. This paper received the highest Impact Paper Award in the conference KDD 2014.

Notions

DBSCAN discovers clusters of arbitrary shape.


A density-based notion of cluster
o A cluster is defined as a maximal set of density-connected points.
Two parameters:
o Eps: maximum radius of neighborhood
o MinPts: minimum number of points in the Eps-neighborhood of a point

The Eps-neighborhood of a point q

o NEps(q): {p belongs to D | dist(p,q) Eps


o E.g. if we look above we can identify the core point by noticing that is
surrounded by 5 points, if we look further above the points we can see the
border which doesnt has so many neighbors but it can be reached by the
cluster we called core. The points which cannot be reached by the cluster
we wll call them noise.
Directly density-reachable :
o A point p is directly density-reachable from a point q w.r.t
Eps, MinPts if
P belongs to NEps (q)
Core point condition : | NEps (q)| MinPts
Density-reachable :
o A point p is density-reachable from a point q w.r.t.
MinPts if there is a chain of points p1, pn, p1=q, pn= p such that pi+1 is
directly density-reachable from pi
Density-connected
o A point p is density-connected from a point q w.r.t.
MinPts if there is a point o such that both p and q are density-reachable from o
w.r.t. Eps and MinPts

Algorithm

Steps:

1. Arbitrarily select point p


2. Retrieve all points density-reachable from p w.r.t. Eps and MinPts
a. If p is a core point, a cluster is formed
b. If p is a border point, no points are density-reachable from p, and DBSCAN visits
the next point of the database
3. Continue the process until all of the points have been processed

Complexity

If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where


n is the number of database objects
Otherwise the complexity is O(n2)
Conclusion

The good thing about DBSCAN is you dont give how many clusters (class labels) there will be
in dataset.

You tell to algorithm:

1- Dataset

2- Size of a Region

3- Minimum # of points needs to be in a Region

And it produces the number of clusters based on this density rate. It clusters point on three base
structures: Core, Density (Neighbor), Outlier (Noise).

What is good about it, in some domains, you never know how many clusters you need to find. So
giving the expected number of clusters like in k-Means algorithm are not suitable at all. At the
other hand you may decide minimum density rate of the given dataset and set the threshold value
for it. Especially it has very good use cases in video processing domains.
Pros:

+ No predefined number of clusters

+ Built-in noise points estimation

+ Different sized and shaped clusters can be handled

Implementation
References

[1] Ester, Martin; Kriegel, Hans-Peter; Sander, Jrg; Xu, Xiaowei (1996). Simoudis, Evangelos;
Han, Jiawei; Fayyad, Usama M., eds. A density-based algorithm for discovering clusters in large
spatial databases with noise. Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD-96). AAAI Press. pp. 226231

[2]Wikipedia, DBSCAN https://en.wikipedia.org/wiki/DBSCAN

You might also like