You are on page 1of 6

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Computer Science 107 (2017) 442 – 447

International Congress of Information and Communication Technology (ICICT 2017)

Parallel Implementation of Density Peaks Clustering Algorithm


Based on Spark
Rui Liua, Xiaoge Lia*, Liping Dua, Shuting Zhia, Mian Weib
a
School of Computing, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
b
Tulane University, New Orleans, LA 70118, USA
* Corresponding author: lixg@xupt.edu.cn Tel.: 15055114114

Abstract

Clustering algorithm is widely used in data mining. It attempt to classify elements into several clusters, and the elements in the
same cluster are more similar to each other meanwhile the elements belonging to other clusters are not similar. The recently
published density peaks clustering algorithm can overcome the disadvantage of the distance-based algorithm that can only find
clusters of nearly-circular shapes, instead it can discover clusters of arbitrary shapes and it is insensitive to noise data. However it
needs calculate distances between all pairs of data points and is not scalable to the big data, in order to reduce the computational
cost of the algorithm we propose an efficient distributed density peaks clustering algorithm based on Spark’s GraphX. This paper
proves the effectiveness of the method based on two different data set. The experimental results show our system can improve the
performance significantly (up to 10x) comparing to MapReduce implementation. We also evaluate our system expansibility and
scalability.

Keywords: density peaks; clustering; Spark; GraphX; big data

1. Introduction

Clustering analysis is an important technique in machine learning and data mining. Clustering analysis 1 divides
elements into several clusters, and the elements in the same cluster are more similar to each other meanwhile the
elements belonging to other clusters are not similar. At present, there are many clustering algorithms, such as
partition-based method(e.g. k-medoids2, k-means3), hierarchical-based method(e.g. Agglomerative
4
Nesting(AGNES) ), density-based method(e.g. Density-based Spatial Clustering of Applications with
Noise(DBSCAN)5), grid-based method(e.g. a Grid-Clustering algorithm for High-dimensional very Large spatial
databases(GCHL)6) and probability model based method. In 2014, a paper about density peaks clustering algorithm

1877-0509 © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the scientific committee of the 7th International Congress of Information and Communication Technology
doi:10.1016/j.procs.2017.03.138
Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447 443

was published in Science magazine7. The core of the algorithm is that cluster centers are characterized by a higher
density than their neighbors and by a relatively large distance from points with higher densities 7.
In this paper, we present a parallel implementation of density peaks clustering system using GraphX based on
Spark. We study the effectiveness of the method and evaluate the running time under different number of nodes at
the same amount of data or under different amount of data at the same number of nodes. Finally, we compare the
running time of Spark and MapReduce to see which is better.
The rest of this paper is organized as follows. In Section 2, we review the density peaks clustering algorithm and
Spark RDD model. In Section 3, we introduce our parallel density peaks clustering System based on Spark. Section
4 provides the details of our experiment and deeply analyzes the results. Finally, in Section Conclusions we conclude
our contribution and indicates our directions for future research.

2. Related works

This section reviews the density peaks clustering algorithm and introduces Spark RDD model.

2.1. Density peaks clustering algorithm

The kernel parts of density peaks clustering algorithm are computing two value for point i : the local density Ui
and the distance from points of higher density G i . And for point i , the local density Ui is defined as:

Ui ¦ F d
j
ij  dc (1)

Where F x 0 if x t 0 and F x 1 otherwise, and d ij is the distance between point i and point j meanwhile d c
is a cutoff distance. Typically, to the point i , Ui is equal to the number of points that are closer than d c .
Remarkably, the algorithm is robust with respect to the choice of d c for large data sets and the algorithm is sensitive
only to the relative magnitude of Ui in different points.
G i is calculated by getting the minimum distance between the point i and any other point with higher density:

Gi min dij
j :U j ! U i
(2)

For point i with highest density, we take Gi max j (dij ) . And G i is much larger than the typical nearest neighbor
distance only for points that are global or local maxima in the density. Therefore, cluster centers are regarded as
points for which the value of G i is anomalously large.

Fig. 1. Point distribution. Fig. 2 Decision graph for the data in Fig. 1.

For each point i , Ui and G i could be expressed in a two-dimensional decision graph. For example, Fig. 1 shows
28 point embedded in a two-dimensional space, and points 1 and 10 are the density maxima, i.e. points 1 and 10 are
444 Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447

cluster centers. Fig. 2 shows Ui and G i for each point i in a decision graph. The value of G 9 and G10 is very
different, meanwhile the value of U9 and U10 is very similar. In fact, point 9 belongs to the cluster of point 1, and
point 10 is an other cluster center. Hence, the only points are the cluster centers when they have high G and
relatively high U . Points 26, 27, and 28 are isolated because they have a relatively high G and a low U .

2.2. Spark RDD model

Spark is a fast and general engine for large-scale data processing. All operations of Spark are based on resilient
distributed datasets(RDD), which is a fault-tolerant and parallel data structure. And RDD also offers a rich set of
operations to deal with data sets. In general, there are several common models for data processing, and it contains
Iterative Algorithms, Relational Queries, MapReduce, Stream Processing. For example, MapReduce is based on
MapReduces model, and Storm is based on Stream Processing model. RDD mixes these four models, so that Spark
can be applied to a variety of large data processing.
RDD supports two characteristics of persistence and partitioning, and users can use persist and partition By
functions to control the two characteristics. The partition characteristic and the parallel computing capability of
RDD make Spark can utilize the scalable hardware resources better. If combining partitioning and persistence, it can
be more efficient to deal with massive data.
RDD has two types of operations: transformation and action. No matter how many times transformation
operation has been performed, RDD will not be really performed. Only when action operation is performed, RDD
will be triggered. In the internal implementation mechanism of RDD, the underlying interface is based on the
iterator, so that data access becomes more efficient and avoiding a large number of intermediate results on memory
consumption.

3. The parallel density peaks clustering System

Fig. 3 outlines the architecture of our parallel density peaks clustering system.

Fig.3 The architecture of our parallel density peaks clustering system.

Firstly, initialization Spark. It contains setting the threshold of local density U and the threshold of distance from
points of higher density G ; Secondly, importing vertex and edge data stored on HDFS to vertex RDD and edge
RDD separately, and computing the distance of each edge; Thirdly, combination of vertex RDD and edge RDD to
form a Graph in GrapgX8, and after that calculating the truncated distance according to the generated Graph; Then,
Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447 445

for each point i , computing the local density Ui and the distance from points of higher density G i ; Lastly, clustering
according on the local density Ui and the distance G i .

3.1. Building graph

Building graph contains three steps. Firstly, importing vertex and edge data stored on HDFS or other file systems
to vertex RDD and edge RDD separately and the initial value of each edge is set to a constant; Secondly, computing
the distance of each edge based on a distance measure formula and updating the value of each edge by the distance;
Lastly, combining vertex RDD and edge RDD to form a Graph in GrapgX. For example, existing a vertex set {1, 2,
3, 4, 5} and a edge set {(1,2),(1,3),(1,4),(1,5),(2,3),(2,4),(2,5),(3,4),(3,5), (4,5)}. When importing the vertex set and
the edge set, the initial value of each edge is set to 1, as shown in Fig. 4. When computing the distance of each edge,
the value of each edge is updated by the distance, as shown in Fig. 5.

Fig. 4 Initial graph. Fig. 5 Updated graph.

3.2. Computing the truncated distance

In order to reduce computation load, the truncation distance is calculated before the local density Ui is calculated.
According to reference 1, the truncated distance is selected at 98%~99%.

3.3. Computing the local density

According to the formula (1) to calculate the local density Ui for each vertex. Fig. 6 describes the local density
Ui of each vertex in Fig. 5. For example, the local density U1 of the vertex 1 is 4.

Fig. 6 The local density. Fig. 7 The local density and the distance from points of higher density.

3.4. Computing the distance from points of higher density

According to the formula (2) to calculate the distance from points of higher density G i for each vertex. The
method of calculating G i using GraphX based on Spark is: firstly, for each edge, if the local density U source of the
446 Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447

source vertex is less than the local density Ut arg et of the target vertex, sending a message to the source vertex;
Otherwise sending a message to the target vertex. Secondly, merging all the messages received by each vertex.
Lastly, for each vertex, finding out the minimum edge length from all messages, and the G i of every vertex is equal
to the the minimum edge length. Fig. 7 describes Ui and G i for each vertex in Fig. 6.

3.5. Clustering

The clustering process is divided into three steps, which contains selecting cluster centers, selecting isolated
points and classification. Firstly, selecting these points as cluster centers which Ui is greater than the threshold U
and G i is greater than the threshold G ; Secondly, selecting these points as isolated points which Ui is less than the
threshold U and G i is greater than the threshold G ; Lastly, assigning other points to the nearest cluster center.

4. Experiment

4.1. Experimental environment

Spark cluster includes one master and 6 slaves. Table 1 describes the hardware configuration of the cluster.

Table 1. Hardware configuration.


Machine Name Role Memory CPU
Hadoop-server Master 16G 2 cores
Hadoop1~6 Slave 32G 2 cores

4.2. Experimental data

To conduct the empirical experiment for the parallel density peaks clustering algorithm, two separate data sets
with different size are used. The first data set is provided by reference 7 and the second data set is the news domain
text data downloaded from DataTang1. The news data set contains 10 topics, and 47,956 text. We preprocess the text
data using Chinese word segmentation system9.

4.3. Experimental results and analysis

Fig. 8 The running time of Spark and MapReduce . Fig. 9 Decision Graph.

Firstly, based on the first data set used in reference 7, our experimental result is consistent with the result of
reference 7, therefore our system is valid. Fig. 8 compares the running time of Spark and MapReduce on the first
data set. We can see that the running time of Spark is almost 1/10 of MapReduce.

1
http://www.datatang.com/data/43922.
Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447 447

Fig. 9 shows the results of our system based on the second data set. When the local density threshold U is 3600
and the threshold G is 11, the number of cluster centers is 10. This result is consistent with the second data set
containing 10 topics.
Fig. 10 describes the trend of running time under different number of nodes at the same amount of data. We can
see that running time is the longest when there is only one node, and running time decreases when the number of
nodes increases. Fig. 11 describes the trend of running time under different amount of data at the same number of
nodes. We can see that running time increases almost linearly. These trends prove our system has good expansibility
and scalability.

Fig. 10 The trend of running time under different number of nodes. Fig. 11 The trend of running time under different amount of data.

Conclusions

To reduce high computational cost of density peaks clustering algorithm, we propose an efficient distributed
density peaks clustering algorithm using GraphX based on Spark. In this paper, we proves the effectiveness of the
method based on two different data set, and the experimental results show our system can improve the performance
significantly (up to 10x) comparing to MapReduce implementation. We also evaluate our system expansibility and
scalability. The future work is to study a method for adaptive threshold, instead of setting a certain threshold U and
a certain threshold G when initialization Spark.

This work is supported by Shaanxi science and technology innovation project foundation, (2016PTJS3-02) and
(2016PTJS3-05).

References

1. Xu Rui, Wunsch D II. Survey of clustering algorithms. IEEE Trans on Neural Networks, 2005, 16(3):645-678.
2. Kaufman L, Peter R. Clustering by means of medoids// Statistical Data Analysis Based on the L1 Norm and Related Methods. North-
Holland:North-Holland Press, 1987: 405-416.
3. MacQueen J. Some methods for classification and analysis of multivariate observations[C]// Proc of the 5th Berkeley Symp on Mathematical
Statistics and Probability. Berkeley: University of California Press, 1967: 281-297.
4. Huang Xing, Liu Xiaoqing, Cao Buqing, Tang Mingdong and Liu Jianxun. MSCA: Mashup Service Clustering Approach Integrating K-Means
ans Agnes Algorithms. Journal of Chinese Computer System, 2015, 36(11):2492-2497.
5. Huang Xing, Liu Xiaoqing, Cao Buqing, Tang Mingdong and Liu Jianxun. MSCA: Mashup Service Clustering Approach Integrating K-Means
ans Agnes Algorithms. Journal of Chinese Computer System, 2015, 36(11):2492-2497.
6. Wang Mingkun, Yuan Shaoguang, Zhu Yongli and Wang Dewen. Real-time Clustering for Massive Data Using Storm. Journal of Chinese
Computer Applications, 2014, 34(11):3078-3081.
7. Alex R, Alessandro L. Clustering by fast search and find of density peaks. Science, 2014, 344(1492):1492-1496.
8. Jacobs, S. A. and A. Dagnino (2016). Large-Scale Industrial Alarm Reduction and Critical Events Mining Using Graph Analytics on Spark.
2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService).
9. Du Liping, Li Xiaoge, Yu, Gen, Liu Chunli and Liu Rui. New word detection based on an improved PMI algorithm for enhancing Chinese
segmentation system. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1):35-40.