3 views

Uploaded by Alvaro Gómez Rubio

BigData, Spark

- Apache_Spark_Interview_Questions_Book.pdf
- BIG DATA (1)
- Cloudera Spark
- Practicum 7 STUDYgui
- Hd 2017
- IJETTCS-2014-08-08-89
- Hadoop
- Field Guide to Hadoop (Pentaho)
- Combined Document
- Clsuter Analysis - Note and SPSS
- Llama - Big Data Integration and Analysis
- Spark Notes
- UNCOVERING FEATURES NOISE WITH K-MEANS ALGORITHM USING ENTROPY MEASURE
- mr lab (1).docx
- [IJETA-V5I6P3]:Ei Ei Phyo, Ei Ei Myat
- IJETTCS-2014-02-25-122
- g 044044249
- Enhancing_Data_Analysis_with_Noise_Remov.pdf
- CS2032 DWM QB2.pdf
- Cluster by Evan

You are on page 1of 6

com

ScienceDirect

Procedia Computer Science 107 (2017) 442 – 447

Based on Spark

Rui Liua, Xiaoge Lia*, Liping Dua, Shuting Zhia, Mian Weib

a

School of Computing, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

b

Tulane University, New Orleans, LA 70118, USA

* Corresponding author: lixg@xupt.edu.cn Tel.: 15055114114

Abstract

Clustering algorithm is widely used in data mining. It attempt to classify elements into several clusters, and the elements in the

same cluster are more similar to each other meanwhile the elements belonging to other clusters are not similar. The recently

published density peaks clustering algorithm can overcome the disadvantage of the distance-based algorithm that can only find

clusters of nearly-circular shapes, instead it can discover clusters of arbitrary shapes and it is insensitive to noise data. However it

needs calculate distances between all pairs of data points and is not scalable to the big data, in order to reduce the computational

cost of the algorithm we propose an efficient distributed density peaks clustering algorithm based on Spark’s GraphX. This paper

proves the effectiveness of the method based on two different data set. The experimental results show our system can improve the

performance significantly (up to 10x) comparing to MapReduce implementation. We also evaluate our system expansibility and

scalability.

1. Introduction

Clustering analysis is an important technique in machine learning and data mining. Clustering analysis 1 divides

elements into several clusters, and the elements in the same cluster are more similar to each other meanwhile the

elements belonging to other clusters are not similar. At present, there are many clustering algorithms, such as

partition-based method(e.g. k-medoids2, k-means3), hierarchical-based method(e.g. Agglomerative

4

Nesting(AGNES) ), density-based method(e.g. Density-based Spatial Clustering of Applications with

Noise(DBSCAN)5), grid-based method(e.g. a Grid-Clustering algorithm for High-dimensional very Large spatial

databases(GCHL)6) and probability model based method. In 2014, a paper about density peaks clustering algorithm

1877-0509 © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license

(http://creativecommons.org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of the scientific committee of the 7th International Congress of Information and Communication Technology

doi:10.1016/j.procs.2017.03.138

Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447 443

was published in Science magazine7. The core of the algorithm is that cluster centers are characterized by a higher

density than their neighbors and by a relatively large distance from points with higher densities 7.

In this paper, we present a parallel implementation of density peaks clustering system using GraphX based on

Spark. We study the effectiveness of the method and evaluate the running time under different number of nodes at

the same amount of data or under different amount of data at the same number of nodes. Finally, we compare the

running time of Spark and MapReduce to see which is better.

The rest of this paper is organized as follows. In Section 2, we review the density peaks clustering algorithm and

Spark RDD model. In Section 3, we introduce our parallel density peaks clustering System based on Spark. Section

4 provides the details of our experiment and deeply analyzes the results. Finally, in Section Conclusions we conclude

our contribution and indicates our directions for future research.

2. Related works

This section reviews the density peaks clustering algorithm and introduces Spark RDD model.

The kernel parts of density peaks clustering algorithm are computing two value for point i : the local density Ui

and the distance from points of higher density G i . And for point i , the local density Ui is defined as:

Ui ¦ F d

j

ij dc (1)

Where F x 0 if x t 0 and F x 1 otherwise, and d ij is the distance between point i and point j meanwhile d c

is a cutoff distance. Typically, to the point i , Ui is equal to the number of points that are closer than d c .

Remarkably, the algorithm is robust with respect to the choice of d c for large data sets and the algorithm is sensitive

only to the relative magnitude of Ui in different points.

G i is calculated by getting the minimum distance between the point i and any other point with higher density:

Gi min dij

j :U j ! U i

(2)

For point i with highest density, we take Gi max j (dij ) . And G i is much larger than the typical nearest neighbor

distance only for points that are global or local maxima in the density. Therefore, cluster centers are regarded as

points for which the value of G i is anomalously large.

Fig. 1. Point distribution. Fig. 2 Decision graph for the data in Fig. 1.

For each point i , Ui and G i could be expressed in a two-dimensional decision graph. For example, Fig. 1 shows

28 point embedded in a two-dimensional space, and points 1 and 10 are the density maxima, i.e. points 1 and 10 are

444 Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447

cluster centers. Fig. 2 shows Ui and G i for each point i in a decision graph. The value of G 9 and G10 is very

different, meanwhile the value of U9 and U10 is very similar. In fact, point 9 belongs to the cluster of point 1, and

point 10 is an other cluster center. Hence, the only points are the cluster centers when they have high G and

relatively high U . Points 26, 27, and 28 are isolated because they have a relatively high G and a low U .

Spark is a fast and general engine for large-scale data processing. All operations of Spark are based on resilient

distributed datasets(RDD), which is a fault-tolerant and parallel data structure. And RDD also offers a rich set of

operations to deal with data sets. In general, there are several common models for data processing, and it contains

Iterative Algorithms, Relational Queries, MapReduce, Stream Processing. For example, MapReduce is based on

MapReduces model, and Storm is based on Stream Processing model. RDD mixes these four models, so that Spark

can be applied to a variety of large data processing.

RDD supports two characteristics of persistence and partitioning, and users can use persist and partition By

functions to control the two characteristics. The partition characteristic and the parallel computing capability of

RDD make Spark can utilize the scalable hardware resources better. If combining partitioning and persistence, it can

be more efficient to deal with massive data.

RDD has two types of operations: transformation and action. No matter how many times transformation

operation has been performed, RDD will not be really performed. Only when action operation is performed, RDD

will be triggered. In the internal implementation mechanism of RDD, the underlying interface is based on the

iterator, so that data access becomes more efficient and avoiding a large number of intermediate results on memory

consumption.

Fig. 3 outlines the architecture of our parallel density peaks clustering system.

Firstly, initialization Spark. It contains setting the threshold of local density U and the threshold of distance from

points of higher density G ; Secondly, importing vertex and edge data stored on HDFS to vertex RDD and edge

RDD separately, and computing the distance of each edge; Thirdly, combination of vertex RDD and edge RDD to

form a Graph in GrapgX8, and after that calculating the truncated distance according to the generated Graph; Then,

Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447 445

for each point i , computing the local density Ui and the distance from points of higher density G i ; Lastly, clustering

according on the local density Ui and the distance G i .

Building graph contains three steps. Firstly, importing vertex and edge data stored on HDFS or other file systems

to vertex RDD and edge RDD separately and the initial value of each edge is set to a constant; Secondly, computing

the distance of each edge based on a distance measure formula and updating the value of each edge by the distance;

Lastly, combining vertex RDD and edge RDD to form a Graph in GrapgX. For example, existing a vertex set {1, 2,

3, 4, 5} and a edge set {(1,2),(1,3),(1,4),(1,5),(2,3),(2,4),(2,5),(3,4),(3,5), (4,5)}. When importing the vertex set and

the edge set, the initial value of each edge is set to 1, as shown in Fig. 4. When computing the distance of each edge,

the value of each edge is updated by the distance, as shown in Fig. 5.

In order to reduce computation load, the truncation distance is calculated before the local density Ui is calculated.

According to reference 1, the truncated distance is selected at 98%~99%.

According to the formula (1) to calculate the local density Ui for each vertex. Fig. 6 describes the local density

Ui of each vertex in Fig. 5. For example, the local density U1 of the vertex 1 is 4.

Fig. 6 The local density. Fig. 7 The local density and the distance from points of higher density.

According to the formula (2) to calculate the distance from points of higher density G i for each vertex. The

method of calculating G i using GraphX based on Spark is: firstly, for each edge, if the local density U source of the

446 Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447

source vertex is less than the local density Ut arg et of the target vertex, sending a message to the source vertex;

Otherwise sending a message to the target vertex. Secondly, merging all the messages received by each vertex.

Lastly, for each vertex, finding out the minimum edge length from all messages, and the G i of every vertex is equal

to the the minimum edge length. Fig. 7 describes Ui and G i for each vertex in Fig. 6.

3.5. Clustering

The clustering process is divided into three steps, which contains selecting cluster centers, selecting isolated

points and classification. Firstly, selecting these points as cluster centers which Ui is greater than the threshold U

and G i is greater than the threshold G ; Secondly, selecting these points as isolated points which Ui is less than the

threshold U and G i is greater than the threshold G ; Lastly, assigning other points to the nearest cluster center.

4. Experiment

Spark cluster includes one master and 6 slaves. Table 1 describes the hardware configuration of the cluster.

Machine Name Role Memory CPU

Hadoop-server Master 16G 2 cores

Hadoop1~6 Slave 32G 2 cores

To conduct the empirical experiment for the parallel density peaks clustering algorithm, two separate data sets

with different size are used. The first data set is provided by reference 7 and the second data set is the news domain

text data downloaded from DataTang1. The news data set contains 10 topics, and 47,956 text. We preprocess the text

data using Chinese word segmentation system9.

Fig. 8 The running time of Spark and MapReduce . Fig. 9 Decision Graph.

Firstly, based on the first data set used in reference 7, our experimental result is consistent with the result of

reference 7, therefore our system is valid. Fig. 8 compares the running time of Spark and MapReduce on the first

data set. We can see that the running time of Spark is almost 1/10 of MapReduce.

1

http://www.datatang.com/data/43922.

Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447 447

Fig. 9 shows the results of our system based on the second data set. When the local density threshold U is 3600

and the threshold G is 11, the number of cluster centers is 10. This result is consistent with the second data set

containing 10 topics.

Fig. 10 describes the trend of running time under different number of nodes at the same amount of data. We can

see that running time is the longest when there is only one node, and running time decreases when the number of

nodes increases. Fig. 11 describes the trend of running time under different amount of data at the same number of

nodes. We can see that running time increases almost linearly. These trends prove our system has good expansibility

and scalability.

Fig. 10 The trend of running time under different number of nodes. Fig. 11 The trend of running time under different amount of data.

Conclusions

To reduce high computational cost of density peaks clustering algorithm, we propose an efficient distributed

density peaks clustering algorithm using GraphX based on Spark. In this paper, we proves the effectiveness of the

method based on two different data set, and the experimental results show our system can improve the performance

significantly (up to 10x) comparing to MapReduce implementation. We also evaluate our system expansibility and

scalability. The future work is to study a method for adaptive threshold, instead of setting a certain threshold U and

a certain threshold G when initialization Spark.

This work is supported by Shaanxi science and technology innovation project foundation, (2016PTJS3-02) and

(2016PTJS3-05).

References

1. Xu Rui, Wunsch D II. Survey of clustering algorithms. IEEE Trans on Neural Networks, 2005, 16(3):645-678.

2. Kaufman L, Peter R. Clustering by means of medoids// Statistical Data Analysis Based on the L1 Norm and Related Methods. North-

Holland:North-Holland Press, 1987: 405-416.

3. MacQueen J. Some methods for classification and analysis of multivariate observations[C]// Proc of the 5th Berkeley Symp on Mathematical

Statistics and Probability. Berkeley: University of California Press, 1967: 281-297.

4. Huang Xing, Liu Xiaoqing, Cao Buqing, Tang Mingdong and Liu Jianxun. MSCA: Mashup Service Clustering Approach Integrating K-Means

ans Agnes Algorithms. Journal of Chinese Computer System, 2015, 36(11):2492-2497.

5. Huang Xing, Liu Xiaoqing, Cao Buqing, Tang Mingdong and Liu Jianxun. MSCA: Mashup Service Clustering Approach Integrating K-Means

ans Agnes Algorithms. Journal of Chinese Computer System, 2015, 36(11):2492-2497.

6. Wang Mingkun, Yuan Shaoguang, Zhu Yongli and Wang Dewen. Real-time Clustering for Massive Data Using Storm. Journal of Chinese

Computer Applications, 2014, 34(11):3078-3081.

7. Alex R, Alessandro L. Clustering by fast search and find of density peaks. Science, 2014, 344(1492):1492-1496.

8. Jacobs, S. A. and A. Dagnino (2016). Large-Scale Industrial Alarm Reduction and Critical Events Mining Using Graph Analytics on Spark.

2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService).

9. Du Liping, Li Xiaoge, Yu, Gen, Liu Chunli and Liu Rui. New word detection based on an improved PMI algorithm for enhancing Chinese

segmentation system. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1):35-40.

- Apache_Spark_Interview_Questions_Book.pdfUploaded bypraneeth4u
- BIG DATA (1)Uploaded byRaj Greens
- Cloudera SparkUploaded bygtheu
- Practicum 7 STUDYguiUploaded bynicoletaodagiu
- Hd 2017Uploaded byion2010
- IJETTCS-2014-08-08-89Uploaded byAnonymous vQrJlEN
- HadoopUploaded bysbl cls
- Field Guide to Hadoop (Pentaho)Uploaded byaniac
- Combined DocumentUploaded byjason_lu_9
- Clsuter Analysis - Note and SPSSUploaded byhehehuhu
- Llama - Big Data Integration and AnalysisUploaded byH. Ugarte
- Spark NotesUploaded bybabjeereddy
- UNCOVERING FEATURES NOISE WITH K-MEANS ALGORITHM USING ENTROPY MEASUREUploaded byijsret
- mr lab (1).docxUploaded byPolukanti Gouthamkrishna
- [IJETA-V5I6P3]:Ei Ei Phyo, Ei Ei MyatUploaded byIJETA - EighthSenseGroup
- IJETTCS-2014-02-25-122Uploaded byAnonymous vQrJlEN
- g 044044249Uploaded byAnonymous 7VPPkWS8O
- Enhancing_Data_Analysis_with_Noise_Remov.pdfUploaded byvivekreddy
- CS2032 DWM QB2.pdfUploaded byvelkarthi92
- Cluster by EvanUploaded bymailtorohitaqua
- IJAIEM-2014-05-19-053Uploaded byAnonymous vQrJlEN
- Data Miner Junior Poster CopyUploaded bySam Low
- 9ec597b9af2802ce8bfc560ae65062017f80Uploaded byNéstor Mazatl
- paper11.pdfUploaded byIntegrated Intelligent Research
- DM-W01Uploaded byNoman Saleem
- Hadoop Session Cs246Uploaded byjeromeku
- Front PageUploaded bysushant
- a cluster-based optimization approach for the multi-depot heterogeneous fleet vehicle routing problem with time windowaUploaded byTelma Soares
- AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION RULESUploaded byLewis Torres
- IJETTCS-2014-04-25-123Uploaded byAnonymous vQrJlEN

- Cinco-aspectos-que-debe-conocer-acerca-de-sus-usuarios-antes-de-implantar-BIUploaded byAaB123d
- Hadoop-developingApps.pdfUploaded byAlvaro Gómez Rubio
- Inteligencia ArtificialUploaded byGenaro Cabrera Santillan
- Reflexiones FLUJO de CAJAUploaded byAlvaro Gómez Rubio
- Map ReduceUploaded byAlvaro Gómez Rubio
- Binary black hole algorithmUploaded byAlvaro Gómez Rubio
- MongoDB BI AnalyticsUploaded bySatish V Madala
- TesinaUploaded byAlvaro Gómez Rubio
- Inteligencia Artificial 2Uploaded byAlvaro Gómez Rubio
- Surfeando Hacia El Futuro - CNICUploaded byshalka
- Black hole_ A new heuristic optimization approach.pdfUploaded byAlvaro Gómez Rubio
- Contador 2 Bit Paso a PasoUploaded byAlvaro Gómez Rubio
- Cualidades Del Sonido Intensidad y TimbreUploaded byAlvaro Gómez Rubio
- Campos MagneticosUploaded byAlvaro Gómez Rubio
- Contador 4 BitsUploaded byAlvaro Gómez Rubio
- Integrando Aplicaciones Con Gxportal 4.5.0.0_spaUploaded byAlvaro Gómez Rubio
- Metricas y Ptos de FuncionUploaded byAlvaro Gómez Rubio
- Instalación de GLPI en CentosUploaded byAlvaro Gómez Rubio
- Auditoria Informatica, Un Enfoque PracticoUploaded byGabriela Navarro
- TABLA PFUploaded byAlvaro Gómez Rubio
- Instalación de OCS Inventory en CentosUploaded byAlvaro Gómez Rubio

- REVIEW OF VARIOUS HIERARCHICAL CLUSTERING ALGORITHMS FOR WIRELESS SENSOR NETWORKSUploaded byijsret
- IEC104 Traffic PatternsUploaded byp1999_user
- 10.1.1.104.8082Uploaded bysuser
- Ontology HandbookUploaded byhodynu
- Zhong - 2005 - Efficient Online Spherical K-means ClusteringUploaded byswarmbees
- Journal of Computer Applications - Volume 1 Issue 4 P2Uploaded byJournal of Computer Applications
- Advancement In K-Mean Clustering Algorithm for Distributed DataUploaded byIRJET Journal
- QB Students DmUploaded byVinay Gopal
- Established and Startup FirmsUploaded bychau_969663535
- Dubai 020 PaperUploaded byAnas Shifan
- RF OptimizationUploaded bymd_mustafa_23
- Crime Prediction using K-means AlgorithmUploaded byGRD Journals
- document clustering based on topic maps using k-modes algorithmUploaded byPavani Manthena
- Smart Devices Are Different Assessing and Mitigating Mobile Sensing Heterogeneities for Activity RecognitionUploaded byK.N. Huq
- syllabusUploaded byAnonymous GuQd67
- Apache MahoutUploaded byAmol Jagtap
- Learning-Based Procedural Content GenerationUploaded bya4104165
- 2005_csizer_dornyei_llUploaded byNicos Rossos
- StudiuUploaded byClaudiu
- A PSO-Based Subtractive Data Clustering AlgorithmUploaded byWhite Globe Publications (IJORCS)
- Optimization_of_DEEC_Routing_Protocol_using_Genetic_Algorithm.pdfUploaded byeditorinchiefijcs
- Monitoring Land Cover Changes in Halabja City, IraqUploaded byInternational Journal of Sensor & Related Networks
- Survey of Data MiningUploaded bySufi Syarif
- 14-SPAR2Uploaded byGurpreet Singh
- ADAPTIVE NETWORK BASED FUZZY INFERENCE SYSTEM FOR SPEECH RECOGNITION THROUGH SUBTRACTIVE CLUSTERINGUploaded byAdam Hansen
- Course - Data Science Foundations - Data MiningUploaded byImtiaz N
- Clustering With Bregman Divergences - Machine LearningUploaded bydownfast
- A Solution for Cross-Docking Operations Planning for Cross-Docking Operations PlanningUploaded byCem Yucelten
- Axiomatic generalization of the membership degree weighting function for fuzzy C means clustering: Theoretical development and convergence analysisUploaded byMia Amalia
- Us 6295504Uploaded byElok Galih Karuniawati