You are on page 1of 6

2014 International Conference on Advances in Electronics, Computers and Communications (ICAECC)

Distributed In-memory Cluster Computing Approach


in Scala for Solving Graph Data Applications
Johnpaul C I, Neetha Susan Thampi
Department of Computer Science and Engineering
Amrita School Of Engineering, Coimbatore, India
Email: johnpaul.ci@gmail.com, netasusan@gmail.com
AbstractLarge graph analysis is one of the signicant applications of distributed computing frameworks. The distributed
computing applications are solved by developing programs over
different types of established distributed computing frameworks.
Since graph analysis and prediction is one of the new trend in
data analytics, designing the problems on an in-memory cluster
framework which consumes graph data-sets have a signicant
role in distributed computing. Traditional disk-based distributed
computing framework like hadoop will conne only to a specic
group of problems in data analytics. The importance of utilizing
the memory of the cluster apart from the disk-based storage
space contributes a signicant role in reducing the latency and
increasing the speedup. The whole work describes the signicance
of spark-framework in solving graph related problems in a distributed approach using page ranking algorithm and proteomeprotein annotation method in Scala.
Key Words : Apache Hadoop, Hama, Spark, Pregel, Networkow, Fault tolerance, Distributed computing, Scala, Clustercomputing.

I.

I NTRODUCTION

Cluster computing is becoming one of the most necessities


of day to day computing environment[1]. Since the resources
like memory, disk and other components are becoming less
expensive it results in establishment of large cluster machines
with various prototypes for data representation, computation
process model, programming paradigm, node communication
etc. Cluster computing frameworks adopts from the base
theoretical distributed system concepts to newly developed
modules of distributed data-bases[2]. Hence the whole work
is to explore an in-memory distributed framework and nd
solution to a particular problem of bio-informatics applying an
abstraction from graph-theory on the in-memory framework. It
involves the usage of functional ow algorithm which got its
elements from network ow concepts of graph theory[3][4].
The whole work is done on Scala[5], which was emerged
recently and accepted by programming community with all
consent.
In distributed computing aspects there are some key challenges that need to be addressed. They are data representation,
data distribution, distributed computation model, communication complexity and easiness to use[2]. The rst three attributes
mentioned above needs a deeper insight. In-fact most of the
researches in distributed computing world is currently going
on are based on how effectively they can handle these three
Ds. Due to emergence of large number of database related
organizations, social networks, forums and discussion groups,
there is a large inux of data across the network.

The rest of the work described in the paper is organized


as follows: Section II contains literature survey consists of
two frameworks of distributed computing and related concepts.
Section III includes the proposed work on Spark-bagel Distributed Graph processing system taking Google page rank
algorithm and a scenario from bio-informatics. Results and
analysis are discussed in Section IV contains comparison
graphs of various experiments done on the spark-cluster framework. Section V describes conclusion and scope for future
work.
II.

R ELATED W ORK

The background study is devoted to discuss the existing


distributed system frameworks which is widely used in bigdata management. The two basic distributed system frameworks are explained in this section, which will give and insight
into the process that taking place on each data modules over
the framework.
A. Graph Theory and Related Concepts
Graph theory and its related concepts are of prime importance in designing a graph database for big data[6][3]. Some of
the various elements of interest from graph-theory includes the
number of in degree and out degree of nodes, the number of
cut-edges, cut-vertices, shortest path algorithms, network ow
analysis, graph traversal methods etc. Pregel, Apache Hama,
Apache Giraph, HipG, Signal/Collect etc are some of the graph
processing frameworks which process a massive graph, starting
from an initial set of vertices, one can move from vertex to
another and propagate the execution of algorithm, so to speak,
across the set of active verticess[7][8].
B. Hadoop Distributed Framework
Hadoop is an elegant map-reduce programming platform
which forms the basis of hadoop frameworks provided a high
performance computation on a network of inexpensive fully
congured cluster[9]. Hadoop frameworks rest on a distributed
le-system framework called HDFS which is a reliable data
storage over the distributed architecture. Data Stored in the
HDFS will be available to all worker nodes for processing the
data locally[9][10].
C. Spark: The In-memory Cluster Computing Framework
One important factor which is to be taken into account
when a large cluster is concerned is the disk access. When
the number of worker node in the cluster increases, the HDFS

978-1-4799-5496-4/14/$31.00 2014 IEEE

disk access also increases. So also the communication over


head in the reduce phase. Hence all these will have a negative
impact on the performance of the framework. The result of
such a research gave rise to in-memory cluster computing
framework called Spark[11][12].The most attracting feature of
Spark framework is that the user job can load data into memory
and query it repeatedly with the help of an iterative algorithm
which is much faster than the disk based distributed systems
like hadoop map-reduce.
III.

P ROPOSED W ORK

This section deals with details of work done over the


proposed system on Spark-bagel distributed framework. It
starts from basic modeling of problems onto Spark-bagel
programming paradigm on Google page ranking algorithm as
the rst step in programming over the spark-bagel module
in Scala[13][11]. As a preliminary step, a random graph
generation algorithm is also developed with a view to generate
the graph data-sets[3][4]. The algorithm 1 explains the steps
for generating a random graphic sequence.
Algorithm 1 Random Graph Generation Algorithm
Data: Number of Vertices, Average Degree
Result: Random Graph
initialization;
edgeM ax mod 2 == 0 and edgeM ax <
vertexN um 1.
Averagedegree 2.
if input is valid then
Random sequence of numbers is generated whose
sum will be equal to edgeMax.
end
if Havel-hakimi(Generated Random Sequence)==true
then
Terminate.
else
Go to step 3 and repeat the steps until a valid
sequence is generated.
end
while Degree sequences are not zero do
Take two vertices at random.
Add an edge between them by updating the
corresponding indexes.
Reduce the degree sequences of each taken
vertices by one.
end

A. Problem Modeling Towards Spark-Bagel Programming


Spark-Bagel programming is exclusively meant for graph
processing applications. Usually, iterative algorithms can be
modeled onto Spark framework provided the applications have
to maintain certain disciplines prescribed by the Spark-bagel
framework. The iterative part of the algorithm will be handled
by framework itself. User need not worry about communication and distribution of graph over the cluster systems. These
are the tasks that a user has to take into account when a
problem is modeled onto Spark-Bagel module.

Identify the vertex abstraction from given problem and


the value of vertex. Dene the physical instance that
can be converted to a vertex with a value.

Model the edges according to target vertices.

Find out the apt parameters to be sent as messages to


neighboring vertices in the next super-super step.

Find out the computation logic for updating value of


a vertex. This is the core part of Bagel module where
rest of things are decided upon this value

Make classes for vertex, edge and message with the


appropriate variables which forms the building blocks
of whole program and cache the data in an object.

Formulate the computation function over the vertices


and execute Spark-Bagel run function

B. Page Rank Algorithm in Nut Shell


Algorithm 2 Page-Rank Algorithm
Data: Pages, Output Links
Result: Page rank
initialization;
while Convergence(Page-ranks)!=Convergence Value
do
Get the #outgoing
links of the page.
inlinks
Current Cash = i=0
Rank Share.
Rank Share = CurrentCash
.
outlinks
History = History + Current Cash.
Current Cash = 0.
This happens in every page read operation.
end
Sort(Pages according to page-rank)
The building blocks of Spark-bagel program is as explained
below. The compute function consumes three parameters. They
are vertex object, message object and the super-step variable.
It will return an object packed into a pair of vertex id and out
messages. The message contains rank-share from the incoming
links[14]. Hence the message sum is calculated with program
statement. When designing program for computational logic,
the programmer have to keep in mind that are performing on a
single vertex. The same function will run in parallel on other
active vertices of our data-set. The variable for iterations is
handled by Spark framework and incremented at the end of
each super-step. It can be used as a control variable in the
vertices. During zeroth step, all the vertices will be active.
Whenever any vertex becomes inactive it will be in dormant
state and it will not send any messages. The computation
function over each vertex is designed in such a way that, when
the page-ranks reaches the convergence factor all the vertices
become inactive[15].
C. Proteome-Protein Function Prediction of Unannotated
Nodes of Yeast Proteome
The work which explained in this section includes an
approach to solve protein annotation using Spark-bagel graph
processing module. In fact the output of this module is the
number of unknown proteins that can be identied as a result of

their interaction with known proteins. The process in whole is


called proteome-protein annotation. Protein-protein interaction
maps gives a graphical representation of their interaction with
each other. Many well known labs related to bio-informatics
are generating the data-sets in graph model with the labels of
known protein. The data-set of interest in this work is of the
Yeast having 3112 nodes and 25162 edges.
Functional annotation is the process of identifying the
unknown proteins based on the known proteins. To nd out
how much impact is created over the unknown proteins, to
know the connection diagrams of proteins, the protein-protein
interaction map are used. How a programmer can calculate
the impact or model the impact of known protein over the
unannotated ones?. It is with the help of score value that
assigned to the known proteins. The score value is found out
by various chemical reactions to determine how much the
protein is important in the proteome. Such experiments will
not identify all the proteins in the proteome. These proteins
are categorized as unannotated ones. The goal of this work
on Spark is to identify how much extend the framework can
identify the unknown proteins[16].

of nodes U to V during an iteration t. wu,y is the weight of


edge between node U and node Y. It is based on the condition
that during a time (iteration) t, if the reservoir value of node U
is lesser than that of node V, then the ow is zero. Otherwise
the value is calculated from minimum value from the set of
two elements containing weight of the edge between U and
V together with the proportionate weight from the reservoir
of U. The nal value of reservoir of node U, Rt a (u) is the
sum of previous reservoir value of node U, Rt1 a (u) and
aggregate sum of ows to that node from neighboring nodes.
Thus computation function gets terminated after predetermined
steps of iteration.
F. Functional Flow Algorithm: Spark-Bagel Implementation
Similar to page ranking implementation, here also
implementation can be divided into three parts. The
initialization and caching of data-set into memory, the
computation function and Spark-bagel run function for
execution of the computation function in every vertex. The
pictorial representation of the proteome-protein annotation
can be viewed from Fig 1.

D. Problem Modeling of Proteome-protein Annotation Method


Functional ow algorithm for annotating protein function
in interaction networks uses the idea of network ow in graph
theory[4][16]. Even though it does not follow the network
ow concepts, it uses some of the abstraction from that. In
the protein-protein interaction map of Yeast, each protein of
known functional annotation is treated as a source of functional
ow. From these source proteins which are viewed as a high
potential region, the ow is propagated to unannotated nodes,
using the edges between them. The ow is restricted by
certain conditions. They are edges and distance between the
nodes. The proteins which are farthest apart from the annotated
protein will be getting fewer score comparing to the near ones.
The ow will be restricted based on the capacity of edges
between the nodes.
E. Functional Flow Algorithm: Mathematical Modeling
Reservoir Functions:

,
R0 a (u) =
0,

for annotated,
for otherwise

Rt a (u) = Rt1 a (u) +


Flow Function:

0
a
gt (u, v) = min(w , 
u,v

wu,v
)
Ewu,y

u,y

(gt a (u, v))

(1)

(2)

Rt1 a (u) Rt1 a (v),


otherwise
(3)

The term g0 a (u, v) is the initial ow which is initialized to


zero. Usually it is assigned to be zero during initial condition.
R0 a (u) is the value of initial reservoir of annotated and
unannotated. If it is annotated it is assigned to and Zero for
unannotated. gt a (u, v) is the ow generated from the reservoirs

Fig. 1: Proteome-Protein Functional Flow Algorithm which


shows the ow of messages and score calculations
The initialization and caching Section deals with the primary
steps in spark-begal programming where the data-sets are
prepared for the processing which includes the unique feature
of caching the data-set into the cluster memory[17]. The
physical interpretation of the yeast data set is as follows:
Number of Destination nodes node label node id
Destination node1 Weight to node1
This will continue till number of nodes in the yeast
protein interaction graph. This is the data-set passed into
the Spark-context for creating the RDD[11]. The nodes and
destinations are separated and stored for creating the object.
The links and weights are also stored separately to calculate
the messages from proper neighbor node. Hence it would be
convenient to calculate the score share sent to the respective
neighbors. This will map the links and corresponding weights
to an edge and make the new edge to the respective node id.
val emptyMsgs = sc.parallelize(List[(String,GraphGenMyst)]())
After initializing the message array, the programmer has
to remember that the messages which inows into a vertex
are the score share of the reservoir value of known proteins.

That concludes the initialization and caching section of the


functional ow algorithm. The next section describes the core
part of the program, the computation logic. The computation
logic implementation of functional ow algorithm in Scala is
as follows.
scoreIncrement
+=
(if(self.score
<
insideMsg.score.toDouble) List(insideMsg.weight.toDouble,
((insideMsg.score.toDouble)*(insideMsg.weight.toDouble))/
(insideMsg.weightSum)).min
else
0.0
)

G. Understanding the Architectural View of Spark-Bagel


Framework
Here are some questions arise naturally after obtaining the
result on spark framework: What is happening internally in
the framework?. How the intermediate results will looks like?.
Does it go exactly equal to our mathematical calculation?.
Does the framework obey all the properties it demands?. With
the help of page-rank algorithm implemented on spark-bagel
framework , the answers of all the above questions can be
reached. The graph representation le looks like 1 4,6,7;
2 1,4,3,6; 3 5,1; 4 3,5,7,8; 5 2;6 5,7,8; 7 3,5; The
processing starts with the zeroth super-step is the zeroth which
is the beginning of message passing. Since all the vertices are
initialized with the cash as one, the messages will be of the
form 1/#Outlinks. Lets take the instance of rst four lines
which contains four vertices 1, 4, 3, 6. These four vertices are
the target vertices of vertex 2. Since in the zeroth super step all
the vertices are initialized to a default cash of 1, the messages
send to the vertices 1, 4, 3, 6 from 2 will be 0.25. Similarly
other nodes in the graph also send and receive messages from
its neighbors.

IV.

Fig. 2: Proteome-protein Annotation Results on Test Datasets

B. Results of random Graph Generation


The results of random graph generation which is shown in
Fig 3 is exponential and the average in-degree is taken as 12,
which is not an appreciable trend in the case of large graphs
are concerned.

R ESULTS AND A NALYSIS

The Experiments with the cluster machines are divided into


three sections. The rst section deals with the results of the
proteome-protein annotation algorithm in which a test le is
fed to the results of algorithm and check the number of proteins
can be predicted, labeled or unknown. For the experiments
random les of 1000 proteins are selected at time and checked
with the result. The second section contains the results of
random graph generation. The third section deals with the
results of experiments of ranking algorithm over generated
random graphs.

A. Results of Proteome-protein Annotation


The graph shown in Fig 2 contains the number of proteins
which are identied from a set of random les of 1000 random
proteins. The intuition is that the functional ow algorithm is
executed over the original data-set. When the test le is given,
it will ranks the proteins in the base le according to the results
of functional ow algorithm. In all the graphs of different les,
the proportion of unknown proteins are lesser in number.

Fig. 3: Execution Time of Random Graph Generation Algorithm

C. Standalone Execution and Analysis


The test data-set contains nodes from 100 to 25000 nodes.
The number of edges are also varied in accordance with
number of nodes. The time taken by each graph data-sets for
the two partitions are described in Table 1. The Fig 4 shows
the job execution time for the test data. It can be viewed that
at a certain part the graph showed a constant rate of time. That
region is the region where the number of edges are similar.
When the number of edges increases, the execution time
also increases. Each vertex will have a set of incoming edges
and outgoing edges. Hence it will have a direct effect on the
result. The whole results are taken based on partitions 2 and
4.
The term partition is the property of the framework where
the user can specify the number of graph partitions are needed

Number of Vertices
100
200
400
500
1000
3000
4000
6000
12000
20000

Time (Partition2)
12.33
12
13.33
13.66
17.33
23.66
26.5
28.3
44.5
70.33

Time (Partition4)
15.33
16.33
15.33
19.66
18.66
34.33
36.7
40.2
53.33
88

TABLE I: Total Execution Time using Different Partitions on


Spark-Framework in Standalone mode

Fig. 5: Execution Time in Cluster Mode based on Number of


Partitions

machines, for more number of partitions, the locality will


be more. Since these partitions are accessed on a hashing
functions, more partitions will helps to retrieve the key value
pairs without much searching in the RDD.

Fig. 4: Total Execution Time using Different Partitions on


Spark-Framework in Standalone mode

for graph data-set. As graph data-set is stored in key value


pairs (VertexId, Vertex Object), a partition 2 indicates that a
machine can hold two sets of (key, value) pairs.

Number of Vertices
100
200
400
500
1000
3000
4000
6000
12000
20000

Time(HDFS)
17.5
18
18.2
19
21.37
29.5
33
44
67
76

Time(RDD Cached)
12.5
12
14
13.66
17.33
23.66
24.2
31.2
46.66
63

TABLE III: Execution Time based on HDFS Data and RDD


Cached Data

D. Cluster Job Execution Results


There description for the resultant graphs shown in Fig 5 of
cluster analysis is as follows. Total execution time is taken for
various number of partitions viz 2, 3 and 4. From the resultant
graph it can be viewed that more number of partitions shows
less time of execution of the job. The time taken by each graph
data-sets with the partitions are shown in Table II. It is exactly
opposite to that of single machine. The reason behind this is
that, in the case of single machine if the partition is less, say
for 2, the graph data-set is divided into two sets of key value
pairs.
Number of Vertices
20000
25000
36000
45000
50000
75000
100000

Time(#2)(s)
10.3
12.1
18.2
34.1
43.1
62.3
80

Time(#3)(s)
9.2
10.32
15.3
30.1
41.3
58.3
73

Time(#4)(s)
6.33
7.2
12.2
27.3
37.3
50.1
66

TABLE II: Total Execution Time in Different Partitions on


Spark-Framework in Cluster Mode

Hence the message passing complexity is reduced. This


result is taken by using one machine. In the case of two

Fig. 6: Execution Time based on Data Access from Different


Locations
The graphs shown in Fig 6 indicates the variation of time
based on the location of data. The whole experiment done
based on storage of data. The data-set is stored in a disk-based
distributed system (HDFS) and in in-memory cluster which is a

property of spark cluster where the data-set is stored as RDD..


The program is executed using data from both storage systems.
The results shown in Table III concludes that execution time
is less for data stored in spark in-memory cluster.

Apart from Java it will be more advantageous to use Scala,


since it contains a bunch of inbuilt functions which will make
our life simple. Conguring and mastering over spark needs
a lot of assistance from user-groups and developers. Support
from developers is one of the major criteria that need to seek
by spark-programmers. Spark framework is a good candidate
for cluster computing which will utilize data-sets repeatedly
by iterative algorithms for analysis of data-sets from different
domains like health-care, prediction of market trends etc.
R EFERENCES
[1]
[2]

[3]
[4]

Fig. 7: Execution Time based on Number of Worker Nodes


[5]
[6]

Fig 7 shows the variation of time of execution based on


number of workers in the cluster. First part of the graph
shows a zig-zag view. The reason behind is that the execution
time also depends upon number of edges. It is experimentally
increased the number of edges for lesser number of vertices.
Even-though the number of vertices are low, if the number of
edges relative to it increased, the time of execution also got
increased. This is because when the complexity of program
also depends upon the message passing between the vertices.

[7]

[8]

[9]

V.

C ONCLUSION AND F UTURE S COPE

Large graph processing is one of the most challenging areas


of data analytics. Graph processing needs a large impulse from
parallel programming for reducing the processing time. Worlds
largest search engine Google, itself is a massive graph processing giant. When the number of nodes and edges increases
to a very large extend, common iterative programs will not be
well suitable for processing it. Common iterative programs will
take more time to process such large graphs. This is context
where researchers have to think of parallel programming
frameworks, particularly graph-processing framework, which
will make our life simpler. In the case of big data analytics,
there are established disk-based frameworks like hadoop which
will process the data according to the applications where it is
needed. In terms of large graph processing, it would be more
advantageous if there exist a framework which will helps to
handle basic functionalities of graph processing parallely.
Spark framework establishes an in-memory cluster, where
most of the iterative programs which will access a data-set
from the cluster memory. The whole work explores how to
utilize spark in-memory cluster technique to approach real
world problems with the help of two major instances. In fact
spark suite to be the best candidate for data analytics provided
the data-set is to be represented as a graph. Apart from diskbased distributed system, cluster memory is also used for
storing and retrieving data-sets repeatedly. This is the region
where the user has to be conscious with wise utilization of
cluster memory. Whole work described above is done in Scala.

[10]
[11]

[12]

[13]
[14]
[15]

[16]

[17]

Andrew Duggan, Beowulf computer clusters, Tessella Support Services PLC, Accessed on April 02, 2013.
Andrew S. Tanenbaum and Maarten van Steen, Distributed Systems:
Principles and Paradigms, Pearson Prentice Hall, 2nd edition, May,
2005.
J.A. Bondy and U.S.R. Murty, Graph Theory with Applications,
OReilly Media, 2nd edition, January 2013.
Maarten Van Steen, Graph Theory and Complex Networks: An Introduction, Altera Corporation, 1st edition, January 2010.
Scala programming concepts and examples, http://docs.scala-lang.org/
tutorials/scala-for-javaprogrammers.html, Accessed on May 19, 2013.
Ian Robinson, Jim Webber, and Emil Eifrem, Graph Databases, Orielly
publications, rst edition, June, 2013.
Elzbieta Krepska, Thilo Kielmann, Wan Fokkink, and Henri Bal, A
high-level framework for distributed processing of large-scale graphs,
in 12th International Conference on Distributed Computing and Networking, 2011, pp. 155166.
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C.
Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, Pregel: A
system for large-scale graph processing, in Proceedings of the 2010
ACM SIGMOD International Conference on Management of data, 2010,
pp. 135146.
Tom White, Hadoop Denitive Guide, Orielly publications, second
edition, October, 2010.
Jeffrey Dean and Sanjay Ghemawat, Mapreduce simplied data
processing on large clusters, in in OSDL, 2004, pp. 137150.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and
Ion Stoica, Resilient distributed datasets: A fault-tolerant abstraction
for in-memory cluster computing, in Proceedings of the 9th USENIX
conference on Networked Systems Design and Implementation, 2012,
pp. 220.
Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion
Stoica, Discretized streams: An efcient and fault-tolerant model for
stream processing on large clusters, in Proceedings of the 4th USENIX
conference on Hot Topics in Cloud Computing, 2012, pp. 110.
Scala programming languag, http://www.informatics.indiana.edu, Accessed on June 23, 2013.
Opic page ranking basics, http://www.w2003.org/cdrom/papers, Accessed on February 11, 2013.
Comparison of graph processing frameworks, http://blog.octo.com/en/
introduction-to-large-scale-graph-processing, Accessed on May 19,
2013.
Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and Mona
Singh, Whole-proteome prediction of protein function via graphtheoretic analysis of interaction maps, in ISMB 2005 Proceedings.
Thirteenth International Conference on Intelligent Systems for Molecular Biology, 2005, pp. 13021310.
Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael J.
Franklin, Scott Shenker, and Ion Stoica, Shark: Fast data analysis
using coarse-grained distributed memory, in Proceedings of the ACM
SIGMOD International Conference on Management of Data, 2007, pp.
689692.

You might also like