You are on page 1of 10

Exploration of declarative languages applicability to

development of large-scale data processing systems


Vyacheslav Derevyanko

Anil Pacaci

David R. Cheriton School of Computer Science


University of Waterloo

David R. Cheriton School of Computer Science


University of Waterloo

vderevya@uwaterloo.ca

apacaci@uwaterloo.ca

ABSTRACT
In this project weve set out to explore whether using declarative languages can ease the development of distributed software systems, in comparison to usage of conventional imperative languages, such as Java. Weve conducted an experiment in developing a distributed data processing system
using Bud, a declarative language platform developed at
UCBerkeley. Weve developed GraphBud1 , an implementation of Googles distributed graph processing model Pregel.
Usage of declarative language resulted in a very compact
Pregel implementation. Weve verified the correctness of our
model by implementing PageRank on top of GraphBud, and
carrying out a number of experiments parallelizing the computation up to 10 worker nodes and ensuring that parallel
execution provides identical results with single-node PageRank calculation.
We conclude that using declarative languages is a viable
option for development of distributed systems prototypes,
and that usage of a declarative framework allows to focus
on high-level correctness of distributed protocols implementation, as opposed to more technical details such as eventloops and state-machine transitions.

Keywords
Distributed systems, declarative languages, logic programming, Bloom Bud, Pregel, PageRank.

1.

INTRODUCTION

Present-day internet-based companies operate on a truly


global scale. Most-popular web-applications are accessed
concurrently by millions of users, and such online services
are built to endure these extreme workloads, providing faulttolerant service 24 hours a day, all year round. This poses
stringent operational requirements on underlying data storage subsystems - these systems must be capable of responding to millions of simultaneous data requests with millisecond delays.

1.1

The perils of distributed computing

The growth of online services has spawn the recent evolution of large-scale distributed databases. These datastores are designed to scale automatically when additional
machines are added into the cluster, providing additional
storage and processing capacity. This focus on scalability requires to fundamentally reconsider all of the founda1

Available at https://github.com/slavik112211/declarativity

tional knowledge about database design that was accumulated over the past decades. Scaling conventional relational
databases to required capacities has proved to be difficult
due to advanced functionality that these systems provide,
such as support for transactions, relational model with complex schema, and joining records from multiple tables. The
natural demand for data integrity has been a driver behind the definition of ACID (Atomicity, Consistency, Isolation, Durability) transactions properties, that most conventional database systems are built upon. These properties ensure that simultaneous transactions that are accessing the database are serialized, meaning that the order in
which read and write operations are performed is regulated
using heavy locking of data items. Support for this functionality in a large-scale distributed database can only be
achieved through additional overhead of network-based coordination between database cluster nodes, which greatly
increases latency and reduces throughput of such databases,
and restricts the amount of simultaneous transactions to impractical levels.
To circumvent these issues, recent scalable data storage
solutions have focused on providing the simplest possible
form of storage - storing a collection of objects retrievable
by a key. Such key-value stores provide no support for relations between the objects stored, and this allows to achieve
greater scalability as objects have no dependency between
themselves. At the other hand, such datastores replicate
stored objects between cluster machines for fault-tolerance
and availability purposes, and the challenge of distributed
coordination remains.
There are two opposing approaches to handling the issue
of consistency of object replicas in distributed datastores.
At one end there are systems that favor strict consistency
of the data stored at any point in time. This approach
requires adoption of distributed coordination protocols [7]
such as Two-Phase Commit (TPC) to ensure that the update has propagated to all replicas of an object. This invariably leads to decreased availability of the overall system, as
TPC locks access to modified object and all its replicas until
update propagates to all of the replicas. On the other end
of the spectrum - systems that employ so called eventual
consistency. Such systems propagate updates to the replicas in the background, while the master copy of the object
remains available for subsequent updates. This trade-off between achieving strict data consistency and data availability
in distributed datastores is known from the very early work
conducted in the area of distributed systems [11].

1.2

Declarative languages to the rescue

A research group at UC Berkeley led by Prof. Hellerstein


[3] claims that these fundamental difficulties faced by designers of distributed systems are stemming from the usage
of imperative sequential programming languages to describe
distributed systems that are inherently non-sequential. Their
claim is that declarative programming, which grew out of
an early-era research on database programming [17] and is
disorderly by nature, makes a much better match to the
challenges of distributed systems programming. To confirm their hypotheses, they have conducted an experiment
in recreating a large, widely adopted distributed filesystem
- HDFS [2], using their own custom implementation of Datalog declarative language called Overlog [14]. Theyve succeeded to fully recreate HDFS in Overlog, and highlight that
using declarative languages leads to a cleaner design and significantly more concise codebase - the original HDFS code
spans 20k LOC, whereas their implementation doesnt exceed 1000 LOC.
The key principle that distinguishes declarative programming from conventional imperative programming languages
is that in declarative languages programs are expressed as
unordered sets of declarative rules. Such rules allow to express transformations of data collections and store output as
derived data collections. Languages based on this paradigm
are considered declarative, as the user doesnt specify an exact imperative procedure for how derived data collections
have to be created (as its done in most conventional imperative programming languages), but rather declares transformation rules that define what derived data collections may
be considered logically correct.
When everything in the system is represented as collections of facts it becomes easier to define correct system states
and reason about how parallel execution affects the correctness of system state.

2.

PROJECT OBJECTIVES

Weve set out to explore the usage of declarative languages


in developing distributed software systems. Experiments
conducted by Prof. Hellerstein group provide convincing evidence that theres tremendous benefit in employing declarative programming for this purpose. In an effort to better
understand the impact of using declarative languages, weve
decided to conduct an experiment of rebuilding a well-known
distributed system using declarative programming.
In our search for a suitable distributed system for this experiment weve settled on recreating Pregel - a large-scale
graph processing system created at Google [16]. Pregel distributed graph processing model is based on the Bulk Synchronous Parallel (BSP) computation model [19], which allows to parallelize graph processing while executing the computation in a fully synchronous manner. BSP model avoids
complexities of asynchronous parallel execution through the
usage of barrier synchronization. In Pregel, a user-defined
vertex program is simultaneously executed on each vertex
in parallel in a sequence of supersteps. Pregels stateful iterative model has received broad positive acclaim, and has
been widely adopted in open-source implementations, such
as Apache Giraph and GPS [18]. Efficient distributed graph
processing is an important computational challenge faced by
internet companies, as a lot of the data handled by such companies can be represented as large interconnected graphs.
Thus our goals for this project lied at the intersection of

two important research avenues: first, we wanted to explore


the benefits of using declarative languages for the implementation of distributed systems; and second, we wanted to experiment with implementing different models for distributed
graph processing, such as synchronous and asynchronous
computation models, evaluate their advantages and limitations.

3. BACKGROUND INFORMATION
3.1 Disorderly data-centric programming for
distributed systems
One of Hellersteins students, Peter Alvaro, studied the
challenge of building distributed systems and proposed usage of disorderly declarative programming languages as an
approach to deal with fundamental non-determinism inherent in these systems. In his thesis work [1], Alvaro outlines the following basic causes of non-deterministic behavior in distributed systems: (I) network communication is
inherently asynchronous, and the order of network messages
cannot be predicted; and (II) the fact that remote components cannot be considered reliable (on a larger scale),
and the presence of network failures may lead to computation of incomplete results. Traditional approaches to handling these issues require distributed coordination and include techniques such as distributed locking of data items
and distributed consistency protocols like Paxos and TwoPhase Commit. Such approaches greatly reduce the scalability of resulting distributed system and significantly slow
down the systems operations.
As a solution to aforementioned problems, Alvaro claims
that declarative languages provide a much better match for
the challenges faced by designers of distributed systems,
comparing to commonly-used iterative languages. The level
of abstraction provided by declarative runtime frameworks
allows to avoid imperative specifications of how data transformation is performed, and allows programmers to focus
on specifying what data transformation is desirable. Most
importantly, such logical rules are disorderly by nature, as
conceptually all of such rules are applicable at every point
in time. It is thus claimed that usage of declarative programming languages encourages programmers to think in
this disorderly paradigm.
Nonetheless, distributed systems do require distributed
coordination in cases when system design depends on ordering of network messages. Disorderly programming allows to
employ analysis techniques to understand when distributed
coordination is necessary to produce correct outcomes, and
allows to reason what type of distributed coordination is
sufficient when it is required. Under Consistency as Logical Monotonicity (CALM) theorem [3], data transformations
are divided into two categories: monotonic, where additional
manipulations on data do not cause the previous statements
about data to be revoked (operators include selection, projection and join), and non-monotonic, where additional manipulations lead to a change of facts and truths about data
(modification/removal and aggregation operators). CALM
theorem postulates that monotonic transformations do not
require distributed coordination and are eventually consistent. As an example, MapReduce model divides computation into monotonic map operations, and non-monotonic
reduce operations. Reducers perform aggregation, and require distributed coordination (wait lock) before the reducer
can proceed.

Figure 1: Declarative transformation rule in Bud: multicasting messages to all network nodes

3.2

Bloom Bud: declarative runtime for distributed systems

To demonstrate the practicality of the ideas about the


applicability of declarative programming paradigm in development of distributed systems, group of researchers led by
Prof. Hellerstein has developed a prototype declarative programming framework Bloom Bud. The framework is redistributed as open-source software2 , and thus is freely available for general public to experiment with.
Considering that Bloom Bud has been specifically created
to provide a platform for developing distributed systems in
declarative style, we have picked this framework as a basis
for our declarative implementation of Pregel. This section
provides a short introduction on the main concepts behind
the framework. The project has excellent documentation3
and a whole collection of sample distributed software modules4 , and we encourage the reader to look at these examples to get a more thorough understanding of framework
concepts.
Bud framework provides a domain-specific language (DSL)
that is built on top of Ruby interpreter. Bud programs
are executed as plain Ruby programs with Bud runtime imported as a mandatory Ruby library (or gem in Ruby parlance).

3.2.1

Declarative transformation rules

As is true to imperative programs, declarative programs


allow to express the transformations that are performed on
datasets. Nevertheless, the mechanics of how transformations are expressed and applied in declarative programs is
quite different. All datasets manipulated by a Bud program are represented as collections of data records, or facts,
that closely resemble database tables storing information in
record tuples. Programs that manipulate such collections
are expressed as a disorderly set of logical rules that specify
how data collections can be transformed to create derived
collections of data records.
Primary differences from imperative programs are as follows. First, such transformation rules are based on setsemantics, meaning that a rule is applied to a collection as
a whole, and not to individual records within a collection.
Second, the rules are disorderly by nature, meaning they
arent applied in any particular order, as is the case with
conventional imperative programs. All of the rules specify
logically-valid transformations, and can be evaluated by Bud
runtime in any order. Such transformations bear some resemblance to relational views in traditional databases. As
soon as the records are modified in an original table, a derived relational view is re-evaluated by the database engine.
Figure 1 shows example of a transformation rule, where
joined records from network nodes and messages collections are used to form records in control pipe collection.
A custom transformation can be applied to each record
2

https://github.com/bloom-lang/bud
https://github.com/bloom-lang/bud/tree/master/docs
4
https://github.com/bloom-lang/bud-sandbox
3

Figure 2: State non-determinism in distributed programs


from origin collection (right-hand side, RHS) before it is
projected to a destination collection (left-hand side, LHS).
This transformation code is supplied as a block of Ruby code
in curly braces, and a record is supplied as a variable defined
in-between the vertical bars, as per Ruby convention. This
block can include arbitrary imperative Ruby code, allowing
for flexibility in defining transformations.
As can be seen on Figure 1, RHS collections can be joined
in a manner similar to how tables are JOINed in SQL. Bud
provides a variety of JOIN operators, and allows to specify the key columns on which JOINs have to be performed.
In case no match columns are specified, Bud performs an
all-to-all record match. Besides JOINs, Bud provides traditional SQL grouping functionality which can be used in
conjunction wit aggregation operators, such as min(), max(),
bool and(), bool or(), sum(), avg(), count(). Custom projection and aggregation operators can be defined by programmers using constructs such as flat map() and reduce().
Such rich variety of available set-based transformation operators provides a flexible environment for expressing custom
data transformations.

3.2.2

Network cooperation

Even though Bud is designed for creating distributed systems, each single Bud program running on a network node
is completely independent of other instances, and instances
communicate and transmit data through the usage of network messages. Bud provides an additional collection type
to support the notion of sending network messages - channel.
Records that are put into a channel on one network node are
received on the same channel on the other network node.
Recipient address is specified as part of the message metadata: first column in message tuple designates IP address of
a recipient, and a second column designates IP address of a
sender.

3.2.3

Notion of time in Bloom Bud

To extend the model of declarative programming of database


queries to distributed systems, additional issues had to be
considered. Query languages are designed to work with
static data stored in databases, whereas distributed systems are based on network cooperation protocols, meaning
network nodes react to incoming messages and dynamically
change their state by modifying the data.
Figure 2 provides an example of the problem [1] with nondeterministic state. First rule increments a counter whenever a network request comes in, second - sends the response
back with the current counter value. In conventional programming languages these instructions would be executed
sequentially, leaving no concerns about what value would be
returned: x or x+1. Declarative rules are disorderly by na-

ture, meaning second rule might be executed before the first


rule.
This consideration exposes two main issues with the usage of declarative programming in distributed environment.
First, distributed systems inevitably face non-determinism
due to network operation. A message sent at one moment in
time can arrive later, than a message sent after it. Second,
support for dynamically changing state, or an idea that data
collections evolve over time, has to be modelled.
As a solution to aforementioned issues Alvaro designed
a declarative language with support for time semantics Dedalus [4], and these ideas were later incorporated into
Bloom Bud as well. Dedalus introduces the notion of timesteps
into what previously was a completely disorderly set of declarative rules. This allows to solve the problem of non-determinism
in Figure 2, as we can specify that when the request is
received counters value equals to x, whereas in the next
timestep its going to increase to x+1. Bloom Bud divides
declarative rules into 3 categories: (I) deductive rules (<=),
where assignment from RHS to LHS collections happens instantaneously; (II) inductive rules (<+, <-, <+-), where
addition, modification, or removal of records is deferred until next timestep; and lastly (III) asynchronous rules (<),
meaning that assignment happens in non-deterministic time
in future, as a way to model non-determinism in network
communication.

3.3

BSP processing, Pregel architecture

Bulk Synchronous Parallel (BSP) [19] is a parallel computation model for shared-nothing environments where computation is modelled as sequence of steps. In BSP Model,
each worker performs computation on their local partition of
data at each step and communicates through message passing after each step. A synchronization barrier is enforced to
guarantee that each worker is ready to start next iteration.
MapReduce [10], the most prominent example of large-scale
data processing frameworks based on BSP, has been used for
batch processing of web-scale graphs. MapReduce model relies on distributed file system for node-to-node communication and has initially been designed to support single-staged
applications rather than iterative ones like graph processing.
This stateless nature of MapReduce model is ill-suited for
iterative graph applications which resulted in development
of specific graph processing frameworks.
Pregel by Google [16] is first large-scale BSP data processing model with a graph specific API. Pregel employs
Think like a Vertex programming model through a vertexcentric API, where a user-defined compute function is iteratively executed over the graph vertices and communication is performed by sending messages through the edges of
the graph. Giraph 5 is an open-source implementation of
Googles Pregel. GPS [18] employs Pregels vertex-centric
model and utilizes vertex migration for dynamic load balancing. Additionaly, GPS introduces Large Adjacency List
Partitioning (LALP) optimization to address limitations of
existing systems on scale-free graphs. GraphLab [15] employs an asynchronous communication model which can expedite convergence for various machine learning tasks. PowerGraph [12] and PowerLyra [9] factor vertex computation
over edges to efficiently process real-world graphs with powerlaw degree distribution.
5

http://giraph.apache.org/

Figure 3: Superstep Processing at Worker nodes in


Pregel Model

Pregel employs master-worker architecture where master


node is responsible of coordinating the computation of worker
nodes. Master node does not perform any computation over
the graph data but is responsible of enforcing the synchronization barrier between every superstep. Worker nodes,
on the other hand, perform processing of the supersteps and
communicate with master node through heartbeat messages.
A single superstep iteration at worker nodes is depicted in
Figure 3. Each worker is assigned a disjoint subset of the vertices of the graph. During the computation phase, a worker
process iteratively executes a user-defined compute() function over its set of vertices and messages generated by each
vertex are placed into outgoing buffer. During the communication phase, messages in outgoing buffers are delivered to relevant worker processes and placed into incoming
message queues of corresponding vertices through many-tomany communication among all processes. Master enforces
a synchronization barrier to ensure that all vertex messages
are delivered and all worker processes are ready to start next
superstep. Once all messages are successfully delivered, master commands to start next iteration.

3.4

PageRank calculation in Pregel

To understand how Pregel can be used to parallelize graph


processing algorithms, lets take a look at how PageRank,
a well-known algorithm for evaluating importance of webpages, can be implemented in this model. In PageRank,
the link structure between web-pages is represented as a directed graph, where directed edges signify a link pointing to
another web-page [8]. A weight of a particular page is determined by a cumulative weight of the pages that are pointing
(linking) to that web-page. The final weights of web-pages
are determined after multiple rounds of weights calculations,
and iterations continue until the change of weights calculated in consequent iteration doesnt exceed a certain small
threshold.
PageRank weights calculation rounds naturally lend themselves to the notion of supersteps in Pregel. Every round (or
superstep) each vertex receives network messages with the
current weights of the vertices linking to it, calculates its own
weight, and passes this newly-updated weight as messages to
the vertices that it points to. These messages with newlyupdated page weights are not processed until the Master

Figure 4: Membership Protocol for worker nodes to


join GraphBud Cluster

Figure 5: Master rules for coordinating superstep


execution

commands Workers to start the next superstep.

information including workers network address, numeric id


assigned during registration, boolean flags marking completion of various processes on worker nodes, such as graph
loading and superstep progress.
Initially, user provides the file path for input graph through
command line interface. Same graph loading request is sent
to all workers. Upon receiving success message from all
workers, master is ready to initiate the first superstep. A
rule on lines 2-4 on Figure 5 corresponds to initial superstep execution. This rule translates to if user sent a start
command and graph loading is complete, then create a new
record for superstep 0 and insert it into supersteps collection. A rule on lines 16-21 is responsible for broadcasting
the actual start superstep request to all workers, whenever
a new superstep record is inserted into supersteps collection. Once a worker completes a superstep, success message
is sent back to master and master updates the workers list
collection, marking this worker as finished. A rule on lines 913 closes the loop, and inserts a consequent superstep record
into supersteps, whenever all workers report about successful completion of the current superstep.

4.
4.1

PREGEL IMPLEMENTATION IN BUD


System architecture

Main objective of this project was to explore applicability of declarative programming languages in development
of large-scale distributed data processing systems. For this
purpose, we have designed and implemented a prototype
graph processing system based on Pregel model - GraphBud6 .
Our prototype employs a master-worker architecture, as in
original Pregel implementation. Brief overview of distributed
graph processing systems and Pregel is presented in Section
3.3.
First, we implemented a simple membership protocol to
make network nodes aware of their peer nodes. Each network node maintains a collection of worker nodes and a
communication channel for master-worker communication.
Each worker process implements two simple rules. First,
worker sends a join message to the master over the network
upon initialization. Master node populates the worker list
with newly joined worker and broadcasts the updated worker
list collection to every worker. Second, worker node populates its own worker list collection upon receiving network
message from master. This protocol ensures that complete
membership information is synchronized in entire GraphBud
cluster. Implementation of these two simple rules is depicted
on Figure 4.
GraphBud relies on simple range partitioning scheme to
create disjoint vertex partitioning of the original graph. Each
worker node reads the input graph in adjacency list format
and can create a local subgraph in distributed manner without communication.
GraphBud relies on a one-to-many broadcast between the
master and worker nodes in order to coordinate worker activities. Any request generated by master is broadcasted to
all nodes that are alive at the moment. Following subsection
describes the details of master-worker process implementations in addition to two optimization which are implemented
in GraphBud to reduce communication overhead.

4.2

Master Implementation

Main responsibilities of GraphBud master process include


coordination of synchronous procedures executed on worker
nodes, such as graph loading and execution of supersteps.
Master process maintains a collection of all active workers
6

Available at https://github.com/slavik112211/declarativity

4.3

Worker Implementation

Worker implementation of GraphBud prototype is depicted


on Figure 6. Each worker node maintains the state of its local set of vertices in a data collection. Vertex state includes
the following information: value of the vertex in current superstep, list of outgoing edges and message buffer holding
set of incoming messages for the current superstep. A list of
incoming edges is not stored as part of vertex information,
as in Pregel model messages are only sent to the vertices
that a particular vertex points to.
In addition to current vertex state, each worker node maintains two buffers; one for messages received for the current
superstep and one for messages which will be delivered for
the next superstep. Lines 1-5 on Figure 6 depict the set of
data collections that each worker node maintains.
Workers start execution of the superstep on masters start
command. In each superstep, user-defined compute() function is executed on each vertex. Three main phase of the
compute function are as follows: (I) iterate over the set of
incoming messages; (II) update the value of the current vertex; (III) prepare messages that will be sent for the next
superstep. User-defined compute function does not have to
be written in declarative style, any imperative code block
can be executed as long as it follows these 3 phases of computation. Set of messages that are returned from compute()
are accumulated in queue out buffer, as shown on lines 8-

Figure 6: Worker node implementation in Bud

12. queue out maintains two flags for each message to track
whether a message was already sent and whether it was successfully delivered.
A rule on lines 15-20 on Figure 6 is responsible for the
delivery of vertex messages for the next superstep. Outgoing message buffer (queue out) and collection of workers
(workers list) are joined on worker identifier so that an IP
address of a worker node which holds the target vertex can
be identified. Many-to-many communication channel is used
to deliver each message to the corresponding worker. Current prototype of the Bud framework relies on User Datagram Protocol (UDP) for network communication, thus each
vertex message is sent as a separate network datagram. Section 4.4 and Section 4.5 further describe the vertex message
delivery and present the optimizations in GraphBud to reduce communication overhead.
One limitation of network communication via UDP is the
lack of delivery guarantees. GraphBud uses an implementation of a reliable delivery protocol for vertex messaging
in order to ensure delivery of each vertex message before
workers report a successful completion of a superstep to the
master. Each vertex message is assigned a unique identifier
and recipient of the vertex message sends an acknowledgement to the original sender so that original sender can mark
message as delivered. Once all messages in outgoing buffer
are successfully delivered, worker reports completion of a superstep to the master node. The verification of the delivery
of all vertex messages is implemented through an aggregation over the delivered flags on queue out collection, which
is shown on lines 29-31 on Figure 6.

4.4

Message Packing

As described in previous section, Bud framework relies


on UDP for network communication. A UDP datagram is
created and transmitted over the network for each record
inserted into a channel collection, which results in excessive number of very small messages transmitted over the
network. In order to reduce the number of messages, we
initially implemented a vertex packaging scheme where a
worker creates a large, single package for every other worker
node. However, UDP-based communication also suffers from
datagram size limitations, apart from delivery guarantees.
Maximum allowed size for a UDP package is around 5001500 bytes, depending on the platform. Any package larger
than the platform-specific maximum is silently dropped. Although the message-packing optimization we implemented

Figure 7: Vertex to Vertex Communication in case


of Ghost Vertices
works well on toy-sized graphs, it resulted in lost network
messages on any reasonably-sized graphs. Therefore, we excluded it from the current prototype. However, it can be reenabled when TCP-based communication protocol would be
implemented in Bud framework. TCP automatically splits
large network messages into reasonably-sized packets and
guarantees reliable ordered delivery of each packet.

4.5

Large Adjacency List Partitioning

Large Adjacency List Partitioning (LALP) is an optimization initially proposed in GPS [18] to reduce network communication for specific algorithms where outgoing edges are
only used for sending messages and same message is sent to
all adjacent vertices such as PageRank. LALP is essentially
a receiver-side scatter, where a vertex with high-degree sends
a single message to a worker machine instead of sending separate messages for each adjacent vertex. LALP partitions
adjacency list of high-degree vertices into multiple workers.
On each worker apart from the original worker holding the
vertex, a ghost vertex is created to store partial adjacency
list.
Consider a 4-vertex graph partitioned on two worker nodes,
as shown on Figure 7. In standard implementation, vertex 2
needs to send two separate messages to the remote worker;
one for vertex 1 and one for vertex 3. With LALP enabled,
a ghost vertex is created on worker 1, as this worker stores
local neighbours of vertex 2. Vertex 2 sends a single network
message to the remote node at each superstep. Ghost copy
of the vertex 2 can replicate this message to the local neighbours on a remote worker. Even in this simple example,
LALP optimization reduces the total number of messages
from 5 to 4.
In GraphBud, LALP optimization significantly reduces
the number of UDP messages delivered over the network.
Vertices, which out-degree is larger than a user-defined threshold, are identified during graph loading stage and ghost vertices are created. During the superstep execution, compute()
function is only invoked on master copy of the high degree
vertex. Instead of generating a separate message for each
neighbour, a single message is generated for each worker and
inserted into outgoing message buffer. On a receiver side,
any message which target is a ghost vertex is replicated for
each adjacent vertex in incoming message buffer.
In order to evaluate the impact of LALP on GraphBud

presents that as its output. Theres no need for any time


semantics, as the query doesnt need to specify when that
query has to be called. Distributed systems are quite different. They can be characterized as highly dynamic, reactive
systems. Each network node that is part of a distributed system must actively listen to incoming requests and messages
from other nodes, perform calculations and data manipulations on behalf of incoming requests, and respond back to
the caller with appropriate response messages.

5.1

Figure 8: Impact of LALP on PageRank Performance


performance, we have conducted a micro-benchmark. We
have generated a 500-vertex scale-free graph using BarabasiAlbert model [6]. Weve run a 15-iteration PageRank algorithm over the generated graph using 8 worker processes,
with and without LALP optimization. Figure 8 presents the
total execution time of each superstep. It is observed that
LALP optimization can reduce average superstep time from
23s to 13s. Furthermore, we have monitored the number of
messages transmitted over the network. Without LALP, average number of packages per superstep is 2475 while LALP
reduces the average number of packages to 1337. This microbenchmark clearly shows that LALP can significantly reduce
the number of network messages which in turn improves
overall job completion time.
PowerGraph [12] is yet another vertex-centric graph processing system which is optimized for natural graphs with
power-law degree distribution. It employs a similar strategy of partitioning adjacency lists in order to reduce network overhead. Unlike Pregel, PowerGraph creates an edgedisjoint partitioning of the input graph where adjacency list
of a vertex spans multiple machines and computation of a
single vertex program is distributed to multiple workers. In
a given superstep, each partial copy of the vertex computes a
partial result from incoming messages and designated master copy accumulates the final result from partial results.
In this way, computation of those high-degree vertices can
be distributed effectively. Implementing edge-disjoint partitioning and PowerGraphs distributed vertex-computation
model might further improve GraphBuds performance on
real-world graphs with power-law degree-distribution.

5.

DECLARATIVE EXPERIENCE

In this section we lay out our considerations on the benefits of programming distributed systems in declarative style
as opposed to usage of conventional imperative programming languages, as well as the issues weve encountered along
the way.
Getting used to programming distributed systems using
declarative rules takes some time. As mentioned in subsection 3.2.3, using disorderly declarative logical rules works
well for expressing database queries, as databases can be
considered static storage containers: each declarative rule
transforms the data stored in a collection in some way and

Declarative rules as event callbacks

The mismatch between using declarative rules for static


database queries and reactive distributed systems has been
considered by Bloom Bud developers, and the notion of
timesteps incorporated into Bud allows to express how data
collections evolve based on the requests received from other
network nodes. Bud also offers special collections that preserve data records for only a single timestep, that can be
used to deliver network and intraprocess messages: channels
and interfaces. We found that very often rules we declared
would join regular data collections, for example lists of vertices, with temporary one-step collections used as a channel
to deliver request messages. Effectively this corresponds to
creating a callback function, that processes a list of vertices
whenever a request to do so has been put into a temporary one-step collection. In a way, this transforms reasoning
about declarative rules from an unordered set of rules that
are logically valid all the time to a set of callbacks that
are fired whenever a request comes in. For example, this
can be seen on Figure 6 on lines 8-12, where a temporary
one-step collection worker input is joined to a collection of
vertices. This rule processes a list of vertices whenever a
request is put into worker input collection.
Described pattern of using declarative rules as event callbacks gives Bud a reactive feel, and allows to compare Bud
framework to other reactive frameworks, such as AngularJS7 .
In AngularJS, all modifiable data objects are attached to visual DOM elements of an HTML page. AngularJS runs a
control loop in the background, monitoring changes to data
objects: if an object is changed, a respective DOM element is
redrawn in browser so that a user can visually see a changed
value. Bud iterates the timesteps on network events, which
also provides a feel of a background loop running internally,
executing appropriate callbacks (or declarative rules) when
a request comes in. In comparison to imperative programming, this takes away the burden of maintaining such an
event loop within distributed software off a developer, and
delegates that to the framework.

5.2

Iterating timesteps in Bud

Nonetheless, this impression of a Bud event loop running


in the background is not quite correct. Therere only a
number of cases, when Bud iterates to the next timestep
automatically - on initialization, and when a network request comes in. Per Prof. Hellerstein, this is according to
a design decision in Bud: time only moves forward on the
arrival of events8 . To our understanding, one other necessary case is when insertion of collection records is deferred
until next timestep (<+, <-, and <+-). For example, in
our implementation of Pregel, whenever a worker receives a
command from master to start superstep processing, there
7
8

https://angularjs.org/
Private correspondence

are some init procedures in order: processing vertex message


queue and assigning message lists to corresponding vertices.
Only after that individual vertices can start the processing of
their input vertex messages. We implement this protocol by
separating the two phases by Bud timesteps, and deferring
the start of the vertex processing by one timestep. Unfortunately, this leads to stalling of a Bud instance, as no other
network messages are sent to this node. We get around this
problem by employing a special time-ticking periodic collection. This collection is a simple timer abstraction, and
joining it to a normal collection of records allows to execute
a declarative rule with predefined periodicity. In our case,
we arent using the timer in any declarative rule, but simply
declare it to ensure that Bud iterates the timesteps (we use
millisecond precision). This introduces unnecessary delay,
when a timestep is finished and the next one isnt started
until the timer ticks. In this case, it seems more appropriate
to keep the abstraction of timesteps within Bud framework,
and modify Bud framework to iterate to the next timestep
when record insertions are pending.

5.3

Avoiding state machine programming

Implementing event loops is quite characteristic of distributed software systems. When developing distributed
systems, programmers often need to reason about a spectrum of possible input requests that can be received by a
network node, and define how should a network node handle an event based on a state that its currently in. Alvaro
calls this putting a quadratic reasoning task on the programmer [1]. In contrast, Buds declarative timestepping
model seems to help alleviate the problem, as timesteps semantics allow to clearly define the ordering of consequent
system states, and isolate data manipulations performed in
one system state from a consequent state.
As an example, consider our declarative implementation
of Pregel worker shown on Figure 6. We use Bud timesteps
to separate a number of consequent system states from one
another. First, rule on line 8 is invoked on superstep start,
processing all vertices and creating a queue of outgoing vertex messages in queue out collection, deferring populating
queue out until the next timestep (using <+ instead of <=
as an assignment operator). Rule on line 15 sends out all of
the outgoing vertex messages from queue out to the network
nodes that store appropriate vertices. Lastly, rule on line 24
removes all of the messages from queue out collection, whenever its confirmed that all the messages were sent. If the
first and second rules wouldnt be separated by timesteps, it
would be unclear what messages should be sent out by the
second rule - as queue out is initially empty, and is populated as a result of execution of the first rule. Third rule
is also properly separated from the first two, ensuring that
the clean up of the queue out collection only happens after
all messages were sent. Thus, all three worker node states
- processing, sending messages, cleaning up the queue are
properly separated and cannot overlap, as theyre defined
in consequent timesteps. Bud ensures that any operations
defined to be executed in a single timestep are executed
as such. First rule in our example, vertex processing, is
a time-consuming operation, but Bud ensures that it completes until it progresses to the next timestep, even though
a timer collection might be setup to process timesteps every
millisecond (effectively delaying it).
In contrast, in imperative languages we would need to em-

Figure 9: Handling of worker responses on master


ploy an event loop, through which we would communicate
the change of state, and before a state can be changed to
the next one, we would need to ensure that all the conditions are met through a heavy usage of boolean flags and if
statements.
An additional example of avoiding state machine programming can be seen in our implementation of Pregel master,
shown on Figure 5. Rule on line 9 allows master to move on
to next superstep, whenever all workers have reported that
current superstep processing is completed. Theres no event
loop explicitly programmed here, no waiting for success messages from workers, and theres no control logic counting the
number of responses, checking whether all of the workers
have finished and if master should switch to next superstep.
Instead, there are two declarative rules: first rule marks
worker as finished (in workers list collection), whenever a
worker responds with success message (Fig. 9), and second a rule on line 9 (Fig. 5), checking if all workers have finished
and moving on to next superstep.
Overall, declarative rules allow to express a distributed
system at a very high level of abstraction, compactly capturing the most crucial aspects of distributed protocols.

5.4

Reusing rules on multiple occasions

In addition to saving effort in avoiding programming state


machine transitions, we noticed that often a single declarative rule is reused in many different scenarios. An example
of this is shown on Figure 1 - a multicast message delivery
rule. Whenever a message is put into a messages collection,
its joined with a list of network nodes, first column value
of each record is set to recipients IP address, and the whole
record is pushed into a control pipe channel collection, that
delivers the message to a recipient. This rule is used anytime the master needs to instruct all the workers at once when loading the graph, when starting graph computation,
and when iterating to next superstep.
Another example is a rule that handles response messages
from workers on master node, shown on Figure 9. This rule
updates a worker record in workers list collection, whenever
a response comes back from a particular worker. Currently
it handles two cases: when a worker reports on completion
of loading of its part of the graph, and whenever a worker reports the completion of a superstep. In both of these cases,
boolean flags are updated on worker record, and if conditions are used to distinguish between the cases and updating
the correct flags.
Of course, code reuse and modularization is one of the key
concerns in software systems in general, and conventional
programming paradigms also provide rich variety of tools
for code reuse, starting with extracting frequently used code
into functions. Nonetheless, we wanted to highlight that by
defining general-enough declarative rules they can be applied
in a number of different scenarios, saving effort to provide
customized code for handling each case separately.

5.5

Issues encountered

After outlining our considerations on beneficial sides of using declarative programming, we wanted to point out some
issues that we faced along the way. As the opposite side of
the overall compactness that is achieved by expressing distributed systems using declarative programming paradigm,
the set-based nature of declarative rules is sometimes a bit
too high-level to express the fine-grained details about software functionality. Here we will provide a number of examples, where we felt it was hard to come up with a way to
express the needed programming primitive using available
declarative constructs.
As a first example, rule on line 8 on Figure 6 is used to generate outgoing vertex messages for each vertex in vertices
collection. Considering that each network node might store
a large number of vertices, it might be more efficient to send
vertex messages as soon as a corresponding vertex has generated them, and not when all of the vertices have finished
generating messages. It seems trivial to do this in an imperative language, whereas it isnt as easy to do that using
declarative rules, as theyre suited to express transformations on whole collections.
As another example, in one rule we didnt manage to get
rid of a for-loop, and had to explicitly code it into an imperative block of code that is attached to a declarative rule. That
block is provided by Bud to customize how each single record
of an input collection should be transformed before being
pushed to an output collection, so in a way this usage of the
block is not correct. The rule requires a JOIN of 3 collections: vertices, incoming vertex messages in queue in next
collection, and a channel collection that carries a request
from master to start superstep processing. The program
logic requires all vertices to be processed, so an OUTER
JOIN of vertices with other two collections would be appropriate, but Buds built-in OUTER JOIN function outer()
can only join two collections. This pushed us to define this
rule as an OUTER JOIN of vertices with a channel collection, and loop over messages in queue in next collection
manually within an imperative block, matching messages to
currently processed vertex.
As a last example, consider the following simple scenario:
after sending a network message we want to update a boolean
flag, marking the message as sent. This pattern can be seen
in rule on lines 15-20 on Figure 6. A declarative rule can
only modify records in a collection specified as its output,
and not collections used in right-hand side of the rule. In
this case, we want to mark the message as sent right after
its been put into the vertex pipe channel, so that it wont
be resend twice. We only managed to do this by an explicit
assignment of the sent flag in the imperative code block on
line 17. A more correct declarative way would be to define a
second rule, modifying all messages in queue out collection
and setting sent to true to all of the messages at once, but
this bears the question whether the second rule should be
executed in current timestep, or deferred till next timestep.
Both variants introduce the time discrepancy between the
action and marking the action as completed, which could
potentially lead to unwanted duplication of messages.
All of the aforementioned issues that weve documented
should be taken with a grain of salt. This implementation of
Pregel is the only system weve developed using declarative
paradigm, and we might have missed some opportunities to
properly express the needed functionality in a clean declar-

ative form. Nonetheless, we felt that this report wouldnt


be complete without outlining some of the challenges we encountered.

6.

FUTURE WORK

As described in Section 4.4, one of the biggest limitations


of the current GraphBud prototype is UDP- based network
communication used in Bud framework. Sessionless nature
of UDP protocol means that a new network connection is
opened for each single vertex message. It hinders the scalability of GraphBud to larger graph sizes. TCP-based network communication would allow GraphBud to maintain a
single connection for each worker pair and use buffering to
reduce number of network packages transmitted over the
network. As described in Section 4.4, TCP would allow
to make use of the message packing optimization, where
a single network message would be used to transmit all of
the vertex messages addressed to a particular worker node.
We consider modifying Bud framework to allow TCP-based
communication, in order to handle real-world large datasets.
This extension has already been planned by original Bud developers, but wasnt implemented so far9 .
Micro-benchmark presented in Section 4.5 clearly shows
that performance of GraphBud is limited by network communication; 40% decrease in total number of network messages resulted in 40% improvement in superstep completion
time. In our experiments with real-world large datasets (scientific papers citation graph with 500K edges from SNAP
[13]), overhead on network buffers dominated the computation time and it took a day to complete 15-iteration PageRank. Therefore, we have omitted performance experiments.
We leave extensive experimental analysis for future work,
once aforementioned issues are resolved.
One other important aspect that we wanted to explore
in our work was a claim that the usage of declarative languages helps to reason about the consistency of distributed
data. Hellerstein and his group of students promote the
idea of coordination avoidance in distributed systems [3,
5]. Development of these concepts can lead to significant
impact on the way how software engineers reason about
the design of distributed systems, as distributed coordination remains one of the key challenges in this area. Given
that we only implemented a fully-synchronous BSP Pregel
model, its hard for us to argue whether declarative languages really help in understanding the nature of transformations applied on data and ease the reasoning about
data consistency in distributed systems. Section 4.5 provides
a brief overview of PowerGraphs [12] computation model.
In addition to edge-disjoint partitioning, PowerGraph support asynchronous computation which might expedite convergence of machine-learning applications. Implementing
an asynchronous computation model such as PowerGraph
would allow us to fully understand the validity of claims
about declarative paradigm applicability, as distributed coordination employed in asynchronous models needs to be inherently more complex. Nonetheless, we felt that this work
would constitute another major project, and left implementation of PowerGraph for future work.

https://github.com/bloom-lang/bud/issues/100

7.

CONCLUSIONS

Even though at first it might be quite hard to understand the connection between declarative languages, that
were initially developed as advanced database query languages, and present-day high throughput distributed systems, Bloom Bud bridges the gap between the two. Our
exploration leads us to believe that declarative languages
can be successfully used to implement distributed systems,
and resulting systems tend to be expressed in a concise manner. Bud framework allows to change the reasoning about
declarative rules from unordered set of data transformation
rules that are logically correct at any moment in time to a
set of event callbacks that perform data transformations on
receiving network requests.
Weve presented advantages that using declarative languages brings to the table, as well as documented the issues that weve encountered through our implementation
of a distributed system using declarative paradigm. Weve
understood that declarative paradigm doesnt change the
fundamental aspects of developing software - after all, any
software system can be viewed as a data processing system. Declarative paradigm simply puts that into spotlight
by pushing developers to design software systems by reasoning about data collections that software works upon, and
declaring rules about how these collections need to be transformed based on external requests. Weve also identified
some weaker points of current Bud framework implementation, such as using UDP datagrams for network communication.
Were hoping that analysis presented in this report provides incentive for other teams to try out using declarative
languages in implementing distributed systems.

8.

REFERENCES

[1] P. Alvaro. Data-centric Programming for Distributed


Systems. PhD thesis, University of California,
Berkeley, 2015.
[2] P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J. M.
Hellerstein, and R. Sears. Boom analytics: exploring
data-centric, declarative programming for the cloud.
In Proceedings of the 5th European conference on
Computer systems, pages 223236. ACM, 2010.
[3] P. Alvaro, N. Conway, J. M. Hellerstein, and W. R.
Marczak. Consistency analysis in bloom: a calm and
collected approach. In CIDR, pages 249260. Citeseer,
2011.
[4] P. Alvaro, W. R. Marczak, N. Conway, J. M.
Hellerstein, D. Maier, and R. Sears. Dedalus: Datalog
in time and space. In Datalog Reloaded, pages
262281. Springer, 2011.
[5] P. D. Bailis. Coordination avoidance in distributed
databases. PhD thesis, University of California,
Berkeley, 2015.
[6] A.-L. Barab
asi and R. Albert. Statistical mechanics of
complex networks. Reviews of modern physics,
74(1):47, 2002.
[7] P. A. Bernstein and N. Goodman. Concurrency
control in distributed database systems. ACM
Computing Surveys (CSUR), 13(2):185221, 1981.
[8] S. Brin and L. Page. The anatomy of a large-scale
hypertextual web search engine. In Seventh
International World-Wide Web Conference (WWW

1998), 1998.
[9] R. Chen, J. Shi, Y. Chen, and H. Chen. Powerlyra:
Differentiated graph computation and partitioning on
skewed graphs. In Proceedings of the Tenth European
Conference on Computer Systems, page 1. ACM, 2015.
[10] J. Dean and S. Ghemawat. Mapreduce: simplified
data processing on large clusters. Communications of
the ACM, 51(1):107113, 2008.
[11] T. J. W. I. R. C. R. Division, B. Lindsay, P. Selinger,
C. Galtieri, J. Gray, R. Lorie, T. Price, F. Putzolu,
and B. W. Wade. Notes on distributed databases. 1979.
[12] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and
C. Guestrin. Powergraph: Distributed graph-parallel
computation on natural graphs. In Presented as part
of the 10th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 12), pages 1730,
2012.
[13] J. Leskovec et al. Stanford network analysis project.
http://snap.stanford.edu, 2010.
[14] B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis,
T. Roscoe, and I. Stoica. Implementing declarative
overlays. In ACM SIGOPS Operating Systems Review,
volume 39, pages 7590. ACM, 2005.
[15] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin,
A. Kyrola, and J. M. Hellerstein. Distributed
graphlab: a framework for machine learning and data
mining in the cloud. Proceedings of the VLDB
Endowment, 5(8):716727, 2012.
[16] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,
I. Horn, N. Leiser, and G. Czajkowski. Pregel: a
system for large-scale graph processing. In Proceedings
of the 2010 ACM SIGMOD International Conference
on Management of data, pages 135146. ACM, 2010.
[17] R. Ramakrishnan and J. D. Ullman. A survey of
deductive database systems. The journal of logic
programming, 23(2):125149, 1995.
[18] S. Salihoglu and J. Widom. Gps: a graph processing
system. In Proceedings of the 25th International
Conference on Scientific and Statistical Database
Management, page 22. ACM, 2013.
[19] L. G. Valiant. A bridging model for parallel
computation. Communications of the ACM,
33(8):103111, 1990.

You might also like