Professional Documents
Culture Documents
Anil Pacaci
vderevya@uwaterloo.ca
apacaci@uwaterloo.ca
ABSTRACT
In this project weve set out to explore whether using declarative languages can ease the development of distributed software systems, in comparison to usage of conventional imperative languages, such as Java. Weve conducted an experiment in developing a distributed data processing system
using Bud, a declarative language platform developed at
UCBerkeley. Weve developed GraphBud1 , an implementation of Googles distributed graph processing model Pregel.
Usage of declarative language resulted in a very compact
Pregel implementation. Weve verified the correctness of our
model by implementing PageRank on top of GraphBud, and
carrying out a number of experiments parallelizing the computation up to 10 worker nodes and ensuring that parallel
execution provides identical results with single-node PageRank calculation.
We conclude that using declarative languages is a viable
option for development of distributed systems prototypes,
and that usage of a declarative framework allows to focus
on high-level correctness of distributed protocols implementation, as opposed to more technical details such as eventloops and state-machine transitions.
Keywords
Distributed systems, declarative languages, logic programming, Bloom Bud, Pregel, PageRank.
1.
INTRODUCTION
1.1
The growth of online services has spawn the recent evolution of large-scale distributed databases. These datastores are designed to scale automatically when additional
machines are added into the cluster, providing additional
storage and processing capacity. This focus on scalability requires to fundamentally reconsider all of the founda1
Available at https://github.com/slavik112211/declarativity
tional knowledge about database design that was accumulated over the past decades. Scaling conventional relational
databases to required capacities has proved to be difficult
due to advanced functionality that these systems provide,
such as support for transactions, relational model with complex schema, and joining records from multiple tables. The
natural demand for data integrity has been a driver behind the definition of ACID (Atomicity, Consistency, Isolation, Durability) transactions properties, that most conventional database systems are built upon. These properties ensure that simultaneous transactions that are accessing the database are serialized, meaning that the order in
which read and write operations are performed is regulated
using heavy locking of data items. Support for this functionality in a large-scale distributed database can only be
achieved through additional overhead of network-based coordination between database cluster nodes, which greatly
increases latency and reduces throughput of such databases,
and restricts the amount of simultaneous transactions to impractical levels.
To circumvent these issues, recent scalable data storage
solutions have focused on providing the simplest possible
form of storage - storing a collection of objects retrievable
by a key. Such key-value stores provide no support for relations between the objects stored, and this allows to achieve
greater scalability as objects have no dependency between
themselves. At the other hand, such datastores replicate
stored objects between cluster machines for fault-tolerance
and availability purposes, and the challenge of distributed
coordination remains.
There are two opposing approaches to handling the issue
of consistency of object replicas in distributed datastores.
At one end there are systems that favor strict consistency
of the data stored at any point in time. This approach
requires adoption of distributed coordination protocols [7]
such as Two-Phase Commit (TPC) to ensure that the update has propagated to all replicas of an object. This invariably leads to decreased availability of the overall system, as
TPC locks access to modified object and all its replicas until
update propagates to all of the replicas. On the other end
of the spectrum - systems that employ so called eventual
consistency. Such systems propagate updates to the replicas in the background, while the master copy of the object
remains available for subsequent updates. This trade-off between achieving strict data consistency and data availability
in distributed datastores is known from the very early work
conducted in the area of distributed systems [11].
1.2
2.
PROJECT OBJECTIVES
3. BACKGROUND INFORMATION
3.1 Disorderly data-centric programming for
distributed systems
One of Hellersteins students, Peter Alvaro, studied the
challenge of building distributed systems and proposed usage of disorderly declarative programming languages as an
approach to deal with fundamental non-determinism inherent in these systems. In his thesis work [1], Alvaro outlines the following basic causes of non-deterministic behavior in distributed systems: (I) network communication is
inherently asynchronous, and the order of network messages
cannot be predicted; and (II) the fact that remote components cannot be considered reliable (on a larger scale),
and the presence of network failures may lead to computation of incomplete results. Traditional approaches to handling these issues require distributed coordination and include techniques such as distributed locking of data items
and distributed consistency protocols like Paxos and TwoPhase Commit. Such approaches greatly reduce the scalability of resulting distributed system and significantly slow
down the systems operations.
As a solution to aforementioned problems, Alvaro claims
that declarative languages provide a much better match for
the challenges faced by designers of distributed systems,
comparing to commonly-used iterative languages. The level
of abstraction provided by declarative runtime frameworks
allows to avoid imperative specifications of how data transformation is performed, and allows programmers to focus
on specifying what data transformation is desirable. Most
importantly, such logical rules are disorderly by nature, as
conceptually all of such rules are applicable at every point
in time. It is thus claimed that usage of declarative programming languages encourages programmers to think in
this disorderly paradigm.
Nonetheless, distributed systems do require distributed
coordination in cases when system design depends on ordering of network messages. Disorderly programming allows to
employ analysis techniques to understand when distributed
coordination is necessary to produce correct outcomes, and
allows to reason what type of distributed coordination is
sufficient when it is required. Under Consistency as Logical Monotonicity (CALM) theorem [3], data transformations
are divided into two categories: monotonic, where additional
manipulations on data do not cause the previous statements
about data to be revoked (operators include selection, projection and join), and non-monotonic, where additional manipulations lead to a change of facts and truths about data
(modification/removal and aggregation operators). CALM
theorem postulates that monotonic transformations do not
require distributed coordination and are eventually consistent. As an example, MapReduce model divides computation into monotonic map operations, and non-monotonic
reduce operations. Reducers perform aggregation, and require distributed coordination (wait lock) before the reducer
can proceed.
Figure 1: Declarative transformation rule in Bud: multicasting messages to all network nodes
3.2
3.2.1
https://github.com/bloom-lang/bud
https://github.com/bloom-lang/bud/tree/master/docs
4
https://github.com/bloom-lang/bud-sandbox
3
3.2.2
Network cooperation
Even though Bud is designed for creating distributed systems, each single Bud program running on a network node
is completely independent of other instances, and instances
communicate and transmit data through the usage of network messages. Bud provides an additional collection type
to support the notion of sending network messages - channel.
Records that are put into a channel on one network node are
received on the same channel on the other network node.
Recipient address is specified as part of the message metadata: first column in message tuple designates IP address of
a recipient, and a second column designates IP address of a
sender.
3.2.3
3.3
Bulk Synchronous Parallel (BSP) [19] is a parallel computation model for shared-nothing environments where computation is modelled as sequence of steps. In BSP Model,
each worker performs computation on their local partition of
data at each step and communicates through message passing after each step. A synchronization barrier is enforced to
guarantee that each worker is ready to start next iteration.
MapReduce [10], the most prominent example of large-scale
data processing frameworks based on BSP, has been used for
batch processing of web-scale graphs. MapReduce model relies on distributed file system for node-to-node communication and has initially been designed to support single-staged
applications rather than iterative ones like graph processing.
This stateless nature of MapReduce model is ill-suited for
iterative graph applications which resulted in development
of specific graph processing frameworks.
Pregel by Google [16] is first large-scale BSP data processing model with a graph specific API. Pregel employs
Think like a Vertex programming model through a vertexcentric API, where a user-defined compute function is iteratively executed over the graph vertices and communication is performed by sending messages through the edges of
the graph. Giraph 5 is an open-source implementation of
Googles Pregel. GPS [18] employs Pregels vertex-centric
model and utilizes vertex migration for dynamic load balancing. Additionaly, GPS introduces Large Adjacency List
Partitioning (LALP) optimization to address limitations of
existing systems on scale-free graphs. GraphLab [15] employs an asynchronous communication model which can expedite convergence for various machine learning tasks. PowerGraph [12] and PowerLyra [9] factor vertex computation
over edges to efficiently process real-world graphs with powerlaw degree distribution.
5
http://giraph.apache.org/
3.4
4.
4.1
Main objective of this project was to explore applicability of declarative programming languages in development
of large-scale distributed data processing systems. For this
purpose, we have designed and implemented a prototype
graph processing system based on Pregel model - GraphBud6 .
Our prototype employs a master-worker architecture, as in
original Pregel implementation. Brief overview of distributed
graph processing systems and Pregel is presented in Section
3.3.
First, we implemented a simple membership protocol to
make network nodes aware of their peer nodes. Each network node maintains a collection of worker nodes and a
communication channel for master-worker communication.
Each worker process implements two simple rules. First,
worker sends a join message to the master over the network
upon initialization. Master node populates the worker list
with newly joined worker and broadcasts the updated worker
list collection to every worker. Second, worker node populates its own worker list collection upon receiving network
message from master. This protocol ensures that complete
membership information is synchronized in entire GraphBud
cluster. Implementation of these two simple rules is depicted
on Figure 4.
GraphBud relies on simple range partitioning scheme to
create disjoint vertex partitioning of the original graph. Each
worker node reads the input graph in adjacency list format
and can create a local subgraph in distributed manner without communication.
GraphBud relies on a one-to-many broadcast between the
master and worker nodes in order to coordinate worker activities. Any request generated by master is broadcasted to
all nodes that are alive at the moment. Following subsection
describes the details of master-worker process implementations in addition to two optimization which are implemented
in GraphBud to reduce communication overhead.
4.2
Master Implementation
Available at https://github.com/slavik112211/declarativity
4.3
Worker Implementation
12. queue out maintains two flags for each message to track
whether a message was already sent and whether it was successfully delivered.
A rule on lines 15-20 on Figure 6 is responsible for the
delivery of vertex messages for the next superstep. Outgoing message buffer (queue out) and collection of workers
(workers list) are joined on worker identifier so that an IP
address of a worker node which holds the target vertex can
be identified. Many-to-many communication channel is used
to deliver each message to the corresponding worker. Current prototype of the Bud framework relies on User Datagram Protocol (UDP) for network communication, thus each
vertex message is sent as a separate network datagram. Section 4.4 and Section 4.5 further describe the vertex message
delivery and present the optimizations in GraphBud to reduce communication overhead.
One limitation of network communication via UDP is the
lack of delivery guarantees. GraphBud uses an implementation of a reliable delivery protocol for vertex messaging
in order to ensure delivery of each vertex message before
workers report a successful completion of a superstep to the
master. Each vertex message is assigned a unique identifier
and recipient of the vertex message sends an acknowledgement to the original sender so that original sender can mark
message as delivered. Once all messages in outgoing buffer
are successfully delivered, worker reports completion of a superstep to the master node. The verification of the delivery
of all vertex messages is implemented through an aggregation over the delivered flags on queue out collection, which
is shown on lines 29-31 on Figure 6.
4.4
Message Packing
4.5
Large Adjacency List Partitioning (LALP) is an optimization initially proposed in GPS [18] to reduce network communication for specific algorithms where outgoing edges are
only used for sending messages and same message is sent to
all adjacent vertices such as PageRank. LALP is essentially
a receiver-side scatter, where a vertex with high-degree sends
a single message to a worker machine instead of sending separate messages for each adjacent vertex. LALP partitions
adjacency list of high-degree vertices into multiple workers.
On each worker apart from the original worker holding the
vertex, a ghost vertex is created to store partial adjacency
list.
Consider a 4-vertex graph partitioned on two worker nodes,
as shown on Figure 7. In standard implementation, vertex 2
needs to send two separate messages to the remote worker;
one for vertex 1 and one for vertex 3. With LALP enabled,
a ghost vertex is created on worker 1, as this worker stores
local neighbours of vertex 2. Vertex 2 sends a single network
message to the remote node at each superstep. Ghost copy
of the vertex 2 can replicate this message to the local neighbours on a remote worker. Even in this simple example,
LALP optimization reduces the total number of messages
from 5 to 4.
In GraphBud, LALP optimization significantly reduces
the number of UDP messages delivered over the network.
Vertices, which out-degree is larger than a user-defined threshold, are identified during graph loading stage and ghost vertices are created. During the superstep execution, compute()
function is only invoked on master copy of the high degree
vertex. Instead of generating a separate message for each
neighbour, a single message is generated for each worker and
inserted into outgoing message buffer. On a receiver side,
any message which target is a ghost vertex is replicated for
each adjacent vertex in incoming message buffer.
In order to evaluate the impact of LALP on GraphBud
5.1
5.
DECLARATIVE EXPERIENCE
In this section we lay out our considerations on the benefits of programming distributed systems in declarative style
as opposed to usage of conventional imperative programming languages, as well as the issues weve encountered along
the way.
Getting used to programming distributed systems using
declarative rules takes some time. As mentioned in subsection 3.2.3, using disorderly declarative logical rules works
well for expressing database queries, as databases can be
considered static storage containers: each declarative rule
transforms the data stored in a collection in some way and
5.2
https://angularjs.org/
Private correspondence
5.3
Implementing event loops is quite characteristic of distributed software systems. When developing distributed
systems, programmers often need to reason about a spectrum of possible input requests that can be received by a
network node, and define how should a network node handle an event based on a state that its currently in. Alvaro
calls this putting a quadratic reasoning task on the programmer [1]. In contrast, Buds declarative timestepping
model seems to help alleviate the problem, as timesteps semantics allow to clearly define the ordering of consequent
system states, and isolate data manipulations performed in
one system state from a consequent state.
As an example, consider our declarative implementation
of Pregel worker shown on Figure 6. We use Bud timesteps
to separate a number of consequent system states from one
another. First, rule on line 8 is invoked on superstep start,
processing all vertices and creating a queue of outgoing vertex messages in queue out collection, deferring populating
queue out until the next timestep (using <+ instead of <=
as an assignment operator). Rule on line 15 sends out all of
the outgoing vertex messages from queue out to the network
nodes that store appropriate vertices. Lastly, rule on line 24
removes all of the messages from queue out collection, whenever its confirmed that all the messages were sent. If the
first and second rules wouldnt be separated by timesteps, it
would be unclear what messages should be sent out by the
second rule - as queue out is initially empty, and is populated as a result of execution of the first rule. Third rule
is also properly separated from the first two, ensuring that
the clean up of the queue out collection only happens after
all messages were sent. Thus, all three worker node states
- processing, sending messages, cleaning up the queue are
properly separated and cannot overlap, as theyre defined
in consequent timesteps. Bud ensures that any operations
defined to be executed in a single timestep are executed
as such. First rule in our example, vertex processing, is
a time-consuming operation, but Bud ensures that it completes until it progresses to the next timestep, even though
a timer collection might be setup to process timesteps every
millisecond (effectively delaying it).
In contrast, in imperative languages we would need to em-
5.4
5.5
Issues encountered
After outlining our considerations on beneficial sides of using declarative programming, we wanted to point out some
issues that we faced along the way. As the opposite side of
the overall compactness that is achieved by expressing distributed systems using declarative programming paradigm,
the set-based nature of declarative rules is sometimes a bit
too high-level to express the fine-grained details about software functionality. Here we will provide a number of examples, where we felt it was hard to come up with a way to
express the needed programming primitive using available
declarative constructs.
As a first example, rule on line 8 on Figure 6 is used to generate outgoing vertex messages for each vertex in vertices
collection. Considering that each network node might store
a large number of vertices, it might be more efficient to send
vertex messages as soon as a corresponding vertex has generated them, and not when all of the vertices have finished
generating messages. It seems trivial to do this in an imperative language, whereas it isnt as easy to do that using
declarative rules, as theyre suited to express transformations on whole collections.
As another example, in one rule we didnt manage to get
rid of a for-loop, and had to explicitly code it into an imperative block of code that is attached to a declarative rule. That
block is provided by Bud to customize how each single record
of an input collection should be transformed before being
pushed to an output collection, so in a way this usage of the
block is not correct. The rule requires a JOIN of 3 collections: vertices, incoming vertex messages in queue in next
collection, and a channel collection that carries a request
from master to start superstep processing. The program
logic requires all vertices to be processed, so an OUTER
JOIN of vertices with other two collections would be appropriate, but Buds built-in OUTER JOIN function outer()
can only join two collections. This pushed us to define this
rule as an OUTER JOIN of vertices with a channel collection, and loop over messages in queue in next collection
manually within an imperative block, matching messages to
currently processed vertex.
As a last example, consider the following simple scenario:
after sending a network message we want to update a boolean
flag, marking the message as sent. This pattern can be seen
in rule on lines 15-20 on Figure 6. A declarative rule can
only modify records in a collection specified as its output,
and not collections used in right-hand side of the rule. In
this case, we want to mark the message as sent right after
its been put into the vertex pipe channel, so that it wont
be resend twice. We only managed to do this by an explicit
assignment of the sent flag in the imperative code block on
line 17. A more correct declarative way would be to define a
second rule, modifying all messages in queue out collection
and setting sent to true to all of the messages at once, but
this bears the question whether the second rule should be
executed in current timestep, or deferred till next timestep.
Both variants introduce the time discrepancy between the
action and marking the action as completed, which could
potentially lead to unwanted duplication of messages.
All of the aforementioned issues that weve documented
should be taken with a grain of salt. This implementation of
Pregel is the only system weve developed using declarative
paradigm, and we might have missed some opportunities to
properly express the needed functionality in a clean declar-
6.
FUTURE WORK
https://github.com/bloom-lang/bud/issues/100
7.
CONCLUSIONS
Even though at first it might be quite hard to understand the connection between declarative languages, that
were initially developed as advanced database query languages, and present-day high throughput distributed systems, Bloom Bud bridges the gap between the two. Our
exploration leads us to believe that declarative languages
can be successfully used to implement distributed systems,
and resulting systems tend to be expressed in a concise manner. Bud framework allows to change the reasoning about
declarative rules from unordered set of data transformation
rules that are logically correct at any moment in time to a
set of event callbacks that perform data transformations on
receiving network requests.
Weve presented advantages that using declarative languages brings to the table, as well as documented the issues that weve encountered through our implementation
of a distributed system using declarative paradigm. Weve
understood that declarative paradigm doesnt change the
fundamental aspects of developing software - after all, any
software system can be viewed as a data processing system. Declarative paradigm simply puts that into spotlight
by pushing developers to design software systems by reasoning about data collections that software works upon, and
declaring rules about how these collections need to be transformed based on external requests. Weve also identified
some weaker points of current Bud framework implementation, such as using UDP datagrams for network communication.
Were hoping that analysis presented in this report provides incentive for other teams to try out using declarative
languages in implementing distributed systems.
8.
REFERENCES
1998), 1998.
[9] R. Chen, J. Shi, Y. Chen, and H. Chen. Powerlyra:
Differentiated graph computation and partitioning on
skewed graphs. In Proceedings of the Tenth European
Conference on Computer Systems, page 1. ACM, 2015.
[10] J. Dean and S. Ghemawat. Mapreduce: simplified
data processing on large clusters. Communications of
the ACM, 51(1):107113, 2008.
[11] T. J. W. I. R. C. R. Division, B. Lindsay, P. Selinger,
C. Galtieri, J. Gray, R. Lorie, T. Price, F. Putzolu,
and B. W. Wade. Notes on distributed databases. 1979.
[12] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and
C. Guestrin. Powergraph: Distributed graph-parallel
computation on natural graphs. In Presented as part
of the 10th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 12), pages 1730,
2012.
[13] J. Leskovec et al. Stanford network analysis project.
http://snap.stanford.edu, 2010.
[14] B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis,
T. Roscoe, and I. Stoica. Implementing declarative
overlays. In ACM SIGOPS Operating Systems Review,
volume 39, pages 7590. ACM, 2005.
[15] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin,
A. Kyrola, and J. M. Hellerstein. Distributed
graphlab: a framework for machine learning and data
mining in the cloud. Proceedings of the VLDB
Endowment, 5(8):716727, 2012.
[16] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,
I. Horn, N. Leiser, and G. Czajkowski. Pregel: a
system for large-scale graph processing. In Proceedings
of the 2010 ACM SIGMOD International Conference
on Management of data, pages 135146. ACM, 2010.
[17] R. Ramakrishnan and J. D. Ullman. A survey of
deductive database systems. The journal of logic
programming, 23(2):125149, 1995.
[18] S. Salihoglu and J. Widom. Gps: a graph processing
system. In Proceedings of the 25th International
Conference on Scientific and Statistical Database
Management, page 22. ACM, 2013.
[19] L. G. Valiant. A bridging model for parallel
computation. Communications of the ACM,
33(8):103111, 1990.