You are on page 1of 15

Future Generation Computer Systems 8 (1992) 121-135 121

North-Holland

A comparative study of five parallel


programming languages
Henri E. Bal
Dept. of Mathematics and Computer Science, Vrije Universiteit, De Boelelaan 1081, 1081 HV Amsterdam, Netherlands

Abstract
Bal, H.E., A comparative study of five parallel programming languages, Future Generation Computer Systems 8 (1992)
121-135.
Many different paradigms for parallel programming exist, nearly each of which is employed in dozens of languages. Several
researchers have tried to compare these languages and paradigms by examining the expressivity and flexibility of their
constructs. Few attempts have been made, however, at practical studies based on actual programming experience with
multiple languages. Such a study is the topic of this paper.
We will look at five parallel languages, all based on different paradigms. The languages are: SR (based on message
passing), Emerald (concurrent objects), Parlog (parallel Horn clause logic), Linda (Tuple Space), and Orca (logically shared
data). We have implemented the same parallel programs in each language, using real parallel machines. The paper reports
on our experiences in implementing three frequently occurring communication patterns: message passing through a mailbox,
one-to-many communication, and access to replicated shared data.

Keywords. Parallel programming; SR; Emerald; Parlog; Linda; Orca.

1. Introduction T h e intent of this p a p e r is to cast new light on


these discussions, using a practical approach. W e
D u r i n g the previous decase, a staggering num- have i m p l e m e n t e d a n u m b e r of parallel applica-
ber of languages for p r o g r a m m i n g parallel and tions in each of several parallel languages. Based
distributed systems has e m e r g e d [2,7]. These lan- on this experience, we will draw some conclusions
guages are based on widely different program- about the relative advantages and disadvantages
ming paradigms, such as message passing, concur- of each language. So, unlike most of the discus-
rent objects, logic, and functional programming. sions in the literature, this p a p e r is based on
B o t h within e a c h p a r a d i g m a n d b e t w e e n actual p r o g r a m m i n g experience in several parallel
paradigms, h e a t e d discussions are held about languages on real parallel systems.
which a p p r o a c h is best [20,28,31]. T h e languages studied in this p a p e r obviously
do not cover the whole spectrum of design
choices. Still, they represent a significant subset
Correspondence to: Henri E. Bal, Dept. of Mathematics and o f what we feel are the most important paradigms
Computer Science, Vrije Universiteit, De Boelelaan 1081,
for parallel programming. W e discuss only a sin-
1081 HV Amsterdam, Netherlands.
This research was supported in part by the Netherlands gle language for each paradigm, although o t h e r
Organization for Scientific Research (N.W.O.). languages may exist within each p a r a d i g m that
This paper was first published in the Proceedings of the are significantly different.
EurOpen Spring 1991 Conference on Open Distributed Sys- T h e languages that have b e e n selected for this
tems in Perspective, Tromso, 20-24 May 1991.
A preliminary version of the paper appeared in the Proceed- study are: SR, Emerald, Parlog, Linda, and O r c a
ings of the PRISMA Workshop on Parallel Database Systems, (see Table 1). S R represents message passing
Noordwijk, The Netherlands, September 1990. languages. It provides a range of message sending
122 H.E. Bal

and receiving constructs, rather than a single ment the communication patterns discussed in
model. Emerald is an object-based language. Par- Section 2.
log is a concurrent logic language. Linda is a set • Comments on the language implementation and
of language primitives based on the Tuple Space its performance. Unfortunately, there is no sin-
model. Orca is representative of the Distributed gle platform on which all the languages run, so
Shared Memory model. we had to use many different platforms. The
We focus on languages for parallel applica- systems we used differ in the number of proces-
tions, where the aim is to achieve a speedup on a sors, processor type and speed, as well as in the
single application. These applications can be run way processors are interconnected. A fair com-
on either multiprocessors with shared memory or parison between the languages is therefore not
distributed systems without shared memory. We possible, but the measurements do give some
have selected only languages that are suitable for rough indication of the relative speedups that
both architectures. So, we do not discuss shared- can be obtained.
variable or monitor-based languages, since their • Conclusions on the language.
usage is restricted to shared-memory multiproces- Finally, in Section 8, we will compare the ap-
sors. Functional languages are not discussed ei- proaches used for the different languages.
ther. Most functional languages are intended for
different parallel architectures (e.g. dataflow or
graph reduction machines), and often try to hide 2. The applications and their communication pat-
parallelism from the programmer. This makes an terns
objective comparison with the other languages
hard. We also do not deal with distributed lan- There are many ways to compare parallel lan-
guages based on atomic transactions (e.g. Argus guages. One way is a theoretical study of the
[30]), since these are primarily intended for expressiveness of their primitives. This works well
fault-tolerant applications. The issue of fault- for languages using the same paradigm (e.g. mes-
tolerant parallel programming is discussed in a sage passing), but is more problematic for com-
separate paper [12]. parison between different paradigms. Comparing,
The outline of the rest of this paper is as say, remote procedure calls and shared logical
follows. In Section 2, we will briefly describe the variables is not a trivial task.
applications we have used, focusing on their com- The approach taken in this paper is to imple-
munication patterns. Next, in Sections 3 to 7, we ment a set of small, yet realistic, problems in each
will discuss each of the five languages, one lan- language, and compare the resulting programs.
guage per section. Each section has the following The example problems we have used include ma-
structure: trix multiplication, the All Pairs Shortest Paths
• Background information on the language. (All problem, the Traveling Salesman Problem,
languages have been described in a recent sur- alpha-beta search, and successive overrelaxation.
vey paper [7], so we will be very brief here.) The applications and the algorithms used for
• A description of our programming experience. them are described in detail in [13]. For this
We will comment on the ease of learning the paper, we will restrict ourselves to only two appli-
language and on the effort needed to imple- cations: the All Pairs Shortest Paths problem and

Table 1
Overview of the languages discussed in the paper
Language Paradigm Origin
SR Message passing University of Arizona
Emerald Concurrent object-based language University of Washington
Parlog Concurrent logic language Imperial College
Linda Tuple space Yale University
Orca Distributed shared memory Vrije Universiteit
Fil,e parallel programming languages 123

sage. With mailboxes, any process that can access


the mailbox can receive a message sent to it. So,
each message sent to a mailbox is handled by one
Mailbox
process, but it is not determined in advance which
process will accept the message.
The TSP program uses a mailbox for distribut-
(a) Message passing through a mailbox (used by TSP) ing work. A process that has computed a new job
(to be executed in parallel) sends it to a mailbox,
where it will eventually be picked up by an idle
worker process. Since it is not known in advance
.© which worker process will accept the job, mailbox
communication is required here, rather than di-
rect message passing.
The second communication pattern used in
(b) One-to-many communication (~*ed by ASP) the TSP program is replicated shared data (see
write Fig. l(c)). The branch-and-bound algorithm re-
quires a global variable containing the length of
the current best solution. This variable is used for
pruning partial solutions whose initial paths are
already longer than the current best full route.
In a distributed system, this global variable
cannot be put in shared memory, since such sys-
tems lack shared memory. One solution is to
(e) Communication through replicated shared data (used by TSP) store the variable on one processor and let other
processors access it through remote operations.
For TSP (and many other applications), however,
a much more efficient solution is possible. The
Fig. 1. Communication patterns used by the two applica-
tions discussed in the paper. bound is usually changed (improved) only a few
times, but may be used millions of times by each
processor, so its r e a d / w r i t e ratio is very high.
Therefore, the variable can be implemented effi-
the Traveling Salesman Problem. These applica- ciently by replicating it in the local memories of
tions will be described below. We will focus on the processors. Each processor can directly read
the communication aspects of the applications, the variable. Physical communication only occurs
since, from a parallel programming point of view, when the variable is written, which happens infre-
these are most interesting. quently.

2.1. The Traveling Salesman Problem (TSP) 2.2. The All Pairs Shortest Paths problem (ASP)

The Traveling Salesman Problem computes the The second application is the All Pairs Short-
shortest route for a salesman among a given set est Paths problem, which computes the lengths of
of cities. The program uses a simple branch-and- the shortest paths between each pair of nodes in
bound algorithm and is based on replicated work- a given graph. ASP uses a parallel iterative algo-
ers style parallelism [5,16]. The TSP program uses rithm. Each processor is assigned a fixed portion
two interesting communication patterns: mail- of the rows of the distances matrix. At the begin-
boxes and replicated shared data. ning of each iteration, one process sends a pivot
A mailbox (see Fig. la) is a communication row of the matrix to all the other processes. Each
port with send (nonblocking) and receive opera- process then uses this pivot row to update its
tions [7]. Mailboxes can be contrasted with direct portion of the matrix.
message passing, in which the sender always spec- The most important communication pattern of
ifies the destination process (receiver) of the mes- the ASP program thus is one-to-many communi-
124 H.E. Bal

Send

Msg

Fig. 2. Simulating message passing through a mailbox in SR.

cation (see Fig. l(b)). This pattern transmits data sequential parts, the syntax, type system, and
from one process to many others, all of which use module constructs are different from most other
these data. (In contrast, a message sent to a languages. Nevertheless, these were fairly easy to
mailbox is used by only one process.) learn, although the type system is far from per-
Of course, this pattern can be simulated fect [9].
through multiple point-to-point messages, but SR tries to reduce the number of concepts for
frequently much better solutions are possible. distributed and parallel programming by using an
Many networks have a multicast or broadcast orthogonal design. There are two ways for send-
capability, which can be used to speed up one- ing messages (blocking and nonblocking) and two
to-many communication significantly. So, there ways for accepting messages (explicit and im-
are two issues involved here: how one-to-many plicit). These can be combined in all four ways,
communication is expressed in a given language yielding four different communication mecha-
and how it is actually implemented. For ASP, it is nisms. We agree with the designers that this
very important that the implementation uses a orthogonality principle simplifies SR's design.
real multicast. Otherwise, the communication Unfortunately, there also are some less elegant
costs may easily become a dominating factor. design features. The concurrent-sent (co) com-
mand, for example, is a rather ad-hoc extension
of the basic model, with specialized syntax rules.
3. Synchronizing Resources (SR) Our programming experience indicates that,
even within the restricted domain of parallel pro-
SR [3,4] is a language for writing distributed gramming, nearly all facilities provided by SR are
programs, developed by Greg Andrews, Ron Ols- useful. We found uses for synchronous and asyn-
son, and their colleagues at the University of chronous message invocation, explicit, implicit,
Arizona and the University of California at Davis. conditional, and ordered message receipt, and
The language supports a wide variety of (reliable) multicast [9]. Below we will report on our experi-
message passing constructs, including shared vari- ences in implementing the three communication
ables (for processes on the same node), asyn- patterns of Fig. 1 in SR i
chronous message passing, rendezvous, remote
procedure call, and multicast. 3.1.1. Mailbox communication
Despite its large number of features, SR does
3.1. Programming experience not directly support message passing through a

Given its ambitious goal of supporting many


I The language used for this paper is referred to as SR
communication models, it is not surprising that Version 1.1. The SR designers are currently working on
SR is a fairly large language. Yet, we found it Version 2, in which many of the problems described here
reasonably easy to learn. With regard to the will be solved.
Fiue parallel programming languages 125

mailbox. The receiver of a message is fully deter- ables, other processes must communicate through
mined when the message is sent. With a mailbox, message passing.
the destination process is not determined until Of course, shared data can be simulated by
the message is accepted (serviced). storing them on one server process and having
In contrast with the sender of a message, the other processes send messages to this server. As
receiver need not specify the other party, so in stated before, however, we assume that the
this sense message passing in SR is asymmetric. r e a d / w r i t e ratio of the shared data is very high,
This observation also implies a solution to the in which case it is far more efficient to replicate
mailbox problem. We can simply add an interme- the shared data in the local memories. Each
diate buffer process between the sender and re- processor keeps its own local copy, which is used
ceivers, as shown in Fig. 2. The sender sends its for reading. Whenever the variable is written, all
message to this buffer process, so the (initial) these copies are updated.
destination is fixed. The receivers ask the buffer Concurrent updates of the shared data will
process for a message, whenever they need one. have to be synchronized. For example, if two
The buffer process accepts SendMsg and Re- processes P and Q simultaneously write the
ceiveMsg requests one at a time; if the buffer is shared data, all copies should be updated in a
empty, only SendMsg will be accepted. With this consistent way. It should never be the case that
scheme, the destination of each message is fixed: part of the copies are set to P ' s value while
it is sent to the buffer process. In this way, the another part is set to Q's value. With message
asymmetry of message passing is worked around, passing this requirement is difficult to realize,
at the overhead of implementing an extra pro- because messages are not globally ordered. In
cess. other words, the update messages sent by P and
In applications where only one process is send- Q may arrive in different orders at different
ing messages, a simpler solution can be used. receivers.
When the sender wants to send a message, it The solution we have taken is to send update
blocks until a receiver asks for a message. In this messages through a central manager process. This
case, the receiver can directly fetch a message process orders the update messages and forwards
from the sender, thus eliminating the need for a them in a consistent order to all other processes.
buffer process. These update messages are accepted implicitly by
each receiver, which means that the run time
3.1.2. One-to-many communication system will automatically create a new process for
The second communication pattern, one-to- servicing such a message. This is important, since
many communication, is supported in SR through it is not known in advance when the update
a special language construct: messages may arrive.
t'~ (i := I |[~ P)
,('rl(l r e c e i v e r [i ]. S e n d M s g ( m s g ) 3.2. Implementation and performance
OC

The co statement sends a message concurrently SR has been implemented on a range of multi-
to several processes, as specified in the array processors (Encore, Sequent Balance, Sequent
receiuer. This approach to multicasting has an Symmetry) and distributed systems (homoge-
important disadvantage, however. If two SR pro- neous networks of VAXes, Sun-3s, Sun-4s, and
cesses concurrently multicast two messages, these others). The compiler and run time system are
messages need not arrive in the same order ev- available from the University of Arizona.
erywhere. In other words, multicast in SR is not We have done some initial performance mea-
indiuisible. Applications for which the ordering surements on a Sequent Symmetry with 6 CPU's.
matters must do it themselves. Although this machine has a shared memory, SR
uses it only for implementing message passing, so
3.1.3. Shared data the machine is not really used as a multiproces-
The third communication pattern, replicated sor.
shared data, is not supported by SR. Only pro- For the All Pairs Shortest Paths problem, we
cesses running on the same node can share vari- have measured a speedup of 4.08 (on 6 CPUs).
126 H.E. Bal

The reason why this speedup is less than linear is 4.1. Programming experience
the fact that the co statement currently is not
implemented as a true (physical) multicast. The Emerald is reasonably easy to learn. Since it is
message is copied once for every receiver. The an object-based language, it treats all entities as
communication overhead of the ASP program objects. Unlike object-oriented languages, it does
therefore is high, which prevents a linear speedup. not support inheritance. Nonwithstanding its ob-
For the Traveling Salesman Problem, we mea- ject-based nature, Emerald contains many con-
sured a maximum speedup of 5.87. The latter structs also found in procedural languages (e.g.
program uses the simplified solution for mailbox nested scopes, functions, expressions, assignment
communication (i.e. the manager generates only and control statements). The type system is one
one job at a time and blocks until this job has of the more important contributions of the lan-
been accepted). guage. Although it is not easy to get used to, it is
flexible and features static type checking and
3.3. Conclusion on SR polymorphism.
Below, we will discuss how mailboxes, one-to-
many communication, and shared data can be
Since SR provides so many communication implemented in Emerald.
primitives, it is a flexible language. SR is also
more expressive than most other message passing
languages. It can be argued, however, that mes- 4.1.1. Mailbox communication
sage passing is a low level of abstraction. As we A message (or operation) in Emerald is always
will see, for several applications other mecha- sent to a specific object, so mailbox-style commu-
nisms than message passing are simpler to use. nication is not provided. It is possible, however,
These higher-level mechanisms are frequently to construct a mailbox object, which can be ac-
more expressive yet less flexible. In conclusion, cessed by the senders and receivers. Such an
SR is reasonably suited for virtually all applica- object has the following user-defined polymor-
tions. It is seldom spectacularly good or bad for phic type:
any application. type MaiLbox
operation AddMsg[eType]
% Add a message t o the m a i l b o x
operation GetMsg -> [job: eType]
4. Emerald % Fetch a message from the
% mai I box
Emerald [15,26] is an object-based language, end MaiLbox
designed at the University of Washington by An- The object type is implemented using a queue of
drew Black, Norman Hutchinson, Eric Jul, and messages. To synchronize access to the queue, it
Henry Levy. An object in Emerald encapsulates is encapsulated in a monitor. The AddMsg and
both static data and an active process. Objects GetMsg operations are thus executed in a mutu-
communicate by invoking each other's operations. ally exclusive way. Also, the GetMsg operation
There can be multiple active invocations within will block on a condition variable if the queue is
one object, which synchronize through a monitor. empty; this condition variable will be signalled by
The remote invocation mechanism is location an invocation of AddMsg.
transparent. This implementation of mailboxes is roughly
Central to Emerald's design is the concept of similar to the SR version, except that a passive
object mobility. An object may migrate from one object rather than an active process is used for
processor to another, as initiated either by the storing the message queue. Also, the synchroniza-
programmer or the system. Emerald uses a novel tion of the queue operations is entirely different.
parameter mode, call-by-move. This mode has In the SR version, the buffer process synchro-
similar semantics as call-by-reference, but addi- nizes the operations by accepting them one at a
tionally moves the object parameter to the node time and by delaying requests for messages when
of the invoked object. the buffer is empty. The Emerald version uses a
Fice parallel programming languages 127

monitor and a condition variable for synchroniz- 4.2. Implementation and performance
ing the operations.
A prototype implementation of Emerald exists
on networks of VAXes or Sun-3 workstations,
4.1.2. One-to-many communication connected by an Ethernet. The Emerald system is
Emerald does not support any form of one-to- not yet available to other users. We have not
many communication. To send data to multiple been able to do any meaningful performance
objects, a sequential for loop has to be used. A measurements on the prototype system.
subtle problem arises here that does not occur in
the other languages. Emerald provides a uniform 4.3. Conclusions on Emerald
parameter mechanism: all objects are passed by
reference, no matter where the sender and re- Support for parallel and distributed program-
ceiver are located. With multicasting, however, ming in Emerald is best understood using two
each receiver should be given a copy of the data, levels of abstraction. At the highest level, we have
not a remote reference to it. What is needed here concurrent objects that invoke each other's oper-
is call-by-value semantics, which is not supported ations in a synchronous (blocking) way., certainly
in Emerald. a nice and simple abstraction. To see what is
Thus, the sender must copy the data explicitly really going on, we need to look at how invoca-
and pass this copy as call-by-move parameter. A tions are implemented and synchronized. Here,
distinct copy must be made for every receiver. So we are at the level of monitors. Monitors are well
a multicast is simulated as follows in Emerald: understood, but are harder to program than most
for all receiL'ers r do other mechanisms discussed in this paper. This
r.send[move copy[ msg ]] clearly shows of in the implementation code: most
Here, copy is a user-defined procedure that copies of our Emerald programs are significantly longer
a message. than their counterparts in the other languages.
The Emerald implementation of one-to-many For parallel programming, Emerald is less flex,
communication is fairly complex. In addition, the ible than SR. It provides only one form of inter-
solution is far from efficient. Not only does it process communication: synchronous remote pro-
refrain from using physical multicast, but it also cedure calls that are accepted implicitly. The
forces the sender to copy the message once for parameter mechanism is consistent (call-by-refer-
every receiver, which may become a sequential ence is used throughout), but copying parameters
bottleneck. is a problem. In principle, call-by-value parame-
ters could have been allowed for passive objects
(not containing a process). This extention would
4.1.3. Shared data have made the parameter mechanism less uni-
Although Emerald supports a shared names- form, however, and would have created a distinc-
pace for objects, this is not sufficient for imple- tion between active and passive objects.
menting shared data. If a shared varible were Emerald probably is more suitable for dis-
stored in a single object, nearly all accesses to the tributed applications (e.g. electronic mail, name
variable would require physical communication, servers) than for parallel applications. For such
including read-only operations. What is needed is distributed applications, features like object mi-
a replicated object, which is not provided in gration and location independent invocations are
Emerald. more beneficial and the need for copying objects
The programmer therefore has to replicate (e.g. electronic mailboxes) will be less.
data explicitly. A copy of the shared data is kept
by each process needing the data. To update
these copies, a similar scheme as for SR is used, 5, Parlog
based on implicitly received messages. The main
difference with the SR solution is the usage of a We have chosen Parlog [22,21,23,24] as repre-
monitor for synchronizing access to the local copy sentative for the large class of concurrent logic
of the shared variable. languages. Parlog has been developed at Imperial
128 H.E. Bal

College, London, by Keith Clark, Steve Gregory, The Parlog system we have used uses shared
and their colleagues. memory, which takes away the need for physical
The language is based on A N D / O R paral- multicast.
lelism and committed-choice nondeterminism. Our Parlog ASP program uses an even simpler
The user can specify the order (parallel or se- approach to one-to-many communication. Rather
quential) in which clauses are to be evaluated. than creating a fixed number of long-living pro-
For this purpose, sequential and parallel conjunc- cesses, it creates a new set of parallel processes
tion and disjunction operators can be used. for each iteration of the algorithm. The pivot row
for the next iteration is passed as parameter to
5.1. Programming experience each of these processes. In other words, the pro-
gram does not send a message to existing pro-
The time needed for learning Parlog depends cesses, but it creates new processes and passes
on ones background education in concurrent logic the message as a parameter. This approach only
programming. The language itself is quite simple. works well because the Parlog system efficiently
In addition, there are certain programming id- supports fine-grained parallelism. With the other
ioms one should master, such as streams and languages discussed in this paper, the overhead
objects built with shared logical variables. of creating new processes for each iteration would
be far too high.
5.1.1. Mailbox communication
As in most concurrent logic languages, pro- 5.1.3. Shared data
cesses in Parlog can communicate through mes- Parlog supports shared logical variables, but
sage streams. Such streams can easily be built out these variables can be assigned only one. Imple-
of shared logical variables. Streams, however, menting mutable shared variables in Parlog is
have one disadvantage: the receiving end can much more complicated. We represent such a
scan over the stream, but it cannot remove items variable as a stream of values, the last one of
from it [20]. Thus, mailbox-type communication which is the current value of the variable. The
cannot be expressed easily with streams. predicate current_value scans the stream until
Instead, we can use similar solutions as for SR, the tail is an unbound variable, and returns the
which means either adding a buffer process be- current last element of the stream as output
tween the sender and receivers (see Fig. 2), or value:
blocking the sender of the message. For our TSP mode current value(Stream?, Value D.
program [10], we have chosen the latter option. % Stream is input, Value is o u t p u t
There is only a single sender, which blocks when current value([VIVs] , Value)
it wants to send a message. The sender takes a <- var(Vs): Value = V;
stream of incomplete messages of the form % tail is u n b o u n d
getmsg(Msg) as input. These messages are gener- current_value([_IVs] , Value) <- cur-
ated by the receivers. After receiving such a mes- rent value(Vs, Value).
sage, the sender instantiates the logical variable % try next element.
Msg to the next message it wants to send. To update the variable, a new value is appended
to the end of the stream. A process using the
5.1.2. One-to-many communication variable must periodically check for new values,
One-to-many communication is easy to express by scanning, the stream until the end. (This tech-
using shared logical variables. All that is needed nique is also used by Huntbach [25]).
is a stream of messages shared among the sender An important issue is how often to check the
and the receivers. All receivers can scan this stream. Since scanning streams is expensive, it
stream, thus receiving all the messages. cannot be done too often. On the other hand, if it
It depends on the language implementation is done infrequently, the process will usually have
whether physical multicast is used for this type of an old value of the shared variable. For branch-
one-to-many communication. For example, multi- and-bound applications like TSP, this means
cast is used to some extent in the hypercube pruning will become less efficient, so more nodes
implementation of Flat Concurrent Prolog [33]. will be searched (the so-called search overhead).
Fice parallel programming languages 129

This solution is somewhat similar to the SR simple as the original sequential algorithm. The
and Emerald implementations described above. synchronization of the parallel tasks is done im-
The stream representing the shared variable can plicitly, using suspension on unbound logical vari-
be regarded as a stream of update messages. An ables. On the negative side, it is not clear whether
important difference is the way these messages the program will run efficiently on a realistic
are accepted. In SR and Emerald, a new process large-scale parallel system.
is created when a message arrives, which will For other applications, shared logical variables
service the message immediately (i.e. the message are less suitable, but one can then fall back on
is received implicitly). Parlog does not have im- message passing through streams. This form of
plicit message receipt, so the receiver must explic- message passing has some drawbacks, however,
itly look for new messages. Since it is not known as discussed in [20].
in advance when update messages may arrive,
there is a problem in deciding when to look for
them.
6. l.inda

5.2. Implementation and performance


Linda is a set of language primitives developed
An interpreter for Parlog has been imple- by David Gelernter and colleagues at Yale Uni-
mented on several shared-memory multiproces- versity [1,16,17]. Linda is based on the Tuple
sors (Sequent Balance and Symmetry, Butterfly). Space model of communication. The Tuple Space
A commercially available subset of Parlog, called is a global memory consisting of tuples (records)
Strand, has also been implemented on distributed that are addressed associatively. Three atomic
systems (hypercubes, networks). The Parlog sys- operations are defined on Tuple Space: out adds
tem is available from Imperial College. a tuple to TS; read reads a tuple contained in TS;
We have used a 6-CPU Sequent Balance for in reads a tuple and also deletes it from TS, in
running some initial performance measurements. one atomic action.
This implementation of Parlog relies on the pres-
ence of shared memory. Also, the implementa- 6.1. Programming experience
tion is based on an interpreter and runs on slow
processors, so its absolute performance currently
is one to two orders of magnitude less than that Of all five languages discussed in this paper,
of the other languages described in this paper. Linda undoubtedly is the simplest one to learn. It
These two issues taken together result in a rela- adds only a few primitives to an existing base
tive communication overhead that is far less than language. Below, we will discuss how these primi-
what would be expected in a production-quality, tives can be used to implement the three commu-
distributed implementation. nication patterns.
We have measured a speedup of 5.33 for ASP
and 4.98 for TSP, using 6 CPUs. The speedup for
6.1.1. Mailbox communication
ASP is fairly high, due to the low communication
The simulation of a mailbox in Linda is simple.
overhead. For TSP, the speedup is not optimal,
A mailbox is represented as a distributed data
because the global bound is not kept up-to-date
structure [16] in Tuple Space. To send a message
everywhere. The TSP program therefore suffers
to the mailbox, a new tuple containing the mes-
from a search overhead.
sage is added to this data structure. To receive a
message, a tuple is retrieved from Tuple Space
5.3. Conclusions on Parlog and its contents are read.
To preserve the ordering of the messages, a
The shared logical variable is at a higher level sequence number field is added to each message
of abstraction than message passing. For some tuple. The tuples are generated and retrieved in
applications, it is spectacularly expressive. Our the same order. The next sequence number to
Parlog program for ASP, for example, is just as generate and the sequence number of the next
130 H.E. Bal

message to accept are also stored in tuples. They really happens. For efficiency, it makes consider-
are initialized to zero, by the statements: able difference whether the data are transferred
out( "'head "', 0); # initialize tuple through a real multicast protocol or not.
containing index of head of queue There are many different implementations of
out( "'tail "', 0); # initialize tuple Tuple Space to consider. The S/Net system
containing index of tail of queue replicates all tuples everywhere, using the S / N e t
To send a message msg to a mailbox, the follow- broadcast capability [17]. The hypercube and
ing code is executed: transputer implementations of Linda, on the other
in( "'tail "', ? & t a i l ) ; # obtain hand, hash each tuple onto one specific processor
next sequence number and do not replicate tuples [14,34]. In this case,
out( "'tail "', tail + I); # put back the data in the message will not be multicast.
next sequence number Each receiver will have to fetch the data itself,
out( "'MB "', msg, tail); # put mes- using a read statement. The communication over-
sage with sequence number in TS head will thus be linear to the number of re-
The in operation blocks until a matching tuple is ceivers. In conclusion, expressing one-to-many
found. Next, it assigns the formal parameters of communication is Linda is trivial, but the perfor-
the in (denoted by a "?") the corresponding mance will be hard to predict.
values of the tuple. Finally, it deletes the tuple
from Tuple Space. All of this is done atomically. 6.1.3. Shared data
Receiving a message from a mailbox is imple-
In theory, a shared variable can be simulated in
mented through the following code: Linda by storing it in Tuple Space. This solution
in( "'head "', ? &head); # first
makes heavy demands on the implementation of
o b t a i n sequence number
Tuple Space, however. If the variable is read very
o u t ( "'head "', head + I ) ; # put se-
frequently (as is true in TSP), the overhead of
quence-number t u p l e back in TS
reading it must be very low. So, for efficiency
in( " "MB " ", ? &head); # now fetch
each processor should have a local copy of the
message w i t h right sequence number
tuple. Not all Tuple Space implementations have
The tuples can be thought of as forming a dis- this property. The hypercube and transputer im-
tributed queue data structure, with pointers (in- plementations mentioned above, for example,
dices) to the head and tail of the queue. store each tuple on only a single processor. An
This example clearly illustrates the advantages additional performance problem is the associative
and disadvantages of Linda. The mailbox imple- addressing of Tuple Space. Part of this overhead
mentation is very simple: it requires only a few can be optimized away [18], but it is not clear
lines of code. On the other hand, the operations whether it can be eliminated entirely. So, whether
used for accessing the mailbox are fairly low-level. or not the above solution is practical, depends on
For example, three Tuple Space operations are the implementation.
needed for sending or receiving a single message.
It is far from trivial that this code is correct. Also,
the implementation must do extensive optimiza- 6.2. Implementation and performance
tion to make the send/receive operations effi-
cient. Linda has been implemented on many parallel
machines, both with and without shared memory,
6.1.2. One-to-many communication and has been used for numerous applications
In Linda, data can be transferred from one [20]. The system is distributed as a commercial
process to all the others by putting the data in product. (The Linda system we have used for our
Tuple Space, where it can be read by everyone. performance measurements is not the most re-
So, expressing one-to-many communication in cent one; newer versions of the Linda software
Linda is trivial; it just requires a single out state- may obtain better performance.)
ment: We have used a VME-bus based multiproces-
out(msg); sor for some initial performance measurements.
A key question that remains, however, is what For the All Pairs Shortest Paths problem, we
Fit,e parallel programming languages 131

have measured a speedup of 7.4 on 8 CPUs. ber of operations on Tuple Space - is rather
Since the implementation uses shared memory, low-level, in our view [27].
the distribution of the pivot rows is efficient.
Each new pivot row is put in a tuple in shared
memory, where it can be read by all processors.
The Travelling Salesman Problem program ob- 7. ()rca
tains a speedup of 7.06 on 8 CPUs. The program
stores the global bound in a tuple. In our Linda Orca is a language for implementing parallel
system, using this tuple for every read access is applications on distributed systems. Orca was de-
too expensive. Therefore, each processor also signed at the Vrije Universiteit in Amsterdam
keeps a local copy of the variable. These copies [13,8,11,6].
are updated occasionally. So, this implementation The programming model of Orca is based on
is similar to the Parlog implementation, except logically shared data. The language hides the
that the bound is stored in a tuple rather than in physical distribution of the memory and allows
a stream. Updating the local copies is relatively processes to share data even if they run on differ-
cheaper in the Linda version, so it can be done ent nodes. In this way, Orca combines the advan-
more frequently. As a result, the relative search tages of distributed systems (good price/
overhead in the Linda program is less than that performance ratio and scalability) and shared-
of the Parlog version. memory multiprocessors (ease of programming).
The entities shared among processes are data
6.3. Conclusions on Linda objects, which are variables of user-defined ab-
stract data types. These data objects are repli-
Most of the criticism on Linda in the literature cated in the local memories, so each process can
is related to efficiency. The associative addressing directly read its own copy, without doing any
and global visibility of the Tuple Space have led communication. The language run time system
many people to believe that Linda cannot be atomically updates all copies when an object is
implemented efficiently. However, its implemen- modified.
tors have made considerable progress during the This model is similar to that of Distributed
past few years in optimizing the performance on Shared Memory (DSM) systems [29]. In Orca,
several machines. The in operation, for example, however, the unit of sharing is a logical (user-de-
hardly ever scans the entire Tuple Space, but fined) object rather than a physical (system-de-
typically uses hashing or something even more fined) page, which has many advantages [13].
efficient. Just as with virtual memory, however,
there will probably always remain cases where the
easy-to-program approach will not be optimal. 7.1. Programming experience
So, the performance of Linda programs may
sometimes be hard to predict. Orca is a new language rather than an exten-
An important decision in Linda is to hide the sion to an existing sequential language. An im-
physical distribution of data from the user. In portant disadvantage of extending a base lan-
contrast, Emerald gives the programmer control guage is the difficulty of implementing pointers
over the placement of data, by supporting user- and global variables on systems lacking shared
initiated object migration. The Linda approach is memory. These problems can more easily be
simpler, but it makes heavier demands on the avoided if the language is designed from scratch.
implementation. Again, the transparent approach Orca, for example, supports first-class graph vari-
will sometimes be less efficient, but it remains to ables rather than pointers. Unlike pointers, graphs
be seen how big the differences in performance can freely be moved or copied from one machine
are for actual programs. to another. Of course, this approach also implies
The concept of distributed data structures is that programmers have to learn a new language.
probably one of the most important contributions The design of Orca has been kept as simple as
of Linda. However, the way Linda implements possible, so this disadvantage should not be over-
distributed data structure - through a fixed num- estimated.
132 H.E. Bal

7.1.1. Mailbox communication without needing any communication. So, the Orca
A mailbox can be implemented in Orca in a solution is efficient, since it uses physical multi-
similar way as in Emerald, by using a shared casting, if available.
mailbox object. The specification of a generic
abstract data type Mailbox in Orca is shown 7.1.3. Shared data
below: Orca has the support for logically shared data
generic (type T) as a design goal, so it is no surprise that commu-
object specification G e n e r i c M a i I b o x ; nication through shared data is easy to express in
operation A d d M s g ( M s g : T); this language. The shared variable is put in a data
operation G e t M s g ( ) : T ; object shared among all processes. The run time
end generic; system automatically replicates the object in the
The implementation of the mailbox is simpler local memories, so processes can directly read the
than the one in Emerald, because operations in value. Whenever the object is changed, all copies
Orca are indivisible. In other words, mutual ex- are updated immediately, by broadcasting the
clusion synchronization is done automatically in new value. Moreover, atomicity of the operations
Orca, whereas Emerald requires the usage of a is already guaranteed by the language. This solu-
monitor construct for this purpose. Also, Orca tion is both simple and efficient. The only over-
provides a powerful mechanism for condition syn- head in reading the value is that of a local opera-
chronization (based on guarded commands), so tion invocation. When the variable is changed, its
blocking the receivers when the mailbox is empty new value is broadcast to all processors contain-
is easy to express. ing a copy.

7.1.2. One-to-many communication 7.2. Implementation and performance


Orca's shared data-objects can be used for
expressing one-to-many communication. If one Orca has been implemented on top of Amoeba
process applies a write-operation to an object, all [32] as well as on a collection of MC68030s con-
other processes sharing the object can observe nected through an Ethernet. The latter imple-
the effects. Our ASP program in Orca, for exam- mentation uses the physical multicast capability
ple, uses an object-type RowCollection, with the of the Ethernet. The Orca implementation is
following operations: being distributed as part of the Amoeba system.
object specification Ro wCo l l e e t i o n ; We have done many performance measure-
type R o w Y y p e = a r r a y E i n t e g e r ] ments on these systems, as described in detail
of i n t e g e r ; elsewhere [13]. Here, we present some recent
operation AddRow(iter: integer; R: results for the multicast system, using 16 CPUs.
RowType) ; The measured speedup for the All Pairs Short-
# Add the row for the given est Paths problem on 16 CPUs is 15.9. This high
# iteration number speedup is mainly due to the efficient broadcast
operation AwaitRow(iter: integer): protocol, which is used for transmitting the pivot
RowType; rows. For the Traveling Salesman Problem, the
# Wait until the row for the speedup on 16 CPUs is 14.44. Since all copies of
# given iteration is available, the global bound are updated immediately, the
# then return it. search overhead is low.
end;
The process that wants to send the pivot row 7.3. Conclusions on Orca
applies the operation AddRow to the object. The
run time system will then update all copies of this Orca is not an object-based language; it merely
object by multicasting the operation [8]. A pro- provides abstract data types. It supports both
cess requiring the pivot row invokes the operation active processes and passive data-objects. Since
AwaitRow, which blocks until the requested row objects in Orca are purely passive, they can be
has been added to the object and then returns replicated, which is a very important goal in the
this row. The latter operation is done locally, implementation.
Five parallel programming languages 133

An important difference with Linda is the sup- send and receive messages. This approach re-
port for user-defined, high-level operations on quires more code, especially for synchronizing
shared data [27]. Linda only provides a fixed access to the mailbox. On the other hand, the
number of built-in operations on tuples, but Orca abstract operations on a mailbox object are higher
allows programmers to construct their own atomic level than the Linda operations on tuples. The
operations. Unlike Linda, Orca uses direct rather third solution, used for SR and Parlog, is to add
than associative addressing of shared data, and an extra buffer process between the sender and
thus avoids any problems with associative ad- receivers.
dressing. For one-to-many communication, Parlog,
For some applications, Orca has important Linda, and Orca provide the simplest solutions,
advantages over other languages. Programs that all based on shared data. SR has a concurrent-
need logically shared data are easy to implement send primitive built in, but it does not make any
in Orca and are efficient. Orca also is one of the guarantees about the order in which messages are
few languages that uses physical broadcasting in delivered. Emerald has no provision for one-to-
its implementation. As we have seen, for ASP many communication, so it must be simulated
this is of critical importance. On the other hand, with multiple point-to-point messages, which are
there also are cases where the model is less sent sequentially. An important issue is how one-
efficient, for example when plain point-to-point to-many communication is implemented: as a
message passing is required. physical multicast or not. Most language imple-
mentations do not use multicast, Orca and Linda
being two notable exceptions,
re. Di,,ctl,,~i.l~
The third communication pattern, replicated
In the previous three sections we have looked shared data, is simple to express in Orca and
at how the five languages deal with three example Linda, since these languages provide logically
communication patterns. The results of this study shared data. For Linda, the performance of the
are summarized in Table 2. Below, we will com- resulting programs is hard to predict, because
pare the approaches taken for the different lan- many different strategies are used for distributing
guages. tuples. Orca, on the other hand, always tries to
For communication through mailboxes, there replicate shared objects wherever they are
are three different solutions. For Linda, we store needed. For the other languages, we simulate
a mailbox as a distributed data structure in Tuple shared data through message passing. Here, the
Space. This solution requires only a few lines of ability to accept messages implicitly (i.e. by a
code. For Emerald and Orca, a mailbox is repre- newly created process) is very important. SR and
sented as an abstract object, with operations to Emerald both provide this facility. Parlog uses

Table 2
Summary of the solutions taken for all five languages to the three communication patterns
Mailboxes One-to-many communication Replicated shared data
SR Buffer process Concurrent send Messages with implicit receive
Emerald Shared-object Point-to-point messages Messages with implicit receive
message queue
Parlog Buffer process Shared stream (or solution Messages with explicit receive
with fine-grained parallelism)
Linda Distr. data structure Shared data Shared tuple (or m.p. with
message queue explicit receive)
Orca Shared-object Shared data Distributed shared memory
message queue
134 H.E. Bal

only explicit message receipt, which makes effi- [10] H.E. Bal, Heuristic search in PARLOG using replicated
cient updating of the copies of shared data harder. worker style parallelism, IR-229, Vrije Universiteit, Ams-
terdam, The Netherlands, Nov. 1990.
[11] H.E. Bal, M.F. Kaashoek and A.S. Tanenbaum, Experi-
ence with distributed programming in Orca, Proc. IEEE
Acknowledgements CS 1990 Internat. Conf. on Computer Languages, New
Orleans, LA (Mar. 1990) 78-89.
The work on SR and Emerald was done while [12] H.E. Bal, Fault-tolerant parallel programming in Argus,
the author was visiting the University of Arizona, IR-214, Vrije Universiteit, Amsterdam, The Netherlands,
May 1990.
Department of Computer Science, Tucson, AZ. [13] H.E. Bal, Programming Distributed Systems (Silicon Press,
The work on Parlog was done while he was at Summit, NJ, 1990).
Imperial College, Department of Computing, [14] R. Bjornson, N. Carriero and D. Gelernter, The imple-
London. The author is grateful ~to both depart- mentation and performance of hypercube Linda, Report
RR-690, Yale University, New Haven, CT, Mar. 1989.
ments for receiving him as an academic visitor.
[15] A. Black, N. Hutchinson, E. Jul, H. Levy and L. Carter,
Also, he would like to thank Nick Carriero, Greg Distribution and abstract types in Emerald, IEEE Trans.
Andrews, Dave Bakken, Gregg Townsend, Mike Softw. Engrg. SE-13 (1) (Jan. 1987) 65-76.
Coffin, Norman Hutchinson, Keith Clark, Jim [16] N. Carriero, D. Gelernter and J. Leichter, Distributed
Crammond, and Andrew Davison for the discus- data structures in Linda, Proc. 13th ACM Syrup. Princ.
Progr. Lang., St. Petersburg, FL (Jan. 1986) 236-242.
sions on their languages. The work on Linda and
[17] N. Carriero and D. Gelernter, The S/Net's Linda kernel,
Orca has been done in cooperation with Frans ACM Trans. Comput. Syst. 4 (2) (May 1986) 110-129.
Kaashoek. Erik Baalbergen, Arnold Geels, Frans [18] N. Carriero, The implementation of tuple space ma-
Kaashoek, and Andy Tanenbaum provided useful chines, Research Report 567 (Ph.D. dissertation), Yale
comments on an earlier version of the paper. University, New Haven, CT, Dec. 1987).
[19] N. Carriero and D. Gelernter, How to write parallel
programs: A guide to the perplexed, ACM Comput.
Surveys 21 (3) (Sep. 1989) 323-357.
References [20] N. Carriero and D. Gelernter, Linda in Context, Com-
mun. ACM 32 (4) (Apr. 1989) 444-458.
[1] S. Ahuja, N. Carriero and D. Gelernter, Linda and [21] K.L. Clark and S. Gregory, PARLOG: Parallel program-
friends, IEEE Comput. 19 (8) (Aug. 1986) 26-34. ming in logic, ACM Trans. Program. Lang. Syst. 8 (1)
[2] G.R. Andrews and F.B. Schneider, Concepts and nota- (Jan. 1986) 1-49.
tions for concurrent programming, ACM Comput. Sur- [22] K.L. Clark, PARLOG and its applications, IEEE Trans.
veys 15 (1) (Mar. 1983) 3-43. Softw. Enrg. SE-14 (12) (Dec. 1988) 1792-1804.
[3] G.R. Andrews and R.A. Olsson, The evolution of the SR [23] T. Conlon, Programming in PARLOG (Addison-Wesley,
programming language, Distributed Comput. 1 (Jul. 1986) Wokingham, UK, 1989).
133-149. [24] S. Gregory, Parallel Logic Programming in PARLOG
[4] G.R. Andrews, R.A. Olsson, M. Coffin, I. Elshoff, K. (Addison-Wesley, Wokingham, UK, 1987).
Nilsen, T. Purdin and G. Townsend, An overview of the [25] M. Huntbach, Combinatorial search in PARLOG using
SR language and implementation, ACM Trans. Program. speculative computation, Imperial College, London, May
Lang. Syst. 10 (1) (Jan. 1988) 51-86. 1989.
[5] G.R. Andrews, Paradigms for process interaction in dis- [26] E. Jul, H. Levy, N. Hutchinson and A. Black, Fine-grained
tributed programs, TR 89-24, University of Arizona, Tuc- mobility in the Emerald system, ACM Trans. Comput.
son, AZ (Oct. 1989) (accepted for publication in ACM Syst. 6 (1) (Feb. 1988) 109-133.
Computing Surveys). [27] M.F. Kaashoek, H.E. Bal and A.S. Tanenbaum, Experi-
[6] H.E. Bal and A.S. Tanenbaum, Distributed programming ence with the distributed data structure paradigm in
with shared data, Proc. IEEE CS 1988 Internat. Conf. on Linda, USENIX Workshop on Experiences with Building
Computer Languages, Miami, FL (Oct. 1988) 82-91. Distributed and Multiprocessor Systems, Ft. Lauderdale,
[7] H.E. Bal, J.G. Steiner and A.S. Tanenbaum, Program- FL (Oct. 1989) 175-191.
ming languages for distributed computing systems, ACM [28] K.M. Kahn and M.S. Miller, Technical correspondence
Comput. Surveys 21 (3) (Sep. 1989) 261-322. on "Linda in Context", Comm. ACM 32 (10) (Oct. 1989)
[8] H.E. Bal, M.F. Kaashoek and A.S. Tanenbaum, A dis- 1253-1255.
tributed implementatiorl of the shared data-object model, [29] K. Li and P. Hudak, Memory coherence in shared virtual
USENIX Workshop on Experiences with Building Dis- memory systems, ACM Trans. Comput. Syst. 7 (4) (Nov.
tributed and Multiprocessor Systems, Ft. Lauderdale, FL 1989) 321-359.
(Oct. 1989) 1-19. [30] B. Liskov, Distributed programming in Argus, Commun.
[9] H.E. Bal, An evaluation of the SR language design, ACM 31 (3) (Mar. 1988) 300-312.
IR-219, Vrije Universiteit, Amsterdam, The Netherlands, [31] E. Shapiro, Technical correspondence on "Linda in Con-
Aug. 1990. text", Comm. ACM 32 (10) (Oct. 1989) 1244-1249.
Five parallel programming languages 135

[32] A.S. Tanenbaum, R. van Renesse, H. Van Staveren, G.J. Henri E. Bal received a M. Sc. in
Sharp, S.J. Mullender, A.J. Jansen and G. van Rossum, Mathematics from the Delft Univer-
Experiences with the Amoeba distributed operating sys- sity of Technology in 1982 and a Ph.
D. in Computer Science from the
tem, Comm. ACM 33 (12) (Dec. 1990) 46-63. Vrije Universiteit in Amsterdam in
[33] S. Taylor, S. Safra and E. Shapiro, A parallel implemen- 1989. His research interests include
tation of flat concurrent Prolog, Internat. J. Parallel Pro- programming languages, parallel and
gramming 15 (3) (1987) 245-275. distributed programming, and compil-
ers.
[34] S.E. Zenith, Linda coordination language; Subsystem From 1982 until 1985, Dr. Bal has
kernel architecture (on transputers), RR-794, Yale Uni- participated in the Amsterdam Com-
versity, New Haven, CT, May 1990. piler Kit project. He is the author of
the ACK global optimizer. Since 1985,
he has been working on Orca, a new programming language
for implementing parallel applications on distributed systems.
Orca has been implemented on top of the Amoeba dis-
tributed operating system and has been used for various
applications. The language design, implementation, and usage
are described in several published research papers and in his
book Programming Distributed Systems.
Dr. Bal has been a visiting researcher at MIT, the Univer-
sity of Arizona, and Imperial College. At present, he is a staff
member of the Department of Computer Science at the Vrije
Universiteit in Amsterdam.

You might also like