You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/317393329

Let's Go: a Data-Driven Multi-Threading Support

Conference Paper · May 2017


DOI: 10.1145/3075564.3075596

CITATIONS READS

0 40

2 authors:

Alberto Scionti Somnath Mazumdar


Istituto Superiore Mario Boella (ISMB) Simula Research Laboratory
42 PUBLICATIONS   159 CITATIONS    18 PUBLICATIONS   20 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

OPERA - Low Power Heterogeneous Architecture for Next Generation of Smart Infrastructures and Platforms in Industrial and Societal Applications View
project

TERAFLUX View project

All content following this page was uploaded by Somnath Mazumdar on 21 October 2017.

The user has requested enhancement of the downloaded file.


Let’s Go: a Data-Driven Multi-Threading Support
Alberto Scionti Somnath Mazumdar
Istituto Superiore Mario Boella, Turin, IT Università di Siena, Siena, IT
scionti@ismb.it mazumdar@dii.unisi.it
ABSTRACT their inputs have been produced. By monitoring threads activity, it
Increasing performance of computing systems necessitates provid- is possible to notify threads once their input data become available
ing solutions for improving scalability and productivity. In recent effectively, thus reducing communication overheads and improving
times, manycore multiprocessors and data-driven Program eXe- performance for any generic application. Programming languages
cution Models (PXMs) are also gaining popularity due to their (such as Concurrent Collections [4], X10 [5] and Chapel ([6, 7]))
superior support compared to traditional von Neumann execution provides support for dataflow-like fine-grain threads, but unfortu-
models. However, exposing the benefits of such PXMs within a nately their adoption is very low. In general, programmers tend
high-level programming language is a challenge. Many high-level to prefer more popular languages that ensure support from the
languages offer support for multi-threading by exposing thread developer’s communities, as well as an extensive set of additional
synchronisation mechanisms via external libraries (e.g., OpenMP, libraries. Go language [11] falls into this last category: it is offi-
MPI) or directly through the language itself (e.g., C++11, Java). cially supported by Google and provides several libraries to ease
However, such synchronisation models make large use of mutex the programmer’s work. Furthermore, it provides a concurrency
and locks, generally leading to poor system performance. Con- model for the parallel execution of threads.
versely, one major appeal of Go programming language is the way This paper focuses on supporting a data-driven PXM directly
it supports concurrency: goroutines (tagged functions) are mapped within the Go programming language. Our solution leverages on
on OS threads and communicate each other through data structures the concurrency model implemented by the Go language, which
buffering input data (channels). By forcing goroutines to exchange resembles a producer-consumer schema. By forcing concurrent
data only through channels, it is possible to enable a data-driven functions to use only dedicated data structures for buffering input
execution. This paper proposes first attempt to map goroutines on data (channels) as a mean for exchanging values, the user can eas-
a data-driven thread execution model. Go compilation procedure, ily express parallelism in the application following a data-driven
and the run-time library is modified to exploit the execution of schema. The paper also describes an abstract machine model (AMM)
fine-grain threads on an abstracted parallel machine model. which serves as a reference architectural template, as well as it pro-
vides the low-level execution substrate. Preliminary experiments
CCS CONCEPTS and a possible mapping of such AMM on a high-performance recon-
figurable network-on-chip (NoC) architecture show the advantages
•Computer systems organization → Data flow architectures;
of combining high-level programming with an effective execution
Parallel architectures; •Software and its engineering → Compil-
mechanism.
ers;

1 INTRODUCTION
2 BACKGROUND
Current silicon technology allows the integration of 100+ cores
into a single die [1], and offer support to execute an enormous 2.1 Fine-grain data-driven PXMs
number of concurrent threads. The ability to exploit a high level Modern multi-/manycore processors provide the substrate for the
of parallelism mainly depends on three factors: i) inherent highly execution of data-driven multi-threaded applications. Data-driven
parallel environment for executing threads, ii) availability of an self-scheduling PXMs prefer a fine-grain threading model to coarse-
effective communication and synchronisation mechanism, and iii) grain threads. Such threads are composed of few hundreds of in-
the capability of exposing parallelism at the programming level. All structions, following a producer-consumer communication schema.
these factors play a key role in the implementation of an efficient, A private block of memory (frame) is used to store inputs required
high-performance system. to run each thread, along with the scheduling slot (SS) – a counter
Still there exists a gap for providing an effective link between storing the number of inputs still not received, and a pointer to the
the architectural and programming level. High-level languages thread code. Every time a thread (producer) sends data to another
(such as C++11, Java and Haskell) allow the user to express con- thread (consumer), the SS of the receiver thread is decremented.
current execution without recurring external run-time libraries. The triggering rule assumes that a thread becomes runnable once
Conversely, programming environments such as OpenMP and MPI its SS is reduced to zero. Also, threads are allowed to explicitly
provide language extensions (i.e., run-time libraries) for ease the schedule new threads, by properly initializing the SS and passing
parallelization activity. However, thread execution and synchro- the instruction pointer. We call threads adhering to this model
nisation remain generally confined to the large use of locks and data-driven threads (DD-Threads).
mutex over a von Neumann execution model, which strongly limit Hardware extension can be integrated in the processor design to
the system performance. Instead, in a dataflow-inspired Program increase the performance of multi-/manycore processors support-
eXecution Model (PXM) threads are executed concurrently once ing data-driven PXMs. Recently, the Teraflux project [8] proposed
CF’17, May 15 - 17, 2017, Siena, Italy IT A. Scionti et al.

a solution hierarchically organised, that provides hardware sched- C0


Local Local Cn
uling features, as well as an instruction set extension to accelerate Mem. TU
...
Mem. TU
threads’ synchronisation. The Codelet model [3, 10] is another re-
cent data-driven inspired model. Compared to the one proposed by External chip Interconnection External chip
Teraflux, it does not consider any specific hardware implementa-
tion, but rather an AMM. Also, threads are organised in a hierarchy,
Mem. Mem. Mem.
aiming at better satisfying a principle of locality of computations. In Ctrl Ctrl Ctrl
[2] authors extended the NoC design in such way routers accelerate Manycore chip

the scheduling and communication among threads following such


kind of hierarchical data-driven execution model. Threaded-C [4, 9] Figure 1: The Abstract Machine Model (AMM) supporting
provides an extension of the standard C language to support a fine- the execution of dataflow fine-grain threads.
grain data-driven execution model explicitly. The extension exposes
the synchronisation features of the EARTH architecture directly
at the programming level, making possible to take advantage of a the cores and the interconnection have a communication interface
lighter thread communication mechanism. From this perspective, that provides access to network functions, such as sending and re-
Threaded-C represents an attempt to support dataflow semantic in ceiving data to/from other cores, broadcasting messages. The same
a widespread programming language. interface can be extended to allow sending and receiving messages
from external components. Each core can execute one thread at a
2.2 Concurrency in Go time, albeit it can store information regarding multiple DD-Threads
waiting for the execution. Such information is kept in a fast local
In this section, we briefly review the concurrency model that is
scratchpad-like memory block. This memory is managed directly
provided by the Go language. Go implements concurrent execu-
by the core through the functionalities provided by a Thread Unit
tion in two flavours: a more classic approach based on locks and
(TU). The TU is responsible for scheduling the execution of new
mutex, and the communicating sequential processes (CSP). The latter
threads that are created during the application execution, as well as
provides the features required to integrate a dataflow-like PXM.
to reserve space for threads’ communication channels in the local
Concurrent execution is enabled by placing the go keyword be-
memory. A set of memory controllers, attached to the interconnec-
fore the function names, which become goroutines. By default,
tion medium, allows accessing the external DRAM memory banks,
goroutines execute independently from the others, once their input
which in turn are used to create a shared memory space.
list becomes available. This can be accomplished at the time of the
Each core executes DD-Threads once all their inputs have been
function call, using traditional value passing methods (e.g., by value
produced. Since goroutines communicate through channels, the
or by pointers). However, goroutines can exploit a more effective
inputs required by a thread must be kept as close as possible to the
way of communicating. In fact, the input list can include special
executing core assigned to that thread. To this end, the TU reserves
buffers referred to as channels. Channels behave like storage ele-
a portion of the local memory of the core corresponding to the size
ments, and similarly to variables are typed. Also, channels can be
of the created communication channel. The TU can keep track of
declared as buffered by specifying the number of available slots.
the number of inputs already loaded for a given thread, so that it
In that case, a read operation on an empty slot leads the gorou-
can select one thread for the execution among those runnable.
tine waiting until the value becomes available (similar to a write
operation on a full buffer). Finally, goroutines can mark channels
as input (i.e., only read operations are permitted) or output (i.e., 3.1 Mapping goroutines with data-driven
only write operations are allowed) channels. Scheduling policy threads
is implemented in a run-time library: one or more goroutines are Goroutines are flexible enough that they can be easily mapped
mapped on a system thread for the execution. All these language to DD-Threads for the execution on the proposed AMM. The pro-
features allow Go applications to exploit the underlying data-driven cess can be explained by considering an illustrative example: a
execution model, as we will discuss in the following sections. simple application that multiplies two matrices. The Go code for
this application is represented in figure 2 (right) along with the
3 DATA-DRIVEN EXECUTION SUPPORT corresponding dataflow graph – DFG (left). Yellow edges represent
Our approach tries to extend the capabilities of the Go run-time scheduling operations, while red arrows are used to identify data
scheduler to support the self-scheduling property provided by the dependencies between producer and consumer threads.
data-driven PXM (see Section 2.1). To this end, we take into consid- During the compilation process, the compiler analyses the source
eration an AMM providing the set of features needed to execute code to collect information on how to allocate channels. Since the
fine-grain threads in a data-driven fashion, which in turn are gen- communication process resembles a producer-consumer schema,
erated by the compilation process of the Go application. Figure 1 the compiler needs to decide whether allocate the buffer in the
shows the architecture of the proposed AMM, that can be viewed consumer’s core local memory or the producer’s core local memory.
as an abstraction for the class of manycore processors and software Taking in mind that input data are consumed by the local DD-
systems that implement the actual execution substrate. The AMM Thread while they are produced potentially by several DD-Threads
consists of a set of execution cores (Ci ) abstracting the structure running on different cores, we enforce the compiler to allocate
of a manycore processor. A dedicated interconnection allows the buffers in the consumer’s core local memory. The result of the chan-
cores to exchange data each other in a fast way. To this end, both nel allocation is illustrated in figure 2 (left): queues associated with
Let’s Go: a Data-Driven Multi-Threading Support CF’17, May 15 - 17, 2017, Siena, Italy IT

Ch-name
Mapping table
Size chID TID
Source code
the case of reads, the compiler substitutes the output channel name

ch1 n ID1 TID1 func main ( ) {

1. found in a goroutine definition with the corresponding tuple formed
2.
ch[0]
ch[1]
R+C
R+C
ID2
ID3
TID2
TID3
// n = R*C; 3. by the channel identifier (chID) and a DD-Thread identifier (TID).
ch1 = make(chan int, n); 4.
… … … … go res(ch1); 5. At the end of the compilation phase, the only missing infor-
for i:=0; i<n; i++ {
mation is the identifier of the core that will execute DD-Threads
6.
… … … … ch[i] = make(chan int, R+C); 7.

ch[n-1] R+C IDn TIDn }


go worker(ch[i]); 8.
9.
whenever they will be created. Conversely, the mapping table can
-
-
-
-
-
-
-
-
} 10. be removed. The information regarding the execution core will
func res (in <-chan int) { 11. be obtained at run-time when executing get chID(), get tID(),
exit(0); 12.
entry } 13. schedule(), read(), and write() operations. The specific imple-
point
func worker (in <-chan int, out chan<- int) { 14. mentation of these operations depends on the actual implementa-
for v:= range in {
// read row and column
15.
16.
tion of the AMM. To this end, both an instruction set extension
Main // multiply row by column
out <- elm_i_j;
17.
18.
of the processor ISA and a pure software run-time library can be
} 19. used. In the latter case, the library will provide an implementation
W0 W1
...
Wn-1 of such operations exploiting OS-level functions, as well as it will
map DD-Threads on OS threads (e.g., using Pthreads).
Res
3.2 Dynamic Execution
exit
The starting point in the application execution is represented by
the instantiation of the main thread. To this end, one of the core is
Figure 2: The block matrix multiply application written in selected for its execution (e.g., core 0 can always be assumed as the
Go (right), along with the corresponding dataflow graph and main executor without loss of generality of the approach). Every
the mapping table (left). time the get chID() and get tID() functions are executed, they
return a unique identifier that can be associated with a channel
or a thread. Scheduling the actual executing core requires that a
unique core (cIDdst ) is associated to DD-Thread. To this end, a
each node of the DFG map communication channels. An important hash function
aspect to consider is the decoupling of a channel declaration and its
H (·) → {cIDdst = (chID + T ID) mod K }
actual allocation on the local memory. The former is provided by
the make keyword which allows the compiler to keep track of the is applied, where the + is the concatenation operator and K is the
channel allocation request along with its size. During this phase, total number of available cores for the execution. The execution
the compiler obtains a unique identifier for the channel (i.e., chID context (i.e., the T ID, the chID, the channel size, and the pointer to
= get chID(channel name)), which also identifies the consumer the thread code) is transferred to the destination core (i.e., cI Ddst ).
core. The latter is performed the first time the compiler encoun- Similarly, read() and write() operations must provide capabili-
ters the consumer function (e.g., line 5 and line 8 – figure 2). In ties as follows. Read: applying the hash function to the channel
that case, the go keyword associated to the function call is mapped identifier (i.e., chID), the base address of the channel in the core
by the compiler into a request for a new unique thread identifier local memory is obtained. An offset value allows iterating through
(i.e., TID = get tID(chID,size), which identifies also the thread all the channel entries. Write: applying the H (·) function to the
frame within the assigned core) and a thread scheduling request chID and T ID concatenation allows to determine the core owning
(i.e., schedule(function,chID,TID) that copies the instruction the channel where to write to. To terminate the execution of the ap-
pointer of the thread body in the destination core). It is worth to plication, the last executing thread performs a exit(0) call, which has
note that the size of the communication channel defines the initial been previously translated into a broadcast termination request by
value of the scheduling slot (SS) associated to each DD-Thread since the compiler. This broadcast request is sent to all the cores through
it represents the number of input to receive for executing the thread the interconnection medium, and causes all the cores to flush the
body. During the compilation phase, a table is dynamically built. local memory content and enter in an idle state (eventually, the
The table allows to map functions and channel names with their ac- control is returned to the OS).
tual unique identifiers. Figure 2 (left) also shows the structure of the An important feature of the supported PXM is the capability of
table along with the information gathered during the compilation scheduling the execution of threads (by tuning the value K of the
phase. Reading operations from a channel are always performed hash function H (·)) on a restricted group of cores. Such group of
on the local memory. To ensure the correctness of these opera- cores provides execution capabilities for DD-Threads which have
tions, every time the compiler parses the definition of a goroutine, a tight dependency relationship in the DFG (locality of compu-
it substitutes the channel name with the corresponding channel tations [3]). In fact, in such way the compiler can enforce the
identifier (chID) obtained through a lookup in the mapping table. selection of the subset of neighbours of the core executing the
Code involving the actual read operation is substituted with the scheduling operation, providing an effective solution for the prob-
corresponding data = read(chID, offset) operation, which lem of data/computation locality. Although the Go language does
relies on an offset variable to correctly address the local memory not implement such a distinction at the programming level (i.e., it
during a channel read. Similarly, write operations are mapped into is the responsibility of the compiler and run-time library to allocate
corresponding write(chID,TID,offset,data) operations. As in threads on the available cores), the compiler can apply optimisation
CF’17, May 15 - 17, 2017, Siena, Italy IT A. Scionti et al.

techniques to discover scheduling operations that can benefit from 1. …


R
2. int main (void) {
a restricted core selection. C

CPU_0
IP n
3. //Pthreads initialisation and core binding mat_B [0][j]
4. //Input matrices: mat_A[R][C] and mat_B[R][C] mat_B [1][j]
5. //R := number of rows (constant) mat_B [2][j]
6. //C := number of columns (constant) …
7. int n, k, i, j, r, c; mat_A [j][0]
8. n = R * C; …

core_id
9. k = R + C;

CPU_1
4 EXECUTION SYSTEM AND RESULTS

t_id
10. core_id = get_chID(“ch1”, -1);
11. t_id = get_tID(core_id, n);
12. schedule(“res”, core_id, t_id);
In this section, we discuss preliminary experimental results ob- 13.
14.
for (i = 0; i < n; i++) {
offset = 0;
tained by executing a block matrix multiply (BMM) application on 15.
16.
core_id = get_chID(“ch[i]”, i);
t_id = get_tID(core_id, k + 2);

a possible implementation of the proposed AMM. Also, we discuss 17.


18.
schedule(“worker”, core_id, t_id);
write(core_id, t_id + (offset++), R);


19. write(core_id, t_id + (offset++), C);
how the proposed AMM functionalities can be easily integrated 20.
21.
j = i % C;
for (r = 0; r < R; r++) {
as hardware functions made available by extending a generic NoC 22.
23.
// copy input column
write(core_id, t_id + (offset++), mat_B[r][j]);
24. …
architecture. 25. }

CPU_n
26. j = (i - j) / C;

The AMM has been implemented recurring to standard Pthreads. 27. for (c = 0; c < C; c++) {
28. // copy input row
29. write(core_id, t_id + (offset++), mat_A[j][c]);
A group of Pthreads (i.e., virtual cores) has been created in such 30. …

DD-Thread
IP_ pool

LFSR

SS
}

Frame
31.

selector
32. }
way each of them is bind to a physical core on the host machine. 33. }

Data structures used by the AMM are shown in figure 3 (right): a


set of arrays hold information regarding the DD-Thread contexts Figure 3: BMM main function implemented on the proposed
and frames. Specifically, each virtual core has access to a pool of AMM (left), along with AMM data structures (right).
instruction pointers (i.e., pointers to DD-Thread bodies), the SS
counters and space for exchanged data (frames). In addition, two
counters (managed as linear feedback shift registers – LFSRs) asso- such execution model with the Go programming language. Concur-
ciated to each virtual core allow to schedule the actual execution rent functions are translated into threads following the dataflow
core and the thread slot. To reduce system imbalances and to avoid semantic and running on a highly parallel abstract machine. Pre-
limiting the performance, the number of Pthreads was set to be a liminary results show the benefit of dataflow execution, while the
multiple of the number of the physical cores. Pthreads continuously description of a possible effective implementation of such AMM
access to their associated data structures checking for a DD-Thread provides a glimpse of the possible performance improvement. As a
ready for the execution. With the aim of simulating the presence future research direction, we will investigate on the optimisation
of an interconnection subsystem, data structures are maintained in of the compilation process.
the shared memory on the host, with a minimal use of locks and
mutex to protect the access to the SS counters. The Go source code ACKNOWLEDGMENT
has been translated into an equivalent C representation, which The authors thank S. Zuckerman and R. Giorgi for their initial
directly integrates the AMM. Figure 3 (left) shows the result of discussion on the presented idea and useful suggested references.
the translation process of the Go main function. Executing the
BMM application on top of the AMM implementation running on a REFERENCES
standard 8-core Intel Xeon processor, we observed a large speedup [1] Bohnenstiehl B., et al., “A 5.8 pJ/Op 115 Billion Ops/sec, to 1.78 Trillion Ops/sec
scaling from 2 to 128 virtual cores, although increasing the number 32nm 1000-Processor Array”, IEEE Symp. on VLSI Circuits, 2016.
[2] Scionti A., et al., “Software Defined Network-on-Chip for Scalable CMPs”, IEEE
of virtual cores more than 32 provided less increase. HPINI, 2016.
Further performance scaling can be obtained by introducing spe- [3] J. Suetterlein, et al., “An Implementation of the Codelet Model”, in Euro-Par 2013.
cific communication and scheduling features in the interconnection [4] Z. Budimlic, et al., “Concurrent collections”, Scientific Programming, 2010.
[5] B. L. Chamberlain, D. Callahan, H. P. Zima., “Parallel programmability and the
subsystem. To this end, an extension of a highly-scalable NoC can chapel language”, IJHPCA, 21(3):291–312, 2007.
be used. In [2] the authors propose such kind of NoC architecture: [6] P. Charles, et al., “X10: an object-oriented approach to non-uniform cluster
computing”, OOPSLA, pp. 519–538. ACM, 2005.
the ISA (RISC-V based) of the processing elements attached to each [7] L. J. Hendren et al., “Compiling C for the EARTH multithreaded architecture”,
router is extended with a small set of instructions, which allow IEEE PACT, 1996.
the compiler to access specialised hardware functions. Also, each [8] R. Giorgi, A. Scionti, “A scalable thread scheduling co-processor based on data-
flow principles”, Future Generation Computer Systems, Vol. 53, Dec. 2015.
router is equipped with a local scratchpad memory, which serves as [9] Hum, H.H.J., Maquelin, O., Theobald, K.B. et al., “A Study of the EARTH-MANNA
a high-speed storage for the allocation of communication channels. Multithreaded System”, Int J Parallel Prog, 1996.
Embedding a hardware version of the H (·) it is possible to speed [10] Zuckerman S., Landwehr A., Livingston K., Gao G., “Toward a Self-Aware Codelet
Execution Model”, ACM DFM, 2014.
up the schedule of new DD-Threads, as well as to compute the core [11] https://golang.org/
to contact when a write operation is executed.

5 CONCLUSIONS AND FUTURE WORK


The exploitation of the unprecedented level of thread parallelism in
the next-generation, high-performance computers requires a sim-
ple way to express parallelism in the application. Although modern
programming languages support concurrent thread execution, ef-
fectively supporting fine-grain data-driven execution models is still
a challenge. In this position paper, we discuss the integration of

View publication stats

You might also like