Professional Documents
Culture Documents
net/publication/317393329
CITATIONS READS
0 40
2 authors:
Some of the authors of this publication are also working on these related projects:
OPERA - Low Power Heterogeneous Architecture for Next Generation of Smart Infrastructures and Platforms in Industrial and Societal Applications View
project
All content following this page was uploaded by Somnath Mazumdar on 21 October 2017.
1 INTRODUCTION
2 BACKGROUND
Current silicon technology allows the integration of 100+ cores
into a single die [1], and offer support to execute an enormous 2.1 Fine-grain data-driven PXMs
number of concurrent threads. The ability to exploit a high level Modern multi-/manycore processors provide the substrate for the
of parallelism mainly depends on three factors: i) inherent highly execution of data-driven multi-threaded applications. Data-driven
parallel environment for executing threads, ii) availability of an self-scheduling PXMs prefer a fine-grain threading model to coarse-
effective communication and synchronisation mechanism, and iii) grain threads. Such threads are composed of few hundreds of in-
the capability of exposing parallelism at the programming level. All structions, following a producer-consumer communication schema.
these factors play a key role in the implementation of an efficient, A private block of memory (frame) is used to store inputs required
high-performance system. to run each thread, along with the scheduling slot (SS) – a counter
Still there exists a gap for providing an effective link between storing the number of inputs still not received, and a pointer to the
the architectural and programming level. High-level languages thread code. Every time a thread (producer) sends data to another
(such as C++11, Java and Haskell) allow the user to express con- thread (consumer), the SS of the receiver thread is decremented.
current execution without recurring external run-time libraries. The triggering rule assumes that a thread becomes runnable once
Conversely, programming environments such as OpenMP and MPI its SS is reduced to zero. Also, threads are allowed to explicitly
provide language extensions (i.e., run-time libraries) for ease the schedule new threads, by properly initializing the SS and passing
parallelization activity. However, thread execution and synchro- the instruction pointer. We call threads adhering to this model
nisation remain generally confined to the large use of locks and data-driven threads (DD-Threads).
mutex over a von Neumann execution model, which strongly limit Hardware extension can be integrated in the processor design to
the system performance. Instead, in a dataflow-inspired Program increase the performance of multi-/manycore processors support-
eXecution Model (PXM) threads are executed concurrently once ing data-driven PXMs. Recently, the Teraflux project [8] proposed
CF’17, May 15 - 17, 2017, Siena, Italy IT A. Scionti et al.
Ch-name
Mapping table
Size chID TID
Source code
the case of reads, the compiler substitutes the output channel name
…
ch1 n ID1 TID1 func main ( ) {
…
1. found in a goroutine definition with the corresponding tuple formed
2.
ch[0]
ch[1]
R+C
R+C
ID2
ID3
TID2
TID3
// n = R*C; 3. by the channel identifier (chID) and a DD-Thread identifier (TID).
ch1 = make(chan int, n); 4.
… … … … go res(ch1); 5. At the end of the compilation phase, the only missing infor-
for i:=0; i<n; i++ {
mation is the identifier of the core that will execute DD-Threads
6.
… … … … ch[i] = make(chan int, R+C); 7.
CPU_0
IP n
3. //Pthreads initialisation and core binding mat_B [0][j]
4. //Input matrices: mat_A[R][C] and mat_B[R][C] mat_B [1][j]
5. //R := number of rows (constant) mat_B [2][j]
6. //C := number of columns (constant) …
7. int n, k, i, j, r, c; mat_A [j][0]
8. n = R * C; …
core_id
9. k = R + C;
CPU_1
4 EXECUTION SYSTEM AND RESULTS
t_id
10. core_id = get_chID(“ch1”, -1);
11. t_id = get_tID(core_id, n);
12. schedule(“res”, core_id, t_id);
In this section, we discuss preliminary experimental results ob- 13.
14.
for (i = 0; i < n; i++) {
offset = 0;
tained by executing a block matrix multiply (BMM) application on 15.
16.
core_id = get_chID(“ch[i]”, i);
t_id = get_tID(core_id, k + 2);
…
19. write(core_id, t_id + (offset++), C);
how the proposed AMM functionalities can be easily integrated 20.
21.
j = i % C;
for (r = 0; r < R; r++) {
as hardware functions made available by extending a generic NoC 22.
23.
// copy input column
write(core_id, t_id + (offset++), mat_B[r][j]);
24. …
architecture. 25. }
CPU_n
26. j = (i - j) / C;
The AMM has been implemented recurring to standard Pthreads. 27. for (c = 0; c < C; c++) {
28. // copy input row
29. write(core_id, t_id + (offset++), mat_A[j][c]);
A group of Pthreads (i.e., virtual cores) has been created in such 30. …
DD-Thread
IP_ pool
LFSR
SS
}
Frame
31.
selector
32. }
way each of them is bind to a physical core on the host machine. 33. }