Professional Documents
Culture Documents
Abstract
In this paper we discuss our initial experiences adapting
OpenMP to enable it to serve as a programming model for
high performance embedded systems. A high-level programming model such as OpenMP has the potential to increase
programmer productivity, reducing the design/development
costs and time to market for such systems. However,
OpenMP needs to be extended if it is to meet the needs of
embedded application developers, who require the ability to
express multiple levels of parallelism, real-time and resource
constraints, and to provide additional information in support
of optimization. It must also be capable of supporting the
mapping of different software tasks, or components, to the
devices configured in a given architecture.
1. Introduction
The rapid penetration of multicore and simultaneous
multithreading hardware technology in the marketplace is
changing the nature of application development in all areas
of computing. The use of multicore technology, which offers
more compute power with lower energy consumption, will
increase the design space and significantly increase the
performance of Multiprocessor Systems-on-Chip (MPSoCs)
that are targeted at embedded applications. Software for
embedded systems will become more complex, as multicore hardware will enable more functions to be implemented
on the same device. This complexity will make the production of software the critical path in embedded systems
development and make hardware/software designing, partitioning, scheduling, correctness verification, quality control
and quality assurance more difficult.
In this paper, we discuss our initial work to adapt an
existing shared memory programming interface, OpenMP,
to enable the specification of high performance embedded
applications. We are considering new language features that
allow non-functional characteristics or constraints, such as
deadlines, priorities, power constraints, etc., to be specified
as well as notation to enable a high-level description of the
This work is supported by the National Science Foundation, under contracts
CCF-0833201 and by Texas Instruments.
3.2. OpenUH
The work described here makes use of the OpenUH [15]
compiler, a branch of the open source Open64 compiler suite
for C, C++, and Fortran 95 developed at the University
of Houston. OpenUH contains a variety of state-of-the-art
analyses and transformations, sometimes at multiple levels.
We have implemented OpenMP 2.5 for all 3 input languages
in OpenUH, and designed and implemented novel extensions
that provide greater flexibility and scalability when mapping
work to many CPUs or cores [19]. OpenMP is lowered relatively early in the translation process to enable optimization
of the explicitly parallel code. The output makes calls to our
runtime library, which is based upon PThreads. OpenUH
provides native code generation for IA-32, IA-64 and
Opteron architectures. It has a source-to-source translator
that translates OpenMP code into optimized portable C code
with calls to the runtime library, which enables OpenMP programs to run on other platforms. Its portable, multithreading
runtime library includes a built-in performance monitoring
capability [20]. Moreover, OpenUH has been coupled with
external performance tools [21]. It is directly and freely
available via the web (http://www.cs.uh.edu/openuh).
converter(s1, s2)
{
#pragma omp master
{
#pragma omp parallel
{ // create M tasks
//define priority
DoSthElse()
For(int i=0;i<N;i++)
converter(s1, s2);
C64+
Core 1
C64+
Core 2
C64+
Core N-1
L1D
L1P
L2 local
L1D
L1P
L2 local
L1D
L1P
L2 local
L1D
L1P
L2 local
s2[i] = process(s1[i]);
C64+
Core 0
Shared L2 Memory
}
}
Instruction Dispatch
Instruction Decode
.L1
Data Path A
Data Path B
Register file A
Register file B
.S1
.M1
.D1
.L2
.S2
.M2
.D2
5.2. Compiler
OpenUH implements OpenMP on our multicore MPSoC
by performing a source to source translation from C code
with OpenMP directives to explicit multithreaded C code including calls to our OpenMP runtime library (see Figure 4).
SequentialC /C ++
source code w ith
O penM P directives/
pragm as
C ore #0
(m aster)
app.c
C ore #5
(sl
ave)
C ore #4
(sl
C or
eave)
#3
(sl
ave)
C ore #2
(sl
ave)
C ore #1
(sl
ave)
Start
O penU H
com piler
infrastructure
Ini
ti
al
i
zati
on
app.w 2c.c
ParallelC /C ++
source code
(contains callto
O penM P runtim e
library)
cl6x
O penM P run-tim e
library
(contains calls to the
PO SIX pthread
library)
.c.c
.c.c
cl6x
.c.c
.c.c
cl6x
Ini
ti
al
i
zati
on
mi
cro_task context
M SG Q _put()
startm sg
Ini
ti
al
i
zati
on
(sl
ave thread #1
creati
on)
Execute m i
cro_task()
M SG Q _get()
M SG Q _get()
libopenm p.so
app.o
eti
on
Ini
ti
al
i
zati
oncom pl
m sgs
mi
cro_task context
Parallelregion 2
lnk6x
PO SIX pthread
library forD SP/BIO S
Start
w hirl2c
Parallelregion 1
Front-end
IPA
O ptim izer
C ode G eneration for
Itanium (x86)
libpthread.so
Application im age
thatcontains m ultiple
D SP/BIO S tasks
app.out
Execute
PLFUR_task()
M SG Q _put()
Execute m i
cro_task()
com pl
eti
on
m sgs
M SG Q _get()
End
6. Results
A few experiments have been conducted to assess the
performance of various implementation choices of the lightweight runtime library. We executed a simple OpenMP matrix multiplication program and a dot product (that implies
the reduction of a temporary variable) on a C64x+ based
MPSoC.
While the implementation of the synchronization (fork,
join, and barrier) using the MSGQ represents an elegant and
7
Dot product - N=16384 (Coherent shared buffer for fork & join)
[M,N]*[N*K] Matrix multiply - M=64; N=64 ; K=64
6
Dot product - N=4096 (Coherent shared buffer for fork & join)
Ideal
Speedup ratio
0
1
3
4
Number of cores
7. Conclusions
We believe that, with suitable extensions, OpenMP might
become a suitable and productive programming model for
embedded systems. We have begun to explore a variety of
enhancements for this purpose. High performance embedded
systems lack features that are commonly found in general
purpose processors such as memory management units,
coherent cache systems, and uniform memory access time.
We are in the process of extending OpenUH to enable it
to be used as an OpenMP compiler for a NUMA multicore
embedded DSP system that is not cache coherent, and runs
a lightweight embedded OS. This work is in its early stages.
Our next steps will be to fully implement current OpenMP
and further explore language extensions for embedded systems.
References
[1] T. Instruments, TMS320DM6467 Digital Media System-onChip, December 2007.
[20] V. Bui, O. Hernandez, B. Chapman, R. Kufrin, P. Gopalkrishnan, and D. Tafti, Towards an implementation of the openmp
collector api, in PARCO, 2007.