You are on page 1of 42

TECHNISCHE UNIVERSITT MNCHEN

FAKULTT FR INFORMATIK
Software & Systems Engineering
Prof. Dr. Dr. h.c. Manfred Broy

SPES 2020 Deliverable 1.3.B-7

Concurrency Analysis and Transformation


An Overview

Author:
Version:
Date:
Status:

Wolfgang Schwitzer
1.3
April 11, 2011
Released

Technische Universitat M
unchen - Fakultat f
ur Informatik - Boltzmannstr. 3 - 85748 Garching

Version History
Version 0.1, Draft, 12.10.2010
Schwitzer: Initial structure and contents.
Version 0.2, Draft, 22.10.2010
Schwitzer: Included feedback from Prof. Broy.
Schwitzer: First version of the Analysis section.
Version 0.3, Draft, 14.12.2010
Schwitzer: Introduced Software Engineering Questions concerning Multicores.
Version 1.0, Draft, 05.01.2011
Schwitzer: First version for review by SPES partners.
Version 1.1, Draft, 06.01.2011
Schwitzer: Consistent use of term overlapping schedule.
Version 1.2, Draft, 12.01.2011
Schwitzer: Changed conclusions on retiming.
Version 1.3, Reviewed, 11.04.2011
Schwitzer: Included reviewer comments.

ABSTRACT. This document gives an overview of concurrency-related analysis


and transformation techniques. These techniques can be applied to support software
engineering for embedded systems with parallel processors. Since a couple of years,
there is an ongoing paradigm shift from singlecore towards multicore processors. Employing multicore architectures for embedded systems brings in several advantages,
but poses new challenges for software engineering. This document focuses on software
engineering aspects related to programming embedded systems with parallel processors. First, related work and and the scope of this document are discussed. Iterative
data-flow is presented as the model of computation, which is used throughout this
paper. The main part of this document comprises analysis, metrics, and transformation of these data-flow models. A tool integration of these techniques is illustrated.
The document concludes with a brief overview of ongoing research and future work.
Acknowledgements. The author thanks Manfred Broy, Martin Feilkas and Tobias Sch
ule for their helpful comments and fruitful discussions about conceptual and
technical topics during the writing of this document. Additional thanks go to Sebastian Voss for introducing the author into his outstanding work on SMT-based
scheduling methods. Florian H
olzl and Vlad Popa have the authors fullest respect
for investing so much of their time and excellent programming skills in the tools
AutoFocus and Cadmos that are used to evaluate and discuss many of the topics
covered in this document. This work was mainly funded by the German Federal
Ministry of Education and Research (BMBF), grant SPES2020, 01IS08045A.

Contents
1 Introduction
2 Related Work and Scope
2.1 Iterative Data-Flow . .
2.2 Scheduling . . . . . . .
2.3 Causality and Timing
2.4 Composition . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

9
9
11
12
13

3 Concurrency Analysis
3.1 Concurrency, Parallelism and Precedence . .
3.2 Weakly Connected Components . . . . . . . .
3.3 Strongly Connected Components and Cycles .
3.4 Iteration Bound . . . . . . . . . . . . . . . . .
3.5 Delay Profiles . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

15
15
17
18
19
21

4 Concurrency Metrics
4.1 Software-concerned metrics . .
4.1.1 Spatial concurrency. . .
4.1.2 Temporal concurrency. .
4.1.3 Data-parallelism. . . . .
4.2 Hardware-concerned metrics . .
4.2.1 Speed-up. . . . . . . . .
4.2.2 Efficiency. . . . . . . . .
4.2.3 Utilization of resources.
4.3 Deployment-concerned metrics
4.3.1 Frequency. . . . . . . . .
4.3.2 Reactiveness. . . . . . .
4.3.3 Jitter-robustness. . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

23
23
23
24
24
25
25
26
26
27
27
27
28

5 Concurrency Transformation
5.1 Unfolding-Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Retiming-Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Look-Ahead-Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29
29
33
34

6 Tool-Integration

37

7 Conclusion and Future Work

39

References

40

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

1 Introduction
Motivation. Since a couple of years, there is an ongoing paradigm shift in the silicon industry
from singlecore towards multicore processors. Multicore processors are already used widespread
in desktop- and server-systems. Many next generation embedded systems are likely going
to be built upon multicore processors, too. Employing multicore architectures for embedded
systems brings in several advantages. For example, the overall number of embedded controllers
per system can be reduced and parallel computing performance is increased. Still, moderate
(electrical) power consumption can be maintained. However, the leap into the multicore era
poses novel challenges for several disciplines, e.g. for electrical engineering, semiconductor
production processes, education and software engineering [ABC+ 06].
This document focuses on the software engineering aspects related to programming embedded
systems with parallel processors. Engineering software for distributed embedded systems is a
complex and challenging task [Bro06b]. Software engineering efforts are going to become even
more tackling by introducing platforms that offer highly parallel processing capabilities. In
particular, it is important to be aware of an applications concurrency throughout the software
development process. Concurrent parts of applications are deployed onto concrete multicorebased hardware and finally yield parallelism in application execution.
Awareness of an applications concurrency and adequate deployment with respect to
platform-parallelism are key factors to leverage the potential of parallel architectures.
Software engineering questions concerning parallel embedded software systems. Some
typical questions arise along the software engineering process for parallel architectures. In
distributed systems, concurrency and parallelism are found on different levels of granularity
[Bod95]. This document emphasizes software engineering questions that arise when coarsegrain software concurrency is deployed onto coarse-grain hardware parallelism. The technical architecture [TRS+ 10] of a distributed embedded system comprises coarse-grain parallel
structures like controller-networks connected by gateways, embedded control units (ECUs)
connected by field buses, CPUs and input/output-controllers connected by on-board buses,
and processor-cores connected by on-chip interlink buses. The software architecture [Bro06a]
of embedded systems is commonly described in terms of coarse-grain concurrent structures
like independent or coupled applications, tasks and subtasks communicating by channels (e.g.
pipes and shared variables). This software has to be deployed on the hardware. Hence, stakeholders (e.g. software- and hardware-engineers, managers) might ask the following questions
when developing a distributed and highly parallel embedded software system:
Q1 How does an adequate parallel hardware platform for a given a concurrent software architecture look like?
Q2 How does an adequate concurrent software architecture for a given parallel technical architecture look like?
Q3 Given a software architecture and a technical architecture, how does an adequate deployment of concurrent software components on parallel hardware components look like?

In these questions (Q1-Q3), the meaning of the term adequate strongly depends on the
design goals for the overall system. Thus, in a highly reactive system, adequate can mean
with shortest possible response times. In a safety critical system, adequate can mean with
highest possible robustness against timing jitter on buses. In a system produced for high
volume markets, adequate can mean with lowest possible cost of hardware units.
Metrics for parallel embedded software systems. Unfortunately, for complex real world
systems there usually does not exist such a distinct definition of the term adequate. Rather,
there are several competing design goals, which require trade-offs to be made. Hence, it is
important to have a set of system metrics at hand, that assist in answering questions Q1-Q3
in a comprehensive way. The following is a summary of metrics discussed in section 4 of this
document. These metrics answer questions either about the software, or about the hardware
or about the deployment of the system:
Software-concerned metrics:
Spatial concurrency of applications
Temporal concurrency of applications
Data-parallelism achievable by stateless parts of applications
Hardware-concerned metrics:
Speed-up gained by investing in parallel processing power
Efficiency (average utilization of parallel processing capabilities)
Quantity and absolute utilization of resources (cores, buses, etc.)
Deployment-concerned metrics:
Frequencies of the system (software-side and hardware-side)
Response times (end-to-end delays from sensors to actuators)
Robustness against timing jitter on buses and distributed cores
Different levels of detail along the development process. Most of these metrics can be
expressed on different levels of detail. In early phases of development, there is usually less
detail available. In late phases of development, details about the software architecture, the
software implementation and the hardware platform are at hand. Depending on how much
information about the concrete software architecture and technical architecture is available,
metrics can be expressed within the following levels of detail: uniform computation time,
arbitrary integer computation times, arbitrary real computation times. These levels of detail
have also been used by Xu and Parnas [XP93] for classifying scheduling algorithms.
In uniform computation time (also referred to as unit-time) each software task is considered
to consume one unit of time for processing. In particular, this means that the time consumed
for processing a task is independent from the implementation of the task. Hence, metrics in
uniform computation time can already be retrieved in early stages of the development process. In early stages, a first software architecture (a decomposition into applications, tasks and

channels) may be available, though the implementation can be incomplete. Additionally, a


concrete hardware setup (number of processors, buses etc.) needs not to be known to perform
analyses in uniform time. Rather, a parallel random access machine (PRAM [FW78]) is chosen as hardware platform, providing an unlimited number of processors and communication
bandwidth. Communication latencies between processors are not considered in uniform time,
that is, communication is considered to come at zero cost. In summary, uniform computation
time analysis draws the picture of an ideally parallelizable system, which can appear significantly better than what is achievable with the actual software implementation and hardware
platform.
At the next level of detail, uniform computation time is refined to arbitrary integer computation
times. Here, first estimations of processing times and communication times do exist. Processing
and communication times are interpreted relative to each other. For example, a task A takes
one unit of time to process, a task B takes two units of time to process, and any communication
over a channel from A to B takes three units of time. Again, a PRAM is chosen as hardware
platform. In contrast to uniform time, here communication comes at the given relative cost.
Metrics gained in arbitrary integer computation times give a refined picture of the concurrency
and parallelism quality of the system under design. The more detail on implementation and
potential hardware is used for estimation, the closer can the metrics reflect the properties of
the final system.
Arbitrary real computation times analysis is the most detailed level. Here, the detailed software
implementation and the hardware platform is known and, in consequence, the worst case
execution time (WCET) for each task and communication operation is available. If a task
can be scheduled on several different processors, the tasks WCET on each of these processors
is available. If a communication operation can be scheduled on several different buses, the
communication operations WCET on each of these buses is available. Metrics gained on the
level of real computation times resemble the deployed systems concurrency and parallelism
properties as close as possible.
Contents of this document
Matlab/Simulink
AutoFocus

Presentation

Metrics

Simulation

Analysis

Code-Generation

100
80

C/C++ Code

60

Metric A

40

Metric B

Transformation

Deployment

Metric C

20
0
Option 1 Option 2 Option 3

Figure 1: The contents of this document: presentation, analysis, metrics and transformation of multicore-based embedded software systems.

Goal and contents of this document. The goal of this document is to provide a (nonexhaustive) overview of concurrency analysis and transformation techniques that support the
software engineering process for multicore-based embedded software systems. Basically all

of the techniques discussed are well-known and have been thoroughly studied by others, so
that this document merely gives an overview and outlines how these isolated methods can be
combined.
Figure 1 illustrates the contents of this document, which is presentation, analysis, metrics and
transformation of concurrent applications. First, applications are presented by appropriate
models that already expose concurrency present in these applications. In the context of this
document, data-flow programs (and their respective data-flow graphs) are employed for this
purpose. Second, these data-flow programs are analyzed with graph analysis techniques to
retrieve raw properties regarding concurrency. Third, these raw properties are put into
relation with each other in order to gain metrics regarding concurrency or parallelism. Fourth,
the models can be transformed to achieve improved concurrency and parallelism if the quality
metrics do not satisfy the design goals of the system.
Outline. This document is structured as follows. Section 2 discusses related work and sets the
scope for this document. Iterative data-flow is presented as the model of computation, which
is used throughout this paper, some constraints on scheduling are set and the strong alignment
between causality and data-flow is reviewed. Fundamental functional composition operators
are introduced. The main part of this document comprises sections 3, 4, and 5. Analysis
techniques on iterative data-flow models are presented, metrics are derived that are used for
assessing the quality of concurrency in those models and three well-studied transformation
techniques are illustrated: namely unfolding, retiming and look-ahead. A tool integration of
these techniques is illustrated in section 6. The document concludes with section 7 giving an
outline of ongoing research and future work.

2 Related Work and Scope


This section discusses related work and sets the scope and constraints for the following sections of this document. First, iterative data-flow is presented as the model of computation,
which is used throughout this paper to describe concurrent programs. Second, constraints
and assumptions on scheduling techniques within this document are given. Third, the strong
alignment between the concepts of causality and data-flow is reviewed and a translation from
causal systems to data-flow systems is sketched. Finally, fundamental functional composition
operators related to concurrent and sequential evaluation of functions are introduced.
2.1 Iterative Data-Flow
History, application and scope. This section briefly introduces iterative data-flow as a model
to describe concurrent programs. The analyses and transformations presented throughout this
document refer to iterative data-flow models. In a constructive approach, data-flow models
represent a visual coordination language [Lee06] for implementing concurrent programs. In a
more analytic approach, these models can be derived from other sources, e.g. models from
MATLAB/Simulink [Mat], AutoFocus [Aut] or from C/C++ -like source code [Sch09]. In
general, data-flow models are well-known to expose application concurrency ([Rei68], [Den80],
[DK82], [LM87b], [LM87a], [PM91]) and go back to the work of Karp and Miller [KM66] who
showed determinacy of data-flow systems. A good introduction to iterative data-flow can be
found in [PM91] and concepts of general data-flow modeling are illustrated well in [DK82].

Program 1:
for (t = 1 to ) {
z(t) = 0.5 (x(t) + y(t))
}

x(t)
+

y
(a)

0.5

z(t)

y(t)

(b)

Figure 2: Iterative data-flow. (a) A simple nonterminating program, assigning the arithmetic mean of x(t) and y(t) to z(t) for each given t. (b) A data-flow graph
that represents a nonterminating program with infinite input series x(t) and y(t),
output series z(t), and two tasks. The dashed circles x, y and z are interface
points to the environment, that is x and y are environment inputs and z is an
environment output.

Embedded systems programming. Iterative data-flow offers several features that are desirable for embedded systems programming. Deadlock and boundedness can be statically analyzed. Capacities of FIFO queues on sender- and receiver-side are finite and can be statically
derived as well. The execution order can be statically scheduled, i.e. at compile time, which
minimizes runtime scheduling overhead or, as Lee and Messerschmitt put it, most of the
synchronization overhead evaporates [LM87a]. Static scheduling can be important for safety
critical and hard real-time embedded systems, where certification processes may require the
use of static techniques.
Data-flow programs and data-flow graphs. Within the context of this document, we refer to
data-flow programs are iterative and nonterminating. This reflects the intension to use these
programs for embedded software systems, which execute the same tasks repeatedly (iterative
nature) and it is not known a priori, when the system is going to terminate (nonterminating
nature). Data-flow programs are visualized by data-flow graphs (DFGs). In data-flow graphs,
vertices represent tasks and edges represent directed communication between tasks. According
to Lee and Messerschmitt [LM87a], a data-flow program is called synchronous if the number of
messages transmitted per unit of time is known at compile time. Furthermore, a synchronous
data-flow (SDF) program is called homogeneous if the number of messages transmitted per
unit of time is equal for all edges or heterogeneous otherwise.
A special case of homogeneous SDF is called iterative data-flow, which is restricted to a homogeneous constant rate of 1 message transmitted per unit of time on each of the edges.
An iterative data-flow program processes inputs and produces outputs by executing all tasks
repeatedly. Inputs from and outputs of these programs are infinite time series. During a single
iteration, each task executes exactly once, consuming one token from each of its input channels,
and producing one token on each of its output channels.
An example of a simple iterative nonterminating program that operates over infinite time series
is given in figure 2(a). Any execution of the for-loop (over t) in the nonterminating program
corresponds to one iteration. The execution time of an iteration is called the iteration period
of the program. In the example program 2(a), the iteration period is the time required to
execute one addition operation plus one multiplication operation.
Figure 2(b) shows a data-flow graph that corresponds to this program, where one task represents the (2, 1)-ary addition operation and one task represents the (1, 1)-ary constant multiplication by a factor of 21 . The infinite input and output series x(t), y(t) and z(t) are represented
by edges of the graph. In the iterative data-flow model of computation, each task consumes a
single token from each of its incoming series and produces a single token on each of its output
series. Synonyms for token are sample and message. The iteration period is also
called sampling period with the reciprocal being the iteration rate or sampling rate.
Note, that in general the concept of tasks in data-flow programs is not limited to basic arithmetic operations as in figure 2(b). In general, a task may describe functions of arbitrary
complexity that can be nonlinear and time-varying, e.g., conditional constructs (if-then-else),
state-automatons, or complete sub-tasks, specified by data-flow programs themselves. However, in the context of this document we treat the tasks of a DFG as black-boxes. Here,
these black-boxes perform (n, m)-ary functions and we do not discuss topics related to internal
structuring or hierarchical composition of tasks as architectures.

10

D
A

(a)

(b)

Figure 3: Delays, pipelining and feedback in DFGs. Environment inputs and outputs are
intentionally hidden to increase readability. (a) A DFG with tasks A, B, and C
and two distinct delay operators D, forming a three-stage parallel pipeline. (b)
A DFG with tasks A, B, and C located on a feedback cycle with one delay.
Delays, pipelining, feedback and parallelism. Many algorithms require tokens to be delayed,
so that these tokens can be used in future iterations. For this purpose, data-flow graphs allow
for specification of delay operators. A software implementation of a delay operator could be
a variable or a FIFO queue. Delays are typically called registers or latches in electrical
engineering literature. Figure 3(a) shows a DFG with three tasks A, B, and C and two distinct
delay operators D. Inputs from and outputs to the environment are intentionally not shown
for better readability.
The system in figure 3(a) can be executed in three logical units of time if tasks are scheduled
sequentially as A; B; C. Delays in a data-flow system can be leveraged to reduce the total
execution timespan, since they introduce temporal concurrency and allow for pipelining parallelism on a parallel platform. A good introduction to pipelining in microprocessors can be
found in [LS11], p. 190 and following pages. Scheduling the tasks A, B, and C for execution
as parallel pipeline stages, each on an individual processor, can reduce the timespan required
to execute all three tasks. An example is given in section 4.1.2. In the context of iterative
data-flow, the timespan required to execute all of the tasks once is called the iteration period.
Delays and pipelining can be leveraged to increase the iteration rate of data-flow programs,
the same way, pipelining is used in VLSI systems to increase the frequency of microprocessors,
for example.
As explained in more detail in sections 3.3 and 3.4, feedback cycles can pose a lower limit on
the iteration period, called the iteration bound. The iteration bound can severely limit the
achievable parallelism. Figure 3(b) shows a DFG with three tasks A, B, and C located on a
feedback cycle that has one delay operator.
2.2 Scheduling
Within this document, scheduling is concerned with the construction of non-preemptive, static
schedules. Preemption is not allowed, that is, once a task it started, it runs to completion and

11

cannot be interrupted by other tasks. Schedules are constructed statically at compile time,
to be more precise, the schedules are fully-static. Cyclo-static scheduling is not covered in
the context of this document. Non-overlapping as well as overlapping schedules are presented,
the former being associated with a single iteration, the latter being associated with multiple
subsequent iterations in order to exploit inter-iteration parallelism.
Schedules are concerned with the timing of schedulable units. Schedulable units are tasks and
communication operations among those tasks. Schedules can be given in logical units of time or
in physical time. In logical time, each schedulable unit takes n 1 units of time for execution,
in physical time, each schedulable unit has a deterministic worst-case execution time (WCET),
known at compile time.
2.3 Causality and Timing
Closely related to the concepts of iterative data-flow (see Sec. 2.1) are the concepts of causality
and timing. A straightforward translation scheme from causal models to iterative data-flow
models is sketched here. A more detailed discussion on this topic is out of the scope of this
document. Once a causal model is translated, the techniques presented in this document can be

i1
i2

Fw

in
i1(t)
i2(t)

Fw

in(t)

o1
o2

i1
i2

om

in

Fs

om

o1(t)
o2(t)

i1(t)
i2(t)

om(t)

in(t)

(a)

o1
o2

Fs

D
D

o1(t)
o2(t)
om(t)

(b)

Figure 4: Translation schemata from weakly and strongly causal components to iterative
data-flow graphs. (a) Translation of a weakly causal component Fw . (b) Translation of a strongly causal component Fs .
applied to it, too. Hence, it is possible to interpret analysis and transformation of concurrency
from a causality-centered point of view. For example, the CASE-Tool AutoFocus [Aut],
presented later in section 6, supports modeling of embedded systems software with causal
components and a notion of global logical time. AutoFocus is based on a theory of streamprocessing functions [BS01] that divides logical time into discrete intervals, so-called ticks.
Here, components roughly correspond to tasks, and causality to delay operators. A global tick
corresponds to one iteration of a data-flow program.
Figure 4 sketches translation schemata for weakly and strongly causal components. Figure 4(a)
illustrates how a weakly causal component with behavior function Fw , syntactic input interface
I = {i1 , . . . , in }, and syntactic output interface O = {o1 , . . . , om } translates to a task that implements function Fw and has input series i1 (t), . . . , in (t), and output series o1 (t), . . . , om (t).

12

Figure 4(b) shows how a strongly causal component with behavior function Fs , syntactic
input interface I = {i1 , . . . , in }, and syntactic output interface O = {o1 , . . . , om } translates
to a task that implements function Fs and has input series i1 (t), . . . , in (t), and output series
o1 (t), . . . , om (t). Each of the output series has a dedicated single unit-delay operator D. The
output unit-delay operators satisfy the strongly causal nature of Fs : the outputs of Fs at the
next point of time t + 1 depend causally on the inputs to Fs at the current point of time t.
In general, a k-causal component, k 0, demands the outputs at the k-next point of time
t + k to depend causally on the inputs at the current point of time t. Consequently, a k-causal
component requires kD unit-delays on each output.
2.4 Composition
Functions are composed by three basic composition operators: sequential, concurrent and
recurrent composition, as illustrated in figure 5. Functional composition is fundamental for the
techniques presented in the following sections. A more detailed discussion on this topic is found
in [BDD+ 92] and [Bro95], for example. The following paragraphs introduce the composition
operators, where a task with n inputs and m outputs is called an (n, m)-ary function.

f
f

g
fg
(a)

f||g
g
(b)

f
f
(c)

Figure 5: Forms of composition: (a) sequential, (b) concurrent and (c) recurrent.

Sequential composition. The two functions f and g have to be evaluated in sequence, since
the outputs of f are the inputs of g. Let f be an (n, m)-ary function and let g be and (m, o)-ary
function. Then f g is the (n, o)-ary function defined by
(f g)(x1 , . . . , xn ) = g(f (x1 , . . . , xn )) .
Concurrent composition. The two functions f and g can be evaluated independently from
each other. Let f be an (n, m)-ary function and let g be and (o, p)-ary function. Then f kg is
the (n + o, m + p)-ary function defined by
(f kg)(x1 , . . . , xn+o ) = (f (x1 , . . . , xn ), g(xn+1 , . . . , xn+o )) .

13

Recurrent composition. The function f feeds one of its inputs from one of its outputs, thus
f is defined recursively. Let f be an (n, m)-ary function where n > 0. Then f is the
(n 1, m)-ary function such that the value of (f )(x1 , . . . , xn1 ) is the (least) fixed point of
the equation
(y1 , . . . , ym ) = f (x1 , . . . , xn1 , ym ) .
The -operator used above feeds back the m-th output channel of a (n, m)-ary function with
n > 0.
Note, that using recurrent composition introduces the difficult problem of solving fixed point
equations. Moreover, for a general recurrent function a fixed point may not even exist, thus
making this function non-computable. It has been shown that delayed recurrent structures are
guaranteed to have a fixed point (see [BS01] for an overview). In the context of this document,
we define computable recurrent structures to have at least one unit-delay on each recurrent
path.

14

3 Concurrency Analysis
This section presents a number of concurrency analysis techniques that can be performed
on iterative data-flow graphs (DFGs, see Sec. 2.1). More formally, an iterative data-flow
graph G is defined as tuple G = {V, E}, with V being a set of vertices and E being a set
of directed edges E = {(v1 , v2 ) : v1 , v2 V }. Here, the vertices V represent tasks and the
edges E represent directed communication channels. Since multiple edges can exist between
two vertices, G belongs to the class of directed multigraphs. The following paragraphs define
some useful terms for analyzing concurrency in data-flow graphs.
Definition 3.1. The precedence relation . . V V defines whether a constraint on the
order of execution of two vertices exists. If v1 v2 (read v1 precedes v2 ) holds for two
vertices v1 , v2 V , then v1 has to be scheduled before v2 .
Definition 3.2. The concurrency relation .k. V V defines whether there does not exist a
constraint on the order of execution of two vertices. It complements the precedence relation.
If v1 k v2 (read v1 is concurrent to v2 ) holds for two vertices v1 , v2 V , then both v1 and
v2 can be arbitrarily scheduled with respect to each other. Precedence cannot be claimed in
either direction, that is (v1 v2 ) (v2 v1 ).
Definition 3.3. The execution time function (V ) R returns the execution time associated
with a vertex v V . Note, that execution times in unit-time are always defined as (v) = 1.
Definition 3.4. The delay function (E) N returns the number of unit-delays associated
with an edge e E.
3.1 Concurrency, Parallelism and Precedence
Concurrency and parallelism. Concurrency occurs where no precedence can be claimed (see
definitions 3.1 and 3.2). Two concurrent tasks v1 and v2 can be arbitrarily scheduled with
respect to each other: task v1 can be scheduled any time after task v2 and vice versa. If v1
and v2 are scheduled on different processors at an overlapping period of time, the concurrency
of v1 and v2 has been leveraged to produce parallelism (in time). In a sense, concurrency
reflects the absence of observable casual dependency, while parallelism reflects the coincident
execution at runtime. Hence, concurrency of tasks is a prerequisite for parallelization on the
runtime platform.
Intra-iteration precedence. Intra-iteration precedence is concerned with the precedence constraints that exist within a single iteration of a data-flow program. The first step in intraiteration precedence analysis of a given data-flow graph G is to construct its acyclic precedence
graph
AP G(G) = G/ED
by removing the set of edges with at least one unit-delay
ED = {e E : (e) 1}

15

B
A1
A
D

B
C

D
A2

B1
C1
C2
B2

D1
D2

C
(a)

(b)

(c)

Figure 6: Precedence in iterative data-flow. (a) A simple iterative DFG with 4 tasks A, B,
C and D. (b) Corresponding intra-iteration precedence graph. (c) Corresponding
intra- and inter-iteration precedence graph for 2 consecutive iterations.
from the original graph G. Only edges with zero delay affect computations in the current
iteration. Edges with k unit-delays affect computations, which are k iterations in the future,
hence, delayed edges do not affect intra-iteration precedence. Note, that AP G(G)) is guaranteed to be acyclic: by definition, every cycle in the original graph G must have at least one
edge with a unit-delay greater than zero to be computable (see section 2.4). Thus, each cycle
C is guaranteed to be broken up in AP G(G)) by removing at least that one edge e C with
(e) 1.
The second step is to include tuples in the precedence relation (see definition 3.1). For each
vertex v V get the set of transitively reachable successors S. For each s S, add v s to
the precedence relation. In other words, each vertex precedes its transitive successors in the
acyclic precedence graph.
Example 3.1.1 (Intra-iteration precedence). Figure 6(b) shows an acyclic precedence graph
of the DFG in figure 6(a). In this example, the precedence relation consists of {(A B), (A
D), (B D), (C D)}. Furthermore, the concurrency relation consist of {(AkC), (BkC)}.
Within one iteration, C can be scheduled independently from A and B as long as A is scheduled
before B, B before D and C before D. Here, two processors are sufficient to fully parallelize
the data-flow program.
Inter-iteration precedence. Inter-iteration precedence describes the precedence constraints
that exist between consecutive iterations of a data-flow program. The following explains, how
inter-iteration precedence between two tasks A and B is analyzed. As a first step, let Ai and Bi
be the execution of task A and task B in the i-th iteration of the data-flow program. Further,
let k be the smallest sum of delays on any path that leads from A to B in the original DFG.
More formally, k is defined by
k = min {DP }
where the minimum is taken over all paths P that lead from A to B, and DP is the sum of
delays on path P
X
DP =
(e) .
e(P E)

16

The second step is to include Ai Bi+k in the inter-iteration precedence relation. Note, that
in the case of self-loops or cycles (see section 3.3), a task can particularly show inter-iteration
precedence with itself.
Example 3.1.2 (Inter-iteration precedence). Figure 6(c) shows an acyclic precedence graph
of the DFG in figure 6(a) for two consecutive iterations. Ai , Bi , Ci and Di represent the
execution of the tasks A, B, C and D in the i-th iteration. Intra- and inter-iteration precedence
is shown. In fact, A1 C2 is the single inter-iteration precedence constraint that links the
subgraphs of both iterations. This precedence graph for two consecutive iterations offers more
concurrency than the intra-iteration precedence graph for a single iteration (figure 6(b)) does.
In a combined schedule for both iterations, four processors can be used to fully parallelize the
data-flow program. Three processors are sufficient, if the execution time of C is less than or
equal to that of A, that is (C) (A).
3.2 Weakly Connected Components
Weakly connected components (WCCs) in data-flow graphs represent pairwise concurrent
processes in an intuitive way. In general, any directed graph G = {V, E} can be uniquely
decomposed into n weakly S
connected components W1 , . . . , Wn . The WCCs are disjoint subgraphs of G, that is G = ni=1 Wi and Wi Wj = for all i, j {1, . . . , n} where i 6= j.
Furthermore, the WCCs are pairwise disconnected. Here, disconnected means that there
does not exist any edge (v, u) leading from any vertex v Wi to any vertex u Wj for all
i, j {1, . . . , n} where i 6= j. The vertices v and u represent tasks. Since the tasks v and

Figure 7: Weakly connected components analysis of a DFG with 20 tasks in total. The 6
distinct components are highlighted by colored areas. There largest component
comprises 9 tasks, 3 components are comprising 3 tasks each, and 2 components
consist of a single task each.

17

u reside in distinct WCCs that are not connected, there does not exist communication via
channels among these tasks. Hence, no precedence between these tasks can be claimed, that
is (v u) (u v) v k u.
Now we extend this idea from single tasks v and u to complete weak component clusters Wi
and Wj where i 6= j. It can be concluded that Wi and Wj are pairwise concurrent and can be
scheduled in parallel, that is Wi kWj . Note, that precedence between the tasks inside a single
weakly connected component still exists. The class of concurrency, introduced by weakly
connected components is also called spatial concurrency. This emphasizes the topological
aspect of this topic. Thus, a DFG with n weakly connected components offers at least n-fold
spatial concurrency.
Example 3.2.1 (Analysis of weakly connected components). Figure 7 illustrates a weakly
connected components analysis of a DFG with a total of 20 tasks. In this example, 6 distinct
WCCs can be identified, which are highlighted by colored areas. The largest component
comprises 9 tasks, there are 3 components that comprise 3 tasks each, and there are 2 trivial
components that consist of a single task each. Hence, by scheduling each of the 6 WCCs on a
dedicated processor, 6-fold parallelism can be achieved in this example. One the one hand, no
costly inter-processor communication (IPC) is required by this schedule. On the other hand,
speed-up and efficiency of this schedule heavily depends on the actual execution times of the
20 tasks.
3.3 Strongly Connected Components and Cycles

Figure 8: Strongly connected components and cycles analysis of a DFG with 10 tasks in
total. There exists 1 strongly connected component, which is shown with highlighted edges. This strongly connected component comprises 6 tasks, belonging
to 3 minimal cycles. The 3 minimal cycles are highlighted by colored areas.

18

Detecting feedback cycles is important in concurrency analysis, since cycles hinder parallelization in three ways:
Cycles reduce data parallelism (see Sec. 4.1.3).
Cycles affect the iteration bound (see Sec. 3.4).
Cycles restrict transformations in modifying the number of delay operators (see Sec. 5.2).
Feedback occurs in data-flow graphs inside so-called feedback cycles. On a cycle C = {v1 , . . . , vn }
with length l = |C| each vertex v C can transitively reach any other vertex u C in a maximum of l steps. This definition of cycles is closely related to the definition of strongly connected
components (SCCs) in directed graphs. If a path from a vertex v to a vertex u exists in a
SCC, this implies that there also exists a path from u to v. Hence, the vertices v and u
and, furthermore, all vertices on the paths between v and u are strongly connected. SCCs
in directed graphs can be efficiently detected, e.g. by using Tarjans algorithm [Tar72] with
runtime complexity O(|V | + |E|).
Once data tokens enter a feedback cycle, these tokens (or the effects caused by them) can
possibly circulate inside this cycle for an infinite amount of iterations. Hence, feedback cycles
introduce so-called states in a system. Any task inside a feedback cycle becomes stateful, even
if the separate task (viewed in isolation) is stateless. Stateful tasks do not lend themselves for
coarse-grain data parallelization [GTA06]. Furthermore, the iteration bound is likely to raise,
if feedback cycles with a low number of unit-delays are present (see Sec. 3.4). In feedback
systems, modifying delays changes the functional behavior or even destroys causality.
Example 3.3.1 (Analysis of strongly connected components and cycles). Figure 8 shows the
strongly connected components and cycles of a DFG with 10 tasks in total. There exists 1
strongly connected component, which is shown with highlighted edges. This strongly connected
component comprises 6 tasks, belonging to 3 minimal cycles. These 3 minimal cycles are
highlighted by colored areas.
3.4 Iteration Bound
The time it takes to complete the execution of one iteration of a data-flow program is referred
to as the iteration period. Can we expect that adding more processors will always lead to
a shortened iteration period due to increased parallelism? In theory, it is possible to reduce
the iteration period of any feed-forward system towards zero by adding more processors (see
[Par89a], for example). Unfortunately, any data-flow program with feedback cycles has an
inherent iteration bound, which is a lower bound on the achievable iteration period. If the
iteration period of a system equals its iteration bound, the system is called rate-optimal. It is
not possible to construct a schedule with an iteration period lower than the iteration bound,
regardless of the number of parallel processors available. The notion of a lower bound on
the iteration period in feedback systems was discovered in the late 1960s by Reiter [Rei68] as
maximum cycle ratio with maximum computation rate 1 for periodic admissible schedules.
Parhi showed that rate-optimal schedules for static data-flow programs can always be constructed [PM91]. This is important for two reasons: on the one hand, the maximum achievable
parallelism in a feedback system can be determined statically at compile time, one the other

19

hand, more platform parallelism could be exploited as long as the program is not yet rateoptimal. Nevertheless, even if a vast number of processors is available, an iteration period less
than the iteration bound cannot be achieved. Note, that in some cases, selective rewriting of
program behavior (see section 5.3) may allow for lowering iteration periods below the original
iteration bound.
In any data-flow program with feedback cycles the iteration bound is given by


TC
T = max
DC
where the maximum is taken over all cycles C G in the data-flow graph, and TC is the sum
of execution times of vertices in cycle C,
X
(v)
TC =
v(CV )

and DC is the sum of unit-delays on edges in cycle C


X
DC =
(e) .
e(CE)

Note, that DC > 0 always holds in any computable cycle C, which must have at least one
unit-delay by definition. Any cycle C for which TC /DC = T is referred to as critical cycle of
the data-flow program.

P1 A1

B3

P2

B1

A2

P3

B2

A3

P4

C1

C2

C3
t

(a)

(b)

Figure 9: A simple DFG and a corresponding rate-optimal schedule. (a) DFG of a dataflow program with with three tasks: (A) = 1, (B) = 5 and (C) = 2. (b)
A corresponding rate-optimal schedule of three successive iterations on four
processors P1 , . . . , P4 with T = 2.
Example 3.4.1. The example shown in figure 9 illustrates the idea of iteration boundedness.
The iteration bound of a simple DFG with three vertices A, B and C and feedback cycles is
calculated. A rate-optimal combined schedule of three successive iterations of this DFG on
four processors is given. The example in 9(a) show a simple DFG of a data-flow program with
three tasks. Individual execution times of the tasks are given as (A) = 1, (B) = 5 and

20

(C) = 2. Two cycles exist in DFG 9(a): Cycle1 between A and B and a self-loop Cycle2 on
C. The iteration bound is derived as




(A) + (B)
1+5 2
(C)
T = max
= max
=2
,
,
((A, B)) + ((B, A)) ((C, C))
2+1 1
units of time. Figure 9(b) shows a four-processor schedule of three successive iterations. The
four processors P1 , . . . , P4 are arranged on the vertical axis and time is displayed on the horizontal axis. A1 , A2 and A3 refer to the execution of task A in the 1st, 2nd and 3rd iteration.
The same applies to Bi and Ci , i {1, 2, 3}. The total period of the schedule is 6 units of time
and the iteration period is 6/3 = 2 units of time, since 3 iterations are executed by this schedule
within 6 units of time. Hence, the schedule given in figure 9(b) is already rate-optimal as its
iteration period is equal to its iteration bound. Neither by adding more successive iterations,
nor by adding processors can a schedule with shorter iteration period than 2 units of time be
constructed.
3.5 Delay Profiles
Embedded software systems commonly have the property of being reactive systems. Reactiveness means that the system steadily communicates and interacts with its environment: it
reacts on input data from the environment by producing output data for the environment
within a given period of time. In embedded systems, input data are read from physical sensors
and output data are written to physical actuators.

(a)

(b)

Figure 10: Delay profile with guaranteed delays from inputs to outputs. (a) DFG with 2
inputs and 3 outputs. (b) Corresponding delay profile, showing for each input
its reachable outputs and respective guaranteed delay.
Each environment input of a given data-flow program can affect a set of environment outputs.
This set of affected outputs Y V is determined by transitive search beginning at the input
x V . For example, this can be achieved by depth-first-search (DFS), within the data-flow
graph. Now, we know that any output to the environment y Y can depend (causally) on the
input from the environment x. Furthermore, we know that at least one path P G from x to
y exists. Consequently, we can derive a profile of guaranteed delays between stimuli on x and
reactions on any of the y Y . According to the delay calculus of Broy [Bro10], the guaranteed
delay between two vertices x and y is given by
gar (x, y) = min{DP }

21

where the minimum is taken over all paths P G from x to y, and DP is the sum of unit-delays
on edges along the path P
X
(e)
DP =
e(P E)

if such a path P exists, or gar (x, y) = , otherwise. In other words, gar (x, y) = d means that
it takes at least d iterations before stimuli on sensor x can lead to observable effects on actuator
y. In particular, gar (x, y) = means that stimuli on sensor x never lead to observable effects
on actuator y.
Example 3.5.1 (Delay profiles). Figure 10(b) shows a delay profile of the simple DFG in
figure 10(a), created by the toolkit Cadmos (see section 6). Note, that the term ticks is
synonymous to iterations (see also section 2.3). Input A affects output C after 4 and
output D after 1 iterations. Input E affects output F after 2 iterations. Percentages show
the distribution of input-/output-delays, relative to the highest delay ( 4 in this case).
Why are delay profiles interesting for concurrency and parallelization? Delay profiles can
be used to increase parallelism in multi-iteration schedules and can be applied to determine
jitter-stability of distributed systems. The following two paragraphs summarize how this can
be achieved.
Increasing parallelism in multi-iteration schedules. Guaranteed delays pose a lower bound
on the observable input-/output-latencies of the system under design. In a black-box view,
we observe that the system never reacts on input before some guaranteed amount of time has
passed. Internally, the actual execution of tasks can be delayed for several iterations, as long
as the guaranteed delays are satisfied. This offers additional freedom in scheduling for parallel
systems. In multi-iteration schedules, there can be overpopulated points in time with more
concurrent tasks than processors, and there can be sparse points in time with less concurrent
tasks than processors. Using delay profiles, tasks can be moved from overpopulated to sparse
points in the schedule in order to increase parallelism.
Jitter-stability in distributed systems. Jitter in time is a severe problem in distributed realtime systems. The deviation in transmission time of messages over buses is an example of jitter.
Usually, this jitter is some  that is specific to a given bus system. In this case, jitter can be
taken into account in advance by using transmission times t  instead of t. More difficult to
handle is sporadic jitter that may delay messages in the order of several iterations. Sporadic
jitter can be caused by electromagnetic interference on the bus systems physical layer, for
example. The idea is to use delay profiles to determine how many iterations a message can
be delayed by jitter, without affecting the guaranteed input-/output latencies of the system.
Employing delays for jitter-stability is ongoing research and needs to be discussed in more
detail in companion documents.

22

4 Concurrency Metrics
In the preceding sections 2 and 3, we have shown how applications for embedded systems are
presented as iterative data-flow programs and introduced some basic analysis techniques. This
section gives an overview of some useful concurrency metrics that can be derived from iterative
data-flow models with the help of those analysis techniques. Along the software engineering
process, these metrics support answering concurrency- and parallelism-related questions concerning mainly the software, hardware and deployment of a system. Typical questions for a
system under design have been outlined in the introduction section 1: how does an adequate
concurrent software architecture, adequate parallel hardware platform, or adequate distributed
deployment look like?
Discussion of metrics is arranged in this section as follows. Three different main areas of metrics
are discussed: software-, hardware- and deployment-concerned. Each area is presented in one
of the following subsections. Each single metric is explained in a dedicated sub-subsection and
organized by three topics: purpose, analysis and calculation of the respective metric.
4.1 Software-concerned metrics
This subsection explains three mainly software-concerned metrics: available spatial concurrency, available temporal concurrency and available data-parallelism.
4.1.1 Spatial concurrency.
Purpose. Available spatial concurrency reflects the pairwise non-communicating parts of an
application. These parts neither intercommunicate within a single iteration nor across subsequent iterations. A program with n spatially concurrent parts offers at least n-fold parallelism. The term spatial refers to the fact that parts are always concurrent to each other,
regardless of time (or iterations). Each of the n spatially concurrent parts can be scheduled
on a dedicated processor, producing n-fold parallelism in total. One the one hand, no costly
inter-processor communication (IPC) is required by exploiting spatial concurrency. On the
other hand, speed-up and efficiency heavily depends on the actual execution times of the tasks
inside the concurrent parts.
Analysis. Weakly connected components analysis from section 3.2 is employed. The analysis
function WCC(G) returns the n disjoint WCCs of a data-flow graph G.
Calculation. The spatial concurrency C N of a data-flow program with corresponding
data-flow graph G is defined by
C (G) := kWCC(G)k .

23

4.1.2 Temporal concurrency.


Purpose. Available temporal concurrency reflects parts of an application that only communicate across iterations, but never within a single iteration. A program with n temporally
concurrent parts offers at least n-fold parallelism. This kind of parallelism is also referred to
as pipelining parallelism. Each of the n temporally concurrent parts can be scheduled as a
pipeline stage on a dedicated processor, producing n-fold parallelism in total. One the one
hand, the iteration period can be significantly reduced by pipelining. On the other hand,
costly inter-processor communication (IPC) is required by exploiting temporal concurrency.
The different pipeline stages reside on different processors and the stages have to communicate
from one iteration to the next.
Analysis. Acyclic precedence graph analysis from section 3.1 and weakly connected components analysis from section 3.2 are employed. Additionally, the spatial concurrency C (see
section 4.1.1, above) is required. The analysis function APG(G) returns the acyclic precedence
graph of a data-flow graph G. The analysis function WCC(G) returns the n disjoint WCCs
of a data-flow graph G.
Calculation. The temporal concurrency C N of a data-flow program with corresponding
data-flow graph G is defined by
C (G) := kWCC(APG(G))k C (G) + 1 .
4.1.3 Data-parallelism.
Purpose. Data-parallelism can be employed for any part (subsystem) of an application that
does not depend on its own history of executions, i.e. these are the stateless parts of an
application. Additionally, the type of data processed by data-parallel subsystems is required
to be a complex type like a list, array or matrix. The basic idea is to execute the same operation
simultaneously on disjoint parts of the data. This concept is analogous to the single instruction
multiple data concept (SIMD, see [Fly72]) in processor architecture.
For a stateless subsystem S G, n parallel instances S1 , . . . , Sn can be added to a schedule for
a single iteration. Note, that n is virtually only limited by the number of available processors.
The complex input data token T in is split into n parts T1in , . . . , Tnin , each dispatched to one
of the instances Si , i {1, . . . , n} that run in parallel. After all Si have finished processing,
the resulting n output tokens T1out , . . . , Tnout are merged to one output token T out . One the
one hand, data-parallelism of n can achieve significant speed-up near n for large input data
structures, e.g. found in audio- and image-processing. On the other hand, the additional time
required for the split and the merge operations may outweigh the reduction in time by dataparallel execution. An introduction to the nature and use of data-parallelism can be found in
[GTA06], for example.
Analysis. Strongly connected components analysis from section 3.3 is employed. The analysis
function SCC(G) returns the strongly connected components (SCCs) of a data-flow graph G.

24

Calculation. The subgraph that is stateful and, thus, cannot be used for data-parallelism
Stateful(G) G of a data-flow program with corresponding data-flow graph G is defined by
[
Stateful(G) :=
SCC(G) ,
which is the union of all strongly connected components of G. The subgraph that is stateless
and, thus, can be used for data-parallelization Stateless(G) G is defined by
Stateless(G) := G/Stateful(G) .
The number of potentially data-parallel tasks P N of a data-flow program with corresponding data-flow graph G = {V, E} is defined by
P (G) := kV Stateless(G)k ,
which is the number of vertices in the stateless subgraph of G.
4.2 Hardware-concerned metrics
This subsection introduces three mainly hardware-concerned metrics: speed-up, efficiency and
utilization.
4.2.1 Speed-up.
Purpose. An application can be executed sequentially by a single processor or in parallel
by p processors. The speed-up S R reflects, how many times faster a given application is
executed by the parallel processors than this application is executed by the single processor. In
the context of this document, we are concerned with iterative and non-terminating programs.
Hence, speed-up is measured with respect to the iteration period, which is the time to execute
all tasks (and communication operations) once. Note, that the achievable speed-up is limited by
what is also referred to as Amdahls law [Amd67]: the amount of sequential tasks (compare
with sequential composition, section 2.4) severely limits the speed-up achievable by adding
more parallel processors.
Analysis. The iteration period of the sequential reference system is Tseq and the iteration
period of the parallel system with p processors is Tpar . After constructing a schedule, the
iteration period (Tseq or Tpar ) is set to the duration of the longest schedule appearing for any
of the processors or buses. In the context of uniform computation time or arbitrary integer
computation time (see section 1), schedules can be efficiently constructed by Hu-level methods
[Hu61], for example. For arbitrary real computation time analysis, e.g. A* methods (see
[HNR68] and [PLM99]), or solver based methods (see [Gre68] and [Vos10]) can be used.
Calculation. The speed-up of a parallel system with iteration period Tpar compared to a
sequential system with iteration period Tsqe is defined by
S :=

Tseq
.
Tpar

25

4.2.2 Efficiency.
Purpose. Efficiency is closely related to speed-up (see section 4.2.1, above). The efficiency
E [0, . . . , 1] of a parallel system reflects the average utilization of parallel processing capabilities. In a system with high efficiency (near one), all parallel processing capabilities are utilized
well. A system with low efficiency (near zero) is likely to have higher energy consumption and
hardware costs than necessary. In this case, e.g. lowering frequencies or removing least utilized processors (see section 4.2.3, below) can lead to increased efficiency with lowered energy
consumption and hardware costs. Possibly, there are parts of the system that are deliberately
redundant, e.g. for safety reasons. Usually, these parts reduce the overall efficiency, but cannot
be removed, of course.
Analysis. Speed-up S (see section 4.2.1) and the number of parallel processors p is required.
Efficiency can be calculated for uniform computation time, arbitrary integer computation time
or real computation time schedules (compare to speed-up in section 4.2.1).
Calculation. The efficiency of a parallel system with speed-up S and number of parallel
processors p is defined by
S
E :=
.
p
4.2.3 Utilization of resources.
Purpose. Utilization of resources (processors and buses) reflects the amount of time a resource
is actually used during the execution of an iteration. The utilization U (r) [0, . . . , 1] is
calculated for a given resource r R, with R being the hardware resources of the system.
Resources with high utilization (near one) have little reserves in the case of unforeseen events
that affect execution times or transmission times. Resources with low utilization (near zero)
are likely to add unnecessary energy consumption or hardware cost to the system.
Analysis. The iteration period of the parallel system with resources R is Tpar . After constructing a parallel schedule, each resource r R is assigned a sequence of scheduled tasks
activations. In the case of buses, these tasks are communication operations. The sum of execution times of these tasks is T (r) R, with 0 T (r) Tpar . Utilization can be calculated
for uniform computation time, arbitrary integer computation time or real computation time
schedules (compare to speed-up in section 4.2.1 and efficiency in section 4.2.2).
Calculation. The utilization of a scheduled resource (processors or bus) r in a parallel system
with iteration period Tpar is defined by
U (r) :=

T
Tpar

with T (r) being the sum of execution times of tasks scheduled on resource r.

26

4.3 Deployment-concerned metrics


This subsection briefly discusses three mainly deployment-concerned metrics: frequency, reactiveness and jitter-robustness.
4.3.1 Frequency.
Purpose. The deployed system runs with a certain frequency (or iteration rate), which is the
reciprocal of the iteration period. Typical software-side techniques to increase the frequency of
parallel systems are pipelining (see section 4.1.2) and unfolding (see section 5.1). On hardwareside, fast bus-systems with little communication delay (for distributed IPC) can be selected
to enable higher frequencies. One the one hand, higher frequencies yield higher reactiveness
in embedded software systems. One the other hand, higher frequencies can reduce robustness
against timing jitter in distributed systems.
Analysis. After constructing a parallel schedule, the iteration period Tpar is known. Additionally, the iteration bound T (see section 3.4) shows that Tpar T holds. If Tpar = T the
deployed system is rate-optimal. Hence, neither by unfolding and retiming transformations,
nor by adding more processors, can a system with lower iteration period than T be build.
Frequency can be calculated in all three granularities of time: uniform, arbitrary integer and
real computation time.
Calculation. The iteration rate of a system with iteration period Tpar is defined by
1
Tpar

4.3.2 Reactiveness.
Purpose. Reactiveness is often an important quality metric for embedded software systems.
It reflects the end-to-end response times from sensors to actuators of the deployed system.
Here, reactiveness is defined on the basis of inputs and outputs of the system. Reactiveness is
the amount of time that it takes at least before a stimulus on an input can produce observable
effects on an output. In the context of this document, we are concerned with the lower bounds
of reactiveness.
It is beyond the scope of this document, to analyze whether a stimulus can actually ever affect
an output. Further, it is not considered, how long it takes at most before a stimulus on an
input affects an output. For both analyses (actually ever and at most) it is not sufficient
to solely consider a DFGs structure. Rather, analysis of a DFGs behavior, e.g. by modelchecking, is necessary to retrieve this information regarding the upper bounds of reactiveness.
This is ongoing research and needs to be discussed in companion documents.

27

Analysis. We use the delay profiles from section 3.5 to get for each input x the set of affected
outputs Y and the guaranteed delays gar (x, y) for all y Y . After constructing a parallel
schedule, the iteration period Tpar is known. Response times can be calculated in all three
granularities of time: uniform, arbitrary integer and real computation time.
Calculation. The guaranteed response time gar R of a stimulus on an input x on an output
y in a system with iteration period Tpar is defined as
gar (x, y) := Tpar gar (x, y) .
For a data-flow program with DFG G the guaranteed response time is given by
gar (G) := Tpar min{gar (x, y)} .
where the minimum is taken over all guaranteed delays from any input x to any output y
in G. The deployed system is guaranteed to respond not faster than gar (G) on any input
stimulus.
4.3.3 Jitter-robustness.
Purpose. As mentioned in section 3.5, jitter in time is a problem that occurs in distributed
realtime systems. An example for jitter is the deviation in transmission times of messages over
buses. We define jitter-robustness as the maximum amount of jitter that cannot break the
systems expected input-/output-behavior. For reasons of simplicity, jitter is expected to be of
positive value only, i.e. communication happens at t +  where  0 is the jitter and t is the
exact time without jitter. It is ongoing research, how transformation techniques like retiming
can be applied to maximize jitter-robustness of deployed systems.
Analysis. Jitter-robustness can be expressed in iterations or in time. To calculate the time,
the iteration period Tpar has to be known. In a DFG G = {V, E}, each edge e = (v1 , v2 )
with v1 , v2 V and e E has a jitter-robustness proportional to the edges unit-delay (e).
Messages sent by v1 over e are required by v2 after (e) iterations at the latest. If e is
transmitted over a bus, then a jitter of up to the duration of (e) iterations on that bus is
tolerated. This idea can be extended from single edges to multiple edges between two vertices
v1 and v2 , as explained in the next paragraph.
Calculation. Let H E be the set of all edges from v1 to v2 and from v2 to v1 . If H is
empty, then the jitter-robustness for v1 and v2 is undefined, otherwise continue as follows.
Jitter-robustness in iterations is the minimum delay between the vertices v1 and v2 , defined
as
JitterRobustnessIterations (v1 , v2 ) := min{(h)}
where the minimum is taken over all unit-delays of edges h H. Jitter-robustness in time is
defined as
JitterRobustnessT ime (v1 , v2 ) := Tpar JitterRobustnessIterations (v1 , v2 ) .

28

5 Concurrency Transformation
This sections gives an overview of behavior invariant transformation techniques that can be
applied to iterative data-flow models. The models are transformed in order to influence their
concurrency properties and the parallelism they offer. The metrics discussed in section 4 may be
used to monitor this influence. In the context of this paper, a behavior invariant transformation
is considered to leave the original input-output behavior unaltered, though additional latency
may be introduced. Here, three transformation techniques are presented: unfolding, retiming
and look-ahead. Each of these techniques is used in typical areas of application. A more
detailed discussion on this topic can be found in [Par89a] for example.
5.1 Unfolding-Transformation
Features of the unfolding transformation. Unfolding transformation aims at increasing parallelism by constructing combined schedules for multiple successive iterations of a data-flow
program. The main parameter to this transformation is the unfolding factor J. In the resulting schedules combine J iterations, thus, inter-iteration parallelism can be exploited. By
increasing parallelism, unfolding allows for reducing the iteration period if sufficient parallel
processors are available. In pure feed-forward data-flow programs, there is virtually no limit for
reducing the iteration period by unfolding. Nevertheless, in data-flow programs with feedback,
the iteration bound (see section 3.4) poses a lower limit on the achievable iteration period. A
profound introduction to unfolding transformation is found in [PM91].

(a)

(b)

Figure 11: Unfolding transformation of a feed-forward system. (a) Original DFG with 3
tasks. (b) Unfolded DFG with factor J = 3.
Example 5.1.1 (Unfolding transformation of a feed-forward system). Figure 11(a) shows a
simple feed-forward system with three tasks: SensorProcessing, ControlAlgorithm and ActuatorProcessing. Similar kinds of structures are likely to be found in embedded systems that execute control functions. The execution times of the tasks are given as (SensorProcessing) = 1,
(ControlAlgorithm) = 4 and (ActuatorProcessing) = 1. Hence, a one-processor schedule has
an iteration period of 6. By using more processors and pipelining parallelism, the iteration
period can be reduced to 4 (the execution time of ControlAlgorithm).

29

P1 S1

C2

P2 A1 S2
P3

A3
C3

C1

A2 S 3
t

(a)

(b)

Figure 12: Precedence graph and schedule of an unfolded feed-forward system. (a) Precedence graph of the unfolded DFG in figure 11(b). (b) A three-processor schedule
of the unfolded DFG in figure 11(b).
By unfolding transformation, the iteration period of this system can be reduced below 4. Figure 11(b) shows the unfolded DFG with unfolding factor J = 3. The unfolded DFG comprises
9 tasks, while SensorProcessingi , ControlAlgorithmi and ActuatorProcessingi represent the execution of the respective task in the i-th iteration (i {1, 2, 3}). Note, that unfolding preserved
the sum of delays (2D).
Figure 12(a) shows the acyclic precedence graph of the unfolded DFG in figure 11(b). The
critical path (SensorProcessing1 ControlAlgorithm2 ActuatorProcessing3 ) of the threeunfolded system requires 6 units of time. Figure 12(b) illustrates a three-processor schedule of
the unfolded DFG in figure 11(b). Task names are abbreviated as follows for better readability: SensorProcessing (S), ControlAlgorithm (C) and ActuatorProcessing (A). This schedule
satisfies the precedence constraints from 12(a) and has a total duration of 6. Since the schedule
executes 3 iterations of the original DFG, the iteration period 2.

(a)

(b)

Figure 13: Unfolding transformation of a feedback system. (a) Original DFG with 3 tasks
and a feedback cycle. (b) Unfolded DFG with factor J = 2.

30

Example 5.1.2 (Unfolding transformation of a feedback system). Now we continue and extend example 5.1.1 by introducing a feedback channel. Figure 13(a) shows a feedback system
with three tasks: SensorProcessing, ControlAlgorithm and ActuatorProcessing. Additionally
there is a feedback channel with a unit-delay from ActuatorProcessing to ControlAlgorithm.
The execution times of the tasks are equal to those in example 5.1.1: (SensorProcessing) = 1,
(ControlAlgorithm) = 4 and (ActuatorProcessing) = 1. Consequently, a one-processor schedule still has an iteration period of 6. Again, task names are abbreviated in the following for
better readability: SensorProcessing (S), ControlAlgorithm (C) and ActuatorProcessing (A).
The iteration bound (see section 3.4) of this system with one feedback cycle is
(C) + (A)
4+1
=
= 2.5 .
((C, A) + ((A, C))
1+1
Thus, by unfolding and using more processors, the iteration period can be reduced to 2.5 (the
iteration bound).

P1 S1
P2

C2
C1

A2

P3 A1 S2
t
0

(a)

(b)

Figure 14: Precedence graph and schedule of an unfolded feedback system. (a) Precedence
graph of the unfolded DFG in figure 13(b). (b) A rate-optimal two-processor
schedule of the unfolded DFG in figure 13(b).
Figure 13(b) shows the unfolded DFG with unfolding factor J = 2. The unfolded DFG comprises 6 tasks, while SensorProcessingi , ControlAlgorithmi and ActuatorProcessingi represent
the execution of the respective task in the i-th iteration (i {1, 2}). Note, that unfolding
preserved the sum of delays (3D).
Figure 14(a) shows the acyclic precedence graph of the unfolded DFG in figure 13(b). Figure
14(b) illustrates a two-processor schedule of the unfolded DFG in figure 13(b). The schedule
satisfies all precedence constraints from 14(a) and has a total duration of 5. Since the schedule
executes 2 iterations of the original DFG, the iteration period 2.5. The iteration period of
2.5 equals the iteration bound, thus, this schedule is rate-optimal. Unfolding with J 2 is
possible, but does not yield a lower iteration period than 2.5. Note, that the given threeprocessor schedule has a speed-up of S = 6/2.5 = 2.4 and an efficiency of E = 2.4/3 = 0.8 (see
section 4.2). It is possible to construct a two-processor schedule with an iteration period of 3
that has full efficiency of E = 1.0, but less speed-up of S = 2.0.

31

P1

S1

P2

C2

A3 S1

S2

C3

A3

A1 S2

S3

P3

C2

C1

C3

A1

A2 S3

C1

A2
t

10

12

14

16

Figure 15: Periodic overlapping three-processor schedule for the unfolded DFG of example
5.1.1. By overlapped execution, the iteration period is 2.
Overlapping schedules and periodic task execution. Note, that after unfolding, schedules
can have deviations in the periodic execution of tasks. For example, the schedule given in
example 5.1.1 figure 12(b) activates task S at t = 0, 1, 5, 6, 7, 11, . . .. This can be problematic
for systems, that need strict periodic activation of tasks, e.g. S needs to read data from an
analog-to-digital converter (ADC) at t = 0, 2, 4, 6, 8, 10, . . .. To overcome this problem, overlapping schedules can be used in combination with unfolding. Overlapped execution refers
to the fact, that one processor can start executing the next iteration, while other processors are still executing the current iteration. Figure 15 shows a strictly periodic overlapping
three-processor schedule for the unfolded DFG of example 5.1.1. This schedule satisfies the
precedence constraints. By overlapped execution, the iteration period remains 2. Moreover,
each of the original tasks (S, C and A) is activated with a period of 2.
History, application and scope. A detailed description of the algorithm behind unfolding
transformation is given in [PM91]. An important property of the unfolding transformation is
that it preserves the number of delays of the original data-flow program. Hence, no additional
latency is introduced, which is particularly important for reactive systems. Both examples
(5.1.1 and 5.1.2) illustrate, that the sum of delays in the unfolded system equals the sum of
delays in the original system. Depending on the effects of additional inter-processor communication (IPC), an unfolded program can execute at an significantly higher iteration rate than
the original program does. In general, data-flow programs with large-grain tasks (e.g. complex
sub-programs inside a task) and fine-grain tasks (e.g. simple linear functions) can profit from
unfolding.

32

5.2 Retiming-Transformation
Features of the retiming transformation. Retiming transformation aims at increasing parallelism by changing precedence constraints in data-flow programs. The term retiming refers
to the fact that delays are moved around in a DFG, thus, the timing of the DFG is altered.
Delays are moved around in such a way that the total number of delays in a cycle (see section
3.3) of the program remains unchanged. Changing the number of delays affects precedence
(see section 3.1). The iteration period is reduced if the altered precedence constraints allow for
parallel schedules with less duration. A typical local retiming transformation is the removal
of n unit-delays from each of the incoming edges of a vertex v, and addition of n unit-delays
to each of the outgoing edges of v. This local retiming transformation can be applied to a
vertex if all of its incoming edges have at least one unit-delay associated with them. Any global
retiming transformation can be described by a combination of local retiming transformations.
A comprehensive description of retiming transformation is given in [LS91].

P1 A

P2

D
t

(a)

(b)

(c)

Figure 16: A data-flow program before retiming. (a) Original DFG with 4 tasks 2 unitdelays and once cycle. (b) Precedence graph. (c) Two-processor schedule with
iteration period 6.
Example 5.2.1 (Retiming transformation). Figure 16(a) shows the DFG of a data-flow program with 4 tasks, 2 unit-delays in total, and once cycle. The precedence graph of this program
is given in 16(b). Execution times of the tasks are given as (A) = 1, (B) = 2, (C) = 4
and (D) = 4. With the given precedence relations and execution times, not more than two
parallel processors can be leveraged. A two-processor schedule with an iteration period of 6 is
illustrated in figure 16(c).
Now, we apply a local retiming transformation on task B by removing one unit-delay at the
incoming edge and adding one unit-delay to each of the two outgoing edges. Figure 17(a)
shows the retimed DFG, which now has 3 unit-delays in total. Though the total number of
delays in the DFG changed, the number of delays in the cycle remained unchanged (2D). The
altered precedence graph of this retimed system is given in figure 17(b). With these altered
precedence relations, three parallel processors can be leveraged. A three-processor schedule
with a reduced iteration period of 4 is illustrated in figure 17(c).
History, application and scope. Retiming was first proposed by Leiserson, Rose and Saxe
[LRS83] to increase the frequency of synchronous circuitry. As well, retiming can be used
to increase the frequency of data-flow programs that represent software. Note, that retiming

33

P1 A

P2

P3

D
t
0

(a)

(b)

(c)

Figure 17: A retimed data-flow program that can leverage 3 parallel processors. (a) Retimed DFG with 3 unit-delays. (b) Altered precedence graph of the retimed
system. (c) Three-processor schedule with reduced iteration period 4.
leaves the total number of delays in cycles unchanged, hence, the iteration bound (see section
3.4) remains unchanged, too. However, retiming can change the total number of delays in a
DFG. For practical schedules, the additional parallelism produced by retiming also relies on
pipelining. Pipelining of tasks on different processors requires inter-processor communication
(IPC). Depending on the effects of this additional IPC, a retimed program can execute at an
increased iteration rate. Data-flow programs with large-grain tasks and fine-grain tasks may
profit from retiming.
5.3 Look-Ahead-Transformation
Features of the look-ahead transformation. The look-ahead transformation aims at increasing parallelism in recursive systems, i.e. systems that have feedback loops. The main parameter
to this transformation is the look-ahead depth L. After a look-ahead transformation together
with recursive doubling by depth L the transformed model performs L iterations in time
O(log2 L) if at least L parallel processors are available. In this section, the basic approach of
look-ahead transformation on first order recurrence systems is outlined.
Example 5.3.1 (Look-ahead transformation of a first order recursive system). The following
example illustrates the idea of look-ahead transformation. A first order recursive system is
transformed with L = 2 and a two-processor schedule for the transformed system is given. Note
that this is a slightly modified version of a more comprehensive example found in [Par89b].
Consider the following equation of a basic first order recursive system
y(t + 1) = ay(t) + bx(t) + c .

(1)

The input series to this system is x(t), the output series is y(t) and a, b and c are constants.
While a and b are constant factors, c is a constant summand. For example, this basic system
can be configured to realize a stable discrete low-pass filter section by assigning a + b = 1
with 0 a 1 and 0 b 1 and using c for input offset compensation. This system is first
order recursive, since the term y(t + 1) is calculated depending on its preceding value y(t). It
is this recursive dependency in series y(t) that hinders efficient parallelization of system (1).
Often, a recursively defined series like y(t) is also called a state of the system. The following

34

b
x(t)

y(t+1)

s+ab

y(t)

Figure 18: A data-flow graph of equation 1 with input series x(t), output series y(t), constants a, b and c and one unit-delay D in the recursive loop. The upper right
box explains the multiply-and-add operator used as shortcut in this graph.
paragraphs show how look-ahead shifts a states immediate inter-iteration dependency further
into the future, thus creating pipeline interleaved parallelism.
Figure 18 shows a data-flow graph of the system described by equation 1. In the following, a
look-ahead transformation with L = 2 is applied in two steps: recasting and static look-ahead
computation. First, equation 1 is recast by expressing y(t + 2) as a function of y(t) to derive
y(t + 2) = a [ay(t) + bx(t) + c] + bx(t + 1) + c .

(2)

Second, static look-ahead computation is applied to equation 2 finally obtaining L 1 = 1


steps of look-ahead in equation 3
y(t + 2) = a2 y(t) + abx(t) + bx(t + 1) + ac + c .

(3)

Figure 19 shows a data-flow graph of the transformed system expressed by equation 3. Note,
that the terms ac + c, b, ab and a2 are constant and can be precomputed at compile time.
The transformed model exposes one step of look-ahead at its inputs, manifested in the new

x(t+1)

ac+c

x(t)

a2

ab

y(t+2)

2D

y(t)

Figure 19: An equivalent first order recursive system transformed with L = 2 leading to
L 1 = 1 steps of look-ahead and two unit-delays 2D in the recursive loop.
input series x(t + 1) instead of x(t) from the original model. Finally, this model is two-way
parallelized using a pipeline interleaving approach for scheduling. Figure 20 shows a partial
two processor schedule for the look-ahead transformed system of figure 19 using a pipeline
interleaving approach. This schedule offers two-fold parallelism and, thus, a theoretical speedup
of 2. The practical speedup depends on efficient realization of the pipeline interleaving on
concrete CPU cores and the overhead imposed by the additional multiplications and delays in
the look-ahead transformed system.

35

Time
(t)

0, 1

2, 3

4, 5

Processor 1
State y(t)

y(-1)

y(1)

y(3)

Processor 2
State y(t)

y(0)

y(2)

y(4)

Figure 20: A partial two processor schedule for the system of figure 19 using a pipeline
interleaving approach.
Iteration period and signal processing. Look-ahead transformation is capable of reducing the
iteration period below the iteration bound of the original data-flow model. This reduction is
achieved by actually modifying the algorithm described by the original data-flow model, leaving
the input-output behavior unchanged. Transforming a data-flow model by a look-ahead of L
introduces L-fold parallelism in the transformed model. Look-ahead is an interesting technique
for creating additional parallelism in a class of data-intensive signal processing applications
that process large amounts of input data. Digital signal processing, e.g. filtering, usually
applies linear functions like additions and multiplications, which are well-suited for look-ahead
transformation. The throughput in such data-intensive applications increases significantly by
using recursive doubling along with the look-ahead transformation.
Additional latency and reactive systems. Looking ahead L iterations implies, that input
values for the next L iterations must be available, before the transformed data-flow model
can produce its first output values. Thus, a look-ahead transformation by L introduces an
additional output latency of L iterations. This may yield in exceeding the maximum admissibly response times in reactive systems. Depending on the concrete scenario, highly reactive
embedded applications (that require short response times upon real-time input) may not profit
from look-ahead.
History, application and scope. Look-ahead techniques were first presented by Kogge and
Stone [KS73] as a solution to a general class of recurrence equations. This work lead to what is
also known as recursive doubling algorithms. Later, Parhi [Par89a], [Par89b] showed how these
techniques increase the parallelism in recursive digital filters. Parhi also describes how lookahead can even be applied to parallelize state-automatons and other non-linear time varying
systems. Originally this work aimed at synchronous integrated circuits design, but it is useful
for software applications described by iterative data-flow, too.

36

6 Tool-Integration
Cadmos - A concurrent architectures research toolkit. Many of the techniques presented
in this document are implemented in Cadmos, a toolkit for the Eclipse Rich Client Platform (RCP) [Ecl]. The intended purpose of this toolkit is research in the area of concurrent
architectures in embedded software systems. Cadmos is developed by the chair of software
and systems engineering of the Institut f
ur Informatik at the Technische Universitat M
unchen
(TUM). Cadmos does not offer a programming language by itself, but rather offers an interface to integrate with existing programming and modeling languages. At the moment of
writing, integration with MATLAB/Simulink [Mat] is work in progress, so to allow for concurrency analysis and transformation of industrial Simulink- and Stateflow-models. Additionally,
Cadmos has extensive support for the case-tool AutoFocus [Aut].

Figure 21: Modeling with the case-tool AutoFocus supported by interactive concurrency
analysis of Cadmos.
The Cadmos toolkit offers several analysis views. Some of these views visually present dataflow graphs and precedence graphs with automatic layout. The number of implemented analysis
techniques presented in section 3 are incrementally completed. Other views are concerned
with the presentation of metrics as discussed in section 4. Unfolding transformation is already
implemented and interactive retiming transformation and look-ahead transformation is left for
future work. For example, in section 5.1 the figures of DFGs and precedence graphs (figures
11(a), 11(b), 12(a), 13(a), 13(b), 14(a)) are analyzed, transformed and rendered by Cadmos.
Figure 21 illustrates the integration of Cadmos with the case-tool AutoFocus. Data-flow
graph visualization with WCC- and cycle-highlighting (see sections 3.2 and 3.3) is shown in
the upper-right corner in this example. Delay profiles (see sections 3.5 and 4.3.2) are shown in
the lower-right corner in this example.

37

(a)

(b)

(c)

Figure 22: Visualization and transformation in Cadmos. (a) Data-flow graph visualization.
(b) Unfolding transformation and visualization with Kamada-Kawai layout. (c)
Intra- and inter-iteration precedence graph construction and visualization.
AutoFocus- A research prototype for seamless model-based development. AutoFocus
[Aut] is developed and maintained by the chair of software and systems engineering at TUM.
The purpose of AutoFocus is to serve as a research prototype for integrated modeling, simulation, verification and deployment of reactive embedded software systems. In AutoFocus,
architecture and behavior of software for embedded systems is specified by causal component
networks (see section 2.3). The composition of these networks is based on timed versions of the
operators presented in section 2.4. AutoFocuss model of computation closely resembles iterative data-flow (see section 2.1). Thus, an AutoFocus-model can easily be translated to an iterative data-flow model (see section 2.3) and subsequently be analyzed by the Cadmos toolkit.
As mentioned above, a similar integration in design and analysis of MATLAB/Simulink-models
is work in progress.

38

7 Conclusion and Future Work


First experiments with the tool integration (see section 6) show the practicability of the presented analysis and transformation techniques (see section 3 and 5) in supporting the software
engineering process for parallel embedded software systems. The metrics discussed in section
4 give support in design and deployment decisions. Nevertheless, evaluation of practicability
of these metrics and the introduction of further metrics is future work.
Elaborate hardware description required. The presented techniques mainly focus on the
iterative data-flow models for the software-side. In future, these need to be more closely
coupled with elaborate models for the hardware-side to be practical. A sufficient description
of the hardware-side should comprise embedded controller networks, cores, field-buses and
interlink-buses. A promising approach in this direction is the technical perspective explained
in [TRS+ 10].
Time- and space-efficient static schedulers required. Time- and space-efficient static scheduling techniques are required for the construction of arbitrary real-time schedules with communication latencies. For example, image the complexity of constructing a static real-time schedule
for a complete car with circa 1000 task to be scheduled on 100 cores, distributed over 40
embedded controllers, connected by 5 field-buses. It is left for future work, to further investigate static scheduling with Hu-Level methods [Hu61], A* methods ([HNR68], [PLM99]), and
solver based methods ([Gre68], [Vos10]).
Integration with industry-relevant tools. The integration with more tools that are relevant
in the embedded software systems industry is future work. A major step in this direction is
the integration with MATLAB/Simulink, which is work in progress.

39

References
[ABC+ 06] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,
Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John
Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel
computing research: A view from berkeley. Technical Report UCB/EECS-2006-183,
EECS Department, University of California, Berkeley, Dec 2006.
[Amd67]

Gene M. Amdahl. Validity of the single processor approach to achieving large


scale computing capabilities. In AFIPS 67 (Spring): Proc. of the April 18-20,
1967, spring joint computer conference, pages 483485, New York, NY, USA, 1967.
ACM.

[Aut]

AutoFocus3 Homepage. http://af3.in.tum.de/.

[BDD+ 92] Manfred Broy, Frank Dederich, Claus Dendorfer, Max Fuchs, Thomas Gritzner,
and Rainer Weber. The design of distributed systems - an introduction to focus.
Technical Report TUM-I9202, Technische Universitat M
unchen, jan 1992.
[Bod95]

A. Bode. Klassifikation paralleler architekturen. Parallelrechner: ArchitekturenSysteme-Werkzeuge, Leitf


aden der Informatik, Teubner, Stuttgart, Germany,
pages 1140, 1995.

[Bro95]

Manfred Broy. Advanced component interface specification. In Takaysau Ito and


Akinori Yonezawa, editors, Theory and Practice of Parallel Programming - International Workshop TPPP94, pages 89 104. Springer, 1995.

[Bro06a]

M. Broy. Challenges in automotive software engineering. In ICSE 06: Proceedings


of the 28th international conference on Software engineering, pages 3342, New
York, NY, USA, 2006. ACM.

[Bro06b]

M. Broy. Thegrand challengein informatics: Engineering software-intensive systems. Computer, 39(10):7280, 2006.

[Bro10]

M. Broy. Relating time and causality in interactive distributed systems. European


Review, 18:507563, 2010.

[BS01]

M. Broy and K. Stoelen. Specification and development of interactive systems:


Focus on streams, interfaces, and refinement, 2001.

[Den80]

J.B. Dennis. Data flow supercomputers. IEEE computer, 13(11):4856, 1980.

[DK82]

A. L. Davis and R. M. Keller. Data flow program graphs. Computer, 15(2):2641,


1982.

[Ecl]

Eclipse Homepage - The Eclipse Foundation open source community website.


http://www.eclipse.org/.

[Fly72]

M.J. Flynn. Some computer organizations and their effectiveness. Computers,


IEEE Transactions on, 21:948960, 1972.

40

[FW78]

Steven Fortune and James Wyllie. Parallelism in random access machines. In STOC
78: Proceedings of the tenth annual ACM symposium on Theory of computing,
pages 114118, New York, NY, USA, 1978. ACM.

[Gre68]

H.H. Greenberg. A branch-bound solution to the general scheduling problem. Operations Research, 16(2):353361, 1968.

[GTA06]

Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. SIGOPS Oper.
Syst. Rev., 40(5):151162, 2006.

[HNR68]

P.E. Hart, N.J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. Systems Science and Cybernetics, IEEE Transactions
on, 4(2):100 107, jul. 1968.

[Hu61]

T. C. Hu. Parallel sequencing and assembly line problems. Operations Research,


9(6):841848, Nov. - Dec. 1961.

[KM66]

Richard M. Karp and Raymond E. Miller. Properties of a model for parallel computations: Determinancy, termination, queueing. SIAM Journal on Applied Mathematics, 14(6):pp. 13901411, 1966.

[KS73]

P.M. Kogge and H.S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Transactions on Computers, 100(22):786
793, 1973.

[Lee06]

Edward A. Lee. The problem with threads. Computer, 39:3342, 2006.

[LM87a]

E.A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data flow
programs for digital signal processing. IEEE Transactions on Computers, 2:36,
1987.

[LM87b]

Edward A. Lee and David G. Messerschmitt. Synchronous data flow. Proc. of the
IEEE, 75(9):12351245, September 1987.

[LRS83]

C.E. Leiserson, F.M. Rose, and J.B. Saxe. Optimizing synchronous circuitry by
retiming. In Third Caltech Conference on Very Large Scale Integration, pages 87
116. Computer Science Press, Incorporated, 1983.

[LS91]

C.E. Leiserson and J.B. Saxe. Retiming synchronous circuitry. Algorithmica, 6(1):5
35, 1991.

[LS11]

Edward A. Lee and Sanjit A. Seshia. Introduction to embedded systems, a cyberphysical systems approach. http://LeeSeshia.org/, 2011.

[Mat]

MathWorks Homepage - MATLAB and Simulink for technical computing.


http://www.mathworks.com/.

[Par89a]

Keshab K. Parhi. Algorithm transformation techniques for concurrent processors.


Proceedings of the IEEE, 77:18791895, Dec. 1989.

[Par89b]

Keshab K. Parhi. Pipeline interleaving and parallelism in recursive digital filters.


i. pipelining using scattered look-ahead and decomposition. IEEE Transactions on
Acoustics, Speech and Signal Processing, 37:10991117, Jul 1989.

41

[PLM99]

D. Piriyakumar, Paul Levi, and C. Murthy. Optimal scheduling of iterative dataflow programs onto multiprocessors with non-negligible interprocessor communication. In Peter Sloot, Marian Bubak, Alfons Hoekstra, and Bob Hertzberger,
editors, High-Performance Computing and Networking, volume 1593 of Lecture
Notes in Computer Science, pages 732743. Springer Berlin / Heidelberg, 1999.
10.1007/BFb0100634.

[PM91]

Keshab K. Parhi and David G. Messerschmitt. Static rate-optimal scheduling


of iterative data-flow programs via optimum unfolding. IEEE Trans. Comput.,
40(2):178195, 1991.

[Rei68]

Raymond Reiter. Scheduling parallel computations. J. ACM, 15(4):590599, 1968.

[Sch09]

T. Schuele. A Coordination Language for Programming Embedded Multi-Core


Systems. In 2009 International Conference on Parallel and Distributed Computing,
Applications and Technologies, pages 201209. IEEE, 2009.

[Tar72]

Robert Endre Tarjan. Depth-first search and linear graph algorithms. SIAM J.
Comput., 1(2):146160, 1972.

[TRS+ 10] Judith Thyssen, Daniel Ratiu, Wolfgang Schwitzer, Alexander Harhurin, Martin
Feilkas, and Eike Thaden. A system for seamless abstraction layers for modelbased development of embedded software. In Software Engineering (Workshops),
pages 137148, 2010.
[Vos10]

Sebastian Voss. Integrated Task and Message Scheduling in Time-Triggered


Aeronautic Systems.
PhD thesis, Universitat Duisburg-Essen, Fakultat f
ur
Wirtschaftswissenschaften, Institut f
ur Informatik und Wirtschaftsinformatik,
2010.

[XP93]

J. Xu and D.L. Parnas. On satisfying timing constraints in hard-real-time systems.


IEEE Transactions on Software Engineering, 19:7084, 1993.

42

You might also like