FPP

Background
● Async-Finish Parallelism
● Work Stealing Schedulers
● Tracing and Performance Analysis
● Continuation Steps and Continuations
● All tasks assigned levels, initial task at lvl 0
Background
● Tracing:
○ Tracing captures order of events of interest
○ Used to study runtime behaviour.
● Continuation Step:
○ Sequence of instructions with no interleaving parallelism
constructs (async/finish)
● Partial Continuation:
○ Continuation representing proper subset of computation
○ (represented by a task)
Problem Statement
Problem Statement
● Want to completely track execution, but different stealing structures
for help-first and work-first schedulers.
● Main challenge - “identify key properties of work-first and help-first

scheduling to compactly identify leaves of the tree of steps rooted at
the initial step in each working phase”
Contributions
● Identify key properties of work-first and help-first schedulers using
async-finish to represent stealing relationships
● Algorithms to trace and replay async-finish programs by efficiently
constructing the steal tree;
● Demonstration of low space overheads and within-variance
perturbation of execution in tracing
Contributions
● Reduce cost of data race detection using an algorithm maintaining
and traversing the dynamic program structure tree
● Retentive stealing algorithms for recursive parallel programs
○ (prior work required explicitly enumerated task collections).
Motivation
● Programming models support dynamic task parallelism using work
stealing (Cilk, OpenMP, X10)
● Dynamic behaviour is hard to analyze.
○ No. of tasks >> No. of threads
○ Less structured mapping of work to threads
○ Traces can be large, hard to analyze
Insights (Help-First)
● Observation: Two tasks are in the same immediately enclosing finish
scope if the closest finish scope that encloses each of them also encloses
their common ancestor
● Theorem: The tasks executed and steal operations encountered in each

working phase can be fully described by
○ the level of the root in the total ordering of the steal operations on
the victim’s working phase, and
○ the number of tasks and step of the continuation stolen at each level.
Insights (Work-First)
● Observation: The deque, with l tasks in it, consists of one continuation at
levels 0 through l − 1 with 0 at the steal-end and the continuation at level i
spawned by the step that precedes the continuation at level i − 1.
● Theorem: The tasks executed and steal operations encountered in each

working phase can be fully described by
○ the level of the root in the total ordering of the steal operations on the
victim’s working phase, and
○ the step of the continuation stolen at each level.
Implementation (Tracing)
● ContinuationHdr for continuations and tasks
○ Stores level and step
○ New async/finish => keep track of current level for each continuation.
● WorkerStateHdr object contained in every worker

○ Ordered list of working phases executed by worker
● Help-first schedulers store n(Tasks Stolen) at each level:

Implementation (Tracing)
● Help First: thief marks steal in
the victim’s HelpFirstInfo.
● myrank is the thief thread’s rank
● Work First: steal marked in

victim’s WorkingPhaseInfo.
Implementation (Replay)
● Traces include timing information - allowing replay
● Threads execute assigned working phases in order
● Corresponding thief notified if stolen task is spawned.
● Tasks track whether children could have been stolen
○ Additional field added to ContinuationHdr:
Implementation (Replay)
● Worker sends/notifies a thief when it finds a
task stolen in the past.
● Threads wait for the initial task to be created
by the victim.
● Code (left) is used for both, Help-First and
Work-First.
● Help-First Traces: When replayed, no.

child tasks spawned by task in 1 finish
scope tracked by augmented
HelpFirstInfo (right)
Space Utilization
● n: number of working phases
● v: number of bytes required for a thread identifier
● Si: number of steals in a working phase
● m :number of bytes required for a step identifier
● k: number of bytes required to store the maximum number of tasks at a
given level.
Experimental Methodology
● Shared-Memory Experiments:
○ POWER 7 system, 128 GB memory
○ quad-chip module
■ eight cores per chip (3.8 GHz)
■ supporting fourway simultaneous multi-threading.
● Distributed-Memory Experiments
○ OLCF Cray XK6 ‘Titan’, (18688 node system)
○ 1 AMD sixteen-core 2.2GHz ‘Bulldozer’ processor / node
○ 32GB DDR3 memory / node
● Compilation:
○ GCC 4.4.6 on POWER 7
○ PGI 12.5.0 on Titan.
● Distributed-Memory Benchmarks used MPI (MPICH2 5.5.0) for
communication.
● Implemented tracing algorithms for shared-memory systems in Cilk 5.4.6,
Results (Shared Memory)
● Low tracing overhead , high variability between runs
● Error bars - standard deviation from 15 runs.
● Low standard deviation - random stealing schedules do not significantly
impact storage overhead.
●
● Variation in executing identical iterations of Heat on 120 threads
● Significant underutilization when running LU
Results (Distributed Memory)
● Execution time with tracing : execution time without tracing
● Overhead is low and mostly within the error bars
● Storage overhead in KB/core that we incur with our tracing schemes.
● AQ Grain size is very fine - WF reduces parallel slack (vs HF)
● % reduction of tree traversals achieved by exploiting the subtree
information to bypass traversing regions of the tree :
● Graph amount of storage per core required over time for retentive stealing
● Steals in subsequent iterations cause more partitioning of the work.
● Convergence rate is application and scale dependent.
Conclusions
● Algorithms presented efficiently construct steal trees
● Enable low overhead tracing and replay of async-finish
programs
● Work can be extended to:
○ Optimizing data race detection
○ Extending the class of programs that can employ
retentive stealing.
Q&A
Implementation Plans
Fairly basic implementation details so far -
● New ‘thread’ class containing it’s own pthread to carry out actual parallelism, and added fields within
new thread class to accomodate for WorkerStateHdr, etc.
● Task structs modified to have their own ContuationHdrs etc to keep track of levels.
● Functions such as cotton::startTrace(), cotton::endTrace()
● (If time permits) data race detection implementation (pseudocode given in sec 8.1 of paper)

FPP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FPP

Uploaded by

Copyright:

Available Formats

Background

● Main challenge - “identify key properties of work-first and help-first

● Theorem: The tasks executed and steal operations encountered in each

● Theorem: The tasks executed and steal operations encountered in each

● WorkerStateHdr object contained in every worker

● Help-first schedulers store n(Tasks Stolen) at each level:

● Work First: steal marked in

● Help-First Traces: When replayed, no.

You might also like