You are on page 1of 29

Background

● Async-Finish Parallelism
● Work Stealing Schedulers
● Tracing and Performance Analysis
● Continuation Steps and Continuations
● All tasks assigned levels, initial task at lvl 0
Background
● Tracing:
○ Tracing captures order of events of interest
○ Used to study runtime behaviour.
● Continuation Step:
○ Sequence of instructions with no interleaving parallelism
constructs (async/finish)
● Partial Continuation:
○ Continuation representing proper subset of computation
○ (represented by a task)
Problem Statement
Problem Statement
● Want to completely track execution, but different stealing structures
for help-first and work-first schedulers.

● Main challenge - “identify key properties of work-first and help-first


scheduling to compactly identify leaves of the tree of steps rooted at
the initial step in each working phase”
Contributions
● Identify key properties of work-first and help-first schedulers using
async-finish to represent stealing relationships
● Algorithms to trace and replay async-finish programs by efficiently
constructing the steal tree;
● Demonstration of low space overheads and within-variance
perturbation of execution in tracing
Contributions
● Reduce cost of data race detection using an algorithm maintaining
and traversing the dynamic program structure tree
● Retentive stealing algorithms for recursive parallel programs
○ (prior work required explicitly enumerated task collections).
Motivation
● Programming models support dynamic task parallelism using work
stealing (Cilk, OpenMP, X10)
● Dynamic behaviour is hard to analyze.
○ No. of tasks >> No. of threads
○ Less structured mapping of work to threads
○ Traces can be large, hard to analyze
Insights (Help-First)
● Observation: Two tasks are in the same immediately enclosing finish
scope if the closest finish scope that encloses each of them also encloses
their common ancestor

● Theorem: The tasks executed and steal operations encountered in each


working phase can be fully described by
○ the level of the root in the total ordering of the steal operations on
the victim’s working phase, and
○ the number of tasks and step of the continuation stolen at each level.
Insights (Work-First)
● Observation: The deque, with l tasks in it, consists of one continuation at
levels 0 through l − 1 with 0 at the steal-end and the continuation at level i
spawned by the step that precedes the continuation at level i − 1.

● Theorem: The tasks executed and steal operations encountered in each


working phase can be fully described by
○ the level of the root in the total ordering of the steal operations on the
victim’s working phase, and
○ the step of the continuation stolen at each level.
Implementation (Tracing)
● ContinuationHdr for continuations and tasks
○ Stores level and step
○ New async/finish => keep track of current level for each continuation.

● WorkerStateHdr object contained in every worker


○ Ordered list of working phases executed by worker

● Help-first schedulers store n(Tasks Stolen) at each level:


Implementation (Tracing)
● Help First: thief marks steal in
the victim’s HelpFirstInfo.
● myrank is the thief thread’s rank

● Work First: steal marked in


victim’s WorkingPhaseInfo.
Implementation (Replay)
● Traces include timing information - allowing replay
● Threads execute assigned working phases in order
● Corresponding thief notified if stolen task is spawned.
● Tasks track whether children could have been stolen
○ Additional field added to ContinuationHdr:
Implementation (Replay)
● Worker sends/notifies a thief when it finds a
task stolen in the past.
● Threads wait for the initial task to be created
by the victim.
● Code (left) is used for both, Help-First and
Work-First.

● Help-First Traces: When replayed, no.


child tasks spawned by task in 1 finish
scope tracked by augmented
HelpFirstInfo (right)
Space Utilization
● n: number of working phases
● v: number of bytes required for a thread identifier
● Si: number of steals in a working phase
● m :number of bytes required for a step identifier
● k: number of bytes required to store the maximum number of tasks at a
given level.
Experimental Methodology
● Shared-Memory Experiments:
○ POWER 7 system, 128 GB memory
○ quad-chip module
■ eight cores per chip (3.8 GHz)
■ supporting fourway simultaneous multi-threading.
● Distributed-Memory Experiments
○ OLCF Cray XK6 ‘Titan’, (18688 node system)
○ 1 AMD sixteen-core 2.2GHz ‘Bulldozer’ processor / node
○ 32GB DDR3 memory / node
Experimental Methodology
● Compilation:
○ GCC 4.4.6 on POWER 7
○ PGI 12.5.0 on Titan.
● Distributed-Memory Benchmarks used MPI (MPICH2 5.5.0) for
communication.
● Implemented tracing algorithms for shared-memory systems in Cilk 5.4.6,
Experimental Methodology
Results (Shared Memory)
● Low tracing overhead , high variability between runs
Results (Shared Memory)
● Error bars - standard deviation from 15 runs.
● Low standard deviation - random stealing schedules do not significantly
impact storage overhead.

Results (Shared Memory)
● Variation in executing identical iterations of Heat on 120 threads
● Significant underutilization when running LU
Results (Distributed Memory)
● Execution time with tracing : execution time without tracing
● Overhead is low and mostly within the error bars
Results (Distributed Memory)
● Storage overhead in KB/core that we incur with our tracing schemes.
● AQ Grain size is very fine - WF reduces parallel slack (vs HF)
Results (Distributed Memory)
● % reduction of tree traversals achieved by exploiting the subtree
information to bypass traversing regions of the tree :
Results (Distributed Memory)
● Graph amount of storage per core required over time for retentive stealing
● Steals in subsequent iterations cause more partitioning of the work.
Results (Distributed Memory)
● Convergence rate is application and scale dependent.
Conclusions
● Algorithms presented efficiently construct steal trees
● Enable low overhead tracing and replay of async-finish
programs
● Work can be extended to:
○ Optimizing data race detection
○ Extending the class of programs that can employ
retentive stealing.
Q&A
Implementation Plans
Fairly basic implementation details so far -

● New ‘thread’ class containing it’s own pthread to carry out actual parallelism, and added fields within
new thread class to accomodate for WorkerStateHdr, etc.
● Task structs modified to have their own ContuationHdrs etc to keep track of levels.
● Functions such as cotton::startTrace(), cotton::endTrace()
● (If time permits) data race detection implementation (pseudocode given in sec 8.1 of paper)

You might also like