You are on page 1of 36

Repairing Sequential Consistency in C/C++11

Ori Lahav Viktor Vafeiadis


MPI-SWS, Germany MPI-SWS, Germany
orilahav@mpi-sws.org viktor@mpi-sws.org

Jeehoon Kang Chung-Kil Hur Derek Dreyer


Seoul National University, Korea Seoul National University, Korea MPI-SWS, Germany
jeehoon.kang@sf.snu.ac.kr gil.hur@sf.snu.ac.kr dreyer@mpi-sws.org

Abstract atomic accesses are intended for normal data: races on such
The C/C++11 memory model defines the semantics of concur- accesses are considered as programming errors and lead to
rent memory accesses in C/C++, and in particular supports undefined behavior, thus ensuring that they can be compiled
racy atomic accesses at a range of different consistency to plain machine loads and stores and that it is sound to apply
levels, from very weak consistency (relaxed) to strong, se- standard sequential optimizations on non-atomic accesses.
quential consistency (SC). Unfortunately, as we observe in In contrast, atomic accesses are specifically intended for
this paper, the semantics of SC atomic accesses in C/C++11, communication between threads: thus, races on atomics are
as well as in all proposed strengthenings of the semantics, is permitted, but at the cost of introducing hardware fence
flawed, in that (contrary to previously published results) both instructions during compilation and imposing restrictions
suggested compilation schemes to the Power architecture are on how such accesses may be merged or reordered.
unsound. We propose a model, called RC11 (for Repaired The degree to which an atomic access may be reordered
C11), with a better semantics for SC accesses that restores the with other operationsand more generally, the implemen-
soundness of the compilation schemes to Power, maintains tation cost of an atomic accessdepends on its consistency
the DRF-SC guarantee, and provides stronger, more useful, level, concerning which C11 offers programmers several op-
guarantees to SC fences. In addition, we formally prove, for tions according to their needs. Strongest and most expensive
the first time, the correctness of the proposed stronger compi- are sequentially consistent (SC) accesses, whose primary
lation schemes to Power that preserve load-to-store ordering purpose is to restore the simple interleaving semantics of se-
and avoid out-of-thin-air reads. quential consistency [20] if a program (when executed under
SC semantics) only has races on SC accesses. This property is
Categories and Subject Descriptors D.1.3 [Concurrent called DRF-SC and was a main design goal for C11. To en-
Programming]: Parallel programming; D.3.1 [Programming sure DRF-SC, the standard compilation schemes for modern
Languages]: Formal Definitions and TheorySemantics architectures typically insert hardware fence instructions
Keywords Weak memory models; C++11; declarative se- appropriately into the compiled code, with those for weaker
mantics; sequential consistency architectures (like Power and ARMv7) introducing a full
(strong) fence adjacent to each SC access.
1. Introduction Weaker than SC atomics are release-acquire accesses,
which can be used to perform message passing between
The C/C++11 memory model (C11 for short) [8] defines the
threads without incurring the implementation cost of a full
semantics of concurrent memory accesses in C/C++, of which
SC access; and weaker and cheaper still are relaxed accesses,
there are two general types: non-atomic and atomic. Non-
which are intended to be compiled down to plain loads
Saarland Informatics Campus. and stores at the machine level and which provide only
the minimal synchronization guaranteed by the hardware.
Finally, the C11 model also supports language-level fence
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and instructions, which provide finer-grained control over where
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, contact hardware fences are to be placed and serve as a barrier to
the Owner/Author(s). Request permissions from permissions@acm.org or Publications Dept., ACM, Inc., fax +1 (212)
869-0481. prevent unwanted compiler optimizations.
Copyright c held by owner/author(s). Publication rights licensed to ACM.
In this paper, we are mainly concerned with the semantics
ACM [to be supplied]. . . $15.00
Reprinted from , [Unknown Proceedings], , pp. 136.
of SC atomics (i.e., SC accesses and SC fences), and their

1
interplay with the rest of the model. Since sequential consis- In contrast, the IRIW-acq-sc program from [22] and our ex-
tency is such a classical, well-understood notion, one might amples in 2.1 show that both the suggested compilation
expect that the semantics of SC atomics should be totally schemes to Power are unsound with respect to the C11 model,
straightforward, but sadly, as we shall see, it is not! thereby contradicting the results of [7, 27]. The same problem
The main problem arises in programs that mix SC and occurs in some compilation schemes to ARMv7 (see 6), as
non-SC accesses to the same location. Although not common, well as for ARMv8 (see [3] for an example).
such mixing is freely permitted by the C11 standard, and has In the remainder of the paper, we propose a way to repair
legitimate usese.g., as a way of enabling faster (non-SC) the semantics of SC accesses that resolves the problems
reads from an otherwise quite strongly synchronized data mentioned above. In particular, our corrected semantics
structure. Indeed, we know of several examples of code in restores the soundness of the suggested compilation schemes
the wild that mixes SC accesses together with release/acquire to Power. Moreover, it still satisfies the standard DRF-SC
or relaxed accesses to the same location: seqlocks [9] and theorem in the absence of relaxed accesses: if a programs
Rusts crossbeam library [2]. Now, consider the following sequentially consistent executions only ever exhibit races on
program due to Manerkar et al. [22]: SC atomic accesses, then its semantics under full C11 is also
sequentially consistent. It is worth noting that this correction
a := xacq //1 b := yacq //1 only affects the semantics of programs mixing SC and non-
x :=sc 1 y :=sc 1
c := ysc //0 d := xsc //0 SC accesses to the same location: we show that, without such
(IRIW-acq-sc) mixing, it coincides with the strengthened model of Batty
Here and in all other programs in this paper, we write a, b, ... et al. [5].
for local variables (registers), and assume that all variables We also apply two additional, orthogonal, corrections
are initialized to 0. The program contains two variables, x and to the C11 model, which strengthen the semantics of SC
y, which are accessed via SC atomic accesses and also read fences. The first fix corrects a problem already noted before
by acquire atomic accesses. The annotated behavior (reading [27, 21, 17], namely that the current semantics of SC fences
a = b = 1 and c = d = 0) corresponds to the two threads does not recover sequential consistency, even when SC
observing the writes to x and y as occurring in different fences are placed between every two commands in programs
orders, and is forbidden by C11. (We defer the explanation with only release/acquire atomic accesses. The second fix
of how C11 forbids this behavior to 2.) provides stronger cumulativity guarantees for programs
Lets now consider how this program is compiled to Power. with SC fences. We justify these strengthenings by proving
Two compilation schemes have been proposed [7]. Both that the existing compilation schemes for x86-TSO, Power,
use Powers strongest fence instruction, called sync, for and ARMv7 remain sound with the stronger semantics.
the compilation of SC atomics. The first scheme, the one Finally, we apply another, mostly orthogonal, correction
implemented in the GCC and LLVM compilers, inserts a sync to the C11 model, in order to address the well-known out-of-
fence before each SC access (leading sync convention), thin-air problem. The problem is that the C11 standard per-
whereas the alternative scheme inserts a sync fence after mits certain executions as a result of causality cycles, which
each SC access (trailing sync convention). The intent of break even basic invariant-based reasoning [11] and invalidate
both schemes is to have a strong barrier between every pair of DRF-SC in the presence of relaxed accesses. The correction,
SC accesses, enforcing, in particular, sequential consistency which is simple to state formally, is to strengthen the model
on programs containing only SC accesses. Nevertheless, by to enforce load-to-store ordering for atomic accesses, thereby
mixing SC and release-acquire accesses, one can quickly get ruling out such causality cycles, at the expense of requiring a
into trouble, as illustrated by IRIW-acq-sc. less efficient compilation scheme for relaxed accesses. The
In particular, if one compiles the program into Power using idea of this correction is not novelit has been extensively
the trailing sync convention, then the behavior is allowed by discussed in the literature [31, 11, 30]but the suggested
Power.1 Since all SC accesses are at the end of the threads, the compilation schemes to Power and ARMv7 have not yet been
trailing sync fences have no effect, and the example reduces proven sound. Here, we give the first proof that one of these
to (the result of compilation of) IRIW with only acquire reads, compilation schemesthe one that places a fake control de-
which is allowed by the Power memory model. In 2.1, we pendency after every relaxed readis sound. The proof is
show further examples illustrating that the other, leading sync surprisingly delicate, and involves a novel argument similar
scheme also leads to behaviors in the target of compilation to to that in DRF-SC proofs.
Power that are not permitted in the source. Putting all these corrections together, we propose a new
Although the C11 model is known to have multiple prob- model called RC11 (for Repaired C11) that supports nearly
lems (e.g., the out-of-thin-air problem [31, 11], the lack all features of the C11 model (3). We prove correctness of
of monotonicity [30]), none of them until now affected the compilation to x86-TSO (4), Power (5), and ARMv7 (6),
correctness of compilation to the mainstream architectures. the soundness of a wide collection of program transforma-
tions (7), and a DRF-SC theorem (8).
1 Formally, we use the recent declarative model of Power by Alglave et al. [4].

2
mo Wna (x, 0) Wna (y, 0) mo reads from a write that is mo-before another write f to the
same location, we say that e reads-before (rb) f (this relation
is also called from-read [4], but we find reads-before more
k : Wsc (x, 1) l : Racq (x, 1) n : Racq (y, 1) p : Wsc (y, 1)
rf rf rf intuitive). Formally, rb , rf1 ; mo \ [E]. The \ [E] part is
sb needed so that RMW events (read-modify-write, induced
m : Rsc (y, 0) o : Rsc (x, 0) by atomic update operations like fetch-and-add and compare-
and-swap) do not read-before themselves. For example, in
Figure 1. An execution of IRIW-acq-sc yielding the result Fig. 1, we have hm, pi rb and ho, ki rb.
a = b = 1 c = d = 0. Consistent C11 executions require that hb is irreflexive
2. The Semantics of SC Atomics in C11: (equivalently, sbsw is acyclic), and further guarantee coher-
Whats Wrong, and How Can We Fix It? ence (aka SC-per-location) and atomicity of RMWs. Roughly
speaking, coherence ensures that (i) the order of writes to
The C11 memory model defines the semantics of a program the same location according to mo does not contradict hb
as a set of consistent executions. Each execution is a graph. (COHERENCE - WW); (ii) reads do not read values written in
Its nodes, E, are called events and represent the individual the future (NO - FUTURE - READ and COHERENCE - RW); (iii)
memory accesses and fences of the program, while its edges reads do not read overwritten values (COHERENCE - WR); and
represent various relations among these events: (iv) two hb-related reads from the same location cannot read
The sequenced-before (sb) relation, a.k.a. program order, from two writes in reversed mo-order (COHERENCE - RR). We
captures the order of events in the programs control flow. refer the reader to Prop. 1 in 3 for a formal definition of
The reads-from (rf) relation associates each write with coherence.
Now, to give semantics to SC atomics, C11 stipulates that
the set of reads that read from that write. In a consistent
in consistent executions, there should be a strict total order,
execution, the reads-from relation should be functional
S, over all SC events, intuitively corresponding to the order
(and total) in the second argument: a read must read from
in which these events are executed. This order is required
exactly one write.
to satisfy a number of conditions (but see Remark 1 below),
Finally, the modification order (mo) is a union of total where Esc denotes the set of all SC events in E:
orders, one for each memory address, totally ordering
the writes to that address. Intuitively, it records for each
(S1) S must include hb restricted to SC events
memory address the globally agreed-upon order in which
(formally: [Esc ]; hb; [Esc ] S);
writes to that address happened.
(S2) S must include mo restricted to SC events
As an example, in Fig. 1, we depict an execution of the IRIW- (formally: [Esc ]; mo; [Esc ] S);
acq-sc program discussed in the introduction. In addition to
the events corresponding to the accesses appearing in the (S3) S must include rb restricted to SC events
program, the execution contains two events for the implicit (formally: [Esc ]; rb; [Esc ] S);
non-atomic initialization writes to x and y, which are as- (S4-7) S must obey a few more conditions having to do with
sumed to be sb-before all other events. SC fences.
Notation 1. Given a binary relation R, we write R? , R+ ,
and R respectively to denote its reflexive, transitive, and Remark 1. The S3 condition above, due to Batty et al. [5],
reflexive-transitive closures. The inverse relation is denoted is slightly simpler and stronger than the one imposed by the
by R1 . We denote by R1 ; R2 the left composition of two official C11. Crucially, however, all the problems and coun-
relations R1 , R2 , and assume that ; binds tighter than and terexamples we observe in this section, concerning the
\. Finally, we denote by [A] the identity relation on a set A. C11 semantics of SC atomics, hold for both Batty et al.s
In particular, [A]; R; [B] = R (A B). model and the original C11. The reason we use Batty et al.s
version here is that it provides a cleaner starting point for our
Based on these three basic relations, C11 defines some discussion, and our solution to the problems with C11s SC
derived relations. First, whenever an acquire or SC read semantics will build on it.
reads from a release or SC write, we say that the write
synchronizes with (sw) the read.2 Next, we say that one event
happens before (hb) another event if they are connected by a Intuitively, the effect of the above conditions is to enforce
sequence of sb or sw edges. Formally, hb , (sb sw)+ . For that, since S corresponds to the order in which SC events
example, in Fig. 1, event k synchronizes with l and therefore are executed, it should agree with the other global orders
k happens-before l and m. Lastly, whenever a read event e of events: hb, mo, and rb. However, as we will see shortly,
condition S1 is too strong. Before we get there, let us first
2 The actual definition of sw contains further cases, which are not relevant look at a few examples to illustrate how the conditions on S
for the current discussion. These are included in our formal model in 3. interact to enforce sequential consistency.

3
Wna (x, 0) Wna (y, 0) mo Wna (x, 0)

mo
Wna (x, 0) Wna (y, 0) k : Wsc (x, 1) n : Wsc (y, 1) k : Wsc (x, 1) mo m : RMWsc (y, 1, 2) o : Wsc (y, 3)
mo rf
mo mo rf
rf
sc
k : W (x, 1) rf sc
m : W (y, 1) sc
l : W (y, 2) sc
o : W (x, 2) l:W rel
(y, 1) sw n:R rlx
(y, 3) p : Rsc (x, 0)
rf
rb Figure 3. A C11 execution of Z6.U. The initialization of y
l : Rsc (y, 0) n : Rsc (x, 0) m : Rrlx (y, 1) p : Rrlx (x, 1) is omitted as it is not relevant.
Figure 2. Inconsistent C11 executions of SB and 2+2W. Fig. 3 depicts the only execution yielding the behavior in
question that satisfies the coherence constraints. Again, the rf
Consider the classic store buffering litmus test:
and mo edges are forced: even if all accesses in the program
x :=sc 1 y :=sc 1 were relaxed atomic, they would have to go this way. S(k, m)
(SB)
a := ysc //0 b := xsc //0 holds because of condition S1 (k happens-before l, which
Here, the annotated behavior is forbidden by C11. To see this, happens-before m); S(m, o) holds because of condition S2
consider the first execution graph in Fig. 2. The rf edges (m precedes o in modification order); S(o, p) holds because
are forced because of the values read, while the mo edges of condition S1 (o happens-before p). Finally, since p reads
are forced because of COHERENCE - WW. Then, S(k, l) and x = 0, we have that p reads-before k, so by S3, S(p, k), thus
S(m, n) hold because of condition S1; while S(l, m) and forming a cycle in S.
S(n, k) hold because of condition S3. This entails a cycle in Under the leading sync compilation to Power, however,
S, which is disallowed. the behavior is allowed. Intuitively, all but one of the sync
Similarly, C11s conditions guarantee that the following fences because of the SC accesses are useless because they
(variant given in [32] of the) 2+2W litmus test disallows the are at the beginning of a thread. In the absence of other sync
annotated weak behavior: fences, the only remaining sync fence, due to the a := xsc
load in the last thread, is equivalent to an lwsync fence (cf.
x :=sc 1 y :=sc 1 [17, 7]).
y :=sc 2 x :=sc 2 (2+2W) In [3] we provide a similar example using SC fences
a := yrlx //1 b := xrlx //1 instead of RMW instructions, which shows that even placing
To see this, consider the second execution graph in Fig. 2, sync fences both before and after SC accesses is unsound.
which has the outcome a = b = 1: the rf and mo edges
are forced because of the values read and COHERENCE - WR.
What Went Wrong and How to Fix it Generally, in order to
Now, S(k, l) and S(n, o) hold because of condition S1; while
provide coherence, hardware memory models provide rather
S(l, n) and S(o, k) hold because of condition S2. Again, this
strong ordering guarantees on accesses to the same memory
entails a cycle in S.
location. Consequently, for conditions S2 and S3, which only
Let us now move to the IRIW-acq-sc program from the
enforce orderings between accesses to the same location,
introduction, whose annotated behavior is also forbidden
ensuring that compilation preserves these conditions is not
by C11. To see that, suppose without loss of generality
difficult, even for weaker architectures like Power and ARM.
that S(p, k) in Fig. 1. We also know that S(k, m) because
When, however, it comes to ensuring a strong ordering
of happens-before via l (S1). Thus, by transitivity, S(p, m).
between accesses of different memory locations, as S1 does,
However, if the second thread reads y = 0, then m reads-
compiling to weaker hardware requires the insertion of
before p, in which case S(m, p) (S3), and S has a cycle.
appropriate memory fence instructions. In particular, for
2.1 First Problem: Compilation to Power is Broken Power, to enforce a strong ordering between two hb-related
The IRIW-acq-sc example demonstrates that the trailing sync accesses to different locations, there should be a Power sync
compilation to Power is unsound for the C11 model. We fence occurring somewhere in the hb-path (the sequence of
will now see an example showing that the leading sync sb and sw edges) connecting the two accesses. Unfortunately,
compilation is also unsound. Consider the following behavior, in the presence of mixed SC and non-SC accesses, the Power
where all variables are zero-initialized and FAI(y) represents compilation schemes do not always ensure that a sync exists
an atomic fetch-and-increment of y returning its value before between hb-related SC accesses. Specifically, if we follow the
the increment: trailing sync convention, the hb-path (in Fig. 1) from k to m
starting with an sw edge avoids the sync fence placed after k.
x :=sc 1 b := FAI(y)sc //1 y :=sc 3
(Z6.U) Conversely, if we follow the leading sync convention, the hb-
y :=rel 1 c := yrlx //3 a := xsc //0
path (in Fig. 3) from k to m ending with an sw edge avoids
We will show that the behavior is disallowed according to the fence placed before m. The result is that S1 enforces more
C11, but allowed by its compilation to Power. ordering than the hardware provides!

4
So, if requiring that hb (on SC events) be included in S rb
is too strong a condition, what should we require instead? k : Racq (x, 2) m : Wsc (x, 1) o : Wsc (y, 1)
rb
The essential insight is that, according to either compilation
l : Rsc (y, 0) rf n : Wsc (x, 2) p : Rsc (x, 0)
scheme, we know that a sync fence will necessarily exist rb
between SC accesses a and b if the hb path from a to b starts
and ends with an sb edge. Second, if a and b access the same k : Racq (x, 2) rb o : Wsc (y, 1)
location, then the hardware will preserve the ordering anyway. rb
l : Rsc (y, 0) rf n : Wsc (x, 2) p : Rsc (x, 0)
These two observations lead us to replace condition S1 with
the following:
Figure 4. An abbreviated execution of WWmerge (source),
(S1fix) S must relate any two SC events that are related by and of the resulting program after eliminating the overwritten
hb, provided that the hb-path between the two events write m (target). The source execution has a disallowed cycle
either starts and ends with sb edges, or starts and ends (m, l, o, p, m), while the target execution does not.
with accesses to the same location (formally: [Esc ]; (sb
sb; hb; sb hb|loc ); [Esc ] S, where hb|loc denotes hb To see the unsoundness of eliminating an overwritten
edges between accesses to the same location). SC write, consider the following program. The annotated
behavior is forbidden, but it will become allowed after
We note that condition S1fix, although weaker than S1,
eliminating x :=sc 1 (see Fig. 4).
suffices to rule out the weak behaviors of the basic litmus tests
(i.e., SB and 2+2W). In fact, just to rule out these behaviors, a := xacq //2 x :=sc 1 y :=sc 1
it suffices to require sb (on SC events) to be included in S. (WWmerge)
b := ysc //0 x :=sc 2 c := xsc //0
In essence, according to S1fix, S must include all the hb-
paths between SC accesses to different locations that exist Similarly, eliminating a repeated SC read is unsound (see
regardless of any synchronization induced by the SC accesses example in [3]). The problem here is that these transforma-
at their endpoints. If a program does not mix SC and non-SC tions remove an sb edge, and thus remove an sb; hb; sb path
accesses to the same location, then every minimal hb-path between two SC accesses.
between two SC accesses to the same location (i.e., one which Note that the removed sb edges are all edges between
does not go through another SC access) must start and end same-location accesses. Thus, supporting these transforma-
with an sb edge, in which case S1 and S1fix coincide. tions can be achieved by a slight weakening of our condi-
tion: we replace sb; hb; sb with sb|6=loc ; hb; sb|6=loc , where
Fixing the Model Before formalizing our fix, let us first sb|6=loc denotes sb edges that are not between accesses to the
rephrase conditions S1S3 in the more concise style sug- same location. Thus, we require acyclicity of [Esc ]; scb; [Esc ],
gested by Batty et al. [5]. Instead of expressing them as where scb (SC-before) is given by:
separate conditions on a total order S, they require a single
acylicity condition, namely that [EscS]; (hb mo rb); [Esc ] scb , sb sb|6=loc ; hb; sb|6=loc hb|loc mo rb.
be acyclic. (In general, acyclicity of Ri is equivalent to the
existence of a total order that contains R1 , R2 , ...) We note that this change does not affect programs that do not
We propose to correct the condition by replacing hb with mix SC and non-SC accesses to the same location.
sb sb; hb; sb hb|loc . Accordingly, we require that
2.2 Second Problem: SC Fences are Too Weak
[Esc ]; (sb sb; hb; sb hb|loc mo rb); [Esc ] In this section we extend our model to cover SC fences, which
were not considered so far. Denote by Fsc the set of SC fences
is acyclic. Note that this condition still ensures SC semantics in E. The straightforward adaptation of the condition of Batty
for programs that have only SC accesses. Indeed, since et al. [5] for the full model (obtained by replacing hbmorb
[Esc ]; rf; [Esc ] [Esc ]; sw; [Esc ] [Esc ]; hb|loc ; [Esc ], our with our scb) is that
condition implies acyclicity of [Esc ]; (sbrfmorb); [Esc ].
The latter suffices for this purpose, as it corresponds exactly psc1 , [Esc ] [Fsc ]; sb? ; scb; [Esc ] sb? ; [Fsc ]
 
to the declarative definition of sequential consistency [28].
is acyclic. This condition generalizes the earlier condition
2.1.1 Enabling Elimination of SC Accesses by forbidding scb cycles even between non-SC accesses
We observe that our condition disallows the elimination of an provided they are preceded/followed by an SC fence. This
SC write immediately followed by another SC write to the condition rules out weak behaviors of examples such as SB
same location, as well as of an SC read immediately preceded and 2+2W where all accesses are relaxed and SC fences are
by an SC read from the same location. While neither GCC placed between them in the two threads.
nor LLVM performs these eliminations, they are sound under In general, one might expect that inserting an SC fence be-
sequential consistency, as well as under C11 (with the fixes tween every two instructions restores sequential consistency.
of [30]), and one may wish to preserve their soundness. This holds for hardware memory models, such as x86-TSO,

5
k : Wrlx (x, 1) l : Rrlx (x, 1) n : Wrlx (y, 1) Wrlx (x, 1) Racq (z, 1) Wrlx (y, 1)
rf
rb sw rb
f1 : Fsc f2 : Fsc f1 : Fsc f2 : Fsc
rf
rb
m : Rrlx (y, 0) o : Rrlx (x, 0) Wrel (z, 1) Rrlx (y, 0) rb Rrlx (x, 0)

Figure 5. An execution of RWC+syncs yielding the anno- Figure 6. An abbreviated execution of W+RWC.
tated result. The rb edges are due to the reading from the
omitted initialization events and the mo edges from those. reads-from relation, rf, the modification order, mo, the reads-
before relation, rb, and also all the compositions of these
Power, and ARM, for programs with aligned word-sized ac- relations with one anothernamely, all orders forced because
cesses (for their analogue of SC fences), but holds neither in of the coherence axioms. Then, we require that psc1
the original C11 model nor in its strengthening [5] for two [Fsc ]; sb; eco; sb; [Fsc ] is acyclic.
reasons. The first reason is that C11 declares that programs This stronger condition rules out the weak behavior of
with racy non-atomic accesses have undefined behavior, and RWC+syncs because there are sb; eco; sb paths from one
even if fences are placed everywhere such races may exist. fence to another and vice versa (in one direction via the
There is, however, another way in which putting fences every- x accesses and in the other direction via the y accesses).
where in C11 does not restore sequential consistency, even if Intuitively speaking, compilation remains correct with this
all the accesses are atomic. Consider the following program: stronger model since eco exists only between accesses to
the same location, on which the hardware provides strong
a := xrlx //1 y :=rlx 1 ordering guarantees.
x :=rlx 1 fencesc fencesc (RWC+syncs) Now it is easy to see that, given a program without non-
b := yrlx //0 c := xrlx //0 atomic accesses, placing an SC fence between every two
accesses guarantees SC. Indeed, by the definition of SC, it
The annotated behavior is allowed according to the model of suffices to show that eco sb is acyclic. Consider a eco sb
Batty et al. [5] (and so, also by our weaker condition above). cycle. Since eco and sb are irreflexive and transitive, the
Fig. 5 depicts a consistent execution yielding this behavior, cycle necessarily has the form (eco; sb)+ . Thus, between
as the only psc1 edge is from f1 to f2 . Yet, this behavior is every two eco steps, there must be an SC fence. So in effect,
disallowed by all implementations of C11. We believe that we have a cycle in eco; sb; [Fsc ]; sb, which can be regrouped
this is a serious omission of the standard rendering the SC to a cycle in [Fsc ]; sb; eco; sb; [Fsc ], which is forbidden by
fences too weak, as they cannot be used to enforce sequential our model.
consistency. This weakness has also been observed in a C11 Finally, one might further consider strengthening the
implementation of the Chase-Lev deque by Le et al. [21], who model by including eco in scb (which is used to define
report that the weak semantics of SC fences in C11 requires psc1 ), thereby ruling out the weak behavior of a variant
them to unnecessarily strengthen the access modes of certain of RWC+syncs using SC accesses instead of SC fences in
relaxed writes to SC. (In the context of the RWC+syncs, it threads 2 and 3. We note, however, that this strengthening
would amount to making the write to x in the first thread into is unsound for the default compilation scheme to x86-TSO
an SC write.) (see Remark 4 in 4).
Remark 2 (Itanium). This particular weakness of the stan-
2.2.1 Restoring Fence Cumulativity
dard is attributed to Itanium, whose fences do not guarantee
sequential consistency when inserted everywhere. While this Consider the following variant of the store buffering program,
would be a problem if C11 relaxed accesses were compiled where the write of x := 1 has been moved to another thread
to plain Itanium accesses, they actually have to be compiled with a release-acquire synchronization.
to release/acquire Itanium accesses to guarantee read-read a := zacq //1 y :=rlx 1
coherence. In this case, Itanium fences guarantee ordering. In x :=rlx 1
fencesc fencesc (W+RWC)
fact, Itanium implementations provide multi-copy atomicity z :=rel 1
b := yrlx //0 c := xrlx //0
for release stores, and thus cannot yield the weak outcome of
The annotated behavior corresponds to the writes of x and y
IRIW even without fences [14, 3.3.7.1].
being observed in different orders by the reads, although SC
Fixing the Semantics of SC Fences Analyzing the execu- fences have been used in the observer threads. This behavior
tion of RWC+syncs, we note that there is a sb; rb; rf; sb is disallowed on x86, Power, and ARM because their fences
path from f2 to f1 , but this path does not contribute to psc1 . are cumulative: the fences order not only the writes performed
Although both rb and rf edges contribute to psc1 , their by the thread with the fence instruction, but also the writes of
composition rb; rf does not. other threads that are observed by the thread in question [23].
To repair the model, we define the extended coherence In contrast, the behavior is allowed by the model described
order, eco , (rf mo rb)+ . This order includes the thus far. Consider the execution shown in Fig. 6. While there

6
is a sb; rb; sb path from f1 to f2 , the only path from f2 back read and a subsequent relaxed atomic write or a forced depen-
to f1 is sb; rb; sb; sw; sb (or, more generally, hb; rb; hb), dency between every such pair of accesses [11]. The latter
and so the execution is allowed. can be achieved by inserting a dummy control-dependent
To disallow such behaviors, we can replace [Fsc ]; sb and branch after every relaxed atomic read.
sb; [Fsc ] in the definitions above by [Fsc ]; hb and hb; [Fsc ].3 While the idea of strengthening C11 to require acyclicity
This leads us to our final condition that requires that pscbase of sb rf is well known [31, 11], we are not aware of any
pscF is acyclic, where: proof showing that the proposed compilation schemes of
Boehm and Demsky [11] are correct, nor that DRF-SC holds
pscbase , [Esc ] [Fsc ]; hb? ; scb; [Esc ] hb? ; [Fsc ]
 
under this assumption. The latter is essential for assessing
pscF , [Fsc ]; (hb hb; eco; hb); [Fsc ] our corrected model, as it is a key piece of evidence showing
that our semantics for SC accesses is not overly weak.
We note that [Fsc ]; pscbase ; [Fsc ] pscF . Hence, in pro- Importantly, even in this stronger model, non-atomic
grams without SC accesses (but with SC fences) it suffices to accesses are compiled to plain machine loads and stores. This
require that pscF is acyclic. is what makes the compilation correctness proof highly non-
2.3 A Final Problem: Out-of-Thin-Air Reads trivial, as the hardware models allow certain sb rf cycles
involving plain loads and stores. As a result, one has to rely
The C11 memory model suffers from a major problem, known on the catch-fire semantics (races on non-atomic accesses
as the out-of-thin-air problem [31, 11]. Designed to allow result in undefined behavior) for explaining behaviors that
efficient compilation and many optimization opportunities involve such cycles. A similar argument is needed for proving
for relaxed accesses, the model happened to be too weak, ad- the correctness of non-atomic read-write reordering.
mitting out-of-thin-air behaviors, which no implementation
exhibits. The standard example is load buffering with some 3. The Proposed Memory Model
form of dependencies in both threads:
In this section, we formally define our proposed corrected
a := xrlx //1 b := yrlx //1 version of the C11 model, which we call RC11. Similar to
(LB+deps)
if (a) y :=rlx a if (b) x :=rlx b C11, the RC11 model is given in a declarative style in
three steps: we associate a set of graphs (called executions) to
In this program, the formalized C11 model by Batty et al. [8]
every program (3.1), filter this set by imposing a consistency
allows reading a = b = 1 even though the value 1 does
predicate (3.2), and finally define the outcomes of a program
not appear in the program. The reason is that the execution
based on the set of its consistent executions (3.3). At the end
where both threads read and write the value 1 is consistent:
of the section, we compare our model with C11 (3.4).
each read reads from the write of the other thread. As one
Before we start, we introduce some further notation. Given
might expect, such behaviors are very problematic because
a binary relation R, dom(R) and codom(R) denote its do-
they invalidate almost all forms of formal reasoning about
main and codomain. Given a function f , =f denotes the
programs. In particular, the example above demonstrates a
set of f -equivalent pairs (=f , {ha, bi | f (a) = f (b)}),
violation of DRF-SC, the most basic guarantee that users of
and R|f denotes the restriction of R to f -equivalent pairs
C11 were intended to assume: LB+deps has no races under
(R|f , R =f ). When R is a strict partial order, R|imm
sequential consistency, and yet has some non-SC behavior.
denotes the set of all immediate R edges, i.e., pairs ha, bi R
Fixing the model in a way that forbids all out-of-thin-
such that for every c, hc, bi R implies hc, ai R? , and
air behaviors and still allows the most efficient compilation
ha, ci R implies hb, ci R? .
is beyond the scope of the current paper (see [16] for a
We assume finite sets Loc and Val of locations and values.
possible solution). In this paper, we settle for a simpler
We use x, y, z as metavariables for locations and v for values.
solution of requiring sb rf to be acyclic. This is a relatively
The model supports several modes for accesses and fences,
straightforward way to avoid the problem, although it carries
partially ordered by @ as follows:
some performance cost. Clearly, it rules out the weak behavior
of LB+deps, but also of the following load-buffering program,
5 rel *
which is nevertheless permitted by the Power and ARM na / rlx / sc
architectures. ) 4 acqrel
acq
a := xrlx //1 b := yrlx //1
(LB) 3.1 From Programs to Executions
y :=rlx 1 x :=rlx 1
To correctly compile the stronger model to Power and ARM, First, the program is translated into a set of executions. An
one has to either introduce a fence between a relaxed atomic execution G consists of:
3 To 1. a finite set of events E N containing a distinguished
rule out only the cycle shown in Fig. 6, it would suffice to have replaced
only the sb to a fence by an hb. We can, however, also construct examples, set E0 = {ax0 | x Loc} of initialization events. We use
where it is useful for the sb from a fence to be replaced by hb. a, b, ... as metavariables for events.

7
ay0 : Rna (y, 0) ax0 : Wna (x, 0) Executions of a given program represent prefixes of traces
mo
mo of shared memory accesses and fences that are generated by
mr : Rsc (y, 1) the program. In this paper, we only consider partitioned
sc
k : W (x, 1) rmw mo o : Wsc (y, 3)
rf programs of the form kiTid ci , where Tid is a finite set
mw : Wsc (y, 2) rf of thread identifiers, k denotes parallel composition, and
rf each ci is a sequential program. Then, the set of executions
l:W rel
(y, 1) mo n : Rrlx (y, 3) p : Rsc (x, 0)
associated with a given program is defined by induction
Figure 7. An execution of Z6.U. over the structure of sequential programs. We do not define
formally this construction (it depends on the particular syntax
2. a function lab assigning a label to every event in E. Labels and features of the source programming language). In this
are of one of the following forms: initial stage the read values are not restricted whatsoever (and
Ro (x, v) where o {na, rlx, acq, sc}. rf and mo are arbitrary). Note that the set of executions of
Wo (x, v) where o {na, rlx, rel, sc}. a program P is taken to be prefix-closed: an sb-prefix of
Fo where o {acq, rel, acqrel, sc}. an execution of P (which includes at least the initialization
events) is also considered to be an execution of P . By full
We assume that lab(ax0 ) = Wna (x, 0) for every ax0 E0 . executions of P , we refer to executions that represent traces
lab naturally induces the functions typ, mod, loc, valr , generated by the whole program P .
and valw that return (when applicable) the type (R, W or We show an example of an execution in Fig. 7. This is
F), mode, location, and read/written value of an event. a full execution of the Z6.U program, and is essentially the
For T {R, W, F}, T denotes the set {e E | typ(e) = T}. same as the C11 execution shown in Fig. 3, except for the
We also concatenate the event sets notations, use sub- representation of RMWs (see Item 4 above).
scripts to denote the accessed location, and superscripts
for modes (e.g., RW = R W and Wwrel denotes all events 3.2 Consistent Executions
x
a W with loc(a) = x and mod(a) w rel). The main part of the memory model is filtering the consistent
3. a strict partial order sb E E, called sequenced-before, executions among all executions of the program. The first
which orders the initialization events before all other obvious restriction is that every read should read some
events, i.e., E0 (E \ E0 ) sb. written value (formally, R codom(rf)). We refer to such
executions as complete.
4. a binary relation rmw [R]; (sb|imm =loc ); [W], called To state the other constraints we use a number of derived
read-modify-write pairs, such that for every ha, bi rmw, relations:
hmod(a), mod(b)i is one of the following:

hrlx, rlxi (RMWrlx ) hacq, reli (RMWacqrel ) rb , rf1 ; mo (reads-before)


+
hacq, rlxi (RMWacq ) hsc, sci (RMWsc ) eco , (rf mo rb) (extended coherence order)
hrlx, reli (RMWrel ) rs , [W]; sb|?loc ; [Wwrlx ]; (rf; rmw) (release sequence)
sw , [E wrel ?
We denote by At the set of all events in E that are a part ]; ([F]; sb) ; rs; rf;
(synchronizes with)
of an rmw edge (that is, At = dom(rmw) codom(rmw)). [Rwrlx ]; (sb; [F])? ; [Ewacq ]
Note that our executions represent RMWs differently from hb , (sb sw)+ (happens-before)
C11 executions. Here each RMW is represented as two
events, a read and a write, related by the rmw relation, The first two, rb and eco, are as described previously. Note
whereas in C11 they are represented by single RMW events, that since the modification order, mo, is transitive, we have
which act as both the read and the write of the RMW. eco = rf (mo rb); rf? in every execution.
Our choice is in line with the Power and ARM memory The other three relations, rs, sw and hb, are taken from
models, and simplifies the formal development (e.g., the [30]. Intuitively, hb records when an event is globally per-
definition of receptiveness). ceived as occurring before another one. It is defined in terms
5. a binary relation rf [W]; =loc ; [R], called reads-from, of two more basic relations. First, the release sequence (rs)
satisfying (i) valw (a) = valr (b) for every ha, bi rf; of a write contains the write itself and all later atomic writes
and (ii) a1 = a2 whenever ha1 , bi, ha2 , bi rf. to the same location in the same thread, as well as all RMWs
that recursively read from such writes. Next, a release event a
6. a strict partial order mo on W, called modification order,
synchronizes with (sw) an acquire event b, whenever b (or, in
which is a disjoint union of relations {mox }xLoc , such
case b is a fence, some sb-prior read) reads from the release
that each mox is a strict total order on Wx .
sequence of a (or in case a is a fence, of some sb-later write).
In what follows, to resolve ambiguities, we may include a Then, we say that an event a happens-before (hb) an event b
prefix G. to refer to the components of an execution G. if there is a path from a to b consisting of sb and sw edges.

8
Finally, we define the SC-before relation, scb, and the It does not support consume accesses, a premature feature
partial SC relations, pscbase and pscF , as follows: of C11 that is not implemented by major compilers, nor
locks, as they can be straightforwardly implemented with
sb|6=loc , sb \ sb|loc
release-acquire accesses.
scb , sb sb|6=loc ; hb; sb|6=loc hb|loc mo rb For simplicity, it assumes all locations are initialized.
pscbase , [Esc ] [Fsc ]; hb? ; scb; [Esc ] hb? ; [Fsc ]
 
It incorporates the fixes proposed by Vafeiadis et al. [30],
pscF , [Fsc ]; (hb hb; eco; hb); [Fsc ] namely (i) the strengthening of the release sequences def-
psc , pscbase pscF inition, (ii) the removal of restrictions about different
threads in the definition of synchronization, and (iii) the
Using these derived relations, RC11 imposes four constraints lack of distinction between atomic and non-atomic loca-
on executions: tions (and accordingly omitting the problematic rf hb
Definition 1. An execution G is called RC11-consistent if it condition for non-atomic locations). The third fix avoids
is complete and the following hold: out-of-thin-air problems that arise when performing non-
atomic accesses to atomic location [6, 5].
hb; eco? is irreflexive. ( COHERENCE )
It does not consider unsequenced races between atomic
rmw (rb; mo) = . ( ATOMICITY )
accesses to have undefined behavior. Our results are not
psc is acyclic. ( SC )
affected by such undefined behavior.
sb rf is acyclic. ( NO - THIN - AIR )
We have also made three presentational changes: (1) we have
COHERENCE ensures that programs with only one shared
a much more concise axiomatization of coherence; (2) we
location are sequentially consistent, as at least two locations
model RMWs using two events; and (3) we do not have a
are needed for a cycle in sb eco. ATOMICITY ensures that
total order over SC atomics.
the read and the write comprising a RMW are adjacent in
eco: there is no write event in between. The SC condition is Proposition 1. RC11s COHERENCE condition is equivalent
the main novelty of RC11 and is used to give semantics to SC to the conjunction of the following constraints of C11:
accesses and fences. Finally, NO - THIN - AIR rules out thin-air hb is irreflexive. ( IRREFLEXIVE - HB )
behaviors, albeit at a performance cost, as we will see in 5. rf; hb is irreflexive. ( NO - FUTURE - READ )
mo; rf; hb is irreflexive. ( COHERENCE - RW )
3.3 Program Outcomes
mo; hb is irreflexive. ( COHERENCE - WW )
Finally, in order to allow the compilation of non-atomic read mo; hb; rf1 is irreflexive. ( COHERENCE - WR )
and writes to plain machine load and store instructions (as mo; rf; hb; rf1 is irreflexive. ( COHERENCE - RR )
well as the compiler to reorder such accesses), RC11 follows
Proposition 2. The SC condition is equivalent to requiring
the catch-fire approach: races on non-atomic accesses
the existence of a total strict order S on Esc such that S; psc
result in undefined behavior, that is, any outcome is allowed.
is irreflexive.
Formally, it is defined as follows.
Finally, the next proposition ensures that without mixing
Definition 2. Two events a and b are called conflicting in an
SC and non-SC accesses to the same location, RC11 supplies
execution G if a, b E, W {typ(a), typ(b)}, a 6= b, and
the stronger guarantee of C11. As a consequence, program-
loc(a) = loc(b). A pair ha, bi is called a race in G (denoted
mers that never mix such accesses may completely ignore the
ha, bi race) if a and b are conflicting events in G, and
difference between RC11 and C11 regarding SC accesses.
ha, bi 6 hb hb1 .
Proposition 3. If SC accesses are to distinguished locations
Definition 3. An execution G is called racy if there is some
(for every a, b E\E0 , if mod(a) = sc and loc(a) = loc(b)
ha, bi race with na {mod(a), mod(b)}. A program P
then mod(b) = sc) then [Esc ]; hb; [Esc ] psc+ .
has undefined behavior under RC11 if it has some racy RC11-
consistent execution.
4. Compilation to x86-TSO
Definition 4. The outcome of an execution G is the function
In this section, we present the x86-TSO memory model, and
assigning to every location x the value written by the mo-
show that its intended compilation scheme is sound. We use a
maximal event in Wx . We say that O : Loc Val is an
declarative model of x86-TSO from [17], that we denote
outcome of a program P under RC11 if either O is an
by TSO. By [25, Theorem 3] and [17, Theorem 5], this
outcome of some RC11-consistent full execution of P , or
definition is equivalent to the better known operational one.
P has undefined behavior under RC11.
TSO executions are similar to the ones defined above, with
3.4 Comparison with C11 the following exceptions:
Besides the new SC and NO - THIN - AIR conditions, RC11 Read/write/fence labels have the form R(x, v), W(x, v),
differs in a few other ways from C11. and F (they do not include a mode). In addition, labels

9
(|R|) , MOV (from memory) (|Wvrel |) , MOV (to memory) 3. The mappings in Fig. 8 are applied. The correctness of
(|Wsc |) , MOV;MFENCE (|RMW|) , CMPXCHG this step, given in [3], is established by showing that given
(|F6=sc |) , No operation (|Fsc |) , MFENCE aTSO-consistent TSO execution Gt of (|P |) (where P
has no non-SC fences), there exists an RC11-consistent
Figure 8. Compilation to TSO.
execution G of P that has the same outcome as Gt .
may also be RMW(x, vr , vw ), and executions do not include
an rmw component (i.e., RMWs are represented with a In fact, the proof of Thm. 1 establishes the correctness
single event). We use RMW to denote the set of all events of compilation even for a strengthening of RC11 obtained
a E with typ(a) = RMW. by replacing the scb relation by scb0 , hb mo rb. This
entails that the original C11 model, as well as Batty et al.s
The modification order, mo, is a strict total order on strengthening [5], are correctly compiled to x86-TSO. Addi-
W RMW F (rather than a union of total order on writes tionally, the proof only assumes the existence of an MFENCE
to the same location). between every store originated from an SC write and load
Happens-before is given by hb , (sb rf)+ . originated from an SC read. The compilation scheme in Fig. 8
Reads-before is given by rb , rf1 ; mo|loc \ [E]. achieves this by placing an MFENCE after each store that orig-
inated from an SC write. An alternative correct compilation
Remark 3. Lahav et al. [17] treat fence instructions as scheme may place MFENCE before SC reads, rather than after
syntactic sugar for RMWs of a distinguished location. Here, SC writes [1]. (Since there are typically more SC reads than
we have fences as primitive instructions that induce fence SC writes in programs, the latter scheme is less preferred.)
events in TSO executions.
Remark 4. The compilation scheme that places MFENCE
Definition 5. A TSO execution G is TSO-consistent if it is before SC reads can be shown to be sound even for a very
complete and the following hold: strong SC condition that requires acyclicity of
1. hb is irreflexive.
pscstrong = ([Esc ][Fsc ]; hb? ); (hbeco); ([Esc ]hb? ; [Fsc ]).
2. mo; hb is irreflexive.
3. rb; hb is irreflexive. To prove this (see [3]), we are able to follow a simpler ap-
4. rb; mo is irreflexive. proach utilizing the recent result of Lahav and Vafeiadis [19]
5. rb; mo; rfe; sb is irreflexive (where rfe = rf \ sb). that provides a characterization of TSO in terms of program
6. rb; mo; [RMW F]; sb is irreflexive. transformations (or compiler optimizations). This allows
one to reduce compilation correctness to soundness of cer-
Unlike RC11, well-formed TSO programs do not have
tain transformations. The preferred compilation scheme to
undefined behavior. Thus, a function O : Loc Val is an
x86-TSO, which uses barriers after SC writes (see Fig. 8),
outcome of a TSO program P if it is an outcome of some
is unsound if one requires acyclicity of pscstrong , or even if
TSO-consistent full execution of P (see Def. 4).
one requires acyclicity of [Esc ]; (sb eco); [Esc ]. To see this,
Fig. 8 presents the compilation scheme from C11 to x86-
consider the following variant of SB:
TSO that is implemented in the GCC and the LLVM compilers.
Since TSO provides strong consistency guarantees, it allows x :=rel 1 y :=rel 1
most language primitives to be compiled to plain loads and a := xsc //1 c := ysc //1 (SB+rfis)
stores. Barriers are only needed for the compilation of SC b := ysc //0 d := xsc //0
writes. Our next theorem says that this compilation scheme
is also correct for RC11. Any execution of this program that yields the annotated
behavior has a cycle in [Esc ]; (sb eco); [Esc ] (we have
Theorem 1. For a program P , denote by (|P |) the TSO
rb; rf both from Rsc (x, 0) to Rsc (x, 1), and from Rsc (y, 0)
program obtained by compiling P using the scheme in Fig. 8.
to Rsc (y, 1)). However, since the program has no SC writes,
Then, given a program P , every outcome of (|P |) under TSO
following Fig. 8, all accesses are compiled to plain accesses,
is an outcome of P under RC11.
and x86-TSO clearly allows this behavior.
Proof (Outline). We consider the compilation as if it happens
in three steps, and prove the soundness of each step:
5. Compilation to Power
In this section, we present the Power model and the mappings
1. All non-atomic/relaxed accesses are strengthened to re- of language operations to Power instructions. We then prove
lease/acquire ones, and all relaxed/release/acquire RMWs the correctness of compilation from RC11 to Power.
are strengthened to acquire-release ones. It is easy to see As a model of the Power architecture, we use the recent
that this step does not introduce new outcomes (see 7). declarative model by Alglave et al. [4], which we denote by
2. All non-SC fences are removed. Due to the previous step, Power. Its executions are similar to the RC11s execution,
it is easy to show that non-SC fences have no effect. with the following exceptions:

10
(|Rna |) , ld (|Wna |) , st Leading sync Trailing sync
(|Rrlx |) , ld;cmp;bc (|Wrlx |) , st (|Rsc |) , sync;(|Racq |) (|Rsc |) , ld;sync
(|Racq |) , ld;cmp;bc;isync (|Wrel |) , lwsync;st (|Wsc |) , sync;st (|Wsc |) , (|Wrel |);sync
6=sc
(|F |) , lwsync (|Fsc |) , sync (|RMWsc |) , sync;(|RMWacq |) (|RMWsc |) , (|RMWrel |);sync
rlx
(|RMW |) , L:lwarx;cmp;bc Le;stwcx.;bc L;Le:
(|RMWacq |) , (|RMWrlx |);isync
Figure 10. Compilations of SC accesses to Power.
(|RMWrel |) , lwsync;(|RMWrlx |) As already mentioned, the two compilation schemes from
(|RMWacqrel |) , lwsync;(|RMWrlx |);isync C11 to Power that have been proposed in the literature [1]
Figure 9. Compilation of non-SC primitives to Power. differ only in the mappings used for SC accesses (see Fig. 10).
The first scheme follows the leading sync convention, and
Power executions track syntactic dependencies between places a sync fence before each SC access. The alternative
events in the same thread, and derive a relation called scheme follows the trailing sync convention, and places
preserved program order, denoted ppo, which is a subset a sync fence after each SC access. Importantly, the same
of sb guaranteed to be preserved. The exact definition of scheme should be used for all SC accesses in the program,
ppo is quite intricate, and is included in [3]. since mixing the schemes is unsound. The mappings for the
Read/write labels have the form R(x, v) and W(x, v) (they non-SC accesses and fences are common to both schemes
do not include a mode). Power has two types of fence and are shown in Fig. 9. Note that our compilation of relaxed
events: a lightweight fence and a full fence. We reads is stronger than the one proposed for C11 (see 2.3).
denote by Flwsync and Fsync the set of all lightweight Our main theorem says that the compilation schemes are
fence and full fence events in a Power execution. Powers correct.
instruction fence (isync) is used to derive ppo but is Theorem 2. For a program P , denote by (|P |) the Power
not recorded in executions. program obtained by compiling P using the scheme in Fig. 9
In addition to ppo, the following additional derived re- and either of the schemes in Fig. 10 for SC accesses. Then,
lations are needed to define Power-consistency (see [4] for given a program P , every outcome of (|P |) under Power is an
further explanations and details). outcome of P under RC11.

sync , [RW]; sb; [Fsync ]; sb; [RW] Proof (Outline). The main idea is to consider the compilation
lwsync , [RW]; sb; [Flwsync ]; sb; [RW] \ (W R) as if it happens in three steps, and prove the soundness of
each step:
fence , sync lwsync (fence order)
hb , ppo fence rfe (Powers happens-before) 1. Leading sync: Each Rsc /Wsc /RMWsc in P is replaced by
Fsc followed by Racq /Wrel /RMWacqrel .
prop1 , [W]; rfe? ; fence; hb ; [W] Trailing sync: Each Rsc /Wsc /RMWsc in P is replaced by
prop2 , (moe rbe)? ; rfe? ; (fence; hb )? ; sync; hb Racq /Wrel /RMWacqrel followed by Fsc .
prop , prop1 prop2 (propagation relation) 2. The mappings in Fig. 9 are applied.
3. Leading sync: Pairs of the form sync;lwsync that orig-
where for every relation c (e.g., rf, mo, etc.), we denote by inated from Wsc /RMWsc are reduced to sync (eliminating
ce its thread-external restriction. Formally, ce = c \ sb. the redundant lwsync).
Trailing sync: Any cmp;bc;isync;sync sequences
Definition 6. A Power execution G is Power-consistent if it
originated from Rsc /RMWsc are reduced to sync (elim-
is complete and the following hold:
inating the redundant cmp;bc;isync).
1. sb|loc rf rb mo is acyclic. ( SC - PER - LOC )
The resulting Power program is clearly identical to the one
2. rbe; prop; hb is irreflexive. ( OBSERVATION )
obtained by applying the mappings in Figures 9 and 10.
3. mo prop is acyclic. ( PROPAGATION )
The soundness for each step (that is, none of them intro-
4. rmw (rbe; moe) is irreflexive. ( POWER - ATOMICITY ) duces additional outcomes) is established in [3].
5. hb is acyclic. ( POWER - NO - THIN - AIR )
Remark 5. The model in [4] contains an additional con- The main difficulty (and novelty of our proof) lies in
straint: mo [At]; sb; [At] should be acyclic (recall that proving soundness of the second step, and more specifically
At = dom(rmw) codom(rmw)). Since none of our proofs in establishing the NO - THIN - AIR condition. Since Power,
requires this property, we excluded it from Def. 6. unlike RC11, does not generally forbid sb rf cycles, we
have to show that such cycles can be untangled to produce
Like in the case of TSO, we say that a function O : Loc a racy RC11-consistent execution, witnessing the undefined
Val is an outcome of a Power program P if it is an outcome behavior. Here, the idea is, similar to DRF-SC proofs, to
of some Power-consistent full execution of P (see Def. 4). detect a first rf edge that closes an sb rf cycle, and replace

11
HH Y
H Roy2 Woy2 RMWoy2 Fo2
X HH
Rox1 o1 v rlx o1 , o2 v rlx (o1 = na o2 = na) o1 = na o2 v acq o1 6= rlx o2 = acq
Wox1 6 sc o2 6= sc
o1 = o2 v rlx o2 v acq o2 = acq
RMWox1 o1 v rel o1 v rel o2 = na o1 w acq o2 = acq
Fo1 o1 = rel o1 = rel o2 6= rlx o1 = rel o2 w rel o1 = rel o2 = acq
Table 1. Deorderable pairs of accesses/fences (x and y are distinct locations).

it by a different rf edge that avoids the cycle. This is highly emit ld;dmb (that corresponds to Powers ld;sync). With
non-trivial because it is unclear how to define a first rf this stronger compilation scheme, there is no correctness
edge when sb rf is cyclic. To solve this problem, we came problem in compilation of C11 to ARMv7. Nevertheless, if
up with a different ordering of events, which does not include one intends to use isbs, the same correctness issue arises
all sb edges, and Power ensures to be acyclic (a relation we (e.g., the one in Fig. 1), and RC11 overcomes this issue.
call Power-before in [3]).
For completeness, we also show that the conditional 7. Correctness of Program Transformations
branch after the relaxed read is only necessary if we care In this section, we list program transformations that are sound
about enforcing the NO - THIN - AIR condition. That is, let in RC11, and prove that this is the case. As in [30], to have
weakRC11 be the model obtained from RC11 by omitting a simple presentation, all of our arguments are performed at
the NO - THIN - AIR condition, and denote by (|P |)weak the the semantic level, as if the transformations were applied to
Power program obtained by compiling P as above, except events in an execution. Thus, to prove soundness of a program
that relaxed reads are compiled to plain loads (again, with transformation Psrc Ptgt , we are given an arbitrary RC11-
either leading or trailing syncs for SC accesses). Then, this consistent execution Gtgt of Ptgt , and construct a RC11-
scheme is correct with respect to the weakRC11 model. consistent execution Gsrc of Psrc , such that either Gsrc and
Theorem 3 (Compilation of weakRC11 to Power). Given a Gtgt have the same outcome or Gsrc is racy. In the former
program P , every outcome of (|P |)weak under Power is an case, we show that Gtgt is racy only if Gsrc is. Consequently,
outcome of P under weakRC11. one obtains that every outcome of Ptgt under RC11 is also an
outcome of Psrc under RC11.
Finally, we note that it is also possible to use a lightweight The soundness proofs (sketched in [3]) are mostly similar
fence (lwsync) instead of a fake control dependency and an to the proofs in [30], with the main difference concerning the
instruction fence (isync) in the compilation of (all or some) new SC condition.
acquire accesses.
Strengthening Strengthening transforms the mode o of
an event in the source into o0 in the target where o v o0 .
6. Compilation to ARMv7 Soundness of this transformation is trivial, because RC11-
The ARMv7 model [4] is very similar to the Power model consistency is monotone with respect to the mode ordering.
just presented in 5. There are only two differences.
Sequentialization Sequentialization merges two program
First, while ARMv7 has analogues for Powers strong
threads into one, by interleaving their events in sb. Essen-
fence and instruction fence (dmb for sync, and isb for
tially sequentialization just adds edges to the sb relation. Its
isync), it lacks an analogue for Powers lightweight fence
soundness trivially follows from the monotonicity of RC11-
(lwsync). Thus, on ARMv7 we have Flwsync = and so
consistency with respect to sb.
fence = sync.
The second difference is that ARMv7 has a somewhat Deordering Table 1 defines the deorderable pairs, for
weaker preserved program order, ppo, than Power, which in which we proved the soundness of the transformation
particular does not always include [R]; sb|loc ; [W] (following X; Y X k Y in RC11. (Note that reordering is obtained
the model in [4]). In our Power compilation proofs, however, by applying deordering and sequentialization.) Generally
we never rely on this property of Powers ppo (see [3]). speaking, RC11 supports all reorderings that are intended
The compilation schemes to ARMv7 are essentially the to be sound in C11 [30], except for load-store reorderings
same as those to Power substituting the corresponding of relaxed accesses, which are unsound in RC11 due to the
ARMv7 instructions for the Power ones: dmb instead of sync conservative NO - THIN - AIR condition (if one omits this con-
and lwsync, and isb instead of isync. The soundness of dition, these reorderings are sound). Importantly, load-store
compilation to ARMv7 follows directly from Theorems 2 reorderings of non-atomic accesses are sound due to the
and 3. catch-fire semantics. The soundness of these reorderings
We note that neither GCC (version 5.4) nor LLVM (version (in the presence of NO - THIN - AIR) was left open in [30], and
3.9) map acquire reads into ld;cmp;bc;isb. Instead, they requires a non-trivial argument of the same nature as the one

12
Ro ; Ro Ro Wo ; Wo Wo Theorem 4. If in all SC-consistent executions of a program
Wsc ; Rsc Wsc Wo ; Racq Wo P , every race ha, bi has mod(a) = mod(b) = sc, then the
RMWo ; Ror RMWo RMWo ; RMWo RMWo outcomes of P under RC11 coincide with those under SC.
Wow ; RMWo Wow Fo ; Fo Fo
Note that the NO - THIN - AIR condition is essential for the
Figure 11. Mergeable pairs (assuming both accesses are correctness of Thm. 4 (recall the LB+deps example).
to the same location). or denotes the maximal mode in Next, we show that adding a fence instruction between
{rlx, acq, sc} satisfying or v o; and ow denotes the maxi- every two accesses to shared locations restores SC, or there
mal mode in {rlx, rel, sc} satisfying ow v o. remains a race in the program, in which case the program has
undefined behavior.
used to show NO - THIN - AIR in the compilation correctness
proof. Definition 7. A location x is shared in an execution G if
ha, bi 6 sb sb1 for some distinct events a, b Ex .
Merging Merges are transformations of the form X; Y Z,
eliminating one memory access or fence. Fig. 11 defines Theorem 5. Let G be an RC11-consistent execution. Sup-
the set of mergeable pairs. Note that using strengthening, pose that for every two distinct shared locations x and y,
the modes mentioned in Fig. 11 are upper bounds (e.g., [Ex ]; sb; [Ey ] sb; [Fsc ]; sb. Then, G is SC-consistent.
Racq
x ; Rx
rlx
can be first strengthened to Racq acq
x ; Rx and then
merged). Generally speaking, RC11 supports all mergings We remark that for the proofs of Theorems 4 and 5, we
that are intended to be mergeable in C11 [30]. do not need the full SC condition: for Thm. 4 it suffices for
[Esc ]; (sb rf mo rb); [Esc ] to be acyclic; and for Thm. 5
Remark 6. The elimination of redundant read-after-write it suffices for [Fsc ]; sb; eco; sb; [Fsc ] to be acyclic.
allows the write to be non-atomic. Nevertheless, an SC read
cannot be eliminated in this case, unless it follows an SC
write. Indeed, eliminating an SC read after a non-SC write
9. Related Work
is unsound in RC11. We note that the effectiveness of this The C11 memory model was designed by the C++ standards
optimization seems to be low, and, in fact, it is already committee based on a paper by Boehm and Adve [10].
unsound for the model in [5] (see [3] for a counterexample). During the standardization process, Batty et al. [8] formalized
Note also that read-after-RMW elimination does not allow the C11 model and proved soundness of its compilation to
the read to be an acquire read unless the update includes an x86-TSO. They also proposed a number of key technical
acquire read (unlike read-after-write). This is due to release improvements to the model (such as some coherence axioms),
sequences: eliminating an acquire read after a relaxed update which were incorporated into the standard.
may remove the synchronization due to a release sequence Since then, however, a number of problems have been
ending in this update. found with the C11 model. In 2012, Batty et al. [7] and
Sarkar et al. [27] studied the compilation of C11 to Power,
Register Promotion Finally, register promotion is sound and incorrectly proved the correctness of two compilation
in RC11. This global program transformation replaces all the schemes. In their proofs, from a consistent Power execution,
accesses to a memory location by those to a register, provided they constructed a corresponding C11 execution, which they
that the location is used by only one thread. At the execution tried to prove consistent, but in doing so they forgot to check
level, all accesses to a particular location are removed from the overly strong condition S1. The examples shown in 1
the execution, provided that they are all sb-related. and in 2.1 are counterexamples to their theorems.
Quite early on, a number of papers [12, 31, 24, 11] noticed
the disastrous effects of thin-air behaviors allowed by the
8. Programming Guarantees C11 model, and proposed strengthening the definition of
In this section, we demonstrate that our semantics for SC consistency by disallowing sb rf cycles. Boehm and
atomics (i.e., the SC condition in Def. 1) is not overly weak. Demsky [11] further discussed how the compilation schemes
We do so by proving theorems stating that programmers who of relaxed accesses to Power and ARM would be affected by
follow certain defensive programming patterns can be assured the change, but did not formally prove the correctness of their
that their programs exhibit no weak behaviors. The first such proposed schemes.
theorem is DRF-SC, which says that if a program has no races Next, Vafeiadis et al. [30] noticed a number of other prob-
on non-SC accesses under SC semantics, then its outcomes lems with the C11 memory model, which invalidated a num-
under RC11 coincide with those under SC. ber of source-to-source program transformations that were
In our proofs we use the standard declarative definition of assumed to hold. They proposed local fixes to those problems,
SC: an execution is SC-consistent if it is complete, satisfies and showed that these fixes enabled proving correctness of a
ATOMICITY , and sb rf mo rb is acyclic [28]. number of local transformations. We have incorporated their
fixes in the RC11-consistency definition.

13
Then, in 2016, Batty et al. [5] proposed a more concise pensive compilation schemes (for Power/ARMv7: compile
semantics for SC atomics, whose presentation we have fol- release-acquire atomics exactly as the SC ones; for TSO:
lowed in our proposed RC11 model. As their semantics is place a barrier before every SC read). Our choice of psc
stronger than C11, it cannot be compiled efficiently to Power, achieves the following: (i) it allows free mixing of different
contradicting the claim of that paper. Moreover, as already dis- access modes to the same location in the spirit of C11; (ii) it
cussed, SC fences are still too weak according to their model: ensures the correctness of the existing compilation schemes;
in particular, putting them between every two accesses in a and (iii) it coincides with pscstrong in the absence of mixing
program with only atomic accesses does not guarantee SC. of SC and non-SC accesses to the same location.
Recently, Manerkar et al. [22] discovered the problem Regarding the infamous out-of-thin-air problem, we
with trailing-sync compilation to Power (in particular, they employed in RC11 a conservative solution at the cost of
observed the IRIW-acq-sc counterexample), and identified the including a fake control dependency after every relaxed read.
mistake in the existing proof. Independently, we discovered While this was already considered a valid solution before,
the same problem, as well as the problem with leading-sync we are the first to prove the correctness of this compilation
compilation. Moreover, in this paper, we have proposed a fix scheme, as well as the soundness of reordering of independent
for both problems, and proven that it works. non-atomic accesses under this model. Correctness of an
A number of previous papers [31, 29, 18, 17] have stud- alternative scheme that places a lightweight fence after every
ied only small fragments of the C11 modeltypically the relaxed write is left for future work. It would be interesting
release/acquire fragment. Among these, Lahav et al. [17] to evaluate the practical performance costs of each scheme.
proposed strengthening the semantics of SC fences in a dif- On the one hand, relaxed writes (which are not followed by a
ferent way from the way we do here, by treating them as fence) are perhaps rare in real programs, compared to relaxed
read-modify-writes to a distinguished location. That strength- reads. On the other hand, a control dependency is cheaper
ening, however, was considered in the restricted setting of than a lightweight fence, and relaxed reads are often anyway
only release/acquire accesses, and does not directly scale followed by a control dependency.
to the full set of C11 access modes. In fact, for the frag- Another important future direction would be to combine
ment containing only SC fences and release/acquire accesses, our SC constraint with our recent operational model in [16],
RC11-consistency is equivalent to RA-consistency that treats which prevents out-of-thin-air values (and avoids undefined
SC fences as RMWs to a distinguished location [17]. behaviors altogether), while still allowing the compilation
Finally, several solutions to the out-of-thin-air problem of relaxed reads and writes to plain loads and stores. This
were recently suggested, e.g., [26, 15, 16]. These solutions is, in particular, crucial for adopting a model like RC11 in a
aim to avoid the performance cost of disallowing sb rf type-safe language, like Java, which cannot allow undefined
cycles, but none of them follows the declarative framework behaviors. Integrating our SC condition in that model, how-
of C11. The conservative approach of disallowing sb rf ever, is non-trivial because the model is defined in a very
cycles allows us to formulate our model in the style of C11. different style from C11, and thus we will have to find an
equivalent operational way to check our SC condition.
Finally, extending RC11 with additional features of C11
10. Conclusion
(see 3.4) and establishing the correctness of compilation of
In this paper, we have introduced the RC11 memory model, RC11 to ARMv8 [13] are important future goals as well.
which corrects all the known problems of the C11 model (al-
beit at a performance cost for the out-of-thin-air problem). Acknowledgments
We have further proved (i) the correctness of compilation
from RC11 to x86-TSO [25], Power and ARMv7 [4]; (ii) the We thank Hans Boehm, Soham Chakraborty, Doug Lea, Peter
soundness of various program transformations; (iii) a DRF- Sewell and the PLDI17 reviewers for their helpful feed-
SC theorem; and (iv) a theorem showing that for programs back. This research was supported in part by Samsung Re-
without non-atomic accesses, weak behaviors can be always search Funding Center of Samsung Electronics under Project
avoided by placing SC fences. It would be very useful to Number SRFC-IT1502-07, and in part by an ERC Consol-
mechanize the proofs of this paper in a theorem prover; we idator Grant for the project RustBelt (grant agreement no.
leave this for future work. 683289). The third author has been supported by a Korea
A certain degree of freedom exists in the design of the Foundation for Advanced Studies Scholarship.
SC condition. A very weak version, which maintains the
two formal programming guarantees of this paper, would References
require acyclicity of [Esc ]; (sb rf mo rb); [Esc ] [1] C/C++11 mappings to processors, available at http://www.
[Fsc ]; sb; eco; sb; [Fsc ]. At the other extreme, one can re- cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. [On-
quire the acyclicity of pscstrong = ([Esc ] [Fsc ]; hb? ); (hb line; accessed 27-September-2016].
eco); ([Esc ] hb? ; [Fsc ]), and either disallow mixing SC and [2] Crossbeam: support for concurrent and parallel programming,
non-SC accesses to the same location, or have rather ex- available at https://github.com/aturon/crossbeam.

14
[Online; accessed 24-October-2016]. memory models. In ICALP 2015, pages 311323. Springer,
[3] Supplementary material for this paper, available at http: 2015.
//plv.mpi-sws.org/scfix/. [19] O. Lahav and V. Vafeiadis. Explaining relaxed memory models
[4] J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: with program transformations. In FM 2016, pages 479495.
Modelling, simulation, testing, and data mining for weak Springer, 2016.
memory. ACM Trans. Program. Lang. Syst., 36(2):7:17:74, [20] L. Lamport. How to make a multiprocessor computer that
July 2014. correctly executes multiprocess programs. IEEE Trans. Com-
puters, 28(9):690691, 1979.
[5] M. Batty, A. F. Donaldson, and J. Wickerson. Overhauling SC
atomics in C11 and OpenCL. In POPL 2016, pages 634648. [21] N. M. Le, A. Pop, A. Cohen, and F. Zappa Nardelli. Correct
ACM, 2016. and efficient work-stealing for weak memory models. In
PPoPP 2013, pages 6980. ACM, 2013.
[6] M. Batty, K. Memarian, K. Nienhuis, J. Pichon-Pharabod, and
P. Sewell. The problem of programming language concurrency [22] Y. A. Manerkar, C. Trippel, D. Lustig, M. Pellauer, and
semantics. In ESOP 2015, pages 283307. Springer, 2015. M. Martonosi. Counterexamples and proof loophole for
the C/C++ to POWER and ARMv7 trailing-sync compiler
[7] M. Batty, K. Memarian, S. Owens, S. Sarkar, and P. Sewell.
mappings. arXiv preprint arXiv:1611.01507, 2016.
Clarifying and compiling C/C++ concurrency: From C++11
to POWER. In POPL 2012, pages 509520. ACM, 2012. [23] L. Maranget, S. Sarkar, and P. Sewell. A tutorial intro-
duction to the ARM and POWER relaxed memory models.
[8] M. Batty, S. Owens, S. Sarkar, P. Sewell, and T. Weber.
http://www.cl.cam.ac.uk/pes20/ppc-supplemental/test7.pdf,
Mathematizing C++ concurrency. In POPL 2011, pages 5566.
2012.
ACM, 2011.
[24] B. Norris and B. Demsky. CDSchecker: checking concurrent
[9] H.-J. Boehm. Can seqlocks get along with programming
data structures written with C/C++ atomics. In OOPSLA 2013,
language memory models? In MSPC 2012, pages 1220.
pages 131150. ACM, 2013.
ACM, 2012.
[25] S. Owens, S. Sarkar, and P. Sewell. A better x86 memory
[10] H.-J. Boehm and S. V. Adve. Foundations of the C++ con-
model: x86-TSO. In TPHOLs 2009, pages 391407. Springer-
currency memory model. In PLDI 2008, pages 6878. ACM,
Verlag, 2009.
2008.
[26] J. Pichon-Pharabod and P. Sewell. A concurrency semantics
[11] H.-J. Boehm and B. Demsky. Outlawing ghosts: Avoiding
for relaxed atomics that permits optimisation and avoids thin-
out-of-thin-air results. In MSPC 2014, pages 7:17:6. ACM,
air executions. In POPL 2016, pages 622633. ACM, 2016.
2014.
[27] S. Sarkar, K. Memarian, S. Owens, M. Batty, P. Sewell,
[12] M. Dodds, M. Batty, and A. Gotsman. C/C++ causal cycles
L. Maranget, J. Alglave, and D. Williams. Synchronising
confound compositionality. TinyToCS, 2, 2013.
C/C++ and POWER. In PLDI 2012, pages 311322. ACM,
[13] S. Flur, K. E. Gray, C. Pulte, S. Sarkar, A. Sezgin, L. Maranget, 2012.
W. Deacon, and P. Sewell. Modelling the ARMv8 architecture,
[28] D. Shasha and M. Snir. Efficient and correct execution of
operationally: Concurrency and ISA. In POPL 2016, pages
parallel programs that share memory. ACM Trans. Program.
608621. ACM, 2016.
Lang. Syst., 10(2):282312, Apr. 1988.
[14] Intel. A formal specification of Intel Itanium processor fam-
[29] A. Turon, V. Vafeiadis, and D. Dreyer. GPS: Navigating weak
ily memory ordering, 2002. http://download.intel.com/
memory with ghosts, protocols, and separation. In OOPSLA
design/Itanium/Downloads/25142901.pdf. [Online; ac-
2014, pages 691707. ACM, 2014.
cessed 14-November-2016].
[30] V. Vafeiadis, T. Balabonski, S. Chakraborty, R. Morisset, and
[15] A. Jeffrey and J. Riely. On thin air reads: Towards an event
F. Zappa Nardelli. Common compiler optimisations are invalid
structures model of relaxed memory. In LICS 2016, pages
in the C11 memory model and what we can do about it. In
759767. ACM, 2016.
POPL 2015, pages 209220. ACM, 2015.
[16] J. Kang, C.-K. Hur, O. Lahav, V. Vafeiadis, and D. Dreyer.
[31] V. Vafeiadis and C. Narayan. Relaxed separation logic: A
A promising semantics for relaxed-memory concurrency. In
program logic for C11 concurrency. In OOPSLA 2013, pages
POPL 2017, pages 175189. ACM, 2017.
867884. ACM, 2013.
[17] O. Lahav, N. Giannarakis, and V. Vafeiadis. Taming release-
[32] J. Wickerson, M. Batty, T. Sorensen, and G. A. Constantinides.
acquire consistency. In POPL 2016, pages 649662. ACM,
Automatically comparing memory consistency models. In
2016.
POPL 2017, pages 190204. ACM, 2017.
[18] O. Lahav and V. Vafeiadis. Owicki-Gries reasoning for weak

15
A. Further Examples
A.1 Failure of leading sync convention with SC fences
The following behavior is disallowed according to C11, but allowed by its compilation to Power.

x :=rlx 2 d := yacq //1


fencesc y :=sc 1 x :=rel 1 f := xsc //1 (Rsync+Rsc)
b := yrlx //0 e := xrlx //2

Under C11, this behavior is forbidden. Consider the following execution (the initialization of x is omitted):

Wna (y, 0)
mo
sc
rf
k:W rlx
(x, 2) n : W (y, 1) o : Racq (y, 1) r : Rsc (x, 1)
sw rf
sw
mo
l : Fsc p : Wrel (x, 1)
rf rf

m : Rrlx (y, 0) q : Rrlx (x, 2)

The rf and mo edges are forced because of read values and coherence. Now, the C11 conditions on SC fences
require, in particular, that [Fsc ]; sb; rb; [Esc ] S and [Esc ]; rb; sb; [Fsc ] S. Hence, we must have S(l, n)
(essentially because if we had S(n, l), then m would have been reading from an overwritten write), as well as
S(r, l) (essentially because if we had S(l, r), then r would have been reading from an mo-overwritten write
before the fence). By transitivity, we thus have S(r, n) which contradicts condition S1, which requires S(n, r)
because of the happens-before path via o and p.
The compilation to Power allows the behavior because the sync fences do not provide sufficient synchronization:
again all but one sync fences are useless, as they are placed at the beginning and end of a thread. In fact, this
example shows the unsoundness of compiliation of C11 to Power even for a compliation scheme that places a
sync fence both before and after each SC access.

A.2 Unsoundness of compilation of C11 to ARMv8


a := xrlx //1
y :=sc 1
x :=sc 1 fenceacq (RWC+acq+sc)
c := xsc //0
b := ysc //0

C11 disallows the annotated behavior of this program (we have hb from the write of x to the read of y; rb
from the read of y to the write of y; hb from the write of y to the read of x; and finally rb from the read of x to
the write of x).
Nevertheless, the compilation to ARMv8 (using its special load and store instructions for SC accesses) allows
the behavior following the model in [13]:

LDR a, [x] //1


STLR #1, [y]
STLR #1, [x] DMB LD
LDAR c, [x] //0
LDAR b, [y] //0

First, the store to x is committed and propagated to thread 2 (but not to thread 3). Then, the load of x in thread 2
is issued, satisfied and committed, the fence is committed, and the load of y is issued. In the storage subsystem,
the issued load of y is propagated to thread 1, reordered with the store to x (as they originate from different
threads), propagated to thread 3 and to the main memory, and satisfied with value 0. Then, thread 3 executes:
the store to y is propagated to the main memory, and the load of x is issued and propagated to the main memory,
satisfied with the value 0. Finally, the store of x propagates to the main memory.

16
A.3 Failure of read-after-read elimination
Let pscold = [Esc ]; (sb rf mo rb); [Esc ]. We present the executions showing the failure of read-after-read
elimination when requiring acyclicity of pscold (see 2.1.1).
Consider the following program:

c := xrlx //1
y :=sc 1 a := xsc //1
x :=sc 2 (RRmerge)
x :=rel 1 b := xsc //1
d := ysc //0

The annotated behavior is forbidden, but it will become allowed after replacing b := xsc by b := a. Indeed, the
following execution is an execution of RRmerge yielding the result a = b = c = 1 d = 0.

k : Wsc (y, 1) m : Rsc (x, 1) rf o : Rrlx (x, 1)


rf
rb
l : Wrel (x, 1) n : Rsc (x, 1) p : Wsc (x, 2)
rf

mo q : Rsc (y, 0)
rb

In this execution we have a pscold cycle (k, n, p, q, k). It is, however, consistent using our final psc relation
(hk, ni 6 psc).
Now, the following execution is an execution of the same RRmerge program, but after replacing b := xsc by
b := a, again yielding the result a = b = c = 1 d = 0. This execution is consistent when requiring acyclicity
of pscold (as well as with our final psc relation).

k : Wsc (y, 1) m : Rsc (x, 1) rf o : Rrlx (x, 1)


rf
rb
l : Wrel (x, 1) p : Wsc (x, 2)

mo q : Rsc (y, 0)
rb

A.4 Failure of SC-read-after-non-SC-write elimination


y :=sc 2; y :=sc 2;
x :=sc 2; x :=sc 2;
x :=rlx 1; x :=rlx 1;
y :=sc 1; y :=sc 1;
a := xsc ; //1 a := 1;
c := yrlx ; //2 c := yrlx ; //2
b := xrlx ; //2 b := xrlx ; //2

The annotated behavior is allowed under RC11 for the target, but not for the source. The same applies to the
model of Batty et al. [5].

B. Programs to Executions: Receptiveness Assumption


To carry out the compilation correctness proof, we need to record syntactic dependencies between instructions,
as in the Power model. (This is only needed if one is interested in the NO - THIN - AIR condition; compilation
correctness for weakRC11 may completely ignore this extension.) Dependencies are classified into data, address
and control dependencies. Accordingly, we extend the definition of an execution (see 3.1), with additional
relations data, addr and ctrl. We use deps to denote the union of the three relations. We require data, addr
and ctrl to satisfy the following:

17
1. data R W. 3. ctrl R E. 5. rmw deps.
2. addr R (R W). 4. ctrl; sb ctrl.

The dependency relations are calculated from the program syntax, together with the generation of programs
execution (like in Power), and the construction ensures that the above properties hold. Moreover, the
construction of executions from programs provides us with the following receptiveness property:
Definition B.1. A function lab0 : Event * Label is called a reevaluation of lab : Event * Label if for every
event a, the label lab0 (a) is identical to lab(a), except possibly for read/written value.
Notation B.1. Given an execution G and a reevaluation lab of G.lab, lab(G) denotes the execution G0 given
by: G0 .lab = lab, G0 .rf = , and G0 .c = G.c for every c {E, sb, rmw, mo, data, addr, ctrl}.
Assumption B.1 (receptiveness). Let G be an execution of a program P . Let a R, and suppose that
a 6 dom(deps ; (ctrl addr)). For every v Val, there exists a reevaluation lab of G.lab such that:
lab(G) is an execution of P .
lab(G).valr (a) = v.
lab(b) = G.lab(b) whenever ha, bi 6 G.deps+ .

Note that a more basic receptiveness property follows from this assumption: if a 6 dom(sb) then for every
v Val, we have that lab(G) is an execution of P , for the reevaluation lab of G.lab that sets the read value of
a to v, and otherwise is identical to G.lab.
In addition, we assume that the set of executions of a program is prefix-closed:
Notation B.2. Given an execution G and a set E E that is downwards closed w.r.t. sb (i.e., a E whenever
ha, bi sb for some b E), and contains at least all the initialization events, the restriction of G to E,
denoted G|E , is the execution G0 given by G0 .E = E, G0 .lab = G.lab|E , and G0 .c = [E]; G.c; [E] for
c {sb, rmw, rf, mo, data, addr, ctrl}.
Assumption B.2 (prefix-closed executions). Let G be an execution of a program P , and let E be a subset of E
that is downwards closed w.r.t. sb, and contains at least all the initialization events. Then, G|E is an execution
of P .

C. Properties of RC11
In this section, we present some basic properties of the derived relations eco, sw, hb and of RC11-consistent
executions. We omit some of the proofs that straightforwardly follow from our definitions. For the rest of this
section, consider an arbitrary execution G.
Proposition C.1. eco is a strict partial order.
Proposition C.2. Suppose that [W]; sb|loc ; [W] mo and rmw rb. Then, the following hold:

1. rs eco? . 3. [W]; sw; [F] eco; sb.


2. [W]; sw; [R] eco. 4. eco; hb eco eco; (sb \ rmw); hb? .

Proof.
1. Let ha, bi rs. Then, by definition, ha, bi [W]; sb|?loc ; [Wwrlx ]; (rf; rmw) . Since [W]; sb|loc ; [W] mo
and rmw rb, we have ha, bi eco . Since eco is transitive, we have ha, bi eco? .
2. Let ha, bi [W]; sw; [R]. Then, by definition, we have ha, bi rs; rf. Using the previous item, we obtain
that ha, bi eco? ; eco eco.
3. Let ha, bi [W]; sw; [F]. Then, by definition, we have ha, bi rs; rf; sb. Using the first item, we obtain
that ha, bi eco? ; eco; sb eco; sb.
4. Let ha, ci eco; hb, and let b E be an eco-maximal event satisfying ha, bi eco, and hb, ci hb? . If
b = c then ha, ci eco, and we are done. Otherwise, the maximality of b ensures that hb, b0 i sb \ sw and
hb0 , ci hb? for some b0 E. Since rmw rb eco, it follows that ha, ci eco; (sb \ rmw); hb? .

18
Lemma C.1 (Read at end). Let a R \ dom(sb). Suppose that G0 = G|G.E\{a} is RC11-consistent.
Then, there exists an event b G0 .W such that the execution G00 given by G00 .c = G.c for every
c {E, sb, rmw, data, addr, ctrl, mo}, G00 .lab = G0 .lab {a 7 Rmod(a) (loc(a), valw (b))}, and
G00 .rf = G0 .rf {hb, ai} is RC11-consistent.

Proof. Take b to be the mo-maximal event in G.Wloc(a) . It is straightforward to show that G00 , as defined in the
statement, is RC11-consistent.

Proposition C.3. Let a Wvrlx \ dom(rf). Let G0 = G|G.E\{a} . Then, [G0 .E]; G.hb; [G0 .E] = G0 .hb.
Proposition C.4. Let G0 be any execution obtained from G by possibly changing the value read at some
a Rna , and the source of the rf edge entering the event a. Then, G0 .hb = G.hb.
Proposition C.5. Let G0 be an execution, such that G0 .E = G.E ] {a} for some event a. Suppose that
a G0 .Rna , G.sb G0 .sb, G.lab G0 .lab, G.rmw = G0 .rmw, and G0 .rf = G.rf {hb, ai} for some
b G.E. Then, [G.E]; G0 .hb; [G.E] = G.hb.

D. The RCna Model


In this section we present a variant of RC11, which has a smaller pscbase relation, and is useful in our
correctness of compilation proofs. It is based on the following additional derived relations:

rbna , [Rna ]; rb
rb6=na , rb \ rbna
eco6=na , rf (mo rb6=na ); rf?
scb6=na , sb sb|6=loc ; hb; sb|6=loc hb|loc mo rb6=na
psc6=na ?
; scb6=na ; [Esc ] hb? ; [Fsc ]
sc sc
 
base , [E ] [F ]; hb
psc6=
F
na
, [Fsc ]; (hb hb; eco6=na ; hb); [Fsc ]

Proposition D.1. If rbna hb then pscbase = psc6=na 6=na


base and pscF = pscF .

Proof. Note that rbna hb implies that rbna hb|loc , and hb; rbna ; rf? ; hb hb hb; eco; hb.
In addition, we have pscbase \ psc6= na
base [Fsc ]; hb; rbna ; ([Esc ] hb; [Fsc ]), and
6=na
pscF \ pscF [Fsc ]; hb; rbna ; rf? ; hb; [Fsc ]. Thus, this claim immediately follows from our defini-
tions.

We call an execution RCna -consistent if it satisfies all conditions of Def. 1, except possibly for SC, and
psc6=na 6=na
base pscF is acyclic.
Lemma D.1. Let G be an RCna -consistent execution of a program P . Then, either G is RC11-consistent, or P
has undefined behavior under RC11.

Proof. If rbna hb, then, by Prop. D.1, G is RC11-consistent. Suppose otherwise. We show that P has unde-
fined behavior under RC11. Let a1 , ... , an be an enumeration of E that respects sb rf (that is, i < j whenever
hai , aj i sbrf). For every 1 i n, let Ei = E0 {a1 , ... , ai } and Gi = G|Ei . Let k be the minimal index
such that Gk .rbna 6 Gk .hb. Then, by Prop. D.1, Gk1 .pscbase Gk1 .pscF = Gk1 .psc6= na
base Gk1 .pscF
6=na
na
is acyclic, and so Gk1 is RC11-consistent. Let haR , aW i Gk .rb \Gk .hb. Then, we must have ak {aR , aW }.
Note also that haW , aR i 6 Gk .hb since Gk satisfies COHERENCE.
Now, if Gk is RC11-consistent, then we are done (it is a racy execution of P ). Suppose otherwise. We
show that ak 6= aW . Indeed, otherwise, since Gk is RCna -consistent but not RC11-consistent, and Gk1 is
RC11-consistent, it must be the case that mod(ak ) = sc, and there exist b, f Ek1 such that:
hak , bi Gk .mo; (Gk1 .hb; [F])? ; [Gk1 .Esc ]
hb, f i (Gk1 .pscbase Gk1 .pscF ) ; [Fsc ]
hf, ak i Gk1 .hb; Gk .rbna

19
Now, since we have [Ek1 ]; Gk .rb; Gk .mo; [Ek1 ] Gk1 .rb, it follows that hf, bi Gk1 .pscbase . This,
however, contradicts the fact that Gk1 is RC11-consistent.
Therefore, we have ak = aR . Let x = G.loc(ak ). By Lemma C.1, there exists an event b Gk1 .Wx
such that the execution G0 given by G0 .c = Gk .c for every c {E, sb, rmw, data, addr, ctrl, mo},
G0 .lab = Gk1 .lab {ak 7 Rna (x, valw (b))}, and G0 .rf = Gk1 .rf {hb, ak i} is RC11-consistent.
By Assumption B.1, G0 is an execution of P . In addition, we have haW , ak i 6 G0 .hb (since haW , ak i 6 Gk .hb
and G0 .hb = Gk .hb by Prop. C.4), and so G0 is racy. Hence, P has undefined behavior under RC11.

Next, we prove some lemmas that allow us (under some restrictions) to add a memory access inside a given
execution. In what follows, we take G to be an arbitrary execution.
Proposition D.2. If a 6 dom(sb? ; [Ewrel ]), then for every b E, we have ha, bi hb iff ha, bi sb.

Proof. The assumption that a 6 dom(sb? ; [Ewrel ]) ensures that a 6 dom(sb? ; sw), and so we have ha, bi hb
iff ha, bi sb.

Lemma D.2 (Add write). Let a W \ (dom(sb? ; [Ewrel ]) At). Suppose that G0 = G|G.E\{a} is RCna -
consistent. Let x = loc(a), and suppose that ha, bi sb; [Rx ] implies ha, bi sb; [Wx ]; sb. Then,
there exists a relation T G.Wx G.Wx such that the execution G00 given by G00 .c = G.c for every
c {E, lab, sb, rmw, data, addr, ctrl}, G00 .rf = G0 .rf, and G00 .mo = G0 .mo T is RCna -consistent.

Proof. Let C = {c G0 .Wx | ha, ci G.sb; G0 .mo? }, and take T = ({a} C) ((G0 .Wx \ C) {a}). It
is straightforward to show that G00 , as defined in the statement, is RCna -consistent. In particular, we have
G00 .psc6=na 0 6=na 00
base = G .pscbase and G .pscF
6=na
= G0 .psc6=na
F .

Lemma D.3 (Add rmw write). Suppose that rmw1 ; rf1 ; rf; rmw [G.E]. Let
a (W At) \ dom(sb? ; [Ewrel ]). Suppose that G0 = G|G.E\{a} is RCna -consistent. Let x = loc(a),
and suppose that ha, bi sb; [Rx ] implies ha, bi sb; [Wx ]; sb. Then, there exists a relation T G.Wx G.Wx
such that the execution G00 given by G00 .c = G.c for every c {E, lab, sb, rmw, data, addr, ctrl},
G00 .rf = G0 .rf, and G00 .mo = G0 .mo T is RCna -consistent.

Proof. Let b, d G0 .E such that hb, ai G.rmw and hd, bi G0 .rf. Let C = {c G0 .Wx | hd, ci G0 .mo},
and take T = ({a} C) ((G0 .Wx \ C) {a}). It is straightforward to show that G00 , as defined in the
statement, is RCna -consistent.

Lemma D.4 (Add non-atomic read). Let a Rna \ dom(sb; [Ewrel ]). Suppose that G0 = G|G.E\{a} is
RCna -consistent. Then, there exists an event b G0 .W such that the execution G00 given by G00 .E = G.E,
G00 .lab = G0 .lab {a 7 Rna (loc(a), valw (b))}, G00 .c = G.c for every c {sb, rmw, data, addr, ctrl},
G00 .rf = G0 .rf {hb, ai}, and G00 .mo = G.mo is RCna -consistent.

Proof. Let x = loc(a). Let B = {b G.Wx | hb, ai G.rf? ; G.hb}, and take b be the mo-maximal event in
B. It is straightforward to show that G00 , as defined in the statement, is RCna -consistent.

E. Proof of Global Transformation of SC accesses


In this section we prove the soundness of a global program transformation that either adds an SC fence before
every SC access, or adds an SC fence after every SC access, and then replaces all SC accesses by release/acquire
ones. This will allow us later to prove the correctness of compilation only for programs that do not contain any
SC accesses.
We use the following additional notation:

sb0 , sb \ rmw

Lemma E.1. Let G be an execution satisfying all conditions of Def. 1, except possibly for SC. Suppose that
[RWsc ]; (sb0 sb0 ; hb; sb0 ); [RWsc ] hb; [Fsc ]; hb. Let T = sb sb0 ; hb; sb0 eco. Then:

[Fsc ]; hb; eco? ; ([RWsc ]; T ; [RWsc ]) ; eco? ; hb; [Fsc ] psc+


F .

20
Proof. We show by induction on n, that [Fsc ]; hb; eco? ; ([RWsc ]; T ; [RWsc ])n ; eco? ; hb; [Fsc ] psc+F for every
n 0. For n = 0, the claim holds since eco? ; eco? eco? , and [Fsc ]; hb; eco? ; hb; [Fsc ] pscF .
Suppose now that [Fsc ]; hb; eco? ; ([RWsc ]; T ; [RWsc ])n1 ; eco? ; hb; [Fsc ] psc+
F , and let
sc ? sc sc n ? sc
R = [F ]; hb; eco ; ([RW ]; T ; [RW ]) ; eco ; hb; [F ]. Expanding the definition of T (keeping in
mind that rmw eco) we have R R1 R2 , where:

R1 = [Fsc ]; hb; eco? ; ([RWsc ]; T ; [RWsc ])n1 ; [RWsc ]; (sb0 sb0 ; hb; sb0 ); [RWsc ]; eco? ; hb; [Fsc ],

R2 = [Fsc ]; hb; eco? ; ([RWsc ]; T ; [RWsc ])n1 ; eco; eco? ; hb; [Fsc ].
Since eco; eco? eco, by the induction hypothesis, we have R2 psc+
F . In addition, our assumption entails
that R1 is contained in

R10 = [Fsc ]; hb; eco? ; ([RWsc ]; T ; [RWsc ])n1 ; hb; [Fsc ]; hb; eco? ; hb; [Fsc ],

which, in turn, using the induction hypothesis is also contained in psc+


F .

Lemma E.2. Let G be an execution satisfying all conditions of Def. 1, except possibly for SC. Suppose that
[RWsc ]; (sb0 sb0 ; hb; sb0 ); [RWsc ] hb; [Fsc ]; hb. Then, if pscF is acyclic, then so is pscbase pscF .

Proof. Contrapositively, suppose that pscbase pscF is cyclic. Then, by definition, the union of the following
relations is cyclic:

A1 = [RWsc ]; scb; [RWsc ] A3 = [RWsc ]; scb; hb? ; [Fsc ]


A2 = [Fsc ]; (hb hb; eco; hb); [Fsc ] A4 = [Fsc ]; hb? ; scb; [RWsc ]

Consider first the case that A1 is cyclic. Then, since rmw eco and hb|loc eco sb0 ; hb; sb0 , the relation
[RWsc ]; (sb0 sb0 ; hb; sb0 ); [RWsc ] eco is cyclic. Our assumption on G entails that hb; [Fsc ]; hb eco is cyclic.
Since both hb; [Fsc ]; hb and eco are transitive and irreflexive, we obtain that hb; [Fsc ]; hb; eco is cyclic, which
in turn implies that [Fsc ]; hb; eco; hb; [Fsc ] pscF is cyclic.
Now, consider the case that A1 is acyclic. Let T = sb sb0 ; hb; sb0 eco. It is easy to see that scb T (since
we have rmw eco and hb|loc eco sb0 ; hb; sb0 ). Then, the union of pscF and the following relation must
be cyclic:
B = [Fsc ]; hb? ; scb; ([RWsc ]; T ; [RWsc ]) ; scb; hb? ; [Fsc ]
Now, we have [Fsc ]; hb? ; scb [Fsc ]; hb; eco? and scb; hb? ; [Fsc ] eco? ; hb; [Fsc ]. By Lemma E.1, it
follows that B psc+F , and so pscF is cyclic.

Lemma E.3. Let G be an RC11-consistent execution without any SC accesses. Let A Rwacq Wwrel , such
that [A]; (sb0 sb0 ; hb; sb0 ); [A] hb; [Fsc ]; hb, and [A]; rmw = rmw; [A]. Then, the execution G0 obtained
from G by changing all modes of events in A to sc is RC11-consistent.

Proof. The only constraint that is affected by such modification is SC. Now, in G0 we have
[G0 .RWsc ]; (G0 .sb0 G0 .sb0 ; G0 .hb; G0 .sb0 ); [G0 .RWsc ] G0 .hb; [G0 .Fsc ]; G0 .hb, and by Lemma E.2 it suffices
to show that G0 .pscF is acyclic. This follows from the fact that G satisfies SC, since G0 .pscF = G.pscF .

F. Properties of the Power and ARMv7 Models


In this appendix we provide the full definition of preserved program order (ppo) used by Power and ARMv7,
and prove various properties of these models that are needed in our compilation correctness proof.
Notation F.1. For every relation c (e.g., rf, mo, etc.), we denote by ci and ce (internal c and external c) its
thread-internal and thread-external restrictions. Formally, ci = c sb and ce = c \ sb.

F.1 Preserved Program Order


ppo is defined based on the four dependencies data, addr,ctrl, ctrlisync that satisfy the following
properties:

21
1. data R W. 5. ctrlisync ; sb ctrlisync .
2. addr R (R W). 6. rmw data addr ctrl
3. ctrlisync ctrl R E. 7. rmw; sb ctrl
4. ctrl; sb ctrl.

1 5 hold by definition (see [4]). 6 7 hold due to the compilation scheme: it always places a dependency
from the load to the store that form an RMW pair, and a branch after each (conditional) store in such pairs.
The relation deps includes all types of dependencies:

deps , data addr ctrl

Herds definition of ppo is as follows:

rdw , (rbe; rfe) sb detour , (moe; rfe) sb


ii0 , addr data rdw rfi ic0 ,
ci0 , ctrlisync detour ccPower
0 , data ctrl addr; sb? sb|loc
ARMv7
cc0 , data ctrl addr; sb?
ppo , [R]; ii; [R] [R]; ic; [W]

where, ii, ic, ci, cc are inductively defined as follows:

ii0 ci ic; ci ii; ii


ii ii ii ii
ic0 ii cc ic; cc ii; ic
ic ic ic ic ic
ci0 ci; ii cc; ci
ci ci ci
cc0 ci ci; ic cc; cc
cc cc cc cc

Note that ci ii ic, as well as ci cc ic.


Alternatively the relations ii, ic, ci, cc can be defined as follows:
[
xy , x1 y1 0 ; x2 y2 0 ; ... ; xn yn 0
n1

where:
x, y, x1 , ... , xn , y1 , ... , yn {i, c}.
If x = c then x1 = c.
For every 1 i n 1, if yi = c then xi+1 = c.
If y = i then yn = i.
Note that the only difference between Power and ARMv7 is in the definition of cc0 . Henceforth, we only
assume ARMv7s definition, which is weaker, so our proofs apply for both Power and ARMv7.

Next, we prove some useful properties of ppo. In all propositions below we assume some Power-consistent
execution.
Proposition F.1. ppo is transitive.

Proof. Immediately follows from the definition.

22
Proposition F.2. [W]; psbloc ii.

Proof. Let ha, bi [W]; psbloc and let x = loc(a). Then, by definition, a Wx , b Rx , ha, bi sb, and
there is no c Wx such that ha, ci, hc, bi sb. Since G is complete, there exists some d Wx such that
hd, bi rf. If d = a, then we are done since rfi ii. Otherwise, since G satisfies SC - PER - LOC, we have
ha, di mo, hd, ai 6 sb, and hb, di 6 sb. It follows that ha, di moe and hd, bi rfe. Thus, we have
ha, bi detour ii.

Proposition F.3. (deps addr; sb); [W]; psbloc; ppo; [W] ppo.

Proof. Let a, b, c, d E such that ha, bi (deps addr; sb); [W], hb, ci psbloc, and hc, di ppo; [W]. If
ha, bi ctrl, then by definition, we have ha, di ctrl, and so ha, di ppo. If ha, bi addr; sb, then by
definition, we have ha, di cc, and so ha, di ppo. Otherwise, ha, bi addr data ii. By Prop. F.2,
we also have hb, ci ii. Hence, ha, ci ii, and so ha, ci ppo. It follows that ha, di ppo.

Proposition F.4. (deps addr; sb); [R]; sb; [W] ppo.

Proof. Let a, b, c E such that ha, bi (deps addr; sb); [R] and hb, ci sb; [W]. If ha, bi ctrl, then by
definition, we have ha, ci ctrl, and so ha, ci ppo. Otherwise, ha, bi addr; sb? . In this case, we have
ha, ci addr; sb, and so ha, ci ppo.

Proposition F.5. Let R = deps addr; sb psbloc. Then, (deps addr; sb); R ; [W] ppo.

Proof. We prove by induction that for every n 0, (deps addr; sb); Rn ; [W] ppo. For n = 0, we have
(deps addr; sb); [W] ppo by definition. Let n 1 and suppose that (deps addr; sb); Rk ; [W] ppo for
every k < n. Let ha, bi (deps addr; sb); Rn ; [W]. Let c E such that ha, ci (deps addr; sb), and
hc, bi Rn . If c R, then we are done using Prop. F.4. Otherwise, c W, and hc, bi psbloc; Rn1 . Let d
be the sb-maximal event satisfying hc, di psbloc and hd, bi Rk for some k n 1. The maximality of
d ensures that hd, bi (deps addr; sb); Rk1 . By the induction hypothesis, we have hd, bi ppo. Hence,
we have ha, bi (deps addr; sb); [W]; psbloc; ppo; [W], and the claim follows by Prop. F.3.

Proposition F.6. Let R = deps addr; sb psbloc. Then, rfe; R+ ; [W] rfe; ppo.

Proof. Let ha, ci rfe; R+ ; [W]. Let b be the sb-maximal event satisfying ha, bi rfe and hb, ci R+ . If
hb, ci (deps addr; sb); R , then we are done by Prop. F.5. Otherwise, let d be the sb-maximal element
such that hb, di psbloc and hd, ci R . Then, d R, and since c W, we have hd, ci R+ . The
maximality of b and SC - PER - LOC ensure that hb, di rdw, and so hb, di ppo. The maximality of d ensures
that hd, ci (deps addr; sb); R . By Prop. F.5, we have hd, ci ppo, and so ha, ci rfe; ppo.

Proposition F.7. ppo? ; rbi ppo; mo? mo rbi.

Proof. For any n 0, letS ppon denote ppo edges that are formed by at most n basic ppo edges (ii0 , ic0 , ci0 ,
and cc0 ). Then, ppo? = n0 ppon . The proof proceeds by induction on n. For n = 0, the claim obviously
holds. Suppose now that it holds for n 1, and let ha, bi ppon and hb, ci rbi. Then, b must be a read
event, and so there exists a0 such that ha, a0 i ppon1 and ha0 , bi ii0 ci0 . This leads to five cases:
ha0 , bi addr. In this case we have ha0 , ci cc0 , and so ha, ci ppo.
ha0 , bi rdw. In this case we have ha0 , ci rbi, and the claim follows by the induction hypothesis.
ha0 , bi rfi. In this case we have ha0 , ci mo, and so ha, ci ppo? ; mo.
ha0 , bi ctrlisync . In this case we have ha0 , ci ci0 , and so ha, ci ppo.
ha0 , bi detour. In this case we have ha0 , ci mo, and so ha, ci ppo? ; mo.

23
F.2 Additional Properties
Proposition F.8. rmw (rb; mo) = .

Proof. POWER - ATOMICITY condition ensures that rmw (rbe; moe) = . In addition, in every execution we
have rmw sb, rbe; sb 6 sb, sb; moe 6 sb, and sb; sb 6 rmw. It follows that rmw (rb; mo) = .
Proposition F.9. Let R {sync, fence}. Then, R; hb ; rbi R; hb ; mo? .

Proof. We prove by induction on n that for every n 0, we have R; hbn ; rbi R; hb ; mo? . For n = 0, the
claim follows since R; rbi R. Now, suppose it holds for n 1, and let a, b, c, d such that ha, bi R; hbn1 ,
hb, ci hb, and hc, di rbi. If hb, ci rfe, then we have hb, di mo, and so ha, di R; hb ; mo. If
hb, ci fence, then we have hb, di fence, and so ha, di R; hb . Otherwise, we have hb, ci ppo, and
the claim follows using Prop. F.7 and the induction hypothesis.
Proposition F.10. fence is transitive.

Proof. Immediately follows from the definition of fence.


Proposition F.11. fence; hb sb fence; [W]; hb .

Proof. Let a, b, c E such that ha, bi fence and hb, ci hb . If hb, ci sb, then the claim follows since
fence sb. Suppose otherwise. Then, there exists hd, ei rfe such that hb, di hb sb? and he, ci hb .
It follows that ha, di fence, and so ha, ci fence; [W]; hb .
Proposition F.12. [RW]; sb; (fence; hb )? ; sync (fence; hb )? ; sync.

Proof. Immediately follows from the definition of sync and Prop. F.11.
Proposition F.13. eco? ; (fence; hb )? ; sync; hb is acyclic.

Proof. By definition, we have eco? = (mo rbe)? ; rf? rbi; rfi? rbi; rfe. Thus, it suffices to show that
the union of the following relations is acyclic:
A = ((mo rbe)? ; rf? rbi; rfi? ); (fence; hb )? ; sync; hb
B = rbi; rfe; (fence; hb )? ; sync; hb
By Prop. F.9, A; B A; A and B; B B; A. Hence, it suffices to show that A is acyclic and B is irreflexive.
Acyclicity of A follows from Powers PROPAGATION condition, since we have A mo? ; prop2 (using
Prop. F.12). Irreflexivity of B also follows from PROPAGATION, using Prop. F.9.
Proposition F.14. Let A = {a W | b F. hb, ai sb|imm ; rmw? }.
Then, (sb? ; [F]; sb [A]; moi? ); rfe; hb ; (sb; [F])? is a strict partial order.

Proof. Let R = (sb? ; [F]; sb [A]; moi? ); rfe; hb ; (sb; [F])? . The fact that R is transitive follows from the
following facts (obtained by expanding the relevant definitions):
sb; [F]; (sb? ; [F]; sb [A]; moi? ); rfe fence; rfe hb+ .
rfe; hb ; sb? ; [F]; sb; rfe rfe; hb ; fence; rfe rfe; hb .
rfe; hb ; [A]; moi? ; rfe rfe; hb ; (rmw; sbsb; [F]; sb); rfe rfe; hb ; (ppofence); rfe rfe; hb .
Now, to see that R is irreflexive, note that ha, ai R implies (using these three properties) that ha, ai hb+
which contradicts POWER - NO - THIN - AIR.
Proposition F.15. eco; (sb fence; hb ) is irreflexive.

Proof. eco; sb is irreflexive using SC - PER - LOC. By Prop. F.11, it suffices to show that eco; fence; [W]; hb
is irreflexive. Suppose otherwise, and let a, b E such that ha, bi eco and hb, ai fence; [W]; hb . First,
if ha, bi sb, then we have ha, ai fence; hb hb+ , which contradicts POWER - NO - THIN - AIR. Suppose
otherwise, and consider the possible cases:

24
ha, bi rfe. In this case we obtain ha, ai hb+ , which contradicts POWER - NO - THIN - AIR.
ha, bi mo; rf? . Let c E such that ha, ci mo and hc, bi rf? . Then, we have hc, ai prop1 , and we
obtain that mo; prop1 is not irreflexive, which contradicts PROPAGATION.
ha, bi rbe; rf? . Let c W such that ha, ci rbe and hc, bi rf? . Let d W such that hb, di fence
and hd, ai hb . Then, we have hc, di prop1 , and obtain a violation of OBSERVATION.
ha, bi rbi; rfe. Let c W such that ha, ci rbi and hc, bi rfe. By Prop. F.9, we have
hb, ci fence; hb ; mo? . Let d E such that hb, di fence; hb and hd, ci mo? . Then, we have
hc, di prop1 , and we obtain that mo? ; prop1 is not irreflexive, which contradicts PROPAGATION.

F.3 Removing Redundant Fences


Lemma F.1. Let G be a Power execution, and let ha, bi [Fsync ]; sb|imm ; [Flwsync ]. Let G0 be the execution
obtained from G by removing b (G0 = G|G.E\{b} ). If G0 is Power-consistent, then so is G.

Proof. Since bs immediate sb-predecessor is a full fence, we have G0 .fence = G.fence. Then, it is easy to
see that for every relation c mentioned in Def. 6, we have G0 .c = G.c, and so if G0 is Power-consistent, then
so is G.
Lemma F.2. Let G be a Power execution, and let ha, bi [R]; (sb|imm ctrlisync ); [F]. Let G0
be the execution obtained from G by removing the ctrlisync dependency edges from a onwards
(G0 .ctrlisync = G.ctrlisync \ ({a} E)). If G0 is Power-consistent, then so is G.

Proof. Since as immediate sb-successor is a fence, we have ha, ci G.fence for every c RW such that
ha, ci sb. Now, by omitting ctrlisync dependency edges from a onwards, we may remove ppo edges from
a, but whenever ppo is used to form an hb edge, it can be replaced by a fence edge. Consequently, for every
relation c mentioned in Def. 6, we have G0 .c = G.c, and so if G0 is Power-consistent, then so is G.

G. Power-before Relation
In this section, we define a relation that we call Power-before (pb), and show that if pb is acyclic in some
execution G of a program P , then either G is RC11-consistent, or P has undefined behavior under RC11. This
relation is the key for showing that NO - THIN - AIR holds when proving compilation correctness. (Thus, if one is
only interested in weakRC11-consistency, this section can be completely ignored.)
In what follows we assume an execution G.
pb is given by:

psbloc , sb|loc ; [R] \ sb|loc ; [W]; sb (preserved sb-loc)


wrlx wrel wrel
pbi , deps addr; sb [R W F]; sb psbloc sb; [E ] (internal Power-before)
pb , pbi rfe (Power-before)

Clearly, pb sb rf, and so pb is acyclic in every RC11-consistent execution.


Proposition G.1. If G is weakRC11-consistent, then rf pb.

Proof. COHERENCE guarantees that rfi psbloc pbi, and by definition we have rfe pb.
Proposition G.2. For every weakRC11-consistent execution G, hb sb pb+ .

Proof. It suffices to show that sb? ; swe; sb? pb+ . By definition, we have

sb? ; swe; sb? sb? ; [Ewrel ]; sb? ; (rf rmw)+ ; [Rwrlx ]; sb? .

The claim follows because we have:


sb? ; [Ewrel ]; sb? pbi
rf pb and rmw deps pbi.

25
[Rwrlx ]; sb? pbi? .

Proposition G.3. If pb is acyclic, but sb rf is cyclic, then (rfe; [Rna ] \ hb); sb 6= .

Proof. A cycle in sb rf implies a cycle in rfe; sb. Since rfe; [Rwrlx ]; sb and (rfe hb); sb are contained in
pb+ (using Prop. G.2 for the latter), there must exist an edge ha, bi rfe; sb that is neither in rfe; [Rwrlx ]; sb
nor in (rfe hb); sb. Then, we have ha, bi (rfe; [Rna ] \ hb); sb.
Lemma G.1. Suppose that G is a weakRC11-consistent execution of a program P , and that pb is acyclic, but
G is not RC11-consistent. Then, P has undefined behavior under RC11.

Proof. Since G is weakRC11-consistent but not RC11-consistent, we have that sb rf is cyclic. By Prop. G.3,
rf; [Rna ] 6 hb. We show that this implies that P has undefined behavior under RC11.
Let a1 , ... , an be an enumeration of E that respects pb (that is, i < j whenever hai , aj i pb+ ). For every
1 i n, let Ei = {a1 , ... , ai }. Let k be the minimal index such that [Ek ]; rf; [Rna ]; [Ek ] 6 hb. Then, we
have haj , ak i rf; [Rna ] \ hb for some j < k. Let B = dom(sb? ; [Ek ]) and H = B \ Ek .

Claim 1: h Rna Wvrlx for every h H.


Proof: Otherwise, since [Rwrlx Wwrel F]; sb pb, we would obtain hh, ai pb for some a Ek . This
contradicts the fact that h 6 Ek .

Claim 2: hh, bi 6 sb? for every h H and b B (Ewrel ).


Proof: Suppose otherwise. Let a Ek such that hb, ai sb? . It follows that hh, ai sb? ; Ewrel ; sb? , and so
hh, ai pb . Hence, h Ek as well, which contradicts our assumption.

Claim 3: hh, bi 6 deps ; ctrl for every h H and b B.


Proof: Suppose otherwise. Let a Ek such that hb, ai sb? . Since ctrl; sb? ctrl, it follows that
hh, ai deps+ , and so hh, ai pb+ . This contradicts the fact that h 6 Ek .

Claim 4: hh, bi 6 deps ; addr for every h H and b B.


Proof: Suppose otherwise. Let a Ek such that hb, ai sb? . Then, hh, ai deps ; addr; sb? pb+ . This
contradicts the fact that h 6 Ek .

Let h1 , ... , hm be an enumeration of H that respects sb, and let Hi = {h1 , ... , hi } for every 0 i m.

Claim 5: For every 1 i m, hi 6 dom(deps+ ; [Ek Hi1 ]).


Proof: Suppose otherwise, and let a Ek Hi1 such that hhi , ai deps+ . Then, hhi , ai pb+ . If a Ek ,
then hi Ek as well, which contradicts our assumption. Hence, we have a Hi1 . This contradicts the fact
that the hi s enumeration respects sb.

Claim 6: Let 1 i m, and let x = loc(hi ). Let a (Ek Hi1 ) Rx and suppose that hhi , ai sb.
Then, hhi , ai sb; [(Ek Hi1 ) Wx ]; sb.
Proof: Suppose otherwise. Let i j m be the maximal index satisfying hj Ex , hhi , hj i sb? and
hhj , ai sb. Then, hhj , ai psbloc, and so hhj , ai pb. If a Ek , then hj Ek as well, which contradicts
our assumption. Hence, we have a Hi1 . This contradicts the fact that the hi s enumeration respects sb.

For every 1 i n, let and Gi = G|Ei . Since G.rf G.pb (Prop. G.1), all the Gi s are weakRC11-
consistent. Additionally, Gi .pb is acyclic for every 1 i n.

We inductively construct a sequences of labeling functions lab0 , ... , labm : B Label and executions
G00 , ... , G0m such that the following hold:
1. For every 0 i m, G0i .E = Ek Hi .
2. For every 0 i m, G0i .lab = labi |G0i .E .
3. For every 0 i m, G0i is RCna -consistent.
4. For every 0 i m, haj , ak i, hak , aj i 6 G0i .hb.

26
5. For every 0 i m, labi (G|B ) is an execution of P .
6. For every 0 i m, G.rmw1 ; G0i .rf1 ; G0i .rf; G.rmw [G.E].
Finally, we would obtain that G0m is a racy RCna -consistent execution with G0m .E = B, and
labm (G|B ) = G0m .lab(G|B ) is an execution of P . Hence, G0m is an execution of P , and by Lemma D.1, G0m
is RC11-consistent or P has undefined behavior under RC11. Since G0m is racy, in any case we would obtain
that P has undefined behavior under RC11.
First, we define lab0 and G00 . The minimality of k and Prop. G.3 ensure that Gk1 is RC11-consistent. Hence,
Lemma D.4 ensures that there exists some event b Ek such that the execution G0 given by G0 .c = Gk .c
for every c {E, sb, rmw, data, addr, ctrl, mo}, G0 .lab = Gk .lab[ak 7 Rna (G.loc(ak ), G.valw (b))],
and G0 .rf = Gk .rf {hb, ak i} is RCna -consistent. In addition, ak 6 dom(G|B .deps) (since it is G.pb
maximal in G|B ). By Assumption B.1, there exists a reevaluation lab of G.lab such that lab(G|B ) is an
execution of P , lab(G|B ).valr (ak ) = G.valw (b), and lab(c) = G|B .lab(c) for every c B \ {ak }. We
take lab0 = lab and G00 = G0 . It is straightforward to see that lab and G0 satisfy the six conditions above. In
particular, G|B .rmw1 ; G0 .rf1 ; G0 .rf; G|B .rmw [G.E] follows from the fact that G satisfies ATOMICITY.
Additionally, by Prop. C.4, G0 .hb = Gk .hb, and so, we have haj , ak i, hak , aj i 6 G0 .hb.
Next, let 1 i m, and suppose that labi1 and G0i1 are defined. We construct labi and G0i . By Claim 1
above, we have hi G.Rna G.Wvrlx . Let Gi be the execution obtained from G0i1 by adding the event hi ,
labeled with labi1 (hi ), and the sb, rmw, and dependency edges from/to hi as in G|B . By Claim 2 above, we
also have hi 6 dom(Gi .sb; [Gi .Erel ]). Let x = G.loc(hi ), and consider the two cases:
hi G.Rna : Since G0i1 is RCna -consistent, Lemma D.4 ensures that there exists some
event b Ek Hi1 such that the execution G0 given by G0 .E = Ek Hi ,
G0 .lab = G0i1 .lab {hi 7 Rna (x, Gi .valw (b))}, G0 .c = Gi .c for every
0
c {sb, rmw, data, addr, ctrl, mo}, and G .rf = Gi .rf {hb, ak i} is RCna -consistent. In addi-
tion, by Claims 3 and 4 above, we have that hi 6 dom(G|B .deps ; (G|B .ctrl G|B .addr)). By
Assumption B.1, there exists a reevaluation lab of labi1 such that lab(G|B ) is an execution of P ,
lab(G|B ).valr (hi ) = G.valw (b), and lab(c) = labi1 (c) for every c such that hhi , ci 6 G|B .deps+ .
We take labi = lab and G0i = G0 . Again, it is straightforward to see that lab and G0 satisfy the required
conditions. In particular, G0i .lab = labi |G0i .E follows from the fact that G0i1 .lab = labi1 |Ek Hi1 , and
Claim 5 above. In addition, by Prop. C.5, we have [G0i1 .E]; G0 .hb; [G0i1 .E] = G0i1 .hb, and so, we have
haj , ak i, hak , aj i 6 G0 .hb.
hi G.Wvrlx : By Claim 6 above, we have that for every b Gi .E, if hhi , bi Gi .sb; [Gi .Rx ]
then ha, bi Gi .sb; [Gi .Wx ]; Gi .sb. Thus, since G0i1 is RCna -consistent, and
1 0 1 0
G.rmw ; Gi1 .rf ; Gi1 .rf; G.rmw [G.E], Lemmas D.2 and D.3 ensure that there exists
T Gi .Wx Gi .Wx such that the execution G0 given by G0 .E = Ek Hi , G0 .lab = labi1 |G0 .E ,
G0 .c = Gi .c for every c {sb, rmw, data, addr, ctrl}, G0 .rf = G0i1 .rf, and Gi .mo = G0i1 .mo T
is RCna -consistent. We take labi = labi1 and G0i = G0 . It is straightforward to see that labi1 and G0
satisfy the required conditions. In particular, Prop. C.3 guarantees that haj , ak i, hak , aj i 6 G0 .hb.

H. Proof of Compilation Correctness


Lemma H.1. Let G be an execution without SC accesses. Let Gp be a Power execution. Suppose that the
following hold:
G.R = Gp .R, G.W = Gp .W, G.sb Gp .sb, G.rmw = Gp .rmw, G.rf = Gp .rf, and G.mo = Gp .mo.
G.data Gp .data, G.addr Gp .addr, and G.ctrl Gp .ctrl.
G.rmw; G.sb Gp .ctrl.
G.F6=sc Gp .Flwsync and G.Fsc = Gp .Fsync .
G.Wrel A where A = {a Gp .W | b Gp .F. hb, ai Gp .sb|imm ; Gp .rmw? }.
[G.Rrlx \ G.At]; G.sb Gp .ctrl.
[G.Racq ]; G.sb Gp .rmw? ; Gp .ctrlisync .
Then:

27
G and Gp have the same outcome.
If Gp is Power-consistent, then G is weakRC11-consistent and G.pb is acyclic.

Proof. The first claim easily follows from our definitions. Suppose that Gp is Power-consistent. Before proving
the second claim, we present some properties relating G and Gp .
1. G.swe; G.sb? (Gp .sb? ; [Gp .F]; Gp .sb [A]; Gp .moi? ); Gp .rfe; Gp .hb ; (Gp .sb; [Gp .F])?
(follows from the definition of sw)
2. G.hb Gp .sb (Gp .sb? ; [Gp .F]; Gp .sb ([A] Gp .rmw); Gp .moi? ); Gp .rfe; Gp .hb ; (Gp .sb; [Gp .F])?
(follows from Item 1 using Prop. F.14; note that Gp .sb; [A] Gp .sb? ; [F]; Gp .sb Gp .rmw)
3. [G.RW]; (G.sb \ G.rmw); G.hb? Gp .sb Gp .fence; Gp .hb ; (Gp .sb; [Gp .F])?
(again follows from Item 1 using Prop. F.14)
4. [G.Fsc ]; G.hb; [G.RW] [Gp .Fsync ]; Gp .sb; Gp .hb ; [Gp .RW]
(easily follows from Item 2)
In addition, in order to apply Prop. C.2 in the proof below, we note that:
[G.W]; G.sb|loc ; [G.W] G.mo: Indeed, we have [G.W]; G.sb|loc ; [G.W] = [Gp .W]; Gp .sb|loc ; [Gp .W] and
G.mo = Gp .mo, and the claim follows by Powers SC - PER - LOC condition.
G.rmw G.rb: Indeed, we have G.rmw = Gp .rmw and G.rb = Gp .rb, and the claim follows by Powers
SC - PER - LOC and the fact that G is complete.
Next, we show that G is weakRC11-consistent. Clearly, it is complete (since G.R = Gp .R and G.rf = Gp .rf).
COHERENCE . We show that G.eco? ; G.hb is irreflexive. The irreflexivity of G.hb follows from Prop. F.14.
Now, applying Prop. C.2, it suffices to show that G.eco G.eco; (G.sb \ G.rmw); G.hb? is irreflexive.
First, G.eco = Gp .eco is irreflexive because of SC - PER - LOC. Second, by property 3 above, we
have G.eco; (G.sb \ G.rmw); G.hb? ; [G.RW] Gp .eco; (Gp .sb Gp .fence; Gp .hb ). By Prop. F.15,
Gp .eco; (Gp .sb Gp .fence; Gp .hb ) is irreflexive.
ATOMICITY . By Prop. F.8, we have Gp .rmw (Gp .rb; Gp .mo) = . Then, G.rmw (G.rb; G.mo) =
immediately follows since G.rmw = Gp .rmw, G.rb = Gp .rb, and G.mo = Gp .mo.
SC .We show that G.psc is acyclic. Assuming no SC accesses, we have G.psc = R1 R2
where R1 = [G.Fsc ]; G.hb; G.eco; G.hb; [G.Fsc ] and R2 = [G.Fsc ]; G.hb; [G.Fsc ]. Since R2 is ir-
reflexive and R2+ ; R1 R1 , it suffices to prove the acyclicity of R1 . To this end, we show
that G.eco; G.hb; [G.Fsc ]; G.hb; [G.RW] is acyclic. Applying Prop. C.2, it suffices to show that
G.eco; (G.sb \ G.rmw); G.hb? ; [G.Fsc ]; G.hb; [G.RW] is acyclic. Using properties 3-4 above (and applying
several simple simplifications), it suffices to show that the following relation is acyclic:

Gp .eco; (Gp .fence; Gp .hb )? ; Gp .sb; [Gp .Fsync ]; Gp .sb; Gp .hb ; [Gp .RW].

Using the definition of sync, this relation is equal to:

Gp .eco; (Gp .fence; Gp .hb )? ; Gp .sync; Gp .hb ; [Gp .RW].

Its acyclicity then follows by Prop. F.13.


Next, we show that G.pb is acyclic. Suppose otherwise. Then, there are a1 , ... , an such that
hai , ai+1 i G.rfe; G.pbi+ for every 1 i n (where an+1 = a1 ). We show that
hai , ai+1 i Gp .hb+ for every 1 i n (which contradicts POWER - NO - THIN - AIR). Let
1 i n, and let b E such that hai , bi G.rfe = Gp .rfe and hb, ai+1 i G.pbi+ .
If hb, ai+1 i Gp .fence, then we are done since Gp .rfe, Gp .fence Gp .hb. Otherwise, it
follows that hb, ai+1 i (Gp .deps Gp .addr; Gp .sb Gp .psbloc)+ . By Prop. F.6, we have
hai , ai+1 i Gp .rfe; Gp .ppo G.hb+ .

Lemma H.2. Given a program P without SC accesses, every outcome of (|P |) under Power is an outcome of
P under RC11.

28
Proof. Given a full Power-consistent Power execution Gp of (|P |), the compilation scheme (see Fig. 9) ensures
that there exists some full execution G of P for which the properties of Lemma H.1 hold. Here we assumed
that all RMW write attempts (stwcx.) succeed in the first attempt. Indeed, otherwise, one could always remove
the RMW reads (lwarx) that precede the failed stwcx. attempts while preserving Power-consistency as well as
the outcome of the execution. Now, Lemma H.1 ensures that G has the same outcome as Gp , G is weakRC11-
consistent, and G.pb is acyclic. By Lemma G.1, either G.sb G.rf is acyclic (and NO - THIN - AIR holds) or P
has undefined behavior under RC11. In any case, we obtain that the outcome of Gp is an outcome of P under
RC11.

I. Proofs for 7 (Correctness of Program Transformations)


In this appendix, we state (and outline the proofs of) the properties that ensure the soundness of the
transformations discussed in 7. For this purpose, it is technically convenient to employ a different presentation
of RMWs, that treat them as single events (like in C11). To this end, we consider RMW-executions, defined as
the executions in 3, with the following exceptions:
Labels in RMW-executions may also be RMWo (x, vr , vw ) where o {rlx, acq, rel, acqrel, sc}. Both sets
G.R and G.W include all events a with typ(a) = RMW, while G.RMW denotes the set of all events a with
typ(a) = RMW.
RMW-executions do not include an rmw component.
RC11-consistency for RMW-executions is also defined as for executions, with the following exceptions:
G.rb , rf1 ; mo \ [E].
Instead of ATOMICITY we now require:
rf (mo; mo) = . ( ATOMICITY- RMW )
The rest of the notions are defined for RMW-executions exactly as for executions above.
There exists a trivial one-to-one correspondence, denoted by , between executions according to 3 and
RMW-executions (the latter are obtained by collapsing rmw edges to single RMW events).
Proposition I.1. Suppose that G GRMW for some execution G and RMW-execution GRMW . Then:
G is RC11-consistent iff GRMW is RC11-consistent.
G is racy iff GRMW is racy.

Using this correspondence, we may define and prove the correctness of transformations on RMW-executions.
Lemma I.1 (Strengthening). Let Gtgt be an RMW-execution, obtained from an RMW-execution Gsrc by
strengthening some access/fence modes (Gsrc .mod(a) v Gtgt .mod(a) for every a Gsrc .E). Then:
If Gtgt is RC11-consistent, then so is Gsrc .
If Gtgt is racy, then so is Gsrc .

Proof. Easily follows from our definitions, because both properties are monotone with respect to the mode
ordering.
Lemma I.2 (Sequentialization). Let Gtgt be an RMW-execution, and let ha, bi sb \ sb; sb. Let Gsrc be the
RMW-execution obtained from G by removing the sb edge ha, bi. Then:
If Gtgt is RC11-consistent, then so is Gsrc .
If Gtgt is racy, then so is Gsrc .

Proof. Easily follows from our definitions, because both properties are monotone with respect to sb.

Next, to state the soundness of deordering transformations, we use the following definition of adjacency.
Definition I.1. Let R be a strict partial order on a set A. A pair ha, bi A A is called R-adjacent if the
following hold for every c A:
If hc, ai R then hc, bi R.

29
If hb, ci R then ha, ci R.

Lemma I.3 (Non-load-store deordering). Let Gtgt be an RMW-execution, and let a, b Gtgt .E such that ha, bi
is Gtgt .sb-adjacent. Let Gsrc be the RMW-execution obtained from Gsrc by adding an sb edge ha, bi. Suppose
that the labels of a and b form a deorderable pair according to Table 1, except for the load-store deorderable
pairs (R; W, R; RMW, and RMW; W). Then:
If Gtgt is RC11-consistent, then so is Gsrc .
If Gtgt is racy, then so is Gsrc .

Proof. It is straightforward to verify that all components and derived relations in Gsrc are identical to those
of Gtgt except for: Gsrc .sb = Gtgt .sb {ha, bi} and Gsrc .hb = Gtgt .hb {ha, bi}. Then, the fact that Gsrc is
RC11-consistent, easily follows from the fact that Gtgt is RC11-consistent. In particular, since a, b is not a load-
store deorderable pair, assuming that Gtgt satisfies NO - THIN - AIR, we cannot have hb, ai (Gsrc .sbGsrc .rf)+ ,
so the additional sb edge ha, bi cannot close an sb rf cycle. Finally, since Gsrc .race = Gtgt .race, we have
that Gsrc is racy if Gtgt is racy.

Lemma I.4 (Load-store deordering). Let Gtgt be an RMW-execution, and let a, b Gtgt .E such that ha, bi is
Gtgt .sb-adjacent. Let Gsrc be the RMW-execution obtained from Gsrc by adding an sb edge ha, bi. Suppose that
the labels of a and b form a load-store deorderable pair (R; W, R; RMW, or RMW; W) according to Table 1. Then:
If Gtgt is RC11-consistent, then Gsrc is weakRC11-consistent and Gsrc .pb is acyclic.
If Gtgt is racy, then so is Gsrc .

Proof. The proof is similar to the proof of Lemma I.3. The fact that Gsrc is weakRC11-consistent follows from
the fact that Gtgt is RC11-consistent. In addition, since Gsrc .pb = Gtgt .pb Gtgt .sb Gtgt .rf, assuming that
Gtgt satisfies NO - THIN - AIR, we have that Gsrc .pb is acyclic.

Using Lemma G.1, one obtains the soundness of load-store deordering according to Table 1.
Notation I.1. For a binary relation R on a set A and an element a A, we denote by Ra the set
{b A | hb, ai R}, and by Ra the set {b A | ha, bi R}.
Lemma I.5 (Read-read merging). Let Gtgt be an RC11-consistent RMW-execution. Let a R \ RMW, and let
a0 E such that ha0 , ai rf. Let b 6 E, and let Gsrc be the RMW-execution satisfying:
Gsrc .E = Gtgt .E ] {b}.
Gsrc .lab = Gtgt .lab {b 7 Gtgt .lab(a)}.
Gsrc .sb = Gtgt .sb {ha, bi} (Gtgt .sba {b}) ({b} Gtgt .sba ).
Gsrc .rf = Gtgt .rf {ha0 , bi}.
Gsrc .mo = Gtgt .mo.
Then, Gsrc is RC11-consistent, and it is racy if Gtgt is racy.

Proof. By definition, Gsrc is complete, and ATOMICITY- RMW holds (since Gsrc .mo = Gtgt .mo and
b 6 Gsrc .R \ Gsrc .RMW). It is also easy to see that we have:
Gsrc .eco = Gtgt .eco (Gtgt .ecoa {b}) ({b} Gtgt .ecoa ).
Gsrc .hb = Gtgt .hb {ha, bi} (Gtgt .hba {b}) ({b} Gtgt .hba ).
Hence, Gsrc satisfies COHERENCE. To see that NO - THIN - AIR holds, note that if we had
hb, ai (Gsrc .sb Gsrc .rf)+ , then we would have ha, ai (Gtgt .sb Gtgt .rf)+ ; and, similarly, if we
had hb, a0 i (Gsrc .sb Gsrc .rf)+ , then we would have ha, a0 i (Gtgt .sb Gtgt .rf)+ . It remains to
show that Gsrc .pscbase Gsrc .pscF is acyclic. First, note that we have Gsrc .pscF = Gtgt .pscF . Now, if
Gtgt .mod(a) 6= sc, then we also have Gsrc .pscbase = Gtgt .pscbase , and the claim follows since Gtgt satisfies
SC . Otherwise, we have:
Gsrc .pscbase = Gtgt .pscbase {ha, bi} (Gtgt .pscbase a {b}) ({b} Gtgt .pscbase a ).
This implies that a Gsrc .pscbase Gsrc .pscF cycle would imply a Gtgt .pscbase Gtgt .pscF cycle. Finally, if
hc, bi Gsrc .race, then we have hc, ai Gtgt .race.

30
Lemma I.6 (Write-write merging). Let Gtgt be an RC11-consistent RMW-execution. Let b W \ RMW, a 6 E,
and v Val. Let Gsrc be the RMW-execution satisfying:
Gsrc .E = Gtgt .E ] {a}.
Gsrc .lab = Gtgt .lab {a 7 WGtgt .mod(b) (Gtgt .loc(b), v)}.
Gsrc .sb = Gtgt .sb {ha, bi} (Gtgt .sbb {a}) ({a} Gtgt .sbb ).
Gsrc .rf = Gtgt .rf.
Gsrc .mo = Gtgt .mo {ha, bi} (Gtgt .mob {a}) ({a} Gtgt .mob ).
Then, Gsrc is RC11-consistent, and it is racy if Gtgt is racy.

Proof. By definition, Gsrc is complete. To see that ATOMICITY- RMW holds, note that we have
Gsrc .mo; Gsrc .mo; [RMW] Gtgt .mo; Gtgt .mo ({a} Gsrc .E), and that a has no outgoing rf edges. It is
also easy to see that we have:
Gsrc .eco = Gtgt .eco {ha, bi} (Gtgt .ecob {a}) ({a} Gtgt .ecob ).
Gsrc .hb = Gtgt .hb {ha, bi} (Gtgt .hbb {a}) ({a} Gtgt .hbb ).
Hence, Gsrc satisfies COHERENCE. To see that NO - THIN - AIR holds, note that if we had
hb, ai (Gsrc .sb Gsrc .rf)+ , then we would have hb, bi (Gtgt .sb Gtgt .rf)+ . It remains to show that
Gsrc .pscbase Gsrc .pscF is acyclic. First, note that we have Gsrc .pscF = Gtgt .pscF . Now, if Gtgt .mod(a) 6= sc,
then we also have Gsrc .pscbase = Gtgt .pscbase , and the claim follows since Gtgt satisfies SC. Otherwise, we
have:
Gsrc .pscbase = Gtgt .pscbase {ha, bi} (Gtgt .pscbase b {a}) ({a} Gtgt .pscbase b ).
This implies that a Gsrc .pscbase Gsrc .pscF cycle would imply a Gtgt .pscbase Gtgt .pscF cycle. Finally, if
hc, bi Gsrc .race, then we have hc, ai Gtgt .race.

Lemma I.7 (Write/RMW-read merging). Let Gtgt be an RC11-consistent RMW-execution. Let a W and b 6 E.
Let o Ord, such that:
If typ(a) = W and o = sc, then mod(a) = sc.
If typ(a) = RMW, then o v mod(a).
Let Gsrc be the RMW-execution satisfying:
Gsrc .E = Gtgt .E ] {b}.
Gsrc .lab = Gtgt .lab {b 7 Ro (Gtgt .loc(a), Gtgt .valw (a))}.
Gsrc .sb = Gtgt .sb {ha, bi} (Gtgt .sba {b}) ({b} Gtgt .sba ).
Gsrc .rf = Gtgt .rf {ha, bi}.
Gsrc .mo = Gtgt .mo.
Then, Gsrc is RC11-consistent, and it is racy if Gtgt is racy.

Proof. Similar to the proof of Lemma I.5.

Lemma I.8 (Write-RMW merging). Let Gtgt be an RC11-consistent RMW-execution. Let b W \ RMW, a 6 E,
v Val, and o Ord such that ow = mod(b). Let Gsrc be the RMW-execution satisfying:
Gsrc .E = Gtgt .E ] {a}.
Gsrc .lab = Gtgt .lab[b 7 RMWo (Gtgt .loc(b), v, Gtgt .valw (b))] {a 7 Wow (Gtgt .loc(b), v)}.
Gsrc .sb = Gtgt .sb {ha, bi} (Gtgt .sbb {a}) ({a} Gtgt .sbb ).
Gsrc .rf = Gtgt .rf {ha, bi}.
Gsrc .mo = Gtgt .mo {ha, bi} (Gtgt .mob {a}) ({a} Gtgt .mob ).
Then, Gsrc is RC11-consistent, and it is racy if Gtgt is racy.

Proof. By definition, Gsrc is complete. To see that ATOMICITY- RMW holds, note that we have
Gsrc .mo; Gsrc .mo; [RMW] Gtgt .mo; Gtgt .mo ({a} Gsrc .E) (Gsrc .E {b}), and that a has only an rf edge
to its immediate Gsrc .mo-successor b. The rest of the properties are proved as in the proof of Lemma I.6.

31
Lemma I.9 (RMW-RMW merging). Let Gtgt be an RC11-consistent RMW-execution. Let a E with
lab(a) = RMWo (x, vr , vw ). Let b 6 E and v Val, and let Gsrc be the RMW-execution satisfying:
Gsrc .E = Gtgt .E ] {b}.
Gsrc .lab = Gtgt .lab[a 7 RMWo (x, vr , v)] {b 7 RMWo (x, v, vw )}.
Gsrc .sb = Gtgt .sb {ha, bi} (Gtgt .sba {b}) ({b} Gtgt .sba ).
Gsrc .rf = Gtgt .rf {ha, bi}.
Gsrc .mo = Gtgt .mo {ha, bi} (Gtgt .moa {b}) ({b} Gtgt .moa ).
Then, Gsrc is RC11-consistent, and it is racy if Gtgt is racy.
Lemma I.10 (Fence-fence merging). Let Gtgt be an RC11-consistent RMW-execution. Let a F, b 6 E, and let
Gsrc be the RMW-execution satisfying:
Gsrc .E = Gtgt .E ] {b}.
Gsrc .lab = Gtgt .lab {b 7 Gtgt .lab(a)}.
Gsrc .sb = Gtgt .sb {ha, bi} (Gtgt .sba {b}) ({b} Gtgt .sba ).
Gsrc .rf = Gtgt .rf.
Gsrc .mo = Gtgt .mo.
Then, Gsrc is RC11-consistent, and it is racy if Gtgt is racy.

Proof. By definition, Gsrc is complete, and ATOMICITY- RMW holds since Gsrc .rf = Gtgt .rf and
Gsrc .mo = Gtgt .mo. It is also easy to see that we have Gsrc .eco = Gtgt .eco and:
Gsrc .hb = Gtgt .hb {ha, bi} (Gtgt .hba {b}) ({b} Gtgt .hba ).
Hence, Gsrc satisfies COHERENCE. To see that NO - THIN - AIR holds, note that if we had
hb, ai Gsrc .sbGsrc .rf, then we would have ha, ai Gtgt .sbGtgt .rf. It remains to show that Gsrc .pscbase
is acyclic. If Gtgt .mod(a) 6= sc, then we have Gsrc .pscbase = Gtgt .pscbase and Gsrc .pscF = Gtgt .pscF , and
the claim follows since Gtgt satisfies SC. Otherwise, we have:
Gsrc .pscbase = Gtgt .pscbase (Gtgt .pscbase a {b}) ({b} Gtgt .pscbase a ).
Gsrc .pscF = Gtgt .pscF {ha, bi} (Gtgt .pscF a {b}) ({b} Gtgt .pscF a ).
This implies that a Gsrc .pscbase Gsrc .pscF cycle would imply a Gtgt .pscbase Gtgt .pscF cycle. Finally,
Gsrc .race = Gtgt .race, so Gsrc is racy if Gtgt is racy.

Soundness of register promotion is proved in two steps. First, we show that if all accesses to some location are
in one thread, then they can be safely weakened to non-atomic accesses. Second, we show that these non-atomic
accesses can be safely removed (replaced by register assignments at the program level).
Lemma I.11 (Register promotion-a). Let Gtgt be an RC11-consistent RMW-execution. Suppose that all accesses
to some location x are related by Gtgt .sb. Let Gsrc be the RMW-execution obtained by strengthening the accesses
mode of all accesses to x to sc. Then, Gsrc is RC11-consistent, and it is racy if Gtgt is racy.

Proof. By definition, we have Gsrc .c = Gtgt .c for c {sb, rf, mo, eco}. It is also easy to see that
Gsrc .hb = Gtgt .hb. Hence, Gsrc is complete, and ATOMICITY,COHERENCE,NO - THIN - AIR hold for Gsrc
since they hold for Gtgt . To see that Gsrc .psc is acyclic, it suffices to note that Gsrc .psc Gtgt .psc Gtgt .sb
(acyclicity of Gtgt .psc Gtgt .sb follows from the acyclicity of Gtgt .psc since psc; sb; psc psc+ in every
execution). Finally, if ha, bi Gtgt .race and na {Gtgt .mod(a), Gtgt .mod(b)}, then the same holds in Gsrc :
we must have loc(a) 6= x if ha, bi 6 Gtgt .hb (Gtgt .hb)1 .
Lemma I.12 (Register promotion-b). Let Gtgt be an RC11-consistent RMW-execution. Let x Loc and let
X = {b E | loc(b) = x}. Suppose that all accesses in X are related by Gtgt .sb. Let a 6 E, let Gsrc be an
RMW-execution satisfying:
Gsrc .E = Gtgt .E ] {a}.
Gsrc .lab = Gtgt .lab {a 7 L} where L is some access label with mode na and location x.
Gsrc .sb Gtgt .sb and every ecent in X is Gsrc .sb-related to a.
Gsrc .rf = Gtgt .rf if Gsrc .typ(a) = W \ RMW,
and otherwise Gsrc .rf = Gtgt .rf {hmaxGsrc .sb Gsrc .sba , ai}.

32
Gsrc .mo = Gtgt .mo if Gsrc .typ(a) = R \ RMW,
and otherwise Gsrc .mo = Gtgt .mo (Gsrc .sba {a}) ({a} Gsrc .sba ).
Then, Gsrc is RC11-consistent, and it is racy if Gtgt is racy.

Proof. Easily follows from our definitions.

J. Proofs for 8 (Programming Guarantees)


Theorem 4. If in all SC-consistent executions of a program P , every race ha, bi has mod(a) = mod(b) = sc,
then the outcomes of P under RC11 coincide with those under SC.

Proof. Let P be a program, and suppose that every race ha, bi in some SC-consistent execution of P has
mod(a) = mod(b) = sc. We prove that P has no weak behaviors. Suppose toward a contradiction that there
exists an execution G of P that is RC11-consistent but not SC-consistent. (Note that if P has undefined behavior
under RC11, then there exists a racy RC11-consistent execution of P , and our assumption ensures that this
execution is not SC-consistent.)
We call an execution G0 is a prefix of an execution G if it is obtained by restricting G to a set E of events that
contains the set E0 of initialization events, and is closed with respect to G.sb G.rf (a E whenever b E
and ha, bi G.sb G.rf). It is easy to show that G0 is RC11-consistent, provided that G is RC11-consistent.
Notation J.1. For an execution G, G.rf|sc denotes the restriction of G.rf to SC accesses
(G.rf|sc = [G.Esc ]; G.rf; [G.Esc ]). A similar notation is used for G.mo and G.rb.

For a set of events E, let (E) denote the set of all pairs ha, bi E E of conflicting events, such that
{G.mod(a), G.mod(b)} = 6 {sc} and ha, bi, hb, ai 6 (G.sb G.rf|sc )+ . Let a1 , ... , an be an enumeration of
E \ E0 that respects G.sb G.rf (that is, i < j whenever hai , aj i G.sb G.rf). For every 1 i n, let
Ei = E0 {a1 , ... , ai } and Gi = G|Ei . Since the Gi s are all prefixes of G, all of them are RC11-consistent.

Claim: For every 1 i n, if (Ei ) = then Gi is SC-consistent.


Proof: Suppose that (Ei ) = . Since G satisfies COHERENCE, it follows that:
Gi .rf (G.sb G.rf|sc )+ .
Gi .mo (G.sb G.rf|sc )+ G.mo|sc .
Gi .rb (G.sb G.rf|sc )+ G.rb|sc .
Hence, we have Gi .sb Gi .rf Gi .mo Gi .rb R+ , where R = G.sb G.rf|sc G.mo|sc G.rb|sc .
Since G satisfies the SC condition, we have that R is acyclic, and so Gi is SC-consistent (ATOMICITY holds
since it holds for G).

Now, since G is not SC-consistent, we have (G.E) 6= . Let k = min{i | (Ei ) 6= }. Then,
(Ek1 ) = (and so, Gk1 is SC-consistent), and there exists some j < k, such that aj and ak
are conflicting, {G.mod(aj ), G.mod(ak )} = 6 {sc}, and haj , ak i, hak , aj i 6 (G.sb G.rf|sc )+ . Let
B = {b Ek | hb, ak i G.sb}. Since haj , ak i 6 (G.sb G.rf|sc )+ , and Gk1 .rf (G.sb G.rf|sc )+ ,
we have haj , bi 6 (G.sb G.rf)+ for every event b B. Let x = loc(ak ), and consider two cases:
typ(ak ) = W:

Claim: haj , ak i Gk .race.


Proof: Clearly, we have hak , aj i 6 (Gk .sb Gk .rf)+ (ak has no outgoing sb and rf edges in Gk ).
In addition, we have haj , ak i 6 (Gk .sb Gk .rf)+ (otherwise, haj , bi (G.sb G.rf)+ for some
b B).

Claim: Gk is not SC-consistent.


Proof: Since haj , ak i Gk .race and {G.mod(aj ), G.mod(ak )} 6= {sc}, the claim follows from our
assumption.

Claim: ak 6 G.At.

33
Proof: Suppose otherwise, and let b G.E such that hb, ak i rmw. Since Gk is not SC-consistent, but
Gk1 is SC-consistent, it must be the case that hak , ci G.mo and hc, ak i (G.sbG.rfG.moG.rb)+
for some c Ek1 . Let d Ek1 such that hc, di (G.sb G.rf G.mo G.rb) and
hd, ak i G.sb G.mo G.rb. Then, we also have hc, di (Gk1 .sb Gk1 .rf Gk1 .mo Gk1 .rb) .
If hd, ak i G.moG.rb, then we obtain hd, ci G.moG.rb, and so hd, ci Gk1 .moGk1 .rb, which
contradicts the fact that Gk1 is SC-consistent. Otherwise, hd, ak i G.sb. It follows that hd, bi G.sb? .
Now, COHERENCE ensures that G.rmw G.rb, and it follows that hb, ci G.rb. Hence, hb, ci Gk1 .rb,
which again contradicts the fact that Gk1 is SC-consistent.

Let G0k be the extension of Gk1 with the event ak (with the same label as in Gk ), the sb edges of Gk , and
the mo edges {ha, ak i | a Gk1 .Wx }. It is easy to see that G0k is SC-consistent as well (in particular, it
is important here that ak 6 G.At). Except for mo, it is identical to Gk , and so it is an execution of P and
haj , ak i G0k .race. Since {G.mod(aj ), G.mod(ak )} =6 {sc}, this contradicts our assumption.
typ(ak ) = R:
In this case, we must have typ(aj ) = W. Let

E = {a G.E | ha, ak i (G.sb Gk1 .rf) ha, aj i (G.sb Gk1 .rf) }.

Let G0 be the restriction of Gk to the events in E. Since G0 |E\{ak } is a prefix of Gk1 , it is SC-consistent.
Let c = maxG.mo G0 .Wx , and consider two cases.
c 6= aj :
Let G00 be the execution obtained from G0 by (i) modifying the value read at ak to valw (c), and (ii)
adding the reads-from edge hc, ai. It is easy to see that G00 is SC-consistent, and Assumption B.1 ensures
that it is an execution of P . Additionally, haj , ak i 6 (G00 .sb G00 .rf)+ (there are no outgoing sb and
rf edges from aj in G00 ), and so, haj , ak i G00 .race. Since {G.mod(aj ), G.mod(ak )} = 6 {sc}, this
contradicts our assumption.
c = aj :
Let d be the immediate G.mo-predecessor of c, and let G00 be the execution obtained from G0 by (i)
modifying the value read at ak to valw (d), and (ii) adding the reads-from edge hd, ai. Again, it is easy
to see that G00 is SC-consistent, and Assumption B.1 ensures that it is an execution of P . As in the
previous case we obtain a contradiction to our assumption.

Theorem 5. Let G be an RC11-consistent execution. Suppose that for every two distinct shared locations x
and y, [Ex ]; sb; [Ey ] sb; [Fsc ]; sb. Then, G is SC-consistent.

Proof. It suffices to show that sb ecoe is acyclic (where ecoe , eco \ sb). Consider a cycle in
sb ecoe of a minimal length. Cycles with at most one ecoe edge are ruled out by COHERENCE.
Hence, our cycle must have at least two ecoe edges. Let a1 , b1 , a2 , b2 , ... , an , bn E (where n 2)
such that hai , bi i ecoe and hbi , ai+1 i sb for every 1 i n (where we take an+1 to be a1 ).
The events a1 , b1 , ... , an , bn are all accesses to shared locations (since hai , bi i ecoe =loc for every
1 i n). In addition, we have loc(bi ) 6= loc(ai+1 ) for every 1 i n (otherwise we would have
hai , ai+1 i ecoe; sb|loc ecoe, which contradicts the minimality of the cycle). Therefore, our assumption
entails that there exist f1 , ... , fn Fsc such that hbi , fi i sb and hfi , ai+1 i sb for every 1 i n. It
follows that hfi , fi+1 i [Fsc ]; sb; ecoe; sb; [Fsc ] pscF for every 1 i n (where we take fn+1 to be f1 ).
This contradicts the fact that G satisfies the SC constraint.

34
K. Proofs for 4 (Compilation to x86-TSO)
The following proposition is useful in our proof below:
Proposition K.1. The following hold in every TSO-consistent TSO execution:
rb; mo; hb? ; rfe; hb? is irreflexive.
rb; mo; hb? ; [RMW F]; hb? is irreflexive.
The relation T = [R]; hb hb? ; rfe; hb? hb; [F] [F]; hb mo rb is acyclic.

Proof. The first two are straightforward. We prove the third claim. Consider a cycle ha1 , ... , an i in T with
a minimal number of events ai W F: The minimality of the cycle entails that at most two events in
W F participate in this cycle (otherwise, it can be shortened since mo is total on W F). Since hb and rb; hb
are irreflexive, there must be at least two such events in the cycle. Hence, we have exactly two indices
1 i < j n such that ai , aj W F. W.l.o.g., we may assume that hai , aj i mo. Since the rest of the events
are not in W F, and mo; hb is irreflexive, we obtain that haj , ai i ([R]; hb hb? ; rfe; hb? hb; [F]; hb? )+ ; rb.
Since [W]; ([R]; hb hb? ; rfe; hb? hb; [F] [F]; hb)+ hb? ; rfe; hb? hb; [F]; hb? , we obtain that
haj , ai i (hb? ; rfe; hb? hb; [F]; hb? ); rb. This contradicts the previous claims.

Lemma K.1. Let G be an RMW-execution satisfying G.F6=sc = , G.mod(a) A rlx for every a E, and
G.mod(a) A acqrel for every a RMW. Let Gt be a TSO execution. Suppose that there exists an injective
function f : (G.Wsc \ G.RMW) N assigning a fresh event f (a) 6 G.E to every a G.Wsc \ G.RMW, such that
the following hold:
Gt .E = G.E f [G.Wsc \ G.RMW].
Gt .lab(a) = R(G.loc(a), G.valr (a)) for every a G.R \ G.RMW.
Gt .lab(a) = W(G.loc(a), G.valw (a)) for every a G.W \ G.RMW.
Gt .lab(a) = RMW(G.loc(a), G.valr (a), G.valw (a)) for every a G.RMW.
Gt .lab(a) = F for every a G.Fsc f [G.Wsc \ G.RMW].
Gt .sb = G.sb{hb, f (a)i | hb, ai G.sb? ; [G.Wsc \G.RMW]}{hf (a), bi | ha, bi [G.Wsc \G.RMW]; G.sb}.
G.rf = Gt .rf.
S
G.mo = xLoc [Gt .Wx ]; Gt .mo; [Gt .Wx ].
Then:
G and Gt have the same outcome.
If Gt is TSO-consistent, then G is RC11-consistent.

Proof. The first claim easily follows from our definitions. Suppose that Gt is TSO-consistent. We show that G
is RC11-consistent. Clearly, it is complete (since G.R = Gt .R and G.rf = Gt .rf).
COHERENCE . Easily follows using Prop. 1 from the fact that Gt .hb, Gt .mo; Gt .hb and Gt .rb; Gt .hb are all
irreflexive (note that G.rf G.hb Gt .hb, G.mo Gt .mo, and G.rb Gt .rb).
ATOMICITY- RMW . Trivially follows from the fact that Gt .rb; Gt .mo is irreflexive.
Let psc0base = [G.Esc ] [G.Fsc ]; G.hb? ; (G.hb G.mo G.rb); [G.Esc ] G.hb? ; [G.Fsc ] . We show
 
SC .
that psc0base G.pscF T + , where T is the relation defined in Prop. K.1. This implies that SC holds (as
well as that Batty et al.s [5] condition holds). To prove psc0base G.pscF T + , note that the following
hold (in G):
[F]; (psc0base pscF ); [F] pscF [F]; hb[F]; hb; eco; hb; [F] ([F]; (sbrf)+ (sbrf)+ ; [F]morb)+ .
[F]; (psc0base pscF ); [R W] = [F]; psc0base ; [R W] [F]; hb; (mo rb)?
[RW]; (psc0base pscF ); [F] = [RW]; psc0base ; [F] [RW]; (hbmorb); hb? ; [F] (morb)? ; hb; [F]
[R]; (psc0base pscF ); [R W] = [R]; psc0base ; [R W] [R]; hb mo rb
[W]; (psc0base pscF ); [RW] = [Wsc ]; (hbmorb); [Rsc Wsc ] [Wsc ]; sb; [Rsc ](sbrf) ; rfe; (sbrf) morb

35
Using these facts, since we have G.F Gt .F, G.R Gt .R, G.hb Gt .hb,
(G.sb G.rf)+ Gt .hb, G.mo Gt .mo, G.rb Gt .rb, G.rfe Gt .rfe, and
[G.Wsc ]; G.sb; [G.Rsc ] Gt .sb; [Gt .F]; Gt .sb Gt .hb; [Gt .F]; Gt .hb, it immediately follows that
psc0base G.pscF T + .
NO - THIN - AIR . Trivially follows from the facts that G.sb G.rf Gt .hb, and Gt .hb is irreflexive.
Lemma K.2. Let G be an RMW-execution, such that mod(a) A rlx for every a G.E, and mod(a) w acqrel
for every a G.RMW. Let G0 be the RMW-execution obtained from G by removing all the non-SC fences (that
is: G0 .E = G.E \ G.F6=sc , G0 .sb = [G0 .E]; G.sb; [G0 .E], G0 .rf = G.rf, and G0 .mo = G.mo). Then, G and G0
have the same outcome, and if G0 is RC11-consistent then so is G.

Proof. The conditions on the modes of accesses imply that [G0 .E]; G.hb; [G0 .E] = G0 .hb. Then, the RC11-
consistency of G0 trivially implies the RC11-consistency of G.
Theorem 1. For a program P , denote by (|P |) the TSO program obtained by compiling P using the scheme in
Fig. 8. Then, given a program P , every outcome of (|P |) under TSO is an outcome of P under RC11.

Proof. First, let P 0 be the program obtained from P by (i) strengthening all read/write accesses in P to be at
least release/acquire ones, (ii) all RMWs to be acquire-release RMWs, and (iii) omitting all non-SC fences.
Note that (|P |) = (|P 0 |), and by Lemmas I.1 and K.2, every outcome of P 0 under RC11 is an outcome of P
under RC11. Hence, it suffices to show that every outcome of (|P 0 |) under TSO is an outcome of P 0 under
RC11. Given a full TSO-consistent TSO execution Gt of (|P 0 |), the compilation scheme ensures that there
exists some full execution G of P 0 for which the properties of Lemma K.1 hold. The claim then follows by
Lemma K.1.

K.1 Alternative correctness of fences before SC reads correctness


Here, we follow here a different simpler approach utilizing the recent result of Lahav and Vafeiadis [19]. That
result provides an alternative characterization of the TSO memory model, in terms of program transformations
(or compiler optimizations). They show that every weak behavior of TSO can be explained by a sequence
of:
load-after-store reorderings
(e.g., MOV [x] 1; MOV r [y] MOV r [y]; MOV [x] 1); and
load-after-store eliminations
(e.g., MOV [x] 1; MOV r [x] MOV [x] 1; MOV r 1).
They further outline an application of this characterization to prove compilation correctness, which we follow
here. Accordingly, we have to meet three conditions:
1. Every outcome of the compiled program under SC is an outcome of the source program under RC11. This
trivially holds, since obviously RC11 is weaker than SC (even if arbitrary fences are added to the source).
2. Every store-load reordering that can be applied on the compiled program corresponds to a transformation
on the source program that is sound under RC11. Indeed, the compilation scheme ensures that adjacent load
after store in the compiled program (|P |) correspond to adjacent read after non-SC write in the source P .
These can be soundly reordered under RC11 (see 7), resulting in a program P 0 whose compilation (|P 0 |) is
identical the reordered (|P |).
3. Every load-after-store elimination that can be applied on the compiled program corresponds to a
transformation on the source program that is sound under RC11. Again, the compilation scheme ensures
that a load adjacently after a store in the compiled program (|P |) corresponds to an adjacent non-SC read
after a write in the source P . The read can be soundly eliminated under RC11 (see 7).
Note that this simple argument cannot be applied for the compilation scheme that places fences after SC writes,
since a load adjacently after a store in the compiled program (|P |) corresponds in this case to an adjacent
read after a non-SC write in the source P . However, if the read is SC but the write is not SC, it is unsound to
eliminate the read under RC11 (see Remark 6).

36

You might also like