Professional Documents
Culture Documents
AND
OPTIMIZATION DCC888!
The material in these slides have been taken from the "Dragon Book", Secs 8.5 8.7, by Aho et al.
Fortran
PowerPC
COBOL
x86
Lisp
Front
End
OpLmizer
Back
End
ARM
What does
this program
do?
Machine independent
optimizations, such as
constant propagation
Machine dependent
optimizations, such
as register allocation
../0
%"&'(
!"#$%
*+,
!"#$)%
""%
!"#$)%
!"#$-
Basic Blocks
Basic blocks are maximal
sequences of consecuLve
instrucLons with the following
properLes:
"
%8 = load i32* %2, align 4
store i32 %8, i32* %sum, align 4
%9 = load i32* %sum, align 4
%10 = add nsw i32 %9, 1
store i32 %10, i32* %sum, align 4
"
Universidade Federal de Minas Gerais Department of Computer Science Programming Languages Laboratory
LOCAL OPTIMIZATIONS
DCC 888
Local OpLmizaLon
Code opLmizaLon techniques that work in the scope of a
basic block are called local op:miza:ons.
DAG based opLmizaLons
Peephole opLmizaLons
Local register allocaLon
DAG-Based OpLmizaLons
Some of the earliest code opLmizaLons that compilers
have performed would rely on a directed acyclic
representaLon of the instrucLons in the basic block.
These DAGS were constructed as follows:
There is a node in the DAG for each input value appearing
in the basic block.
There is a node associated with each instrucLon in the
basic block.
If instrucLon S uses variables dened in statements S1, ,
Sn, then we have edges from each Si to S.
If a variable is dened in the basic block, but is not used
inside it, then we mark it as an output value.
-,d
2: b = a d
+,c
3: c = b + c
4: d = a d
-,b
+,a
2: b = a - d
3: c = b + c
4: d = a - d
-,d
+,c
-,b
+,a
2) Could you
opLmize this DAG
in parLcular?
2: b = a - d
3: c = b + c
4: d = a - d
-,d
+,c
-,b
+,a
2: b = a - d
How can we
check this
property?
3: c = b + c
4: d = a - d
-,d
+,c
-,b
2. else:
d
+,a
Value Numbers
We can associate each
-,d
+,c
What is the
result of
opLmizing this
DAG?
-,b
+,a
Value Numbers
Value numbers allows us to refer to nodes in our DAG by
their semanLcs, instead of by their textual program
representaLon.
-,d
+,c
-,b
+,a
b1
Original DAG
(in, _)
2 (c)
(in, _)
-,d
+,c
1 (b)
3 (d)
(in, _)
4 (a = b + c)
(+, 1, 2)
5 (b = d - a)
(-, 3, 4)
-,b
6 (c = b + c)
(+, 5, 2)
7 (d = d - a)
(-, 3, 4)
Value-Number Table
+,a
b1
Optimized DAG
c2
Value Numbers
-,d
+,c
Dead code
elimina;on: which
nodes can be safely
removed from the
DAG?
-,b 5
-,b 5
+,a
+,c
b1
Original DAG
c 2
+,a
b1
Optimized DAG
c 2
+,c
-,b
+,a
b1
-,d
+,c
-,b
+,c
-,b
+,a
b1
-,d
+,c
-,b
+,a
b1
+,a
b1
Algebraic IdenLLes
We can explore algebraic idenLLes to opLmize DAGs
ArithmeLc IdenLLes:
x + 0 = 0 + x = x; x * 1 = 1 * x = x; x 0 = x; x/1 = x
ReducLon in strength:
x2 = x * x; 2 * x = x + x; x / 2 = x * 0.5
Constant folding:
evaluate expressions at compilaLon Lme, replacing each
expression by its value
ReducLon in strength opLmizes code
by replacing some sequences of
instrucLons by others, which can be
computed more eciently.
Peephole OpLmizaLons
Peephole opLmizaLons are a
category of local code
opLmizaLons.
The principle is very simple:
the opLmizer analyzes
sequences of instrucLons.
only code that is within a small
window of instrucLons is
analyzed each Lme.
this window slides over the
code.
once paRerns are discovered
inside this window,
opLmizaLons are applied.
How to nd the
best window size?
Branch TransformaLons
Some branches can be rewriRen into faster code. For
instance:
if debug == 1 goto L1!
goto L2!
L1: !
L2: !
Jumps to Jumps
How could we opLmize this code sequence?
goto L1!
...!
L1: goto L2!
Jumps to Jumps
How could we opLmize this code sequence?
goto L1!
...!
L1: goto L2!
goto L2!
...!
L1: goto L2!
Jumps to Jumps
How could we opLmize this code sequence?
goto L1!
...!
L1: goto L2!
goto L2!
...!
L1: goto L2!
goto L2!
...!
Jumps to Jumps
How could we opLmize this code sequence?
We saw how to opLmize
the sequence on the
right. Does the same
opLmizaLon applies on
the code below?
goto L1!
...!
L1: goto L2!
Jumps to Jumps
How could we opLmize this code sequence?
Nothing new here. The opLmizaLon is the same as in the
previous case. and if there is no jump to L1, we can remove it
from the code, as it becomes dead-code ager our transformaLon.
Jumps to Jumps
How could we opLmize this code sequence?
Under which assumpLons is your opLmizaLon valid?
goto L1!
...!
L1: if a < b goto L2!
L3:!
Jumps to Jumps
How could we opLmize this code sequence?
Under which assumpLons is your opLmizaLon valid?
goto L1!
...!
L1: if a < b goto L2!
L3:!
ReducLon in Strength
Instead of performing reducLon in strength at the DAG
level, many compilers do it via a peephole opLmizer.
This opLmizer allows us to replace sequences such as 4 * x
by, x << 2, for instance.
4 * x is a paRern that is
preRy common in loops
that range over 32-bit
words. Why?
Machine Idioms
Many computer architectures have ecient instrucLons
to implement some common operaLons.
A typical example is found in the increment and decrement
operators:
addl $1, %edi!
Is there any other
machine idioms
that we could think
about?
incl %edi!
Registers vs Memory
_arith_all_mem:
## BB#0:
subl $12, %esp
movl 16(%esp), %eax
movl %eax, 8(%esp)
movl 20(%esp), %eax
movl %eax, 4(%esp)
movl 24(%esp), %eax
movl %eax, (%esp)
movl 8(%esp), %ecx
addl 4(%esp), %ecx
imull %eax, %ecx
movl %ecx, %eax
shrl $31, %eax
addl %ecx, %eax
sarl %eax
addl $12, %esp
ret
_arith_reg_alloc:
## BB#0:
movl 4(%esp), %ecx
addl 8(%esp), %ecx
imull 12(%esp), %ecx
movl %ecx, %eax
shrl $31, %eax
addl %ecx, %eax
sarl %eax
ret
1) Why we do not
need to worry
about "memory
allocaLon"?
Spilling
If the register pressure is too high, we may
have to evict some variable to memory. This
acLon is called spilling. If a variable is
mapped into memory, then we call it a spill.
nd_reg(i) {
if there is free register r
return r
else
let v be the last used variable ager i, that is in a register
if v does not have a memory slot
let m be a fresh memory slot
else
1) Why do we spill the
let m be the memory slot assigned to v
variable that has the last
add "store v m" right ager the deniLon of v
use ager the current
return r
instrucLon i?
}
c
a
b, d
L2
store b m2
L1
store d m2
: A Study of Replacement algorithms for a Virtual Storage Computer (1966) IBM Systems Journal
Example
R0
a = load m0
b = load m1
R1
c = a + b
d = 4 * c
e = b + 1
f = e + a
g = d / e
h = f - g
ret h
Example
R0
a = load m0
b = load m1
c = a + b
d = 4 * c
e = b + 1
f = e + a
g = d / e
h = f - g
ret h
R1
Example
R0
R1
a = load m0
a
b = load m1
c = a + b
d = 4 * c
Where can we t
variable c?
e = b + 1
f = e + a
g = d / e
h = f - g
ret h
Example
R0
a = load m0
store a ma
b = load m1
c = a + b
d = 4 * c
e = b + 1
f = e + a
g = d / e
h = f - g
ret h
R1
ma
a
c
Example
R0
a = load m0
store a ma
b = load m1
store b mb
c = a + b
d = 4 * c
e = b + 1
f = e + a
g = d / e
h = f - g
ret h
R1
ma
mb
1) We need to
reload
variable b.
Where?
a
b
a
c
d
2) Where can
we put
variable e?
Example
R0
a = load m0
store a ma
b = load m1
store b mb
c = a + b
d = 4 * c
b = load mb; store d md
e = b + 1
f = e + a
g = d / e
h = f - g
ret h
R1
ma
mb
md
a
b
c
d
b
b
d
Why is it not
necessary to spill any
variable, e.g., e, for
instance, to allocate f?
Example
R0
a = load m0
store a ma
b = load m1
store b mb
c = a + b
R1
ma
mb
md
a
b
c
d = 4 *
b = load mb; store
e = b +
a = load ma
f = e +
c
d md
1
a
g = d / e
h = f - g
ret h
d
b
a
f
b
d
How do we
allocate
"h = f g"?
Example
R0
a = load m0
store a ma
b = load m1
store b mb
c = a + b
R1
ma
mb
md
mf
a
b
c
d = 4 *
b = load mb; store
e = b +
a = load ma
f = e +
d = load md; store
g = d /
c
d md
1
a
f mf
e
h = f - g
ret h
d
b
a
a
e
f
d
b
d
f
Example
R0
a = load m0
store a ma
b = load m1
store b mb
c = a + b
R1
ma
mb
md
mf
a
b
c
d = 4 *
b = load mb; store
e = b +
a = load ma
f = e +
d = load md; store
g = d /
f = load mf
h = f -
c
d md
1
a
f mf
e
d
b
a
f
b
d
d
g
g
f
ret h
OpLmality
If there exists an assignment of
variables to registers that requires at
most K registers, then our algorithm
will nd it.
If registers had dierent sizes, e.g.,
singles and doubles, then this
problem would be NP-complete.
Universidade Federal de Minas Gerais Department of Computer Science Programming Languages Laboratory
DCC 888
llvm-dis
llvm-dwarfdump
llvm-extract
llvm-link
llvm-lit
llvm-lto
llvm-mc
llvm-mcmarkup
pp-trace
llvm-objdump
llvm-ranlib
llvm-readobj
llvm-rtdyld
llvm-stress!
llvm-symbolizer!
llvm-tblgen!
macho-dump!
modularize!
not!
obj2yaml!
opt!
llvm-size!
rm-cstr-calls!
tool-template!
yaml2obj!
llvm-diff!
OpLmizaLons in PracLce
The opt tool, available in the LLVM toolbox, performs
machine independent opLmizaLons.
There are many opLmizaLons available through opt.
To have an idea, type opt --help.!
The front-end
that parses C
into bytecodes
Machine independent
optimizations, such as
constant propagation
Machine dependent
optimizations, such
as register allocation
../0
%"&'(
!"#$%
*+,
!"#$)%
""%
!"#$)%
!"#$-
OpLmizaLons in PracLce
$> opt --help!
Optimizations available:!
-adce
-always-inline
-break-crit-edges
-codegenprepare
-constmerge
-constprop
-correlated-propagation
-dce
-deadargelim
-die
-dot-cfg
-dse
-early-cse
-globaldce
-globalopt
-gvn
-indvars
-instcombine
-instsimplify
-ipconstprop
-loop-reduce
-reassociate
-reg2mem
-sccp
-scev-aa
-simplifycfg
...!
Levels of OpLmizaLons
Like gcc, clang supports dierent
levels of opLmizaLons, e.g., -O0
(default), -O1, -O2 and -O3.
To nd out which opLmizaLon
each level uses, you can try:
How could we
further opLmize
this program?
Constant PropagaLon
We can fold the computaLon of expressions that are
known at compilaLon Lme with the constprop pass.
What is %1 in the
leg CFG? And what
is i32 42 in the CFG
on the right side?
How could we
opLmize this
program?
Can you
intuiLvely tell
how CSE works?
Function foo!
add: 4!
alloca: 5!
br: 8!
icmp: 3!
load: 11!
ret: 1!
select: 1!
store: 9!
: Example taken from the slides of Gennady Pekhimenko "The LLVM Compiler Framework and Infrastructure"
Typed representaLon.
%0 = load i32* %X, align 4!
%add = add nsw i32 %0, 1!
ret i32 %add!
This is C
switch(argc) {
case 1: x = 2;
case 2: x = 3;
case 3: x = 5;
case 4: x = 7;
case 5: x = 11;
default: x = 1;
}
This is LLVM
This is LLVM
int main() {!
int T = 4;!
return callee(&T);!
}!
clang ex0.hack.ll!
./a.out!
echo $?!
5!
int callee(int X) {!
return X + 1;!
}!
int main() {!
int T = 4;!
return callee(T);!
}!
Targets:!
- Alpha [experimental]!
- ARM!
- Analog Devices Blackfin!
- C backend!
- STI CBEA Cell SPU!
- C++ backend!
- MBlaze!
- Mips!
- Mips64 [experimental]!
- Mips64el [experimental]!
- Mipsel!
- MSP430 [experimental]!
- PowerPC 32!
- PowerPC 64!
- PTX (32-bit) [Experimental]!
- PTX (64-bit) [Experimental]!
- Sparc!
- Sparc V9!
- SystemZ!
- Thumb!
- 32-bit X86: Pentium-Pro!
- 64-bit X86: EM64T and AMD64!
- XCore!
!
.globl!
_identity!
!
.align!
4, 0x90!
_identity:!
!
pushl!%ebx!
!
pushl!%edi!
!
pushl!%esi!
!
xorl! %eax, %eax!
!
movl! 20(%esp), %ecx!
!
movl! 16(%esp), %edx!
!
movl! %eax, %esi!
!
jmp! LBB1_1!
!
.align!
4, 0x90!
LBB1_3:!
!
movl! (%edx,%esi,4), %ebx!
!
movl! $0, (%ebx,%edi,4)!
!
incl! %edi!
LBB1_2:!
!
cmpl! %ecx, %edi!
!
jl!
LBB1_3!
!
incl! %esi!
LBB1_1:!
!
cmpl! %ecx, %esi!
!
movl! %eax, %edi!
!
jl!
LBB1_2!
!
jmp! LBB1_5!
LBB1_6:!
!
movl! (%edx,%eax,4), %esi!
!
movl! $1, (%esi,%eax,4)!
!
incl! %eax!
LBB1_5:!
!
cmpl! %ecx, %eax!
!
jl!
LBB1_6!
!
popl! %esi!
!
popl! %edi!
!
popl! %ebx!
!
ret!
A Bit of History
Some of the rst code opLmizaLons were implemented
only locally
Fortran was one of the rst programming languages to be
opLmized by a compiler
The LLVM infra-structure was implemented by Chris
LaRner, during his PhD, at UIUC
Kam, J. B. and J. D. Ullman, "Monotone Data Flow Analysis Frameworks",
Actal InformaLca 7:3 (1977), pp. 305-318
Kildall, G. "A Unied Approach to Global Program OpLmizaLons", ACM
Symposium on Principles of Programming Languages (1973), pp. 194-206
LaRner, C., and Adve, V., "LLVM: A CompilaLon Framework for Lifelong
Program Analysis & TransformaLon", CGO, pp. 75-88 (2004)