You are on page 1of 11

ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.

com/lpt/a/6968

Published on ONLamp.com (http://www.onlamp.com/)


http://www.onlamp.com/pub/a/onlamp/2007/04/03/getting-familiar-with-gcc-parameters.html
See this if you're having trouble printing code examples

Getting Familiar with GCC Parameters


by Mulyadi Santosa
04/05/2007

gcc
(GNU C Compiler) is actually a collection of frontend tools that does compilation, assembly, and linking. The
goal is to produce a ready-to-run executable in a format acceptable to the OS. For Linux, this is ELF
(Executable and Linking Format) on x86 (32-bit and 64-bit). But do you know what some of the gcc
parameters can do for you? If you're looking for ways to optimize the resulted binary, prepare for a debugging
session, or simply observe the steps gcc
takes to turn your source code into an executable, getting familiar with these parameters is a must. So, please
read on.

Recall that gcc does multiple steps, not just one. Here is a brief explanation of their meaning:

Preprocessing: Producing code that no longer contains directives. Things like "#if" cannot be
understood directly by the compiler, so this must be translated into real code. Macros are also expanded
at this stage, making the resulting code larger than the original. [1]
Compilation: This takes the post-processed code, does lexical and syntax analysis, and then generates
the assembly code. gcc
outputs warning or error messages during this stage as the analyzer parses and finds any mistakes in
your code. If any optimization is requested, gcc will analyze your code further to find the possible spots
and manipulate them further. This work is done in multipass style, which demonstrates that it sometimes
takes more than one scan through the source code to optimize. [2]
Assembly: Accepts the assembler mnemonics and produces object codes containing opcodes. This is a
common misunderstanding: compilation stage doesn't produce opcodes, assembly stage does. The result
is one or more object files containing opcodes, which are really machine-dependent. [3]
Linking: Transforms object files into final executables. Opcodes alone are not enough to make the
operating system recognize and execute them. They must be wrapped into a more complete form. This
form, known as binary format, dictates how the OS loads the binary, arranges the relocation, and does
other necessary work. ELF is the default format for x86-based Linux. [4]

The gcc
parameters described here directly and indirectly touch all those four stages, so for clarity, this article is
organized as follows:

Parameters related to optimization


Parameters related to function call
Parameters related to debugging
Parameters related to preprocessing

Before that, let's meet the accompanying tools that will help us to sneak inside the resulting code:

The ELF tool collection, which includes such programs as objdump and readelf. They parse ELF
information for us.
Oprofile, one of the de facto ways to collect hardware performance counters. We need this tool to see
certain aspects of the code's performance.

第1页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

time, a simple way to collect total runtime of a program.

The following explanations can be applied to gcc version 3.x and 4.x, so it's pretty generic. Let's start digging,
shall we?

Parameters Related To Code Optimization

gcc offers a "no-brainer" way to do a bunch of optimizations for you: the -O option. It either makes your code
run faster or squeezes the final code's size. There are five variants of it:

-O0 (O zero) to -O3. "0" means no optimization, while "3" is highest optimization level. "1" and "2" are
between these extreme edges. If you just use -O without specifying any number, implicitly it means -O1.
-Os. Tells gcc to optimize for size. Basically, it is similar to -O2, but skips some steps that might enlarge
the size.

How much speedup do we really get from them? Well, suppose we have this code:
#include<stdio.h>
int main(int argc, char *argv[])
{
int i,j,k
unsigned long acc=0;
for(i=0;i<10000;i++)
for(j=0;j<5000;j++)
for(k=0;k<4;k++)
acc+=k;
printf("acc = %lu\n",acc);
return 0;
}

Listing 1. Simple code to be optimized.

Using gcc, four different binaries are created, using each -O variants (except -Os). The time tool records their
execution time, e.g.:
$ time ./non-optimized

No optimization -O1 -O2 -O3


real 0.728 0.1 0.1 0.1
user 0.728 0.097 0.1 0.1
sys 0.000 0.002 0.000 0.000

For simplicity's sake, we use the following conventions:

Non-optimized refers to the executable file compiled with -O0.


OptimizedO1 refers to the executable file compiled with -O1.
OptimizedO2 refers to the executable file compiled with -O2.
OptimizedO3 refers to the executable file compiled with -O3.

As you can see, between the non-optimized case and the one compiled using -O1, the elapsed time is seven
times shorter. Notice that -O1, -O2, and -O3
don't really give too much additional benefit--in fact, they are pretty much the same. So what's the magic of
-O1 here?

After a quick sightseeing trip to the source code, you would expect that such code is a dead-end for
optimization. First, let's look at a brief comparison between the disassembled versions of non-optimized and
optimizedO1.
$ objdump -D non-optimized
$ objdump -D optimizedO1

第2页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

(Note: You might get different results, so use this as general guideline.)

Non-optimized OptimizedO1
mov 0xfffffff4(%ebp),%eax add $0x6,%edx
add %eax,0xfffffff8(%ebp) add $0x1,%eax
addl $0x1,0xfffffff4(%ebp) cmp $0x1388,%eax
cmpl $0x3,0xfffffff4(%ebp)

The above snippets implement the innermost loop (for (k=0;k<4;k++)). Notice the clear difference: the
non-optimized code directly loads and stores from a memory address, while optimizedO1 uses CPU registers
as the accumulator and loop counter. As you may already be aware, registers can be accessed hundreds or
thousands times faster than RAM cells.

Not satisfied to just exploit CPU registers for temporary storage, gcc uses another optimization trick. Let's see
the disassembled code of optimizedO1 again, specifically in the main() function:
......
08048390 <main>:
...
80483a1: b9 00 00 00 00 mov $0x0,%ecx
80483a6: eb 1f jmp 80483c7 <main+0x37>
80483a8: 81 c1 30 75 00 00 add $0x7530,%ecx

0x7530 is 30,000 in decimal form, so we can quickly guess the loop is simplified. This code represents the
innermost loop and the outermost loop ("for(j=0;j<5000;j++) ... for(k=0;k<4;k++)") because that is
literally a request to do 30,000 loops. Note that you just need to loop three times inside. When k=0, acc is still
the same, so the first loop can be discarded.
80483ae: 81 f9 00 a3 e1 11 cmp $0x11e1a300,%ecx
80483b4: 74 1a je 80483d0 <main+0x40>
80483b6: eb 0f jmp 80483c7 <main+0x37>

Hmm, now it is compared to 300,000,000 (10,000*5,000*6). This represents all three loops. After we reach
that number of loops, we jump straight to the printf() to print the sum (address 0x80483d0 - 0x80483db)
80483b8: 83 c2 06 add $0x6,%edx
80483bb: 83 c0 01 add $0x1,%eax
80483be: 3d 88 13 00 00 cmp $0x1388,%eax
80483c3: 74 e3 je 80483a8 <main+0x18>
80483c5: eb f1 jmp 80483b8 <main+0x28>

Six is added in the accumulator in each iteration. %edx will eventually hold the total sum after completion of all
the loops. The third and fourth lines shows us that after this is done 5,000 times, it should return to address
0x80483a8 (as mentioned previously).

We can conclude that gcc


creates a simplification here. Instead of looping three times in the innermost loop, it simply adds six for every
middle loop. This sounds simple, but it makes your program do only 100,000,000 loops instead of
300,000,000. This simplification, technically called loop unrolling, is one of the jobs done by -O1/2/3. Of
course, you can do it on your own, but sometimes it is good to know that gcc can detect such things and
optimize them.

With -O2 or -O3, gcc


also tries to optimize the branch. This is usually achieved via reordering [5] and code transformation. The goal
of this procedure is to eliminate as many mispredicted branches as possible, thus improving pipeline
utilization. For example, we can compare how non-optimized and optimizedO2 do most outer loops.
80483d4: 83 45 ec 01 addl $0x1,0xffffffec(%ebp)
80483d8: 81 7d ec 0f 27 00 00 cmpl $0x270f,0xffffffec(%ebp)
80483df: 7e c4 jle 80483a5 <main+0x21>

第3页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

Non-optimized binary use jle


to execute the jump. Mathematically, it means that the probability of taking the branch is 50 percent. Other the
other hand, the optimizedO2 version uses this:
80483b4: 81 c1 30 75 00 00 add $0x7530,%ecx
80483ba: 81 f9 00 a3 e1 11 cmp $0x11e1a300,%ecx
80483c0: 75 e1 jne 80483a3 <main+0x13>

Instead of jle, now jne is used. Assuming any integer could be compared in the previous cmp, you could
easily conclude that it boosts the chance of taking the branch to almost 100 percent. It is a small but useful hint
for the processor to indicate which code should be executed. However, for most modern processors, this kind
of transformation is not terribly necessary because the branch predictor is smart enough on its own.

To prove how much this transformation can help, OProfile comes to the rescue. Oprofile is executed to record
the number of retired branches and retired mispredicted branches. Retired here means it is executed inside the
CPU pipeline.
$ opcontrol --event=RETIRED_BRANCHES_MISPREDICTED:1000 --event=RETIRED_BRANCHES:1000;

We run non-optimized and optimizedO2 five times each. Then we take out the maximum and minimum of the
samples. We calculate the misprediction rate using this formula (adapted from this):

Misprediction ratio = retired mispredicted branch / retired branch

The average misprediction ratio is now calculated for each binary. Non-optimized gets 0.5117 percent, while
optimizedO2 gets 0.4323 percent--a very small gain for our case. The actual gain may vary in a real-world
case, because gcc itself can't do much without external hints. Please read about __builtin_expect() in the
gcc online documentation for further information.

Options Related To Function Calling

gcc
basically offers you several ways to manage how a function is called. Let's take a look at inlining first. By
inlining, you reduce the cost of a function call because the body of the function is directly substituted into the
caller. Please note that this is not done by default, only when you use -O3 or at least -finline-functions.

How does the finished binary look when gcc does inlining? Observe Listing 2:
#include<stdio.h>

inline test(int a, int b, int c)


{
int d;
d=a*b*c;
printf("%d * %d * %d is %d\n",a,b,c,d);
}

static inline test2(int a, int b, int c)


{
int d;
d=a+b+c;
printf("%d + %d + %d is %d\n",a,b,c,d);
}

int main(int argc, char *argv[])


{
test(1,2,3);
test2(4,5,6);
}

Listing 2. Inline in action

Compile Listing 2 using the following parameter:

第4页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

$ gcc -S -O3 -o <result-file.s> <listing-2.c>

-S makes gcc stop right after compilation stage (we'll cover it later in this article). The results are as follows:
....
test:
pushl %ebp
movl %esp, %ebp
pushl %ebx
....
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
...
movl $6, 16(%esp)
movl $3, 12(%esp)
movl $2, 8(%esp)
movl $1, 4(%esp)
movl $.LC0, (%esp)
call printf
...
movl $15, 16(%esp)
movl $6, 12(%esp)
movl $5, 8(%esp)
movl $4, 4(%esp)
movl $.LC1, (%esp)
call printf
...

Both test() and test2() are indeed inlined, but you also see test(), which stays outside main(). This is
where the static keyword plays a role. By saying a function is static, you tell gcc that this function won't be
called by any outside object file, so there is no need to emit the codes on its own. Thus, it is a space saver if
you can mark them as static whenever possible. On the other hand, be wise when deciding which function
should be inlined. Increasing size for a small speedup isn't always worthwhile.

With certain heuristics, gcc


decides whether a function should be inlined or not. One of the considerations is the function size in term of
pseudo-instructions. By default, the limit is 600. You can change this limit via -finline-limit. Experiment
to find better inline limits for your own case. It is also possible to override the heuristics so gcc always inlines
the function. Simply declare your function like this:
__attribute__((always_inline)) static inline test(int a, int b, int c)

Now, on to parameter passing. In x86 architectures, parameters are pushed to the stack and later popped inside
the function for further processing. But gcc
gives you a chance to change this behavior and instead use registers. Functions with up to three parameters
could use this feature by passing -mregparm=<n>, where <n> is the number of registers we want to use. If we
apply this parameter (n=3) to Listing 2, take out the inline attribute, and use no optimization, we get this:
...
test:
pushl %ebp
movl %esp, %ebp
subl $56, %esp
movl %eax, -20(%ebp)
movl %edx, -24(%ebp)
movl %ecx, -28(%ebp)
...
main:
...
movl $3, %ecx
movl $2, %edx
movl $1, %eax
call test

Instead of stack, it uses EAX, EDX, and ECX to hold the first, second, and third parameter. Because register
access time is faster than RAM, it is one way to reduce runtime. However, you must pay attention to these
issues:

第5页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

You MUST compile all your code with the same -mregparm register number. Otherwise, you will have
trouble calling functions on another object file since they assume different calling conventions.
By using -mregparm, you basically break the Intel x86-compatible Application Binary Interface (ABI).
Therefore, you should mention it when you distribute your software in binary only form.

You probably notice this kind of sequence at the beginning of every function:
push %ebp
mov %esp,%ebp
sub $0x28,%esp

This sequence, also known as the function prologue, is written to set up the frame pointer (EBP). It is useful to
help the debugger do a stack trace. The structure below helps you visualize this [6]:

[ebp-01] Last byte of the last local variable


[ebp+00] Old ebp value
[ebp+04] Return address
[ebp+08] First argument

Can we omit it? Yes, with -fomit-frame-pointer, the prologue will be shortened so the function just begins
with a stack reservation (if there are local variables):
sub $0x28,%esp

If the function gets called very frequently, cutting out the prologue saves your program several CPU cycles.
But be careful: by doing this, you also make it hard for the debugger to investigate the stack. For example, let's
add test(7,7,7) at the end of test2() and recompile with -fomit-frame-pointer and no optimization. Now
fire up gdb to inspect the binary:
$ gdb inline
(gdb) break test
(gdb) r
Breakpoint 1, 0x08048384 in test ()
(gdb) cont
Breakpoint 1, 0x08048384 in test ()
(gdb) bt
#0 0x08048384 in test ()
#1 0x08048424 in test2 ()
#2 0x00000007 in ?? ()
#3 0x00000007 in ?? ()
#4 0x00000007 in ?? ()
#5 0x00000006 in ?? ()
#6 0x0000000f in ?? ()
#7 0x00000000 in ?? ()

On the second call of test, the program is stopped and gdb prints the stack trace. Normally, main() should
come up in Frame #2, but we only see question marks. Recall what I said about the stack layout: the absence of
a frame pointer prevents gdb from finding the location of the saved return address in Frame #2.

Options Related To Debugging

Everybody needs to debug his or her code sometimes. When that time comes, usually you fire up gdb, put
breakpoints here and there, analyze backtraces, and so on, to pinpoint the exact location of the offending
code(s). And what exactly do you get? Assuming you haven't used any debugging option, you likely just get
an address pointed to by an EIP register.

The problem is, you don't really want an address. You want gdb or another debugger to simply show the
related lines. But gdb
can't do that without some kind of hint. This hint, specifically called Debugging With Attributed Record
Formats (DWARF), helps you do source-level debugging.

第6页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

How do you do it? Use -g when you compile into object code, e.g.:
gcc -o -g test test.c

What does gcc actually add so that the debugger can correlate an address with source code? You need
dwarfdump
[7] to find out. This tool is packaged inside the "libdwarf" tarball or RPM, so you won't find it as a standalone
package. Compile it on your own, or simply install from your distro's online repository; both should work. In
this section, I use the 20060614 RPM version.

Using readelf, you notice that there are 28 sections inside the non-debug version of Listing 1:
$ readelf -S ./non-optimized

But the debug version has 36 sections. The new sections added are:

debug_arranges
debug_pubnames
debug_info
debug_abbrev
debug_line
debug_frame
debug_str
debug_loc

You don't need to dig into all of the above sections; taking a look into .debug_line is enough for quick
observation. The command you need is:
$ /path/to/dwarfdump -l <object file>

Here is an example of what you'll get:


.debug_line: line number info for a single cu
Source lines (from CU-DIE at .debug_info offset 11):
<source> [row,column] <pc> //<new statement or basic block
/code/./non-optimized.c: [ 3,-1] 0x8048384 // new statement
/code/./non-optimized.c: [ 5,-1] 0x8048395 // new statement
...............

The interpretation of the above messages is quite straightforward. Take the first entry (below the <source>
string) as an example:
line number 3 in file non-optimized.c is located in address 0x8048384.

gdb itself gives the same information:


$ gdb non-optimized-debugging
(gdb) l *0x8048384
0x8048384 is in main (./non-optimized.c:3).

readelf also provides similar information, by using --debug-info:


$ readelf --debug-dump=line <object file>
Line Number Statements:
Extended opcode 2: set Address to 0x8048384
Special opcode 7: advance Address by 0 to 0x8048384 and Line by 2 to 3
Advance PC by constant 17 to 0x8048395
Special opcode 7: advance Address by 0 to 0x8048395 and Line by 2 to 5
....

Both readelf and dwarfdump can analyze debug information, so you're free to choose.

What you should realize is that the source code itself isn't embedded into the object file. In fact, the debugger
must check a separate source code file. The entry in the <source> column helps determine where to load the
source file. Notice that it contains a full path--meaning if the file is moved somewhere or renamed, gdb can't

第7页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

locate it.

gcc itself has the ability to produce many debugging information. Besides DWARF, there are:

Stabs: -gstabs produce native stabs format, while -gstabs+ includes specific
GNU extensions.
Common Object File Format (COFF): Created with -gcoff.
XCOFF: Created with -gxcoff. If you prefer to include GNU extensions,
use -gxcoff+.
Virtual Memory System (VMS): Produced with -gvms.

Each of the above formats is described in the footnotes ([8], [9], and [10]), but in x86-compatible architecture,
without a doubt, you would use DWARF format. The latest DWARF specification is DWARF version 3, and
gdb can produce it via -gdwarf-2. This could be misleading for first-time users, because you might think it
creates DWARF 2-based information. In fact, it is DWARF 2 coupled with some DWARF 3 features. Not
every debugger supports version 3, so use it with caution.

However, things do not always go smoothly. When you combine -O and -g, it is necessary for line information
to relate to the actual code in the mentioned address offset. An example can clarify this--take a file (I use
Listing 1 again) and compile it:
$ gcc -O2 -ggdb -o debug-optimized listing-one.c
$ readelf --debug-dump=line debug-optimized
..
Special opcode 107: advance Address by 7 to 0x80483aa and Line by 4 to 11
...

But what does gdb say?


$ gdb debug-optimized
(gdb) l *0x80483aa
0x80483aa is in main (./non-optimized.c:11).
...
11 printf("acc = %lu\n",acc);
...
(gdb) disassemble main
...
0x080483aa <main+26>: add $0x6,%edx
0x080483ad <main+29>: cmp $0x1388,%eax
...

There you see a complete disagreement. By inspecting debug information alone, you would expect the related
address to contain something like a CALL instruction. But in reality, you get ADD and CMP instructions, more
likely a loop construction. This is the side effect of the jobs done by optimization--instruction reordering in
this case. So as a rule of thumb, don't mix -g (or its variants) with -O.

Options Controlling Compilation Stages

For learning purposes, sometimes you want to know how your source code is transformed into an executable.
Fortunately, gcc provides you options to stop at any processing stage. Recall that gcc has several stages to be
accomplished--for example, linking. The options are:

-c stops at assembly phase but skips linking. The result is an object code.
-E stops after preprocessing stage. All preprocessing directives are expanded so you only see plain code.
-S stops after compilation. It leaves you with assembler code

-c
is mostly used when you have multiple source files and combine them to create the final executable. So,
instead of:
$ gcc -o final-binary test1.c test2.c

第8页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

it would be better to split them as:


$ gcc -c -o test1.o test1.c

$ gcc -c -o test2.o test2.c

and then:
$ gcc -o final-binary ./test1.o ./test1.o

You probably notice the same sequence if you build the program using a Makefile. The advantage of using -c
is clear: you only need to recompile the changed source files. The only phase that has to be redone is linking
all the object files, and that greatly saves time, especially in large projects. An obvious example of this is the
Linux kernel.

-E
is useful if you want to see how your code really looks after macros, definitions, and such are expanded. Take
Listing 3 as an example.
#include<stdio.h>

#define A 2
#define B 4
#define calculate(a,b) a*a + b*b

void plain_dummy()
{
printf("Just a dummy\n");
}

static inline justtest()


{
printf("Hi!\n");
}

int main(int argc, char *argv[])


{
#ifdef TEST
justtest();
#endif
printf("%d\n", calculate(A,B));
return 0;
}

Listing 3. Code contains #define and #ifdef

We compile it like this:


$ gcc -E -o listing2.e listing2.c

Notice that we don't pass any -D parameters, which means TEST is undefined. So what do we have in the
preprocessed file?
void plain_dummy()
{
printf("Just a dummy\n");
}

static inline justtest()


{
printf("Hi!\n");
}

int main(int argc, char *argv[])


{
printf("%d\n", 2*2 + 4*4);
return 0;
}

Where is the call to justtest() inside main()? Nowhere. TEST is undefined--that's why the code is

第9页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

eliminated. You can also see that the calculate()


macro is already expanded into multiplication and addition of constant numbers. In final executable form, this
number will be replaced with the operation result. As you see, -E is quite handy to double-check the
correctness of directives.

Notice that plain_dummy()


is still there even though it is never called. No surprise since no compilation happens here, therefore dead code
elimination doesn't happen at this stage. stdio.h is also expanded but it isn't shown in the above listing.

I found an interesting application of -E


as an HTML authoring tool [11]. In short, it helps you to adopt common programming practices such as code
modularization and macros into the HTML world--something that cannot be done using plain HTML coding.

-S gives you assembly code, much like what you see with objdump -d/-D. However, with -S, you still see
directives and symbol names, which makes it easier to study the code. For example, a call like
printf("%d\n", 20) could be transformed into:

.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "%d\n"
...
movl $20, 4(%esp)
movl $.LC0, (%esp)
call printf

You can see that format string %d is placed in a read-only data section ( .rodata). Also, you can confirm
that arguments are pushed to the stack from right to left, with the format string at the top of the stack.

Conclusion

gcc
gives us many useful options to make our code into whatever we like. By understanding what these options
really do, we can make the program faster and slimmer. However, do not depend entirely on them: you should
pay more attention to writing efficient and well-structured code.

Acknowledgments

I would like to thank the communities in the OFTC chat room (#kernelnewbies and #gcc) and #osdev
(Freenode) for their valuable ideas.

References

1. Wikipedia's article about Preprocessing


2. Wikipedia's article about Compilation
3. Wikipedia's article about Assembler
4. Wikipedia's article about Linker
5. An example of code reordering using gcc
6. Frame pointer omission (FPO) optimization and consequences when debugging, Part 1 and Part 2
7. Explanation of DWARF
8. Explanation of stabs
9. Explanation of COFF
10. Explanation of XCOFF (a COFF variant)
11. Using a C preprocessor as an HTML authoring tool
12. gcc online documentation
13. AMD Athlon Processor x86 Code Optimization Guide

Mulyadi Santosa is a freelance writer who lives in Indonesia.

第10页 共11页 2007-4-9 22:02


ONLamp.com: Getting Familiar with GCC Parameters http://www.onlamp.com/lpt/a/6968

Return to ONLamp.com.
Copyright © 2007 O'Reilly Media, Inc.

第11页 共11页 2007-4-9 22:02

You might also like