Optimizing PowerPC Code CH 13

Programming
Model I3
Although the PowerPC processor does not enforce any particular programming
model, a variety of programming conventions that IBM devised for the POWER
architecture have become the standard for the PowerPC. These conventions are
part of the Poweropen Application Binary Interface (ABI) and Application Pro-
gramming Interface (API) that formalize the standard to make compliant sys-
tems binary and source-code compatible. These same conventions are also used
on Apple's PowerPC-based Macintosh computers.
13.1 Register Usage Conventions

All of the registers that the PowerPC provides are defined as being either volatile
or non-volatile. A volatile register can be freely used by any routine-a volatile
register does not need to be saved and restored. A non-volatile register needs to
be saved and restored if it is used.
Additionally, a few of the registers are defined as dedicated. A dedicated register
is used for one very well-defined purpose and shouldn't be used for anything
else. There are only two dedicated registers: the stack pointer (r1) and the TOC
pointer ( r 2 ) . Both of these registers are discussed in detail later in this chapter.
Programming Model 25 1
5 13.1 Register Usage Conventions
GPR Usage
The PowerPC has 32 GPRs, which may be either 32 or 64 bits (depending on the
implementation). Table 13-1 summarizes the standard register usage conven-
tions.
Table 13-1 GPR Register Usage Conventions
Two of the GPRs ( r l and r 2 ) are dedicated for use with 0s-related tasks, three
(ro, r 11 and r 12) are used by compiler, linkage, or glue routines, and eight
more (1-3 through r 10) are allocated for passing parameters into a function (see
s13.5, "Subroutine Calling Conventions," for more information). This leaves 19
GPRs (r13 through r 3 1)which are available for general use, but which must be
saved and restored if used.
By convention, a routine should use the volatile registers first because they do
not need to be saved and restored. Thus, a routine should first use GPRO and any
of the registers GPR3 through GPR12 that are not already being used by the rou-
tine for parameters.
If a routine needs to use still more registers, the non-volatile GPRs should be
used from highest numbered to lowest numbered. That is, GPR31 is used first,
followed by GPR30, and so on. Using the non-volatile registers in this fashion
allows the stmw and lmw instructions to be used to save and restore the registers
in the function prologlepilog code. However, some important issues are
involved with using the load and store multiple instructions. These issues are
discussed later in s13.8, "Saving Registers on the Stack."
252 Chapter 13
FPR Usage
FPR Usage
The PowerPC defines thirty-two 64-bit floating point registers. Of these regis-
ters, one (f r O ) is set aside as a scratch register, 13 (f r l through f r13) are used
for passing parameters to subroutines, and the remaining 18 (f r 1 4 through
fr31) are available for general use. Table 13-2 summarizes the floating-point
register usage conventions.
Table 13-2 FPR Register Usage Conventions
As with the GPRs, a routine that needs to use floating-point registers should first
use the volatile registers, f r O and any of the registers f r 1through f r 13 that are
not being used to hold parameters. The remaining 18 FPRs can be used if there
are not enough volatile registers to hold the required values.
The non-volatile registers should be used from the highest to the lowest (that is,
from f r 3 1 down to f r 1 4 ) so that the FPR and GPR register save methods fol-
low the same basic conventions. (See 513.8, "Saving Registers on the Stack," later
in h s chapter.)
SPR Usage
Table 13-3 summarizes the standard usage conventions for the common SPRs
available on PowerPC implementations. In general, the system registers do not
need to be preserved across function calls. The only exceptions are that some of
the fields within the CR must always be preserved, and the FPSCR should be
preserved under certain circumstances.
Programming Model 253

§ 13.2 Table of Contents (TOC)
Table 13-3 SPR Register Usage Conventions
I
SPR
Type
Must be
Preserved? 1 Usage
'generaluse;implicitly userby integer
0
- volatile instructions with the Record bit set
general use; implicitly used by floating-
1
CR -
2
3
-
non-volatile yes general use; must be preserved
4
5
-
6
- volatile no 1 general use
-
7
branch target address
LR
, subroutine return address I
loop counter
CTR volatile branch address (goto, case, system glue)
-
XER fixed point exceptions
FPSCR floating-point exceptions
MQ volatile no obsolete; exists on the 601 onlv
Note that even though the FPSCR is listed as a volatile register that doesn't need
to be preserved, it is considered rude for a routine to change the floating-point
exception enable bits in the FPSCR without restoring them to their original state.
The only exceptions are routines defined to modify the floating-point execution
state.
It is unnecessary to restore the FPSCR if it was only used to record the exception
information as set by the standard arithmetic floating-point instructions.
13.2 Table of Contents (TOC)

Each program module (collection of routines) has associated with it a Table of
Contents (TOC) area that identifies imported symbols and also provides a refer-
ence point for accessing the module's static storage area. A register is (by con-
vention) reserved to always point to the current TOC area. This register is r2,
but it is also referred to as rTOC.
For the most part, TOC maintenance is not something you need to worry about
because all of the functions in the same module will share the same TOC. How-
ever, when calling routines in other modules (for example, the standard library
routines), care must be taken so that the TOC for the called routine is set up
properly, and so that the TOC of the caller is restored before control returns to
the caller. Examples of how this transition is accomplished are given later in
g13.13, "Linking with Global Routines."
254 Chapter 13
Initializing the TOC
Aroutine accesses its global variables (including external routine descriptors)by

recording the variable's offset from the routine's TOC. This allows the variable
to be accessed by simply adding the offset to the current TOC value in rTOC:
lwz r3, offset(rT0C)
The data referenced from the TOC typically isn't the actual global data-it is
usually just a pointer to the data. In the above example, the actual data would
then be accessed via the pointer that has been loaded into r 3.
The reason pointers are stored in the TOC area instead of the actual data is that
64K is the maximum size for the TOC area (because offset is a signed 16-bit
quantity). If all of the program data needed to be directly accessible from the
TOC, then the global static data would have an obviously unacceptable 64K
limit.
By storing only pointers in the TOC area, the 64K limit applies only to the num-
ber of pointers that can be stored. Because each pointer is four bytes in size, the
maximum of 16,384 global data pointers isn't a practical limitation for most
applications.
Initializing the TOC

There is no need for a routine or module to initialize its TOC pointer. In fact, it is
allowed to assume that the TOC has already been set up before it is given con-
trol. However, the TOC must be initialized somewhere. Fortunately, that some-
where is in the system loader, so most programmers do not need to worry
about it.
The system loader handles the TOC initialization because it is in control of
where the program's code and data are loaded. After the loader has decided
where the program should be loaded, it knows the location of the TOC and can
set up the TOC and update any portions of the code or data that need to be ini-
tialized with its value.
13.3 The Stack Pointer

By convention, GPR 1 contains a 16-byte aligned value that always points to the
top of the stack, and the top of the stack always contains a valid stackframe for the
current routine. The stack frame for a routine identifies the routine's execution
context-it contains the register save area, local storage area, and a few other bits
of information for the given routine. The information in the stack frame allows
the entire calling chain to be examined at any time, because each stack frame also
contains a pointer to the stack frame of the routine that called it. Figure 13-1
shows a sample stack with the call-chain pointers to the previous stack frame
made explicit.

$13.3 The Stack Pointer
Figure 13-1 Sample Stack Showing Multiple Stack Frames
I I low addresses
Pointer to stack
frame of caller
I Stack frame
I high addresses
It's nice to be able to assume that the stack pointer always points to a valid stack
frame, because interrupt-level code doesn't need to worry about the stack being
in an inconsistent state. However, it does require a little bit of effort on the part
of the program to insure that the stack pointer does, indeed, always identify a
valid stack frame.
Basically this effort amounts to making sure that the stack pointer update oper-
ation is atomic (that is, it is accomplished in one instruction). This means that the
stack frame cannot be built by using a series of small steps that each allocate a
small area on the stack-the entire stack frame size must be allocated at once,
and the offsets to each area within the frame must be calculated. These calcula-
tions can become quite cumbersome because of multiple variable-sized areas in
a stack frame. Efforts to simplify these calculations have led to some rather
bizarre stack frame building conventions, which are covered later in "Building
Stack Frames."
Updating the Stack Pointer

Because the stack pointer must always point to a valid stack frame at the top of
the stack, the only time the stack pointer should be updated is when the flow of
control is entering or exiting a function.'
1. Actually, the C library routine a l l o c a ( ), which dynamically allocates storage on the stack,
also updates the stack pointer. This special case is discussed later in 913.12, "Stack Frames
and alloca()."
256 Chapter 13
Stack Pointer Maintenance on Function Entry
When a function is entered, it needs to create a new stack frame above the cur-
rent one on the stack and update r l to point to the new frame, saving the
address of the previous stack frame in the process. When a function exits, it'sim-
ply needs to restore the previous stack frame.
Figure 13-2 shows this for the simple case of the routine f oo ( ) calling the rou-
tine bar ( ) . At point (A), £00 ( ) has initialized itself but has not yet called
another routine. During (B), £ oo ( ) has called bar ( ) , and b a r ( ) has set itself
up with its own stack frame. A pointer back to £00 ( )'s stack frame is recorded
as part of bar ( ) 's initialization process. After the call to b a r ( ) is complete (C),
the stack pointer once again points to £00 ( ) 's stack frame.
Figure 13-2 Stack When £00 ( ) Calls b a r ( )
foo's foo's
stack frame stack frame
The three states shown in Figure 13-2 are the only states that the stack pointer
has during the subroutine call. There are no intermediate states where bar ( )'s
stack frame is only partially built.
Stack Pointer Maintenance on Function Entry

When the size of the stack frame is less than 32K, the stack pointer update that is
required when a function is entered can be accomplished with a single instruc-
tion:
stwu r 1,-frame-size ( r 1)
This instruction calculates the address of the new stack frame from (r1)-
frame-size and stores the old value of r l (which points to the current stack
frame) at that address. The calculated address is then stored in r 1. The first part
of this operation saves a pointer to the old stack frame in the first word of the
new stack frame, and the second part updates the stack pointer to point to the
new stack frame.

5 13.3 The Stack Pointer
Stack Pointer Maintenance on Function Exit

For functions with stack frames less than 32K in size, the code required to
update the stack pointer on function exit is also quite simple:
addi r 1, r 1 ,frame-size
All that is required is that the stack be adjusted down by the same amount that
it was adjusted up during function entry.
Alternatively, because a pointer to the previous stack frame was saved at offset
0 of the current stack frame, the stack pointer could be updated as:
lwz rl, 0 (rl)
However, this method is less efficient because it requires a load from memory
that the addi method avoids.
Handling Stack Frames 2 32K in Size

Handling stack frames that are 32K or larger in size is not difficult, but care must
be taken so that the stack pointer is updated in one operation and not in a series
of small adjustments.
On function entry, this operation should be performed as:
# load the 32-bit -frame-size value into r12
lis r 12 ,<upper 16 bits of -frame-size>
ori r 12 ,<lower 16 bits of -frame-size>
# update the stack pointer

stwux rl ,rl ,r12
On function exit, the same technique is used as with small stack frames, but the
addi instruction cannot be used (because it only handles 16-bit values). The
frame-size value can be placed in a register and then added to the stack pointer,
like this:
# load the 32-bit frame-size value into r12
lis r 12 ,<upper 16 bits offrame-size>
ori r 12 ,<lower 16 bits offrame-size>
# update the stack pointer

add rl ,rl,rl2
As in the small stack frame case, the previous stack frame can be restored using
a load from offset 0 of the current stack frame, as:
lwz rl, 0 (rl)
258 Chapter 13
Building Stack Frames
Which of the two methods is more efficient depends on the particular routine
since the three-instruction sequence provides more opportunities for schedul-
ing the instructions.
Building Stack Frames

This section discusses the steps necessary to build a stack frame as they relate to
the stack pointer. It does not provide all the details and the structure of stack
frames (these will be given later in the sections detailing the subroutine calling
conventions), but rather it discusses the order in which the components of the
stack frame should be built.
The values in the stack frame can be initialized during two periods: before the
frame is built and after it has been created. Both of these periods are shown in
Figure 13-3.
After the stack frame is built the stack pointer ( r l ) points to the newly created
stack frame. At this time, the area in the stack frame can be initialized by using a
positive offset from the stack pointer (which is also a pointer to this stack frame).
This period is typically when most stack frame initialization is performed.
Figure 13-3 Periods When Stack Frame Values Can Be Initialized
bar's bar's
stack stack
frame frame
(under
construction)
foo's foo's
stack stack
frame frame
Before building After building

bar's stack frame bar's stack frame
Before the frame is built, r l points to the stack frame of the previous routine,
which is conveniently located immediately below where the new stack frame is
going to be built. The new stack frame areas can be initialized at this point by
using a negative offset from r 1to write values above the top of the stack.

5 13.3 The Stack Pointer
..
Writing Above the Stack Pointer. ick!
The calling conventions for most well-designed systems involve initializing the
stack frame values after the frame has been built and consider it a bad idea to
write values above the current stack pointer. There's a very good reason for this:
if an interrupt comes along, it may need to allocate some temporary storage on
the stack for itself. Although it will free the space and return the stack pointer to
its original value, any data that was above the stack pointer is likely to be
trashed.
The PowerPC calling conventions are no exception: it is still considered bad
form to write data above the stack pointer-with this one exception of building
stack frames. Two things make this exception acceptable. First, the only values
written above the stack are the GPR and FPR save areas, guaranteed to be no
larger than a certain maximum size (because only a certain number of registers
will ever need to be saved). Second, interrupts and other system-level code are
aware of this maximum size and skip over that many bytes before they allocate
any space on the stack.
Now, this may seem like too much effort just to write values safely above the
stack, and, to a certain extent, it is. The GPR and FPR save areas could have been
written using offsets from the new stack frame after it had been built. However,
because many areas in the stack frame are of variable width and need to be prop-
erly aligned, the formula to determine the offset to these two areas from the
frame pointer is relatively complicated.
So, it's basically a choice between complicating the interrupt handlers (which
very few people write) or complicating the formula for calculating the offsets to
these areas (which would affect more people). It's important to note that neither
option affects performance. The interrupt handlers simply add the maximum
save area size to the amount of space that they're allocating on the stack, and the
"more complex offset formula" is statically computed at assembly time doesn't
generate any extra code.
The end result is that it doesn't really matter. The standard calling convention
involves writing above the stack, and the system-level code is designed to han-
dle this. If writing above the stack offends you as a programmer, you can simply
not do it. It is perfectly acceptable to save the GPR and FPR values after the stack
frame has been built because the only code that uses them is the function that
owns the stack frame (no one else can use them since the number of regsters
saved isn't even recorded unless debugging information (h la traceback table) is
present).
It's interesting to note that because the size of the GPR and FPR save areas is
dependent on the size and number of registers being saved, the maximum save
area size will change for 64-bit PowerPC implementations. In order to make
room for the nineteen 64-bit GPRs and eighteen 64-bit FPRs, 296 bytes (instead
260 Chapter 13
Brief Interlude: Naming Conventions
of the 220 required for nineteen 32-bit GPRs and eighteen 64-bit FPRs) will need
to be "reserved above the stack.
13.4 Brief Interlude: Naming Conventions

In the remainder of this chapter, three routine names will be used in the exam-
ples: sna ( ), f oo ( ), and b a r ( ) . Of these three routines, f oo ( ) and b a r ( ) will
commonly be used in the examples. The relationship between these routines is
that f oo ( ) calls b a r ( ) . The s n a ( ) routine is only used to refer to the routine
that originally called f oo ( ) . This calling order will always hold true for all the
examples given here: s n a ( ) always calls f oo ( ), which always calls b a r ( ) .
The terms caller1callee or pitcher1catcher can also be used when describing how
the functions are interacting with the surrounding routines. However, a routine
is never referred to as simply a caller or a callee-these terms are always used
with respect to some other function. For example, f oo ( ) is the caller of b a r ( ),
but is a callee of s n a ( )-how it's referred to depends on the context. Most of the
time, this caller/callee relationship will not be used because using these terms
can lead to confusion when the context of the discussion changes.
13.5 Subroutine Calling Conventions

A subroutine call requires that the caller and the callee both agree on a protocol
for passing data and control back and forth. As mentioned earlier, these conven-
tions are not enforced by the processor. These conventions are instead
"enforced by the operating system and the libraries, because a program must
follow these conventions in order to access the system-provided routines.
According to these calling conventions, the calling routine has the responsibility
of setting up the parameters, passing the return address, and then handing off
control to the subroutine.
The called routine has the responsibility of saving the return address and any
non-volatile register that it modifies, setting up any stack structures that it
requires, and then cleaning everything up and restoring registers before return-
ing to the calling routine. These two tasks (set up and clean up) are performed
by code fragments known as prologs and epilogs.
Prologs and Epilogs

Prologs and epilogs are little code snippets at the beginning (prologs) or ending
(epilogs)of a function. The purpose of these 'logs is to set up the proper environ-
ment for the routine and then restore the original environment when the routine
is finished.
§ 13.6 A Simple Subroutine Call
Function Prologs
When a function is called, it must set up the stack properly so that it creates room
for all local and register-save storage, and so that it sets up a proper back chain.
The portion of a function that does this is known as the function prolog.
Function Epilogs
The function epilog undoes the work of the prolog. It restores the registers that
were saved and makes sure that the stack pointer once again points to the stack
frame of the routine that' originally called this routine.
13.6 A Simple Subroutine Call

The easiest way to describe a function call is to step through a very basic subrou-
tine call and explain what is happening along the way. For this example, the
routine f oo ( ) (which is assumed to have already been set up) is calling the rou-
tine bar ( ), using a standard Branch with Link instruction:
b1 bar
The parameters being passed are not important for this general discussion, so
they are not specified.
Before Calling bar ( )

As mentioned earlier, f oo ( ) is assumed to have been set up properly already.
This means that the stack pointer currently points to f oo ( )'s stack frame. This
is shown in Figure 13-4.
There are five areas in this stack frame: the link area, the argument area, the local
storage area, the G P R save area, and the FPR save area. The last two areas are some-
times collectively referred to as the register save area or simply the save area. These
areas are briefly described in the next few paragraphs and more completely
described in 513.9, "Stack Frames."
The link area is used to save the back-chain (the pointer to the previous function's
stack frame) and to provide space to store a few special registers: the Table Of
Contents, the Condition Register, and the Link Register.
262 Chapter 13
Prolog for bar()
Figure 13-4 The Stack Frame of £00 ( )
- Link Area
Argument Area
Local Storage foo's

Area stack
frame
GPR Save
FPR Save
The argument area is where the arguments are placed if there isn't enough
room to store all of them in the registers. This area is always at least eight
words in size and is left unused most of the time (because the arguments
are stored in registers if possible). The arguments stored here are the
arguments that foe( ) (in this example) is sending to another routine
(bar ( ) ). These are not the arguments that were passed to £00 ( ) .
The local storage area is where £00 ( ) stores whatever it likes. Its size is
determined by f oo ( ) when it creates the stack frame.
The register save area is where £00 ( ) stores the original contents of any of
the non-volatile GPRs or FPRs that it needs to use. This way it can restore
them to their original values before returning. If no non-volatile registers
are used by f oo ( ), then this area will be zero bytes in size.
Prolog for bar ( )

After control is passed to b a r ( ), its prolog code is executed, which performs
these tasks:
.
Saves any non-volatile registers that are used by b a r ( )
Creates b a r ( )'s stack frame (saving a pointer to the previous frame).
After bar () has accomplished these tasks, the stack looks like Figure 13-5.

013.6 A Simple Subroutine Call
Figure 13-5 The Stack after bar ( ) Has Built Its Stack Frame.
Argument Area
-
Local Storage
Area
Link Area
foo's
stack
frame
Saving the Non-volatile Registers

The following registers are non-volatile and must be saved if bar ( ) uses them:
GPRs 13 through 31, FPRs 14 through 31, and the Condition Register. In addi-
tion, the Link Register (technically a volatile register) should be saved and
restored.
As mentioned earlier, the GPRs and FPRs are saved in bar ( )'s stack frame
before the stack frame is built. First, the FPRs are stored immediately above
f oo ( )'s stack frame, and then the GPRs are written above the FPRs. A simpli-
fied diagram showing only the GPR and FPR save areas is given in Figure 13-6.
264 Chapter 13
Saving the Non-volatile Registers
Figure 13-6 Saving the GPRs and FPRs above the Stack Pointer
bar's
GPR Save stack
frame
(under
construction)
FPR Save
d
foo's
stack
frame
Because of the register usage convention of starting from register 31 and work-
ing down, each routine will have a contiguous range of registers (rN through
r31) that need to be saved. This simplifies the savelrestore code and allows the
Load and Store Multiple instructions to be used for this purpose. In reality, the
process of saving and restoring registers ends up being a little more compli-
cated, but these complications don't affect the fact that the registers are stored in
the register save area in ascending order. Register saving is treated in more detail
later in 513.8, "Saving Registers on the Stack."
For the CR and LR, storage space is set aside in £00 ( )'s stack frame to save
these values. The CR is saved at offset 4 from the start of f oo ( )'s stack frame,
and the LR is stored at offset 8. Figure 13-7 shows this operation. In this figure,
the highlighted area in £00 ( )'s stack frame is the area in the Link Area where
the register is being stored.
The code to save the CR and LR is:
# save t h e Link R e g i s t e r
mflr r O
stw r0,8(rl)
# save t h e Condition R e g i s t e r
mfcr r O
stw r0,4(rl)

$13.6 A Simple Subroutine Call
Figure 13-7 Saving Values into f oo ( 1's Stack Frame
bar's stack
frame (under
construction)
foo's
stack frame
One additional register, the FPSCR, is a special case because it generally doesn't
need to be saved and restored, but there are situations where it should be. If any
of the enable (VE, OE, UE, ZE, or XE) or mode (NI or RN) bits of the FPSCR are
changed, then the routine should save and restore the FPSCR, because it is impo-
lite for a routine to globally change the floating-point model. The other bits of
the FPSCR which may be set as a side effect of executing floating-point instruc-
tions, are volatile, and the FPSCR does not need to be saved if they are modified.
No special storage location is set aside for the FPSCR. If a routine needs to save
and restore it, the routine must allocate space in its local storage area.
Creating the Stack Frame

After all the registers have been saved, the stack frame for bar ( ) can be created.
This is usually a single instruction:
stwu r 1 ,-frame-size ( r 1 )
where theframe-size value is calculated from the sum of the sizes of the five areas
comprising the stack frame, plus some padding bytes to insure that the stack
frame always starts on a quadword boundary. Hence:
frame-size = link-size + arg-size + local-size + gpr-size +fir-size + padding
where:
link-size is always six words (24 bytes) in length.
arg-size is large enough to hold all arguments needed for any function that
bar ( ) calls. Note that this is not related to the arguments that are passed
266 Chapter 13
Execution of bar()
into bar ( ) . This area is for the arguments that bar ( ) will (possibly)pass to
another routine. This is always at least eight words (64 bytes) in length.
local-size is the size of the local storage area that bar ( ) needs, or 0 if it
doesn't need any local storage.
gpr-size is large enough to hold all of the GPRs that bar ( ) is saving and
restoring. This can range from 0 to 19 words (76 bytes).
fir-size is large enough to hold all of the FPRs that bar ( ) is saving and
restoring. This can range from 0 to 18 doublewords (144 bytes).
padding is the number of extra bytes needed to insure that the stack pointer
is always quadword (16-byte) aligned.
Because the link, argument, and local storage area are allocated from the top of the
stack frame, and the FPR and GPR save areas are allocated from the bottom, the
padding bytes fall between the local storage area and the GPR save area.
It is important to note that this one Store w i t h Update instruction performs two
critical functions: it allocates the new stack frame on the stack, and it saves the
back chain (pointer to the previous stack frame) at offset 0 into the newly created
stack frame.
Execution of bar ( )
Finally, bar ( ) can execute its code and accomplish the tasks that it needs to,
including calling other routines.
Epilog of bar ( )
During the epilog, bar ( ) must restore the registers, restore f oo ( )'s stack
frame, and then return control to f oo ( 1.
Restoring the Non-volatile Registers

The FPRs and GPRs can be restored by loading from the values stored in the
register save area. For GPRs, the Load Multiple instruction can be used, but since
there is no equivalent instruction for FPRs, some other method must be used.
The problems associated with saving and restoring registers are covered later in
s13.8, "Saving Registers on the Stack."
Restoring the CR and LR is just as easy as saving them. The code to restore these
registers is:
# restore the Link Register
lwz r0 ,frame-size+8 ( rl )

5 13.7 An Even Simpler Subroutine Call
mtlr r O
# r e s t o r e t h e Condition R e g i s t e r
lwz r 0 ,frame-size+ 4 ( r 1 )
mtcr r O
Technically only the CR fields that have changed need to be restored, but some
PowerPC implementations may execute a complete CR restore instruction sig-
nificantly faster than they would execute a partial CR restore.
Restoring the Stack Frame

To restore f oo ( ) 's stack frame, we add the size of bar ( ) 's stack frame to the
stack pointer:
addi r 1 ,r 1 ,frame-size
Returning Control to f oo ( )
Because the LR has been restored, it now holds the return address in foo ( ).
This means that control can be returned to f oo ( ) by simply executing a Branch
to Link Register instruction:
blr
Return to foo ( )
At this point, control has been returned to f oo ( ), and the stack and all of the
non-volatile registers have been restored. Execution in f oo ( ) continues.
13.7 An Even Simpler Subroutine Call

One thing was not made explicit in the simple subroutine call: it was assumed
that the routine b a r ( ) needed a stack frame. A routine needs a stack frame only:
If bar ( ) calls another routine. A stack frame is needed because the
arguments for the routine being called must be stored in the argu-
ment area for bar ( ) . f oo ( ) 's argument area holds the arguments for
b a r ( ) and cannot be reused to also hold the arguments for another
routine.
if b a r ( ) requires more than 220 bytes of storage area. The magical
"220 byte" value comes from the maximum size of the GPR and FPR
save area that all interrupt level code will skip over before allocating
268 Chapter 13
Before Calling bar()
space on the stack. If bar ( ) can fit its register save area and its local
storage area into 220 bytes, then it can get away without having a
stack frame.
If a routine doesn't require a stack frame, then there's no sense in creating one.
This section will step through the subroutine call where is it assumed that
bar ( ) does not need a stack frame.
Before Calling bar ( )

The frameless routine is called like an ordinary routine, through a blr instruc-
tion. The stack for foo ( ) is set up just as it was in the previous section. Basically,
f oo ( ) doesn't know (or care) if any routine it calls has a stack frame or not. It
just passes control to the routine and waits for control to return.
Prolog of bar ( )
The only task that the prolog needs to perform is saving the non-volatile regis-
ters that bar( ) needs to use. This is done above the stack pointer (where
bar ( )Is stack frame would be if it were to build one). g13.8, "Saving Registers
on the Stack," discusses in detail how this is done.
The CR and LR should also be saved. Because space has been allocated for them
in foo ( ) 's stack frame, they can be saved using the same code given in the pre-
vious section:
# save the Link Register
mflr rO
stw r0,8(rl)
# save the Condition Register

mfcr rO
stw r0,4(rl)
In general, the LR should be saved even though it may seem unneeded (because
bar( ) doesn't call any functions). There are two reasons for this. First, there
may be some constructs (like case statements) that use the LR; and, second, some
system-level routines are branched to (saving the LR), but are not considered
real subroutines because they do not require a stack frame (see g13.8, "Saving
Registers on the Stack).
Execution of bar ( )
As in the previous description, the execution of bar ( ) continues as it normally
would. The only differences are that bar ( ) is not allowed to call any other

$13.8 Saving Registers on the Stack
routines, and any of bar ( ) 's local variables must be accessed using a positive
offset from the top of the stack.
Epilog of bar ( )
Because there is no stack frame to restore, the epilog code just restores the non-
volatile registers that were used by bar ( ) and then returns control to f oo ( ) .
For the GPRs and FPRs, the original values are pulled from above the stack and
stuffed into the appropriate registers. See g13.8, "Saving Registers on the Stack,"
for a full discussion of how this should be done.
For the CR and LR, the original values were stored at offsets 4 and 8 of f oo ( )'s
stack frame. This is the same code used in the previous section, with the simpli-
fication that theframe-size is known to be 0.
# restore the Link Register
lwz r0,8(rl)
mtlr rO
# restore the Condition Register

lwz r0,4(rl)
mtcr rO
Not surprisingly the instruction to return to f oo ( ) is the standard:
blr
13.8 Saving Registers on the Stack

The previous sections hinted that saving the GPRs and FPRs on the stack was
not quite as straightforward as it might seem. This section finally explains the
complications surrounding the seemingly simple task of saving and restoring
registers.
Using the Load and Store Multiple Instructions

For two really good reasons, the Load and Store Multiple routines are not encour-
aged as part of the standard register saving mechanism.
The first is the ominous hint in the PowerPC Architecture manual that the Load
and Store Multiple instructions may (on some PowerPC implementations) exe-
cute significantly slower than an equivalent series of loads or stores. Basically,
this means that the hardware designers aren't going to bother wasting transis-
tors on these instructions-if it's easy to add support for them, then it might be
thrown in, but otherwise it'll be emulated in software. Note that this doesn't
270 Chapter 13
Saving GPRs Only
mean that a series of loads or stores will always be faster than the analagous Load
or Store Multiple instruction. On processors with unified caches (like the 601), the
instruction fetches from the series of loads / stores can collide with the data cache
accesses and thus be less efficient than the Load or Store Multiple.
The second reason is that equivalent instructions do not exist for the FPRs, and
there are no planned instructions to support 64-bit GPRs. This means that the
only time these instructions can be used is with the GPRs on 32-bit PowerPC
implementations, and they might be horribly inefficient.
So, because a mechanism is needed for the FPRs anyway, it might as well be
generalized to handle the GPRs.
Saving GPRs Only

The easiest case to describe is where a series of GPRs need to be saved, but no
FPRs. In this case, the GPR save area is located immediately above the stack
frame for the previous routine. Figure 13-8 shows the save area when only the
GPRs need to be saved.
Figure 13-8 Saving GPRs into bar ( ) 's Soon-to-be-constructedStack Frame
bar's stack
-
frame (under
construction)
GPR Save Area
foo's
stack frame
There are three common ways of accomplishing this. The first is to use the Store
Multiple instruction, which has already been presented as a potentially bad idea.
However, it doesn't hurt to show how it would be done. The second and third
methods both involve using a series of Store instructions. The second method
has these instructions inline, and the third branches to a system routine that per-
forms the appropriate register saves.
Using the Store Multiple Instruction

The stmw instruction automatically stores all of the registers, from a specified
starting register up to r31, at a specified effective address. By specifying the
appropriate offset above the current stack pointer as the effective address, the
GPRs can be saved above the stack using one instruction.
There is really only one variable here, the number of registers that need to be
saved. If this variable is called N, and it is allowed to range from one (only r3 1
is saved) to 19 (r13 through r 3 1 are saved), then the instruction needed to save
this number of registers above the stack pointer would be:
stmw 32-N, -4*N(rl)
So, for N = 1, this would be:
stmw r31,-4(rl)
For N = 19, it would be:
stmw r13,-76(rl)
Using a Series of Store Instructions

After the formula for the Store Multiple instruction is known, the Store instruc-
tions are easy to generate because the arguments are basically the same. To save
five registers, this series of stores would be used:
stw r27,-20(rl)
stw r28,-16(rl)
stw r29,-12(rl)
stw r30,-8(rl)
stw r31,-4(rl)
To save all the GPRs, a series of 19 stw instructions would be used:

stw r13,-76(rl)
stw rl4,-72(rl)
... # 15 more stw instructions
stw r30,-8(rl)
stw r31,-4(rl)
Branching to System-Provided GPR Save Routine

Because the stw instructions are always the same series of instructions, it makes
sense to put the instructions in a system-level routine that any routine can
branch to. If the system provides this, then the registers can be stored by branch-
ing to the appropriate entry point in this system routine.
272 Chapter 13
Restoring GPRs Only
As an example, consider the following (abbreviated)implementation of such a

system routine with the listed entry points:
-savegprl3: stw rl3,-76(rl)
-savegprl4: stw r141-72(r1)
-savegprl5: stw r15#-68(rl)
...
-savegpr30: stw r30,-8(rl)
-savegpr31: stw r31,-4(rl)
blr
In order to save nine registers ( r 2 3 through r31), a routine could branch to the
"gpr23" entry point:
mf l r r0
bla -savegpr23
Note that the LR must be saved (in this case, in rO) before the save routine is
called.
The b l a (Branch with Link Absolute) instruction is used to call the save routine
because it is assumed that the routine is at a fixed location determined by the 0 s .
If the routine is not at a fixed location, then the b l (Branch with Link) instruction
would be used instead.
Restoring GPRs Only

Just as there are three ways to save the GPRs, there are three ways to restore the
registers: using a Load Multiple instruction; using a series of loads; or using a
system-provided routine to load the registers.
It is important to note that these methods must be applied after the original rou-
tine's stack pointer has been restored in r 1.
Using the Load Multiple Instruction

A Load Multiple instruction with the same parameters as the Store Multiple
instruction used to save the registers, restores all the GPRs in one instruction.
To summarize, this instruction should be used:
lmw 32-N, - 4 * N ( r l )
/'
where N is the number of registers to be saved, ranging from 1to 19.

Using a Series of Load Instructions

Because this series of load instructioni is exactly the same sequence provided in
the system GPR load routine described in the next section, it isn't necessary to go
into detail here-the same information would be duplicated.
However, it is perfectly valid to take this system GPR load routine and copy it
inline into a routine.
Branching to System-Provided GPR Restore Routine

The standard GPR restore routine is a series of load instructions that are analo-
gous to the store instructions in the GPR save routine:
-restgprl3: lwz r13,-76(rl)
...
-restgpr30: lwz r30,-8(rl)
-restgpr31: lwz r31,-4(rl)
blr
A routine can restore the registers by branching to the appropriate entry point.
For example, to restore registers r2 3 to r 3 1, this branch would be used:
bla -restgpr23
Saving FPRs Only

Now that the save procedure for the GPRs has been discussed, it is much easier
to describe the save procedure for the 18 FPRs. There are two options: the pro-
vided FPR save routine or an inline copy of this routine. Because these are both
the same code, only the system FPR save routine will be discussed.
A system FPR save routine might look like this:
-savefprl4: stfd fr14,-144(rl)
...
-savefpr30: stf d fr30,-16(rl)
-savefpr31: stfd fr31,-8(rl)
blr
274 Chapter 13
Restoring FPRs Only
In order to save nine floating-point registers (fr23 through fr31), a routine

could branch to the "fpr23" entry point:
mflr r0
bla -savefpr23
As with the GPR case, the LR must be saved before the save routine is called.
Restoring FPRs Only

Not surprisingly, the standard FPR restore routine is just a series of floating-
point load instructions that follow the same pattern as the store instructions in
the FPR save routine:
-restfprl4: 1fd fr14,-144(rl)
-restfprl5: lfd fr15,-136(rl)
-restfprl6: lfd fr16,-128(rl)
..a
-restfpr30: lfd fr30,-16(rl)

-restfpr31: lfd fr31,-8(rl)
blr
A routine to restore floating-point registers fr23 to fr3 1 would use this branch
instruction:
bla -restfpr23
Saving Both GPRs and FPRs

Restoring both GPRs and FPRs gets somewhat messy because the FPRs get
stored first and the GPRs must be stored immediately above them. Where the
GPRs get stored now depends on the number of FPRs saved, so a simple scheme
of storing them at a constant offset from the current stack pointer (as was done
above) won't work anymore.
The way around this is to calculate and use a GPR base address instead of the
stack pointer, as shown in Figure 13-9. This base address is easy to calculate from
the current stack pointer and the number of FPRs being saved:
subi rl2, rl ,8*numberof-FPRs

513.8 Saving Registers on the Stack
Figure 13-9 The GPRs Save Area Is Immediately above the FPR Save Area.
new frame
under
construction
GPR Save
1-
FPR Save
" )2@
I
(I1) old frame
If the standard GPR save routines are rewritten to use r12 instead of rl, then
this code can be used to save six GPRs and five FPRs:
mf lr r0
subi r12,rl18*5
bla -s avegpr2 6
bla -savef pr27
Restoring Both GPRs and FPRs

Restoring GPRs and FPRs is done the same way as the GPRIFPR save-an extra
register is used to calculate the GPR base address. As with the save routine, the
GPR restore routine must be rewritten to use this new base register.
To continue the example, this code could be used to restore the six GPRs and five
FPRs saved:
subi r12,rl18*5
bla restgpr26
bla -restfpr27
System SaveIRestore Routine in Practice

In practice, the GPRIFPR save and restore routines can be slightly different than
the examples. These differences arise because some systems provide two types
of GPR save and restore routines (one based on rl and another based on r12),
and because the routines are sometimes expanded to perform duties in addition
to saving and restoring GPRs or FPRs.
276 Chapter 13
Multiple GPR SaveRestore Routines
Multiple GPR SaveIRestore Routines

Because it is convenient to have GPR save and restore routines based on both r 1
and r12,many systems provide both. These routines are identical except for the
base register used.
To differentiate between these two versions, the standard name of the routines
are changed: -savegpr 0-xx and -restgpr 0-xx reference the save / restore
off of rl, and -savegpr 1-xx and -restgpr 1-xx reference the save/ restore
off of r12.
Note that the '0' variant is used when only the GPRs are being saved, and the '1'
variant is used when both GPRs and FPRs are being saved.
Additional Functionality
Because these routines are always part of a function's prolog or epilog, it makes
sense to include some other prolog/epilog tasks in their code.
Additional Functionality for Save Routines
For the save routines, it's easy to add the store of the original LR into the caller's
stack frame because it is something that the function prolog needs to do anyway.
If the LR is saved in rO before calling the save routine (as it should be), the LR
can be saved in the proper place using:
stw r0,8(rl)
For the GPR save routine, this instruction only needs to be added to the '0' vari-
ant because the '1' variant is always used with the FPR save routine, which will
presumably handle the LR save.
The FPR save routine that includes an LR save is differentiated from the stan-
dard FPR save routine by adding an underscore between the 'f pr' and the reg-
ister number. Thus, -save f pr-2 9 would be used instead of -save f pr2 9.
Additional Functionality for Restore Routines
The restore routines can restore the saved LR value and return directly to the
value stored there, eliminating the need to return back to the function perform-
ing the restore. This is done by adding code to the end of the restore routine:
lwz r0,8(rl)
mtlr r0
blr
When this version of a restore routine is used, it isn't necessary to use the Branch
with Link form to call the restore routine (no damage will occur if it is used). Only

3 13.9 Stack Frames
the non-Link branch form is needed because the restore routine will return
directly to the caller.
Rewriting the example that restores six GPRs and five FPRs, the restore routines
would be called as:
subi r12,r118*5
bla -restgprl-26
ba -restfpr-27
As with the save routines, this extra functionality only needs to be added to the
'0' variant of the GPR save routines. The FPR restore routine that restores the LR
adds an underscore to the name just like the FPR save routine does. Thus,
-rest fpr-2 7 is used in the above example instead of -rest fpr2 7.
13.9 Stack Frames

Stack frames have been discussed throughout this chapter, but not thoroughly
so that other, more interesting topics could be discussed sooner. This section pro-
vides a detailed analysis of the structures that comprise a routine's stack frame.
A stack frame is composed of five basic areas as shown in Figure 13-10. To help
speed up data accesses, stack frames are always quadword aligned and the areas
within the frame are either doubleword or word aligned. The required align-
ment for the stack frame areas is also shown in Figure 13-10. Quadword bound-
aries are marked with a 'Q', doubleword boundaries with a 'D', and word
boundaries with a 'W'.
Because the Link, Argument, and Local Storage areas grow down from the top
of the frame, and the FPR and GPR save areas grow up from the bottom of the
frame, some unused "padding" bytes may be left between the Local Storage
Area and GPR Save Area.
As in the previous sections, the routines in this discussion will be referred to by
name instead of usage (because usage may change with the context). Thus, the
routine that owns the stack frame will be known as foo ( ), and bar ( ) will be
used as an example routine that is called by foo ( ) . The routine that originally
called foo ( ) will be referred to as sna ( ) . Hence, the complete calling chain is
sna ( ) + foo ( ) 3 bar ( ) . This terminology helps simplify the discussion by
eliminating some unwieldy phrases needed to identify each routine properly.
278 Chapter 13
Link Area
Figure 13-10 Stack Frame Structure Showing the Required Alignment
p b l e padding bytes
Link Area
The Link Area is a 24-byte area that is used by both f oo ( ) (the stack frame
owner) and bar ( ). Table 13-4 lists the area's fields.
Table 13-4 Link Area fields
The first word (at offset 0) contains a pointer to the stack frame of the routine
that called f oo ( ), in this case sna ( ). This field is initialized by f oo ( ) as part
of its function prolog and is used by foo ( )'s epilog to restore the original stack
frame when the routine is complete.
The next two fields are used by routines that f oo ( ) calls, in this case bar ( ), so
that they have a place to store the LR and CR. These values are stored here
because it is possible that bar ( ) will not need to create a stack frame and it is
convenient to always save these register values in the same place.

$13.9 Stack Frames
The next two fields are reserved for use by compilers and binders, but they are
generally left unused.
The last field is a storage space set aside for bar ( ) to save its TOC pointer when
it makes out-of-module function calls. Like the LR and CR storage areas, this
area is part of £00 ( )'s stack frame so that all routines have a common place to
store their TOC value, whether or not they create a stack frame.
Because the Saved CR, LR, and TOC fields are presumed to be available to any
routine that £00 ( ) calls, £00 ( ) must create a stack frame if it calls any other
routine.
Argument Area
The Argument Area is a storage place that f oo ( ) can use to hold arguments that
are being passed to bar ( ). Table 13-5 lists the area's fields. Note that this is not
where the arguments to £00 ( ) are stored. The arguments being passed to
£00 ( ) are stored in the Argument area owned by sna ( ).
Table 13-5 Argument Area Fields

Offset Size Description GPR
0 W 1st parameter word 3
4 W 2nd parameter word 4
24 l W l 7th parameter word 9

28 II W lI 8th varameter word I
I
10 1
additional parameters
32
(if necessarv)
Because the Argument Area is used to hold the arguments that £00 ( ) is passing
to bar ( ), it must be large enough to hold all the arguments that bar ( ) expects.
If £00 ( ) calls multiple routines, then this area must be large enough to hold all
the arguments for the routine that requires the most argument space.
One restriction on the size of the Argument Area is that it is always at least eight
words in size. The first eight words of arguments would be placed in these eight
words if they weren't placed in GPRs 3 through 10. These eight words are
always allocated because it may be necessary for bar ( ) to take the address of
one of the arguments. In this case, the value would be written from the GPR into
the analogous slot in the Argument Area and the address into the Argument
Area would be used.
280 Chapter 13
Local Storage Area
Local Storage Area

The Local Storage Area is a free-form data area that begins immediately below
the Argument Area. Its size is determined by the amount of stack space that
f oo ( ) allocates for its stack frame.
GPR Save Area

The GPR Save Area is an area set aside for storing the original values of the non-
volatile GPRs so that they can be restored before foo( ) returns control to
ma().
The number of registers saved here depends on the number of non-volatile reg-
isters that f oo ( ) uses. This can vary from zero to 19 registers, which results in
this area's size varying from zero to 76 bytes. The position depends on the size
of the FPR Save Area because the GPR Save Area is defined to be placed imme-
diately above the FPR Save Area.
FPR Save Area

The FPR Save Area is where the original values of the non-volatile FPRs are
stored so that f oo ( ) can restore them when it exits. The size of this area varies
from 0 to 144 bytes, depending on the number of FPRs (0 through 19) that need
to be saved.
13.10 Passing Arguments to Routines

The subroutine calling convention used on most PowerPC systems is register-
based. This means the parameters are passed to the routine using agreed-upon
registers. This scheme works very well in most cases. However, in some situa-
tions, this simple scheme breaks down, for example:
When there are >8 words of fixed-point arguments.
When there are >13 doublewords of floating-point arguments.
When there are a large number of both fixed- and floating-point
arguments.
When passing structures or other complex data types into a routine.
The best way of describing the parameter passing scheme is to formalize the
basic system and then to extend it to handle the situations listed. This section
includes many examples that demonstrate how the function parameters are
assigned to the registers.
5 13.10 Passing Arguments to Routines
Arguments for Simple Routines

A simple routine (as far as this section is concerned) has a small number of
parameters, all of whch are basic types. For example, the following routine has
four fixed-point parameters:
void fl(int a,char b,short c,long d)
These four parameters are assigned to the GPRs from left to right, starting with
GPR 3 (the first GPR available for use as a parameter). Thus, the register assign-
ments listed in Table 13-6 are made.
Table 13-6 Arguments for void fl( int a,char b, short c, long d)
Arg 1 2 3 4
Type int char short long
I Offset (0) (4) (8) (12)
GPR 3 4 5 6
FPR -
char in low-order slzort in low-order
byte of register halfroord of register
Because this table format will be used throughout this section to describe the
argument-passing conventions, it is worthwhile to spend a few paragraphs to
describe how the data is arranged in the table.
First of all, each argument is described in a separate column. Since there are four
arguments, there are four columns numbered 1 through 4. This numbering is
useful because it is convenient to refer to arguments by number instead of name
or type.
For each argument, the argument type, an offset, and a GPRIFPR allocation will
be given. In addition, sometimes notes on the bottom describe some special fea-
ture of the argument.
The argument type is simply the type originally defined for the variable, to typ-
ically something like int, char, or long.
The offset for each argument is the offset into the Argument Area that is being
used to hold the arguments. In many cases, the arguments are passed in registers
and the space left in the Argument Area is unused. In these cases, the offset will
be displayed in parentheses.
The last two rows are the GPR or FPR assignment for this argument. For GPRs,
this ranges between 3 and 10, and for FPRs it ranges from 1to 13. Not all argu-
ments have register assignments, and it is possible for an argument to be
assigned to both a GPR and an FPR.
282 Chapter 13
Routines with Integer Arguments
Routines with Integer Arguments

Routines with only integer arguments are the most basic in terms of how the
data is allocated.
A couple of things are important to note about the example arguments described
in Table 13-6. The first item of note is that each argument is assigned to a differ-
ent register in order starting with GPR 3. The arguments are not combined in
any way to save registers (as could possibly have been done with arguments 2
and 3). The second is that the data is always placed in the low-order (rightmost)
bits of the register. This second rule is true even when the data is allocated in the
Argument Area.
Routines with Floating-Point Arguments

Floating-point arguments are allocated to floating-point registers in the same
manner that integer arguments are allocated to GPRs.
In Table 13-7 the four floating-pointvalues are assigned to the first four available
FPRs. Each value is also mapped into an appropriate number of bytes in the
Argument Area: eight bytes for double-precision values and four bytes for
single-precision values.
Table 13-7 Arguments for void f 2 (double a,double b, f loat c,double d )
Routines with Both Integer and Floating-point Arguments

When both integer and floating-point arguments are being passed to a routine,
things become somewhat complicated. The problem stems from the fact that the
eight words in the argument area are assigned to GPRs 3-10, and now some
floating-point values may need to be mapped into that area.
The conflict is resolved by not using the GPRs that would be assigned to the
same area in the Argument Area as the FPRs that need to be mapped into that
area. The tables show this by allocating the GPRs to the floating-point values
and listing the GPR numbers in parentheses.
Table 13-8 shows how GPRs 4 and 5 are set aside to eliminate any conflict with
the double-precision value stored in FPR 1, and GPR 7 is set aside for FPR 2
(which contains only a single-precision value). These three GPRs are not used
for argument passing, but they can be used by the called routine for any other
purpose.

6 13.10 Passing Arguments to Routines
Table 13-8 Arguments for void f 3 (int a,double b, char c, f loat d, int e)
Offset
GPR (4,5)
FPR - 2
Routines with More than 32 Bytes of Arguments

If there are more than 32 bytes of arguments, then some arguments will not fit
into the GPRs and must be stored directly in the Argument Area. Whenever
there are more than 32 bytes of arguments, all of the argument bytes beyond the
32nd must be stored in the Argument Area.
In Table 13-9, the first six parameters fill u p the entire 32 bytes of the Argument
Area, so it must be expanded to 44 bytes to hold all of the arguments.
Table 13-9 Arguments for void f4 ( int a,double b, double c,
single d,int e,int f,int g,double h)
Another interesting aspect of this example is that even though the GPRs have all
been expended and the Argument Area has been partially used, the FPRs can
continue to be allocated as long as they are available. In this case, the floating-
point values must still be passed in the Argument Area, but they may also be
passed in the appropriate FPRs.
Routines without Prototypes

The only difference between routines with and without prototypes is how the
floating-point parameters are handled. If a called routine does not have a proto-
type, then the floating-point values are passed in both the GPRs and the FPRs.
The mapping follows the same technique used when there were mixed integer
and floating-point arguments. The difference is that now the GPRs aren't just
being allocated. They are actually being used to store the floating-point value (in
addition to the FPR, which stores the same value).
To show some of these register assignments, routines with floating-point argu-
ments are revisited in Tables 13-10 and 13-11to show the GPR allocation.
284 Chapter 13
Routines with an Ellipsis
Table 13-10 Arguments for v o i d f 2 (double a , double b, f l o a t c , double d )
A r g 1 1 1 2 1 3 4 1
.-
Type1 double / double I float I double
Offset (0) (8) (16) (20)
GPR 3,4 5,6 7 83
1 FPR 1 2 3 4
Table 13-11 Arguments for v o i d f 3 ( i n t a,double b,char c , f l o a t d, i n t e )
Routines with an Ellipsis

An ellipsis indicates that the routine can have a variable number of unspecified
arguments, which is very similar to a routine that doesn't have a prototype.
If a routine prototype has an ellipsis, then it is treated as if the prototype didn't
exist, and the arguments are passed in both the GPRs and FPRs.
Passing Complex Arguments

Complex arguments like structures are passed using the registers just as if the
elements of the structure were passed to the routine individually. The only
exception to this rule is that arguments that do not require the entire register
word (like chars and shorts) are left justified in the register instead of right justi-
fied. The reason for this difference is to match the way the structure is stored in
memory.
The example routine in Table 13-12 assumes the definition of the structure:
typedef s t r u c t {
i n t a;
s h o r t b;
double c;
) data;
Table 13-12 Arguments for v o i d f 5 ( d a t a abc , d o u b l e d , i n t e )
data
-4% 4 5
1 2 3
Type int short double double int
Offset (0) (4) (8) (16) (24)

5 13.11 Retrieving Results from Routines
GPR 3 4 (56) (7,8) 9

FPR - 1 2
short in high-order
I halfword o f reaister
13.11 Retrieving Results from Routines

Like arguments that are passed to routines, the results returned to the calling
routine are all register based.
Integer Results
Integer function results are returned in registers r 3 and r 4 . The rules governing
the use of these registers are:
int, long, and pointer values are returned in r 3 .
Unsigned char and short values are returned in r 3 , where the value is
right justified in the register and zero-extended.
Signed char and short values are returned in r 3 , where the value is
right justified in the register and sign-extended.
Bit fields of 32 bits or less are returned right justified in r 3 .
64-bit fixed-point values are returned in r 3 : r 4 , where r 3 contains the
high-order portion of the value and r 4 contains the low-order
portion.
Floating-Point Results
Floating-point function results are returned in registers f r l through f r 4 ,
according to these rules:
Single-precision(32-bit) values are returned in f r 1.
Double-precision (64-bit)values are returned in f r 1.
Long double-precision (128-bit) values are returned in f r l : f r 2 ,
where f r l contains the high-order 64 bits and f r 2 contains the low-
order 64 bits.
Single- or double-precision complex values are returned in f r l:f r 2 ,
where f r l contains the real single- or double-precision portion, and
f r 2 contains the imaginary single- or double-precision portion.
Long double-precision complex values are returned in f r l : f r 4 ,
where f r l : f r 2 contain the high- and low-order bits of the real
portion, and f r 3 : f r 4 contain the high- and low-order bits of the
imaginary portion.
286 Chapter 13
Complex Results
Complex Results
When complex data types like structures or strings (of greater than four charac-
ters) need to be returned as a function result, the caller must first allocate a buffer
large enough to hold the result. A pointer to this buffer is passed to the routine
as its first argument and occupies GPR 3 (and the first four bytes of the Argu-
ment Area).
The first user-visible argument to the routine is passed in GPR 4.
13.12 Stack Frames and alloca ( )

The C library routine a l l o c a ( ) causes a small problem with the requirement
that the stack pointer always be a valid stack frame. Since a l l o c a ( ) is defined
to allocate storage space dynamically on the stack, it must update the stack
pointer. The trick is to make sure that it allocates the needed storage whle keep-
ing the current stack frame intact.
This is done by maintaining two pointers into the stack: the stack pointer and a
local storage area pointer for this routine. The stack pointer is always rl and the
local storage area pointer can be any register chosen by the routine. By using these
two pointers, the link area and the argument area can be copied to the top of the
newly expanded stack frame and the new storage space is taken from the area
between the argument area and the local storage area. As Figure 13-11 shows, this
results in the new storage area partially overlapping the area on the stack that
used to contain the link area and argument area.
The local storage area pointer can be any available GPR and must be initialized
and maintained by the routine.
Routines that do not use a l l o c a ( ) do not need to use a separate pointer for
local storage because the top of the local storage area is trivially calculated from
the stack pointer and guaranteed not to change.
In their epilog code, routines that use a l l o c a ( ) must restore the previous stack
frame using the value stored at offset 0 from the current stack frame. This must
be done because there is no easy way to calculate the current size of the stack
frame because the values passed to a l l o c a ( ) are presumably runtime depen-
dent.

§ 13.13 Linking with Global Routines
Figure 13-11 Stack Frame before and after Call to a l l o c a ( )
bar's
stack
frame
f 00's
stack
frame
13.13 Linking with Global Routines

Subroutine calls are complicated when the routine being called is not located in
the same module as the routine making the subroutine call. Routines located in
different modules are likely to have different TOC environments, and each rou-
tine needs to have the proper environment to function properly.
Consider the arrangement of routines shown in Figure 13-12.
Two modules are defined, each with a different TOC environment. As far as the
routine f oo ( ) is concerned, the routine b a r ( ) is a local routine because it is
located in f oo ( )Is local module. A call to bar ( ) from f oo ( ) can be accom-
plished by simply using the Branch with Link instruction to the address of bar ( ) .
This is acceptable because both share the same TOC environment.
288 Chapter 13
Function Descriptors
Figure 13-12 Two Sample Modules with Different TOC Environments
Module 1 Module 2
TOC = 1000 TOC = 2000
The kyoko ( ) routine is located in a module that is foreign to foo ( )'s module,
so kyoko ( ) has a different TOC environment. If foo ( ) were to simply Branch
with Link to the address of kyoko ( ), kyoko ( ) would inherit the wrong TOC
environment, which would probably cause bad things to happen.
To prevent them from happening, there must be some mechanism for switching
TOC environments whenever an "out-of-module" routine is called.
Function Descriptors
One way of handling the necessary TOC context switch is to use a structure
called afuncfion descriptor (sometimes called a transition vector) instead of just a
simple pointer. A function descriptor is a pointer to the structure described in
Table 13-13.
Table 13-13 Structure of a Function Descriptor
I Offset I Descrivtion
1 0 1I routine address I
4 1 TOC
8 1 environment vointer
The first word of this structure contains the address of the routine, and the sec-
ond word contains the TOC pointer. Following words can contain data such as
an environment pointer, but anything beyond the TOC pointer is optional and
depends on the development environment originally used to create the routine.
In many cases, these following words are zero or are not present.

4 13.13 Linking with Global Routines
Using this structure, an external routine can be called by passing the function
descriptor to a special routine provided to take the information in the descriptor
and properly pass control to the target routine.
Pointer Global Linkage (.ptrgl)Routines

A pointer global linkage (. ptrgl) routine is a small glue routine whose purpose
is to pass control to a routine defined by a given function descriptor. Part of the
routine's operation is to handle the TOC context switch properly.
Using the .ptrgl Routine

.
The method of using a ptrgl routine is exactly the same as using a normal
routine call, except that the Branch with Link to the target routine is replaced with
these actions:
Load the function descriptor into r 11.
.
Branch with Link to the ptrgl routine.
Restore the current routine's TOC (if necessary).
Some development environments use r 12 instead of r 11,but the general idea is
still the same.
Loading the address of the function descriptor and the branch are both straight-
forward, but the TOC restoration might seem a bit tricky because the current
TOC doesn't seem to have been saved anywhere. The part that is not apparent
.
from the bulleted list is that part of the ptrgl routine's responsibility is to save
the TOC in the Link Area of the stack frame before setting up the target routine's
TOC environment. This allows the TOC to be restored using a simple load
instruction:
The "20" value is the magic offset from the top of the stack frame to the TOC
.
storage area in the Link Area. This is where the ptrgl routine will save the
current TOC.
If it is known that the source and target routines both share the same TOC envi-
ronment, then the lwz instruction is not necessary. In this case, a standard no-op
instruction will be added in place of the load. This no-op is commonly encoded
as a ori rO ,rO ,0, but some older development environments used cror
31,31,31orcror 15,15,15.
This no-op reflects the fact that the compiler is generally unable to tell if the tar-
get routine will be in the same module as the source (the calling) routine. The
compiler cannot figure this out because module assignments are not made until
the object code is linked by the linker. When the compiler encounters a subrou-
290 Chapter 13
How the .ptrgl Routine Works
tine call to a routine that is not defined in the same source file, it must assume
that the linker could place the routine in another module. To support this poten-
tial out-of-module call, a no-op placeholder instruction is placed after the sub-
routine call. This gives the linker a place to write the required load instruction. If
the two routines are in the same module, the linker doesn't add the load instruc-
tion and the no-op instruction remains.
Of course, if the compiler has enough information available to determine if the
called routine is in the same TOC environment, then the placeholder no-op
instruction is unnecessary and is not generated.
How the .ptrgl Routine Works

.
The actions performed by the ptrgl routine are best described by examining
the code:
lwz rO,O(rll) # rO <= routine address
stw rTOC,20(rl) # save current TOC
mtctr rO # ctr <= routine address
lwz rTOC,4(rll) # setup proper TOC
lwz r11,8(rll) # setup proper env ptr
bctr # call routine
Because this is a short routine, each of its instructions will be examined in turn.
The Iwz rO ,O ( rll ) instruction gets the address of the target routine from the
function descriptor and stores it in r 0.This value will later be stored in the CTR
so that the bctr instruction can be used to jump to the routine.
The stw ~ T O ,2C 0 ( r 1 ) instruction saves the current TOC in the Link Area of
the current routine's stack frame. The value stored here will be restored by the
.
lwz instruction, which follows the branch to the ptrgl routine.
The mtctr rO instruction takes the target routine address and places it in the
CTR register so that it can be jumped to easily.
The lwz rTOC ,4 ( r 11 ) instruction gets the target routine's TOC pointer and
stores it in the TOC register.
The optional lwz r 11,8 ( r 11 ) instruction sets up the target routine's environ-
ment pointer. Because the environment pointer isn't used by all development
systems, this instruction may not be necessary.
The bctr instruction actually (finally)calls the target routine.
Note that the code does not touch the LR. The calling routine calls the ptrgl.
glue using a Branch with Link instruction that sets the LR to hold the address (in
5 13.13 Linking with Global Routines
the original routine) where control should return. Since the LR hasn't been mod-
ified, the target routine can return control using the standard blr instruction.
Global Linkage (&ink) Routines

. .
Global Linkage ( g link) routines are similar to ptrgl routines except in
.
usage. A glink routine is created for every external routine that a module
.
imports, so there would be a separate glink routine for every system or
library routine that a program calls.
. .
The main difference between a glink routine and the ptrgl routine is where
.
the function descriptor originates. The ptrgl routine is a general routine that
.
is passed a function descriptor; the glink routines are specific to a particular
target routine.
.
This does not mean that the function descriptors for glink routines are
.
encoded in the glink routine itself. A separate function descriptor is still
.
stored in the TOC area for the module. Each glink routine encodes only the
offset from the TOC to the function descriptor. This method gives both the linker
.
and the glink routine easy access to the descriptor.
Using the .glink Routines

.
Other than this difference in the source of the function descriptor, glink rou-
.
tines are used like ptrgl routines, except the function descriptor (obviously)
.
needn't be passed in rll. All glink routines are called using a Branch with
Link instruction, and they are always followed by a lwz or no-op instruction.
How the .glink Routines Work

.
The code for the standard glink routine is:
lwz rl2,offset(rTOC) # r12 <= f() desc
stw rTOC,20(rl) # save current TOC
lwz r010(r12) # rO <= routine address
lwz rTOC14(r12) # set up routine's TOC
mtctr rO # ctr <= routine address
bctr # call routine
As with the .ptrgl code, it is useful to describe the routine instruction by

instruction.
The first instruction, lwz r 12,off set ( rTOC ), gets a pointer to the target rou-
tine's function descriptor and places it in r 12.The offset to the function descrip-
tor from the module's TOC is hard-coded into this instruction.
292 Chapter 13
Function Pointers in High-Level Languages
The s tw ~ T O C, 2 0 ( r 1 ) instruction saves the current TOC in the Link Area of

the current routine's stack frame.
The next two instructions, lwz r 0,O ( r 1 2 ) and lwz r T O C ,4 ( r 1 2 ) , get the
target routine's address and TOC pointer and load them into r O and rTOC,
respectively.
The last two instructions put the target routine's address into the CTR register
and then call the routine.
Function Pointers in High-Level Languages

In high-level languages, function pointers are always implemented as function
descriptors. However, these descriptors are never directly visible to the pro-
grammer-the compiler handles all necessary descriptor magic so the program-
mer only deals with (apparent) function pointers.
Function descriptors are necessary because there is no way for the compiler to
know what the program plans to do with the "function pointer." If it is going to
be used as a callback function for a system routine, then the routine must be
passed along with its TOC environment as a function descriptor.
Because the compiler can't know if a descriptor is needed, it plays it safe and
always creates one.

Optimizing PowerPC Code CH 13

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimizing PowerPC Code CH 13

Uploaded by

Copyright:

Available Formats

Programming

13.1 Register Usage Conventions

Programming Model 253

Table 13-3 SPR Register Usage Conventions

13.2 Table of Contents (TOC)

Aroutine accesses its global variables (including external routine descriptors)by

Initializing the TOC

13.3 The Stack Pointer

Programming Model 255

Figure 13-1 Sample Stack Showing Multiple Stack Frames

Updating the Stack Pointer

Stack Pointer Maintenance on Function Entry

Programming Model 257

Stack Pointer Maintenance on Function Exit

Handling Stack Frames 2 32K in Size

# update the stack pointer

# update the stack pointer

Building Stack Frames

Before building After building

Programming Model 259

13.4 Brief Interlude: Naming Conventions

13.5 Subroutine Calling Conventions

Prologs and Epilogs

13.6 A Simple Subroutine Call

Before Calling bar ( )

Figure 13-4 The Stack Frame of £00 ( )

Local Storage foo's

Prolog for bar ( )

Programming Model 263

Saving the Non-volatile Registers

Programming Model 265

Figure 13-7 Saving Values into f oo ( 1's Stack Frame

Creating the Stack Frame

Restoring the Non-volatile Registers

Programming Model 267

Restoring the Stack Frame

13.7 An Even Simpler Subroutine Call

Before Calling bar ( )

# save the Condition Register

Programming Model 269

# restore the Condition Register

13.8 Saving Registers on the Stack

Using the Load and Store Multiple Instructions

Saving GPRs Only

Using the Store Multiple Instruction

Using a Series of Store Instructions

To save all the GPRs, a series of 19 stw instructions would be used:

Branching to System-Provided GPR Save Routine

As an example, consider the following (abbreviated)implementation of such a

Restoring GPRs Only

Using the Load Multiple Instruction

Programming Model 273

Using a Series of Load Instructions

Branching to System-Provided GPR Restore Routine

Saving FPRs Only

In order to save nine floating-point registers (fr23 through fr31), a routine

Restoring FPRs Only

-restfpr30: lfd fr30,-16(rl)

Saving Both GPRs and FPRs

Programming Model 275

Restoring Both GPRs and FPRs

System SaveIRestore Routine in Practice