Floating-Point To Fixed-Point Code Conversion With Variable Trade-Off Between Computational Complexity and Accuracy Loss

Abstract This paper describes a method of converting
floating-point expressions into equivalent fixed-point code in

DSP software. Replacing floating-point expressions with
specialized integer operations can greatly improve the
performance of embedded applications. This is a new method
that is developed for Direct-Form I filters with constant
coefficients and input variables whose low/high bounds are
known. Two conflicting objectives are considered
simultaneously: computational complexity and accuracy loss.
The algorithm presented here can construct multiple fixed-
point solutions for the same floating-point code: from high-
complexity-high-accuracy to low-complexity-low-accuracy.
A so-called cost function conducts the data flow
transformation decisions. By changing the cost function
coefficients, different fixed-point forms can be obtained. The
data flow transformation takes very little time: less than 100
milliseconds for a 32-tap FIR filter. The generated fixed-point
code is tested on 8-bit (AVR ATmega), 16-bit (MSP430), and
32-bit (ARM Cortex-M3) microcontrollers. It provides, in all
cases, execution speeds better than if using floating-point code.
I. INTRODUCTION
Floating-point code is somehow inappropriate for
embedded applications. The computing capabilities of
microcontrollers are reduced in general and, in most cases,
no hardware support for floating-point operations is
provided. To overcome this problem, the mathematical
function contained in the floating-point code must be
expressed with fixed-point code. Doing this manually that
is, rewriting by hand a floating-point function into a
sequence of integer operations can be a difficult task.
II. RELATED WORK
There has been a significant effort to develop frameworks
to automate the conversion of floating-point code to integer
code [1]-[4]. Two distinct approaches can be identified:
statistical (simulation-based) and analytical. The difference
between them is in the way the dynamic intervals of
variables are computed. A statistical method performs a
series of simulations and may require an important amount
of time. An analytical method is obligatorily based on a
concrete data model (for example, propagation rules) and
can give precise information in very short time.
In [1] is described one of the first floating-point to fixed-
point converters: AUTOSCALER for C. It is able to
optimize the number of shift operations by equalizing the
word-length of specific variables or constants. In [2] is
presented a method that performs CDFG optimizations
under accuracy constraints. It makes extensive use of the
equations representing the system. In [4], is described a
genetic algorithm employed to find the optimal trade-off
between signal quality and implementation complexity.
This paper is based on a previous work detailed in [5].
Paper [5] addresses the same task as this paper, but is
primarily focused on generating ANSI C compliant code.
III. METHOD OVERVIEW
The method presented in this paper is designed to
transform dot products with constant coefficients (floating-
point literals) and integer variables with known intervals:

=

N
i
i i
x a
0

(1)
Failing to state the correct intervals of the integer
variables can lead to erroneous results. The manipulation of
intervals [6] is central to the optimization procedure.
The following types of nodes are used to represent the
data flow:
- Stand-alone nodes: nodes whose values do not
depend on other nodes. Stand-alone nodes are used to
represent constants, parameters: a
i
, x
i
, etc.
- Operators: add, multiply, shift and change sign. An
operator has one or more operands (child nodes). These can
be stand-alone nodes or other operators.
A node has an associated interval and fractional word
length (FWL). The interval represents the extreme values of
the node run-time integer (the memory or register variable).
Scaling a node is a frequent operation encountered in the
optimization process. Note: the FWL of a node is an integer
value. A scaling operation does not change the fixed-point
(or real) value of a node. The node interval is always altered
together with the node FWL.
A node can be realized in code as a 16-bit or 32-bit
integer. The values that can pass through a node must
have as many as possible significant bits, to carry precise
information, but, on the other hand, must be limited to a
specific interval. The FWL is not necessarily the same for
every node.
The data flow structure is modified in steps. A step can
be viewed as an inference operation:
Floating-point to fixed-point code conversion with variable trade-off
between computational complexity and accuracy loss
Alexandru Brleanu, Vadim Bitoiu and Andrei Stan, Member, IEEE

1. Effect. The integer interval of a node must be
decreased or increased.
2. List of possible causes. A list of candidate data flow
transformations is constructed.
3. Best cause selection. The optimal data flow
transformation is selected with the help of a cost function
whose coefficients represents in essence the importance
given to the computational effort and to the accuracy loss.
The method described in this paper is implemented in
Java (mostly because of the Java support for object-oriented
programming and advanced IDEs available).
IV. DATA FLOW TRANSFORMATION
A. Problem Difficulty
The initial form of the data flow represents a true image
of the floating-point dot product expression. There is one
add operator with N child nodes a multiply operator for
each a
i
x
i
term, as in (1).
Node a
i
has a very long fractional part 24 bits if the
floating-point literals of the dot product expression are
parsed as single precision values. If node x
i
has, for
example, the run-time interval equal to [0; 1023], then the
multiply node overflows. Four data types are permitted for a
node: signed/unsigned 16-bit and 32-bit integers. To make
the multiply node not to overflow it is necessary to make its
integer interval smaller (to decrease the fractional part).
There are two possibilities: shift to right a
i
at design-time or
shift to right x
i
at run-time. Each solution has its own
impact on the data flow computational complexity and
accuracy. In this case it is simple to decide which one
solution to select discarding some least significant bits
from the very precise constant at design-time means no run-
time overhead and causes less accuracy loss than in
comparison with the right shift of x
i
. But this is a rare case.
In general case, a solution is either low-complexity-low-
accuracy or high-complexity-high-accuracy (not low-
complexity-high-accuracy). This makes it difficult to
compare candidate solutions. It is necessary to determine
quantitatively the complexity and accuracy of a particular
solution. There is no other way, because the number of the
alternative possibilities grows very quickly with the size of
the data flow area below the node whose integer interval
must be modified.
B. Computational Complexity
A data flow node has an associated computational
complexity. This is an estimator of the computational effort
required to obtain at run-time the node value. At design-
time the exact computational complexity is difficult to
evaluate. The Java application simply counts the operators
contained in the data flow area below the target node. This
is a sufficiently good approximation.
C. Node Error Interval (Drift)
Every data flow node stands for a fixed-point value which
can vary within a specific interval. This interval refers to the
node value at run-time, which is in essence an integer very
close to the ideal infinite-precision value. Thus, every node
has a specific error. The interval of the infinite-precision
value that can pass through a node, and the corresponding
interval of the run-time integer, can be calculated at design-
time, which means that, for each node, the interval of the
error can be obtained without actually running the code.
The error interval of an operator node can be calculated
using the integer interval and the error interval of every
child node [6]. For example, the error interval of an add
node can be calculated by adding the error interval of every
child node.
The low/high values of an error interval are considered to
be absolute values; not relative values or, in other words,
units in the last place (ulps) [7].
D. Multi-Objective Search
The simplest way to decrease or increase the integer
interval of a node is to basically perform a shift operation.
This can be done at design-time if the node represents a
constant value or at run-time if the node represents an
operator and (very important) the integer interval is valid
(does not overflow). But these are not frequent cases. The
most usual situation is when an operator overflows and its
integer interval must be decreased. The problem is that the
integer interval of an operator cannot be changed directly
the integer intervals of the child nodes must be altered. To
force the integer interval of an add node it is necessary to
force the integer interval of every child node (logical AND).
To force the integer interval of a multiply node it is
necessary to force one or more child nodes (logical OR).
In general case, there are multiple ways the increase or
decrease the integer interval of a node (logical OR). One
possible way is called a solution. A solution involves a
number of data flow changes (logical AND). A change can
be viewed in the simplest way as a node switch the child
node of an operator is replaced with another child node. A
change is invertible a change can be applied and can be
undone. This is a very important feature. Because a change
is always a part of a solution, it makes sense to say that a
solution is applied or undone (meaning that all the changes
included are applied or undone).
Multiple solutions can be viewed as being concurrent if
all of them are built with the same purpose for example, to
make the integer interval of a specific node smaller. But

each solution consists of a number of particular (different)
changes. Thus each solution has its own computational
complexity and influence on accuracy. The Java application
compares concurrent solutions by these two metrics. To
evaluate the complexity and error interval of a solution, it is
applied (some child nodes are disconnected and others are
connected).
This algorithm step (switching between different
solutions) is essentially a search. The method described in
this paper resembles other methods if regarding the way the
data flow is implemented (types of nodes) and the usage of
operator properties (value propagation). From this point of
view, the method described here can be considered
analytical. But it still does a search! It can be regarded as
being search-based, but, anyway, it is very different from
other search-based methods. The method described in this
paper connects/disconnects various data flow fragments,
while other methods scan large, large multi-dimensional
spaces that represent fractional word-lengths.
To compare several concurrent solutions it is necessary to
combine, for each solution, the complexity and error
interval into a single indicator. In order to do this, a linear
function is used:
error k complexity k cost - + - =
2 1

(2)
Varying the cost function coefficients is like, for
example, favoring solutions which introduce considerable
computational overhead but give high accuracy results in
place of low-complexity-low-accuracy solutions.
Although the cost function has two parameters, the
variation space is one-dimensional. The cost function can be
represented geometrically as a line which passes through the
origin point in a two-dimensional space (Fig. 1).

Fig. 1: Solution space
In Fig. 1 the cost of one solution is directly proportional
to the shortest distance to the cost function line. The
complexity coefficient (k
1
) and the error coefficient (k
2
)
determine together the slope of cost function line. Two
coefficients are used, because, otherwise it would be
impossible to represent the vertical line (+INF slope). For
simplicity, the sum of the cost coefficients is kept constant:
1
2 1
= + k k (3)
The cost of one solution has no meaning if considered
separately. It does make sense only in comparison with the
costs of other solutions.
E. Transformation Example
Fig. 2 shows two extreme data flow structures obtained
for a dot product with 12 terms. (Such images can be
created with Graphviz software.)

Fig. 2: High level view of two extreme data flow structures obtained for a
dot product expression with 12 terms low-complexity-low-accuracy
(left) and high-complexity-high-accuracy (right).
In Fig. 2, the data flow on the left hand side has a specific
pattern. The fractional word length is the same for the most
part of the nodes. The number of operators is minimal
(complexity = 26). In contrast, the data flow on the right
hand side is very developed and doesnt have a specific
pattern (at global level). Some nodes have very long
fractional parts (which is not visible). The number of
operators is maximal (complexity = 53).
V. DESIGN-TIME TECHNIQUES
A. Node Cache Information
The optimization process makes extensive use of some
node attributes like integer interval and drift. For operator
nodes this information depends on the child nodes
(operands) and must be computed. The time required for
this can become significant for large data flows the high
nodes generate a lot of subsequent calls to nodes located
below to get the necessary information. This traffic can be
somehow diminished. The data flow structure is not itself
very dynamic a change that is applied in the optimization
process has a limited impact area. In many cases the integer
interval and drift information can be reused. For this

purpose, each operator node is designed with its own cache.
In this way, obtaining the integer interval or the drift
information can be very straightforward; unless the cache of
the target node is invalidated. The invalidation of the cache
is crucial. Doing this for fewer nodes than required may
lead to erroneous results, and, doing this for more nodes
than it is required can lower the cache hit rate.
The cache invalidation is triggered in the following
manner whenever a node N is connected with an operator
F, a message of change is propagated along the chain of
operators from F to the root operator to invalidate the
corresponding cache data.
The design time is reduced considerably with node cache
information. This is easy to observe as the filter length is
increased. Without caching, transforming the data flow of a
dot product with 16 terms can take more than 10 seconds.
Turning the cache mechanism on, the data flow is optimized
is tens of milliseconds.
Fig. 3 shows the execution time of the optimization
procedure for dot products with different lengths (node
cache information is used).

Fig. 3: Average data flow transformation time
B. Automatic Search of Data Flows
Varying the coefficients of the cost function leads to
different data flow structures. The coefficients can be set by
hand; but this is not very practical. The reason is that the
coefficients themselves do not carry very much information
(except for the extreme cases). In a concrete situation it
might be more desirable to generate all the possible data
flows, create code for all of them, and later select the most
convenient function.
From a high level point of view, the search method,
whichever is, should traverse the one-dimensional search
space from 0 to 1, generate various data flows, and pick-up
the unique ones. The ideal search method should generate
as few as possible equivalent data flows. Two data flows are
considered equivalent if, while traversing both structures in-
depth in parallel, every node that is encountered has the
same type (add, multiply), has the same integer interval
(low/high values, fractional length) and has the same
number of child nodes as its mirror node.
Performing a sequential search can be very time-
consuming. Varying a coefficient from 0 to 1 using a
constant step and generating all the corresponding data
flows can be very inefficient. A large number of data flow
structures are equivalent and the generation of a single one
requires an important amount of time for example, a dot
product with 10 terms requires 10-15 milliseconds. On the
other hand, the increment step has to be small enough to
capture all the possible data flow structures.
Fortunately, it is possible to perform a more selective
search. It is just necessary to use the following context
information: if the end-points of a segment inside the search
space generate the same data flow structure then it does
not make sense to sweep this particular segment. No new
data flow structures can be discovered in this area. But if
the end-points of a segment generate different data flows
then the segment should be halved and the same procedure
should be applied further for the resulted segments. This
method is very efficient, because the number of the
discarded data flow structures is minimal.

Fig. 4: Complexity of data flows that are found for a dot product with 20
terms (partial view). The complexity coefficient is swept from 0 to 1, while
the drift coefficient is set to a complementary value (2). A horizontal
segment represents one or more data flows with the same complexity.
Given an arbitrary filter, the number of non-equivalent
data flows that can be found by varying the cost coefficients
is proportional to the filter length. As a rule, if N is the
filter length, then the second search method yields
0.5N1.5N non-equivalent data flows.
VI. CODE GENERATION
Generating fixed-point C code for a particular data flow
is, in essence, a straightforward process. However, there are
two important aspects: the declaration of the intermediary
variables and the explicit data type casts [8].
The C code can be generated in two very different forms:
as a long sequence of short assignments (one operator in
every right hand side) and a lot of intermediary variables or
as a single, very long, arithmetic expression and a lot of

parentheses. Although honestly both forms of code look
meaningless, the first variant can be used for debugging
purposes, because all the intermediary variables are declared
and can be watched step by step. The second variant is
preferable in case no compiler optimizations are applied.
Generating a very long line with arithmetic operators
poses some problems. It is so because the compiler must
deduce the data type for some subexpressions. (In case the
intermediary variables are declared, their data type is clearly
stated.) Examples:
- Short multiplication. The compiler might consider
that the result of a multiplication between two 16-bit
integers is a 16-bit integer. This is in general not desirable,
because most multiply nodes produce 32-bit values; so,
when the code is generated, short integers that must be
multiplied are explicitly cast to long integers.
- Signed/Unsigned arithmetic. There are cases when a
signed integer is added with an unsigned integer and the
result is known to be nonnegative, but the compiler assumes
that it is signed. If such an integer must be shifted to right,
the compiler might do an arithmetic (not logical) shift,
which is wrong, because the most significant bit would be
interpreted differently. To avoid this, additional casts are
inserted when generating the code.
VII. RESULTS
A. Accuracy
When a fixed-point C function is generated, the error
interval of its result is already known. This is the worst-case
indicator computed at design-time the drift of the data
flow root node.
Note: The error is considered as the difference between
the floating-point value obtained with the original floating-
point expression (the reference value) and the integer value
obtained with the generated fixed-point code.
A more relevant accuracy indicator is the signal-to-
quantization-noise-ratio (SQNR) computed with the mean of
the absolute reference values (S) and the mean of the
absolute error values (N):

|
.
|
\
|
- =
N
S
SQNR
10
log 10

(3)
The SQNR values are computed on a high-speed
computer (not on microcontrollers). It is important to run
the fixed-point code with as many as possible different
input parameters.
Fig. 5 illustrates the accuracy and the complexity of the
solutions that are found (automatically) for a dot product
with 24 terms. The accuracy is represented as the difference
between the highest possible (ideal) SQNR and the SQNR
of the generated code. The highest possible SQNR is
defined as the SQNR of a function that would return the
integer nearest to the ideal floating-point value.

Fig. 5: Solutions found for a dot product with 24 terms, random coefficients
within the interval [-1, 1] and variables within the interval [0, 4095]
The SQNR degrades as the dot product number of terms
grows and, especially, as the complexity cost coefficient is
increased.
B. Speed
The execution time of the generated fixed-point code
depends on many factors:
- The filter. The number of data flow nodes is directly
proportional to the number of filter taps. This holds true
before and after the data flow is optimized.
- The cost function. Varying the cost coefficients leads
to specific data flow transformation decisions (as discussed
in section Multi-Objective Search).
- The code generation. If the fixed-point code is
generated as one very long expression (everything inline),
then most of the intermediary values are allocated in
registers and, in effect, the number of load/store operations
is decreased. This is especially important when no compiler
optimizations are applied.
- The compiler. Turning the compiler optimizations on
can greatly accelerate the fixed-point code. This is worth
considering especially when the intermediary variables are
declared.
- The microprocessor. The microprocessor capabilities
are not regarded in detail, because the main purpose is to
generate platform-independent code, not assembler. The
only thing assumed is that there are no floating-point units,
which is characteristic for embedded microprocessors. The
microprocessors used for testing are shown in Table I. Some
instruction sets include integer division (something that can
be used instead of bitwise shift), but this is not a general
feature and is not considered.

TABLE I
MICROPROCESSORS USED FOR TESTING
Microprocessor Register width Compiler
ATmega16 8-bit IAR
MSP430F149 16-bit IAR

STM32F 32-bit gcc
LPC1768 32-bit

IAR
The speed factor between the fixed-point code and the
floating-point code can vary within a wide range. One very
important cause is the cost function used throughout the
data flow optimization. For low-complexity-low-accuracy
solutions the speed can be increased by 15 times or more.
For high-complexity-high-accuracy solutions the speed
can be increased by at least 3 times. (These results are
obtained with floating-point dot products with 4-32 terms
generated randomly.)
C. Memory Usage (Flash and SRAM)
The fixed-point code takes in general slightly more Flash
space (code memory) than the floating-point code.
The SRAM (data memory) usage is determined mainly by
the stack requirements. The fixed-point code, if generated as
a single arithmetic expression (no intermediary variables),
occupies almost no stack space. The floating-point code
needs a specific amount of stack, because it is calling low-
level functions.
VIII. CONCLUSIONS
A method for transforming floating-point expressions to
integer C code for embedded processors is described.
Direct-Form I non-adaptive filters with predefined input
bounds are targeted. The algorithm presented uses a
parametrizable cost function and is able to produce multiple
solutions for the same given floating-point expression.
The method can be applied for FIR filters, as well as for
IIR filters if the intervals of the output variables can be
specified. (There is work in progress for recursive filters.)
The generated code is tested on 8-bit, 16-bit, and 32-bit
microprocessors, using different compilers.
There can be two major realizations of the presented
algorithm: as a stand-alone application for code conversion
(as it is currently implemented) or as a separate type of
compiler IR optimization (requires integration in a compiler
system).
ANNEX
A floating-point expression is converted to fixed-point
code, for illustrative purposes:
0.023159746f*x[0]+0.007362494f*x[1]+0.109808266f*x[2]-
0.8996903f*x[3]-0.52352905f*x[4]+0.34677517f*x[5]+
0.50765723f*x[6]+0.9989124f*x[7]+0.5545187f*x[8]-
0.73752284f*x[9]
This expression can be viewed as a FIR filter. The
interval of the input variables x[0-9] is set to [0; 4095]. The
coefficients are generated randomly in the interval [-1; 1].
The conversion to fixed-point takes 106 milliseconds and
yields 11 non-equivalent data flows. ANSI-C integer code is
generated. Here is the compact form of one solution (code
without intermediary variables):
((unsigned long)(1159921664L + (((((unsigned
long)(((((unsigned long)(((((unsigned long)7720 * (unsigned
long)x[1]) + ((unsigned long)24284 * (unsigned long)x[0])) +
(115142L * (unsigned long)x[2])) + (532317L * (unsigned
long)x[6])) >> 2) + ((unsigned long)(1047435L * (unsigned
long)x[7]) >> 2)) + ((unsigned long)(581455L * (unsigned
long)x[8]) >> 2)) + (90905L * (unsigned long)x[5])) >> 1) -
((unsigned long)(193337L * (unsigned long)x[9]) >> 1)) + ((-
117924L) * (signed long)x[3])) + ((-68620L) * (signed
long)x[4]))) >> 17) - 8849
The accuracy of this fixed-point code is estimated by
running 1.9e+6 random test cases.
The SQNR of the fixed-point code is 38.694524dB. This
value is 0.000098dB less than the ideal SQNR. The error
distribution is as follows: in 99.80% of the cases the result
of the fixed-point code is the same as the integer nearest to
the floating-point expression, in 0.07% of the cases the
error is 1 and in 0.13% of the cases the error is -1.
IAR Embedded Workbench for ARM is used to measure
the performance of the integer code. LPC1768, which is an
ARM Cortex-M3 microprocessor, is selected as the target
architecture. It occurs that, without compiler optimizations,
in Simulator, the floating-point code takes 737-754 cycles
and the integer code takes 50 cycles. Thus, the execution
time is decreased by approximately 15 times.
REFERENCES
[1] K. I Kum, J. Kang, W. Sung, AUTOSCALER For C: An Optimizing
Floating-Point to Integer C Program Converter For Fixed-Point Digital
Signal Processors, IEEE Trans. on Circuits and Systems II: Analog
and Digital Signal Processing, vol. 47, issue 9, pp. 840-848, Sep.
2000.
[2] D. Menard, D. Chillet, F. Charot, O. Sentieys, Automatic Floating-
point to Fixed-point Conversion for DSP Code Generation, in Proc.
of the 2002 International Conference on Compilers, Architecture,
and Synthesis for Embedded Systems, Oct. 2002.
[3] C. Shi, R. W. Brodersen. An Automated Floating-point to Fixed-
point Conversion Methodology, in Proc. of IEEE International Conf.
on Acoustics, Speech, and Signal Processing, vol. II, pp. 529-32,
2003
[4] K. Han, Automating transformations from floating-point to fixed-
point, Ph.D. dissertation, Faculty of the Graduate School of the
University of Texas at Austin, 1996.
[5] A. Brleanu, V. Bitoiu, A. Stan, Digital filter optimization for C
language, in Advances in Electrical and Computer Engineering, to
be published.
[6] R. B. Kearfott, Interval Computations: Introduction, Uses, and
Resources, Euromath Bulletin, vol. 2, no. 1, pp. 95112, 1996.
[7] D. Goldberg, What Every Computer Scientist Should Know About
Floating-Point Arithmetic, ACM Computing Surveys (CSUR), vol.
23, issue 1, 1991.
[8] Programming languages C, International Standard, ISO/IEC
9899:TC2.
[9] R. J. Mitchell and P.R. Minchinton, A Note on Dividing Integers by
Two, The Computer Journal, 32, No. 4, Aug 1989, 380.

Floating-Point To Fixed-Point Code Conversion With Variable Trade-Off Between Computational Complexity and Accuracy Loss

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Floating-Point To Fixed-Point Code Conversion With Variable Trade-Off Between Computational Complexity and Accuracy Loss

Uploaded by

Copyright:

Available Formats

Abstract This paper describes a method of converting

floating-point expressions into equivalent fixed-point code in

You might also like