You are on page 1of 3

Floating Point Formats

Scientic notation:

Lecture 8 - Floating Point Arithmetic, The IEEE Standard


MIT 18.335J / 6.337J Introduction to Numerical Methods

19 1 10 602 .
sign signicand base exponent

Per-Olof Persson October 3, 2006

Floating point representation d0 + d1 1 + . . . + dp1 (p1) e , 0 di <


with base and precision p

Exponent range [emin , emax ] Normalized if d0 = 0 (use e = emin 1 to represent 0)


1 2

Floating Point Numbers


The gaps between adjacent numbers scale with the size of the numbers Relative resolution given by machine epsilon, machine = .5 1p For all x, there exists a oating point x such that |x x | machine |x| Example: = 2, p = 3, emin = 1, emax = 2

Special Quantities
is returned when an operation overows x/ = 0 for any number x, x/0 = for any nonzero number x Operations with innity are dened as limits, e.g. 4 = lim 4 x =
x

NaN (Not a Number) is returned when the an operation has no


well-dened nite or innite result

Examples: , /, 0/0,

1, NaN x

Denormalized Numbers
With normalized signicand there is a gap between 0 and emin This can result in x y = 0 even though x = y , and code fragments like if x = y then z = 1/(x y ) might break Solution: Allow non-normalized signicand when the exponent is emin This gradual underow garantees that x = y x y = 0

IEEE Single Precision


1 sign bit, 8 exponent bits, 23 signicand bits:
0 S 00000000 E 0000000000000000000000000000000 M

Represented number:

Special cases:
0 emin emin +1 emin +2 emin +3

(1)S 1.M 2E 127


E=0 0 < E < 255
Powers of 2

E = 255
NaN

M =0

emin

emin +1

emin +2
5

emin +3

M = 0 Denormalized Ordinary numbers


6

IEEE Single Precision, Examples


S 0 1 0 0 0 0 0 0 0 1 1 1 E 11111111 11111111 11111111 10000001 10000000 00000001 00000000 00000000 00000000 00000000 10000001 11111111 M 00000100000000000000000 00100010001001010101010 00000000000000000000000 10100000000000000000000 00000000000000000000000 00000000000000000000000 10000000000000000000000 00000000000000000000001 00000000000000000000000 00000000000000000000000 10100000000000000000000 00000000000000000000000
7

IEEE Floating Point Data Types


Single precision Signicand size (p) Exponent size Total size 24 bits 8 bits 32 bits +127 -126 Double precision 53 bits 11 64 bits +1023 -1022

Quantity NaN NaN

+1 2129127 1.101 = 6.5 +1 2


128127

+1 21127 1.0 = 2126 +1 2126 0.1 = 2127 0 0 +1 2126 223 = 2149

1.0 = 2

emax emin
Smallest normalized Largest normalized

126

2127 1038

10

38

21022 10308 21023 10308 253 1016

machine

1 2129127 1.101 = 6.5

224 6 108

Floating Point Arithmetic


Dene (x) as the closest oating point approximation to x By the denition of machine , we have for the relative error:
For all x

R, there exists with || machine such that (x) = x(1 + )

The result of an operation using oating point numbers is (a b) If (a b) is the nearest oating point number to a b, the arithmetic
rounds correctly (IEEE does), which leads to the following property: For all oating point x, y , there exists with ||

x y = (x y )(1 + )

machine such that

Round to nearest even in the case of ties


9

MIT OpenCourseWare http://ocw.mit.edu

18.335J / 6.337J Introduction to Numerical Methods


Fal l 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like