You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1817 lines
69 KiB
1817 lines
69 KiB
|
|
@c Copyright @copyright{} 2022 Richard Stallman and Free Software Foundation, Inc. |
|
|
|
This is part of the GNU C Intro and Reference Manual |
|
and covered by its license. |
|
|
|
@node Floating Point in Depth |
|
@chapter Floating Point in Depth |
|
|
|
@menu |
|
* Floating Representations:: |
|
* Floating Type Specs:: |
|
* Special Float Values:: |
|
* Invalid Optimizations:: |
|
* Exception Flags:: |
|
* Exact Floating-Point:: |
|
* Rounding:: |
|
* Rounding Issues:: |
|
* Significance Loss:: |
|
* Fused Multiply-Add:: |
|
* Error Recovery:: |
|
@c * Double-Rounding Problems:: |
|
* Exact Floating Constants:: |
|
* Handling Infinity:: |
|
* Handling NaN:: |
|
* Signed Zeros:: |
|
* Scaling by the Base:: |
|
* Rounding Control:: |
|
* Machine Epsilon:: |
|
* Complex Arithmetic:: |
|
* Round-Trip Base Conversion:: |
|
* Further Reading:: |
|
@end menu |
|
|
|
@node Floating Representations |
|
@section Floating-Point Representations |
|
@cindex floating-point representations |
|
@cindex representation of floating-point numbers |
|
|
|
@cindex IEEE 754-2008 Standard |
|
Storing numbers as @dfn{floating point} allows representation of |
|
numbers with fractional values, in a range larger than that of |
|
hardware integers. A floating-point number consists of a sign bit, a |
|
@dfn{significand} (also called the @dfn{mantissa}), and a power of a |
|
fixed base. GNU C uses the floating-point representations specified by |
|
the @cite{IEEE 754-2008 Standard for Floating-Point Arithmetic}. |
|
|
|
The IEEE 754-2008 specification defines basic binary floating-point |
|
formats of five different sizes: 16-bit, 32-bit, 64-bit, 128-bit, and |
|
256-bit. The formats of 32, 64, and 128 bits are used for the |
|
standard C types @code{float}, @code{double}, and @code{long double}. |
|
GNU C supports the 16-bit floating point type @code{_Float16} on some |
|
platforms, but does not support the 256-bit floating point type. |
|
|
|
Each of the formats encodes the floating-point number as a sign bit. |
|
After this comes an exponent that specifies a power of 2 (with a fixed |
|
offset). Then comes the significand. |
|
|
|
The first bit of the significand, before the binary point, is always |
|
1, so there is no need to store it in memory. It is called the |
|
@dfn{hidden bit} because it doesn't appear in the floating-point |
|
number as used in the computer itself. |
|
|
|
All of those floating-point formats are sign-magnitude representations, |
|
so +0 and @minus{}0 are different values. |
|
|
|
Besides the IEEE 754 format 128-bit float, GNU C also offers a format |
|
consisting of a pair of 64-bit floating point numbers. This lacks the |
|
full exponent range of the IEEE 128-bit format, but is useful when the |
|
underlying hardware platform does not support that. |
|
|
|
@node Floating Type Specs |
|
@section Floating-Point Type Specifications |
|
|
|
The standard library header file @file{float.h} defines a number of |
|
constants that describe the platform's implementation of |
|
floating-point types @code{float}, @code{double} and @code{long |
|
double}. They include: |
|
|
|
@findex FLT_MIN |
|
@findex DBL_MIN |
|
@findex LDBL_MIN |
|
@findex FLT_HAS_SUBNORM |
|
@findex DBL_HAS_SUBNORM |
|
@findex LDBL_HAS_SUBNORM |
|
@findex FLT_TRUE_MIN |
|
@findex DBL_TRUE_MIN |
|
@findex LDBL_TRUE_MIN |
|
@findex FLT_MAX |
|
@findex DBL_MAX |
|
@findex LDBL_MAX |
|
@findex FLT_DECIMAL_DIG |
|
@findex DBL_DECIMAL_DIG |
|
@findex LDBL_DECIMAL_DIG |
|
|
|
@table @code |
|
@item FLT_MIN |
|
@itemx DBL_MIN |
|
@itemx LDBL_MIN |
|
Defines the minimum normalized positive floating-point values that can |
|
be represented with the type. |
|
|
|
@item FLT_HAS_SUBNORM |
|
@itemx DBL_HAS_SUBNORM |
|
@itemx LDBL_HAS_SUBNORM |
|
Defines if the floating-point type supports subnormal (or ``denormalized'') |
|
numbers or not (@pxref{subnormal numbers}). |
|
|
|
@item FLT_TRUE_MIN |
|
@itemx DBL_TRUE_MIN |
|
@itemx LDBL_TRUE_MIN |
|
Defines the minimum positive values (including subnormal values) that |
|
can be represented with the type. |
|
|
|
@item FLT_MAX |
|
@itemx DBL_MAX |
|
@itemx LDBL_MAX |
|
Defines the largest values that can be represented with the type. |
|
|
|
@item FLT_DECIMAL_DIG |
|
@itemx DBL_DECIMAL_DIG |
|
@itemx LDBL_DECIMAL_DIG |
|
Defines the number of decimal digits @code{n} such that any |
|
floating-point number that can be represented in the type can be |
|
rounded to a floating-point number with @code{n} decimal digits, and |
|
back again, without losing any precision of the value. |
|
@end table |
|
|
|
@node Special Float Values |
|
@section Special Floating-Point Values |
|
@cindex special floating-point values |
|
@cindex floating-point values, special |
|
|
|
IEEE floating point provides for special values that are not ordinary |
|
numbers. |
|
|
|
@table @asis |
|
|
|
@item infinities |
|
@code{+Infinity} and @code{-Infinity} are two different infinite |
|
values, one positive and one negative. These result from |
|
operations such as @code{1 / 0}, @code{Infinity + Infinity}, |
|
@code{Infinity * Infinity}, and @code{Infinity + @var{finite}}, and also |
|
from a result that is finite, but larger than the most positive possible |
|
value or smaller than the most negative possible value. |
|
|
|
@xref{Handling Infinity}, for more about working with infinities. |
|
|
|
@item NaNs (not a number) |
|
@cindex QNaN |
|
@cindex SNaN |
|
There are two special values, called Not-a-Number (NaN): a quiet |
|
NaN (QNaN), and a signaling NaN (SNaN). |
|
|
|
A QNaN is produced by operations for which the value is undefined |
|
in real arithmetic, such as @code{0 / 0}, @code{sqrt (-1)}, |
|
@code{Infinity - Infinity}, and any basic operation in which an |
|
operand is a QNaN. |
|
|
|
The signaling NaN is intended for initializing |
|
otherwise-unassigned storage, and the goal is that unlike a |
|
QNaN, an SNaN @emph{does} cause an interrupt that can be caught |
|
by a software handler, diagnosed, and reported. In practice, |
|
little use has been made of signaling NaNs, because the most |
|
common CPUs in desktop and portable computers fail to implement |
|
the full IEEE 754 Standard, and supply only one kind of NaN, the |
|
quiet one. Also, programming-language standards have taken |
|
decades to catch up to the IEEE 754 standard, and implementations |
|
of those language standards make an additional delay before |
|
programmers become willing to use these features. |
|
|
|
To enable support for signaling NaNs, use the GCC command-line option |
|
@option{-fsignaling-nans}, but this is an experimental feature and may |
|
not work as expected in every situation. |
|
|
|
A NaN has a sign bit, but its value means nothing. |
|
|
|
@xref{Handling NaN}, for more about working with NaNs. |
|
|
|
@item subnormal numbers |
|
@cindex subnormal numbers |
|
@cindex underflow, floating |
|
@cindex floating underflow |
|
@anchor{subnormal numbers} |
|
It can happen that a computed floating-point value is too small to |
|
represent, such as when two tiny numbers are multiplied. The result |
|
is then said to @dfn{underflow}. The traditional behavior before |
|
the IEEE 754 Standard was to use zero as the result, and possibly to report |
|
the underflow in some sort of program output. |
|
|
|
The IEEE 754 Standard is vague about whether rounding happens |
|
before detection of floating underflow and overflow, or after, and CPU |
|
designers may choose either. |
|
|
|
However, the Standard does something unusual compared to earlier |
|
designs, and that is that when the result is smaller than the |
|
smallest @dfn{normalized} representable value (i.e., one in |
|
which the leading significand bit is @code{1}), the normalization |
|
requirement is relaxed, leading zero bits are permitted, and |
|
precision is gradually lost until there are no more bits in the |
|
significand. That phenomenon is called @dfn{gradual underflow}, |
|
and it serves important numerical purposes, although it does |
|
reduce the precision of the final result. Some floating-point |
|
designs allow you to choose at compile time, or even at |
|
run time, whether underflows are gradual, or are flushed abruptly |
|
to zero. Numbers that have entered the region of gradual |
|
underflow are called @dfn{subnormal}. |
|
|
|
You can use the library functions @code{fesetround} and |
|
@code{fegetround} to set and get the rounding mode. Rounding modes |
|
are defined (if supported by the platform) in @code{fenv.h} as: |
|
@code{FE_UPWARD} to round toward positive infinity; @code{FE_DOWNWARD} |
|
to round toward negative infinity; @code{FE_TOWARDZERO} to round |
|
toward zero; and @code{FE_TONEAREST} to round to the nearest |
|
representable value, the default mode. It is best to use |
|
@code{FE_TONEAREST} except when there is a special need for some other |
|
mode. |
|
@end table |
|
|
|
@node Invalid Optimizations |
|
@section Invalid Optimizations |
|
@cindex invalid optimizations in floating-point arithmetic |
|
@cindex floating-point arithmetic invalid optimizations |
|
|
|
Signed zeros, Infinity, and NaN invalidate some optimizations by |
|
programmers and compilers that might otherwise have seemed obvious: |
|
|
|
@itemize @bullet |
|
@item |
|
@code{x + 0} and @code{x - 0} are not the same as @code{x} when |
|
@code{x} is zero, because the result depends on the rounding rule. |
|
@xref{Rounding}, for more about rounding rules. |
|
|
|
@item |
|
@code{x * 0.0} is not the same as @code{0.0} when @code{x} is |
|
Infinity, a NaN, or negative zero. |
|
|
|
@item |
|
@code{x / x} is not the same as @code{1.0} when @code{x} is Infinity, |
|
a NaN, or zero. |
|
|
|
@item |
|
@code{(x - y)} is not the same as @code{-(y - x)} because when the |
|
operands are finite and equal, one evaluates to @code{+0} and the |
|
other to @code{-0}. |
|
|
|
@item |
|
@code{x - x} is not the same as @code{0.0} when @var{x} is Infinity or |
|
a NaN. |
|
|
|
@item |
|
@code{x == x} and @code{x != x} are not equivalent to @code{1} and |
|
@code{0} when @var{x} is a NaN. |
|
|
|
@item |
|
@code{x < y} and @code{isless (x, y)} are not equivalent, because the |
|
first sets a sticky exception flag (@pxref{Exception Flags}) when an |
|
operand is a NaN, whereas the second does not affect that flag. The |
|
same holds for the other @code{isxxx} functions that are companions to |
|
relational operators. @xref{FP Comparison Functions, , , libc, The |
|
GNU C Library Reference Manual}. |
|
|
|
@end itemize |
|
|
|
The @option{-funsafe-math-optimizations} option enables |
|
these optimizations. |
|
|
|
|
|
@node Exception Flags |
|
@section Floating Arithmetic Exception Flags |
|
@cindex floating arithmetic exception flags |
|
@cindex exception flags (floating point) |
|
@cindex sticky exception flags (floating point) |
|
@cindex floating overflow |
|
@cindex overflow, floating |
|
@cindex floating underflow |
|
@cindex underflow, floating |
|
|
|
@dfn{Sticky exception flags} record the occurrence of particular |
|
conditions: once set, they remain set until the program explicitly |
|
clears them. |
|
|
|
The conditions include @emph{invalid operand}, |
|
@emph{division-by_zero}, @emph{inexact result} (i.e., one that |
|
required rounding), @emph{underflow}, and @emph{overflow}. Some |
|
extended floating-point designs offer several additional exception |
|
flags. The functions @code{feclearexcept}, @code{feraiseexcept}, |
|
@code{fetestexcept}, @code{fegetexceptflags}, and |
|
@code{fesetexceptflags} provide a standardized interface to those |
|
flags. @xref{Status bit operations, , , libc, The GNU C Library |
|
Reference Manual}. |
|
|
|
One important use of those @anchor{fetestexcept}flags is to do a |
|
computation that is normally expected to be exact in floating-point |
|
arithmetic, but occasionally might not be, in which case, corrective |
|
action is needed. You can clear the @emph{inexact result} flag with a |
|
call to @code{feclearexcept (FE_INEXACT)}, do the computation, and |
|
then test the flag with @code{fetestexcept (FE_INEXACT)}; the result |
|
of that call is 0 if the flag is not set (there was no rounding), and |
|
1 when there was rounding (which, we presume, implies the program has |
|
to correct for that). |
|
|
|
@c ===================================================================== |
|
|
|
@ignore |
|
@node IEEE 754 Decimal Arithmetic |
|
@section IEEE 754 Decimal Arithmetic |
|
@cindex IEEE 754 decimal arithmetic |
|
|
|
One of the difficulties that users of computers for numerical |
|
work face, whether they realize it or not, is that the computer |
|
does not operate in the number base that most people are familiar |
|
with. As a result, input decimal fractions must be converted to |
|
binary floating-point values for use in computations, and then |
|
the final results converted back to decimal for humans. Because the |
|
precision is finite and limited, and because algorithms for correct |
|
round-trip conversion between number bases were not known until the |
|
1990s, and are still not implemented on most systems and most |
|
programming languages, the result is frequent confusion for users, |
|
when a simple expression like @code{3.0*(1.0/3.0)} evaluates to |
|
0.999999 instead of the expected 1.0. Here is an |
|
example from a floating-point calculator that allows rounding-mode |
|
control, with the mode set to @emph{round-towards-zero}: |
|
|
|
@example |
|
for (k = 1; k <= 10; ++k) |
|
(void)printf ("%2d\t%10.6f\n", k, k*(1.0/k)) |
|
1 1.000000 |
|
2 1.000000 |
|
3 0.999999 |
|
4 1.000000 |
|
5 0.999999 |
|
6 0.999999 |
|
7 0.999999 |
|
8 1.000000 |
|
9 0.999999 |
|
10 0.999999 |
|
@end example |
|
|
|
Increasing working precision can sometimes help by reducing |
|
intermediate rounding errors, but the reality is that when |
|
fractional values are involved, @emph{no amount} of extra |
|
precision can suffice for some computations. For example, the |
|
nice decimal value @code{1/10} in C99-style binary representation |
|
is @code{+0x1.999999999999ap-4}; that final digit @code{a} is the |
|
rounding of an infinite string of @code{9}'s. |
|
|
|
Financial computations in particular depend critically on correct |
|
arithmetic, and the losses due to rounding errors can be |
|
large, especially for businesses with large numbers of small |
|
transactions, such as grocery stores and telephone companies. |
|
Tax authorities are particularly picky, and demand specific |
|
rounding rules, including one that instead of rounding ties to |
|
the nearest number, rounds instead in the direction that favors |
|
the taxman. |
|
|
|
Programming languages used for business applications, notably the |
|
venerable Cobol language, have therefore always implemented |
|
financial computations in @emph{fixed-point decimal arithmetic} |
|
in software, and because of the large monetary amounts that must be |
|
processed, successive Cobol standards have increased the minimum |
|
number size from 18 to 32 decimal digits, and the most recent one |
|
requires a decimal exponent range of at least @code{[-999, +999]}. |
|
|
|
The revised IEEE 754-2008 standard therefore requires decimal |
|
floating-point arithmetic, as well as the now-widely used binary |
|
formats from 1985. Like the binary formats, the decimal formats |
|
also support Infinity, NaN, and signed zero, and the five basic |
|
operations are also required to produce correctly rounded |
|
representations of infinitely precise exact results. |
|
|
|
However, the financial applications of decimal arithmetic |
|
introduce some new features: |
|
|
|
@itemize @bullet |
|
|
|
@item |
|
There are three decimal formats occupying 32, 64, and 128 bits of |
|
storage, and offering exactly 7, 16, and 34 decimal digits of |
|
precision. If one size has @code{n} digits, the next larger size |
|
has @code{2 n + 2} digits. Thus, a future 256-bit format would |
|
supply 70 decimal digits, and at least one library already |
|
supports the 256-bit binary and decimal formats. |
|
|
|
@item |
|
Decimal arithmetic has an additional rounding mode, called |
|
@emph{round-ties-to-away-from-zero}, meaning that a four-digit |
|
rounding of @code{1.2345} is @code{1.235}, and @code{-1.2345} |
|
becomes @code{-1.235}. That rounding mode is mandated by |
|
financial laws in several countries. |
|
|
|
@item |
|
The decimal significand is an |
|
@anchor{decimal-significand}@emph{integer}, instead of a fractional |
|
value, and trailing zeros are only removed at user request. That |
|
feature allows floating-point arithmetic to emulate the |
|
@emph{fixed-point arithmetic} traditionally used in financial |
|
computations. |
|
|
|
@end itemize |
|
|
|
@noindent |
|
We can easily estimate how many digits are likely to be needed for |
|
financial work: seven billion people on Earth, with an average annual |
|
income of less than US$10,000, means a world financial base that can |
|
be represented in just 15 decimal digits. Even allowing for alternate |
|
currencies, future growth, multiyear accounting, and intermediate |
|
computations, the 34 digits supplied by the 128-bit format are more |
|
than enough for financial purposes. |
|
|
|
We return to decimal arithmetic later in this chapter |
|
(@pxref{More on decimal floating-point arithmetic}), after we have |
|
covered more about floating-point arithmetic in general. |
|
|
|
@c ===================================================================== |
|
|
|
@end ignore |
|
|
|
@node Exact Floating-Point |
|
@section Exact Floating-Point Arithmetic |
|
@cindex exact floating-point arithmetic |
|
@cindex floating-point arithmetic, exact |
|
|
|
As long as the numbers are exactly representable (fractions whose |
|
denominator is a power of 2), and intermediate results do not require |
|
rounding, then floating-point arithmetic is @emph{exact}. It is easy |
|
to predict how many digits are needed for the results of arithmetic |
|
operations: |
|
|
|
@itemize @bullet |
|
|
|
@item |
|
addition and subtraction of two @var{n}-digit values with the |
|
@emph{same} exponent require at most @code{@var{n} + 1} digits, but |
|
when the exponents differ, many more digits may be needed; |
|
|
|
@item |
|
multiplication of two @var{n}-digit values requires exactly |
|
2 @var{n} digits; |
|
|
|
@item |
|
although integer division produces a quotient and a remainder of |
|
no more than @var{n}-digits, floating-point remainder and square |
|
root may require an unbounded number of digits, and the quotient |
|
can need many more digits than can be stored. |
|
|
|
@end itemize |
|
|
|
Whenever a result requires more than @var{n} digits, rounding |
|
is needed. |
|
|
|
@c ===================================================================== |
|
|
|
@node Rounding |
|
@section Rounding |
|
@cindex rounding |
|
|
|
When floating-point arithmetic produces a result that can't fit |
|
exactly in the significand of the type that's in use, it has to |
|
@dfn{round} the value. The basic arithmetic operations---addition, |
|
subtraction, multiplication, division, and square root---always produce |
|
a result that is equivalent to the exact, possibly infinite-precision |
|
result rounded to storage precision according to the current rounding |
|
rule. |
|
|
|
Rounding sets the @code{FE_INEXACT} exception flag (@pxref{Exception |
|
Flags}). This enables programs to determine that rounding has |
|
occurred. |
|
|
|
Rounding consists of adjusting the exponent to bring the significand |
|
back to the required base-point alignment, then applying the current |
|
@dfn{rounding rule} to squeeze the significand into the fixed |
|
available size. |
|
|
|
The current rule is selected at run time from four options. Here they |
|
are: |
|
|
|
@itemize * |
|
@item |
|
@emph{round-to-nearest}, with ties rounded to an even integer; |
|
|
|
@item |
|
@emph{round-up}, towards @code{+Infinity}; |
|
|
|
@item |
|
@emph{round-down}, towards @code{-Infinity}; |
|
|
|
@item |
|
@emph{round-towards-zero}. |
|
@end itemize |
|
|
|
Under those four rounding rules, a decimal value |
|
@code{-1.2345} that is to be rounded to a four-digit result would |
|
become @code{-1.234}, @code{-1.234}, @code{-1.235}, and |
|
@code{-1.234}, respectively. |
|
|
|
The default rounding rule is @emph{round-to-nearest}, because that has |
|
the least bias, and produces the lowest average error. When the true |
|
result lies exactly halfway between two representable machine numbers, |
|
the result is rounded to the one that ends with an even digit. |
|
|
|
The @emph{round-towards-zero} rule was common on many early computer |
|
designs, because it is the easiest to implement: it just requires |
|
silent truncation of all extra bits. |
|
|
|
The two other rules, @emph{round-up} and @emph{round-down}, are |
|
essential for implementing @dfn{interval arithmetic}, whereby |
|
each arithmetic operation produces lower and upper bounds that |
|
are guaranteed to enclose the exact result. |
|
|
|
@xref{Rounding Control}, for details on getting and setting the |
|
current rounding mode. |
|
|
|
@node Rounding Issues |
|
@section Rounding Issues |
|
@cindex rounding issues (floating point) |
|
@cindex floating-point rounding issues |
|
|
|
The default IEEE 754 rounding mode minimizes errors, and most |
|
normal computations should not suffer any serious accumulation of |
|
errors from rounding. |
|
|
|
Of course, you can contrive examples where that is not so. Here |
|
is one: iterate a square root, then attempt to recover the |
|
original value by repeated squaring. |
|
|
|
@example |
|
#include <stdio.h> |
|
#include <math.h> |
|
|
|
int main (void) |
|
@{ |
|
double x = 100.0; |
|
double y; |
|
int n, k; |
|
for (n = 10; n <= 100; n += 10) |
|
@{ |
|
y = x; |
|
for (k = 0; k < n; ++k) y = sqrt (y); |
|
for (k = 0; k < n; ++k) y *= y; |
|
printf ("n = %3d; x = %.0f\ty = %.6f\n", n, x, y); |
|
@} |
|
return 0; |
|
@} |
|
@end example |
|
|
|
@noindent |
|
Here is the output: |
|
|
|
@example |
|
n = 10; x = 100 y = 100.000000 |
|
n = 20; x = 100 y = 100.000000 |
|
n = 30; x = 100 y = 99.999977 |
|
n = 40; x = 100 y = 99.981025 |
|
n = 50; x = 100 y = 90.017127 |
|
n = 60; x = 100 y = 1.000000 |
|
n = 70; x = 100 y = 1.000000 |
|
n = 80; x = 100 y = 1.000000 |
|
n = 90; x = 100 y = 1.000000 |
|
n = 100; x = 100 y = 1.000000 |
|
@end example |
|
|
|
After 50 iterations, @code{y} has barely one correct digit, and |
|
soon after, there are no correct digits. |
|
|
|
@c ===================================================================== |
|
|
|
@node Significance Loss |
|
@section Significance Loss |
|
@cindex significance loss (floating point) |
|
@cindex floating-point significance loss |
|
|
|
A much more serious source of error in floating-point computation is |
|
@dfn{significance loss} from subtraction of nearly equal values. This |
|
means that the number of bits in the significand of the result is |
|
fewer than the size of the value would permit. If the values being |
|
subtracted are close enough, but still not equal, a @emph{single |
|
subtraction} can wipe out all correct digits, possibly contaminating |
|
all future computations. |
|
|
|
Floating-point calculations can sometimes be carefully designed so |
|
that significance loss is not possible, such as summing a series where |
|
all terms have the same sign. For example, the Taylor series |
|
expansions of the trigonometric and hyperbolic sines have terms of |
|
identical magnitude, of the general form @code{@var{x}**(2*@var{n} + |
|
1) / (2*@var{n} + 1)!}. However, those in the trigonometric sine series |
|
alternate in sign, while those in the hyperbolic sine series are all |
|
positive. Here is the output of two small programs that sum @var{k} |
|
terms of the series for @code{sin (@var{x})}, and compare the computed |
|
sums with known-to-be-accurate library functions: |
|
|
|
@example |
|
x = 10 k = 51 |
|
s (x) = -0.544_021_110_889_270 |
|
sin (x) = -0.544_021_110_889_370 |
|
|
|
x = 20 k = 81 |
|
s (x) = 0.912_945_250_749_573 |
|
sin (x) = 0.912_945_250_727_628 |
|
|
|
x = 30 k = 109 |
|
s (x) = -0.987_813_746_058_855 |
|
sin (x) = -0.988_031_624_092_862 |
|
|
|
x = 40 k = 137 |
|
s (x) = 0.617_400_430_980_474 |
|
sin (x) = 0.745_113_160_479_349 |
|
|
|
x = 50 k = 159 |
|
s (x) = 57_105.187_673_745_720_532 |
|
sin (x) = -0.262_374_853_703_929 |
|
|
|
// sinh(x) series summation with positive signs |
|
// with k terms needed to converge to machine precision |
|
|
|
x = 10 k = 47 |
|
t (x) = 1.101_323_287_470_340e+04 |
|
sinh (x) = 1.101_323_287_470_339e+04 |
|
|
|
x = 20 k = 69 |
|
t (x) = 2.425_825_977_048_951e+08 |
|
sinh (x) = 2.425_825_977_048_951e+08 |
|
|
|
x = 30 k = 87 |
|
t (x) = 5.343_237_290_762_229e+12 |
|
sinh (x) = 5.343_237_290_762_231e+12 |
|
|
|
x = 40 k = 105 |
|
t (x) = 1.176_926_334_185_100e+17 |
|
sinh (x) = 1.176_926_334_185_100e+17 |
|
|
|
x = 50 k = 121 |
|
t (x) = 2.592_352_764_293_534e+21 |
|
sinh (x) = 2.592_352_764_293_536e+21 |
|
@end example |
|
|
|
@noindent |
|
We have added underscores to the numbers to enhance readability. |
|
|
|
The @code{sinh (@var{x})} series with positive terms can be summed to |
|
high accuracy. By contrast, the series for @code{sin (@var{x})} |
|
suffers increasing significance loss, so that when @var{x} = 30 only |
|
two correct digits remain. Soon after, all digits are wrong, and the |
|
answers are complete nonsense. |
|
|
|
An important skill in numerical programming is to recognize when |
|
significance loss is likely to contaminate a computation, and revise |
|
the algorithm to reduce this problem. Sometimes, the only practical |
|
way to do so is to compute in higher intermediate precision, which is |
|
why the extended types like @code{long double} are important. |
|
|
|
@c Formerly mentioned @code{__float128} |
|
|
|
@c ===================================================================== |
|
|
|
@node Fused Multiply-Add |
|
@section Fused Multiply-Add |
|
@cindex fused multiply-add in floating-point computations |
|
@cindex floating-point fused multiply-add |
|
|
|
In 1990, when IBM introduced the POWER architecture, the CPU |
|
provided a previously unknown instruction, the @dfn{fused |
|
multiply-add} (FMA). It computes the value @code{x * y + z} with |
|
an @strong{exact} double-length product, followed by an addition with a |
|
@emph{single} rounding. Numerical computation often needs pairs of |
|
multiply and add operations, for which the FMA is well-suited. |
|
|
|
On the POWER architecture, there are two dedicated registers that |
|
hold permanent values of @code{0.0} and @code{1.0}, and the |
|
normal @emph{multiply} and @emph{add} instructions are just |
|
wrappers around the FMA that compute @code{x * y + 0.0} and |
|
@code{x * 1.0 + z}, respectively. |
|
|
|
In the early days, it appeared that the main benefit of the FMA |
|
was getting two floating-point operations for the price of one, |
|
almost doubling the performance of some algorithms. However, |
|
numerical analysts have since shown numerous uses of the FMA for |
|
significantly enhancing accuracy. We discuss one of the most |
|
important ones in the next section. |
|
|
|
A few other architectures have since included the FMA, and most |
|
provide variants for the related operations @code{x * y - z} |
|
(FMS), @code{-x * y + z} (FNMA), and @code{-x * y - z} (FNMS). |
|
@c The IEEE 754-2008 revision requires implementations to provide |
|
@c the FMA, as a sixth basic operation. |
|
|
|
The functions @code{fmaf}, @code{fma}, and @code{fmal} implement fused |
|
multiply-add for the @code{float}, @code{double}, and @code{long |
|
double} data types. Correct implementation of the FMA in software is |
|
difficult, and some systems that appear to provide those functions do |
|
not satisfy the single-rounding requirement. That situation should |
|
change as more programmers use the FMA operation, and more CPUs |
|
provide FMA in hardware. |
|
|
|
Use the @option{-ffp-contract=fast} option to allow generation of FMA |
|
instructions, or @option{-ffp-contract=off} to disallow it. |
|
|
|
@c ===================================================================== |
|
|
|
@node Error Recovery |
|
@section Error Recovery |
|
@cindex error recovery (floating point) |
|
@cindex floating-point error recovery |
|
|
|
When two numbers are combined by one of the four basic |
|
operations, the result often requires rounding to storage |
|
precision. For accurate computation, one would like to be able |
|
to recover that rounding error. With historical floating-point |
|
designs, it was difficult to do so portably, but now that IEEE |
|
754 arithmetic is almost universal, the job is much easier. |
|
|
|
For addition with the default @emph{round-to-nearest} rounding |
|
mode, we can determine the error in a sum like this: |
|
|
|
@example |
|
volatile double err, sum, tmp, x, y; |
|
|
|
if (fabs (x) >= fabs (y)) |
|
@{ |
|
sum = x + y; |
|
tmp = sum - x; |
|
err = y - tmp; |
|
@} |
|
else /* fabs (x) < fabs (y) */ |
|
@{ |
|
sum = x + y; |
|
tmp = sum - y; |
|
err = x - tmp; |
|
@} |
|
@end example |
|
|
|
@noindent |
|
@cindex twosum |
|
Now, @code{x + y} is @emph{exactly} represented by @code{sum + err}. |
|
This basic operation, which has come to be called @dfn{twosum} |
|
in the numerical-analysis literature, is the first key to tracking, |
|
and accounting for, rounding error. |
|
|
|
To determine the error in subtraction, just swap the @code{+} and |
|
@code{-} operators. |
|
|
|
We used the @code{volatile} qualifier (@pxref{volatile}) in the |
|
declaration of the variables, which forces the compiler to store and |
|
retrieve them from memory, and prevents the compiler from optimizing |
|
@code{err = y - ((x + y) - x)} into @code{err = 0}. |
|
|
|
For multiplication, we can compute the rounding error without |
|
magnitude tests with the FMA operation (@pxref{Fused Multiply-Add}), |
|
like this: |
|
|
|
@example |
|
volatile double err, prod, x, y; |
|
prod = x * y; /* @r{rounded product} */ |
|
err = fma (x, y, -prod); /* @r{exact product @code{= @var{prod} + @var{err}}} */ |
|
@end example |
|
|
|
For addition, subtraction, and multiplication, we can represent the |
|
exact result with the notional sum of two values. However, the exact |
|
result of division, remainder, or square root potentially requires an |
|
infinite number of digits, so we can at best approximate it. |
|
Nevertheless, we can compute an error term that is close to the true |
|
error: it is just that error value, rounded to machine precision. |
|
|
|
For division, you can approximate @code{x / y} with @code{quo + err} |
|
like this: |
|
|
|
@example |
|
volatile double err, quo, x, y; |
|
quo = x / y; |
|
err = fma (-quo, y, x) / y; |
|
@end example |
|
|
|
For square root, we can approximate @code{sqrt (x)} with @code{root + |
|
err} like this: |
|
|
|
@example |
|
volatile double err, root, x; |
|
root = sqrt (x); |
|
err = fma (-root, root, x) / (root + root); |
|
@end example |
|
|
|
With the reliable and predictable floating-point design provided |
|
by IEEE 754 arithmetic, we now have the tools we need to track |
|
errors in the five basic floating-point operations, and we can |
|
effectively simulate computing in twice working precision, which |
|
is sometimes sufficient to remove almost all traces of arithmetic |
|
errors. |
|
|
|
@c ===================================================================== |
|
|
|
@ignore |
|
@node Double-Rounding Problems |
|
@section Double-Rounding Problems |
|
@cindex double-rounding problems (floating point) |
|
@cindex floating-point double-rounding problems |
|
|
|
Most developers today use 64-bit x86 processors with a 64-bit |
|
operating system, with a Streaming SIMD Extensions (SSE) instruction |
|
set. In the past, using a 32-bit x87 instruction set was common, |
|
leading to issues described in this section. |
|
|
|
To offer a few more digits of precision and a wider exponent range, |
|
the IEEE 754 Standard included an optional @emph{temporary real} |
|
format, with 11 more bits in the significand, and 4 more bits in the |
|
biased exponent. |
|
|
|
Compilers are free to exploit the longer format, and most do so. |
|
That is usually a @emph{good thing}, such as in computing a |
|
lengthy sum or product, or in implementing the computation of the |
|
hypotenuse of a right-triangle as @code{sqrt (x*x + y*y)}: the |
|
wider exponent range is critical there for avoiding disastrous |
|
overflow or underflow. |
|
|
|
@findex fesetprec |
|
@findex fegetprec |
|
However, sometimes it is critical to know what the intermediate |
|
precision and rounding mode are, such as in tracking errors with |
|
the techniques of the preceding section. Some compilers provide |
|
options to prevent the use of the 80-bit format in computations |
|
with 64-bit @code{double}, but ensuring that code is always |
|
compiled that way may be difficult. The x86 architecture has the |
|
ability to force rounding of all operations in the 80-bit |
|
registers to the 64-bit storage format, and some systems provide |
|
a software interface with the functions @code{fesetprec} and |
|
@code{fegetprec}. Unfortunately, neither of those functions is |
|
defined by the ISO Standards for C and C++, and consequently, |
|
they are not standardly available on all platforms that use |
|
the x86 floating-point design. |
|
|
|
When @code{double} computations are done in the 80-bit format, |
|
results necessarily involve a @dfn{double rounding}: first to the |
|
64-bit significand in intermediate operations in registers, and |
|
then to the 53-bit significand when the register contents are |
|
stored to memory. Here is an example in decimal arithmetic where |
|
such a double rounding results in the wrong answer: round |
|
@code{1_234_999} from seven to five to four digits. The result is |
|
@code{1_235_000}, whereas the correct representation to four |
|
significant digits is @code{1_234_000}. |
|
|
|
@cindex -ffloat-store |
|
One way to reduce the use of the 80-bit format is to declare variables |
|
as @code{volatile double}: that way, the compiler is required to store |
|
and load intermediates from memory, rather than keeping them in 80-bit |
|
registers over long sequences of floating-point instructions. Doing |
|
so does not, however, eliminate double rounding. The now-common |
|
x86-64 architecture has separate sets of 32-bit and 64-bit |
|
floating-point registers. The option @option{-float-store} says that |
|
floating-point computation should use only those registers, thus eliminating |
|
the possibility of double rounding. |
|
@end ignore |
|
|
|
@c ===================================================================== |
|
|
|
@node Exact Floating Constants |
|
@section Exact Floating-Point Constants |
|
@cindex exact specification of floating-point constants |
|
@cindex floating-point constants, exact specification of |
|
|
|
One of the frustrations that numerical programmers have suffered |
|
with since the dawn of digital computers is the inability to |
|
precisely specify numbers in their programs. On the early |
|
decimal machines, that was not an issue: you could write a |
|
constant @code{1e-30} and be confident of that exact value |
|
being used in floating-point operations. However, when the |
|
hardware works in a base other than 10, then human-specified |
|
numbers have to be converted to that base, and then converted |
|
back again at output time. The two base conversions are rarely |
|
exact, and unwanted rounding errors are introduced. |
|
|
|
@cindex hexadecimal floating-point constants |
|
As computers usually represent numbers in a base other than 10, |
|
numbers often must be converted to and from different bases, and |
|
rounding errors can occur during conversion. This problem is solved |
|
in C using hexadecimal floating-point constants. For example, |
|
@code{+0x1.fffffcp-1} is the number that is the IEEE 754 32-bit value |
|
closest to, but below, @code{1.0}. The significand is represented as a |
|
hexadecimal fraction, and the @emph{power of two} is written in |
|
decimal following the exponent letter @code{p} (the traditional |
|
exponent letter @code{e} is not possible, because it is a hexadecimal |
|
digit). |
|
|
|
In @code{printf} and @code{scanf} and related functions, you can use |
|
the @samp{%a} and @samp{%A} format specifiers for writing and reading |
|
hexadecimal floating-point values. @samp{%a} writes them with lower |
|
case letters and @samp{%A} writes them with upper case letters. For |
|
instance, this code reproduces our sample number: |
|
|
|
@example |
|
printf ("%a\n", 1.0 - pow (2.0, -23)); |
|
@print{} 0x1.fffffcp-1 |
|
@end example |
|
|
|
@noindent |
|
The @code{strtod} family was similarly extended to recognize |
|
numbers in that new format. |
|
|
|
If you want to ensure exact data representation for transfer of |
|
floating-point numbers between C programs on different |
|
computers, then hexadecimal constants are an optimum choice. |
|
|
|
@c ===================================================================== |
|
|
|
@node Handling Infinity |
|
@section Handling Infinity |
|
@cindex infinity in floating-point arithmetic |
|
@cindex floating-point infinity |
|
|
|
As we noted earlier, the IEEE 754 model of computing is not to stop |
|
the program when exceptional conditions occur. It takes note of |
|
exceptional values or conditions by setting sticky @dfn{exception |
|
flags}, or by producing results with the special values Infinity and |
|
QNaN. In this section, we discuss Infinity; @pxref{Handling NaN} for |
|
the other. |
|
|
|
In GNU C, you can create a value of negative Infinity in software like |
|
this: |
|
|
|
@example |
|
double x; |
|
|
|
x = -1.0 / 0.0; |
|
@end example |
|
|
|
GNU C supplies the @code{__builtin_inf}, @code{__builtin_inff}, and |
|
@code{__builtin_infl} macros, and the GNU C Library provides the |
|
@code{INFINITY} macro, all of which are compile-time constants for |
|
positive infinity. |
|
|
|
GNU C also provides a standard function to test for an Infinity: |
|
@code{isinf (x)} returns @code{1} if the argument is a signed |
|
infinity, and @code{0} if not. |
|
|
|
Infinities can be compared, and all Infinities of the same sign are |
|
equal: there is no notion in IEEE 754 arithmetic of different kinds of |
|
Infinities, as there are in some areas of mathematics. Positive |
|
Infinity is larger than any finite value, and negative Infinity is |
|
smaller than any finite value. |
|
|
|
Infinities propagate in addition, subtraction, multiplication, |
|
and square root, but in division, they disappear, because of the |
|
rule that @code{finite / Infinity} is @code{0.0}. Thus, an |
|
overflow in an intermediate computation that produces an Infinity |
|
is likely to be noticed later in the final results. The programmer can |
|
then decide whether the overflow is expected, and acceptable, or whether |
|
the code possibly has a bug, or needs to be run in higher |
|
precision, or redesigned to avoid the production of the Infinity. |
|
|
|
@c ===================================================================== |
|
|
|
@node Handling NaN |
|
@section Handling NaN |
|
@cindex NaN in floating-point arithmetic |
|
@cindex not a number |
|
@cindex floating-point NaN |
|
|
|
NaNs are not numbers: they represent values from computations that |
|
produce undefined results. They have a distinctive property that |
|
makes them unlike any other floating-point value: they are |
|
@emph{unequal to everything, including themselves}! Thus, you can |
|
write a test for a NaN like this: |
|
|
|
@example |
|
if (x != x) |
|
printf ("x is a NaN\n"); |
|
@end example |
|
|
|
@noindent |
|
This test works in GNU C, but some compilers might evaluate that test |
|
expression as false without properly checking for the NaN value. |
|
A more portable way to test for NaN is to use the @code{isnan} |
|
function declared in @code{math.h}: |
|
|
|
@example |
|
if (isnan (x)) |
|
printf ("x is a NaN\n"); |
|
@end example |
|
|
|
@noindent |
|
@xref{Floating Point Classes, , , libc, The GNU C Library Reference Manual}. |
|
|
|
One important use of NaNs is marking of missing data. For |
|
example, in statistics, such data must be omitted from |
|
computations. Use of any particular finite value for missing |
|
data would eventually collide with real data, whereas such data |
|
could never be a NaN, so it is an ideal marker. Functions that |
|
deal with collections of data that may have holes can be written |
|
to test for, and ignore, NaN values. |
|
|
|
It is easy to generate a NaN in computations: evaluating @code{0.0 / |
|
0.0} is the commonest way, but @code{Infinity - Infinity}, |
|
@code{Infinity / Infinity}, and @code{sqrt (-1.0)} also work. |
|
Functions that receive out-of-bounds arguments can choose to return a |
|
stored NaN value, such as with the @code{NAN} macro defined in |
|
@code{math.h}, but that does not set the @emph{invalid operand} |
|
exception flag, and that can fool some programs. |
|
|
|
@cindex NaNs-always-propagate rule |
|
Like Infinity, NaNs propagate in computations, but they are even |
|
stickier, because they never disappear in division. Thus, once a |
|
NaN appears in a chain of numerical operations, it is almost |
|
certain to pop out into the final results. The programmer |
|
has to decide whether that is expected, or whether there is a |
|
coding or algorithmic error that needs repair. |
|
|
|
In general, when function gets a NaN argument, it usually returns a |
|
NaN. However, there are some exceptions in the math-library functions |
|
that you need to be aware of, because they violate the |
|
@emph{NaNs-always-propagate} rule: |
|
|
|
@itemize @bullet |
|
|
|
@item |
|
@code{pow (x, 0.0)} always returns @code{1.0}, even if @code{x} is |
|
0.0, Infinity, or a NaN. |
|
|
|
@item |
|
@code{pow (1, y)} always returns @code{1}, even if @code{y} is a NaN. |
|
|
|
@item |
|
@code{hypot (INFINITY, y)} and @code{hypot (-INFINITY, y)} both |
|
always return @code{INFINITY}, even if @code{y} is a Nan. |
|
|
|
@item |
|
If just one of the arguments to @code{fmax (x, y)} or |
|
@code{fmin (x, y)} is a NaN, it returns the other argument. If |
|
both arguments are NaNs, it returns a NaN, but there is no |
|
requirement about where it comes from: it could be @code{x}, or |
|
@code{y}, or some other quiet NaN. |
|
@end itemize |
|
|
|
NaNs are also used for the return values of math-library |
|
functions where the result is not representable in real |
|
arithmetic, or is mathematically undefined or uncertain, such as |
|
@code{sqrt (-1.0)} and @code{sin (Infinity)}. However, note that a |
|
result that is merely too big to represent should always produce |
|
an Infinity, such as with @code{exp (1000.0)} (too big) and |
|
@code{exp (Infinity)} (truly infinite). |
|
|
|
@c ===================================================================== |
|
|
|
@node Signed Zeros |
|
@section Signed Zeros |
|
@cindex signed zeros in floating-point arithmetic |
|
@cindex floating-point signed zeros |
|
|
|
The sign of zero is significant, and important, because it records the |
|
creation of a value that is too small to represent, but came from |
|
either the negative axis, or from the positive axis. Such fine |
|
distinctions are essential for proper handling of @dfn{branch cuts} |
|
in complex arithmetic (@pxref{Complex Arithmetic}). |
|
|
|
The key point about signed zeros is that in comparisons, their sign |
|
does not matter: @code{0.0 == -0.0} must @emph{always} evaluate to |
|
@code{1} (true). However, they are not @emph{the same number}, and |
|
@code{-0.0} in C code stands for a negative zero. |
|
|
|
@c ===================================================================== |
|
|
|
@node Scaling by the Base |
|
@section Scaling by Powers of the Base |
|
@cindex scaling floating point by powers of the base |
|
@cindex floating-point scaling by powers of the base |
|
|
|
We have discussed rounding errors several times in this chapter, |
|
but it is important to remember that when results require no more |
|
bits than the exponent and significand bits can represent, those results |
|
are @emph{exact}. |
|
|
|
One particularly useful exact operation is scaling by a power of |
|
the base. While one, in principle, could do that with code like |
|
this: |
|
|
|
@example |
|
y = x * pow (2.0, (double)k); /* @r{Undesirable scaling: avoid!} */ |
|
@end example |
|
|
|
@noindent |
|
that is not advisable, because it relies on the quality of the |
|
math-library power function, and that happens to be one of the |
|
most difficult functions in the C math library to make accurate. |
|
What is likely to happen on many systems is that the returned |
|
value from @code{pow} will be close to a power of two, but |
|
slightly different, so the subsequent multiplication introduces |
|
rounding error. |
|
|
|
The correct, and fastest, way to do the scaling is either via the |
|
traditional C library function, or with its C99 equivalent: |
|
|
|
@example |
|
y = ldexp (x, k); /* @r{Traditional pre-C99 style.} */ |
|
y = scalbn (x, k); /* @r{C99 style.} */ |
|
@end example |
|
|
|
@noindent |
|
Both functions return @code{x * 2**k}. |
|
@xref{Normalization Functions, , , libc, The GNU C Library Reference Manual}. |
|
|
|
@c ===================================================================== |
|
|
|
@node Rounding Control |
|
@section Rounding Control |
|
@cindex rounding control (floating point) |
|
@cindex floating-point rounding control |
|
|
|
Here we describe how to specify the rounding mode at run time. System |
|
header file @file{fenv.h} provides the prototypes for these functions. |
|
@xref{Rounding, , , libc, The GNU C Library Reference Manual}. |
|
|
|
@noindent |
|
That header file also provides constant names for the four rounding modes: |
|
@code{FE_DOWNWARD}, @code{FE_TONEAREST}, @code{FE_TOWARDZERO}, and |
|
@code{FE_UPWARD}. |
|
|
|
The function @code{fegetround} examines and returns the current |
|
rounding mode. On a platform with IEEE 754 floating point, |
|
the value will always equal one of those four constants. |
|
On other platforms, it may return a negative value. The function |
|
@code{fesetround} sets the current rounding mode. |
|
|
|
Changing the rounding mode can be slow, so it is useful to minimize |
|
the number of changes. For interval arithmetic, we seem to need three |
|
changes for each operation, but we really only need two, because we |
|
can write code like this example for interval addition of two reals: |
|
|
|
@example |
|
@{ |
|
struct interval_double |
|
@{ |
|
double hi, lo; |
|
@} v; |
|
extern volatile double x, y; |
|
int rule; |
|
|
|
rule = fegetround (); |
|
|
|
if (fesetround (FE_UPWARD) == 0) |
|
@{ |
|
v.hi = x + y; |
|
v.lo = -(-x - y); |
|
@} |
|
else |
|
fatal ("ERROR: failed to change rounding rule"); |
|
|
|
if (fesetround (rule) != 0) |
|
fatal ("ERROR: failed to restore rounding rule"); |
|
@} |
|
@end example |
|
|
|
@noindent |
|
The @code{volatile} qualifier (@pxref{volatile}) is essential on x86 |
|
platforms to prevent an optimizing compiler from producing the same |
|
value for both bounds. |
|
|
|
@ignore |
|
We no longer discuss the double rounding issue. |
|
The code also needs to be compiled with the |
|
option @option{-ffloat-store} that prevents use of higher precision |
|
for the basic operations, because that would introduce double rounding |
|
that could spoil the bounds guarantee of interval arithmetic. |
|
@end ignore |
|
|
|
@c ===================================================================== |
|
|
|
@node Machine Epsilon |
|
@section Machine Epsilon |
|
@cindex machine epsilon (floating point) |
|
@cindex floating-point machine epsilon |
|
|
|
In any floating-point system, three attributes are particularly |
|
important to know: @dfn{base} (the number that the exponent specifies |
|
a power of), @dfn{precision} (number of digits in the significand), |
|
and @dfn{range} (difference between most positive and most negative |
|
values). The allocation of bits between exponent and significand |
|
decides the answers to those questions. |
|
|
|
A measure of the precision is the answer to the question: what is |
|
the smallest number that can be added to @code{1.0} such that the |
|
sum differs from @code{1.0}? That number is called the |
|
@dfn{machine epsilon}. |
|
|
|
We could define the needed machine-epsilon constants for @code{float}, |
|
@code{double}, and @code{long double} like this: |
|
|
|
@example |
|
static const float epsf = 0x1p-23; /* @r{about 1.192e-07} */ |
|
static const double eps = 0x1p-52; /* @r{about 2.220e-16} */ |
|
static const long double epsl = 0x1p-63; /* @r{about 1.084e-19} */ |
|
@end example |
|
|
|
@noindent |
|
Instead of the hexadecimal constants, we could also have used the |
|
Standard C macros, @code{FLT_EPSILON}, @code{DBL_EPSILON}, and |
|
@code{LDBL_EPSILON}. |
|
|
|
It is useful to be able to compute the machine epsilons at |
|
run time, and we can easily generalize the operation by replacing |
|
the constant @code{1.0} with a user-supplied value: |
|
|
|
@example |
|
double |
|
macheps (double x) |
|
@{ /* @r{Return machine epsilon for @var{x},} */ |
|
/* @r{such that @var{x} + macheps (@var{x}) > @var{x}.} */ |
|
static const double base = 2.0; |
|
double eps; |
|
|
|
if (isnan (x)) |
|
eps = x; |
|
else |
|
@{ |
|
eps = (x == 0.0) ? 1.0 : x; |
|
|
|
while ((x + eps / base) != x) |
|
eps /= base; /* @r{Always exact!} */ |
|
@} |
|
|
|
return (eps); |
|
@} |
|
@end example |
|
|
|
@noindent |
|
If we call that function with arguments from @code{0} to |
|
@code{10}, as well as Infinity and NaN, and print the returned |
|
values in hexadecimal, we get output like this: |
|
|
|
@example |
|
macheps ( 0) = 0x1.0000000000000p-1074 |
|
macheps ( 1) = 0x1.0000000000000p-52 |
|
macheps ( 2) = 0x1.0000000000000p-51 |
|
macheps ( 3) = 0x1.8000000000000p-52 |
|
macheps ( 4) = 0x1.0000000000000p-50 |
|
macheps ( 5) = 0x1.4000000000000p-51 |
|
macheps ( 6) = 0x1.8000000000000p-51 |
|
macheps ( 7) = 0x1.c000000000000p-51 |
|
macheps ( 8) = 0x1.0000000000000p-49 |
|
macheps ( 9) = 0x1.2000000000000p-50 |
|
macheps ( 10) = 0x1.4000000000000p-50 |
|
macheps (Inf) = infinity |
|
macheps (NaN) = nan |
|
@end example |
|
|
|
@noindent |
|
Notice that @code{macheps} has a special test for a NaN to prevent an |
|
infinite loop. |
|
|
|
@ignore |
|
We no longer discuss double rounding. |
|
To ensure that no expressions are evaluated with an intermediate higher |
|
precision, we can compile with the @option{-fexcess-precision=standard} |
|
option, which tells the compiler that all calculation results, including |
|
intermediate results, are to be put on the stack, forcing rounding. |
|
@end ignore |
|
|
|
Our code made another test for a zero argument to avoid getting a |
|
zero return. The returned value in that case is the smallest |
|
representable floating-point number, here the subnormal value |
|
@code{2**(-1074)}, which is about @code{4.941e-324}. |
|
|
|
No special test is needed for an Infinity, because the |
|
@code{eps}-reduction loop then terminates at the first iteration. |
|
|
|
Our @code{macheps} function here assumes binary floating point; some |
|
architectures may differ. |
|
|
|
The C library includes some related functions that can also be used to |
|
determine machine epsilons at run time: |
|
|
|
@example |
|
#include <math.h> /* @r{Include for these prototypes.} */ |
|
|
|
double nextafter (double x, double y); |
|
float nextafterf (float x, float y); |
|
long double nextafterl (long double x, long double y); |
|
@end example |
|
|
|
@noindent |
|
These return the machine number nearest @var{x} in the direction of |
|
@var{y}. For example, @code{nextafter (1.0, 2.0)} produces the same |
|
result as @code{1.0 + macheps (1.0)} and @code{1.0 + DBL_EPSILON}. |
|
@xref{FP Bit Twiddling, , , libc, The GNU C Library Reference Manual}. |
|
|
|
It is important to know that the machine epsilon is not symmetric |
|
about all numbers. At the boundaries where normalization changes the |
|
exponent, the epsilon below @var{x} is smaller than that just above |
|
@var{x} by a factor @code{1 / base}. For example, @code{macheps |
|
(1.0)} returns @code{+0x1p-52}, whereas @code{macheps (-1.0)} returns |
|
@code{+0x1p-53}. Some authors distinguish those cases by calling them |
|
the @emph{positive} and @emph{negative}, or @emph{big} and |
|
@emph{small}, machine epsilons. You can produce their values like |
|
this: |
|
|
|
@example |
|
eps_neg = 1.0 - nextafter (1.0, -1.0); |
|
eps_pos = nextafter (1.0, +2.0) - 1.0; |
|
@end example |
|
|
|
If @var{x} is a variable, such that you do not know its value at |
|
compile time, then you can substitute literal @var{y} values with |
|
either @code{-inf()} or @code{+inf()}, like this: |
|
|
|
@example |
|
eps_neg = x - nextafter (x, -inf ()); |
|
eps_pos = nextafter (x, +inf() - x); |
|
@end example |
|
|
|
@noindent |
|
In such cases, if @var{x} is Infinity, then @emph{the @code{nextafter} |
|
functions return @var{y} if @var{x} equals @var{y}}. Our two |
|
assignments then produce @code{+0x1.fffffffffffffp+1023} (that is a |
|
hexadecimal floating point constant and its value is around |
|
1.798e+308; see @ref{Floating Constants}) for @var{eps_neg}, and |
|
Infinity for @var{eps_pos}. Thus, the call @code{nextafter (INFINITY, |
|
-INFINITY)} can be used to find the largest representable finite |
|
number, and with the call @code{nextafter (0.0, 1.0)}, the smallest |
|
representable number (here, @code{0x1p-1074} (about 4.491e-324), a |
|
number that we saw before as the output from @code{macheps (0.0)}). |
|
|
|
@c ===================================================================== |
|
|
|
@node Complex Arithmetic |
|
@section Complex Arithmetic |
|
@cindex complex arithmetic in floating-point calculations |
|
@cindex floating-point arithmetic with complex numbers |
|
|
|
We've already looked at defining and referring to complex numbers |
|
(@pxref{Complex Data Types}). What is important to discuss here are |
|
some issues that are unlikely to be obvious to programmers without |
|
extensive experience in both numerical computing, and in complex |
|
arithmetic in mathematics. |
|
|
|
The first important point is that, unlike real arithmetic, in complex |
|
arithmetic, the danger of significance loss is @emph{pervasive}, and |
|
affects @emph{every one} of the basic operations, and @emph{almost |
|
all} of the math-library functions. To understand why, recall the |
|
rules for complex multiplication and division: |
|
|
|
@example |
|
a = u + I*v /* @r{First operand.} */ |
|
b = x + I*y /* @r{Second operand.} */ |
|
|
|
prod = a * b |
|
= (u + I*v) * (x + I*y) |
|
= (u * x - v * y) + I*(v * x + u * y) |
|
|
|
quo = a / b |
|
= (u + I*v) / (x + I*y) |
|
= [(u + I*v) * (x - I*y)] / [(x + I*y) * (x - I*y)] |
|
= [(u * x + v * y) + I*(v * x - u * y)] / (x**2 + y**2) |
|
@end example |
|
|
|
@noindent |
|
There are four critical observations about those formulas: |
|
|
|
@itemize @bullet |
|
|
|
@item |
|
the multiplications on the right-hand side introduce the |
|
possibility of premature underflow or overflow; |
|
|
|
@item |
|
the products must be accurate to twice working precision; |
|
|
|
@item |
|
there is @emph{always} one subtraction on the right-hand sides |
|
that is subject to catastrophic significance loss; and |
|
|
|
@item |
|
complex multiplication has up to @emph{six} rounding errors, and |
|
complex division has @emph{ten} rounding errors. |
|
|
|
@end itemize |
|
|
|
@cindex branch cuts |
|
Another point that needs careful study is the fact that many functions |
|
in complex arithmetic have @dfn{branch cuts}. You can view a |
|
function with a complex argument, @code{f (z)}, as @code{f (x + I*y)}, |
|
and thus, it defines a relation between a point @code{(x, y)} on the |
|
complex plane with an elevation value on a surface. A branch cut |
|
looks like a tear in that surface, so approaching the cut from one |
|
side produces a particular value, and from the other side, a quite |
|
different value. Great care is needed to handle branch cuts properly, |
|
and even small numerical errors can push a result from one side to the |
|
other, radically changing the returned value. As we reported earlier, |
|
correct handling of the sign of zero is critically important for |
|
computing near branch cuts. |
|
|
|
The best advice that we can give to programmers who need complex |
|
arithmetic is to always use the @emph{highest precision available}, |
|
and then to carefully check the results of test calculations to gauge |
|
the likely accuracy of the computed results. It is easy to supply |
|
test values of real and imaginary parts where all five basic |
|
operations in complex arithmetic, and almost all of the complex math |
|
functions, lose @emph{all} significance, and fail to produce even a |
|
single correct digit. |
|
|
|
Even though complex arithmetic makes some programming tasks |
|
easier, it may be numerically preferable to rework the algorithm |
|
so that it can be carried out in real arithmetic. That is |
|
commonly possible in matrix algebra. |
|
|
|
GNU C can perform code optimization on complex number multiplication and |
|
division if certain boundary checks will not be needed. The |
|
command-line option @option{-fcx-limited-range} tells the compiler that |
|
a range reduction step is not needed when performing complex division, |
|
and that there is no need to check if a complex multiplication or |
|
division results in the value @code{Nan + I*NaN}. By default these |
|
checks are enabled. You can explicitly enable them with the |
|
@option{-fno-cx-limited-range} option. |
|
|
|
@ignore |
|
@c ===================================================================== |
|
|
|
@node More on Decimal Floating-Point Arithmetic |
|
@section More on Decimal Floating-Point Arithmetic |
|
@cindex decimal floating-point arithmetic |
|
@cindex floating-point arithmetic, decimal |
|
|
|
Proposed extensions to the C programming language call for the |
|
inclusion of decimal floating-point arithmetic, which handles |
|
floating-point numbers with a specified radix of 10, instead of the |
|
unspecified traditional radix of 2. |
|
|
|
The proposed new types are @code{_Decimal32}, @code{_Decimal64}, and |
|
@code{_Decimal128}, with corresponding literal constant suffixes of |
|
@code{df}, @code{dd}, and @code{dl}, respectively. For example, a |
|
32-bit decimal floating-point variable could be defined as: |
|
|
|
@example |
|
_Decimal32 foo = 42.123df; |
|
@end example |
|
|
|
We stated earlier (@pxref{decimal-significand}) that the significand |
|
in decimal floating-point arithmetic is an integer, rather than |
|
fractional, value. Decimal instructions do not normally alter the |
|
exponent by normalizing nonzero significands to remove trailing zeros. |
|
That design feature is intentional: it allows emulation of the |
|
fixed-point arithmetic that has commonly been used for financial |
|
computations. |
|
|
|
One consequence of the lack of normalization is that there are |
|
multiple representations of any number that does not use all of the |
|
significand digits. Thus, in the 32-bit format, the values |
|
@code{1.DF}, @code{1.0DF}, @code{1.00DF}, @dots{}, @code{1.000_000DF}, |
|
all have different bit patterns in storage, even though they compare |
|
equal. Thus, programmers need to be careful about trailing zero |
|
digits, because they appear in the results, and affect scaling. For |
|
example, @code{1.0DF * 1.0DF} evaluates to @code{1.00DF}, which is |
|
stored as @code{100 * 10**(-2)}. |
|
|
|
In general, when you look at a decimal expression with fractional |
|
digits, you should mentally rewrite it in integer form with suitable |
|
powers of ten. Thus, a multiplication like @code{1.23 * 4.56} really |
|
means @code{123 * 10**(-2) * 456 * 10**(-2)}, which evaluates to |
|
@code{56088 * 10**(-4)}, and would be output as @code{5.6088}. |
|
|
|
Another consequence of the decimal significand choice is that |
|
initializing decimal floating-point values to a pattern of |
|
all-bits-zero does not produce the expected value @code{0.}: instead, |
|
the result is the subnormal values @code{0.e-101}, @code{0.e-398}, and |
|
@code{0.e-6176} in the three storage formats. |
|
|
|
GNU C currently supports basic storage and manipulation of decimal |
|
floating-point values on some platforms, and support is expected to |
|
grow in future releases. |
|
|
|
@c ??? Suggest chopping the rest of this section, at least for the time |
|
@c ??? being. Decimal floating-point support in GNU C is not yet complete, |
|
@c ??? and functionality discussed appears to not be available on all |
|
@c ??? platforms, and is not obviously documented for end users of GNU C. --TJR |
|
|
|
The exponent in decimal arithmetic is called the @emph{quantum}, and |
|
financial computations require that the quantum always be preserved. |
|
If it is not, then rounding may have happened, and additional scaling |
|
is required. |
|
|
|
The function @code{samequantumd (x,y)} for 64-bit decimal arithmetic |
|
returns @code{1} if the arguments have the same exponent, and @code{0} |
|
otherwise. |
|
|
|
The function @code{quantized (x,y)} returns a value of @var{x} that has |
|
been adjusted to have the same quantum as @var{y}; that adjustment |
|
could require rounding of the significand. For example, |
|
@code{quantized (5.dd, 1.00dd)} returns the value @code{5.00dd}, which |
|
is stored as @code{500 * 10**(-2)}. As another example, a sales-tax |
|
computation might be carried out like this: |
|
|
|
@example |
|
decimal_long_double amount, rate, total; |
|
|
|
amount = 0.70DL; |
|
rate = 1.05DL; |
|
total = quantizedl (amount * rate, 1.00DL); |
|
@end example |
|
|
|
@noindent |
|
Without the call to @code{quantizedl}, the result would have been |
|
@code{0.7350}, instead of the correctly rounded @code{0.74}. That |
|
particular example was chosen because it illustrates yet another |
|
difference between decimal and binary arithmetic: in the latter, the |
|
factors each require an infinite number of bits, and their product, |
|
when converted to decimal, looks like @code{0.734_999_999@dots{}}. |
|
Thus, rounding the product in binary format to two decimal places |
|
always gets @code{0.73}, which is the @emph{wrong} answer for tax |
|
laws. |
|
|
|
In financial computations in decimal floating-point arithmetic, the |
|
@code{quantized} function family is expected to receive wide use |
|
whenever multiplication or division change the desired quantum of the |
|
result. |
|
|
|
The function call @code{normalized (x)} returns a value that is |
|
numerically equal to @var{x}, but with trailing zeros trimmed. |
|
Here are some examples of its operation: |
|
|
|
@multitable @columnfractions .5 .5 |
|
@headitem Function Call @tab Result |
|
@item normalized (+0.00100DD) @tab +0.001DD |
|
@item normalized (+1.00DD) @tab +1.DD |
|
@item normalized (+1.E2DD) @tab +1E+2DD |
|
@item normalized (+100.DD) @tab +1E+2DD |
|
@item normalized (+100.00DD) @tab +1E+2DD |
|
@item normalized (+NaN(0x1234)) @tab +NaN(0x1234) |
|
@item normalized (-NaN(0x1234)) @tab -NaN(0x1234) |
|
@item normalized (+Infinity) @tab +Infinity |
|
@item normalized (-Infinity) @tab -Infinity |
|
@end multitable |
|
|
|
@noindent |
|
The NaN examples show that payloads are preserved. |
|
|
|
Because the @code{printf} and @code{scanf} families were designed long |
|
before IEEE 754 decimal arithmetic, their format items do not support |
|
distinguishing between numbers with identical values, but different |
|
quanta, and yet, that distinction is likely to be needed in output. |
|
|
|
The solution adopted by one early library for decimal arithmetic is to |
|
provide a family of number-to-string conversion functions that |
|
preserve quantization. Here is a code fragment and its output that |
|
shows how they work. |
|
|
|
@example |
|
decimal_float x; |
|
|
|
x = 123.000DF; |
|
printf ("%%He: x = %He\n", x); |
|
printf ("%%Hf: x = %Hf\n", x); |
|
printf ("%%Hg: x = %Hg\n", x); |
|
printf ("ntosdf (x) = %s\n", ntosdf (x)); |
|
|
|
%He: x = 1.230000e+02 |
|
%Hf: x = 123.000000 |
|
%Hg: x = 123 |
|
ntosdf (x) = +123.000 |
|
@end example |
|
|
|
@noindent |
|
The format modifier letter @code{H} indicates a 32-bit decimal value, |
|
and the modifiers @code{DD} and @code{DL} correspond to the two other |
|
formats. |
|
|
|
@c ===================================================================== |
|
@end ignore |
|
|
|
@node Round-Trip Base Conversion |
|
@section Round-Trip Base Conversion |
|
@cindex round-trip base conversion |
|
@cindex base conversion (floating point) |
|
@cindex floating-point round-trip base conversion |
|
|
|
Most numeric programs involve converting between base-2 floating-point |
|
numbers, as represented by the computer, and base-10 floating-point |
|
numbers, as entered and handled by the programmer. What might not be |
|
obvious is the number of base-2 bits vs. base-10 digits required for |
|
each representation. Consider the following tables showing the number of |
|
decimal digits representable in a given number of bits, and vice versa: |
|
|
|
@multitable @columnfractions .5 .1 .1 .1 .1 .1 |
|
@item binary in @tab 24 @tab 53 @tab 64 @tab 113 @tab 237 |
|
@item decimal out @tab 9 @tab 17 @tab 21 @tab 36 @tab 73 |
|
@end multitable |
|
|
|
@multitable @columnfractions .5 .1 .1 .1 .1 |
|
@item decimal in @tab 7 @tab 16 @tab 34 @tab 70 |
|
@item binary out @tab 25 @tab 55 @tab 114 @tab 234 |
|
@end multitable |
|
|
|
We can compute the table numbers with these two functions: |
|
|
|
@example |
|
int |
|
matula(int nbits) |
|
@{ /* @r{Return output decimal digits needed for nbits-bits input.} */ |
|
return ((int)ceil((double)nbits / log2(10.0) + 1.0)); |
|
@} |
|
|
|
int |
|
goldberg(int ndec) |
|
@{ /* @r{Return output bits needed for ndec-digits input.} */ |
|
return ((int)ceil((double)ndec / log10(2.0) + 1.0)); |
|
@} |
|
@end example |
|
|
|
One significant observation from those numbers is that we cannot |
|
achieve correct round-trip conversion between the decimal and |
|
binary formats in the same storage size! For example, we need 25 |
|
bits to represent a 7-digit value from the 32-bit decimal format, |
|
but the binary format only has 24 available. Similar |
|
observations hold for each of the other conversion pairs. |
|
|
|
The general input/output base-conversion problem is astonishingly |
|
complicated, and solutions were not generally known until the |
|
publication of two papers in 1990 that are listed later near the end |
|
of this chapter. For the 128-bit formats, the worst case needs more |
|
than 11,500 decimal digits of precision to guarantee correct rounding |
|
in a binary-to-decimal conversion! |
|
|
|
For further details see the references for Bennett Goldberg and David |
|
Matula. |
|
|
|
@c ===================================================================== |
|
|
|
@node Further Reading |
|
@section Further Reading |
|
|
|
The subject of floating-point arithmetic is much more complex |
|
than many programmers seem to think, and few books on programming |
|
languages spend much time in that area. In this chapter, we have |
|
tried to expose the reader to some of the key ideas, and to warn |
|
of easily overlooked pitfalls that can soon lead to nonsensical |
|
results. There are a few good references that we recommend |
|
for further reading, and for finding other important material |
|
about computer arithmetic: |
|
|
|
We include URLs for these references when we were given them, when |
|
they are morally legitimate to recommend; we have omitted the URLs |
|
that are paywalled or that require running nonfree JavaScript code in |
|
order to function. We hope you can find morally legitimate sites |
|
where you can access these works. |
|
|
|
@c ===================================================================== |
|
@c Each bibliography item has a sort key, so the bibliography can be |
|
@c sorted in emacs with M-x sort-paragraphs on the region with the items. |
|
@c ===================================================================== |
|
|
|
@itemize @bullet |
|
|
|
@c Paywalled. |
|
@item @c sort-key: Abbott |
|
Paul H. Abbott and 15 others, @cite{Architecture and software support |
|
in IBM S/390 Parallel Enterprise Servers for IEEE Floating-Point |
|
arithmetic}, IBM Journal of Research and Development @b{43}(5/6) |
|
723--760 (1999), This article gives a good description of IBM's |
|
algorithm for exact decimal-to-binary conversion, complementing |
|
earlier ones by Clinger and others. |
|
|
|
@c Paywalled. |
|
@item @c sort-key: Beebe |
|
Nelson H. F. Beebe, @cite{The Mathematical-Function Computation Handbook: |
|
Programming Using the MathCW Portable Software Library}, |
|
Springer (2017). |
|
This book describes portable implementations of a large superset |
|
of the mathematical functions available in many programming |
|
languages, extended to a future 256-bit format (70 decimal |
|
digits), for both binary and decimal floating point. It includes |
|
a substantial portion of the functions described in the famous |
|
@cite{NIST Handbook of Mathematical Functions}, Cambridge (2018), |
|
ISBN 0-521-19225-0. |
|
See |
|
@uref{https://www.math.utah.edu/pub/mathcw/} |
|
for compilers and libraries. |
|
|
|
@c URL ok |
|
@item @c sort-key: Clinger-1990 |
|
William D. Clinger, @cite{How to Read Floating Point Numbers |
|
Accurately}, ACM SIGPLAN Notices @b{25}(6) 92--101 (June 1990), |
|
@uref{https://doi.org/10.1145/93548.93557}. |
|
See also the papers by Steele & White. |
|
|
|
@c Paywalled. |
|
@c @item @c sort-key: Clinger-2004 |
|
@c William D. Clinger, @cite{Retrospective: How to read floating |
|
@c point numbers accurately}, ACM SIGPLAN Notices @b{39}(4) 360--371 (April 2004), |
|
@c Reprint of 1990 paper, with additional commentary. |
|
|
|
@c URL ok |
|
@item @c sort-key: Goldberg-1967 |
|
I. Bennett Goldberg, @cite{27 Bits Are Not Enough For 8-Digit Accuracy}, |
|
Communications of the ACM @b{10}(2) 105--106 (February 1967), |
|
@uref{https://doi.acm.org/10.1145/363067.363112}. This paper, |
|
and its companions by David Matula, address the base-conversion |
|
problem, and show that the naive formulas are wrong by one or |
|
two digits. |
|
|
|
@c URL ok |
|
@item @c sort-key: Goldberg-1991 |
|
David Goldberg, @cite{What Every Computer Scientist Should Know |
|
About Floating-Point Arithmetic}, ACM Computing Surveys @b{23}(1) |
|
5--58 (March 1991), corrigendum @b{23}(3) 413 (September 1991), |
|
@uref{https://doi.org/10.1145/103162.103163}. |
|
This paper has been widely distributed, and reissued in vendor |
|
programming-language documentation. It is well worth reading, |
|
and then rereading from time to time. |
|
|
|
@c ??? site does not respond, 20 Sep 2022 |
|
@item @c sort-key: Juffa |
|
Norbert Juffa and Nelson H. F. Beebe, @cite{A Bibliography of |
|
Publications on Floating-Point Arithmetic}, |
|
@uref{https://www.math.utah.edu/pub/tex/bib/fparith.bib}. |
|
This is the largest known bibliography of publications about |
|
floating-point, and also integer, arithmetic. It is actively |
|
maintained, and in mid 2019, contains more than 6400 references to |
|
original research papers, reports, theses, books, and Web sites on the |
|
subject matter. It can be used to locate the latest research in the |
|
field, and the historical coverage dates back to a 1726 paper on |
|
signed-digit arithmetic, and an 1837 paper by Charles Babbage, the |
|
intellectual father of mechanical computers. The entries for the |
|
Abbott, Clinger, and Steele & White papers cited earlier contain |
|
pointers to several other important related papers on the |
|
base-conversion problem. |
|
|
|
@c URL ok |
|
@item @c sort-key: Kahan |
|
William Kahan, @cite{Branch Cuts for Complex Elementary Functions, or |
|
Much Ado About Nothing's Sign Bit}, (1987), |
|
@uref{https://people.freebsd.org/~das/kahan86branch.pdf}. |
|
This Web document about the fine points of complex arithmetic |
|
also appears in the volume edited by A. Iserles and |
|
M. J. D. Powell, @cite{The State of the Art in Numerical |
|
Analysis: Proceedings of the Joint IMA/SIAM Conference on the |
|
State of the Art in Numerical Analysis held at the University of |
|
Birmingham, 14--18 April 1986}, Oxford University Press (1987), |
|
ISBN 0-19-853614-3 (xiv + 719 pages). Its author is the famous |
|
chief architect of the IEEE 754 arithmetic system, and one of the |
|
world's greatest experts in the field of floating-point |
|
arithmetic. An entire generation of his students at the |
|
University of California, Berkeley, have gone on to careers in |
|
academic and industry, spreading the knowledge of how to do |
|
floating-point arithmetic right. |
|
|
|
@c paywalled |
|
@item @c sort-key: Knuth |
|
Donald E. Knuth, @cite{A Simple Program Whose Proof Isn't}, |
|
in @cite{Beauty is our business: a birthday salute to Edsger |
|
W. Dijkstra}, W. H. J. Feijen, A. J. M. van Gasteren, |
|
D. Gries, and J. Misra (eds.), Springer (1990), ISBN |
|
1-4612-8792-8, This book |
|
chapter supplies a correctness proof of the decimal to |
|
binary, and binary to decimal, conversions in fixed-point |
|
arithmetic in the TeX typesetting system. The proof evaded |
|
its author for a dozen years. |
|
|
|
@c URL ok |
|
@item @c sort-key: Matula-1968a |
|
David W. Matula, @cite{In-and-out conversions}, |
|
Communications of the ACM @b{11}(1) 57--50 (January 1968), |
|
@uref{https://doi.org/10.1145/362851.362887}. |
|
|
|
@item @c sort-key: Matula-1968b |
|
David W. Matula, @cite{The Base Conversion Theorem}, |
|
Proceedings of the American Mathematical Society @b{19}(3) |
|
716--723 (June 1968). See also other papers here by this author, |
|
and by I. Bennett Goldberg. |
|
|
|
@c page is blank without running JS. |
|
@item @c sort-key: Matula-1970 |
|
David W. Matula, @cite{A Formalization of Floating-Point Numeric |
|
Base Conversion}, IEEE Transactions on Computers @b{C-19}(8) |
|
681--692 (August 1970), |
|
|
|
@c page is blank without running JS. |
|
@item @c sort-key: Muller-2010 |
|
Jean-Michel Muller and eight others, @cite{Handbook of |
|
Floating-Point Arithmetic}, Birkh@"auser-Boston (2010), ISBN |
|
0-8176-4704-X (xxiii + 572 pages). This is a |
|
comprehensive treatise from a French team who are among the |
|
world's greatest experts in floating-point arithmetic, and among |
|
the most prolific writers of research papers in that field. They |
|
have much to teach, and their book deserves a place on the |
|
shelves of every serious numerical programmer. |
|
|
|
@c paywalled |
|
@item @c sort-key: Muller-2018 |
|
Jean-Michel Muller and eight others, @cite{Handbook of |
|
Floating-Point Arithmetic}, Second edition, Birkh@"auser-Boston (2018), ISBN |
|
3-319-76525-6 (xxv + 627 pages). This is a new |
|
edition of the preceding entry. |
|
|
|
@c URL given is obsolete |
|
@item @c sort-key: Overton |
|
Michael Overton, @cite{Numerical Computing with IEEE Floating |
|
Point Arithmetic, Including One Theorem, One Rule of Thumb, and |
|
One Hundred and One Exercises}, SIAM (2001), ISBN 0-89871-482-6 |
|
(xiv + 104 pages), |
|
This is a small volume that can be covered in a few hours. |
|
|
|
@c URL ok |
|
@item @c sort-key: Steele-1990 |
|
Guy L. Steele Jr. and Jon L. White, @cite{How to Print |
|
Floating-Point Numbers Accurately}, ACM SIGPLAN Notices |
|
@b{25}(6) 112--126 (June 1990), |
|
@uref{https://doi.org/10.1145/93548.93559}. |
|
See also the papers by Clinger. |
|
|
|
@c paywalled |
|
@item @c sort-key: Steele-2004 |
|
Guy L. Steele Jr. and Jon L. White, @cite{Retrospective: How to |
|
Print Floating-Point Numbers Accurately}, ACM SIGPLAN Notices |
|
@b{39}(4) 372--389 (April 2004), Reprint of 1990 |
|
paper, with additional commentary. |
|
|
|
@item @c sort-key: Sterbenz |
|
Pat H. Sterbenz, @cite{Floating Point Computation}, Prentice-Hall |
|
(1974), ISBN 0-13-322495-3 (xiv + 316 pages). This often-cited book |
|
provides solid coverage of what floating-point arithmetic was like |
|
@emph{before} the introduction of IEEE 754 arithmetic. |
|
|
|
@end itemize
|
|
|