Skip to content

Latest commit

 

History

History
594 lines (475 loc) · 23.5 KB

mathematical-library.md

File metadata and controls

594 lines (475 loc) · 23.5 KB

Mathematical Library

User documentation

The mathematical library follows Annex F of C99 and assumes floating-point semantics according to Annex F.

Floating-point formats are assumed to be IEEE-754 binary32 format for float and IEEE-754 binary64 for double.

Supported long double formats: IEEE-754 binary64 (ld64) and x86 80 bit extended precision format (ld80) are fully supported and there is partial support for IEEE-754 binary128 (ld128).

Supported arithmetics evaluations (provided by the underlying platform and compiler) are FLT_EVAL_METHOD 0, 1 or 2 (when FLT_EVAL_METHOD!=0 the excess precision and range should be handled according to the ISO C semantics: rounded away at assignments and casts).

The four rounding modes (FE_TONEAREST, FE_UPWARD, FE_DOWNWARD, FE_TOWARDZERO) are supported on all platforms with fenv support.

The five status flags (FE_INEXACT, FE_INVALID, FE_DIVBYZERO, FE_UNDERFLOW, FE_OVERFLOW) are supported on all platforms with fenv support.

No additional classifications are provided beyond the five standard classes (FP_NAN, FP_INFINITE, FP_ZERO, FP_SUBNORMAL, FP_NORMAL), in case of ld80 the invalid representations are treated according to the x87 FPU (pseudo infinites, nans and normals are classified as FP_NAN and pseudo subnormals are classified as FP_NORMAL).

Errors are reported by raising floating-point status flags ie. math_errhandling is set to MATH_ERREXCEPT on all platforms (errno is not set by math functions and on platforms without IEEE 754 exception semantics there is no error reporting).

There are no other predefined fenv settings than FE_DFL_ENV.

If the floating-point environment is changed in other ways than the provided interfaces in fenv.h the behaviour of the library functions is undefined (eg. changing the precision setting of the x87 fpu or unmasking floating-point exceptions are not supported).

Status flags are raised correctly in default fenv except the inexact flag. (inexact is only raised correctly when explicitly specified by ISO C, underflow flag is handled correctly in all cases which is a stronger guarantee than the requirement in Annex F)

At least the following functions give correctly rounded results (in all rounding modes, the float, double and long double versions as well): ceil, copysign, fabs, fdim, floor, fma, fmax, fmin, fmod, frexp, ldexp, logb, modf, nearbyint, nextafter, nexttoward, rint, remainder, remquo, round, scalbln, scalbn, sqrt, trunc.

Decimal conversion functions in stdlib.h and stdio.h are correctly rounded as well.

Other functions should have small ulp error in nearest rounding mode and the sign of zero should be correct. (the goal is < 1.5 ulp error, ie. the last bit may be wrong, which is met for most commonly used functions, notable exceptions: several complex functions, bessel functions, lgamma and tgamma functions and arc hyperbolic functions)

Signaling nan is not supported (library functions handle it as quiet nan so they may not raise the invalid flag properly)

Quiet nans are treated equally (there is only one logical nan value, it is printed as "nan" or "-nan", the library otherwise does not care about the sign and payload of nan)

Infinities are printed as "inf" and "-inf".

Development documentation

Overview

The source code of math functions are in three directories: src/math/, src/complex/ and src/fenv/ (implementing math.h, complex.h and fenv.h APIs respectively)

The relevant public header files are include/float.h, include/math.h, include/complex.h, include/fenv.h and include/tgmath.h (they rely on arch specific headers: arch/$ARCH/bits/float.h and arch/$ARCH/bits/fenv.h)

The implementation uses the src/internal/libm.h internal header.

Most other functions in the standard library don't use floating-point arithmetics, the notable exceptions are decimal conversion functions: strtof, strtod, strtold, printf, scanf. (Note that these conversion functions that don't use global stdio buffers are async-signal-safe in musl, except scanf with %m which calls malloc. No other async-signal-safe code in musl uses floating-point arithmetics, this is important because in C11 the state of the fenv in a signal handler is unspecified, one has to explicitly save, set to default and restore the fenv in a signal handler if float code is used.)

A large part of the mathematical library is based on the freebsd libm, which is in turn based on fdlibm. No effort was made to keep the integrity of the original code, most of it needed heavy cleanups and rewrites. The numerical algorithms and approximation methods are not changed in most cases though. (Designing new algorithms requires additional tools and is more effort than conformance cleanups).

Note that floating-point semantics is non-trivial and often misinterpreted or misimplemented, some examples:

  • gcc and clang do not handle FENV_ACCESS (explained below)
  • gcc and clang do not compile relation operators correctly (quiet vs signaling comparisons, inconsistent behaviour on different target archs, broken -mieee and -mno-ieee flag)
  • even experts misinterpret FLT_EVAL_METHOD==2 as "non-deterministic" excess precision, even though the excess precision semantics is well-defined (most relevant example is gcc spilling x87 registers at double precision by default instead of extended precision)
  • various libraries/tools changing x87 precision setting to non-conformant value while still using libc calls (without solving the double-rounding issues because of the wider exponent range)
  • gcc incorrectly rounds away excess precision at return in -std=c99 mode (annex F should be followed)
  • gcc incorrectly handles decimal constants in FLT_EVAL_METHOD!=0
  • valgrind does not handle extended precision properly, only emulates double precision fp, giving incorrect results on x86
  • libgcc soft float only implements fenv with hw support, it should provide _Thread_local fenv state otherwise (C11)
  • gcc does not always optimize union{double f; uint64_t i;} correctly (can generate very expensive spurious float load/store)

Bit manipulation

It is often required to manipulate the bit representation of floating-point numbers using integer arithmetics (mostly to avoid raising floating-point exceptions or to do simple operations that is hard or slow with floating-point arithmetics).

libm.h has historical macros for bit manipulations, the convention for new code is using explicitly declared unions and the ldshape union in case of long double. An example of typical bit manipulation code that accesses the sign and exponent:

#include <stdint.h>

/* return x if it's in (-2,1) otherwise 0, without raising invalid flag on nan */

double foo(double x)
{
    union {double f; uint64_t i;} u = {x};
    int e = u.i>>52 & 0x7ff;
    int s = u.i>>63;
    return e<0x3ff || (s && e==0x3ff) ? x : 0;
}

float foof(float x)
{
    union {float f; uint32_t i;} u = {x};
    int e = u.i>>23 & 0xff;
    int s = u.i>>31;
    return e<0x7f || (s && e==0x7f) ? x : 0;
}

#include <float.h>
#if LDBL_MANT_DIG == 53
long double fool(long double x) { return foo(x); }
#elif LDBL_MANT_DIG == 64 || LDBL_MANT_DIG == 113
#include "libm.h"
long double fool(long double x)
{
    kunion ldshape u = {x};
    int e = u.i.se & 0x7fff;
    int s = u.i.se>>15;
    return e<0x3fff || (s && e==0x3fff) ? x : 0;
}
#endif

On ld64 platforms all long double functions are mapped to double precision ones, so only ld80 and ld128 requires explicit code (and when possible they are treated together).

Note: since signaling nans are not supported (they would raise invalid exception for !=) the foo function can be implemented without bit manipulation

double foo(double x)
{
    if (x!=x) return 0;
    return x > -2 && x < 1 ? x : 0;
}

The union convention is used because other solutions would violate the C99 aliasing rules or could not be optimized easily. For float and double the unions are explicit instead of typedefed somewhere because they are sufficiently short, make the code self-contained and they should be portable to all platforms with IEEE fp format. (There is one known exception: the arm floating-point accelerator provided IEEE arithmetics with mixed endian double representation, but arm oabi is not supported by musl)

The ldshape union is for ld80 and ld128 code and depends on byte order.

The bit operations and int arithmetics need to be done with care and awareness about integer promotion rules, implementation-defined and undefined behaviours of signed int arithmetics. The freebsd math code heavily relies on implementation-defined behaviour (signed int representation, unsigned to signed conversion, signed right shift) and even undefined behaviour (signed int overflow, signed left shift). Such integer issues should be fixed, they are considered to be bugs. The convention is to use uint32_t and uint64_t types for bit manipulations to reduce surprises.

Constants

Floating constants can have float, double or long double type, eg. 0.1f, 0.1 or 0.1L respectively.

The usual arithmetic conversion rules convert to the "largest" type used in an expression, so the 'f' suffix should be used consistently for single precision float expressions (missing the 'f' causes the expression to be evaluated at double precision).

The 'L' suffix has to be used for long double constants that are not exactly representible at double precision to avoid precision loss. For clarity the convention in musl is to use 'L' consistently in long double context except for trivial numbers (0.0, 1.0, 2.0, .5) where it is optional.

Using integers in a floating-point expression is fine, it will be converted to the appropriate float type at translation time (eg. 2x is just as good as 2.0fx, however -0x is not the same as -0.0x).

Note that 0.1f and (float)0.1 has different meaning: if FLT_EVAL_METHOD==0 the first is 0.1 rounded to float precision, the second is 0.1 rounded to double precision and then rounded to float causing double-rounding issues (not in this case, but eg (float)0x1.00000100000001p0) and raises floating-point status flags (inexact in this case) if FENV_ACCESS is ON.

FLT_EVAL_METHOD!=0 means the evaluation precision and range of constants and expressions can be higher than the that of the type ie 0.1f==0.1 may be true (this is not implemented by gcc correctly as of gcc-4.8 and clang only supports targets with FLT_EVAL_METHOD==0).

Note that a larger floating-point type can always represent all values of a smaller one (so conversion to larger type can never cause precision loss).

Even if FENV_ACCESS is ON static initializers don't raise flags at runtime (the constants are evaluated using the default fenv, ie round to nearest mode) so 'static const' variables are used for named consts when inexact flag and higher precision representation should be avoided otherwise macros or explicit literal is used.

The standard does not require correct rounding for decimal float constants (the converted floating-point value can be the nearest or any neighbor of the nearest representible number, but the recommended practice is correct rounding according to nearest rounding mode), however hexfloats can represent all values of a type exactly and unambiguously, so the convention is to prefer hexfloat representation in the math code internally especially when it makes the intention clearer (eg 0x1p52 instead of 4503599627370496.0). (In public header files they must not be used as they are not C++ or C89 compatible).

There is another i386 specific gcc issue related to constants: gcc can recognize special constants that are built into the x87 FPU (eg 3.141592653589793238462643383L) and load them from the FPU instead of memory, however the FPU value depends on the current rounding mode, so in this rare case static const values may not be correctly rounded to the nearest number in non-nearest rounding mode. Musl supports non-nearest rounding modes, but library results may be imprecise for functions that are not listed as correctly-rounded.

Excess precision

In C, fp arithmetics and constants may be evaluated to a format that has greater precision and range than the type of the operands as specified by FLT_EVAL_METHOD. The mathematical library must be compiled with consistent and well-defined evaluation method (FLT_EVAL_METHOD >= 0).

If FLT_EVAL_METHOD==0 then there is no excess precision and this is the most common case. (This gives the IEEE semantics, clang does not even support other values and gcc only supports other values on i386, where FLT_EVAL_METHOD==2 is the default)

There is an additional issue on i386: the x87 FPU has a precision setting, it can do arithmetics at extended precision (64bit precision and 16bit sign+exponent) or at double precision (53bit precision and 16 bit sign+exponent) or at single precision. On linux the default setting is extended precision and changing it is not supported by musl. On other operating systems the default is double precision (most notably on freebsd), so taking floating-point code from other systems should be done with care. (The double precision setting provides similar fp semantics to FLT_EVAL_METHOD==0, but there are still issues with the wider exponent range and the extra precision of long double is lost)

The convention in musl is that when excess precision is not harmful use double_t or float_t and use explicit cast or assignment when rounding away the excess precision is important. (Rounding the result means extra load/store operations on i386, on other architectures it is a no-op).

Argument passing is an assignment so excess precision is dropped there, but return statement is not (this has changed in C11), so if the extra range or precision is harmful then explicit cast or assignment should be used before return (eg 'return 0x1p-1000*0x1p-1000;' is not ok for returning 0 with underflow raised).

An important issue with FLT_EVAL_METHOD!=0 is double rounding: rounding an infinite precision result into an intermediate format and then again into a smaller precision format can give different result than a single rounding to the smaller format in nearest rounding mode. (with the x87 double precision setting the wider exponent range can still cause double rounding in the subnormal range where the inmemory representation has smaller precision than the fp registers)

    double x = 0.5 + 0x1p-50; // exact, no rounding
    x = x + 0x1p52;

On FLT_EVAL_METHOD==0 platforms x is set to 0x1p52+1, on i386 the addition is evaluated to 0x1p52+0.5 (rounded to extended precision) and then this is rounded again to double precision for the assignment to 0x1p52 (nearest rounding rule)

In non-nearest rounding mode double rounding cannot cause different results, but it can cause the omission of the underflow flag:

    (double)(0x1p-1070 + 0x1p-1000*0x1p-1000)

The result is inexact and subnormal, so it should raise the underflow flag, but the multiplication does not underflow at extended precision and the addition is first rounded to extended precision and then to double and the result of the second rounding is exact subnormal.

Musl raises the underflow flag correctly so it needs to take special care when the result is in the subnormal range.

Fenv and error handling

Correctly rounded functions support all rounding modes and all math functions support the exception flags other than inexact.

In some cases the rounding mode is changed (eg. fma), or the exception flags are tested and cleared (eg. nearbyint) or exception flags are raised by some float operation (many functions). This means the compiler most not treat surrounding floating-point operations as sideeffect-free pure computations: they depend on the current rounding mode and change the exception flags (depend on and modify global fenv state).

To allow compilers to do fp optimizations, the standard invented the FENV_ACCESS pragma:

#pragma STDC FENV_ACCESS ON
#pragma STDC FENV_ACCESS OFF

The OFF state means the code does not access the fenv in any way, so various reordering, constant folding and common subexpression elimination operations are valid optimizations. The ON state means in the local scope the compiler must not do such optimizations.

Neither gcc nor clang supports the FENV_ACCESS pragma, in which case they should assume FENV_ACCESS ON for all code for safety, however both compilers assume FENV_ACCESS OFF and do aggressive fp optimizations which cannot be turned off.

For the gcc issues see http://gcc.gnu.org/wiki/FloatingPointMath

Various FENV_ACCESS issues with gcc and clang (can be worked around using volatile hacks):

Constant folding exception raising code or code that depends on rounding mode (incorrectly optimized to x = 0; y = INFINITY; z = 1;):

#pragma STDC FENV_ACCESS ON
double x = 0x1p-1000*0x1p-1000; // raises underflow, depends on rounding-mode, -frounding-math does not fix it
double y = 0x1p1000*0x1p1000; // raises overflow, depends on rounding-mode, -frounding-math does not fix it
double z = 1 - 0x1p-100; // depends on rounding mode, -frounding-math fixes it

Common subexpression elimination with rounding-mode change (simplified fma code: a + b is not recalculated):

double x, a, b;
int r = fegetround();
fesetround(FE_TONEAREST);
/* ... */
x = a + b;
if (x == 0) {
    /* make sure the sign of zero is right */
    fesetround(r);
    x = a + b;
    return x;
}

Expression reordering (trunc: absx < 1 check is ordered before the calculation of y omitting inexact exception):

/* round to integer, raise inexact if not integer */
y = absx + 0x1p52 - 0x1p52;
if (absx < 1)
    return 0*x;

Expression reordering (old lrint code: converting the result of rint to long is ordered after the fe* checks)

fenv_t e;
long i;
feholdexcept(&e);
i = rint(x);
/* don't raise inexact on overflow */
if (fetestexcept(FE_INVALID))
    feclearexcept(FE_INEXACT);
feupdateenv(&e);
return i;

Whenever the optional FE_* macros are used in library code they should be conditional (#ifdef FE_...) because some platform may not support them.

Musl does not provide errno based error handling, math_errhandling is MATH_ERREXCEPT, so it's important that floating-point status flags are raised correctly. The convention is not to use explicit feraiseexcept(flag) calls, because math code should work even if the platform does not support fenv, raising flags through function calls is slow and arithmetics usually raise flags correctly anyway. The math code makes sure that exceptions are not raised spuriously and exceptions are not omitted by using appropriate arithmetics.

A simple exception raising example: implement a function f that is odd, can be approximated by x + x*x*x/2 + x^5*eps when x is small, otherwise return g(x).

double f(double x)
{
    union {double f; uint64_t i;} u = {x};
    int e = u.i>>52 & 0x7ff;

    if (e < 0x3ff - 13) { /* |x| < 0x1p-13 */
    if (e < 0x3ff - 26) { /* |x| < 0x1p-26 */
    if (e == 0) { /* x is subnormal or zero */
    #pragma STDC FENV_ACCESS ON
    /* raise underflow if x!=0 */
    float unused = x;
    }
    /* do not fall through, x*x could raise underflow incorrectly */
    /* note: inexact flag is not supported */
    return x;
    }
    return x*(1 + x*x/2);
    }
    return g(x);
}

Unfortunately the unused variable is optimized away so the convention is to use volatile hacks in such cases:

    if (e == 0)
    /* raise underflow if x!=0 */
    FORCE_EVAL((float)x);

the FORCE_EVAL macro is defined in libm.h and does a volatile assignment which is not optimized away by the compiler.

Usual exception raising methods (inexact and divbyzero rarely need special arithmetics):

// underflow for subnormal x
(float)x;
// underflow for |x| < 1, x!=0
(float)(0x1p-149f*x)
// overflow for |x| >= 2, |x|!=inf
(float)(0x1p127f*x)
// invalid for inf
(x-x);
0*x;
// invalid for any x except nan
(x-x)/(x-x);
// invalid
0.0f/0.0f;

FORCE_EVAL is used if the expression is only evaluated for its fenv sideeffects. Sometimes more efficient code could be made without any store if the compilers did not optimize away unused code. (Efficient in terms of code size, these are usually the rare corner-cases so performance is not critical)

Common formulas

Commonly used formulas throughout the mathematical library (some of them are described in the Goldberg paper, what every computer scientist should know about fp)

Using the exponent

    union {double f; uint64_t i;} u = {x};
    int e = u.i>>52 & 0x7ff;
    if (e == 0x7ff) { /* +-inf or nan */ }
    if (e == 0x3ff + N) { /* 2^N <= |x| < 2^(N+1) */ }
    if (e == 0) { /* +-0 or subnormal x */ }

Exact add in nearest rounding mode: hi+lo == x+y exactly and hi == (double)(x+y), it assumes FLT_EVAL_METHOD==0 (otherwise double rounding can ruin the exactness)

    /* exact add, assumes exponent_x >= exponent_y, no overflow */
    hi = x + y;
    lo = y - (hi - x);

Exact mul in nearest rounding mode: hi+lo == x*y exactly and hi == (double)(x*y), it assumes FLT_EVAL_METHOD==0

    /* exact mul, assumes no over/underflow */
    static const double c = 0x1p27 + 1;
    double cx, xh, xl, cy, yh, yl, hi, lo, x, y;

    cx = c*x;
    xh = x - cx + cx;
    xl = x - xh;
    cy = c*y;
    yh = y - cy + cy;
    yl = y - yh;
    hi = x*y;
    lo = (xh*yh - hi) + xh*yl + xl*yh + xl*yl;

Another method to get reasonably precise x*x is to zero out the low bits (hi+lo has extra 26bit precision compared to (double)(x*x)):

    union {double f; uint64_t i;} u = {x};
    u.i &= (uint64_t)-1<<27;
    hi = u.f*u.f;  // exact
    lo = (x+u.f)*(x-u.f); // inexact

Round to int in the current rounding mode (or round to fixed point value):

    static const double_t toint = 1/EPSILON;
    // x >= 0, x < 0x1p52
    r1 = x + toint - toint;
    // |x| < 0x1p51, nearest rounding mode
    r2 = x + 1.5*toint - 1.5*toint;

where EPSILON is DBL_EPSILON (0x1p-52) or LDBL_EPSILON (0x1p-63 or 0x1p-112) according to the FLT_EVAL_METHOD.

Sign of zero

Musl tries to return zero with correct sign (this is important in functions which may be used in complex code, because of branch cuts).

With simple arithmetics the sign of zero cannot be detected silently: 0.0 == -0.0 and in all context they are treated equally except for division (1/-0.0 == --inf, 1/0.0 == inf), but then the divbyzero flag is raised. The signbit macro or-- bit manipulation can be used to get the sign silently.

To remove the sign of zero, x*x can be used if x==0, alternatively x+0 works too except in downward rounding mode (-0.0+0 is -0.0 in downward rounding).

To create 0 with sign according to the sign of x, 0*x can be used if x is finite (otherwise copysign(0,x) or signbit(x)?-0.0:0.0 can be used, since signbit is a macro it is probably faster, nan may need special treatment).

Complex

Currently some complex functions are dummy ones: naive implementation that may give wildly incorrect result. The reason is that there are non-trivial special cases, which are hard to handle correctly. (There are various papers on the topic, eg. about branch cuts by William Kahan and about the implementation of complex inverse trigonometric functions by Hull et al.)

The convention is to create complex numbers using the CMPLX(re,im) construct, new in C11, because gcc does not support the _Imaginary type and turns

x*I

into

(x+0*I)*(0+I) = 0*x + x*I

So if x=inf then the result is nan+inf*I instead of inf*I and thus a+b*I cannot be used to create complex numbers in general.