Accuracy relaxation for floating point conversion

First define the least radix function, lb, defined for arguments that are greater than 0:

lb : Z → Z

lb(r) = min{n ∈ Z | n > 1 and there is an m ∈ Z such that r = n^m}

If this relaxation is allowed, there shall be a max error convert_{F →F}⁰ parameter that gives the maximum error when converting from F to F⁰ and lb(r_F) 6= lb(r_F⁰), a max error convert_{F →D} parameter that gives the maximum error when converting from F to D and lb(rF) 6= lb(rD), and a max error convert_D→F parameter that gives the maximum error when converting from D to F and lb(r_D) 6= lb(r_F). These parameters may be required to have the same value, and then only one parameter need be made available to programs.

max error convert_{F →F}⁰, max error convert_{F →D}, and max error convert_D→F should each have a value less than 1. When lb(rF) = lb(rF⁰), then max error convertF →F⁰, max error convertF →D, and max error convertD→F shall all be 0.5.

The convert^∗_{F →F}0, convert^∗_{F →D}, and convert^∗_D→F helper functions are introduced to model this pre-rounding approximation: convert^∗_{F →F}0 : F^‡ → R, convert^∗_{F →D} : F^‡ → R, convert^∗_D→F : D^∗→ R.

convert^∗_{F →F}0(x) returns a close approximation to x, satisfying

|x − nearest_F(convert^∗_{F →F}0(x))| 6 max error convertF →F⁰ · u_F(x) convert^∗_{F →D}(x) returns a close approximation to x, satisfying

|x − nearest_F(convert^∗_{F →D}(x))| 6 max error convertF →D· u_F(x) convert^∗_D→F(x) returns a close approximation to x, satisfying

|x − nearest_F(convert^∗_D→F(x))| 6 max error convertD→F · u_F(x) Further requirements on the convert^∗_{F →F}0 approximation helper functions are:

convert^∗_{F →F}0(x) = x if x ∈ Z ∩ F convert^∗_{F →F}0(−x) = −convert^∗_{F →F}0(x) if x ∈ F^‡

convert^∗_{F →F}0(x) 6 convert^∗_{F →F}⁰(y) if x, y ∈ F^‡and x < y

Relationship to other floating point to floating point conversion approximation helper functions for conversion operations in the same library shall be:

convert^∗_{F →F}0(x) = convert^∗_F00→F⁰(x) if lb(rF⁰⁰) = lb(rF) and x ∈ F ∩ F⁰⁰ The convert⁰_{F →F}0 operation:

convert⁰_{F →F}0 : F → F⁰∪ {inexact, underflow, overflow}

convert⁰_{F →F}0(x) = result_F⁰(convert^∗_{F →F}0(x), nearest_F⁰) if x ∈ F

= convertF →F⁰(x) otherwise

Further requirements on the convert^∗_{F →D} approximation helper functions are:

convert^∗_{F →D}(x) = x if x ∈ Z ∩ F convert^∗_{F →D}(−x) = −convert^∗_{F →D}(x) if x ∈ F

convert^∗_{F →D}(x) 6 convert^∗_{F →D}(y) if x, y ∈ F and x < y

Relationship to other floating point to fixed point conversion approximation helper functions for conversion operations in the same library shall be:

convert^∗_{F →D}(x) = convert^∗_F00→D(x) if lb(r_F⁰⁰) = lb(r_F) and x ∈ F ∩ F⁰⁰ The convert⁰_{F →D} operation:

convert⁰_{F →D}: F → D ∪ {inexact, overflow}

convert⁰_{F →D}(x) = resultD(convert^∗_{F →D}(x), nearestD) if x ∈ F

= convert_{F →F}⁰(x) otherwise

Further requirements on the convert^∗_D→F approximation helper functions are:

convert^∗_D→F(x) = x if x ∈ Z ∩ D

convert^∗_D→F(−x) = −convert^∗_D→F(x) if x ∈ D

convert^∗_D→F(x) 6 convert^∗_D→F(y) if x, y ∈ D and x < y

Relationship to other floating point and fixed point to floating point conversion approximation helper functions for conversion operations in the same library shall be:

convert^∗_D→F(x) = convert^∗_D0→F(x) if lb(r_D⁰) = lb(r_D) and x ∈ D ∩ D⁰ convert^∗_D→F(x) = convert^∗_F0→F(x) if lb(r_F⁰) = lb(r_D) and x ∈ D ∩ F⁰ The convert⁰_D→F operation:

convert⁰_D→F : D → F ∪ {inexact, underflow, overflow}

convert⁰_D→F(x) = resultF(convert^∗_D→F(x), nearestF) if x ∈ D

= convert_D→F(x) otherwise

A.7 Accuracy relaxation for floating point conversion 49

Annex B (informative) IEC 60559 bindings

When the parameter iec 559F is true for a floating point type F , all the facilities required by IEC 60559 shall be provided for that datatype. Methods shall be provided for a program to access each such facility. In addition, documentation shall be provided to describe these methods.

This means that a complete programming language binding for LIA-1 should provide a binding for all IEC 60559 facilities as well. Such a programming language binding must define syntax for all required facilities, and should define syntax for all optional facilities as well. Defining syntax for optional facilities does not make those facilities required. All it does is ensure that those implementations that choose to provide an optional facility will do so using a standardized syntax.

The normative listing of all IEC 60559 facilities (and their definitions) is given in IEC 60559.

ISO/IEC 10967 does not alter or eliminate any of them. However, to assist the reader, the following summary is offered.

B.1 Summary

A binding of IEC 60559 to a programming language must provide the names of the programming language datatypes that correspond to:

a) binary32, b) binary64, c) binary128, d) decimal64, e) decimal128, if any.

Note that the LIA-1 parameter values for each of the IEC 60559 datatypes the parameters denorm_F and iec 559_F are true. The remaining LIA-1 basic parameters for ‘binary32’ are:

rF = 2 p_F = 24 emin_F = −125 emaxF = 128

For IEC 60559 ‘binary64’ they are:

r_F = 2 pF = 53

emin_F = −1021 emax_F = 1024

For IEC 60559 ‘binary128’ they are:

B. IEC 60559 bindings 51

rF = 2

IEC 60559 also specifies ‘binary16’ and ‘decimal32’ with only conversion operations speci-fied, terming them “storage formats”. These storage formats are not included when referring to IEC 60559 conforming datatype below. IEC 60559 also specifies extended formats, giving just maximum or minimum requirements on the parameters. These are included when referring to IEC 60559 conforming datatype below.

For each IEC 60559 conforming datatype, the binding must provide:

a) a method for denoting positive infinity,

b) a method for denoting at least one quiet NaN (not-a-number), c) a method for denoting at least one signalling NaN (not-a-number).

For each IEC 60559 conforming datatype provided, the binding should provide the notation for invoking each of the following operations:

a) The basic comparison operations (also required by LIA-1, though not with built-in conver-sion): b) The other comparison operations required by IEC 60559:

boolean compareSignalingNotGreater(source1,source2)

boolean compareQuietNotGreater(source1,source2) boolean compareQuietLessUnordered(source1,source2) boolean compareQuietNotLess(source1,source2)

boolean compareQuietGreaterUnordered(source1,source2) boolean compareOrdered(source1,source2)

c) The basic arithmetic operations (also required by LIA-1, though not with built-in conver-sions, and not rounding mode dependent), plus negation and absolute value.

formatOf-addition(source1, source2) add_{F →F}⁰, add^↑_{F →F}0, add^↓_{F →F}0

formatOf-subtraction(source1, source2) subF →F⁰, sub^↑_{F →F}0, sub^↓_{F →F}0

formatOf-multiplication(source1, source2) mul_{F →F}⁰, mul^↑_{F →F}0, mul^↓_{F →F}0

formatOf-division(source1, source2) div_{F →F}⁰, div^↑_{F →F}0, div_{F →F}^↓ 0

sourceFormat negate(source) similar to negF

sourceFormat abs(source) similar to abs_F

sourceFormat copySign(source, source) similar to mul_{F →F}(abs_F(x), signumF(y))

d) Remainder, square-root, and fused multiply-add.

sourceFormat remainder(source, source) residueF

formatOf-squareRoot(source) sqrt_{F →F}⁰, sqrt^↑_{F →F}0, sqrt^↓_{F →F}0

formatOf-fusedMultiplyAdd(source1, source2, source3) mul add_{F →F}⁰, mul add^↑_{F →F}0, mul add^↓_{F →F}0

e) Floating point step functions, exponent extraction and change.

sourceFormat nextUp(source) succF

sourceFormat nextDown(source) pred_F

sourceFormat nextAfter(source, source)

logBFormat logB(source) exponentF →I(x) − 1

sourceFormat scaleB(source, logBFormat) scale_F,I

sourceFormat quantize(source, source) (may have denormalisation loss) f) Maximum and minimum operations.

sourceFormat minNum(source, source) similar to mmin_F sourceFormat maxNum(source, source) similar to mmaxF

sourceFormat minNumMag(source, source) mmin_F(abs_F(x), abs_F(y)) sourceFormat maxNumMag(source, source) mmax_F(abs_F(x), abs_F(y)) g) The type and value conversions.

formatOf-convert(int) convert_I→F, convert^↑_I→F, convert^↓_I→F intFormatOf-convertToIntegerTiesToEven(source)

intFormatOf-convertToIntegerTowardZero(source)

B.1 Summary 53

intFormatOf-convertToIntegerTowardPositive(source)

formatOf-convert(source) convert_{F →F}⁰, convert^↑_{F →F}0, convert^↓_{F →F}0

formatOf-convertFromDecimalCharacter(decimalCharacterSequence)

decimalEncodingType encodeDecimal(decimalType) convert_{F →F}⁰, convert^↑_{F →F}0, convert^↓_{F →F}0

decimalType decodeDecimal(decimalEncodingType) convert_{F →F}⁰, convert^↑_{F →F}0, convert^↓_{F →F}0

binaryEncodingType encodeBinary(decimalType) convert_{F →F}⁰, convert^↑_{F →F}0, convert^↓_{F →F}0

decimalType decodeBinary(binaryEncodingType) convert_{F →F}⁰, convert^↑_{F →F}0, convert^↓_{F →F}0

h) Some test functions.

boolean isSigned(source) similar to eqF(signumF(x), −1) boolean isNormal(source) similar to geq_F(abs_F(x), fminN_F) boolean isFinite(source) similar to lss_F(abs_F(x), +∞+∞+∞)

boolean isZero(source) eqF(x, 0)

boolean isSubnormal(source) similar to istiny_F(x)&&neq_F(x, 0) boolean isInfinity(source) eq_F(abs_F(x), +∞+∞+∞)

boolean isNaN(source) neqF(x, x), isnanF(x)

boolean isSignaling(source) issignan_F(x)

i) Notification indicator functions.

void lowerFlag(exceptionGroupType) clear indicators(C , S ) boolean testFlag(exceptionGroupType) test indicators(C , S )

void restoreFlag(flagsType, exceptionGroupType) similar to set indicators(C , S ) but also clears

flagsType saveFlags(void) current indicators(C )

j) “Mode” changes (rounding direction modes, and other modes).

MODEtype getMODE(void)

binaryRoundingDirectionType getBinaryRoundingDirection(void) decimalRoundingDirectionType getDecimalRoundingDirection(void) void setMODE(MODEtype)

void setBinaryRoundingDirection(binaryRoundingDirectionType) void setDecimalRoundingDirection(decimalRoundingDirectionType) modeGroupType saveModes(void)

void restoreModes(modeGroupType) void defaultModes(void)

Note that several of the above facilities are already required by LIA-1 even for implementations that do not conform to IEC 60559.

In document DRAFT INTERNATIONAL (Page 58-65)