Accuracy relaxation for floating point conversion

First define the least radix function, lb, defined for arguments that are greater than 0:

lb : Z → Z

lb(r) = min{n ∈ Z | n > 1 and there is an m ∈ Z such that r = n^m}

A.7 Accuracy relaxation for floating point conversion 47

If this relaxation is allowed, there shall be a max error convertF →F⁰ parameter that gives the maximum error when converting from F to F⁰ and lb(r_F) 6= lb(r_F⁰), a max error convert_{F →D} parameter that gives the maximum error when converting from F to D and lb(r_F) 6= lb(r_D), and a max error convertD→F parameter that gives the maximum error when converting from D to F and lb(r_D) 6= lb(r_F). These parameters may be required to have the same value, and then only one parameter need be made available to programs.

max error convertF →F⁰, max error convertF →D, and max error convertD→F should each have a value less than 1. When lb(r_F) = lb(r_F⁰), then max error convert_{F →F}⁰, max error convert_{F →D}, and max error convert_D→F shall all be 0.5.

The convert^∗_{F →F}0, convert^∗_{F →D}, and convert^∗_D→F helper functions are introduced to model this pre-rounding approximation: convert^∗_{F →F}0 : F^‡ → R, convert^∗_{F →D} : F^‡ → R, convert^∗_D→F : D^∗→ R.

convert^∗_{F →F}0(x) returns a close approximation to x, satisfying

|x − nearest_F(convert^∗_{F →F}0(x))| 6 max error convertF →F⁰ · u_F(x) convert^∗_{F →D}(x) returns a close approximation to x, satisfying

|x − nearest_F(convert^∗_{F →D}(x))| 6 max error convertF →D· u_F(x) convert^∗_D→F(x) returns a close approximation to x, satisfying

|x − nearest_F(convert^∗_D→F(x))| 6 max error convertD→F · u_F(x) Further requirements on the convert^∗_{F →F}0 approximation helper functions are:

convert^∗_{F →F}0(x) = x if x ∈ Z ∩ F convert^∗_{F →F}0(−x) = −convert^∗_{F →F}0(x) if x ∈ F^‡

convert^∗_{F →F}0(x) 6 convert^∗_{F →F}⁰(y) if x, y ∈ F^‡and x < y

Relationship to other floating point to floating point conversion approximation helper functions for conversion operations in the same library shall be:

convert^∗_{F →F}0(x) = convert^∗_F00→F⁰(x) if lb(r_F⁰⁰) = lb(r_F) and x ∈ F ∩ F⁰⁰ The convert⁰_{F →F}0 operation:

convert⁰_{F →F}0 : F → F⁰∪ {inexact, underflow, overflow}

convert⁰_{F →F}0(x) = result_F⁰(convert^∗_{F →F}0(x), nearest_F⁰) if x ∈ F

= convert_{F →F}⁰(x) otherwise

Further requirements on the convert^∗_{F →D} approximation helper functions are:

convert^∗_{F →D}(x) = x if x ∈ Z ∩ F convert^∗_{F →D}(−x) = −convert^∗_{F →D}(x) if x ∈ F

convert^∗_{F →D}(x) 6 convert^∗F →D(y) if x, y ∈ F and x < y

Relationship to other floating point to fixed point conversion approximation helper functions for conversion operations in the same library shall be:

convert^∗_{F →D}(x) = convert^∗_F00→D(x) if lb(r_F⁰⁰) = lb(rF) and x ∈ F ∩ F⁰⁰ The convert⁰_{F →D} operation:

convert⁰_{F →D} : F → D ∪ {inexact, overflow}

convert⁰_{F →D}(x) = result_D(convert^∗_{F →D}(x), nearest_D) if x ∈ F

= convertF →F⁰(x) otherwise

Further requirements on the convert^∗_D→F approximation helper functions are:

convert^∗_D→F(x) = x if x ∈ Z ∩ D

convert^∗_D→F(−x) = −convert^∗_D→F(x) if x ∈ D

convert^∗_D→F(x) 6 convert^∗_D→F(y) if x, y ∈ D and x < y

Relationship to other floating point and fixed point to floating point conversion approximation helper functions for conversion operations in the same library shall be:

convert^∗_D→F(x) = convert^∗_D0→F(x) if lb(r_D⁰) = lb(rD) and x ∈ D ∩ D⁰ convert^∗_D→F(x) = convert^∗_F0→F(x) if lb(r_F⁰) = lb(r_D) and x ∈ D ∩ F⁰ The convert⁰_D→F operation:

convert⁰_D→F : D → F ∪ {inexact, underflow, overflow}

convert⁰_D→F(x) = result_F(convert^∗_D→F(x), nearest_F) if x ∈ D

= convert_D→F(x) otherwise

A.7 Accuracy relaxation for floating point conversion 49

Annex B (informative) IEC 60559 bindings

When the parameter iec 60559F is true for a floating point type F , all the facilities required by IEC 60559 shall be provided for that datatype. Methods shall be provided for a program to access each such facility. In addition, documentation shall be provided to describe these methods.

This means that a complete programming language binding for LIA-1 should provide a binding for all IEC 60559 facilities as well. Such a programming language binding must define syntax for all required facilities, and should define syntax for all optional facilities as well. Defining syntax for optional facilities does not make those facilities required. All it does is ensure that those implementations that choose to provide an optional facility will do so using a standardized syntax.

The normative listing of all IEC 60559 facilities (and their definitions) is given in IEC 60559.

ISO/IEC 10967 does not alter or eliminate any of them. However, to assist the reader, the following summary is offered.

B.1 Summary

A binding of IEC 60559 to a programming language must provide the names of the programming language datatypes that correspond to:

a) binary32, b) binary64, c) binary128, d) decimal64, e) decimal128, if any.

Note that the LIA-1 parameter values for each of the IEC 60559 datatypes the parameters denorm_F and iec 60559_F are true. The remaining LIA-1 basic parameters for ‘binary32’ are:

rF = 2 p_F = 24 emin_F = −125 emaxF = 128

For IEC 60559 ‘binary64’ they are:

r_F = 2 pF = 53

emin_F = −1021 emax_F = 1024

For IEC 60559 ‘binary128’ they are:

B. IEC 60559 bindings 51

rF = 2 p_F = 113

emin_F = −16381 emaxF = 16384

For IEC 60559 ‘decimal64’ they are:

r_F = 10 pF = 16 emin_F = −382 emax_F = 385

For IEC 60559 ‘decimal128’ they are:

r_F = 10 pF = 34

emin_F = −6142 emax_F = 6145

IEC 60559 also specifies ‘binary16’ and ‘decimal32’ with only conversion operations speci-fied, terming them “storage formats”. These storage formats are not included when referring to IEC 60559 conforming datatype below. IEC 60559 also specifies extended formats, giving just maximum or minimum requirements on the parameters. These are included when referring to IEC 60559 conforming datatype below.

For each IEC 60559 conforming datatype, the binding must provide:

a) a method for denoting positive infinity,

b) a method for denoting at least one quiet NaN (not-a-number), c) a method for denoting at least one signalling NaN (not-a-number).

For each IEC 60559 conforming datatype provided, the binding should provide the notation for invoking each of the following operations.

a) Homogeneous general-computational operations.

sourceFormat roundToIntegralTiesToEven(source) rounding_F sourceFormat roundToIntegralTiesToAway(source)

sourceFormat roundToIntegralTowardZero(source)

sourceFormat roundToIntegralTowardPositive(source) ceiling_F sourceFormat roundToIntegralTowardNegative(source) floor_F sourceFormat roundToIntegralExact(source)

sourceFormat nextUp(source) succ_F

sourceFormat nextDown(source) predF

sourceFormat remainder(source, source) residue_F sourceFormat minNum(source, source) mmin_F sourceFormat maxNum(source, source) mmaxF

sourceFormat minNumMag(source, source) mmin_F(abs_F(x), abs_F(y)) sourceFormat maxNumMag(source, source) mmax_F(abs_F(x), abs_F(y)) sourceFormat quantize(source, source)

sourceFormat scaleB(source, logBFormat) scaleF,I

logBFormat logB(source) exponent_{F →I}(x) − 1

b) formatOf general-computational operations. The basic arithmetic operations are also re-quired by LIA-1, though not with built-in conversion of the arguments are of different type, and in LIA-1 not permitted to be rounding mode dependent, changing the rounding mode may change the implementation to be in a state not conforming to LIA-1.

formatOf-addition(source1, source2) addF →F⁰, add^↑_{F →F}0, add^↓_{F →F}0

formatOf-subtraction(source1, source2) sub_{F →F}⁰, sub^↑_{F →F}0, sub^↓_{F →F}0

formatOf-multiplication(source1, source2) mul_{F →F}⁰, mul^↑_{F →F}0, mul^↓_{F →F}0

formatOf-division(source1, source2) divF →F⁰, div^↑_{F →F}0, div_{F →F}^↓ 0

formatOf-squareRoot(source) sqrt_{F →F}⁰, sqrt^↑_{F →F}0, sqrt^↓_{F →F}0

formatOf-fusedMultiplyAdd(source1, source2, source3) mul add_{F →F}⁰, mul add^↑_{F →F}0, mul add^↓_{F →F}0

formatOf-convertFromInt(int) convert_I→F, convert^↑_I→F, convert^↓_I→F intFormatOf-convertToIntegerTiesToEven(source) roundingF →I

intFormatOf-convertToIntegerTowardZero(source)

intFormatOf-convertToIntegerTowardPositive(source) ceiling_{F →I} intFormatOf-convertToIntegerTowardNegative(source) floor_{F →I} intFormatOf-convertToIntegerTiesToAway(source)

intFormatOf-convertToIntegerExactTiesToEven(source) intFormatOf-convertToIntegerExactTowardZero(source) intFormatOf-convertToIntegerExactTowardPositive(source) intFormatOf-convertToIntegerExactTowardNegative(source) intFormatOf-convertToIntegerExactTiesToAway(source)

formatOf-convertFormat(source) convert_{F →F}⁰, convert^↑_{F →F}0, convert^↓_{F →F}0

formatOf-convertFromDecimalCharacter(decimalCharacterSequence) r_F⁰ = 10 and r_D = 10

convert_F⁰→F, convertD→F

convert^↑_F0→F, convert^↑_D→F convert^↓_F0→F, convert^↓_D→F decimalCharacterSequence convertToDecimalCharacter(source, conversionSpecification)

r_F⁰ = 10 and r_D = 10

convert_{F →F}⁰, convert_{F →D} convert^↑_{F →F}0, convert^↑_{F →D} convert^↓_{F →F}0, convert^↓_{F →D} formatOf-convertFromHexCharacter(hexCharacterSequence)

r_F⁰ = 16, r_D = 16, and r_F = 2

convert_F⁰_→F, convert_D→F convert^↑_F0→F, convert^↑_D→F convert^↓_F0→F, convert^↓_D→F hexCharacterSequence convertToHexCharacter(source, conversionSpecification)

r_F⁰ = 16, rD = 16, and rF = 2

B.1 Summary 53

convertF →F⁰, convertF →D

convert^↑_{F →F}0, convert^↑_{F →D} convert^↓_{F →F}0, convert^↓_{F →D}

NOTE – mul addF (mul addF →F⁰) is specified in LIA-2 (part 2 of ISO/IEC 10967).

c) Quiet-computational operations.

sourceFormat copy(source) convertF →F, but non-signalling sourceFormat negate(source) similar to neg_F, but non-signalling sourceFormat abs(source) similar to abs_F, but non-signalling decimalEncoding encodeDecimal(decimal)

decimal decodeDecimal(decimalEncoding) binaryEncoding encodeBinary(decimal) decimal decodeBinary(binaryEncoding)

d) Signaling-computational operations. The basic comparisons are also required by LIA-1, though not with built-in conversion if the arguments are of different type.

boolean compareQuietEqual(source1,source2) eqF

boolean compareQuietNotEqual(source1,source2) neq_F boolean compareSignallingGreater(source1,source2) gtr_F boolean compareSignallingGreaterEqual(source1,source2) geqF

boolean compareSignallingLess(source1,source2) lss_F boolean compareSignallingLessEqual(source1,source2) leq_F boolean compareSignalingEqual(source1, source2)

boolean compareSignalingNotEqual(source1, source2) boolean compareSignalingNotGreater(source1,source2) boolean compareSignalingLessUnordered(source1,source2) boolean compareSignalingNotLess(source1,source2)

boolean compareSignalingGreaterUnordered(source1,source2) boolean compareQuietGreater(source1,source2)

boolean compareQuietGreaterEqual(source1,source2) boolean compareQuietLess(source1,source2)

boolean compareQuietLessEqual(source1,source2) boolean compareQuietUnordered(source1,source2) boolean compareQuietNotGreater(source1,source2) boolean compareQuietLessUnordered(source1,source2) boolean compareQuietNotLess(source1,source2)

boolean compareQuietGreaterUnordered(source1,source2) boolean compareQuietOrdered(source1,source2)

e) Non-computational operations.

boolean is754version1985(void) boolean is754version2008(void) enum class(source)

boolean isSignMinus(source) similar to eq_F(signum_F(x), −1)

boolean isNormal(source) similar to geqF(absF(x), fminNF) boolean isFinite(source) similar to lss_F(abs_F(x), +∞+∞+∞) boolean isZero(source) eq_F(x, 0)

boolean isSubnormal(source) similar to istinyF(x) && neqF(x, 0) boolean isInfinite(source) eq_F(abs_F(x), +∞+∞+∞)

boolean isNaN(source) isnan_F(x)

boolean isSignaling(source) issignanF(x) boolean isCanonical(source)

enum radix(source) r_F

boolean totalOrder(source, source) boolean totalOrderMag(source, source) boolean sameQuantum(source, source)

f) Handling of notification recordings.

void lowerFlags(exceptionGroup) clear indicators void raiseFlags(exceptionGroup) set indicators boolean testFlags(exceptionGroup) test indicators boolean testSavedFlags(flags, exceptionGroup)

void restoreFlags(flags, exceptionGroup)

flags saveAllFlags(void) current indicators

Note that several of the above facilities are already required by LIA-1 even for implementations that do not conform to IEC 60559.

IEC 60559 allows for dynamic change of rounding mode. Note though that LIA requires that the rounding is static for an operation (which may be simulated at lower level by setting and resetting a dynamic rounding mode, provided that the rounding mode is local to each thread).

Thus explicitly using rounding mode setting at a “higher level” may change an LIA conforming implementation to (a mode in which it is) a non-conforming implementation.

IEC 60559 recommend a number of other operations (some of which are covered by LIA-2), but these are not listed here. See IEC 60559 for details.

In document DRAFT INTERNATIONAL (Page 57-65)