• No results found

Accuracy relaxation for floating point conversion

In document DRAFT INTERNATIONAL (Page 58-65)

First define the least radix function, lb, defined for arguments that are greater than 0:

lb : Z → Z

lb(r) = min{n ∈ Z | n > 1 and there is an m ∈ Z such that r = nm}

If this relaxation is allowed, there shall be a max error convertF →F0 parameter that gives the maximum error when converting from F to F0 and lb(rF) 6= lb(rF0), a max error convertF →D parameter that gives the maximum error when converting from F to D and lb(rF) 6= lb(rD), and a max error convertD→F parameter that gives the maximum error when converting from D to F and lb(rD) 6= lb(rF). These parameters may be required to have the same value, and then only one parameter need be made available to programs.

max error convertF →F0, max error convertF →D, and max error convertD→F should each have a value less than 1. When lb(rF) = lb(rF0), then max error convertF →F0, max error convertF →D, and max error convertD→F shall all be 0.5.

The convertF →F0, convertF →D, and convertD→F helper functions are introduced to model this pre-rounding approximation: convertF →F0 : F → R, convertF →D : F → R, convertD→F : D→ R.

convertF →F0(x) returns a close approximation to x, satisfying

|x − nearestF(convertF →F0(x))| 6 max error convertF →F0 · uF(x) convertF →D(x) returns a close approximation to x, satisfying

|x − nearestF(convertF →D(x))| 6 max error convertF →D· uF(x) convertD→F(x) returns a close approximation to x, satisfying

|x − nearestF(convertD→F(x))| 6 max error convertD→F · uF(x) Further requirements on the convertF →F0 approximation helper functions are:

convertF →F0(x) = x if x ∈ Z ∩ F convertF →F0(−x) = −convertF →F0(x) if x ∈ F

convertF →F0(x) 6 convertF →F0(y) if x, y ∈ Fand x < y

Relationship to other floating point to floating point conversion approximation helper functions for conversion operations in the same library shall be:

convertF →F0(x) = convertF00→F0(x) if lb(rF00) = lb(rF) and x ∈ F ∩ F00 The convert0F →F0 operation:

convert0F →F0 : F → F0∪ {inexact, underflow, overflow}

convert0F →F0(x) = resultF0(convertF →F0(x), nearestF0) if x ∈ F

= convertF →F0(x) otherwise

Further requirements on the convertF →D approximation helper functions are:

convertF →D(x) = x if x ∈ Z ∩ F convertF →D(−x) = −convertF →D(x) if x ∈ F

convertF →D(x) 6 convertF →D(y) if x, y ∈ F and x < y

Relationship to other floating point to fixed point conversion approximation helper functions for conversion operations in the same library shall be:

convertF →D(x) = convertF00→D(x) if lb(rF00) = lb(rF) and x ∈ F ∩ F00 The convert0F →D operation:

convert0F →D: F → D ∪ {inexact, overflow}

convert0F →D(x) = resultD(convertF →D(x), nearestD) if x ∈ F

= convertF →F0(x) otherwise

Further requirements on the convertD→F approximation helper functions are:

convertD→F(x) = x if x ∈ Z ∩ D

convertD→F(−x) = −convertD→F(x) if x ∈ D

convertD→F(x) 6 convertD→F(y) if x, y ∈ D and x < y

Relationship to other floating point and fixed point to floating point conversion approximation helper functions for conversion operations in the same library shall be:

convertD→F(x) = convertD0→F(x) if lb(rD0) = lb(rD) and x ∈ D ∩ D0 convertD→F(x) = convertF0→F(x) if lb(rF0) = lb(rD) and x ∈ D ∩ F0 The convert0D→F operation:

convert0D→F : D → F ∪ {inexact, underflow, overflow}

convert0D→F(x) = resultF(convertD→F(x), nearestF) if x ∈ D

= convertD→F(x) otherwise

A.7 Accuracy relaxation for floating point conversion 49

Annex B (informative) IEC 60559 bindings

When the parameter iec 559F is true for a floating point type F , all the facilities required by IEC 60559 shall be provided for that datatype. Methods shall be provided for a program to access each such facility. In addition, documentation shall be provided to describe these methods.

This means that a complete programming language binding for LIA-1 should provide a binding for all IEC 60559 facilities as well. Such a programming language binding must define syntax for all required facilities, and should define syntax for all optional facilities as well. Defining syntax for optional facilities does not make those facilities required. All it does is ensure that those implementations that choose to provide an optional facility will do so using a standardized syntax.

The normative listing of all IEC 60559 facilities (and their definitions) is given in IEC 60559.

ISO/IEC 10967 does not alter or eliminate any of them. However, to assist the reader, the following summary is offered.

B.1 Summary

A binding of IEC 60559 to a programming language must provide the names of the programming language datatypes that correspond to:

a) binary32, b) binary64, c) binary128, d) decimal64, e) decimal128, if any.

Note that the LIA-1 parameter values for each of the IEC 60559 datatypes the parameters denormF and iec 559F are true. The remaining LIA-1 basic parameters for ‘binary32’ are:

rF = 2 pF = 24 eminF = −125 emaxF = 128

For IEC 60559 ‘binary64’ they are:

rF = 2 pF = 53

eminF = −1021 emaxF = 1024

For IEC 60559 ‘binary128’ they are:

B. IEC 60559 bindings 51

rF = 2

IEC 60559 also specifies ‘binary16’ and ‘decimal32’ with only conversion operations speci-fied, terming them “storage formats”. These storage formats are not included when referring to IEC 60559 conforming datatype below. IEC 60559 also specifies extended formats, giving just maximum or minimum requirements on the parameters. These are included when referring to IEC 60559 conforming datatype below.

For each IEC 60559 conforming datatype, the binding must provide:

a) a method for denoting positive infinity,

b) a method for denoting at least one quiet NaN (not-a-number), c) a method for denoting at least one signalling NaN (not-a-number).

For each IEC 60559 conforming datatype provided, the binding should provide the notation for invoking each of the following operations:

a) The basic comparison operations (also required by LIA-1, though not with built-in conver-sion): b) The other comparison operations required by IEC 60559:

boolean compareSignalingNotGreater(source1,source2)

boolean compareQuietNotGreater(source1,source2) boolean compareQuietLessUnordered(source1,source2) boolean compareQuietNotLess(source1,source2)

boolean compareQuietGreaterUnordered(source1,source2) boolean compareOrdered(source1,source2)

c) The basic arithmetic operations (also required by LIA-1, though not with built-in conver-sions, and not rounding mode dependent), plus negation and absolute value.

formatOf-addition(source1, source2) addF →F0, addF →F0, addF →F0

formatOf-subtraction(source1, source2) subF →F0, subF →F0, subF →F0

formatOf-multiplication(source1, source2) mulF →F0, mulF →F0, mulF →F0

formatOf-division(source1, source2) divF →F0, divF →F0, divF →F 0

sourceFormat negate(source) similar to negF

sourceFormat abs(source) similar to absF

sourceFormat copySign(source, source) similar to mulF →F(absF(x), signumF(y))

d) Remainder, square-root, and fused multiply-add.

sourceFormat remainder(source, source) residueF

formatOf-squareRoot(source) sqrtF →F0, sqrtF →F0, sqrtF →F0

formatOf-fusedMultiplyAdd(source1, source2, source3) mul addF →F0, mul addF →F0, mul addF →F0

e) Floating point step functions, exponent extraction and change.

sourceFormat nextUp(source) succF

sourceFormat nextDown(source) predF

sourceFormat nextAfter(source, source)

logBFormat logB(source) exponentF →I(x) − 1

sourceFormat scaleB(source, logBFormat) scaleF,I

sourceFormat quantize(source, source) (may have denormalisation loss) f) Maximum and minimum operations.

sourceFormat minNum(source, source) similar to mminF sourceFormat maxNum(source, source) similar to mmaxF

sourceFormat minNumMag(source, source) mminF(absF(x), absF(y)) sourceFormat maxNumMag(source, source) mmaxF(absF(x), absF(y)) g) The type and value conversions.

formatOf-convert(int) convertI→F, convertI→F, convertI→F intFormatOf-convertToIntegerTiesToEven(source)

intFormatOf-convertToIntegerTowardZero(source)

B.1 Summary 53

intFormatOf-convertToIntegerTowardPositive(source)

formatOf-convert(source) convertF →F0, convertF →F0, convertF →F0

formatOf-convertFromDecimalCharacter(decimalCharacterSequence)

decimalEncodingType encodeDecimal(decimalType) convertF →F0, convertF →F0, convertF →F0

decimalType decodeDecimal(decimalEncodingType) convertF →F0, convertF →F0, convertF →F0

binaryEncodingType encodeBinary(decimalType) convertF →F0, convertF →F0, convertF →F0

decimalType decodeBinary(binaryEncodingType) convertF →F0, convertF →F0, convertF →F0

h) Some test functions.

boolean isSigned(source) similar to eqF(signumF(x), −1) boolean isNormal(source) similar to geqF(absF(x), fminNF) boolean isFinite(source) similar to lssF(absF(x), +∞+∞+∞)

boolean isZero(source) eqF(x, 0)

boolean isSubnormal(source) similar to istinyF(x)&&neqF(x, 0) boolean isInfinity(source) eqF(absF(x), +∞+∞+∞)

boolean isNaN(source) neqF(x, x), isnanF(x)

boolean isSignaling(source) issignanF(x)

i) Notification indicator functions.

void lowerFlag(exceptionGroupType) clear indicators(C , S ) boolean testFlag(exceptionGroupType) test indicators(C , S )

void restoreFlag(flagsType, exceptionGroupType) similar to set indicators(C , S ) but also clears

flagsType saveFlags(void) current indicators(C )

j) “Mode” changes (rounding direction modes, and other modes).

MODEtype getMODE(void)

binaryRoundingDirectionType getBinaryRoundingDirection(void) decimalRoundingDirectionType getDecimalRoundingDirection(void) void setMODE(MODEtype)

void setBinaryRoundingDirection(binaryRoundingDirectionType) void setDecimalRoundingDirection(decimalRoundingDirectionType) modeGroupType saveModes(void)

void restoreModes(modeGroupType) void defaultModes(void)

Note that several of the above facilities are already required by LIA-1 even for implementations that do not conform to IEC 60559.

In document DRAFT INTERNATIONAL (Page 58-65)