• No results found

Floating point operations

In document DRAFT INTERNATIONAL (Page 32-39)

5.2 Floating point datatypes and operations

5.2.6 Floating point operations

For each provided conforming floating point datatype, the following operations shall be provided.

eqF : F × F → Boolean

eqF(x, y) = true if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x = y

= false if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x 6= y

= eqF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= eqF(x, 0) if x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and y = −−−0

= false if x is a quiet NaN and y is not a signalling NaN

= false if y is a quiet NaN and x is not a signalling NaN

= invalid(false) if x is a signalling NaN or y is a signalling NaN neqF : F × F → Boolean

neqF(x, y) = true if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x 6= y

= false if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x = y

= neqF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= neqF(x, 0) if y = −−−0 and x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= true if x is a quiet NaN and y is not a signalling NaN

= true if y is a quiet NaN and x is not a signalling NaN

= invalid(true) if x is a signalling NaN or y is a signalling NaN lssF : F × F → Boolean

lssF(x, y) = true if x, y ∈ F and x < y

= false if x, y ∈ F and x > y

= lssF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= lssF(x, 0) if y = −−−0 and x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= true if x = −∞−∞−∞ and y ∈ F ∪ {+∞+∞+∞}

= false if x = +∞+∞+∞ and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= false if x ∈ F ∪ {−∞−∞−∞} and y = −∞−∞−∞

= true if x ∈ F and y = +∞+∞+∞

= invalid(false) if x is a NaN or y is a NaN leqF : F × F → Boolean

leqF(x, y) = true if x, y ∈ F and x 6 y

= false if x, y ∈ F and x > y

= leqF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= leqF(x, 0) if y = −−−0 and x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= true if x = −∞−∞−∞ and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= false if x = +∞+∞+∞ and y ∈ F ∪ {−∞−∞−∞}

= false if x ∈ F and y = −∞−∞−∞

= true if x ∈ F ∪ {+∞+∞+∞} and y = +∞+∞+∞

= invalid(false) if x is a NaN or y is a NaN gtrF : F × F → Boolean

gtrF(x, y) = lssF(y, x) geqF : F × F → Boolean geqF(x, y) = leqF(y, x) isnegzeroF : F → Boolean

isnegzeroF(x) = true if x = −−−0

= false if x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= invalid(false) if x is a NaN istinyF : F → Boolean

istinyF (x) = true if (x ∈ F and |x| < fminNF) or x = −−−0

= false if (x ∈ F and |x| > fminNF) or x ∈ {−∞−∞−∞, +∞+∞+∞}

= invalid(false) if x is a NaN isnanF : F → Boolean

isnanF(x) = false if x ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= true if x is a quiet NaN

= invalid(true) if x is a signalling NaN issignanF : F → Boolean

issignanF(x) = false if x ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= false if x is a quiet NaN

= true if x is a signalling NaN

5.2.6 Floating point operations 23

5.2.6.2 Basic arithmetic

For each provided conforming floating point datatype, the following round to nearest operations shall be provided, and the round towards negative and positive infinity operations should be provided. For the non-conforming case that denormF = false, the operations below using upF or downF as rounding shall not be provided.

NOTE 1 – If denormF = false, then any result that is smaller than f minnF is replaced by zero. This implies that neither rounding direction (nearest, up, down) is heeded, doing “flush to zero” for would-be subnormal results. Thus if denormF = false, the directed rounding operations would be unreliable for interval arithmetic, as well as other uses. That is why the directed rounding operations are not to be provided when denormF = false.

The operations in this clause are specified only for the case that rF = rF0, denormF = denormF0, iec 559F = iec 559F0. If iec 559F = false then the operations are required only if F = F0. The addF →F0 and subF →F0 operations can underflow only if denormF0 = false (non-conforming case) or eminF − pF < eminF0 − pF0.

negF : F → F ∪ {−−−0}

negF(x) = −x if x ∈ F and x 6= 0

= −−−0 if x = 0

= 0 if x = −−−0

= −∞−∞−∞ if x = +∞+∞+∞

= +∞+∞+∞ if x = −∞−∞−∞

= no resultF →F(x) otherwise addF →F0 : F × F → F0∪ {inexact, underflow, overflow}

addF →F0(x, y) = resultF0(x + y, nearestF0)

if x, y ∈ F

= −−−0 if x = −−−0 and y = −−−0

= addF →F0(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= addF →F0(x, 0) if x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and y = −−−0

= +∞+∞+∞ if x = +∞+∞+∞ and y ∈ F ∪ {+∞+∞+∞}

= +∞+∞+∞ if x ∈ F and y = +∞+∞+∞

= −∞−∞−∞ if x = −∞−∞−∞ and y ∈ F ∪ {−∞−∞−∞}

= −∞−∞−∞ if x ∈ F and y = −∞−∞−∞

= no result2F →F0(x, y) otherwise addF →F0 : F × F → F0∪ {inexact, underflow, overflow}

addF →F0(x, y) = resultF0(x + y, upF0) if x, y ∈ F

= addF →F0(x, y) otherwise

addF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}

addF →F0(x, y) = resultF0(x + y, downF0)if x, y ∈ F and (x + y 6= 0 or x = 0)

= −−−0 if x, y ∈ F and x + y = 0 and x 6= 0

= −−−0 if addF(x, y) = 0 and (x = −−−0 or y = −−−0)

= addF →F0(x, y) otherwise

subF →F0 : F × F → F0∪ {inexact, underflow, overflow}

subF →F0(x, y) = addF →F0(x, negF(y))

subF →F0 : F × F → F0∪ {inexact, underflow, overflow}

subF →F0(x, y) = addF →F0(x, negF(y))

subF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}

subF →F0(x, y) = addF →F0(x, negF(y))

mulF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}

mulF →F0(x, y) = resultF0(x · y, nearestF0)

if x, y ∈ F and x 6= 0 and y 6= 0

= 0 if x = 0 and y ∈ F and y > 0

= −−−0 if x = 0 and ((y ∈ F and y < 0) or y = −−−0)

= −−−0 if x = −−−0 and y ∈F and y > 0

= 0 if x = −−−0 and ((y ∈ F and y < 0) or y = −−−0)

= 0 if x ∈ F and x > 0 and y = 0

= −−−0 if x ∈ F and x < 0 and y = 0

= −−−0 if x ∈ F and x > 0 and y = −−−0

= 0 if x ∈ F and x < 0 and y = −−−0

= +∞+∞+∞ if x = +∞+∞+∞ and ((y ∈ F and y > 0) or y = +∞+∞+∞)

= −∞−∞−∞ if x = +∞+∞+∞ and ((y ∈ F and y < 0) or y = −∞−∞−∞)

= −∞−∞−∞ if x = −∞−∞−∞ and ((y ∈ F and y > 0) or y = +∞+∞+∞)

= +∞+∞+∞ if x = −∞−∞−∞ and ((y ∈ F and y < 0) or y = −∞−∞−∞)

= +∞+∞+∞ if x ∈ F and x > 0 and y = +∞+∞+∞

= −∞−∞−∞ if x ∈ F and x < 0 and y = +∞+∞+∞

= −∞−∞−∞ if x ∈ F and x > 0 and y = −∞−∞−∞

= +∞+∞+∞ if x ∈ F and x < 0 and y = −∞−∞−∞

= no result2F →F0(x, y) otherwise

mulF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}

mulF →F0(x, y) = resultF0(x · y, upF0) if x, y ∈ F and x 6= 0 and y 6= 0

= mulF →F0(x, y) otherwise

mulF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}

mulF →F0(x, y) = resultF0(x · y, downF0) if x, y ∈ F and x 6= 0 and y 6= 0

= mulF →F0(x, y) otherwise

divF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow, infinitary, invalid}

divF →F0(x, y) = resultF0(x/y, nearestF0)

if x, y ∈ F and x 6= 0 and y 6= 0

= 0 if x = 0 and y ∈ F and y > 0

5.2.6 Floating point operations 25

= −−−0 if x = 0 and y ∈ F and y < 0

= −−−0 if x = −−−0 and y ∈ F and y > 0

= 0 if x = −−−0 and y ∈ F and y < 0

= infinitary(+∞+∞+∞) if x ∈ F and x > 0 and y = 0

= infinitary(−∞−∞−∞) if x ∈ F and x < 0 and y = 0

= infinitary(−∞−∞−∞) if x ∈ F and x > 0 and y = −−−0

= infinitary(+∞+∞+∞) if x ∈ F and x < 0 and y = −−−0

= 0 if x ∈ F and x > 0 and y = +∞+∞+∞

= −−−0 if x ∈ F and x > 0 and y = −∞−∞−∞

= −−−0 if ((x ∈ F and x < 0) or x = −−−0) and y = +∞+∞+∞

= 0 if ((x ∈ F and x < 0) or x = −−−0) and y = −∞−∞−∞

= +∞+∞+∞ if x = +∞+∞+∞ and y ∈ F and y > 0

= −∞−∞−∞ if x = −∞−∞−∞ and y ∈ F and y > 0

= −∞−∞−∞ if x = +∞+∞+∞ and ((y ∈ F and y < 0) or y = −−−0)

= +∞+∞+∞ if x = −∞−∞−∞ and ((y ∈ F and y < 0) or y = −−−0)

= no result2F →F0(x, y) otherwise

divF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow, infinitary, invalid}

divF →F0(x, y) = resultF0(x/y, upF0) if x, y ∈ F and x 6= 0 and y 6= 0

= divF →F0(x, y) otherwise

divF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow, infinitary, invalid}

divF →F0(x, y) = resultF0(x/y, downF0) if x, y ∈ F and x 6= 0 and y 6= 0

= divF →F0(x, y) otherwise absF : F → F

absF(x) = |x| if x ∈ F

= 0 if x = −−−0

= +∞+∞+∞ if x ∈ {−∞−∞−∞, +∞+∞+∞}

= no resultF →F(x) otherwise signumF : F → F

signumF(x) = 1 if (x ∈ F and x > 0) or x = +∞+∞+∞

= −1 if (x ∈ F and x < 0) or x ∈ {−−−0, −∞−∞−∞}

= no resultF →F(x) otherwise residueF : F × F → F ∪ {−−−0, invalid}

residueF(x, y) = resultF(x − (round(x/y) · y), nearestF)

if x, y ∈ F and y 6= 0 and

(x > 0 or x − (round(x/y) · y) 6= 0)

= −−−0 if x, y ∈ F and y 6= 0 and

x < 0 and x − (round(x/y) · y) = 0

= −−−0 if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and y 6= 0

= x if x ∈ F and y ∈ {−∞−∞−∞, +∞+∞+∞}

= no result2F →F(x, y) otherwise

NOTE 2 – The residueF operation is informally known as “IEEE remainder”.

sqrtF →F0 : F → F0∪ {inexact, underflow, invalid}

sqrtF →F0(x) = resultF0(√

x, nearestF0)

if x ∈ F and x > 0

= x if x ∈ {−−−0, +∞+∞+∞}

= no resultF →F0(x) otherwise sqrtF →F0 : F → F0∪ {inexact, underflow, invalid}

sqrtF →F0(x, y) = resultF0(√

x, upF0) if x ∈ F and x > 0

= sqrtF →F0(x) otherwise sqrtF →F0 : F → F0∪ {inexact, underflow, invalid}

sqrtF →F0(x, y) = resultF0(√

x, downF0) if x ∈ F and x > 0

= sqrtF →F0(x) otherwise

5.2.6.3 Value dissection

For each provided floating point type, the following operations shall be provided. For the non-conforming case of denormF = false, ulpF may underflow, and the operations succF and predF

shall not be provided.

The exponentF →I operation is specified only when minintI < eminF − pF and maxintI >

emaxF (or a stronger requirement) holds. Further, this requirement (or a stronger requirement) should hold also for scaleF,I.

exponentF →I : F → I ∪ {infinitary}

exponentF →I(x)= blogrF(|x|)c + 1 if x ∈ F and x 6= 0

= infinitary(−∞−∞−∞) if x ∈ {−−−0, 0}

= +∞+∞+∞ if x ∈ {−∞−∞−∞, +∞+∞+∞}

= qNaN if x is a quiet NaN

= invalid(qNaN) if x is a signalling NaN NOTES

1 Since most integer datatypes cannot represent any infinitaty (or NaN) values, documented

“well out of range” finite integer values of the correct sign may here be used instead of the infinities.

2 The related IEC 60559 operation logb returns a floating point value, to guarantee the representability of the infinitary (and NaN) return values.

fractionF : F → F

fractionF(x) = x/rFexponentF →Z(x) if x ∈ F and x 6= 0

= x if x ∈ {−∞−∞−∞, −−−0, 0, +∞+∞+∞}

= no resultF →F(x) otherwise

5.2.6 Floating point operations 27

scaleF,I : F × I → F ∪ {underflow, overflow}

scaleF,I(x, n) = resultF(x · rnF, nearestF)

if x ∈ F and n ∈ I

= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}

= no result2F →F(x, convertI→F(n)) otherwise succF : F → F ∪ {overflow}

succF(x) = resultF(min {z ∈ F | z > x}, nearestF) if x ∈ F

= succF(0) if x = −−−0

= no resultF →F(x) otherwise predF : F → F ∪ {overflow}

predF(x) = resultF(max {z ∈ F | z < x}, nearestF) if x ∈ F

= predF(0) if x = −−−0

= no resultF →F(x) otherwise ulpF : F → F

ulpF(x) = resultF(uF(x), nearestF) if x ∈ F

= ulpF(0) if x = −−−0

= no resultF →F(x) otherwise

5.2.6.4 Value splitting

For each provided floating point type, the following operations shall be provided. The truncF,I and roundF,I operations are specified only if maxintI > pF (or a stronger requirement) holds.

intpartF : F → F ∪ {−−−0}

intpartF(x) = bxc if x ∈ F and x > 0

= negF(intpartF(−x)) if x ∈ F and x < 0

= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}

= no resultF →F(x) otherwise fractpartF : F → F ∪ {−−−0}

fractpartF(x) = x − bxc if x ∈ F and x > 0

= negF(fractpartF(−x)) if x ∈ F and x < 0

= x if x ∈ {−−−0}

= no resultF →F(x) otherwise truncF,I : F × I → F ∪ {−−−0}

truncF,I(x, n) = bx/reFF(x)−nc · rFeF(x)−n if x ∈ F and x > 0 and n ∈ I

= negF(truncF(−x, n)) if x ∈ F and x < 0 and n ∈ I

= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}

= no result2F →F(x, n) otherwise roundF,I : F × I → F ∪ {−−−0, overflow}

roundF,I(x, n) = resultF(round(x/rFeF(x)−n) · reFF(x)−n, nearestF) if x ∈ F and x > 0 and n ∈ I

= negF(roundF,I(−x, n)) if x ∈ F and x < 0 and n ∈ I

= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}

= no result2F →F(x, n) otherwise

In document DRAFT INTERNATIONAL (Page 32-39)