5.2 Floating point datatypes and operations
5.2.6 Floating point operations
For each provided conforming floating point datatype, the following operations shall be provided.
eqF : F × F → Boolean
eqF(x, y) = true if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x = y
= false if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x 6= y
= eqF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}
= eqF(x, 0) if x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and y = −−−0
= false if x is a quiet NaN and y is not a signalling NaN
= false if y is a quiet NaN and x is not a signalling NaN
= invalid(false) if x is a signalling NaN or y is a signalling NaN neqF : F × F → Boolean
neqF(x, y) = true if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x 6= y
= false if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x = y
= neqF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}
= neqF(x, 0) if y = −−−0 and x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}
= true if x is a quiet NaN and y is not a signalling NaN
= true if y is a quiet NaN and x is not a signalling NaN
= invalid(true) if x is a signalling NaN or y is a signalling NaN lssF : F × F → Boolean
lssF(x, y) = true if x, y ∈ F and x < y
= false if x, y ∈ F and x > y
= lssF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}
= lssF(x, 0) if y = −−−0 and x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}
= true if x = −∞−∞−∞ and y ∈ F ∪ {+∞+∞+∞}
= false if x = +∞+∞+∞ and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}
= false if x ∈ F ∪ {−∞−∞−∞} and y = −∞−∞−∞
= true if x ∈ F and y = +∞+∞+∞
= invalid(false) if x is a NaN or y is a NaN leqF : F × F → Boolean
leqF(x, y) = true if x, y ∈ F and x 6 y
= false if x, y ∈ F and x > y
= leqF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}
= leqF(x, 0) if y = −−−0 and x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}
= true if x = −∞−∞−∞ and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}
= false if x = +∞+∞+∞ and y ∈ F ∪ {−∞−∞−∞}
= false if x ∈ F and y = −∞−∞−∞
= true if x ∈ F ∪ {+∞+∞+∞} and y = +∞+∞+∞
= invalid(false) if x is a NaN or y is a NaN gtrF : F × F → Boolean
gtrF(x, y) = lssF(y, x) geqF : F × F → Boolean geqF(x, y) = leqF(y, x) isnegzeroF : F → Boolean
isnegzeroF(x) = true if x = −−−0
= false if x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}
= invalid(false) if x is a NaN istinyF : F → Boolean
istinyF (x) = true if (x ∈ F and |x| < fminNF) or x = −−−0
= false if (x ∈ F and |x| > fminNF) or x ∈ {−∞−∞−∞, +∞+∞+∞}
= invalid(false) if x is a NaN isnanF : F → Boolean
isnanF(x) = false if x ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}
= true if x is a quiet NaN
= invalid(true) if x is a signalling NaN issignanF : F → Boolean
issignanF(x) = false if x ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}
= false if x is a quiet NaN
= true if x is a signalling NaN
5.2.6 Floating point operations 23
5.2.6.2 Basic arithmetic
For each provided conforming floating point datatype, the following round to nearest operations shall be provided, and the round towards negative and positive infinity operations should be provided. For the non-conforming case that denormF = false, the operations below using upF or downF as rounding shall not be provided.
NOTE 1 – If denormF = false, then any result that is smaller than f minnF is replaced by zero. This implies that neither rounding direction (nearest, up, down) is heeded, doing “flush to zero” for would-be subnormal results. Thus if denormF = false, the directed rounding operations would be unreliable for interval arithmetic, as well as other uses. That is why the directed rounding operations are not to be provided when denormF = false.
The operations in this clause are specified only for the case that rF = rF0, denormF = denormF0, iec 559F = iec 559F0. If iec 559F = false then the operations are required only if F = F0. The addF →F0 and subF →F0 operations can underflow only if denormF0 = false (non-conforming case) or eminF − pF < eminF0 − pF0.
negF : F → F ∪ {−−−0}
negF(x) = −x if x ∈ F and x 6= 0
= −−−0 if x = 0
= 0 if x = −−−0
= −∞−∞−∞ if x = +∞+∞+∞
= +∞+∞+∞ if x = −∞−∞−∞
= no resultF →F(x) otherwise addF →F0 : F × F → F0∪ {inexact, underflow, overflow}
addF →F0(x, y) = resultF0(x + y, nearestF0)
if x, y ∈ F
= −−−0 if x = −−−0 and y = −−−0
= addF →F0(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}
= addF →F0(x, 0) if x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and y = −−−0
= +∞+∞+∞ if x = +∞+∞+∞ and y ∈ F ∪ {+∞+∞+∞}
= +∞+∞+∞ if x ∈ F and y = +∞+∞+∞
= −∞−∞−∞ if x = −∞−∞−∞ and y ∈ F ∪ {−∞−∞−∞}
= −∞−∞−∞ if x ∈ F and y = −∞−∞−∞
= no result2F →F0(x, y) otherwise add↑F →F0 : F × F → F0∪ {inexact, underflow, overflow}
add↑F →F0(x, y) = resultF0(x + y, upF0) if x, y ∈ F
= addF →F0(x, y) otherwise
add↓F →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}
add↓F →F0(x, y) = resultF0(x + y, downF0)if x, y ∈ F and (x + y 6= 0 or x = 0)
= −−−0 if x, y ∈ F and x + y = 0 and x 6= 0
= −−−0 if addF(x, y) = 0 and (x = −−−0 or y = −−−0)
= addF →F0(x, y) otherwise
subF →F0 : F × F → F0∪ {inexact, underflow, overflow}
subF →F0(x, y) = addF →F0(x, negF(y))
sub↑F →F0 : F × F → F0∪ {inexact, underflow, overflow}
sub↑F →F0(x, y) = add↑F →F0(x, negF(y))
sub↓F →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}
sub↓F →F0(x, y) = add↓F →F0(x, negF(y))
mulF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}
mulF →F0(x, y) = resultF0(x · y, nearestF0)
if x, y ∈ F and x 6= 0 and y 6= 0
= 0 if x = 0 and y ∈ F and y > 0
= −−−0 if x = 0 and ((y ∈ F and y < 0) or y = −−−0)
= −−−0 if x = −−−0 and y ∈F and y > 0
= 0 if x = −−−0 and ((y ∈ F and y < 0) or y = −−−0)
= 0 if x ∈ F and x > 0 and y = 0
= −−−0 if x ∈ F and x < 0 and y = 0
= −−−0 if x ∈ F and x > 0 and y = −−−0
= 0 if x ∈ F and x < 0 and y = −−−0
= +∞+∞+∞ if x = +∞+∞+∞ and ((y ∈ F and y > 0) or y = +∞+∞+∞)
= −∞−∞−∞ if x = +∞+∞+∞ and ((y ∈ F and y < 0) or y = −∞−∞−∞)
= −∞−∞−∞ if x = −∞−∞−∞ and ((y ∈ F and y > 0) or y = +∞+∞+∞)
= +∞+∞+∞ if x = −∞−∞−∞ and ((y ∈ F and y < 0) or y = −∞−∞−∞)
= +∞+∞+∞ if x ∈ F and x > 0 and y = +∞+∞+∞
= −∞−∞−∞ if x ∈ F and x < 0 and y = +∞+∞+∞
= −∞−∞−∞ if x ∈ F and x > 0 and y = −∞−∞−∞
= +∞+∞+∞ if x ∈ F and x < 0 and y = −∞−∞−∞
= no result2F →F0(x, y) otherwise
mul↑F →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}
mul↑F →F0(x, y) = resultF0(x · y, upF0) if x, y ∈ F and x 6= 0 and y 6= 0
= mulF →F0(x, y) otherwise
mul↓F →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow}
mul↓F →F0(x, y) = resultF0(x · y, downF0) if x, y ∈ F and x 6= 0 and y 6= 0
= mulF →F0(x, y) otherwise
divF →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow, infinitary, invalid}
divF →F0(x, y) = resultF0(x/y, nearestF0)
if x, y ∈ F and x 6= 0 and y 6= 0
= 0 if x = 0 and y ∈ F and y > 0
5.2.6 Floating point operations 25
= −−−0 if x = 0 and y ∈ F and y < 0
= −−−0 if x = −−−0 and y ∈ F and y > 0
= 0 if x = −−−0 and y ∈ F and y < 0
= infinitary(+∞+∞+∞) if x ∈ F and x > 0 and y = 0
= infinitary(−∞−∞−∞) if x ∈ F and x < 0 and y = 0
= infinitary(−∞−∞−∞) if x ∈ F and x > 0 and y = −−−0
= infinitary(+∞+∞+∞) if x ∈ F and x < 0 and y = −−−0
= 0 if x ∈ F and x > 0 and y = +∞+∞+∞
= −−−0 if x ∈ F and x > 0 and y = −∞−∞−∞
= −−−0 if ((x ∈ F and x < 0) or x = −−−0) and y = +∞+∞+∞
= 0 if ((x ∈ F and x < 0) or x = −−−0) and y = −∞−∞−∞
= +∞+∞+∞ if x = +∞+∞+∞ and y ∈ F and y > 0
= −∞−∞−∞ if x = −∞−∞−∞ and y ∈ F and y > 0
= −∞−∞−∞ if x = +∞+∞+∞ and ((y ∈ F and y < 0) or y = −−−0)
= +∞+∞+∞ if x = −∞−∞−∞ and ((y ∈ F and y < 0) or y = −−−0)
= no result2F →F0(x, y) otherwise
div↑F →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow, infinitary, invalid}
div↑F →F0(x, y) = resultF0(x/y, upF0) if x, y ∈ F and x 6= 0 and y 6= 0
= divF →F0(x, y) otherwise
div↓F →F0 : F × F → F0∪ {−−−0, inexact, underflow, overflow, infinitary, invalid}
div↓F →F0(x, y) = resultF0(x/y, downF0) if x, y ∈ F and x 6= 0 and y 6= 0
= divF →F0(x, y) otherwise absF : F → F
absF(x) = |x| if x ∈ F
= 0 if x = −−−0
= +∞+∞+∞ if x ∈ {−∞−∞−∞, +∞+∞+∞}
= no resultF →F(x) otherwise signumF : F → F
signumF(x) = 1 if (x ∈ F and x > 0) or x = +∞+∞+∞
= −1 if (x ∈ F and x < 0) or x ∈ {−−−0, −∞−∞−∞}
= no resultF →F(x) otherwise residueF : F × F → F ∪ {−−−0, invalid}
residueF(x, y) = resultF(x − (round(x/y) · y), nearestF)
if x, y ∈ F and y 6= 0 and
(x > 0 or x − (round(x/y) · y) 6= 0)
= −−−0 if x, y ∈ F and y 6= 0 and
x < 0 and x − (round(x/y) · y) = 0
= −−−0 if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and y 6= 0
= x if x ∈ F and y ∈ {−∞−∞−∞, +∞+∞+∞}
= no result2F →F(x, y) otherwise
NOTE 2 – The residueF operation is informally known as “IEEE remainder”.
sqrtF →F0 : F → F0∪ {inexact, underflow, invalid}
sqrtF →F0(x) = resultF0(√
x, nearestF0)
if x ∈ F and x > 0
= x if x ∈ {−−−0, +∞+∞+∞}
= no resultF →F0(x) otherwise sqrt↑F →F0 : F → F0∪ {inexact, underflow, invalid}
sqrt↑F →F0(x, y) = resultF0(√
x, upF0) if x ∈ F and x > 0
= sqrtF →F0(x) otherwise sqrt↓F →F0 : F → F0∪ {inexact, underflow, invalid}
sqrt↓F →F0(x, y) = resultF0(√
x, downF0) if x ∈ F and x > 0
= sqrtF →F0(x) otherwise
5.2.6.3 Value dissection
For each provided floating point type, the following operations shall be provided. For the non-conforming case of denormF = false, ulpF may underflow, and the operations succF and predF
shall not be provided.
The exponentF →I operation is specified only when minintI < eminF − pF and maxintI >
emaxF (or a stronger requirement) holds. Further, this requirement (or a stronger requirement) should hold also for scaleF,I.
exponentF →I : F → I ∪ {infinitary}
exponentF →I(x)= blogrF(|x|)c + 1 if x ∈ F and x 6= 0
= infinitary(−∞−∞−∞) if x ∈ {−−−0, 0}
= +∞+∞+∞ if x ∈ {−∞−∞−∞, +∞+∞+∞}
= qNaN if x is a quiet NaN
= invalid(qNaN) if x is a signalling NaN NOTES
1 Since most integer datatypes cannot represent any infinitaty (or NaN) values, documented
“well out of range” finite integer values of the correct sign may here be used instead of the infinities.
2 The related IEC 60559 operation logb returns a floating point value, to guarantee the representability of the infinitary (and NaN) return values.
fractionF : F → F
fractionF(x) = x/rFexponentF →Z(x) if x ∈ F and x 6= 0
= x if x ∈ {−∞−∞−∞, −−−0, 0, +∞+∞+∞}
= no resultF →F(x) otherwise
5.2.6 Floating point operations 27
scaleF,I : F × I → F ∪ {underflow, overflow}
scaleF,I(x, n) = resultF(x · rnF, nearestF)
if x ∈ F and n ∈ I
= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}
= no result2F →F(x, convertI→F(n)) otherwise succF : F → F ∪ {overflow}
succF(x) = resultF(min {z ∈ F† | z > x}, nearestF) if x ∈ F
= succF(0) if x = −−−0
= no resultF →F(x) otherwise predF : F → F ∪ {overflow}
predF(x) = resultF(max {z ∈ F† | z < x}, nearestF) if x ∈ F
= predF(0) if x = −−−0
= no resultF →F(x) otherwise ulpF : F → F
ulpF(x) = resultF(uF(x), nearestF) if x ∈ F
= ulpF(0) if x = −−−0
= no resultF →F(x) otherwise
5.2.6.4 Value splitting
For each provided floating point type, the following operations shall be provided. The truncF,I and roundF,I operations are specified only if maxintI > pF (or a stronger requirement) holds.
intpartF : F → F ∪ {−−−0}
intpartF(x) = bxc if x ∈ F and x > 0
= negF(intpartF(−x)) if x ∈ F and x < 0
= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}
= no resultF →F(x) otherwise fractpartF : F → F ∪ {−−−0}
fractpartF(x) = x − bxc if x ∈ F and x > 0
= negF(fractpartF(−x)) if x ∈ F and x < 0
= x if x ∈ {−−−0}
= no resultF →F(x) otherwise truncF,I : F × I → F ∪ {−−−0}
truncF,I(x, n) = bx/reFF(x)−nc · rFeF(x)−n if x ∈ F and x > 0 and n ∈ I
= negF(truncF(−x, n)) if x ∈ F and x < 0 and n ∈ I
= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}
= no result2F →F(x, n) otherwise roundF,I : F × I → F ∪ {−−−0, overflow}
roundF,I(x, n) = resultF(round(x/rFeF(x)−n) · reFF(x)−n, nearestF) if x ∈ F and x > 0 and n ∈ I
= negF(roundF,I(−x, n)) if x ∈ F and x < 0 and n ∈ I
= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}
= no result2F →F(x, n) otherwise