Floating point operations - Floating point datatypes and operations

5.2 Floating point datatypes and operations

5.2.6 Floating point operations

For each provided conforming floating point datatype, the following operations shall be provided.

eqF : F × F → Boolean

eq_F(x, y) = true if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x = y

= false if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x 6= y

= eqF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= eq_F(x, 0) if x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and y = −−−0

= false if x is a quiet NaN and y is not a signalling NaN

= false if y is a quiet NaN and x is not a signalling NaN

= invalid(false) if x is a signalling NaN or y is a signalling NaN neqF : F × F → Boolean

neq_F(x, y) = true if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x 6= y

= false if x, y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and x = y

= neqF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= neq_F(x, 0) if y = −−−0 and x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= true if x is a quiet NaN and y is not a signalling NaN

= true if y is a quiet NaN and x is not a signalling NaN

= invalid(true) if x is a signalling NaN or y is a signalling NaN lssF : F × F → Boolean

lss_F(x, y) = true if x, y ∈ F and x < y

= false if x, y ∈ F and x > y

= lssF(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= lss_F(x, 0) if y = −−−0 and x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= true if x = −∞−∞−∞ and y ∈ F ∪ {+∞+∞+∞}

= false if x = +∞+∞+∞ and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= false if x ∈ F ∪ {−∞−∞−∞} and y = −∞−∞−∞

= true if x ∈ F and y = +∞+∞+∞

= invalid(false) if x is a NaN or y is a NaN leq_F : F × F → Boolean

leqF(x, y) = true if x, y ∈ F and x 6 y

= false if x, y ∈ F and x > y

= leq_F(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= leqF(x, 0) if y = −−−0 and x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= true if x = −∞−∞−∞ and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= false if x = +∞+∞+∞ and y ∈ F ∪ {−∞−∞−∞}

= false if x ∈ F and y = −∞−∞−∞

= true if x ∈ F ∪ {+∞+∞+∞} and y = +∞+∞+∞

= invalid(false) if x is a NaN or y is a NaN gtr_F : F × F → Boolean

gtrF(x, y) = lssF(y, x) geq_F : F × F → Boolean geqF(x, y) = leqF(y, x) isnegzero_F : F → Boolean

isnegzeroF(x) = true if x = −−−0

= false if x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= invalid(false) if x is a NaN istiny_F : F → Boolean

istinyF (x) = true if (x ∈ F and |x| < fminNF) or x = −−−0

= false if (x ∈ F and |x| > fminNF) or x ∈ {−∞−∞−∞, +∞+∞+∞}

= invalid(false) if x is a NaN isnan_F : F → Boolean

isnanF(x) = false if x ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= true if x is a quiet NaN

= invalid(true) if x is a signalling NaN issignan_F : F → Boolean

issignan_F(x) = false if x ∈ F ∪ {−∞−∞−∞, −−−0, +∞+∞+∞}

= false if x is a quiet NaN

= true if x is a signalling NaN

5.2.6 Floating point operations 23

5.2.6.2 Basic arithmetic

For each provided conforming floating point datatype, the following round to nearest operations shall be provided, and the round towards negative and positive infinity operations should be provided. For the non-conforming case that denorm_F = false, the operations below using up_F or downF as rounding shall not be provided.

NOTE 1 – If denormF = false, then any result that is smaller than f minnF is replaced by zero. This implies that neither rounding direction (nearest, up, down) is heeded, doing “flush to zero” for would-be subnormal results. Thus if denorm_F = false, the directed rounding operations would be unreliable for interval arithmetic, as well as other uses. That is why the directed rounding operations are not to be provided when denorm_F = false.

The operations in this clause are specified only for the case that rF = r_F⁰, denormF = denorm_F⁰, iec 559_F = iec 559_F⁰. If iec 559_F = false then the operations are required only if F = F⁰. The add_{F →F}⁰ and sub_{F →F}⁰ operations can underflow only if denorm_F⁰ = false (non-conforming case) or eminF − p_F < emin_F⁰ − p_F⁰.

neg_F : F → F ∪ {−−−0}

negF(x) = −x if x ∈ F and x 6= 0

= −−−0 if x = 0

= 0 if x = −−−0

= −∞−∞−∞ if x = +∞+∞+∞

= +∞+∞+∞ if x = −∞−∞−∞

= no result_{F →F}(x) otherwise add_{F →F}⁰ : F × F → F⁰∪ {inexact, underflow, overflow}

add_{F →F}⁰(x, y) = result_F⁰(x + y, nearest_F⁰)

if x, y ∈ F

= −−−0 if x = −−−0 and y = −−−0

= add_{F →F}⁰(0, y) if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞}

= addF →F⁰(x, 0) if x ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and y = −−−0

= +∞+∞+∞ if x = +∞+∞+∞ and y ∈ F ∪ {+∞+∞+∞}

= +∞+∞+∞ if x ∈ F and y = +∞+∞+∞

= −∞−∞−∞ if x = −∞−∞−∞ and y ∈ F ∪ {−∞−∞−∞}

= −∞−∞−∞ if x ∈ F and y = −∞−∞−∞

= no result2_{F →F}⁰(x, y) otherwise add^↑_{F →F}0 : F × F → F⁰∪ {inexact, underflow, overflow}

add^↑_{F →F}0(x, y) = resultF⁰(x + y, upF⁰) if x, y ∈ F

= add_{F →F}⁰(x, y) otherwise

add^↓_{F →F}0 : F × F → F⁰∪ {−−−0, inexact, underflow, overflow}

add^↓_{F →F}0(x, y) = result_F⁰(x + y, down_F⁰)if x, y ∈ F and (x + y 6= 0 or x = 0)

= −−−0 if x, y ∈ F and x + y = 0 and x 6= 0

= −−−0 if addF(x, y) = 0 and (x = −−−0 or y = −−−0)

= add_{F →F}⁰(x, y) otherwise

sub_{F →F}⁰ : F × F → F⁰∪ {inexact, underflow, overflow}

sub_{F →F}⁰(x, y) = add_{F →F}⁰(x, neg_F(y))

sub^↑_{F →F}0 : F × F → F⁰∪ {inexact, underflow, overflow}

sub^↑_{F →F}0(x, y) = add^↑_{F →F}0(x, negF(y))

sub^↓_{F →F}0 : F × F → F⁰∪ {−−−0, inexact, underflow, overflow}

sub^↓_{F →F}0(x, y) = add^↓_{F →F}0(x, negF(y))

mulF →F⁰ : F × F → F⁰∪ {−−−0, inexact, underflow, overflow}

mul_{F →F}⁰(x, y) = result_F⁰(x · y, nearest_F⁰)

if x, y ∈ F and x 6= 0 and y 6= 0

= 0 if x = 0 and y ∈ F and y > 0

= −−−0 if x = 0 and ((y ∈ F and y < 0) or y = −−−0)

= −−−0 if x = −−−0 and y ∈F and y > 0

= 0 if x = −−−0 and ((y ∈ F and y < 0) or y = −−−0)

= 0 if x ∈ F and x > 0 and y = 0

= −−−0 if x ∈ F and x < 0 and y = 0

= −−−0 if x ∈ F and x > 0 and y = −−−0

= 0 if x ∈ F and x < 0 and y = −−−0

= +∞+∞+∞ if x = +∞+∞+∞ and ((y ∈ F and y > 0) or y = +∞+∞+∞)

= −∞−∞−∞ if x = +∞+∞+∞ and ((y ∈ F and y < 0) or y = −∞−∞−∞)

= −∞−∞−∞ if x = −∞−∞−∞ and ((y ∈ F and y > 0) or y = +∞+∞+∞)

= +∞+∞+∞ if x = −∞−∞−∞ and ((y ∈ F and y < 0) or y = −∞−∞−∞)

= +∞+∞+∞ if x ∈ F and x > 0 and y = +∞+∞+∞

= −∞−∞−∞ if x ∈ F and x < 0 and y = +∞+∞+∞

= −∞−∞−∞ if x ∈ F and x > 0 and y = −∞−∞−∞

= +∞+∞+∞ if x ∈ F and x < 0 and y = −∞−∞−∞

= no result2_{F →F}⁰(x, y) otherwise

mul^↑_{F →F}0 : F × F → F⁰∪ {−−−0, inexact, underflow, overflow}

mul^↑_{F →F}0(x, y) = result_F⁰(x · y, up_F⁰) if x, y ∈ F and x 6= 0 and y 6= 0

= mul_{F →F}⁰(x, y) otherwise

mul^↓_{F →F}0 : F × F → F⁰∪ {−−−0, inexact, underflow, overflow}

mul^↓_{F →F}0(x, y) = resultF⁰(x · y, downF⁰) if x, y ∈ F and x 6= 0 and y 6= 0

= mul_{F →F}⁰(x, y) otherwise

div_{F →F}⁰ : F × F → F⁰∪ {−−−0, inexact, underflow, overflow, infinitary, invalid}

div_{F →F}⁰(x, y) = result_F⁰(x/y, nearest_F⁰)

if x, y ∈ F and x 6= 0 and y 6= 0

= 0 if x = 0 and y ∈ F and y > 0

5.2.6 Floating point operations 25

= −−−0 if x = 0 and y ∈ F and y < 0

= −−−0 if x = −−−0 and y ∈ F and y > 0

= 0 if x = −−−0 and y ∈ F and y < 0

= infinitary(+∞+∞+∞) if x ∈ F and x > 0 and y = 0

= infinitary(−∞−∞−∞) if x ∈ F and x < 0 and y = 0

= infinitary(−∞−∞−∞) if x ∈ F and x > 0 and y = −−−0

= infinitary(+∞+∞+∞) if x ∈ F and x < 0 and y = −−−0

= 0 if x ∈ F and x > 0 and y = +∞+∞+∞

= −−−0 if x ∈ F and x > 0 and y = −∞−∞−∞

= −−−0 if ((x ∈ F and x < 0) or x = −−−0) and y = +∞+∞+∞

= 0 if ((x ∈ F and x < 0) or x = −−−0) and y = −∞−∞−∞

= +∞+∞+∞ if x = +∞+∞+∞ and y ∈ F and y > 0

= −∞−∞−∞ if x = −∞−∞−∞ and y ∈ F and y > 0

= −∞−∞−∞ if x = +∞+∞+∞ and ((y ∈ F and y < 0) or y = −−−0)

= +∞+∞+∞ if x = −∞−∞−∞ and ((y ∈ F and y < 0) or y = −−−0)

= no result2F →F⁰(x, y) otherwise

div^↑_{F →F}0 : F × F → F⁰∪ {−−−0, inexact, underflow, overflow, infinitary, invalid}

div^↑_{F →F}0(x, y) = result_F⁰(x/y, up_F⁰) if x, y ∈ F and x 6= 0 and y 6= 0

= div_{F →F}⁰(x, y) otherwise

div^↓_{F →F}0 : F × F → F⁰∪ {−−−0, inexact, underflow, overflow, infinitary, invalid}

div^↓_{F →F}0(x, y) = result_F⁰(x/y, down_F⁰) if x, y ∈ F and x 6= 0 and y 6= 0

= divF →F⁰(x, y) otherwise abs_F : F → F

absF(x) = |x| if x ∈ F

= 0 if x = −−−0

= +∞+∞+∞ if x ∈ {−∞−∞−∞, +∞+∞+∞}

= no resultF →F(x) otherwise signumF : F → F

signum_F(x) = 1 if (x ∈ F and x > 0) or x = +∞+∞+∞

= −1 if (x ∈ F and x < 0) or x ∈ {−−−0, −∞−∞−∞}

= no resultF →F(x) otherwise residue_F : F × F → F ∪ {−−−0, invalid}

residueF(x, y) = resultF(x − (round(x/y) · y), nearestF)

if x, y ∈ F and y 6= 0 and

(x > 0 or x − (round(x/y) · y) 6= 0)

= −−−0 if x, y ∈ F and y 6= 0 and

x < 0 and x − (round(x/y) · y) = 0

= −−−0 if x = −−−0 and y ∈ F ∪ {−∞−∞−∞, +∞+∞+∞} and y 6= 0

= x if x ∈ F and y ∈ {−∞−∞−∞, +∞+∞+∞}

= no result2F →F(x, y) otherwise

NOTE 2 – The residueF operation is informally known as “IEEE remainder”.

sqrt_{F →F}⁰ : F → F⁰∪ {inexact, underflow, invalid}

sqrt_{F →F}⁰(x) = result_F⁰(√

x, nearest_F⁰)

if x ∈ F and x > 0

= x if x ∈ {−−−0, +∞+∞+∞}

= no result_{F →F}⁰(x) otherwise sqrt^↑_{F →F}0 : F → F⁰∪ {inexact, underflow, invalid}

sqrt^↑_{F →F}0(x, y) = result_F⁰(√

x, up_F⁰) if x ∈ F and x > 0

= sqrt_{F →F}⁰(x) otherwise sqrt^↓_{F →F}0 : F → F⁰∪ {inexact, underflow, invalid}

sqrt^↓_{F →F}0(x, y) = resultF⁰(√

x, downF⁰) if x ∈ F and x > 0

= sqrt_{F →F}⁰(x) otherwise

5.2.6.3 Value dissection

For each provided floating point type, the following operations shall be provided. For the non-conforming case of denormF = false, ulpF may underflow, and the operations succF and predF

shall not be provided.

The exponent_{F →I} operation is specified only when minint_I < emin_F − p_F and maxint_I >

emaxF (or a stronger requirement) holds. Further, this requirement (or a stronger requirement) should hold also for scale_F,I.

exponent_{F →I} : F → I ∪ {infinitary}

exponentF →I(x)= blog_r_F(|x|)c + 1 if x ∈ F and x 6= 0

= infinitary(−∞−∞−∞) if x ∈ {−−−0, 0}

= +∞+∞+∞ if x ∈ {−∞−∞−∞, +∞+∞+∞}

= qNaN if x is a quiet NaN

= invalid(qNaN) if x is a signalling NaN NOTES

1 Since most integer datatypes cannot represent any infinitaty (or NaN) values, documented

“well out of range” finite integer values of the correct sign may here be used instead of the infinities.

2 The related IEC 60559 operation logb returns a floating point value, to guarantee the representability of the infinitary (and NaN) return values.

fraction_F : F → F

fraction_F(x) = x/r_F^exponent^{F →Z}^(x) if x ∈ F and x 6= 0

= x if x ∈ {−∞−∞−∞, −−−0, 0, +∞+∞+∞}

= no resultF →F(x) otherwise

5.2.6 Floating point operations 27

scale_F,I : F × I → F ∪ {underflow, overflow}

scale_F,I(x, n) = result_F(x · rⁿ_F, nearest_F)

if x ∈ F and n ∈ I

= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}

= no result2_{F →F}(x, convert_I→F(n)) otherwise succ_F : F → F ∪ {overflow}

succF(x) = resultF(min {z ∈ F^† | z > x}, nearest_F) if x ∈ F

= succF(0) if x = −−−0

= no resultF →F(x) otherwise predF : F → F ∪ {overflow}

pred_F(x) = result_F(max {z ∈ F^† | z < x}, nearest_F) if x ∈ F

= predF(0) if x = −−−0

= no result_{F →F}(x) otherwise ulpF : F → F

ulp_F(x) = result_F(u_F(x), nearest_F) if x ∈ F

= ulpF(0) if x = −−−0

= no result_{F →F}(x) otherwise

5.2.6.4 Value splitting

For each provided floating point type, the following operations shall be provided. The trunc_F,I and roundF,I operations are specified only if maxintI > pF (or a stronger requirement) holds.

intpart_F : F → F ∪ {−−−0}

intpartF(x) = bxc if x ∈ F and x > 0

= neg_F(intpart_F(−x)) if x ∈ F and x < 0

= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}

= no resultF →F(x) otherwise fractpart_F : F → F ∪ {−−−0}

fractpart_F(x) = x − bxc if x ∈ F and x > 0

= neg_F(fractpart_F(−x)) if x ∈ F and x < 0

= x if x ∈ {−−−0}

= no resultF →F(x) otherwise trunc_F,I : F × I → F ∪ {−−−0}

truncF,I(x, n) = bx/r^e_F^F^(x)−nc · r_F^e^F^(x)−n if x ∈ F and x > 0 and n ∈ I

= neg_F(trunc_F(−x, n)) if x ∈ F and x < 0 and n ∈ I

= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}

= no result2F →F(x, n) otherwise roundF,I : F × I → F ∪ {−−−0, overflow}

round_F,I(x, n) = result_F(round(x/r_F^e^F^(x)−n) · r^e_F^F^(x)−n, nearest_F) if x ∈ F and x > 0 and n ∈ I

= neg_F(round_F,I(−x, n)) if x ∈ F and x < 0 and n ∈ I

= x if x ∈ {−∞−∞−∞, −−−0, +∞+∞+∞}

= no result2F →F(x, n) otherwise

In document DRAFT INTERNATIONAL (Page 32-39)