An error assessment of matrix multiplications on posit matrices

(1)

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM, SWEDEN 2020

An error assessment of matrix

multiplications on posit matrices

KTH Bachelor Thesis Report

August Danell Håkansson

Mirja Johnsson

(2)

Authors

August Danell Håkansson <aughak@kth.se> and Mirja Johnsson

<mirjaj@kth.se>

Electrical Engineering and Computer Science KTH Royal Institute of Technology

Svensk titel på projektet

En bedömning av beräkningsfel vid matrismultiplikation av matriser fyllda med tal i Positformat.

Examiners

Pawel Herman, Örjan Ekeberg

KTH Royal Institute of Technology

Supervisor

(3)

Acknowledgements

We are grateful to our supervisor Stefano Markidis and to his team for their

guidance, and their willingness to share in their expertise.

(4)

Abstract

The posit is a floating-point number format as proposed by John Gustafson to improve upon the accuracy of the IEEE-754 format, which is the standard today.

The goal of this paper was to look specifically at matrix multiplication and examine how the posit format compared to the IEEE-754. The quire which is part of the posit standard was not included in this paper due to limitations. We used the library softPosit in Python to construct arrays of matrices referred to as matrix chains. These matrices were filled with numbers in one format and bit size at a time. These chains were then multiplied together with normal matrix multiplication, and then we compared the error of the two formats. An IEEE- 754 matrix chain with more bits than the ones being compared was used as the reference to compare the accuracy between the IEEE-754 and the posit, as it pertains to matrix multiplication. The result was that the posit format could yield more accurate matrix multiplications, especially for matrix multiplications with few matrices of low dimension. When dimensions and number of matrices increased, however, the posit matrix produced an error that was greater than that of the IEEE-754 matrix. The conclusion was that posits, if used sensibly, can be a more accurate format for matrix multiplication but it is important to consider the properties inherent to the posit when dealing with matrix multiplication of matrices inhabited by posits.

Keywords

Posit, softPosit Library, Moore’s Law

(5)

Sammanfattning

Posit är John Gustafsons förslag på ett nytt format för att representera flyttal med bättre precision än det vedertagna formatet IEEE-754. På grund av avgränsningar användes inte quire-ackumulatorn som är en del av posit-standarden. Syftet med denna rapport var att genomföra matrismultiplikationer och jämföra de båda formatens resultat. Vi använde Pythonbiblioteket softPosit för att skapa listor av matriser som vi kallar för matriskedjor. Dessa matriser fylldes med tal uttryckta i de olika formaten och bitstorlekarna, i tur och ordning. Sedan utfördes matrismultiplikation på kedjorna, och felen som uppstått jämfördes mellan typerna. Som referens för detta användes IEEE-754 med fler bitar än vad talen som testades hade. Resultatet blev att positformated gav mer exakta svar. Särskilt märkbart var detta för matriskedjorna med få, mindre matriser.

När dimensionerna och antalet matriser ökade hade IEEE-754 det klara övertaget.

Slutsatsen blev att ett informerat och strategiskt användande av positformatet kan vara det bättre valet för matrismultiplikation, då det är viktigt att ta dess egenskaper i beaktande.

Nyckelord

Posit, softPosit, Moores Lag

(6)

1 Introduction 1

1.1 Research Question . . . . 2

1.2 Scope . . . . 2

1.3 Thesis outline . . . . 3

2 Theoretical Background 5 2.1 Terminology and definitions used in the text . . . . 5

2.2 Matrix Multiplication . . . . 7

2.3 How does the float-754 format work? . . . . 8

2.4 How does the posit format work? . . . 10

2.5 Related Work: Positive aspects of the posit format . . . . 13

2.6 Related Work: Negative aspects of the posit format . . . . 15

3 Method 16 3.1 Benchmark Testing . . . 16

3.2 Evaluation . . . 19

4 Results 22 4.1 16-bit representations . . . 24

4.2 32-bit representations . . . 25

4.3 A summary of the results . . . 27

5 Discussion 28 5.1 The Results . . . 28

5.2 Future Work . . . 30

6 Conclusion 32

References 33

(7)

1 Introduction

Many fields of science, finance, and engineering all require the representation and manipulation of the real numbers [1]. However, the notion that the real numbers are infinite entails that representing them all exactly, in a computer with a finite bit string, becomes an impossible task [2]. Therefore almost every programming language has something called a floating-point data type which is an approximate representation of a real number [2]. The floating-point representation of a real number such as 9.4 will not be exactly 9.4. [3].

The reigning standard for the representation of floating-point numbers in computers is called the IEEE-754, which is referred to hereafter as the float- 754. The float-754 format offers a large dynamic range and high precision in most cases. However, there are downsides to the float-754 as well, one of which is its rigidity. With each bit statically assigned the result is that both range and precision can suffer. Numbers close to each other run the risk of being approximated into the same bit representation, i.e. the same number. When this happens during calculations this causes cancellation which in turn results in errors and the loss of data.

The posit is one of several suggested ways to improve on the float-754 standard.

The fundamental idea that separates the posit format from the established float- 754 format, is that the user should be able to calibrate the posit’s precision and range in accordance with their present need. By allowing the user to decide how the available bits are to be used to calibrate range and precision, it could potentially lead to a more efficient way of using a computer’s capacity. This means that the user can choose between either a highly accurate precision, or a large dynamic range, and can adjust in between these extremes [4].

There has been previous research demonstrating that the posit can in practice be

more accurate than the float-754 standard for specific intervals referred to as the

golden zone. The training of neural networks has been one suggested use of the

posit [4]. Furthermore, fluid dynamics and weather predictions have been one

area in which the posit format produced a more accurate simulation than that of

the float-754 format [5].

(8)

1.1 Research Question

The prospect of the posit is particularly interesting to test because if the same accuracy can be achieved with fewer bits in use, using the posit, this will decrease the energy cost and require less hardware to produce the same calculations.

Furthermore, with a format that exhibits higher accuracy, an operation will have a more accurate result with the same number of bits, a higher bit-for-bit accuracy.

According to Gustafson, the bit efficiency is so high that when hardware support is realized there is potential for up to two turns of improvement according to Moore’s law without transistors decreasing in size [4].

One important operation to evaluate the posit, in regards to the float-754 format, is via matrix multiplication. To this end, the goal of this paper is to compare the two floating-point formats, posit and float-754, as it pertains to matrix multiplication, and the formulated research question is the following:

How does the posit-standard compare to the float-754 standard in terms of the error in regards to matrix multiplication?

1.2 Scope

The scope of this report is to specifically compare the error rate of the float-754 and the posit, as the research question suggests. Other floating-point formats can be regarded as potential prospects to replace the float-754 but those have been completely discarded in this thesis work. The comparison between float-754 and posit is furthermore performed in sizes of 16-bit and 32-bit.

There is also an accumulator called the quire which is an important part of the

posit standard. Despite its importance, the results of this paper are not using a

quire for accumulations but is only using the posit, as it is. The reason for omitting

the quire is twofold. By Jan Gustafson’s admission, there is nothing quite like the

quire in the float-754 standard, therefore it seems fairer to compare the float-754

with the posit, as it is [6]. Secondly, the focus of this bachelor thesis is on the use

of matrix multiplication, and in omitting the quire the implementation section is

(9)

1.3 Thesis outline

The thesis starts with a background that provides definitions of the terms used throughout the text. What follows next is a subsection for how matrix multiplication works and what its use is. The background section furthermore explains how the posit works in comparison to the float-754 format and what the conceptual differences are between the two formats. Despite the quire being outside the limitations of this report it is briefly mentioned and talk about as a subsection of the posit. After all, it is a part of the posit standard and it is later on brought up in section 5.2, Future Work, that a continuation of this paper could be to look at the quire. As such the quire section relates to the discussion of this paper.

After the theoretical background of the posit, the background section then explores the related work of what has already been researched on the posit format.

The related work is furthermore divided into two subsections. The first subsection focuses on the positive aspects of the posit in comparison to the float-754. It explains some useful applications of the format that have been found and positive successes of changing the paradigm by replacing float-754 with the posit. The second subsection then focuses on the negative aspects and what dangers might come from such a paradigm shift. It is important that the reader understands the background section because all of its content, with the exception of the quire, relate to the implementation of the method and the results that follow.

The method section outlines the tools that are available for simulating the posit

and how they have been used to create a comparison of error between the float-

754 and the posit. The method consists of the benchmark test as a subsection

and evaluation as another. The first subsection simply covers the process of

attaining a set of four matrices to test the posit to the float-754 in regards to

matrix multiplication. In this, matrices are filled with the two formats and

subjected to matrix multiplications. These matrices are then evaluated in the

evaluation section, to answer the research question of how these different formats

compare. As such, to understand both the evaluation and the process of matrix

multiplication as well as the nature of the two formats, the background section is

important to understand.

(10)

After the background section, the results of this benchmark test are then presented within the result section. The results are presented in text and with reference to figures, and those figures follow after the text. After the figures, there is a summary of the results, where a table encapsulates all the results that were previously talked about before moving on to the discussion.

The results are then further discussed in the discussion section. The first

subsection explains how the results are logical concerning the concept of the

golden zone. The golden zone is a property inherent to posits that is very

important and is as such talked about in the background. After the discussion

on the results, there is also a subsection that explores future work. Lastly, a

conclusion is formed in the final section.

(11)

2 Theoretical Background

In this section, the reader will be acquainted with the terminology of this text as well as background knowledge regarding matrix multiplication, the float-754, and the posit. As mentioned in the thesis outline, this background knowledge is what later sections such as the method, the result, and the discussion build upon. As such, it is important to understand this section. However, if the reader already has deep and extensive knowledge of the topic, parts of this section can be skipped.

2.1 Terminology and definitions used in the text

• Accuracy - When accuracy is mentioned in the text what is referred to is decimal accuracy. It is a measurement of how many digits of an answer are correct and it is the inverse of error [4]. If the error is small, the decimal accuracy is high and vice versa.

• Moore’s law - Moore’s law is a law that states that the number of transistors in an integrated circuit grows exponentially over time. The law is connected with the definition of accuracy, as the inverse of error, for the reason that was mentioned in the research question: If a posit exhibits higher bit-for- bit efficiency than the float-754, that will essentially improve Moore’s law without actually adding more transistors (transistors decreasing in size).

• Norms - There are two types of norms that are being used for error evaluation of the results, these norms are the Euclidean norm and the max norm. The Euclidean norm is taking the residual sum of all the elements in a matrix whereas the max norm takes the single largest residual in the matrix.

Let i be the index for the row and let j be the index for the column.

Euclidean Norm = √∑ N i=0

∑ N

j=0 (M _i,j ) ² Max Norm = Max( |(M i,j ) |)

Where the function Max(arg) above extracts the largest value.

(12)

• The Golden Zone - The golden zone is an interval in which the posit tends to perform well. The golden zone for a posit32 is [10 ⁻⁶ , 10 ⁶ ][7]. The idea of a golden zone is important when considering if the posit is the correct tool for certain calculations.

• Matrix Chain - When a matrix chain is mentioned in the text it refers to an array of M matrices that are matrix multiplied with each other. Matrix multiplication in this paper is executed without considering the order to perform the operation, just performing matrix multiplications naively from left to right. In the method section, three different matrix chains are described with length 5, 10, and 15, respectively.

• Deferred Rounding, FMA - Since the 2008 version of the float-754 the format has used the concept of deferred rounding. It is used whenever a sum that follows the structure (a ∗ b) + c (fused multiply-add, FMA) is calculated.

An additional register in memory is used, and it needs to be many times

larger than the size of the elements of the calculation. The exact value at

every step of the way is stored in this register, and then finally as the last

step, the answer is rounded to fit the bit size explicitly used for the sum.

(13)

2.2 Matrix Multiplication

Matrix multiplication is an operation on two matrices that results in a new matrix. Given that the requirement that the number of columns in the first matrix equals the number of rows in the second matrix, matrix multiplication can be performed. If there are two matrices A and B where this requirement is met, matrix multiplication AB results in a new matrix C. Each cell in the matrix C is calculated by taking each row in A and taking its dot product with each column in B. Mathematically the definition for matrix multiplication looks like this [8]:

(AB) _ij = A _i ∗ b j = C

Similar to the number one in multiplication or the number zero, in addition, matrix multiplication has an identity element called the identity matrix that will not affect the other matrix. Just like in normal multiplication a matrix B can have an inverse that maps the matrix B to the identity matrix. This is not always the case however, a matrix is not guaranteed to have an inverse. Matrix multiplication is also different from normal multiplication in that it is not commutative. A matrix multiplication AB will in most cases not be equal to BA. Furthermore, geometric transformations such as rotations, reflections, stretches, shears, dilations, and contractions can be represented as a matrix [8].

An important application reliant of matrix multiplication is a Markov chain, which is a probabilistic model being used in a variety of fields. In the book ”Introduction to Probability” multiple Markov matrices are being used as examples modeling different problems on probability, one example being that of gene-modeling [9].

A Markov chain is a matrix chain, as was mentioned in section 2.1, but with the same matrix, a probability matrix.

There are many other applications for matrix multiplication, across many different fields. In the book ”Essential MATLAB for Engineers and Scientists”

matrix multiplication is attributed as probably the most important matrix

operation. A few mentioned usages of the operation are within network theory,

solution of linear systems of equations, transformations of coordinate systems,

and population modeling [10].

(14)

2.3 How does the float-754 format work?

The float-754 format consists of three segments. A sign bit to determine if the number is positive or negative; an exponent segment that determines the range;

a fraction segment, often called the mantissa, that determines the precision of the fractional part [3]. There are three sizes of the float-754 that are relevant in this work. They are the 16, 32, and 64-bit floats, or as they are more commonly called:

the half, single, and double-precision floats, respectively. The bits of each format are statically assigned. As an example, a double-precision float-754, consisting of 64 bits has one bit assigned to determine its sign, 11 bits assigned to its exponent, and 52 bits assigned to its mantissa. The fact that the float-754 has its segments statically assigned is an important notion when later comparing it to the posit format. A floating-point is often represented in the following way [3]:

±1.bbb...b ∗ 2 ^p [3].

Essentially, a number is represented by bit shifting it to the right with the respect of an exponent, this is called normalized form. An example illustrated in the book Numerical Analysis by Timothy Sauer is the number 9, the highest exponent in the bit representation of 9 is 2 ³ = 8 and as such its normal form looks like this [3]:

9 = 1001. ₂ = 1.001..0 ∗ 2 ³ [3].

Similarly, the number 9.375 can be represented as:

9.375 = 1001.011 ₂ = 1.00101100..0 ∗ 2 ³

As can be seen, there are numbers such as 9 and 9.375 that can be represented

exactly. Both of these numbers can also be represented with a half-precision

floating-point, which has an exponent of 5 bits and a mantissa of 10 bits. However,

there are limitations to this representation which is illustrated in Numerical

Analysis with the example of the number 9.4 [3].

(15)

The bit representation of 9.4 is an infinite bit string and as such the bit string

will have to be truncated in some way. Two common methods for truncation are

chopping and rounding. The first method just throws away the bits that fall of

the end whereas the second method performs a rounding operation in base 10. In

having to perform a truncation operation there will necessarily be a small error

in the representation of 9.4. That is the important takeaway of floating numbers,

they are not always exact. A floating number representing 9.4 is essentially a close

approximation, it is not exactly 9.4 [3].

(16)

2.4 How does the posit format work?

The structure of the posit format is almost the same as the float-754 when taken into comparison. The posit format has four segments, a sign segment, a new segment called the regime segment, an exponent segment, and lastly a fraction segment, or a mantissa, just like the float-754 standard. Though the two formats are similar there is more to the posit format than what might be apparent at first.

The user can provide a number, usually denoted as es, which is a number that denotes the max number of exponent bits. It is also used to generate what is called the useed. Mathematically the useed is u = 2 ²

^es

[4].

The new regime segment consists out of 1 to n-1 identical bits and since these bits are identical it becomes possible to determine where the next segment begins. In letting the number M be the number of bits in the regime segment, a number k is defined as m-1, if the regime segment consists out of ones, and as -m if it consists out of zeros. Put simply, k becomes a negative number if the regime bits are zeros and it becomes a positive number if the regime segments are ones. The absolute value of k depends on the size of the regime segment. A floating number Y is then represented as [4] [11]:

Y = s ∗ u ^k ∗ 2 ^e ∗ (1 + f) [4] [11].

Where s is the sign bit, u is the useed defined above, k is the number k defined above that is dependent on the regime segment, e is the exponent bits as an integer, and f is the fractional segment or the mantissa [4] [11].

The important aspect of the posit is the useed, u, which is being used as a scaling factor. The scaling factor can be used to reach numbers that are usually out of range. As an example, with 4 bits the max value for es is 4-2 = 2 and with a bit string that is ”0111” the max value for that string is 256.0 [4]:

Y ₀₁₁₁ = 1 ∗ 16 ² ∗ 1 ∗ (1 + 0.0) = 256.0

(17)

The scaling factor provided by the useed and by k, which is defined by the regime segment, enables the format in theory to reach both a greater range as well as accuracy. Furthermore, it can be tailored to fit the situation, to adjust its accuracy and range accordingly. Consider for instance the real number 9.4 which was an example written in the normalized form in section 2.3, on how float-754 works.

9.4 can be written with an exponent size of 4, but the double-precision float-754

format uses 11 bits for the exponent so 7 bits are left unused. These bits are more

useful as part of the fractional segment instead. With es = 4 that would mean that

the posit would not let these bits be unused.

(18)

2.4.1 The Quire

The posit expands on the concept of deferred rounding and uses it for FMA and an additional 4 kinds or serialised calculations. The additional ones are:

• Fused add-multiply (a + b) ∗ c

• Fused multiply-multiply-subtract (a ∗ b) − (c ∗ d)

• Fused sum Σa _i

• Fused dot product (scalar product) Σa _i b _i

In posit-arithmetic, this register is called the quire, and its suggested sizes for 16-bit and 32-bit posits are 128-bit and 1024-bit, respectively. Gustafson states that when there is hardware support for the posit format the quire will offer the functionality of identical calculations obtaining the same result independent of system. This is not guaranteed by the float-754 [4].

The quire is used to perform accumulation and in the work ”Posit Arithmetic”

by Jan Gustafson its accumulative use is demonstrated with several examples.

Essentially, what is illustrated in the example is to calculate all the accumulations in the quire instead of the posit, and then to save the result of the quire to a posit. According to Gustafson there is nothing quite like the quire in the float-754 standard. Though it is limited to accumulations, the quire is a powerful concept that allows the posits to ”punch above their weight class.” [6].

As was mentioned in the scope and limitations of this paper, the quire was omitted

in this work to avoid the paper becoming too extensive. However, given that the

quire is a part of the posit standard, further work could be to incorporate the quire

into the test, which will be brought up as part of the discussion [12]. Therefore it

is still useful for the reader to know that about the quire.

(19)

2.5 Related Work: Positive aspects of the posit format

The precision of the posit is usually higher than that of the float-754, with the same number of bits, within the range (10 ⁻⁶ , 10 ⁶ ) which is known as the golden range, or the golden zone. The difference is especially noticeable in numbers close to 0. In John Gustafson’s paper ”Beating the Floating Point at its own Game: Posit Arithmetic” the accuracy of an 8-bit posit is demonstrated to exceed the accuracy of the 8-bit float-754 for values close to zero[4]. Many scientific constants are close to zero, for example, Euler’s number e, and the number pi. Being able to accurately represent those numbers is certainly a valuable aspect to the posit format.

In the article by John Gustafson it is furthermore stated that in some cases a 32- bit posit can safely replace a 64-bit float-754 provided that fused operations are in use, to prevent rounding errors from accumulating. In making a shift from 64- bit to 32-bit, while at the same time maintaining the same levels of accuracy, that would save power and hardware, achieving the goal of using fewer resources, as well as increase the speed of a calculation [4].

One practical application of the posit is as an alternative for regular float-754 floats as it comes to building up a model for weather predictions. In the article ”Posits as an alternative to floats for weather and climate models” it was demonstrated that the representation of the system was significantly improved upon using a 16-bit posit as opposed to a 16-bit float-754. In a fluid system such as weather prediction where computationally heavy algorithms are being executed, using fewer bits to accommodate the same precision could greatly reduce the execution time [5].

Another area where the posit could be a beneficial tool is during the training of

Neural Networks. During training, a Sigmoid function is used a large number

of times to simulate prediction of the probability for different events. This is

commonly simulated with a 16-bit float-754, which is computationally heavy and

time-consuming, as each call to the Sigmoid function can require over a hundred

clock cycles to evaluate. A close approximation can be approximated using an 8-

bit posit instead where the value is found by performing a bit flip and then a bit

shift 2 bits to the right. Since these steps can be executed easily in hardware this

method has the potential to be up to 4 times faster than the current one [4].

(20)

All in all the posit format can be a giant leap towards the ultimate goal of achieving

the same accuracy with a smaller number of bits in use and as such to improve

upon Moore’s law without adding more transistors. Especially for accurately

representing numbers that are close to zero. However, there has also been work

that has been conducted that would contradict these positive prospects of the posit

format and have demonstrated that this shift is not safe.

(21)

2.6 Related Work: Negative aspects of the posit format

In Posit: The good, the bad, and the ugly the authors speak on when the posit generates an error that is larger than that of the float-754. This happens when calculations are computed on numbers whose magnitude at some point surpasses roughly one million in 32-bit. On the one hand, it is not difficult to mitigate this problem, simply scaling the data so that it remains in the golden zone is enough to ensure that the precision remains. However, the drawbacks of this are that the need for additional checks and calculations arises. Every re-scaled data value also needs to have its scaling factor saved, thus requiring more bits of storage and possibly nullifying the predefined goal of using less memory [7].

Another aspect mentioned in the article is the risks associated with the naive use of

the posit format. The authors state that it is possible to inflict even greater damage

with naive use of the posit than it is with naive use of floats. When the posit is

more accurate it provides one or two more digits of accuracy than the float 754

of the same size, but ”When they [Posits] are worse than floats, the degradation

of accuracy can be arbitrarily large.”. Caution is needed and a rush towards a

total switch to Posits is not a safe idea at present. Though there certainly are

areas in which the posit can be used, this article recommends that the posit should

potentially only be a storage format as it pertains to general-purpose computing

[7].

(22)

3 Method

The goal of this thesis work is to continue the previous work and to further increase the knowledge about the performance potential of the posit format but specifically applied to matrix multiplication. The previously mentioned research question of this report is as such the following: ”How does the posit- standard compare to the float-754 standard in terms of the error in regards to matrix multiplication?”. As part of the work we have devised a quantitative benchmark test that generates matrices in the different formats, mentioned in the background, and then evaluates the error of these matrices. The purpose of the benchmark test was to test the error rate of both formats and as such to find an answer to the research question.

3.1 Benchmark Testing

In implementing the benchmark tests for matrix multiplication we have used the

library called Softposit, which emulates the posit format with es = 1 and es = 2

for 16-bit and 32-bit respectively. Softposit also includes the quire and fused

operations as a tool for avoiding accumulations of rounding errors. There are

multiple versions of the Softposit library, we have been using the Python version

and the Python language in developing our benchmark test [12]. The benchmark

test implemented chained matrix multiplications, with m randomized matrices,

all with a dimension of N by N. Each cell in the N by N matrices was a randomized

number within the range (-1, 1), randomized by a simple function, see figure

3.1.

(23)

The next step was to have a matrix filled with the posit format and another matrix filled with the float-754 format, each matrix held the same randomized number in every cell, but in their respective format. As an example, the element a _1,2 might have received the randomized value 0.54321. This means that the posit matrix would have held 0.54321 in index [1][2] in the form of a posit whereas the float- 754 matrix would have represented the same value, on the same index, but in the float-754 format instead. There was also a matrix for a high precision float-754, which became the comparison matrix later on.

Figure 3.2: Casting a value into different formats

All the identical matrix chains, represented in the different formats, were multiplied through with matrix multiplication and the result of the test was five matrices. To multiply through the matrix chains, we started by initiating the matrices as identity matrices, in their respective formats. We then let these matrices all accumulate through a series of m matrix multiplications, where m was the length of a requested matrix chain. The generateRandom() function in figure 3.3 generated a random matrix where each cell was a random number in the range (-1, 1) but it was the same number for each format, as can be seen from figure 3.2. For each run, every matrix was multiplied with a random matrix, for a chain that was m matrices long, see figure 3.3.

Figure 3.3: Performing chained matrix multiplications

(24)

The result of the chained matrix multiplication was five matrices. One of the five matrices was a high-bit value in the float-754 format, and that was the matrix that is in this text referred to as the comparison matrix because it was used to evaluate what the error was in the other matrices. The comparison matrix was not in it of itself interesting to this work, it was only used to evaluate the other four matrices.

The reason why the matrices initially were initiated to be the identity matrix in their respective format as seen in figure 3.3, and why that works, was covered in the background section. The identity matrix is the identity element under matrix multiplication [8]. Essentially what was done with the identity matrix in figure 3.3 is analogous to just initiating a real number as 1 and then multiplying it with a chain of real numbers. It yields the multiplicative accumulation of that chain, which was exactly what this test was evaluating.

Two of the matrix chains were accumulated in the float-754 format, where each

cell held 16-bit float-754 and 32-bit float-754 respectively. The last two matrix

chains were accumulated in the posit format and just like the two aforementioned

matrices they held a 16-bit value and a 32-bit value for each cell, only represented

in the posit format, as seen in figure 3.2. The reason why we compared the float-

754 matrix to the posit matrix, in both 16-bit as well as 32-bit, was to see if there

was a difference between the formats when looking at two different bit sizes. As

such, there was a set of four matrices at this stage, accumulated from four matric

chains, and the accumulated matrices were subsequently evaluated in regards to

the comparison matrix.

(25)

3.2 Evaluation

At this stage, the float-754 and posit matrices both constituted out of a 16-bit matrix and a 32-bit matrix. Therefore a total of four matrices were evaluated with respect to the comparison matrix, the fifth matrix. Two different norms for evaluating the error were used, the Euclidean norm and the max norm, which were both defined in the background section. The Euclidean error rate looked at the overall, geometric error of a matrix whereas the max error would instead represent the error as the largest error for a cell. The reason for the two error norms was to see not only the error of a format but also how the error was distributed in the matrix.

An error was simply the absolute difference between a cell in a matrix and the cell on the same index in the comparison matrix. If for instance the posit32 matrix had a value of 1.01 in its first cell and the comparison matrix had a value of 1.0 in its first cell, the residual error was simply the difference of the two, 0.01 that is, see residual in figure 3.4.

Figure 3.4: Calculating an Euclidean error and a max error.

For each of these matrix chains, we incrementally increased the number of

dimensions up to a maximum of 18 dimensions and measured the error for each

run. As such, the complexity in terms of dimensions and the number of matrices

was increased up to the most complex measurement which was that of 15 18x18

matrices. The reason for this was to examine the error compared to the complexity

of the chain and to see how the two formats were affected for more complex matrix

(26)

chains. To design the test to be as fair as possible we also repeated the runs 100 times and measured an average error rate for both the posit and the float-754 formats.

Figure 3.5: Averaging the Euclidean error and finding max errors.

Lastly, the matplotlib library in python was used to plot the results. Each plot held

two subplots, one which plotted the error rate of the two formats in regards to the

averaged Euclidean error norm and one subplot that plotted the error in terms of

the max error norm.

(27)

To test for different complexities of matrix chains the test was performed with three different matrix chains, one of length 5, one of length 10, and one of length 15. In each of those chains, the dimension of the matrix was incremented to see how this rise in complexity affected the error. There was therefore a total of six plots, three plots for the errors of the three matrix chains in 16-bit as well as three plots in 32-bit. Table 3.1 summarises this and describes all the different constellations that were tested.

Number of Matrices 16-bit formats 32-bit formats 5 matrices Euclidean / Max Error Euclidean / Max Error 10 matrices Euclidean / Max Error Euclidean / Max Error 15 matrices Euclidean / Max Error Euclidean / Max Error

Table 3.1: Summary of the planned matrix chain tests

In table 3.1 above there are six entries that say ”Euclidean Error - Max Error”.

Every such entry was a plot with two subplots, one subplot for the averaged Euclidean error on the left-hand side and one subplot for the averaged max error on the right-hand side. Both of the subplots were plotted with the number of dimensions on the x-axis and their respective error rate on the y-axis, which was averaged over 100 runs. The averaged error of the posit was plotted in blue and the averaged error of the float-754 was plotted in red.

As seen on the table we tested for both 16-bit formats as well as 32-bit formats.

Mentioned in the background section the posit operates well under what is called

the golden zone, where it is expected to be at least as accurate as the float-

754 format [7]. As seen from the background section, the float-754 format has

its bits statically assigned, a double-precision float-754 has 11 bits assigned to

its exponent [3]. The more bits the float-754 has the more of those are being

distributed into the exponent. When assessing the accuracy for numbers close to

zero the static allocation onto the exponent segment is a waste, which was brought

up in the background section. Therefore, the 32-bit float-754 miss-allocates more

bits for numbers close to zero than the 16-bit float-754 does. As such, we compared

the posit to the float-754 as it pertains to both 16-bit as well as 32-bit. That was

the reason why we assumed there to presumably be a difference between the two

bit sizes and wanted to test that.

(28)

4 Results

The result of the benchmark test was that the posit can be more accurate than the float-754 and have a lower rate of error as it pertains to matrix multiplication.

The posit format started in both the 16-bit case as well as the 32-bit case with a lower error rate. When the number of dimensions and the number of matrices was increased this favored the float-754 format. Eventually, a cut off point was surpassed for which the float-754 matrix started having a lower error than the posit matrix. However, there was a significant difference when this cut off point was reached when comparing the 16-bit representations to the 32-bit representations.

The figures (figure 4.1 - 4.6) below illustrate the errors of a posit in blue and a float- 754 in red, represented in 16-bit as well as 32-bit, with the y-axis as the averaged error and the x-axis as the number of dimensions of the square matrix. As can be seen from the plot in figure 4.1, the error of the blue posit mark surpassed the red float-754 error at an early stage. What this means is that with a matrix chain with 5 NxN matrices, after several increments of N, the accuracy of the posit matrix was lower than that of the float-754 matrix. This point was reached somewhere at a dimension of 12. After that point, the error of the posit matrix started to exceed that of the float-754.

Figure 4.2 and figure 4.3 were reiterations of the same experiment but now with

a matrix chain of 10 and 15 matrices respectively, as was illustrated in table 1 in

the method section. From the plot in figure 4.2, it again seems like after a value

of 12 dimensions the posit matrix generated an error that was larger than that of

the float-754 matrix. As can be seen from figure 4.2 and figure 4.3 was that with a

longer matrix chain the error of the posit, as compared to the float-754, was much

larger.

(29)

In summary, in the case of the 16-bit representations, the posit did not fare well in comparison to the float-754 format in higher dimensions. In the cases where the posit outperformed the float-754 the difference was small as can be seen on the first plot, figure 4.1. However, when looking at the 32-bit matrices the result was entirely different which can be seen in figure 4.4. Figure 4.4 is a test with a matrix chain of five matrices in 32-bit with the posit matrix again in blue and the float-754 matrix in red. In figure 4.4 the posit matrix outperformed the float-754 matrix across all measured points. The posit format dominated this test.

In figure 4.5 more complexity was added by increasing the number of matrices to

10. As can be seen in the figure, the posit had a lower error rate in 32-bit across all

dimensions compared to the float-754, for a matrix chain of 10 matrices in terms of

Euclidean error, yet the max error was larger for the posit. When the matrix chain

was extended further to a chain of 15 matrices, seen in figure 4.6, the 32-bit posit

was eventually surpassed by the 32-bit float-754 in terms of Euclidean accuracy

as well. For long matrix chains, even the 32-bit posit was eventually surpassed by

the float-754 matrix in terms of both of the norms.

(30)

4.1 16-bit representations

Figure 4.1: 5 NxN matrix multiplications in 16-bit representations.

(31)

Figure 4.3: 15 NxN matrix multiplications in 16-bit representations.

4.2 32-bit representations

Figure 4.4: 5 NxN matrix multiplications in 32-bit representations.

(32)

Figure 4.5: 10 NxN matrix multiplications in 32-bit representations.

Figure 4.6: 15 NxN matrix multiplications in 32-bit representations.

(33)

4.3 A summary of the results

Table 4.1 below illustrates all the constellations of the matrix chain lengths as well as different bit-formats and what the results where. ”Better” in this context simply means that the rate of error was lower and that the accuracy as such was higher.

Chain length 16-bit formats 32-bit formats

5 matrices Float-754 eventually better Posit better 10 matrices Float-754 eventually much better Stalemate

15 matrices Float-754 eventually much better Float-754 eventually better Table 4.1: Summary of averaged runs on different matrix chains

The main observation was that increasing the length of a matrix chain as well as the number of dimensions meant that the float-754 matrix eventually generated less error than the posit matrix. In the case of 32-bit matrices the posit matrix had overall a lower error and as such a higher accuracy across a larger interval.

However, even in the case of 32-bit matrices, the float-754 was eventually more accurate when presented with a matrix chain of 15 matrices, as can be seen in table 4.1.

In regards to the 32-bit entry, for 10 matrices, that says ”stalemate” in table 4.1, this refers to figure 4.5. In figure 4.5 the Euclidean error of the posit was less than that of the float-754 but the max error was slightly greater for high dimension errors. This means that one or more of the cells in the posit matrix had a larger error but as a whole, the error of the matrix was less.

Another observation to notice was that when the posit was outperformed by the

float-754 the error disparity between them just grew out of control when adding

more dimensions and longer matrix chains. Looking at figure 4.3 as an example,

a large matrix chain will result in a posit error that is growing quickly after a

high dimension is reached. The error of the float-754 matrix in the same level of

complexity is comparatively small. This can be seen when the chain is 10 matrices

long or 15 matrices long in the 16-bit comparison. In table 4.1 these runs are being

labeled as ”Float-754 eventually much better”.

(34)

5 Discussion

5.1 The Results

The results illustrate that as it pertains to matrix multiplication a posit matrix can outperform a float-754 matrix, especially for a higher bit value. Presumably, the reason for the 32-bit posit to be more accurate is because the 32-bit float-754 statically assigns more bits to the exponent, which reduces its accuracy severely for numbers close to zero. However, as can be seen even for the 32-bit value in figure 4.6, when the matrix chain is long and the dimensions of those matrices grow high, the float-754 will eventually have a lower error rate. These results are not that surprising when considering that the posit operates well under a certain interval known as the golden zone.

In Posit: The Good, The Bad, The Ugly the authors mention that the golden zone for the posit32 is the range [10 ⁻⁶ , 10 ⁶ ] [7]. This could be an explanation of the results of the matrix multiplication as well. Whenever the dimensions are increased, as well as the length of the matrix chain, each cell in the matrix will have to represent a number that is more and more unlikely to stay within the golden zone. This behavior can be observed when looking at the matrix chain of length 10 in figure 4.5. The max error of the posit is starting to grow rapidly for high dimensions which could be because a point has been reached in which some cells of the matrix are just outside the golden zone. Because each cell in the posit matrix is a posit it will inherit the golden zone property. As was brought up in the negative aspects of the posit in section 2.6, pertaining to the article ”Posit: The Good, The Bad, The Ugly”, ”when the posit is worse the error can be arbitrarily large” [7].

Even if the Euclidean error is smaller for the posit matrix if one or more cells

are out of the golden bounds the result could be a large max error, as might

be the case in figure 4.5. If a cell has a high max error due to being out of

bounds, in regards to the golden zone, another matrix multiplication might push

the other cells out of bounds as well since that specific cell is used in further matrix

multiplications.

(35)

More complexity in the form of a longer matrix chain and higher dimensions will

likely lead to cells becoming out of bounds in regards to the golden zone, which

could cause large errors and spread via further multiplication. A matrix of posits

must adhere to the properties of the golden zone, which is an explanation as to why

the posit matrix generated a larger error in a more complex matrix chain.

(36)

5.2 Future Work

From the perspective of general-purpose computing, this paper aligns itself with previous work in that it is not safe to just replace the float-754 with the posit in matrix multiplication, presumably due to the inherent golden zone property.

However, from a scientific perspective with a sensible implementation, the posit matrix can yield a more accurate result as seen earlier. This subsection elaborates on tangents and applications that were outside the scope of this report but could be important for future works in matrix multiplication, using posit matrices.

As it pertains to future work, one similar tangent to explore further could be to explore other matrices than with cells in the range of (-1, 1), which were explored in this paper. Matrices, where the cells are of values in the range of (0, 1), are commonly used in probability to create Markov chains, as mentioned in section 2.2, on matrix multiplication. A Markov chain is dependant on chained matrix multiplications, similar to how the chain was created in this paper, but instead with no negative numbers and a chain with the same probability matrix (or Markov matrix). Just like in this paper Markov matrices are also square matrices, therefore the benchmark test for Markov matrices could be almost identical to that of this paper.

Given that the result of this paper was that the posit matrix multiplication performed better under certain levels of complexity, a posit Markov matrix could perhaps also be performing better, using the same number of bits. The benchmark test of this paper was after all very similar to that of a Markov chain. Markov chains have found its use in a wide variety of different fields, gene-modeling was brought up in section 2.2 regarding matrix multiplication. As such, a work to further improve on the accuracy of Markov matrix multiplication, using posit matrix multiplication instead, could be a tangent that could be worth exploring if it is explored sensibly with regards to the golden zone property. Markov chains could as such be a useful application regarding posit matrix multiplication.

Another tangent to explore further could be how the order in which the matrices

are chain multiplied could affect the error, specifically regarding posit matrices,

(37)

places of the matrices because, as was brought up in section 2.2 regarding matrix multiplication, matrices are not commutative under matrix multiplication. As a consequence, a re-factorization into something like ACD*B does not work simply because that violates the commutative property and is not guaranteed to result in the same matrix.

However, the associative property of multiplication could be used. To illustrate, ABCD = A(BC)D, is an example. The matrices do not change place, only the order in which an operation is performed, and that is perfectly allowed under the rules of matrix multiplication. Future work could aim to look at how this property could be used in conjunction with posit matrix multiplication to furthermore improve the accuracy of the format. Markov chains and optimization of the order could be two tangents to explore further, in posit matrix multiplication.

However, the use of the quire to complement the posit is probably the most valuable tangent to explore outside of this report. As was mentioned in the section pertaining to the scope of this report the quire was left out for two reasons. One was that it seemed fairer to compare the two formats without the inclusion of the quire since there was no equivalence to the quire in the float-754 standard. As such the benchmark test was probably fairer in addressing the research question. The second reason was simply that the inclusion of the quire might lead to the method being less reader-friendly. Given that the method section already describes five matrix chains, in three different lengths, evaluated using two different norms, it seemed like the inclusion of the quire and fused operations might confuse the reader even more.

As such, the quire was left out in this report. However, the concept of the quire accumulator should be further explored in more sophisticated works as it pertains to matrix multiplication. The results demonstrated that even without its inclusion the posit can be more accurate. The quire is also part of the posit standard as well as being supported by the Softposit library which was used in this report [12].

Furthermore, according to Jan Gustafson the quire is a powerful concept that

allows the posit to punch above its weight class [6]. Therefore, maybe the quire’s

inclusion will radically improve upon the accuracy of matrix multiplication with

posit matrices.

(38)

6 Conclusion

The conclusion of this paper is that posit matrices can be used in matrix multiplication and produce an error that is smaller than that of the IEEE-754 format, which is called float-754 in this text and is the used format of today.

However, caution has to be exercised when running posit matrix multiplications to replace float-754 matrix multiplications. It is important to understand that the accuracy of the posit matrix depends on which numbers the cells are representing, presumably as a result of the golden zone property inherent to posits. If a posit matrix is being used in an inappropriate matrix multiplication the error might be much larger than that of the float-754 matrix, and the error could be arbitrarily large, which previous research has mentioned pertaining to the posit format.

Therefore posit matrices should be used sensibly, as a possible replacement only for specific matrix multiplication problems, as of right now.

However, in looking back at Moore’s law, which motivates the research question of this paper via Jan Gustafson, there is certainly a possibility of using the posit to improve the accuracy of an approximated real number. Important applications like matrix multiplication, where the real number approximations inhabit the cells of the matrix, can in turn also be improved with the posit format. The result of this paper demonstrates that. The potential improvement of using the posit matrix in matrix multiplication is a prospect that should be explored further and incorporate ideas such as the quire and deferred rounding, left out in this work.

After all, Jan Gustafson mentions that the quire is the powerful concept that allows

the posit to punch above its weight class. As such, maybe that punching power

of the quire is what could truly improve upon Moore’s law and on the bit-for-bit

accuracy, as it pertains to matrix multiplication.

(39)

References

[1] Muller, Jean-Michel et al. Handbook of Floating-Point Arithmetic. 2018.

URL: https://link.springer.com/chapter/10.1007/978-3-319-76526- 6_1.

[2] Goldberg, David. What every computer scientist should know about floating-point arithmetic. 1991. URL: https : / / dl . acm . org / doi / abs / 10.1145/103162.103163.

[3] Sauer, Timothy. Numerical Analysis. 2014, pp. 8–14. ISBN: 978-1-292- 02358-8.

[4] Gustafson, John and Yonemoto, Isaac. Beating Floating Point at its Own Game: Posit Arithmetic. 2017. URL: http : / / www . johngustafson . net / pdfs/BeatingFloatingPoint.pdf.

[5] Klöwer, Milan, Düben, Peter D, and Palmer, Tim N. Posits as an alternative to floats for weather and climate models. 2019. URL: https://dl.acm.

org/doi/abs/10.1145/3316279.3316281.

[6] Gustafsson, John L. “Posit Arithmetic”. In: (2017), pp. 5–6, 80–84. URL:

https://posithub.org/docs/Posits4.pdf.

[7] Dinechin, Florent de et al. Posits: the good, the bad and the ugly. 2019.

URL: https://hal.inria.fr/hal-01959581v3/document.

[8] Norman, Daniel and Wolczuk, Dan. SF1624 Algebra and Geometry Introduction to Linear Algbra for Science and Engineering. 2017, pp. 127–

132, 149–154. ISBN: 978-1-78449-244-1.

[9] Grinstead, Charles M. and Snell, J.Laurie. Introduction to probability.

2003, pp. 405–411. URL: https : / / www . dartmouth . edu / ~chance / teaching_aids/books_articles/probability_book/Chapter11.pdf.

[10] Hahn, Brian D. and Valentine, Daniel T. Essential MATLAB for Engineers and Scientists. 2019.

[11] Chien, Steven WD, Peng, Ivy B, and Markidis, Stefano. “Posit NPB:

Assessing the Precision Improvement in HPC Scientific Applications”. In:

arXiv preprint arXiv:1907.05917 (2019).

(40)

[12] URL: https://gitlab.com/cerlane/SoftPosit.

(41)

TRITA-EECS-EX-2020:349

An error assessment of matrix multiplications on posit matrices

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM, SWEDEN 2020

An error assessment of matrix

multiplications on posit matrices

KTH Bachelor Thesis Report

August Danell Håkansson

Mirja Johnsson

Authors

August Danell Håkansson <aughak@kth.se> and Mirja Johnsson

<mirjaj@kth.se>

Electrical Engineering and Computer Science KTH Royal Institute of Technology

Svensk titel på projektet

En bedömning av beräkningsfel vid matrismultiplikation av matriser fyllda med tal i Positformat.

Examiners

Pawel Herman, Örjan Ekeberg

KTH Royal Institute of Technology

Supervisor

Acknowledgements

We are grateful to our supervisor Stefano Markidis and to his team for their

guidance, and their willingness to share in their expertise.

Abstract

The posit is a floating-point number format as proposed by John Gustafson to improve upon the accuracy of the IEEE-754 format, which is the standard today.

Keywords

Posit, softPosit Library, Moore’s Law

Sammanfattning

När dimensionerna och antalet matriser ökade hade IEEE-754 det klara övertaget.

Slutsatsen blev att ett informerat och strategiskt användande av positformatet kan vara det bättre valet för matrismultiplikation, då det är viktigt att ta dess egenskaper i beaktande.

Nyckelord

Posit, softPosit, Moores Lag

Contents

1 Introduction 1

1.1 Research Question . . . . 2

1.2 Scope . . . . 2

1.3 Thesis outline . . . . 3

2 Theoretical Background 5 2.1 Terminology and definitions used in the text . . . . 5

2.2 Matrix Multiplication . . . . 7

2.3 How does the float-754 format work? . . . . 8

2.4 How does the posit format work? . . . 10

2.5 Related Work: Positive aspects of the posit format . . . . 13

2.6 Related Work: Negative aspects of the posit format . . . . 15

3 Method 16 3.1 Benchmark Testing . . . 16

3.2 Evaluation . . . 19

4 Results 22 4.1 16-bit representations . . . 24

4.2 32-bit representations . . . 25

4.3 A summary of the results . . . 27

5 Discussion 28 5.1 The Results . . . 28

5.2 Future Work . . . 30

6 Conclusion 32

References 33

1 Introduction

The posit is one of several suggested ways to improve on the float-754 standard.

There has been previous research demonstrating that the posit can in practice be

more accurate than the float-754 standard for specific intervals referred to as the

golden zone. The training of neural networks has been one suggested use of the

posit [4]. Furthermore, fluid dynamics and weather predictions have been one

area in which the posit format produced a more accurate simulation than that of

the float-754 format [5].

1.1 Research Question

The prospect of the posit is particularly interesting to test because if the same accuracy can be achieved with fewer bits in use, using the posit, this will decrease the energy cost and require less hardware to produce the same calculations.

Furthermore, with a format that exhibits higher accuracy, an operation will have a more accurate result with the same number of bits, a higher bit-for-bit accuracy.

According to Gustafson, the bit efficiency is so high that when hardware support is realized there is potential for up to two turns of improvement according to Moore’s law without transistors decreasing in size [4].

How does the posit-standard compare to the float-754 standard in terms of the error in regards to matrix multiplication?

1.2 Scope

There is also an accumulator called the quire which is an important part of the

posit standard. Despite its importance, the results of this paper are not using a

quire for accumulations but is only using the posit, as it is. The reason for omitting

the quire is twofold. By Jan Gustafson’s admission, there is nothing quite like the

quire in the float-754 standard, therefore it seems fairer to compare the float-754

with the posit, as it is [6]. Secondly, the focus of this bachelor thesis is on the use

of matrix multiplication, and in omitting the quire the implementation section is

1.3 Thesis outline

After the theoretical background of the posit, the background section then explores the related work of what has already been researched on the posit format.

The method section outlines the tools that are available for simulating the posit

and how they have been used to create a comparison of error between the float-

754 and the posit. The method consists of the benchmark test as a subsection

and evaluation as another. The first subsection simply covers the process of

attaining a set of four matrices to test the posit to the float-754 in regards to

matrix multiplication. In this, matrices are filled with the two formats and

subjected to matrix multiplications. These matrices are then evaluated in the

j=0 (M _i,j ) ² Max Norm = Max( |(M i,j ) |)

• The Golden Zone - The golden zone is an interval in which the posit tends to perform well. The golden zone for a posit32 is [10 ⁻⁶ , 10 ⁶ ][7]. The idea of a golden zone is important when considering if the posit is the correct tool for certain calculations.

(AB) _ij = A _i ∗ b j = C

±1.bbb...b ∗ 2 ^p [3].

9 = 1001. ₂ = 1.001..0 ∗ 2 ³ [3].

9.375 = 1001.011 ₂ = 1.00101100..0 ∗ 2 ³