• No results found

Constructions for efficient MDS diffusion layers

N/A
N/A
Protected

Academic year: 2021

Share "Constructions for efficient MDS diffusion layers"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Constructions for efficient MDS diffusion layers

NIKOLAOS TATSIS

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

(2)

Summary (English)

Matrices are widely used in Block Cipher Diusion layers, usually chosen for of- fering maximal branch numbers, and allowing lightweight hardware implemen- tations through their low XOR Count. When implemented in software however, is XOR Count the only metric?

This project will utilize the parallelism provided by modern SIMD vector in- structions to evaluate metrics, by implementing dierent algorithmic approaches.

Timing their executions will hopefully provide some insight into the eect of the immense complexity of modern architectures and software tools on the expected outcome of an algorithm.

A further focus will be the construction of matrices with interesting proper- ties over large extension elds, which can apply in recent white-box cipher designs[BIT16]. This involves implementing a 1979 paper[Mac71], and the trans- lation from mathematical paper to working implementation will involve many coding challenges.

Both having an evaluation of matrix diusion software implementation ap- proaches and another method for nding matrices can ultimately help with the design of more ecient block ciphers.

(3)
(4)

Preface

This thesis was prepared at DTU Compute in fullment of the requirements for acquiring an M.Sc. in Engineering, under the supervision and advice of Professor Elmar Wolfgang Tischhauser.

The thesis deals with certain aspects of using vector-matrix multiplication for diusion in block ciphers. When exploring these mathematical properties and procedures, a full implementation in dierent environments helps with both understanding and insight. This approach pushed me towards this topic.

Lyngby, 23-June-2017

Nikolaos Tatsis

(5)
(6)

Contents

Summary (English) i

Preface iii

1 Introduction 1

1.1 Organization . . . 2

1.2 Tools Used . . . 2

1.3 Block Ciphers . . . 3

1.4 Substitution - Permutation Network . . . 3

1.5 Finite Fields . . . 4

1.6 Diusion Eciency Metrics . . . 5

1.6.1 Hamming Weight . . . 6

1.6.2 XOR Count . . . 6

1.7 Signicant Matrix Classes . . . 8

1.7.1 Maximum Distance Separable Matrices . . . 8

1.7.2 Circulant Matrices . . . 9

1.7.3 Involutory Matrices . . . 9

1.7.4 Orthogonal Matrices . . . 10

2 SIMD Implementation Metrics 11 2.1 SIMD Instructions . . . 12

2.2 XOR Count Implementation . . . 13

2.3 C Implementation Choices . . . 14

2.4 SIMD Multiplication Implementations . . . 16

2.4.1 Shue Multiplication . . . 16

2.4.2 Times-Two Multiplication . . . 17

2.4.3 XOR Multiplication . . . 17

2.5 Data Arrangement Implementations . . . 18

(7)

2.5.1 Direct Arrangement . . . 18

2.5.2 Byte-slice Arrangement . . . 19

2.5.3 Bit-slice Arrangement . . . 19

2.6 Results . . . 20

3 Finding COMDS Matrices 23 3.1 Process overview . . . 23

3.1.1 Key concepts . . . 24

3.1.2 Algorithm . . . 25

3.2 Implementation . . . 26

3.2.1 Optimization . . . 27

3.3 Findings . . . 28

4 Conclusion 31 4.1 Ethics and Sustainability . . . 31

4.2 Future Work . . . 32

A SIMD Implementation 33

B COMDS search 35

Bibliography 39

(8)

Chapter 1

Introduction

The security of symmetric cryptographic primitives such as block ciphers, hash functions, and authenticated encryption schemes is based on the combination of nonlinear S-boxes and linear diusion layers as basic building locks. Optimal diusion layers require maximal branch numbers, as outlined by the creators of AES [DR13]. This can also be achieved using nite eld matrices based on maximum distance separable (MDS) linear codes, which are often relatively computationally expensive.

Over the last two decades, many constructions for such MDS diusion layers have been proposed, often with a particular focus on lightweight implementation characteristics in hardware [SKOP15] [BKL16]. A popular metric for a matrix is the XOR-count, dened as the number of required XOR operations to multiply it with a random (column) vector in the same eld. This chapter was written to help the reader understand these concepts, and consequently the work shown in the next chapters.

The next chapter aims at analyzing the implementation of MDS diusion in software, and the applicability of hardware metrics on software approaches.

SIMD instructions have been used to create implementations for similar matrix diusion algorithms [AFK14], but were not focused on their extensibility for higher elds, or their evaluation for dierent matrices.

(9)

Another problem we plan to deal with is the construction of ecient MDS diu- sion layers over large extension elds, which has applications in recent white-box cipher designs [BIT16], and gives itself another tool towards cipher design.

1.1 Organization

The thesis consists of three chapters: This chapter introduces the key concepts and mathematical tools needed to understand the project. In chapter 2 an attempt is made to evaluate the performance of vector-matrix multiplication over a Finite Field using Single Instruction, Multiple Data (SIMD) instructions on modern processors. In the next chapter 3 we present an implementation of an algorithm designed to nd Circulant Orthogonal Maximum Distance Sepa- rable (COMDS) matrices over a Finite Field. Finally, a brief Conclusion 4 is presented, outlining the results and possible directions for future work.

1.2 Tools Used

The C code of this project, provided in the folder code/c/ of [Tat17], was compiled with GCC [Sta01], Versions Ubuntu 5.4.0-6ubuntu1 16.04.4 and Ubuntu 5.2.1-22ubuntu2.

The SIMD multiplication implementations 2.4 were created using the SSE4 (and earlier) instruction set [Geo07], utilizing 128-bit XMM registers.

The Sage - Python code of this project,provided in the folder code/sage/ of [Tat17], was created and executed using the SageMath [Dev17] system, Version 7.5.1. In this implementation, tools from the NumPy [WCV11] (Version 1.11.1) library were utilized, and Matplotlib [Hun07] (Version 1.5.1) was used to create the gures in this document.

The code was developed on a machine running Ubuntu 16.04.2 LTS with the 4.4.0-81-generic kernel, using an Intel Pentium 2117U CPU @1.80GHz in

the Ivy Bridge micro-architecture. This machine had the Ubuntu 5.4.0-6ubuntu1 16.04.4 version of GCC.

Additional benchmarking was executed on the DTU Compute skylake.imm.

dtu.dk device, running Ubuntu 15.10 with the 4.2.3-040203-generic kernel, using an Intel Core i7-6700 CPU @3.40GHz in the Skylake micro-architecture.

(10)

1.3 Block Ciphers 3

The Ubuntu 5.4.0-6ubuntu1 16.04.4 version of GCC was installed on this de- vice.

1.3 Block Ciphers

Block ciphers are the prevalent encryption scheme used for symmetric cryptog- raphy today. A block cipher uses a number of invertible mathematical opera- tions to encrypt a predened block of data of xed size. The provided block of data (referred to as clear-text) is converted into a equally sized block of data (cipher-text) using another secret block of data (key). The same key can be used to decrypt the data, giving it the "symmetric" title, and the operations used are designed to make it almost impossible to extract the clear-text from the cipher-text.

Two properties were introduced by Claude Shannon for the design of a secure cipher, confusion and diusion [Sha45]. A cipher with the confusion property has each bit in the produced cipher-text depend on multiple parts of key, giving the key a highly improbable chance to be guessed. Diusion on the other hand refers to the high interdependence between every bit of the clear-text and the cipher-text, such that if a bit is changed (ipped) in the input block, on average half of the bits in the output block will be consequently changed.

1.4 Substitution - Permutation Network

A Substitution - Permutation Network (SPN) [KD79] is an encryption scheme, that has been widely adopted in the design of block ciphers. Examples of this include AES [DR13], 3-Way [DGV93], SHARK [RDP+96], and PRESENT [BKL+07].

A block cipher using this scheme accepts a block of data of xed length (depen- dent on the cipher), and applies a number of rounds of Substitution functions (S-Boxes) and Permutation functions (P-Boxes), in order to encrypt it. A xed length secret key is the other input of the cipher, usually expanded into round keys, which in turn are applied on every round of the block cipher. In order to decrypt, all the functions used are reversible, and the encrypted data has the inverse functions applied in reverse order.

The S-Box substitutes small blocks of the input data with other blocks of the same size. The size of this small block is usually 4-8 bits, as the entire mapping

(11)

needs to stored; An n-bit S-Box requires at least 2n∗ n bits to store. This substitution function is injective, with every small input block corresponding to exactly one output block to allow inversion, and both domain and image of this function are all possible small blocks of that size. During encryption (or decryption), the entire input block is divided into these smaller blocks, and the S-Box function is individually applied to each.

A good S-Box needs to create the avalanche eect [WT86], where small changes to the input result in big changes in the output. More clearly, if one of the input bits is ipped in any possible small block input, the S-Box output would have about half of its bits ipped.

The P-Box applies a permutation or bit-shuing to the entire input block, aiming to diuse the eect of the multiple individual S-Box applications across the entire block. As such, a good P-Box requires the output of each S-Box to aect as many S-Box inputs in the next round as possible.

This encryption process satises the properties of confusion and diusion: A bit changed in the initial block of input is expanded to multiple bits in the S-Box output, and then communicated to input bits of multiple S-Boxes.These two functions, the S-Box and P-Box are applied in alternating order for a number of rounds, and, as the rounds progress, the avalanche eect appears, and tiny input modications are diused throughout the block. The secret key modies the output of each round (usually by XOR-ing the round key, as the XOR function is its own inverse), "confusing" the nal cipher-text output in a pseudo-random fashion.

In this project, we will be focusing on a prevalent technique used in the im- plementation of P-Boxes, matrix - vector multiplication in nite elds. This is not necessarily the complete P-Box design: For example in AES the diusion is achieved through the combination of the MixColumns operation (matrix - vector multiplication) and the ShiftRows operation (cyclical shifts).

1.5 Finite Fields

Most arithmetic operations presented henceforth will operations in extensions of the binary nite eld, since their elements can be directly represented in com- puter memory. A nite eld, or Galois eld, is a nite set of elements with the operations of multiplication, addition, subtraction and division being dened, and a number of properties apply: For both addition and multiplication asso- ciativity and commutativity apply, as well as the existence of identity elements

(12)

1.6 Diusion Eciency Metrics 5

and inverses (except for the multiplicative inverse of zero). Multiplication is also distributive over addition.

A Finite eld will be denoted as GF (q), where q is either a prime number q = p, or a positive power of a prime, q = pr, with the eld having q elements. The elements of a prime eld are the natural numbers modulo p. This modulus operation applies on all operations in the eld, for example in GF (7): (6 + 3) (mod 7) = 9 (mod 7) = 2. In the latter case of q = pr, we have a (prime) extension eld. A similar modulus operation constricts operations in case of overow, with the eld dened by an irreducible polynomial irr(x) of degree r and coecients in GF (p). The elements are isomorphic to, and represented as, the polynomials in GF (p)[X] (mod irr(x)). Therefore there are again q = pr elements in the eld, represented as polynomials of degree up to r − 1 with coecients in GF (p).

As aforementioned, we will be focusing on binary extension elds, i.e.: GF (2r)/irr(x). To represent elements in these elds, we will be using dierent representations, in a similar fashion to [SKOP15]. The polynomial representation of elements (or the modulus) has binary coecients, so they can be also be shown as binary or hexadecimal strings. For example in GF (24), the element x3+x+1is equivalent to (1011)2and 0xb. Addition in these elds is equivalent to the binary XOR op- eration, and multiplication is equivalent to a left logical shift, possibly followed by an XOR with the modulus. As an example, in GF (24)/0x13: (1011)2·(0010)2

(mod (10011)2) = (10110)2 (mod (10011)2) = (10110)2⊕ (10011)2= (0101)2. The choice of the eld dimension r is usually the size of the S-Box input and output, or "small blocks" in 1.4. As such, larger elds are not usually considered, due to the same S-Box implementation size limitations.

1.6 Diusion Eciency Metrics

This project focuses on using matrix - vector multiplication over a nite eld as part of the P-Box diusion in an SPN. This method allows the use of well-studied mathematical terminology and techniques, in order to achieve the highest possi- ble diusion. The input block can be easily represented as a number of elements in GF (2r), and then divided into vectors. Multiplying these vectors with a pre- dened matrix will yield another vector, with all input elements potentially aecting all output elements. These matrices are chosen to be invertible in the chosen eld, to allow for reversibility. If a vector v is multiplied with matrix M to produce the diused vector v0, then multiplying v0 with M−1 will yield the original vector v.

(13)

The choice of the matrix depends on two factors, once a eld has been chosen:

First, a matrix needs to achieve high diusion, as described in 1.4, which is further analyzed in 1.7.1. Secondly, since matrix-vector multiplication is a non- trivial operation with high complexity [CW82], the elements forming the matrix need to be carefully chosen. This section describes two commonly used metrics, Hamming Weight and XOR Count.

1.6.1 Hamming Weight

Hamming Weight is a simple metric to approximate the complexity of a hard- ware implementation of a matrix - vector multiplication. It was used in the design of AES [DR13] to choose the matrix used in the MixColumns operation.

The metric was rst introduced as a tool in error-correction codes:

Definition 1.1 [MS77, Denition on p.8] The Hamming Weight of a vector x = (x1, x2, . . . , xn)is the number of non-zero xi, and is denoted as wt(x).

This metric can be easily applied on elements of binary extension elds, repre- sented as strings containing their binary coecients. It can also be expanded to cover a matrix, as the sum of Hamming Weights of all its entries:

Definition 1.2 Assume the following n × n matrix M, with elements in GF (2r):

M =

γ11 γ12 γ13 . . . γ1n

γ21 γ22 γ23 . . . γ2n

... ... ... ... ...

γn1 γn2 γn3 . . . γnn

 The Hamming Weight of M will be wt(M) = Pnj=1(Pn

i=1wt(γij)).

1.6.2 XOR Count

The XOR count is one of the most commonly used metrics for the computational eciency of a matrix - vector multiplication. Hardware implementations of block ciphers intend to minimize the number of gates required, and the XOR count is directly correlated to that number of gates, or equivalently, the area required for that implementation [KPPY14]. It was also shown in [KPPY14] that the XOR Count metric is a more accurate metric than Hamming Weight, as elements

(14)

1.6 Diusion Eciency Metrics 7

with higher Hamming Weight can have a very low XOR count. It has also been shown in [SKOP15] that the choice of the eld dening irreducible polynomial is important in nding good matrices with a low XOR count.

Definition 1.3 Assume that a specic element a in the eld GF (2r)is mul- tiplied with another random element b. The XOR Count for a is the number of bitwise XOR operations required for that multiplication, is independent of the choice of b, and will be denoted here as XORCount(a).

An easy example to illustrate this concept is the times two multiplication (a = 2). Let the eld be GF (28)/0x11b, and b = (b0, b1, b2, b3, b4, b5, b6, b7)2 is the random element to be multiplied, where bi is the ith most signicant bit of b.

To multiply, we shift all the bits by one to the left, (b0, b1, b2, b3, b4, b5, b6, b7, 0)2, which can result in an overow if b0 is one. Therefore we XOR the modulus expressed in the most signicant bit of b, Ox11b = (b0, 0, 0, 0, b0, b0, 0, b0, b0)2, to obtain the nal result, (b1, b2, b3, b4⊕ b0, b5⊕ b0, b6, b7⊕ b0, b0)2. The XOR count for a = 2 in this eld, the Rjindael MixColumns eld, is XORCount(2) = 3.

The XOR count metric can be extended to vector - matrix multiplications, representing the overall number of XOR operations required to multiply an n × n matrix with an arbitrary vector, over GF (2r). The following formula is derived from the one given in [KPPY14].

Assume the following n × n matrix M, with elements in GF (2r):

M =

γ11 γ12 γ13 . . . γ1n γ21 γ22 γ23 . . . γ2n ... ... ... ... ...

γn1 γn2 γn3 . . . γnn

For each row j, the XOR count equals Pni=1XORCount(γij) + (n?j − 1) · r, where n?j is the number of non-zero elements in the jth row. The second part of the equation, (n?− 1) · r is derived from the matrix - vector multiplication algorithm, since the non-zero elements of the row have to be XORed together after being multiplied, and each element contains r bits.

Definition 1.4 The XOR count for an n × n matrix M, with elements in GF (2r)is XORCount(M) = Pnj=1(Pn

i=1XORCount(γij) + (n?j− 1) · r). When symmetric block ciphers utilize matrices for diusion, the XOR Count for the matrix is only part of the consideration. Since these ciphers need to im- plement both encryption and decryption, the XOR Count of the inverse matrix M−1 is also important.

(15)

1.7 Signicant Matrix Classes

For the purpose of using matrix - vector multiplication to achieve diusion, sev- eral classes of matrices have been studied. These classes oer properties that facilitate lightweight hardware and software implementations of the multiplica- tion, and/or a better rate of diusion.

1.7.1 Maximum Distance Separable Matrices

The concept of Maximum Distance Separable, (MDS), matrices is introduced from error correction codes, and the equivalent MDS codes. To explain this concept, rst we introduce the following denition of branch numbers:

Definition 1.5 [SKOP15, Denition 3] Assume an n × n matrix M over GF (2r), and a vector v, of length n and elements in GF (2r). We denote wtM(v) = wt(v) + wt(v · M ), or the Hamming Weight (element-wise, instead of bitwise) of the vector v plus the Hamming Weight of the vector multiplied by the matrix M. The minimum wtM(v) amongst all possible non-zero vectors of length n is the branch number of the matrix, or br(M) = minv6=0(wtM(v)) = minv6=0(wt(v) + wt(v · M )).

The branch number leads to the denition of an MDS matrix, or:

Definition 1.6 [SKOP15, Denition 3] An n × n matrix M over GF (2r)is MDS if br(M) = n + 1.

n + 1is the highest possible branch number for a matrix, as it is then the non- identity part of the generator matrix of an MDS linear code that meets the Singleton Bound. [SKOP15, Denition 4].

Using an MDS matrix for diusion closely relates to the design goal of a P- Box1.4, for the output of an S-Box (or input vector element here) to aect as many S-Box inputs as possible (or output vector elements). If the number of GF (2r)elements in the data block undergoing diusion is larger than the matrix dimension n, then additional steps are necessary (like the ShiftRows operation in AES).

Since MDS matrices achieve optimal diusion, they are widely used in ciphers, like in AES [DR13], SHARK [RDP+96], ANUBIS [RB00], KHAZAD [BR00], or in hash functions, like WHIRLPOOL [BR11].

(16)

1.7 Signicant Matrix Classes 9

In order to check a matrix for the MDS property there are easier ways than iterating through all vectors:

Proposition 1.7 [MS77, Theorem 8 on p.321] [SKOP15, Proposition 2]

An n × n matrix M is MDS if and only if every square sub-matrix of M is nonsingular( has a non-zero determinant, or, equivalently, is invertible)

Checking all the minors, or determinants of square sub-matrices, of an n × n matrix M matrix to be non-zero is a computationally expensive process. There are nkchoices of k rows (or columns) to remove from a matrix, and therefore

n k

2

k × k square sub-matrices. So checking the matrix for the MDS property involves calculating (at most) Qnk=0

n k

2

determinants.

1.7.2 Circulant Matrices

Definition 1.8 An n×n matrix C is circulant when each row is its preceding row, but rotated one element to the right:

C =

γ0 γn−1 γn−2 . . . γ1

γ1 γ0 γn−1 . . . γ2

γ2 γ1 γ0 . . . γ3

... ... ... ... ...

γn−1 γn−2 γn−3 . . . γ0

Circulant matrices are commonly used for diusion, such as in the AES block cipher [DR13], and the WHIRLPOOL hash function [BR11]. This is because an n × ncirculant matrix can have at most n dierent element multiplications to be implemented, either in software or in hardware. Searching for matrices that oer better diusion is also easier, as the XOR count is equal for each row of the matrix, allowing for faster evaluation.

1.7.3 Involutory Matrices

Definition 1.9 An n × n matrix M is Involutory if it is equal to its inverse, or M = M−1↔ M2= I, with I the identity matrix.

(17)

By using am involutory matrix for diusion, a cipher can use the exact same hardware or software implementation for the diusion layer of encryption and de- cryption. This is a very promising design property. Multiple ciphers use involu- tory matrices, like ANUBIS [RB00], KHAZAD [BR00] and PRINCE [BCG+12].

1.7.4 Orthogonal Matrices

Definition 1.10 An n × n matrix M is Orthogonal if its inverse is equal to its transpose, or M = MT ↔ M · M−1= I.

An orthogonal matrix, just like an involutory matrix, also has the same XOR count as its inverse, allowing both encryption with a cipher utilizing them be of equal performance to its decryption. Since the inverse matrix shares the same matrix entries, encryption and decryption can potentially share parts of the implementation. Thus, there is research done in using them for diusion [ZWS17].

However, due to the prevalence of involutory matrices and the existence of ecient construction methods to produce low XOR count matrices [SKOP15], orthogonal matrices are not widely adopted in ciphers.

(18)

Chapter 2

SIMD Implementation Metrics

As mentioned in 1.6.2, the space eciency of hardware implementations of vector-matrix multiplication is highly correlated to its XOR Count. When do- ing a software implementation however, it is not as simple. Modern processors execute instructions in a mostly linear fashion, and XOR operations would only take a part of the multiplication process, as all data processed must rst be loaded from memory into registers. Processors also support technologies like pipelining, where dierent parts of the processor can execute parts of previous or upcoming instructions, allowing instructions to execute in under a processor cycle. During compilation, there are a number of optimizations that can be ap- plied, in an attempt to rearrange the instructions required for a specic result in a way that will use the most out of such technologies.

In this chapter, we will look at multiple data arrangements, implementation techniques and possible optimizations, to evaluate software implementations of vector-matrix multiplication and their scaling based on the choice of matrix.

A key design decision for a secure implementation is creating code that is resis- tant to timing attacks. A timing attack1 involves measuring the time taken for a large number of similar operations to execute in order to extract information

1Example of a timing attack: [WHS12]

(19)

about the clear-text or key, using statistical analysis. In order to prevent them, a software implementation needs to execute in the same amount of processor cycles independently from the contents of the clear-text (or cipher-text during decryption) and key. This means that we need to avoid branching, or using conditional statements in the implementation.

Another decision is the choice of eld and matrix. For this we chose the 4 × 4 involutory MDS matrix used in the ANUBIS [RB00] cipher, denoted as ANU- BISH here, and a 4 × 4 involutory MDS matrix found by [SKOP15], denoted as OTHERH here.

AN U BISH =

0x01 0x02 0x04 0x06 0x02 0x01 0x06 0x04 0x04 0x06 0x01 0x02 0x06 0x04 0x02 0x01

OT HERH =

0x01 0x02 0xb0 0xb2 0x02 0x01 0xb2 0xb0 0xb0 0xb2 0x01 0x02 0xb2 0xb0 0x02 0x01

The main choice of eld is GF (28)/0x11dfor ANUBISH, and GF (28)/0x165for OTHERH. In these elds, XORCount(ANUBISH) = 184, and XORCount(OTHERH)

= 160. ANUBISH is also used over GF (216)/0x1002d, where XORCount(ANUBISH)

= 320.

2.1 SIMD Instructions

We will be using Single Instruction, Multiple Data SIM instructions for these implementations. As the name suggests, this is a method to parallelize process- ing on the instruction level, with the operation dened by a single instruction being applied on multiple variables, contained within a larger register. This technology was introduced to personal computers with Intel's MMX instruction set [PW96], equipping Pentium processors with 8 64-bit registers. The number of concurrent operations depends on the "packing", or the logical division of that 64-bit register into smaller integers: A single instruction can be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers.

The amount of data that can be processed in any way in a processor instruction is limited by the size of the registers used. Thus, the SIMD instructions were

(20)

2.2 XOR Count Implementation 13

gradually expanded from 64-bit MMX registers to 128-bit XMM registers, 256- bit YMM registers, and 512-bit ZMM registers. The available operations were also expanded, as well as the packing formats and treatment of the variables as integers or oating-point values.

For this project we chose to use 128-bit XMM registers, due to hardware con- straints on the development machine 1.2, and the higher number of dierent operations compared to larger register instruction sets. An implementation us- ing larger registers would certainly almost double performance. Compiling the same code on a later architecture also improves instruction cycle performance, due to hardware optimizations and the availability of more registers.

2.2 XOR Count Implementation

This project is largely based on the concept of XOR count, and an implemen- tation was required for generating vector-matrix multiplication code in bit-slice arrangement 2.5.3, as well as for the COMDS search in the next chapter 3.2.

This was done in the SageMath [Dev17] system, in the le sage/xorcount.sage of [Tat17].

The XOR count metric applies to any element a in GF (2r)/mod1.6.2. A mul- tiplication by a can be decomposed into a number of multiplications by two (bit-wise left shifts followed by reduction, or XORing in the modulo mod), and

nally XORing together all multiplications by two. We wish to also calculate which bits are XORed to produce a bit in the multiplication output, so this process was fully implemented.

For example, let us use multiplication of a random element b by a = x3+ x + 1 = 0xb = (1011)2 in GF (24)/x4+ x + 1 = (10011)2. The following table shows the 4 shift and reduction operations required to nd the output bits:

Factor10 Factor2 Shift Reduction

1 (0001)2 (0, b0, b1, b2, b3)2 (b0, b1, b2, b3)2

2 (0010)2 (b0, b1, b2, b3, 0)2 (b1, b2, b0⊕ b3, b0)2

4 (0100)2 (b1, b2, b3⊕ b0, b0, 0)2 (b2, b0⊕ b3, b0⊕ b1, b1)2

8 (1000)2 (b2, b0⊕ b3, b0⊕ b1, b1, 0)2 (b0⊕ b3, b0⊕ b1, b1⊕ b2, b2)2

As a = (1011)2, we need to XOR together the reduction result of rows 4, 2 and 1 to get the bits composing each output bit. (b0, b1, b2, b3)2· (1011)2 =

(21)

(b0, b1, b2, b3)2⊕ (b1, b2, b0⊕ b3, b0)2⊕ (b0⊕ b3, b0⊕ b1, b1⊕ b2, b2)2= (b0⊕ b0⊕ b1⊕ b3, b0⊕ b1⊕ b1⊕ b2, b0⊕ b1⊕ b2⊕ b2⊕ b3, b0⊕ b2⊕ b3)2 = (b1⊕ b3, b0⊕ b2, b0⊕ b1⊕ b3, b0⊕ b2⊕ b3)2. This is done using Python sets containing the bit indices, and the symmetric dierence operation combines these sets. From this form, the XOR Count is trivial to calculate, as the number of indices in each set minus one, all summed up: XORCount(a) = 6.

For smaller elds, all elements can have their XOR count calculated serially.

Larger elds would take an exponentially longer time, so a custom class, delay_dict, containing a Python dict is used; whenever an XOR count is accessed by the __getitem__ method, it is calculated and stored for future reference.

Any linear operation in the eld can be expressed in a similar bit-wise fashion:

Every element a 6= 0 in GF (2r)/mod has a multiplicative inverse a−1, which can be similarly multiplied with a random element b. Similarly b can be mul- tiplied with a ratio, b · (c/a) = b · c · a−1). To nd this multiplicative inverse, the sets describing the multiplication are expressed as an r × r binary matrix.

Multiplying an element expressed as a vector by that matrix creates exactly the same result as multiplying the two elements within the eld. By inversing this matrix and converting back to a list of sets, we obtain the bits required for a multiplication by an inverse. To multiply by a ratio, or any combination of factors, we express each bit index within each set of one factor as the set of bit indices of the other. Thus, a multiplication by a ratio is again expressed as a list of sets containing the random element indices to XOR, and the XOR Count of that ratio is calculated in exactly the same way. Ratios are considered in 2.4.2 and 2.5.3.

In order to test the correctness of the XOR count implementation, the method assert_XORcount_GF256_correct generates XOR counts for selected modulos in GF (28)and compares them to the XOR Count provided in [SKOP15, Table 12].

2.3 C Implementation Choices

When attempting to create ecient C code, a consideration is how to organize the code into functions. On the one hand, function calls can create additional instructions upon compilation, inducing some overhead. On the other hand, placing all the code in a single function will signicantly reduce readability as many parts will be repeated, and make code reuse and debugging harder. When using functions for better organization, another issue is handling function input and output. An easy strategy is passing array pointers as function arguments, to

(22)

2.3 C Implementation Choices 15

be used for both input and output. Whether this strategy results in additional overhead needed to be evaluated however.

The overwhelming source of optimization is the GCC compiler itself, through the use of the - O3 ag. This optimization can however result in the compiler removing important parts of the code as "dead code", as the compiler can detect when the data produced by the benchmarked operations is not output in any way. To avoid this, the function use_m128i_variable was created, that conditionally prints the contents of the variable. This function is not dened as inline, making it harder for the compiler to detect that the printing conditional

ag is set to zero during static analysis.

In order to nd out which organization strategy compiles into more ecient code, two were considered and implemented: The rst is exclusively using static inline functions in the benchmarked portions of the code2, with input and output done through array pointers. This results in more readable code, but its eciency needed to be checked, mainly for the handling of input and output arrays. The other approach was using macro functions, that would be expanded by the gcc preprocessor before the application of optimizations. These are harder to read, implement and debug, but are popular for ecient implementations.

Code developed for mathematical operations is hard to assess for correctness, as more fringe cases are possible. Towards this end all implemented methods oer an input-output interface in raw byte format through mult_io.c. The test_c_mult SageMath method was also developed, available in the le sage/

ffmulttest.sage. This method executes mult_io with random vectors, and checks its output against the same multiplication in SageMath. This does not guarantee correctness, as errors might exist in the SageMath implementation, and all possible vectors and matrices would need to be tested.

To benchmark the code we will use __rdtsc() function that compiles into the rdtsc instruction, that returns the number of processor clock cycles since the last reset. When code is executed on a modern operating system, there are a multitude of unpredictable delays, like context switches and page faults. To mitigate, the benchmarked operation is repeated in a loop for a large number of repetitions. By getting the processor cycle count before and after that loop, and dividing by the number of repetitions, we can have a more accurate measurement of the benchmarked operation. The overhead from the loop itself should be equal for all operations, allowing a more fair comparison.

2Asserted to be as fast as the macro approach, https://gcc.gnu.org/onlinedocs/gcc/

Inline.html

(23)

2.4 SIMD Multiplication Implementations

A vector-matrix multiplication can be executed with dierent multiplication al- gorithms, and three dierent algorithms were considered, with their applicability dependent on the choice of data arrangement 2.5.

2.4.1 Shue Multiplication

This is a method that utilizes the capabilities of the pshufb instruction. This instruction takes as input an XMM register a to act upon and an XMM mask register b, and outputs an XMM register c. Each of the 16 bytes ci in c is a byte from a, chosen by the index formed by the 4 least signicant bits in bi. Essentially, ∀i ∈ [0, 15], ci = abi. The purpose of this instruction is to reorder the bytes of a, with the capability of repetition, omission, and zeroing if the most signicant bit of ci is set.

This method can also be used as a table lookup, and therefore multiplication in binary extension eld[PGM13]. GF (24)has 16 elements, so the multiplication result of every element by a factor k can be put in order into a, or ai= i·k. If 16 GF (24)elements are supplied into b, then ci= abi = bi· k. This method can be extended to larger binary extension elds with multiple shue calls. In GF (28), we need two shue calls to multiply 16 elements, the rst indexed by the 4 least signicant bits of every element, and the second indexed by the 4 most signicant bits. XORing the two shue results together yields the multiplication result, due to the distributivity property. In GF (216) elements can be divided into four 4-bit portions, but two bytes are needed to represent one element, so we would require 8 shue multiplications to multiply 16 elements. The number of required shues can be reduced for smaller factors.

In general, by using the 128-bit register version of shue, in GF (28n) we can multiply 16 eld elements using (up to) 2 ∗ n2shue calls. However, in order to utilize shues to their full capacity and multiply 16 elements for n > 1, we need additional instructions to arrange the element parts in their correct positions, as well as to rearrange partial multiplication results to be able to XOR them into the nal multiplication results. It is also important to avoid setting the most signicant bit of the mask to one, as it will zero that byte's output.

(24)

2.4 SIMD Multiplication Implementations 17

2.4.2 Times-Two Multiplication

This multiplication method is essentially the same algorithm as the one de- scribed in 2.2. By shifting the bits of an element left by one, and XORing the modulo, with its ones set to the shift overow bit, we obtain the element times two. By doing this shift up to the position of most signicant nonzero bit of the factor, and XORing together all results where the factor bit is one, we have the multiplication result. Thus in GF (28n), to multiply 16/n elements by k we need blog_2(k)c SHIFTs, and blog_2(k)c + wt(k) XORs.

2.4.3 XOR Multiplication

The XOR multiplication is the direct analog of hardware multiplication using XORs. This is only possible in a bit-slice arrangement 2.5.3. In this arrange- ment, we have an input register for every bit position i in the element, containing the ith bit of every element. The same applies to output registers. To multi- ply, each output register is produced by XORing specied input registers, as described in 2.2. The XORs required match the XORCount of the factor, and with XMM registers, 128 elements can be multiplied by kwith XORCount(k) XOR operations.

Since the number of available registers, the average cycle count of the XMM XOR instruction, and the compiler instruction rearrangement optimization all aect the performance of this algorithm, a few alternatives were tested.

The rst approach is allowing the compiler to optimize as best as it can, by just writing the required XORs. The second approach is using ratios 2.2: For ex- ample, in ANUBISH over we have the factors (0x01, 0x02, 0x04, 0x06), with re- spective XOR Counts (0, 3, 6, 13) over GF (28)/0x11d. We can choose to instead multiply the data in place by (1, 2, 2, 6/4), which have XOR Counts (0, 3, 3, 11), before XORing the results to the output. This discrepancy in XOR Count is caused by the calculation of factors having common intermediate XOR results, which are repeated in the simple approach. The correct choice of ratios can eliminate some common intermediates.

The third approach was to pre-calculate all common intermediates, in an at- tempt to use the least XOR operations possible. This was done in in the Sage- Math method generate_bitslice_xormult_code within the sage/ffmulttest.

sage le. To nd them, we rst calculate the set of input bits that need to XORed to produce each output bit. Then all possible input bit pairs are cal- culated for each set, and the most common pair is replaced by a tmp variable.

(25)

This process is repeated until no more common pairs can be found, at which point the code is output.

2.5 Data Arrangement Implementations

As we are multiplying GF (2r)vectors with a 4×4 matrix, the vectors will have 4 elements, or be 4r-bits long. A 128-bit register can contain 4 vectors for GF (28), and 2 vectors for GF (216). An assumption is made that these vectors are by default represented as a sequence of bytes, containing the binary representations of its elements. We assume large amounts of data need to be processed, and are concurrently available. Dierent multiplication algorithms 2.4 require the data to be arranged in dierent formats before they can be applied. Therefore a vector-matrix multiplication involves rearranging the input data to the proper format, applying a multiplication algorithm, and then arranging it back to the default representation.

2.5.1 Direct Arrangement

A direct arrangement refers to directly using the data in the default arrange- ment, with no need to reorganize. The shue multiplication method applies well here, if we are using the registers and instruction to their full capacity.

Times-two also applies, but was not considered due to performance.

Only ANUBISH was implemented, since shue multiplication complexity de- pendence on the matrix entries is low in GF (28). It is constant, as long as each row is a permutation of every other row, and has the same amount of ones.

Both a GF (28)version with 4 input vectors and a GF (216)version with 2 input vectors are provided, as well as their macro versions. For the GF (216)version, multiplication is done with 6 shues instead of 8, as the low absolute values of the entries render certain shues obsolete.

First the input vector is multiplied with the required amount of shue calls, then the result is rearranged with another shue so that it lines up with the correct output vector according to the matrix, before XORing it to the output variable. This required shuing masks and multiplication arrays to be prepared, done in SageMath.

(26)

2.5 Data Arrangement Implementations 19

Figure 2.1: Default to Byte-Slice Arrangement, each box represents a byte

2.5.2 Byte-slice Arrangement

In a byte-slice arrangement2.1 for GF (28) elements, 4 128-bit variables con- taining 16 vectors are accepted as input, and rearranged so that the rst input variable contains the rst byte of every vector, the second contains each second byte, and so on. This arrangement allows for shue multiplication, as we are dealing with bytes, and also allows for times-two multiplication, by applying the same operation on all 16 bytes at once. Both multiplication algorithms were implemented for both the ANUBISH and OTHERH matrices, to showcase the shue multiplication relative independence from the matrix, and the scaling of the times-two multiplication.

Rearranging the input data will impose some overhead. First, a shue is applied on each input variable that contains 4 input vectors, so that all rst bytes appear rst, then the second bytes, and so on. Then a process similar to the _MM_TRANSPOSE4_PS macro in the Intel Intrinsics Guide is applied. This is a matrix transposition, when treating the four 128-bit variables as a 4 × 4 matrix with 32 bit elements. Applying these two operations in the reverse order returns the byte-slice into the default arrangement.

Once in the byte-slice arrangement, we can easily implement the vector-matrix multiplication: each input vector can be seen as a vector entry, so we multiply that entry with each matrix entry in the corresponding row, and XOR the result to the output vector entry of the corresponding column.

2.5.3 Bit-slice Arrangement

The bit-slice arrangement is similar to the byte-slice, but for bits. Thus, in a bit-slice arrangement for GF (28)elements, 32 128-bit variables containing 128 vectors are accepted as input, and rearranged so that the rst input variable con- tains the rst bit of the rst byte of every vector, and so on. This arrangement only allows for XOR multiplication.

The arrangement is a very costly operation; each input variable contains 128

(27)

bits, of which only 1 is in its correct position, 3 need to move to other positions, and 124 need to be moved into the other variables, 4 per. First, the variables have the byte-slice arrangement applied to them into 8 consecutive groups of 4. They are then rearranged (a free operation), so that the rst group only contains the rst four bits, the second contains bits 5-8, and so on.

The next step utilizes the sse_trans_slice function from [mis], that rearranges the bits in a 128-bit variable, so that the rst two bytes contain all the rst bits of every byte, the third and fourth bytes contain all the second bits of every byte, and so on. This function is applied on every input variable.

Now, each of the 128-bit variables, contains 8 16-bit values. A transpose function is again applied on each function, treating each group as an 8 × 8 matrix of 16 − bitvalues. After another rearrangement, the bit-slice is complete. To turn the bit-slice arrangement back into the default one, the operations are applied in reverse order.

Four implementations were created using the ANUBISH matrix: First, two hand-coded versions using ratios, both in inline and macro fashion. Two other versions were automatically generated, the simple direct XOR sum for each output variable (also done for OTHERH), and one with all intermediates stored in temporary variables.

2.6 Results

Benchmarks were ran on both the Skylake and Ivy Bridge machines 1.2, over 50,000,000 iterations for each implementation. The benchmark output is shown in Appendix A. These timings seem to be consistent enough to allow comparison.

The results consistently show very small discrepancies between the macro and inline implementations, with inline usually outperforming. This suggests that inline functions are preferable, due to the additional ease of readability and debugging.

As expected, byte-slice shue implementations perform equally in respect to the matrix used. In the times-two implementations however, the lower values of the entries in ANUBISH allow it to dramatically outperform OTHERH. Times-two also signicantly outperforms shue for ANUBISH on Skylake.

The direct implementations using shue seem to perform equally in Ivy Bridge and Skylake. In the byte-slice implementations Skylake begins to slightly out-

(28)

2.6 Results 21

perform, and continues so in the bit-slice versions. This is expected, as explained in 2.1. It is however worse than the byte-slice arrangement, and would only ap- ply for small amounts of data.

Byte-slice arrangement (and its reverse) seems to require about 10-13 cycles, while bit-slice arrangement requires 850-925 cycles. While these are signicant overheads, especially for bit-slice, they need to occur only once in a block cipher, assuming all other operations can be implemented in this arrangement. If a block-cipher has a large number of rounds, the parallelization provided by SIMD instructions makes diusion a much less costly operation.

An important note is the comparison of the three XOR multiplication schemes, which can be seen in 2.1. The XOR count as dened in the C code is provided, as well as the number of XOR instructions in the compiled assembly code. The ratio implementation is the worst for both machines, as it both not fully opti- mized for intermediates, and does not allow the compiler to properly optimize it. The intermediates one requires the most variables, but cannot be further optimized by the compiler in respect to the XOR count. This extra variable requirement forces the compiler to load from memory more often, and makes it second in performance. The simple XOR sum performs best on both machines.

Therefore, optimization is best left to the compiler. The table also shows that OTHERH performed better than ANUBISH, while having a better XOR Count and a worse Hamming Weight. This further supports that XOR multiplication in this arrangement scales according to XOR Count.

Table 2.1: XOR Multiplication Optimization Comparison

Method Matrix XORCount-Code XORCount-Compiled Skylake C/Vec IvyBridge C/Vec

Ratios ANUBISH 156 143 0.68 0.73

Intermediates ANUBISH 122 122 0.58 0.71

Simple XOR ANUBISH 184 157 0.49 0.67

Simple XOR OTHERH 160 141 0.46 0.66

(29)
(30)

Chapter 3

Finding COMDS Matrices

This chapter will describe the process of implementing a method described in F. J. MacWilliams' 1971 mathematics paper, "Orthogonal Circulant Matrices over Finite Fields,and How to Find Them" [Mac71] in the SageMath [Dev17]

system. The paper provides a computationally ecient way for nding orthog- onal circulant matrices in a nite eld, which is then further rened to nd MDS matrices with the least XOR count amongst them. First we will describe the algorithm used, then the implementation process and choices, and nally provide some interesting results.

3.1 Process overview

The paper looks at three dierent cases for nding n × n circulant matrices in GF (q): The rst case is for q = 2 and n odd, the second extends it to q prime and n co-prime to q (not sharing any factors other than 1), and the third is for n = sq. Since we wish to use a binary extension eld (q = 2r) for easy binary representation and operations, and it has been shown that no 2d× 2dCOMDS matrices exist [GR14], we cannot use the third case to nd 2d× 2dmatrices that oer perfect diusion. We will instead focus on the second case, which has been shown to also apply on extension elds [JBG94].

(31)

3.1.1 Key concepts

The paper rst introduces an alternative representation, or isomorphism, of circulant matrices as polynomials. A circulant square n × n matrix C over GF (q),

C =

γ0 γ1 γ2 . . . γn−1 γn−1 γ0 γ1 . . . γn−2

γn−2 γn−1 γ0 . . . γn−3

... ... ... ... ...

γ1 γ2 γ3 . . . γ0

is fully dened by its rst line, or C = circ(γ0, γ1, γ2, . . . , γn−1).

This more clearly shows that there are qnsuch circulant matrices. The isomor- phism introduced is between the space of all such matrices and the Polynomial Ring Rn = GF (q)[x]/(xn− 1). This ring contains all polynomials of degree up to n and coecients in q, and therefore has qn polynomials. The main dierence between a ring and a eld (explained in 1.5) is that the division operation and multiplicative inverses are not (fully) dened in a ring.

The corresponding polynomial to the matrix C is c(x) = γ0+ γ1x + γ2x2+ · · · + γn−1xn−1

or

c(x) =Pn−1 i=0ixi)

As we are searching for orthogonal matrices, we need to also dene CT in this ring:

c(x)T = γ0+ γn−1x + γn−2x2+ · · · + γ1xn−1 or

c(x)T =Pn−1

i=0n−ixi)(where γn= γ0)

This isomorphism allows for all operations between circulant matrices, and we

(32)

3.1 Process overview 25

will be using that fact for orthogonal matrices:

C · CT = E ⇔ c(x) · c(x)T = 1 (mod xn− 1)

We will therefore call a polynomial c(x) orthogonal if c(x)·c(x)T = 1 (mod xn− 1).

Another key concept are ideals. The following is a simplied denition of two- sided ideals, as multiplication is commutative in Rn= GF (q)[x]/(xn− 1). Definition 3.1 For a ring R and some (dening) element α ∈ R, an ideal A is a subset of R, such that ∀β ∈ R, α · β ∈ A.

An easy way to understand an ideal (as used here) is to think of it as the subset of Rn containing all multiples of the dening element.

3.1.2 Algorithm

The rst step of the algorithm is decomposing xn− 1into its irreducible poly- nomial factors, xn − 1 = Qk

i=0(fi(x)). If fi(x) = Pw

i=0ixi) is an irre- ducible factor, then fi?(x) = Pw

i=0w−ixi0) is also an irreducible factor.

If fi(x) = fi?(x), we will call that an unpaired factor, and if fi(x) 6= fi?(x)we will call them a pair of factors, for easy reference. x − 1 is always one of the unpaired factors, and so is x + 1, if q is odd and n is ever. We denote di as the degree of fi(x), and di is even for unpaired factors, except for x − 1 and x + 1.

Since n and q are chosen to be co-prime, xn− 1has no multiple zeroes, or no factor appears more than once.

For each factor fi(x)we form the ideal Ai or Afi(x) using the polynomial (xn− 1)/fi(x) as its dening element. As with factors, ideals are either paired or unpaired, and Ai will contain qdi polynomials, including the zero polynomial.

To iterate through all polynomials of Ai, we iterate lexicographically through all polynomials e(x) of up to degree di, and e(x)·(xn−1)/fi(x)is the corresponding polynomial within the ideal.

Each ideal also contains a multiplicative unit Ei(x), such that ∀β(x) ∈ Ai, Ei(x)β(x) = β(x). It is also the only polynomial within the ideal to be its own square, which is the test we perform to nd it while iterating through all polynomials of the ideal. For paired ideals, fi(x), fj(x) = fi?(x), we have that Ei(x) = Ej(x)T. In order to nd the orthogonal polynomials, a key point is that no o(x) ∈

(33)

Ai, i ∈ [0, k] exists, such that o(x) orthogonal[Mac71, Lemma 1.7]. Instead, for a polynomial o(x) ∈ Rn to be orthogonal, it needs to be a sum of specic polynomials, one from each ideal. We will denote these ideal members as oi(x) for ease of reference:

o(x) ∈ Rn orthogonal ⇔ o(x) =Pk

i=0(oi(x)) ∀oi(x) ∈ Ai

The denition of oi(x) diers between unpaired and paired ideals: For the un- paired ideal Ai, oi(x)oi(x)T = Ei(x). For the paired ideals Ai, Aj, ∀oi(x) ∈ Ai∃oj(x) ∈ Aj such that oi(x)oj(x)T = Ei(x).

In order to nd all orthogonal polynomials in Rn, we need to rst nd all oi(x). In an unpaired ideal Ai, there are qdi/2+ 1 such elements. To nd them, we search through all polynomials ci(x) ∈ Ai, checking for each if ci(x)ci(x)T = Ei(x). The ideals formed by x − 1 and x + 1 are special: Ax−1 contains 1 oi(x) for q even, and 2 oi(x)for q odd, while Ax+1contains 2 oi(x)if x + 1 is a factor of xn− 1 (q odd and n even). This is the source of the factors 1,2 and 4 in [JBG94, Theorem 1].

In the pair of ideals Ai, Aj, every polynomial of one has a corresponding polyno- mial in the other such that oi(x)oj(x)T = Ei(x), except for the zero polynomial.

Paired ideals therefore have qdi− 1 = qdj − 1 pairs of oi(x),oj(x). To nd all pairs, for every polynomial oi(x) ∈ Ai its corresponding oj(x)needs to be found amongst qdj − 1elements in Aj.

After nding all oi(x), all combinations of choices of oi(x) from all Ai form all the circulant orthogonal matrices. By checking them for the MDS property, we can nally nd all COMDS matrices amongst them.

3.2 Implementation

The algorithm outlined above was implemented in the SageMath [Dev17] system, in the le sage/comds.sage of [Tat17]. SageMath was chosen as it oers native support for nite elds and their elements, polynomials and polynomial rings, matrices with any type of coecients, as well as the full capabilities of the Python language.

(34)

3.2 Implementation 27

3.2.1 Optimization

We wish to nd the MDS matrix of least XOR count amongst them, and since

nding the XOR Count is faster than checking for the MDS property1.7.1, we can optimize the search: The XOR Count of a circulant matrix can be derived from the coecients of its polynomial representation, and the MDS property needs to be only checked if a matrix. This signicantly reduces the time needed for the search, compared to checking all matrices for the MDS property.

During this search, there are signicant bottlenecks, or operations that scale exponentially with the eld degree. The three main problems identied were the search for the multiplicative inverses, Ei(x), nding the oi(x)in an unpaired ideal, and nding the corresponding oj(x)to a specied oi(x)in a pair of ideals.

The rst implementation step to address these bottlenecks is moving the search for the oi(x)polynomials from pre-computation, to concurrent with the search.

This is done mainly by implementing Python generators for the two types of search. This however does not work by itself: If the obvious choice of itertools.product to create all combinations is used, while it does not store intermediaries, it will convert the iterables provided into arrays. Thus providing itertools.product generators negates most of their usefulness. To amend this, the custom implementation of cache_gen_product was created inspired by a Stack Overow answer [sh], that does not store intermediaries, uses generators properly, and also stores elements that have already been generated.

In order to nd the multiplicative inverses, an optimization was found, using the echelon_form method of matrices in SageMath. By converting the dening polynomial of an ideal Ai into its matrix form, converting that matrix into its echelon form, and then converting back to a polynomial, we can somehow directly nd Ei(x). This optimization seems to always work for 5 × 5 matrices in GF (22b), but does not work in most combinations of n and q.

Another optimization was developed based on the capability of SageMath to solve systems of linear equations, as the solve_right method of matrices. This optimization helps nd the corresponding element oj(x) in paired ideals in a much more computationally ecient manner, in solve_AXeqB_circ. By trans- forming the chosen oi(x), Ei(x), and a variable polynomial g(x) into their corre- sponding matrices Oi, Eiand G, we can attempt to solve the system of equations Oi · G = Ei. Since all matrices are circulant, G and Eican be expressed as vec- tors without losing any accuracy, as the matrix representation will only result in repeating the linear equations generated by the rst row. solving Oi·G = Eifor Gdoes not yield a single solution as Oi is not invertible [Mac71, Proof of The- orem 1.1]. The provided result, when multiplied with the kernel of the matrix

(35)

will yield all results, which is a similar search space as the naive approach.

However, the rst solution of SageMat is somehow a polynomial that, when multiplied by the dening element of the ideal Aj, yields oj(x). This optimiza- tion seems to always work for 5 × 5 matrices in GF (22b), but does not work in most combinations of n and q.

When searching for COMDS matrices in a large eld, memory is also an issue.

The oi(x)need to be stored to avoid recomputing them, which is done through the cache_gen_product method. The same method is used to iterate through polynomial elements up to a specied degree, in many places, using the helper method gen_poly_ring_elements. The memory consumption optimizations are still not free of bugs, and the search will consume inordinate amounts of memory for large elds.

3.3 Findings

The implementation, provided in the find_COMDS_matrices method, has been tested against a number of combinations of n and q, and seems to work. In this section we will mainly focus on 5 × 5 matrices over GF (2r), as a recent development in white-box cryptography [BIT16] utilizes such a matrix, and can provide some comparison. It is important to note however that the matrices used for diusion in the aforementioned paper were chosen to have low Hamming Weights instead of low XOR count. Results of the search can be seen in 3.1.

Despite the size dierence, these MDS matrices do not perform as well as the ones found for even sized matrices, by techniques like the one in [SKOP15].

However, that technique does not apply to odd sized matrices, and should a cipher design require odd sized matrices, this implementation might be helpful.

Table 3.1: Best 5 × 5 COMDS matrices found over GF (2r)

Field XORC. Matrix Search

GF (24)/0x13 150 circ(0x4, 0x5, 0x2, 0x1, 0x3) Exhaustive

GF (26)/0x6d 300 circ(0x1, 0x3, 0xc, 0x4, 0xb) Exhaustive

GF (28)/0x10eb 390 circ(0xa6, 0x29, 0x3, 0x1, 0x8c) Exhaustive GF (212)/0x1002d 990 circ(0x8, 0xf 5c, 0x86f, 0x10, 0x72a) Incomplete GF (216)/0x1002d 2110 circ(0x6386, 0x8086, 0x1488, 0x730f, 0x8486) Incomplete GF (220)/0x1006f 3 3580 circ(0x9e337, 0xa0ea4, 0xf 2a73, 0x1e34f, 0xd24ae) Incomplete GF (224)/0x14d 5540 circ(0x904d, 0xc49454, 0x7c9eb9, 0x6cd91a, 0xd443bb) Incomplete

The gures in Appendix B show the minimum XOR count compared to XOR count standard deviation in an exhaustive search over all irreducible modulos.

While inconclusive, it does seem that by choosing a eld with a higher standard

(36)

3.3 Findings 29

deviation of XOR counts, one is more likely to nd lower XOR count MDS matrices. This argument was rst presented in [SKOP15].

(37)
(38)

Chapter 4

Conclusion

This project mainly focused on implementing vector-matrix multiplication with SIMD instructions, and the COMDS search algorithm. For the former, multiple implementations were compared. When choosing a software implementation, the choice of eld and matrix determine the optimal choice: For matrices with large coecients, shue multiplication in byte-slice arrangement would perform best, as it is mostly independent of the matrix XOR count and Hamming Weight. For matrices chosen due to their low XOR count, a bit-slice arrangement mimicking the hardware XOR multiplication scheme would be optimal, with full compiler optimization. Matrices in larger elds with low Hamming Weight and absolute value coecients can also benet from times-two multiplication.

The COMDS search algorithm was also successfully implemented and shown to produce results. While the matrices found are not useful in most current cipher designs, it would be helpful in a cipher design using odd matrices, or operating in a non-binary extension eld.

4.1 Ethics and Sustainability

As is the case with all scientic research, it is important to consider any ethical implications. This project provides tools and some insight for designing crypto-

References

Related documents

Submitted to Linköping Institute of Technology at Linköping University in partial fulfilment of the requirements for the degree of Licentiate of Engineering. Department of Computer

The model is a structured unified process, named S 3 P (Sustainable Software Security Process) and is designed to be easily adaptable to any software development process. S 3

Motion detection using multiple viewpoints This is a description of the algorithm used to detect moving objects in a sensor measurement, also refered to as a scan, using a set of

The propagation methods used in this analysis are a Taylor Series Method and a Monte Carlo Method.. The Taylor Series Method uses the partial derivatives of each input variable

 The new headbox design Pressurized pocket is a concept that gives increased possibility to adjust the base weight and allows an active adjustment of the

According to our survey- and interview results, choosing Subversion does in fact affect a lot of people by not letting them use their version control systems at all times when

För en försvarsmakt i tillväxt innebär längre perioder upp till 40 år då omsättningstakten för militära system som stridsvagnar och ytstridsfartyg ligger kring 30‒40 år,

In the models where aerobic capacity was expressed in absolute terms a negative association with the biomarkers for skeletal muscle was seen in combination with a positive association