The Performance of Post-Quantum Key Encapsulation Mechanisms: A Study on Consumer, Cloud and Mainframe Hardware

(1)

Master of Science in Engineering: Computer Security June 2021

The Performance of Post-Quantum Key Encapsulation Mechanisms

A Study on Consumer, Cloud and Mainframe Hardware

Alex Gustafsson Carl Stensson

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in Engineering: Computer Security. The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:

Authors:

Alex Gustafsson

E-mail: algc16@student.bth.se Carl Stensson

E-mail: casg16@student.bth.se

University advisor:

Prof. Håkan Grahn

Department of Computer Science

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Background. People use the Internet for communication, work, online banking and more. Public-key cryptography enables this use to be secure by providing confidentiality and trust online. Though these algorithms may be secure from attacks from classical computers, future quantum computers may break them using Shor’s algorithm. Post-quantum algorithms are therefore being developed to mitigate this issue. The National Institute of Standards and Technology (NIST) has started a standardization process for these algorithms.

Objectives. In this work, we analyze what specialized features applicable for post- quantum algorithms are available in the mainframe architecture IBM Z. Furthermore, we study the performance of these algorithms on various hardware in order to understand what techniques may increase their performance.

Methods. We apply a literature study to identify the performance characteristics of post-quantum algorithms as well as what features of IBM Z may accommodate and accelerate these. We further apply an experimental study to analyze the practical performance of the two prominent finalists NTRU and Classic McEliece on consumer, cloud and mainframe hardware.

Results. IBM Z was found to be able to accelerate several key symmetric primitives such as SHA-3 and AES via the Central Processor Assist for Cryptographic Func- tions (CPACF). Though the available Hardware Security Modules (HSMs) did not support any of the studied algorithms, they were found to be able to accelerate them via a Field-Programmable Gate Array (FPGA). Based on our experimental study, we found that computers with support for the Advanced Vector Extensions (AVX) were able to significantly accelerate the execution of post-quantum algorithms. Lastly, we identified that vector extensions, Application-Specific Integrated Circuits (ASICs) and FPGAs are key techniques for accelerating these algorithms.

Conclusions. When considering the readiness of hardware for the transition to post- quantum algorithms, we find that the proposed algorithms do not perform nearly as well as classical algorithms. Though the algorithms are likely to improve until the post-quantum transition occurs, improved hardware support via faster vector instructions, increased cache sizes and the addition of polynomial instructions may significantly help reduce the impact of the transition.

Keywords: Public-Key Cryptography, Benchmark, x86, IBM Z, z15

i

(4)

(5)

Sammanfattning

Bakgrund. Människor använder internet för bland annat kommunikation, arbete och bankärenden. Asymmetrisk kryptering möjliggör att detta sker säkert genom att erbjuda sekretess och tillit online. Även om dessa algoritmer förväntas vara säkra från attacker med klassiska datorer, riskerar framtida kvantdatorer att knäcka dem med Shors algoritm. Därför utvecklas kvantsäkra krypton för att mitigera detta problem. National Institute of Standards and Technology (NIST) har påbörjat en standardiseringsprocess för dessa algoritmer.

Syfte. I detta arbete analyserar vi vilka specialiserade funktioner för kvantsäkra algoritmer som finns i stordator-arkitekturen IBM Z. Vidare studerar vi prestandan av dessa algoritmer på olika hårdvara för att förstå vilka tekniker som kan öka deras prestanda.

Metod. Vi utför en litteraturstudie för att identifiera vad som är karaktäristiskt för kvantsäkra algoritmers prestanda samt vilka funktioner i IBM Z som kan möta och accelerera dessa. Vidare applicerar vi en experimentell studie för att analysera den praktiska prestandan av de två framträdande finalisterna NTRU och Classic McEliece på konsument-, moln- och stordatormiljöer.

Resultat. Vi fann att IBM Z kunde accelerera flera centrala symmetriska primitiver så som SHA-3 och AES via en hjälpprocessor för kryptografiska funktioner (CPACF).

Även om befintliga hårdvarusäkerhetsmoduler inte stödde några av de undersökta algoritmerna, fann vi att de kan accelerera dem via en på-plats-programmerbar grind- matris (FPGA). Baserat på vår experimentella studie, fann vi att datorer med stöd för avancerade vektorfunktioner (AVX) möjlggjorde en signifikant acceleration av kvantsäkra algoritmer. Slutligen identifierade vi att vektorfunktioner, applikation- sspecifika integrerade kretsar (ASICs) och FPGAs är centrala tekniker som kan nyt- tjas för att accelerera dessa algortmer.

Slutsatser. Gällande beredskapen hos hårdvara för en övergång till kvantsäkra krypton, finner vi att de föreslagna algoritmerna inte presterar närmelsevis lika bra som klassiska algoritmer. Trots att det är sannolikt att de kvantsäkra kryptona fort- satt förbättras innan övergången sker, kan förbättrat hårdvarustöd för snabbare vektorfunktioner, ökade cachestorlekar och tillägget av polynomoperationer signifikant bidra till att minska påverkan av övergången till kvantsäkra krypton.

Nyckelord: Asymmetrisk Kryptering, Prestandatest, x86, IBM Z, z15

iii

(6)

(7)

Acknowledgments

We would like to thank our university advisor Prof. Håkan Grahn for his commit- ment to keep us on the right path, focusing on the goal of the thesis. His valuable feedback, inspiring words and sense of humor kept us going through stressful times.

We extend our gratitude to Robert Nyqvist for his engaging courses in mathematics and cryptology over the years. As Robert provided us with numerous challenges throughout the years, we would like to return the favor. At the bottom of this page is a small puzzle of sorts.

We wish to thank Anders Westberg and Emma Bachner of IBM for providing us with the opportunity of working with IBM for our thesis. We would also like to thank our external advisor Niklas Dahl of IBM for his time.

Lastly, thank you Marcus Lenander for inspiring us to research the progress of the post-quantum standardization process.

W ; n ; t ; 0 ; x ; e ; \ / g ; c ; l < . . > ; r ; e ; n ; / P ; r ; g ; a ; v P 0 ; f ; y ; h ; q ;

. > ; w ; n ; t ; P h e x a g o n y t

e ; @ d o t n e

; 0 P ; e < .

; e ; n ; /

v

(8)

(9)

Chapter 1 Introduction

People worldwide use the Internet every day for a myriad of things. We shop for food, clothes and services, communicate with family and friends and entertain ourselves using streaming services. Businesses and enterprises rely on the Internet not only to serve their ever-growing customer base, but also to transmit confidential information, personal identifiable information, credit card transactions and more. That the internet traffic is kept secure for people and businesses alike is imperative. Public- key cryptography is a fundamental technology in providing this security [71]. Used heavily in the Transport Layer Security (TLS) protocol, Virtual Private Networks (VPNs) and other applications, public-key cryptography serves the purpose of ensur- ing that a party is who they say they are and that encryption keys may be exchanged on insecure channels without ever jeopardizing the confidentiality of the traffic [71].

For now, it is believed that the algorithms in use will continue to be secure from attacks by conventional computers for the foreseeable future. The rise of a new type of computer and set of algorithms, the quantum computer and quantum algorithms, have shown that the fundamental security of today’s cryptography is threatened and is likely to be made obsolete in the near future [11, 16, 70].

Today, many protocols use public-key cryptography to exchange a key between two parties. The recommended public-key cryptography suites are based on the assumption that either integer factorization or an elliptic-curve discrete logarithm is hard to solve. [59, 62]. One of the most prominent threats to the encryption algorithms is Shor’s algorithm [75]. The algorithm has been found to be able to solve the previously mentioned problems. However, it is known to be difficult or impractical to run on a traditional computer, but easy to run efficiently on a quantum computer. Today’s quantum computers are not powerful enough to execute Shor’s algorithm on the large numbers that are used in modern cryptography. As quantum computers get more powerful and available to more people, the threat increases.

The transition to a new set of algorithms that are not built on the same underlying mathematics is becoming increasingly important. These post-quantum algorithms, have not yet been standardized. The National Institute of Standards and Technology (NIST) has started their standardization process with an open call for submissions of post-quantum Key Encapsulation Mechanisms (KEMs) and digital signature algorithms. The process has, at the time of writing, gone through three rounds of submissions - with algorithmic changes and performance optimizations made each iteration. During this standardization process, the security and performance of the submissions have been researched for various use cases and in various environments.

We have identified a lack of research on the performance of these post-quantum

1

(14)

2 Chapter 1. Introduction KEMs on mainframe hardware.

A mainframe is a uniquely engineered computer that is designed to handle a large amount of data and bulk transactions [47]. Their availability, resilience, high throughput and security are core features [47]. Although they may not be used directly by most people, mainframes such as those running on IBM Z are used all around the world to process millions of hotel bookings daily as well as 90% of airline reservations and 90% of credit card transactions made every day [53]. The fact that personal information, trade secrets and more are kept secure when transferring them for processing and storage is critical. The use of these modern cryptography algorithms makes mainframes susceptible to the issues imposed by the progress of quantum computing and the use of Shor’s algorithm. Due to the vast use of mainframes and their reliance on strong public-key cryptography, it is central to our society that the move to post-quantum cryptography can be performed without sacrificing the availability, resilience, high throughput or security of mainframes.

We have investigated if the transition to post-quantum cryptography can be made in the near future, in terms of performance of post-quantum KEMs on consumer, cloud and mainframe hardware. With a study of the performance of post-quantum algorithms on various architectures, we provide up-to-date data to shed light on the readiness of hardware for the post-quantum transition. This data may be used by individuals and businesses alike to understand how the transition may impact them.

We also identify what specialized features of a mainframe computer can be utilized to increase the performance of post-quantum KEMs. The following research questions are answered in this thesis.

RQ1 What specialized instructions and features applicable for post-quantum Key Encapsulation Mechanisms are available in IBM Z?

RQ2Does the performance of post-quantum Key Encapsulation Mechanisms differ between architectures and if so, how?

RQ3 What techniques may be used to increase the performance of post-quantum Key Encapsulation Mechanisms?

(15)

3

The rest of this thesis is structured as follows.

Chapter 2 covers general topics in cryptography and computing related to this work. Not all related topics are described, as the reader is assumed to be ac- customed to general terms of computing, such as compilers and how computers work in broad terms. The chapter further provides an overview of the mathematics used in classical and post-quantum Key Exchange Algorithms (KEXs) and KEMs. Furthermore, the chapter describes various computer architectures and performance-related topics.

Chapter 3 discusses research that has been conducted prior to this thesis, on topics related to the performance of post-quantum KEMs or analysis of the performance of various computer architectures.

Chapter 4describes in detail the method used to provide data and information to help answer the stated research questions.

Chapter 5presents and analyzes the data and information collected as outlined in the method.

Chapter 6discusses the results and the validity of the method. Potential areas of improvement in terms of new features and instructions are discussed, based on the results previously gathered.

Chapter 7 concludes the thesis and discusses future work.

(16)

(17)

Chapter 2 Background

2.1 Cryptography

Delfs et.al. [24] state that electronic communication has grown rapidly. They further discuss that this expansion has led to increased requirements on digital confidentiality - keeping data safe from manipulation and prying eyes. Although just a part of the vast field of cryptology, cryptography studies the act of keeping information safe - encrypting it on one end and decrypting it on another. As outlined in [11], cryptographic functions are sorted into two categories - symmetric-key functions and public-key functions, both of which refer to how the keys to the data are used.

In this section, symmetric-key, public-key and related topics will be further discussed to provide an overview of cryptography.

2.1.1 Symmetric-Key Cryptography

Symmetric-key cryptography is where the same key is shared between both parties and used for both encryption and decryption [11]. Symmetric functions can, besides providing confidentiality, also be used to provide authenticity. If only two parties know of the secret key, one of them may request the other to prove possession of it.

A potential problem when using a symmetric key is the fact that the two parties have to somehow securely exchanged a secret key [24]. It is not a trivial task to exchange the keys securely, as anyone may intercept their communication channel.

There are many ways of securely exchanging a symmetric key by relying on public-key cryptography.

2.1.2 Public-Key Cryptography

Rivest et.al. [72] describe the functions of a public-key cryptosystem as follows. Let E_A be an encryption algorithm created by party A, Alice, known by the public. Let DA be a decryption algorithm created and kept secret by Alice. Then, a public-key cryptosystem can be described as DA(E_A(M )) = M for any message M. Such a cryptosystem must be efficiently computable. Furthermore, it is assumed that Alice does not compromise DA by revealing EA.

In [72] it is further described that whenever a party B, Bob, wants to send a confidential message M to Alice, they use Alice’s public encryption algorithm and send the ciphertext C received from C = EA(M )to Alice. On the receiving end, Alice may access the plaintext message M by decrypting the ciphertext, M = DA(C)[72].

5

(18)

6 Chapter 2. Background For the cryptosystem to be considered secure, it must be computationally unfeasible for an evil party, Eve, to efficiently compute DA from information found in E_A [72]. A trivial, but inefficient way of deciphering a ciphertext C is to enumerate all possible messages M until EA(M ) = C.

In practice, each party is in possession of two keys - a public key and a private key [11]. The public key is known to everyone. The private key only to the party in ownership of the keys. Anyone may encrypt a message using a party’s public key, but only the owner of the private key may decrypt the contents. By using the private key to encrypt data, anyone with the public key may decrypt it. This mechanism provides a way to digitally sign a message and can be used to authenticate a party.

2.1.3 Key Establishment

A key establishment protocol, sometimes referred to as a Key Exchange Algorithm (KEX), is a process in which two or more parties securely exchange a shared symmetric key known only to them [14]. Such a protocol should be able to be performed securely over an untrusted communication channel.

One class of key establishments protocols is the key transport protocols. A key transport protocol is a protocol where one party generates a shared key which is then securely transferred to one or more parties [14].

In another class of protocols, the key agreement protocols, the parties jointly influence the outcome of the shared key by deriving the key from information supplied by the involved parties. This should be performed in such a way that no single party can predetermine the resulting shared key on their own [14].

2.1.4 Key Encapsulation Mechanism

A Key Encapsulation Mechanism (KEM) is a form of key transport protocol designed to generate a new random shared key. The key is encapsulated in a form which can only be unpacked by the chosen recipient. Key Encapsulation Mechanisms (KEMs) are often more efficient than general encryption schemes and are therefore used in many key establishment designs [14]. A KEM consists of four sets and three algorithms [14], listed below.

• KE - a set of public keys for encryption.

• KD - a set of private keys for decryption.

• R - a randomness set.

• C - a ciphertext set.

• A key generation algorithm which outputs a valid public key K ∈ KE, as well as a private key K⁻¹ ∈ K_D.

• (c, k) = EncapsulateK()- an encapsulation algorithm which takes a public key K and outputs a new symmetric key k as well as the corresponding ciphertext c ∈ C.

(19)

2.2. ClassicalCryptography 7

•k=Decapsulate_K ¹(c)-adecapsulationalgorithmwhichtakesaprivatekey K⁻¹andaciphertextcandproducesthesamesymmetrickeykasreceived fromtheencapsulationalgorithm.

Tofurtherrestrictthedeﬁnition,werequirethatif(c,k)isoutputbytheencapsu- lationfunctionEncapsulate_K(),thenkmustbeoutputbythecorrespondingdecap- sulationfunctionDecapsulate_K ¹(c)[14].

Althoughthenotationusedinthelistingabovedoesnotrespectthechosen randomnessset, KEMsinpracticeoftenrelyontheuseofrandomnesssothata diﬀerent(c,k)isreturnedforeachinvocationofEncapsulate_K()[14].

2 .1 .5 ForwardSecrecy

ForwardSecrecyisatermusedtodescribethesecurityofsessionkeysafteroneor morelong-termkeyshavebeenexposed[14]. Akeyestablishmentprotocolissaid toprovideforwardsecrecyifthecompromiseoflong-termkeysdoesnotcompromise anypreviouslyexchangedsessionkey.

Keyagreementprotocols mayprovideforwardsecrecyifthelong-termkeysare onlyusedforauthenticationoftheexchange[14]. Anotherwaytoprovideforward secrecyistouseephemeralkeys-keysthatareonlyusedforasinglerunofakey establishmentprotocol.

2 .2 C lass ica l Cryptography

Classicalcryptographyiscryptographyinuseonclassicalcomputerssuchascon- sumerPCs,smartphonesandcloudservers.Inclassicalcryptography,severalal- gorithmsbasedonRSAandElliptic-CurveCryptography(ECC)areusedforkey exchange.

Thissectionwillfurtherdescribecryptographicsystemsinusetokeepdatacon- ﬁdentialfromattacksrunonclassicalhardware,aswellasthethreatstheyface.

2 .2 .1 App l icat ions

InTransportLayerSecurity(TLS),keyestablishmentisusedtoletpartiesestablish asharedsymmetrickey[11]. Digitalsignaturesareusedtoensuretheauthenticity ofthepublickeysusedinthekeyestablishment. Therestofthecommunicationis securedusingsymmetriccryptography.TransportLayerSecurity(TLS)mayuseany ofawidevarietyofdifferentkeyestablishmentalgorithms.Someofthesealgorithms andparametersetsarerecommendedforuseinapre-quantumerabyorganiza- tionssuchasNationalInstituteofStandardsandTechnology(NIST)andtheInter- netEngineeringTaskForce(IETF),namelyX25519[56],EphemeralElliptic-Curve Diffie-Hellman(ECDHE)[59]andEphemeralDiffie-Hellman(DHE)[59].

Theelliptic-curve-basedX25519isusedforkeyexchangeintheVirtualPrivate Network(VPN) Wireguard[26]. ThealgorithmusestheCurve25519curve[56].In somecontexts,theX25519algorithmisknownbythenameofthecurve[10].

In TLS,Elliptic-Curve Diﬃe-Hellman(ECDH),ECDHE, Diﬃe-Hellman(DH) andDHEareusedtoexchangesessionkeys[71]. OtherprotocolssuchasSSHalso

(20)

8 Chapter 2. Background use the same algorithms for key exchange [84]. Some VPNs such as OpenVPN [68]

and IPSec [18] also use the same algorithms. Variants of the mentioned key exchange algorithms are also used in messaging applications such as the Signal protocol [22].

2.2.2 RSA

In 1977, L. Rivest et. al. [72] wrote a paper on digital signatures and public-key cryptosystems. In the paper they introduced a public-key cryptosystem based on powers and modulo operations. In the system, Alice first decides on two different primes p and q as well as an integer s which is relatively prime to (p − 1)(q − 1).

Alice makes r = p ∗ q and s public, but keeps r and q secret. As seen in section 2.1.2, the encryption function in a public-key cryptosystem is defined as EA(M ). In RSA, the function EA is defined as EA(M ) = M^s (mod r), for any message M.

The decryption function is defined as DA(C) = C^t (mod r)where t is solved from s ∗ t = 1 (mod φ(r))[72]. The value φ(r) is the so called Euler totient function, which produces the number of positive integers less than r which are relatively prime to r.

The security of the cryptosystem is based on the mathematical problem of fac- torizing a composite prime [72]. That is, under the assumption that it is difficult to factorize a number n = p ∗ q into its factors p and q, the system is considered secure.

2.2.3 Elliptic Curve Cryptography

Elliptic curve cryptography is a type of cryptosystems built on finite fields [24]. Its security lies in the assumption that solving the discrete logarithm problem in the multiplicative Z^∗p of a finite field Zp is unfeasible.

Elliptic curves generally yield much more dense ciphertexts and requires smaller keys to use when compared to RSA, without sacrificing security [24]. Implementa- tions are often very fast and outperform comparable RSA implementations.

To use an elliptic curve for cryptography, one must establish the following domain parameters [24].

• Fq - the finite base field

• a, b ∈ Fq - the parameters of the curve E

• Q, n - the base point Q ∈ E(Fq), whose order is a large prime number n

• h - the cofactor defined by hn = |E(Fq)|

The domain parameters a, b ∈ Fq describe the curve E as y² = x³ + ax + b. Il- lustrations of these curves may be seen in figure 2.1. Not all curves are suitable for elliptic-curve cryptography, which has led to some being standardized for mass usage [6].

Elliptic curves are used to replace the multiplicative subgroup Z^∗p in other cryptosystems based on discrete logarithms, such as ElGamal and Digital Signature Algo- rithm (DSA) [24]. The key difference is that elliptic curves define the group operation as (P, Q) → P + Q ∈ E(Fq)for some curve E and two points on the curve P and Q - that is, the group operation of two points on the curve will result in another point on the curve.

(21)

2.2. Classical Cryptography 9

b = -1 b = 0

a = -2

b = 1 b = 2

a = -1

a = 0

a = 1

Figure 2.1: Illustrations of y² = x³+ ax + b for various values for a and b

2.2.4 Diffie-Hellman

The DH Key Exchange Algorithm (KEX) is a key establishment protocol conceived by Ralph Merkle and published in 1978 [60]. The work was centered around a novel idea, that anything sent across a communication channel will be intercepted by a malicious party. To enable such a secure key exchange, the idea was for the two trusting parties to ensure that the malicious party has to perform a much higher amount of work to derive the shared secret than the exchanging parties. That is, although it is possible for the malicious party to successfully break the key exchange, it must be unfeasible.

In the DH protocol, two parties (Alice and Bob) agree publicly on a secret element g that generates a multiplicative group G [14, 60]. Each party selects a random value r_A and rB, respectively. These random values are in the range between 1 and the order of G. Alice calculates tA = g^r^A and Bob calculates tB = g^r^B. The values tA

and tB are exchanged on a channel that is not required to be secure. Alice and Bob may now both calculate a shared secret Z = g^r^A^r^B as Z = t^rA^B = t_Br_A.

The security of the key exchange comes from the assumption that it is computationally difficult for a malicious party to recover g^r^A^r^B from the public channel [14].

The problem is that of a discrete logarithm which was deemed unfeasible in the pre-quantum era, given a sufficient key size.

2.2.5 Elliptic-Curve Diffie-Hellman

The ECDH KEX is an adaption of DH over elliptic curves [6]. In this case, ECC refers to the use of one of the so called NIST curves (P-256 etc.).

Alice computes the point P = hdaQ_b, where dA is Alice’s private key, QB Bob’s public key and h a ECC domain parameter [6]. If the point evaluates to a null point, the calculation has failed. Else, let z = xP be the x-coordinate of P and convert the element z to a byte string Z, the shared secret.

(22)

10 Chapter 2. Background

2.2.6 Ephemeral Diffie-Hellman

In the case of both DH and ECDH, ephemeral keys may be used to provide forward secrecy. These ephemeral variants are referred to as DHE and ECDHE. As mentioned in section 2.2.1, the use of ephemeral keys is recommended for all versions of DH.

2.2.7 Threats to classical cryptography

Cryptography does not usually come with any guarantee of being secure forever.

Algorithms and parameters are updated continuously to mitigate attacks as they are found and as already known attacks become more practical [59]. As the performance of classical computers has increased throughout the years, this has meant that some known attacks such as prime factorization and discrete logarithms have become more computationally feasible [79]. The increase in performance has not yet lead to any major breakage as the algorithms in use are projected to withstand thousands of years of attacks using classical algorithms [80].

Since the 1980s, research has been made to utilize quantum mechanics for computation, thus introducing the quantum computer [8]. By utilizing quantum bits or qubits instead of the bits used in classical computers, quantum computers are able to represent several states per qubit. This mechanic enables a quantum computer to feasibly perform calculations that have been deemed impossible or impractical on a classical computer [54]. Recently, progress made on quantum computers has been increasing exponentially [31].

Parallel to the development of the theoretical quantum computer, algorithms have been developed to make use of the mechanics [54, 75]. Two of these algorithms, Shor’s algorithm and Grover’s algorithm have been shown to threaten pre-quantum cryptographic systems. They have been shown to be impractical or impossible to implement or use on a classical system, but feasible to use on a quantum computer.

Demonstrating that a quantum computer is able to feasibly perform an algorithm that a classical computer cannot is referred to as Quantum Supremacy [28].

Shor’s algorithm can solve the problems posed by many of the traditional cryptosystems - integer factorization and discrete logarithms, rendering the cryptosystems useless [75]. Today’s quantum computers are however not powerful enough to use Shor’s algorithm to break the security offered by classical cryptography [11]. A study [33] suggests that it would require 20 million so called noisy qubits to break 2048 bits RSA. Among today’s most powerful quantum computers are Google’s 53- qubits system [3] and IBM’s 65-qubits system [31]. IBM has a 1000-qubits system on their road map scheduled for release in 2023 [31]. Though these roadmaps suggest that there is a long way to go until the classical cryptography algorithms are broken, some estimates place the quantum advantage in 3-10 years [16, 70].

Grover’s algorithm was first proposed as a way to search in databases in O(√ N ) quantum operations where N is the number of available items [35]. However, a more practical application for the algorithm is to find the root of a function [11]. This enables an attack on some cryptosystem such as AES, leading to the bit security of the cryptosystems to be cut in half. The bit security corresponds to the best security a key of n bits can provide under the best known attack. Values for some widely deployed cryptographic systems are presented in Table 2.1. Pre-quantum and

(23)

2.3. Post-Quantum Cryptography 11 post-quantum refers to the bit security of the algorithm for the corresponding epoch.

To mitigate Grover’s and Shor’s algorithms, one may double the size of the AES key.

However, some cryptosystems are completely broken.

Table 2.1: Security levels for widely deployed cryptographic systems. Based on [11].

Name Function Pre-Quantum Post-Quantum Attack

Symmetric Cryptography

AES-128 block cipher 128 64 Grover

AES-256 block cipher 256 128 Grover

Salsa20 stream cipher 256 128 Grover

GMAC MAC 128 128 -

Poly1305 MAC 128 128 -

SHA-256 hash 256 128 Grover

SHA-3 hash 256 128 Grover

Public-key Cryptography

RSA-3072 encryption 128 broken Shor

RSA-3072 signature 128 broken Shor

DH-3072 key exchange 128 broken Shor

DSA-3072 signature 128 broken Shor

256-bit ECDH key exchange 128 broken Shor

256-bit ECDSA signature 128 broken Shor

The performance increase offered by quantum computers and the threats that Shor’s and Grover’s algorithms impose have led to a new epoch in computing and cryptography - post-quantum.

2.3 Post-Quantum Cryptography

Classical cryptosystems rely on mathematical problems shown to be easy for quantum computers to solve, resulting in their security being diminished or entirely broken [54, 75]. With the dawn of practical quantum computers, a new set of mathematical problems needs to be used to protect from the attacks made feasible by the quantum computers. Cryptography built on such problems is referred to as post-quantum cryptography [66]. Note, however, that such cryptography may still be used by classical computers.

This section further describes post-quantum cryptography, the NIST standardization process and post-quantum cryptosystems.

2.3.1 The NIST Post-Quantum Standardization Process

The National Institute of Standards and Technology (NIST) is an American organization under the Department of Commerce. By advancing measurements, standards and technologies, the institute’s goal is to promote U.S. innovation and industrial competitiveness [63].

(24)

12 Chapter 2. Background

Table 2.2: NIST round three finalists [66]

Name Use Type

Classic McEliece Key Encapsulation Mechanism Code-based Kyber Key Encapsulation Mechanism Lattice-based NTRU Key Encapsulation Mechanism Lattice-based SABER Key Encapsulation Mechanism Lattice-based Dilithium Digital Signature Lattice-based

FALCON Digital Signature Lattice-based

Rainbow Digital Signature Lattice-based

The organization is split into various divisions [64]. One of these divisions, the Computer Security Division, has assembled the Cryptographic Technology Group.

The group focuses on the topics of cryptographic algorithms such as block ciphers, digital signatures, hash functions and post-quantum cryptography.

The Cryptographic Technology Group has previously held standardization processes for the globally used algorithm suites AES and SHA-3 [65]. January 3rd, 2017, the Cryptographic Technology Group posted another call for submissions to an open standardization contest. This time for post-quantum cryptography algorithms. The process was estimated to take three to five years with multiple rounds of submissions.

At the time of writing, the process has been ongoing for four years and it has reached a third round of submissions [66]. For the third round, NIST published finalists and alternate candidates grouped in public-key encryption and key-establishment algorithms as well as digital signature algorithms. The finalists are presented in Table 2.2. The algorithms rely on two types of cryptography - code-based and lattice-based.

NIST has identified that a relevant attack on post-quantum KEMs is a chosen- ciphertext attack [67]. Resistence to such an attack is referred to as IND-CCA2 security.

2.3.2 Lattice-Based Cryptography

A lattice is the set of linear combinations of the basis vectors in a euclidean space [15].

It is defined as follows, where Z refers to an integral coefficient and x1, x₂, ..., x_n a basis in the euclidean space Rⁿ.

L = Zx1 + Zx2+ ... + Zxn = ( _n

X

i=1

a_ix_i|a₁, a₂, ..., a_n∈ Z )

A lattice provides several applications in number theory and has been applied to cryptography, where the problems imposed by lattices are taken advantage of [15].

One significant problem is the shortest vector problem. The problem revolves around approximating the minimal euclidean length of a non-zero lattice vector. The problem has been shown to be hard to solve efficiently and is thought to be secure from future classical and post-quantum algorithms alike [77].

One of the proposed lattice-based cryptosystems is Kyber, which relies on the learning with errors problem over module lattices being hard to solve [5]. Another

(25)

2.3. Post-Quantum Cryptography 13 lattice-based submission is the SABER cryptosystem that relies on the learning with rounding problem over module lattices being hard to solve [7].

2.3.3 NTRU

The NTRU KEM is a unification of several lattice-based variants [20]. It has its roots in two submissions to the NIST standardization process; NTRUEncrypt and NTRU- HRSS-KEM. The original paper on the NTRU cryptosystem was published in 1998, following the interest of creating a new, efficient and computationally inexpensive public key cryptosystem [40]. NTRU then used a mixing system which was based on polynomial algebra (modulo some number p and q) for encryption. The decryption was centered around an unmixing system. The security of NTRU relies on the shortest vector problem being hard to solve [40, 77].

The third-round submission to NIST is a merger of several variants of NTRU [20].

Two of the prominent ones are NTRUEncrypt and NTRU-HRSS-KEM. The NTRU- Encrypt algorithm has its origin in the original paper published in 1998. The submission had several issues and lacked a correct recommended parameter set. The NTRU-HRSS-KEM submission brought performance improvements which enhanced the decapsulation routine without sacrificing security. The merged algorithm, NTRU, offer correct parameter sets with trade-offs for performance versus security.

2.3.4 Code-Based Cryptography

Code-based cryptosystems take advantage of error-correction code [11]. Typically error-correction code is used to detect or correct a bit flip that has occurred. But in cryptography, one may use it to encrypt by adding errors, which may then be cor- rected / decrypted. There are different types of error-correction codes, for example, quasi-cyclic codes and Goppa codes [74].

2.3.5 Classic McEliece

Classic McEliece is a code-based cryptosystem that uses random binary Goppa codes as the public key. It adds a specific number of errors to the plaintext to encrypt it.

The original McEliece cryptosystem was first proposed in 1979 by McEliece [58] and is the basis of Classic McEliece, together with various improvements made over the years [1]. Decryption uses the private key to perform so called error correction to correct the errors added in the encryption step [1].

For each configuration of Classic McEliece, there are two versions - systematic form (non-f) and semi-systematic form (f) which are two ways of generating the keypair [1]. The non-f variant is faster at generating the keypair but fails more often and may need several retries. The f variant has a very low probability of failing, but is theoretically slower than the non-f variant. The semi-systematic form results in a loss of up to 2 bits of security, however, which the authors claim is not of concern.

The (µ, ν)-semi-systematic form of Classic McEliece is designed to take both time and success probability into account [1]. The systematic form, where (µ, ν) = (0, 0), yields a lower probability of correctly finding a key than when compared to the (µ, ν) = (32, 64)semi-systematic form. The failure probability of the semi-systematic

(26)

14 Chapter 2. Background form is below 2⁻³⁰, which in practice should correlate to only one key-generation attempt being required. The semi-systematic form requires extra computational work per attempt in order to compute the key.

2.3.6 Security Categories

NIST [67] anticipated that submissions to the standardization effort would face uncertainties in terms of estimating the security strength of post-quantum algorithms.

They state that the uncertainties are largely caused by two issues. The first issue is the possibility that new attacks will be discovered. The second is that there is a limited ability to predict how a quantum computer will behave in terms of cost, speed and memory.

NIST [67] therefore defined security categories that are not based on the traditional measure of bit security. The following categories were defined. The order denotes the strength of the attack.

1. Any attack that breaks the relevant security definition must require computational resources comparable to or greater than those required for key search on a block cipher with a 128-bit key (e.g. AES-128)

2. Any attack that breaks the relevant security definition must require computational resources comparable to or greater than those required for collision search on a 256-bit hash function (e.g. SHA256/ SHA-3-256)

4. Any attack that breaks the relevant security definition must require computational resources comparable to or greater than those required for collision search on a 384-bit hash function (e.g. SHA384/ SHA-3-384)

NIST [67] further discuss that the security categories previously presented provide more quantum security than a naïve analysis might suggest. They further state that security category 1, 3 and 5 are defined in terms of block ciphers - which may be broken by Grover’s algorithm, given a quadratic quantum speedup. The potential impact, presented in Table 2.1, could be that the AES bit security is halved. NIST [67]

further claim that such an attack requires a long-running serial computation, which is difficult to implement in practice. For the attack to be practical, NIST believe that many smaller instances of the algorithm must be run in parallel, making the quantum speedup less dramatic.

In Table 2.3, the security level of the NIST submissions are presented. Each parameter set of the algorithm is presented with its security level right below it.

Chen et.al. [20] present two different models for assessing the security level of NTRU

(27)

2.4. Architectures and Performance 15

Table 2.3: Security levels of various NIST submissions [1, 5, 7, 20]

Algorithm Security Level

NTRU HPS 2048509 HPS 2048677 HPS 40968211 HRSS 701 1¹, -² 3¹, 1² 5¹, 3² 3¹, 1² Classic

McEliece

348864(f) 460896(f) 6688128(f) 6960119(f) 8192128(f)

1 3 5 5 5

SABER LightSABER SABER FireSABER

1 3 5

Kyber 512 768 1024

1 3 5

- a local and a stricter non-local model. Both of these models are presented in the aforementioned table and labeled accordingly.

2.4 Architectures and Performance

Computer architectures define the different characteristics of a system. An architecture can use different instruction sets, micro-architectures and systems designs.

These will in one way or another, affect the performance of the system.

This section will describe Reduced Instruction Set Computer (RISC) and Com- plex Instruction Set Computer (CISC), x86 and IBM Z, as well as their respective design decisions.

2.4.1 RISC and CISC

The Reduced Instruction Set Computer (RISC) design methodology is centered around a load/store architecture [30]. This means that only the load and store instructions may access the memory system [19]. The aim is to reduce the amount of complexity in the instruction set and regularize the instruction format. The regular- ization simplifies the decoding of the instructions with the aim to improve the overall performance [30]. The Complex Instruction Set Computer (CISC) design methodology, on the other hand, is centered around a register/memory architecture [30]. The architecture permits arithmetic and other instructions to read their input from, or write to the memory system [19].

A CISC computer generally requires fewer instructions than a RISC computer to perform a computation [19]. A CISC computer may therefore perform better than a RISC computer that performs instructions at the same rate. A RISC computer may be implemented at a higher clock rate than a CISC computer as the instructions can more efficiently be decoded.

The nature load/store and register/memory architectures of RISC and CISC results in memory-related arithemtic requiring fewer instructions on a CISC machine

1local model

2non-local model

(28)

16 Chapter 2. Background than a RISC machine [19]. The hardware required for the machine is more complex, however. This trade-off is one of many when comparing the architectures.

Examples of architectures using CISC are x86 and IBM Z. Examples of architectures using RISC are ARM and RISC-V.

2.4.2 Single Input Multiple Data (SIMD)

The classification of computer architectures proposed by Flynn [29], describes the four classifications presented below.

• The Single Instruction Single Data (SISD) organization represents the most conventional type of computer. Such an organization is limited by data depen- dencies. Branching is particularly limiting.

• The Single Instruction Multiple Data (SIMD) organization represents array processes. The performance of such a process increases as O(log2N ) with respect to the number of data stream processors.

• The Multiple Instruction Single Data (MISD) organization was typically represented by the plug-board type machines no longer in use.

• The Multiple Instruction Multiple Data (MIMD) organization is typically referred to as multi-processors. The organization may be subject to saturation.

Due to the properties of SIMD, it has been an attractive target for software optimization with practical use seeing significant throughput increases [25].

In order to use the properties of SIMD, the target architecture has to be a vector architecture or one that supports SIMD extensions [38]. Such a target will work on sets of data elements in memory, placing them in sequential registers, operate on the vectors of data using a single instruction and then disperse the results back to main memory. In the case of a vector architecture, these vector payloads are heavily pipelined. The memory overhead is consistent for an entire vector of operands as opposed to linear in regular SISD architectures. Furthermore, a piece of code may have to be vectorized to use the properties offered by SIMD. That is, it will need to be written in such a way that the same operation can be performed on multiple units of data. Not all algorithms can be written in such a way [25].

Although usage of a CPU’s vectorization capabilities may require explicit input from a programmer, advanced compilers may in some cases be able to automatically vectorize code in order to make it run more efficiently [25].

2.4.3 The x86 Architecture and AVX

The x86 family of architectures are some of the most popular architectures for consumer and cloud hardware [19]. The family has a long history and includes architectures such as the 32-bit IA-32 and the 64-bit amd64, also known as x86-64. In order to avoid confusion, amd64 will be referred to as x86 in this thesis.

x86 is in theory a CISC, but there has been a fair amount of convergence between RISC and CISC, which makes x86 a hybrid of sorts [19].

(29)

2.4. Architectures and Performance 17 In 1996 the x86 architecture saw the addition of the MMX instruction set [39].

The instruction set used the 64-bit floating-point registers of the CPU to enable either eight 8-bit vector operations or four 16-bit operations simultaneously. The instruction set was superseeded by SSE in 1999, which added special 128 bit wide registers. This enabled instructions to simultaneously perform sixteen 8-bit operations, eight 16-bit operations, or four 32-bit operations. SSE was in turn superseeded by SSE2 in 2001, SSE3 in 2004 and SSE4 in 2007. Each generation saw improvements in available instructions and floating-point performance.

In 2010 AVX was introduced and doubled the width of the registers to 256 bits [39, 51]. This increase effectively doubled the number of simultaneous operations that could be performed. The 256-bit width supported 64-bit operands in parallel. AVX2 and AVX512 superseded AVX. AVX2 extended the number of 256- bit instructions [52] and AVX512 added support for 512-bit wide instructions [48].

AVX2 and AVX512 are power-hungry instructions that make the CPU run hot. To prevent the CPU from overheating, the CPU downclock its cores during AVX operations [37, 52]. This might have a negative performance impact on non-AVX code that runs directly after AVX-related code.

By using AVX, a developer may optimize a program and make use of the performance properties offered by SIMD [39].

2.4.4 Mainframe Hardware - IBM Z

Mainframe hardware is designed to handle a large amount of data and bulk transactions [47]. To achieve this, they are using custom-made CPUs and special co- processors. Further features are availability, resilience, backward compatibility and security.

Many features seen in consumer platforms such as x86 was first seen in mainframe hardware [53]. Features such as virtualization, caches and cache hierarchies for reduced memory latency, as well as branch prediction, out-of-order execution and hardware-assisted encryption.

IBM Z is among the oldest of mainframes as it was launched as S/360 in 1964 [53].

Banks, airlines and enterprises worldwide use IBM Z mainframes as part of their IT infrastructure. The Z platform brought the innovation of focusing on an Instruction Set Architecture (ISA) - which specifies a hardware behavior software could rely on regardless of the underlying hardware.

With the rise of high-performance workstations and the client-server computing model in the 1990s, IBM refocused the S/390 platform to handle commercial batch and transaction processing workloads [53].

Since its original launch in 1964, the platform has been developed to become one of the most modern large-scale computing platforms [53]. The workloads for which the Z platform is optimized has continued to be characterized by large instruction and data footprints, with a high level of input/output activity.

In 2019, IBM released the z15 [53] - a 64-bit CISC architecture. To further increase performance, focus has shifted from general speedups to provide functions tailored for specific aspects of enterprise computing, such as on-chip acceleration of compression and encryption.

(30)

18 Chapter 2. Background In z15, virtualization is a core feature [57]. It is designed from the bottom up with virtualization in mind with specially engineered hardware to handle it. In the first layer of virtualization, the hardware is divided into logical partitions (LPARs) that runs their own virtual systems. Each CPU has 12 custom cores with a Simultaneous Multithreading (SMT) grade of 2 which means that each core has two logical cores.

A mainframe contains multiple of these CPUs. Special co-processors are built into the CPU. One example is Nest Accelerator Unit (NXU) to accelerate compression and the Central Processor Assist for Cryptographic Functions (CPACF) to deliver high-speed, low-latency cryptographic functions. It supports multiple variants of DES, AES, SHA and SHAKE. Further, it has support for SIMD, which enables vectorization and other performance optimizations.

The z15 mainframe has support for CryptoCards [45]. CryptoCards are IBM’s cryptographic Hardware Security Modules (HSMs). A HSM is a physical device placed into a computer to deliver high throughput for cryptographic functions. The devices are designed to be physically secured from tampering and their firmware may be cryptographically verifiable. CryptoCards are available for the IBM Z, POWER and x86 architectures, depending on the chosen HSM.

At the time of writing, the newest card for IBM Z and x86 is the CEX7S / 4769 [44]. The newest card for the POWER platform is the CEX5S / 4767 [42].

Common to both cards is support for both Federal Information Processing Standard (FIPS) 140-2 Level 4. The CEX7S / 4769 is also certified according to the Payment Card Industry (PCI) HSM standard [42, 44]. The two cards highly accelerate the use of common cryptographic algorithms such as AES, SHA-3 and DH [42, 44].

(31)

Chapter 3 Related Work

In the following sections, related work for this thesis is discussed. First, work related to our experimental study is presented and discussed. Then, we present information on the underlying primitives and algorithms, as well as potential optimizations as outlined in our literature study.

3.1 Experimental Performance Studies

In [23] Dang et. al. evaluated the second round of submissions to the National Insti- tute of Standards and Technology (NIST) standardization process, in terms of performance. They analyzed six lattice-based Field-Programmable Gate Array (FPGA) and Application-Specific Integrated Circuit (ASIC) implementations, as well as 12 software-based Key Encapsulation Mechanism (KEM) implementations on hardware platforms as well as submissions using a software/hardware co-design. They did not focus their performance benchmarks on software implementations. As we solely aim to evaluate the performance of the post-quantum KEMs in software, our aim is different from theirs. Furthermore, we will evaluate the performance of the third round of submissions, which represent almost a year of further progress. Not all algorithms represented in the second round made it through to the third round. Furthermore, several algorithms were changed and some algorithms merged to one. In their work, Dang et. al. did not present data on the software performance of all of the algorithms present in the third round of submissions. For example, the Classic McEliece submission is not presented. Furthermore, Dang et. al. only run the implementations on a single x86 processor which makes the results less generalizable for processors of different generations and architectures. The method used to measure the performance of the software-based algorithms provided a simplified view of the performance as it only takes the elapsed time and cycles into account. We believe a more in-depth method of measurement based in accurate hardware-based counters may provide a more detailed and whole picture.

In [21] the aim of the authors was to evaluate the performance of various post- quantum public-key schemes for constrained-resources smart mobile devices in terms of computational time, required memory and power consumption. Though a public- key encryption scheme may be converted into a KEM, their work does not cover such topics. As the main purpose of the NIST standardization purpose and our work is to study post-quantum KEMs, we believe one needs to focus on the KEMs themselves, rather than the underlying schemes.

19

(32)

20 Chapter 3. Related Work In [55], Kumar and Pattnaik discussed the underlying mathematics of the third- round submissions NewHope, Frodo, NTRU, Kyber, SABER and Classic McEliece.

They further described the classical algorithms Diffie-Hellman (DH) and Elliptic- Curve Diffie-Hellman (ECDH). Their focus on the mathematics and fundamental performance costs provided an up-to-date and low level comparison of the various KEMs. Given their focus on the algebraic constructs, they did not investigate the practicalities of running the algorithms on real hardware. We believe that it is important to understand the underlying reasons for performance differences found in the submissions, but in order to provide a full understanding on the readiness of today’s hardware one should study the practical performance when run in various environments.

In [82], Vambol et. al. evaluated two post-quantum public-key schemes - McEliece and Niederreiter. As the work was published before the round one NIST submissions were released, the work does not use the algorithms that are likely to be standardized. At the time of writing it has been almost four years since the work of Vambol et. al. was published. Since then, a lot of progress has been made in the potential future algorithms - as seen in the various rounds of submissions to NIST. Further- more, Vambol et. al. focused on the characteristics of the underlying cryptosystem for the proposed Classic McEliece KEM, as such it does not provide a fair view of the performance characteristics of the KEM itself, which we are interested in. The authors also focused on computational complexity and parameter sizes of the cryptosystems. Though these topics are relevant and interesting, they do not constitute enough evidence of the performance characteristics of the post-quantum KEMs on various hardware, such as the z15 mainframe.

3.2 Literature on Post-Quantum Characteristics

In [12], Bernstein describes the fundamentals of a KEM in that a correct public-key encryption is applied to a random input to obtain a ciphertext. Bernstein further describes that it is a standard discipline to avoid data flow from secrets to array indices and branch conditions. Although it is simple to construct a valid key generation and encapsulation for a KEM, it may be difficult to provide constant-time decapsulation. Some x86 processors support the AES instruction set [2]. This instruction set speeds up AES-related operations and may provide cryptosystems with a high-performance choice for noise sampling. As previously explained in section 2.4.2, x86 also supports the Single Instruction Multiple Data (SIMD) instruction set AVX. Sinha Roy [76] notes that on high-end platforms, SIMD instructions such as those found in AVX2 only provide a limited speedup. On Intel hardware, Sinha Roy measured a 1.5 times higher throughput of an optimized SABER variant, even though the algorithm was expected to achieve nearly four times higher throughput.

One reason behind the identified performance bottleneck is the overhead of vector processing. With improved computer architectures, this overhead can be expected to become lower. Bernstein [9] further highlights four problems with Intel’s x86 instruction set that make it hard to implement cryptographic algorithms securely. The first problem is that they are not committed to provide an instruction set that does not change in a way that can break cryptographic functions. For example, they do

(33)

3.2. Literature on Post-Quantum Characteristics 21 not commit to keep instructions constant-time. This is even more important in vector operations. The second problem is that integer vector multiplication is limited to 25 bits. He wants to be able to use the 53-bit multiplier. The third is that the instruction set does not support a vector version of the CARRY instruction. The last problem highlighted is that Intel do not supply any documentation on pipelines.

In [83], Classic McEliece is described to make use of matrix multiplication for encryption. As for key generation, Classic McEliece relies heavily on the computation of a random permutation of selected field elements. Besides permutations, matrix- related algorithms such as Gaussian elimination is used during key-generation [1]. An operation destinct to the code-based Classic McEliece submission is the operation of finding the unique codeword in the Goppa code at a certain distance [1].

In [2], it is said that one of the inefficiencies of lattice-based cryptosystems stems from the misconception that high-quality Gaussian noise is crucial for encryption based on learning with errors. Using this Gaussian noise was found to make implementations slower and more complex than they have to be. The Gaussian sampler is also hard to protect against timing attacks. Therefore one may use a centered binomial distribution instead. This distribution may be further optimized by applying vector instructions such as those found in AVX2. In the case of the New Hope cryptosystem and AVX2, the operations are carried out on unsigned 16-bit integers.

To increase the performance of polynomial operations and the performance of lattice-based cryptosystems as whole, one may turn to Number-Theoretic Transforms (NTTs) [2]. NTT provides an efficient way of multiplying polynomials. The NTT, which is a generalization of Fast Fourier Transform (FFT), has an asymptotically- fastest time-complexity of O(n log n) [73]. In [2] the bottleneck of the NTT-operations was found to be the butterfly operations - each consisting of one addition, one sub- traction and one multiplication by a pre-computed constant. Another performance- sensitive topic is the use of encodings of polynomials in byte arrays [2]. There has been research on exchanging this encoding with another encoding that is particularly well-suited for NTT-based polynomial multiplication.

Roberto et. al. [5] argue that NTTs are extremely efficiently vectorizable on large processors. State-of-the-art performance may be achieved by carefully optimizing NTTs using AVX2 integer instructions.

The importance of SIMD instructions such as AVX is further established in [36].

By applying a correct representation of the polynomials used in lattice-based cryptosystems, one may utilize the 256-bit wide AVX2 registers to represent four double- precision floats. In the case of AVX, one may perform one double-precision-vector multiplication and one addition every cycle. It is further established that the main bottleneck might not always be the arithmetic cost. As only 64 polynomial coeffi- cients fit into the 16 available registers in AVX2 - many additional loads and stores are often necessary. The performance of these loads and stores is also more complex to determine when compared to the arithmetic throughput.

The post-quantum KEM SABER is designed to be serial by nature [76]. The design was chosen to attain simplicity and efficiency on constrained devices. The SABER algorithm relies heavily on the pseduo-random number generation implemented using SHAKE-128, which also occupies a significant portion of SABER’s execution time [76]. Some measurements suggest that around 50-70% of the overall computation time is spent generating pseudo-random numbers [7].

(34)

22 Chapter 3. Related Work In [88], Zhu et. al. discuss the performance of cryptosystems based on learning with rounding, such as SABER. In SABER, one of the main computational bottlenecks is the polynomial multiplication, which cannot be accelerated by using the NTT fast multiplication algorithm. This is due to SABER using a power-of-two modulo and NTTs requiring a prime modulo for ciphertexts. When it comes to hardware implementations, it is therefore important to discuss how one may efficiently implement polynomial multiplication without using NTT. One may exchange NTT with Toom-Cook and Karatsuba algorithms. However, high-speed implementations of Toom-Cook multiplication have been found to add additional overhead. For implementations in hardware, a Karatsuba algorithm may be adopted for accelerating learning with rounding, as found in SABER. In one implementation, it was found that a 100MHz hardware implementation required roughly 5.2 microseconds for en- capsulating a key, which was found to be 14 times faster than implementations on a more conventional Intel Core i7 processor. Further efforts have been made to implement SABER in hardware [73]. In [73], a high-speed instruction-set coprocessor for lattice-based KEMs is presented. Just like Zhu et. al. [88], Sinha Roy et. al. [73]

also identified that polynomial multiplications plays a performance-critical role in lattice-based public-key cryptography.

The post-quantum KEM Kyber uses NTTs as an efficient way of performing multiplications of polynomials in Rq [5]. In order to increase the performance of the factors when performing calculations using AVX SIMD instructions, one may apply bit-reversed order. Kyber further uses SHA3-256, SHA-3-512, SHAKE-128 and SHAKE-256. The use of NTTs have some advantages over Karatsuba and Toom multiplication algorithms. NTTs are extremely fast and do not require additional memory. NTTs may also be implemented in very little code. The performance of Kyber is largely bound to the performance of the symmetric primitives of the SHA-3 family of algorithms. The hash computations are considerably less performant than the polynomial arithmetic found in Kyber when run on recent Intel processors. In an AVX2-optimized Kyber implementation, the hashes constitute almost half of the encapsulation cycles.

In Kyber, the choice of random number generation during key generation is considered a local decision that any user may make independently of one another [5].

On platforms where there is hardware support for AES, one may adapt AES-based generators. On platforms with support for KECCAK-based algorithms, one may instead rely on them for pseudo-random numbers.

In [73] it is said that there are two general methodologies to implement computationally intensive cryptographic algorithms in hardware. A hardware/software codesign offers a shorter design cycle and higher flexibility, but it may not result in the best performance. The best performance is found in a full-hardware design. By using a instruction-set coprocessor architecture for SABER, the authors achieved pro- grammability and flexibility, which allowed them to easily extend the instruction set and adapt the architecture for other algorithms and tasks. The architecture followed best practices by making the implementation constant-time. On platforms with support for SIMD instructions such as those found in AVX2, the cost of pseudo-random number generation is reduced by vectorizing the implementation by a factor of four.

As KECCAK is very efficient on hardware platforms, one may run the algorithm in parallel.

(35)

Chapter 4 Method

4.1 Selected Methods

We used two methods to answer our three research questions - a literature study and an experiment.

A literature study was conducted to answer RQ1 and RQ3. We did so as the scope of the question, as well as our limited access to mainframe hardware, made it difficult for us to perform a practical study of the performance potentials of using various hardware features.

To provide information to help us answer RQ2, we conducted an experiment in form of a performance test. An experiment was chosen as we could isolate variables and control the environment in which the study was conducted. Furthermore, we were only interested in the raw performance metrics from different computer platforms. Therefore, a survey would not be relevant because we were not interested in user experience, rather the algorithms’ performance on various platforms. Neither was a case study relevant because it does not allow for a controlled environment and is mostly used in social sciences [85].

4.2 Literature Study

As previously mentioned, we aimed to answer RQ1 and RQ3 by conducting a literature study. We limited the scope to focus on z15 and the four finalists in National Institute of Standards and Technology (NIST)’s standardization process, round 3.

By studying the algorithms’ underlying mathematics, the authors’ own optimizations as well as relevant literature on cryptography optimization, we aimed to identify what parts of the algorithms are possible and suitable to optimize.

When potential areas of improvements were identified, literature was studied to find relevant methods available on z15, such as specialized instruction sets and other hardware features. IBM’s official documentation as well as research conducted by third parties was also studied. That way, we hoped to get a balanced view of the capabilities of the platform.

To identify relevant research papers, we searched for peer-reviewed papers by using databases such as Scopus and Web of Science. By searching for relevant keywords such as mainframe cryptography, z15 cryptography, cpacf performance, mainframe cpu cache, cryptocards, z15 simd, z15 memory security and z15 alu we selected papers that seemed relevant based on their title and abstracts. We then read the selected papers in their entirety to determine the quality and relevance.

23

The Performance of Post-Quantum Key Encapsulation Mechanisms: A Study on Consumer, Cloud and Mainframe Hardware