HuanyuWang Side-ChannelAnalysisofAESBasedonDeepLearning

(1)

DEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

(2)

Abstract

Side-channel attacks avoid complex analysis of cryptographic algorithms, instead they use side-channel signals captured from a software or a hardware implementation of the algorithm to recover its secret key. Recently, deep learning models, especially Convolutional Neural Networks (CNN), have been shown successful in assisting side-channel analysis. The attacker first trains a CNN model on a large set of power traces captured from a device with a known key. The trained model is then used to recover the unknown key from a few power traces captured from a victim device. However, previous work had three important limitations: (1) little attention is paid to the effects of training and testing on traces captured from different devices; (2) the effect of different power models on the attack’s efficiency has not been thoroughly evaluated; (3) it is believed that, in order to recover all bytes of a key, the CNN model must be trained as many times as the number of bytes in the key.

This thesis aims to address these limitations. First, we show that it is easy to overestimate the attack’s efficiency if the CNN model is trained and tested on the same device. Second, we evaluate the effect of two common power models, identity and Hamming weight, on CNN-based side-channel attack’s efficiency. The results show that the identity power model is more effective under the same training conditions. Finally, we show that it is possible to recover all key bytes using the CNN model trained only once.

Keywords

(3)

Abstract

Sidokanalattacker undviker komplex analys av kryptografiska algoritmer, utan använder sig av sidokanalssignaler som tagits från en mjukvara eller en hårdvaruimplementering av algoritmen för att återställa sin hemliga nyckel. Nyligen har djupa inlärningsmodeller, särskilt konvolutionella neurala nätverk (CNN), visats framgångsrika för att bistå sidokanalanalys. Anfallaren tränar först en CNN-modell på en stor uppsättning strömspår som tagits från en enhet med en känd nyckel. Den utbildade modellen används sedan för att återställa den okända nyckeln från några kraftspår som fångats från en offeranordning. Tidigare arbete hade dock tre viktiga begränsningar: (1) Liten uppmärksamhet ägnas åt effekterna av träning och testning på spår som fångats från olika enheter; (2) Effekten av olika kraftmodeller på attackerens effektivitet har inte utvärderats noggrant. (3) man tror att CNN-modellen måste utbildas så många gånger som antalet byte i nyckeln för att återställa alla bitgrupper av en nyckel.

Denna avhandling syftar till att hantera dessa begränsningar. Först visar vi att det är lätt att överskatta attackens effektivitet om CNN-modellen är utbildad och testad på samma enhet. För det andra utvärderar vi effekten av två gemensamma kraftmodeller, identitet och Hamming-vikt, på CNN-baserad sidokanalangrepps effektivitet. Resultaten visar att identitetsmaktmodellen är effektivare under samma träningsförhållanden. Slutligen visar vi att det är möjligt att återställa alla nyckelbyte med hjälp av CNN-modellen som utbildats en gång.

Nyckelord

(4)

Acknowledgements

(5)

Authors

Huanyu Wang <huanyu@kth.se>

Electrical Engineering and Computer Science KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

Electrum 229, 164 40 Kista

Examiner

Prof. Elena Dubrova

KTH Royal Institute of Technology

Supervisor

Prof. Mark T Smith

(6)

1 Introduction

Cryptography is an important part of information security and communication confidentiality. At the present stage, the algorithms, the protocols and the corresponding standards have strictly guaranteed the theoretical security of cryptography. However, for cryptographic systems, one problem that cannot be ignored is that the security in theory is not equivalent to the security in implementation. Because a cryptography algorithm relies on hardware or software implementation, there is a security risk of information leakage when it’s running in a device or a chip. The attacker can observe the side-channel leakage and combine the details of specific cryptographic algorithm for cryptanalysis. The available side-channel information includes execution time [15], power consumption [16], electromagnetic radiation [28], acoustic information [32], cache information [14][24], etc. This type of attacks is called Side-Channel Attacks (SCA). Many well-known cryptography algorithms, including Advanced Encryption Standard (AES) [7], have been broken by the SCA.

One powerful tool for that side-channel attacks is Deep Learning (DL). DL helps exploring the correlation between the leakage information and the key. Unlike the traditional side-channel attacks, DL based side-channel attack enables the attacker to use little leakage information (e.g. power traces in power analysis) at the attack stage with a trained DL model. This makes side-channel attack significantly more efficient. Recent works have explored the SCA based on different deep learning techniques, including Multilayer Perception Network (MLP) [21][22][23] and Convolutional Neural Networks (CNN) [3][21]. These works demonstrate that the SCA with properly used deep learning algorithms can perform better than the template attacks [4].

(8)

Figure 1.1: An overview of how the DL-based SCA works.

Specifically, CNNs can be applied against jitter-based countermeasures [3] and masked AES implementation [25]. The further details about deep learning and convolutional neural network can be found in 2.3. The previous CNN-based SCAs have some limitations and based on [2][3][20][26][31], this thesis explores the CNN-based SCA with the following contributions:

1. This thesis explores how the board diversity can affect the performance of the CNN-based side-channel attacks. The results show that it is easy to overestimate the accuracy of the side-channel attack if the CNN models are trained and tested on traces captured from same board.

2. Rare works pay attention to how different the power models affect the CNN-based side-channel attacks. This thesis compares the 9-classifier (Hamming weight power model) and the 256-classifier (identity power model).

3. The previous work [25] claims that, to recover an entire key, the number of times a neural network must be trained is equivalent to the number of bytes in the key. This thesis demonstrates that for CNN-based SCA, it is enough to train a model with one byte of the key to recover an entire key.

(9)

(10)

2 Background

The ability of deep learning to explore relationships in raw data makes it a good candidate for channel analysis. In recent years, many studies on side-channel attacks based on deep learning have emerged in order to make the SCA more efficient. Based on the previous works, this thesis aims to explore a more efficient side-channel attack based on CNN. This section introduces the theoretical background of cryptography, side-channel attacks, and machine learning. The review of each respective field will generally include the overview as well as the theoretical descriptions, traditional analytical methods, evaluation criteria and examples.

2.1 Cryptography

Basics

and

Advanced

Encryption

Standard

Side-channel attacks aim at breaking an implementation of the cryptographic algorithms, it is necessary to learn the cryptography basics. This section first presents a theoretical overview of cryptography, and then some research milestones about cryptography. Then, this section describes some important cryptographic algorithms especially AES, which is the target algorithm in our attacks.

2.1.1 Theoretical Overview

(11)

shows a cryptographic system transmission model in an ideal communication environment. The ideal communication environment means that the attacker can only intercept the information transmitted on the public channel and then perform key analysis. This is also a common assumption of traditional cryptanalysis.

Figure 2.1: A cryptographic system transmission model in an ideal communication environment.

2.1.2 Historical Overview

Back to the war years, cryptography was mainly used in the intelligence and command transmission. In peacetime, especially in today’s information society, cryptography has penetrated into every aspect of people’s lives. It is often used to provide information confidentiality, which is to protect the message in the transmission and storage. In addition, cryptography can also be used for digital signatures, identity authentication, system control, source confirmation, and more. The history of cryptography can be summarized in the following three stages:

1. The development of symmetric cryptography.

(12)

n(n − 1)/2 different secret key managements. The key managements will be significantly inefficient when the number of user n is large, it will be considerably difficult for the network to accomplish the key production and distribution. Therefore, the difficulty of key distribution is the main obstacle for the symmetric cryptography to be used in a large communication system. 2. The development of modern cryptography.

From 1976 to 1996, The two most influential events mark the birth of modern cryptography.

The first one is the proposal of the public key concept [8]. This is the first time the secure communication without key transfer proved to be possible, which leads to the birth of the public key cryptography. The problem of the key distribution in the symmetric cryptography is fundamentally solved by the public key cryptography, which is widely used in the nowadays computer networks. At present, the widely used public key algorithms include: Rivest-Shamir-Adleman (RSA) key cryptosystems [30], EL Gamal public-key cryptosystems [9] and Elliptic Curves Cryptography (ECC).

The second one is the Data Encryption Standard (DES) [36] established by the United States in 1977, which is a symmetric cryptographic algorithm designed by International Business Machines corporation (IBM). It embodies the design idea from Shannon’s communication security theory, marking the new stage of design and analysis of cryptographic algorithms. 3. The development of applied cryptography.

(13)

an international encryption standard, and many countries have adopted the AES in their banking systems. For these reasons, most of the side-channel attacks now are devoted to breaking the AES encryption system. The target algorithm of our thesis is also the AES, we use the AES with the key size n = 128, AES-128. Section 2.1.3 explains the mathematical details of the AES-128.

2.1.3 Advanced Encryption Standard, AES

The AES algorithm is a symmetric cryptographic algorithm adopted by the U.S. National Institute of Standards and Technology after the DES algorithm was outdated. AES-128 divides the plaintext into 16-byte blocks before encryption, and takes the plaintext blocks with a 128-bit key as input. Algorithm 1 shows the pseudo-code for the AES-128 encryption algorithm. It is an iterative process with a total of 10 rounds, each of which contains four basic operations: SubBytes, ShiftRows, MixColumns and AddRoundKey (there is no AddRoundKey operation in the last round). Before the operation of the AES encryption and decryption, the 16-byte input array is re-organized into a 4× 4 matrix (Fig. 2.2 shows how the matrix is arranged).

(14)

Algorithm 1 Pseudo-code of the AES-128 algorithm.

// AES-128 Cipher // in: 128 bits (plaintext) // out: 128 bits (ciphertext)

// N r: number of rounds, N r = 10 for AES-128 // N b: number of columns in state, N b = 4

// w: expanded key K, N b∗ (Nr + 1) = 44 words, (1 word = Nb bytes) state = in;

AddRoundKey(state, w[0, N b− 1]);

for round = 1 step 1 to N r− 1 do

SubBytes(state); // Attack Point, for round = 1. ShiftRows(state);

MixColumns(state);

AddRoundKey(state, w[round∗ Nb, (round + 1) ∗ Nb − 1]);

end for SubBytes(state); ShiftRows(state); AddRoundKey(state, w[N r∗ Nb, (Nr + 1) ∗ Nb − 1]); out = state; 1. SubBytes.

SubBytes is a reversible nonlinear operation, which is represented by formula 1. For every 8 bits (one byte), after getting the multiplicative inverse modulo x7x6x5x4x3x2x1x0, the affine transformation is applied to it and the

(15)

The SubBytes process can be also described as figure 2.3 that introduces Substitution-box ( SBox), which is a non-linear substitution. The output of the SBox in the first round is the attack point of the attacks in this thesis. See the details of SBox in the appendix.

Figure 2.3: The SubBytes process of using SBox [29]

2. ShiftRows.

The ShiftRows are cyclic shifts for each row. The specific operations are: the 0th_{row is unchanged, the 1}st_{row is shifted left by one byte, the 2}nd_{row is} shifted left by 2 bytes, and the 3rd_{row is shifted left by three bytes, as shown} in Figure 2.4.

Figure 2.4: The ShiftRows process [29]

3. MixColumns.

The purpose of MixColumns is to mix all the bytes in each column. Mathematically, each column is multiplied by a polynomial c(x) = 3x3 ₊

(16)

and also see the overview of the MixColumns process in figure 2.5.          b0 b1 b2 b3          =          02 03 01 01 01 02 03 01 01 01 02 03 03 01 01 02                   a0 a1 a2 a3          (2)

Figure 2.5: The MixColumns process [29] 4. AddRoundKey.

This operation is to add (XOR) the subkey kij to the state matrix aijfor every byte, see figure 2.6.

(17)

Our work is using the CNN-based side-channel attacks to break the AES, the next chapter will describe the side-channel attacks.

2.2 Side-Channel Attacks

Side-channel attacks avoid complex analysis of cryptographic algorithms. It aims to use the leakage information from the software and hardware implementations of the encryption algorithm to recover the secret key. Figure 2.7 shows the cryptographic system transmission model in a communication environment with the side-channel leakage. Compared to the traditional model shown in Figgure 2.1, the attacker in SCA attack can not only get the message transmitted in the public channel, but also can exploit the side-channel leakages generated by the encryption and decryption implementations.

Figure 2.7: A cryptographic system transmission model in a communication environment with the side-channel leakage

2.2.1 Historical Overview

The historical development of the side-channel attacks can be divided into 3 stages:

1. The beginning of SCA (from 1996 to 2000).

(18)

the power consumption leakage model was used for breaking DES [16]. In 2000, Quisquater and Samyde found that the electromagnetic radiation is also suitable for side-channel attacks [28].

2. The initial development of SCA (from 2001-2010).

The main feature of this stage is that the evaluation, countermeasures, and applications of SCA have received more attention and at the same time more leakage models are found. In 2008, the side-channel analysis contest DPA contest [10] appeared. Many subsequent machine learning-based SCA studies [17][18][20][27][31] are based on the traces from this DPA contest. In 2010, some methods of side-channel attacks became hot topics: flash memory pumping attack [35], watermark based side-channel attack [1], fault sensitivity side-channel attack [19].

3. The peak development of SCA (after 2011). The main feature of this stage is that more cross-domain technologies are used on SCA, especially the deep learning methods, such as Multi-Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). Since the CNNs are proved to overcome the power trace misalignment and jitter-based countermeasures [3] and break masked AES implementations [25], they are used in this thesis. See the technical details of CNN and deep learning in chapter 2.3.

2.3 Deep Learning and Convolutional Neural Networks

Deep Learning is a subfield of machine learning concerned with models inspired by the structure and function of the brain, called artificial neural networks, which is explained in 2.3.2.

2.3.1 Historical Overview of Deep Learning

(19)

and the multilayer perceptron is also the earliest deep learning network model. In 1997, Jurgen Schmidhuber proposed Long Short-Term Memory (LSTM) which promoted the development of circulating neural networks. In 1998, Y. LeCun proposed Convoluted Neural Network (CNN). In 2009, Yoshua Bengio proposed another common model for deep learning: Stacked Auto-Encoder (SAE), which uses an automatic encoder instead of the basic unit of the deep belief network.

2.3.2 Convolutional Neural Networks

CNNs are made up of neurons that have weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The task is to classify the data x based on their labels l(x) ∈ C, where k is the number of points in the data and C ={0, 1, . . . , |C| − 1} is the set of classification classes. A neural network can be viewed as a mapping N : k→ |C| which takes as input data x to classify, and produces as output a score vector s = N (x).

Formula 3 shows the one-hot encoded ground truth vector t of the label l(x):

ti =      1if i = l(x) 0overwise. (3)

Formula 4 shows the categorical cross-entropy loss used to quantify the classification error of the network, which is one of the different types of loss functions. The cross entropy describes the distance between two probability distributions. CE =−∑ i∈C tilog ( esi ∑ j∈Cesj ) (4)

(20)

After the CNN model is successfully trained, this model is used to recover a key byte from input power traces captured from a device with unknown key. Formula 5 shows when l = l(x) is satisfied, the process can be considered successful.

l = i ∈ |L|argmax si (5)

A CNN architectures consists some basic elements, called neurons. Each neuron in the network has a bias value b and a weight w. The activation function f of a neuron maps the resulting values into the desired range and defines the output of that neuron. Formula 6 shows the output of the neuron, where xi, i ∈ {1, 2, ..., n} represents the input value of this neuron which is also the output value of the previous neuron: y = f ( n ∑ i=1 wixi+ b) (6)

In this thesis, we use CNN architecture suggested in [2].

Figure 2.8 is the general architecture of Visual Geometry Group 16 (VGG16) network which is one of the most commonly used CNN models.

Figure 2.8: A general example of a VGG16 structure [34]

(21)

layer, each input data will be through a series of convolution layers with filters (kernals), pooling layers and fully connected (FC) layers to classify an object with probabilistic values between 0 and 1.

1. Convolution layer.

Convolution layers apply convolution operations to the pre-processed data by sliding a set of filters along the traces, which aims to extract the features from the input data. Different filters can extract different features from the initial data. By using a set of different filters, the convolution layer preserves the main features of the input data by exploring the relationship between the data points. Figure 2.9 shows an example of parts of the convolution process using one specific 3× 3 filter and a pre-processed 6 × 6 data. With the stride is equal to 1, the output will be a 4× 4 feature matrix. In this example, the filter is fixed in the convolution, but in practice, the goal of this layer is to find the most suitable filters so that can extract the most useful features from the input data.

Figure 2.9: An example to illustrate how the kernel of the convolution layer extracts the features

2. Pooling Layer.

(22)

average pooling process works.

Figure 2.10: An example to illustrate how the filter of the average pooling layer reduces the dimensions

3. Fully connected layer. Fully connected layers (FCs) act like the final classifiers throughout the entire convolutional neural network. If the operation of the convolution layer and the pooling layer is to map the original data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned distributed feature representation to the sample label space l(x)∈ C, where k is the number of points in the data.

(23)

3 CNN based Side Channel Analysis

This section describes the technical details of how CNNs can be used in side-channel analysis, as well as the experimental setups and the parameters. In this work, side-channel analysis is based on power consumption.

3.1 Setup

During the experiments, power traces are captured from two different ATXmega128D4 microcontroller boards which implement AES, using the ChipWisperer [12]. All the training processes are allocated in the Tegner Computing system of PDC Center of KTH. Tegner is a heterogeneous system which has 67 micro nodes with Intel CPUs - 9 of those nodes have NVIDIA Tesla GPUs.

(a) Xmega 1, the

first board of ATXmega128D4 microcontroller (right) connected to the ChipWisperer.

(b) Xmega 2, the second board of ATXmega128D4 microcontroller (right) connected to the ChipWisperer.

Figure 3.1: Two different printed circuit boards of the AES implementations are used in the experiments

(24)

captured from a printed circuit board of ATXmega128D4 microcontroller called Xmega 1, and testing traces are captured from both Xmega1 and Xmega2. See Figure 3.1.

3.2 Assumptions

We assume that the attacker has a device implemented AES which is similar to the target device. For the training device, there is no limitation for the attacker to get the plaintext, ciphertext, key, and the corresponding power traces. For the target device, the attacker has a physical access to capture power traces. For example, for a short time, the attacker can have 3 chances to enter a password and record the information when using a point of sale (POS) terminal.

3.3 Attack Point and AES Implementation

During the acquisition of power traces ⃗T, the variable P denotes the plaintext with 16 bytes and K is the unknown secret key that the attacker is going to recover. With i = {≤ i ≤ 16}, P [i] and K[i] respectively represent the corresponding byte of the plaintext and key. L denotes the label for the corresponding ⃗T. During the experiments, the data is divided into 3 different parts: training data Dtrain which is used for training the CNN model, testing data Dtestwhich is used for evaluating the performance of the trained CNN models, validating data Dvalidwhich is used for validating the CNN model during the training process. Each dataset contains the power traces ⃗T, the corresponding key values K, plaintext values P and the labels ⃗L. ˆg denotes the trained CNN models.

The most vulnerable point of the AES implemented by the ATXmega128D4 microcontroller is the output of the SBox at the first AES round. The power consumption captured at this point have the strongest correlation with the key. P ointarepresents the attack point in formula 7, which is equal to the line of pseudo code with the comment (see Algorithm 1).

(25)

Since the leakage power traces have the best feature points during the 3rd byte of the target pair Z, we use the 3rd-byte attack point in the training process. The training point P ointtis represented in formula 8:

P ointt= SBox[P [3]⊕ K[3]] (8)

For any block cipher, there are different operating modes of the AES algorithm. The Electronic Codebook (ECB) AES mode is applied to our work, which splits the entire plaintext into blocks and then encrypts each block separately using the block cipher. Figure 3.2 (a) shows a power trace of Xmega 1 with the first 3000 data points in the ECB mode, figure (b) shows the training points, which is the 3rd-byte power trace with 95 data points. The interval [249, 344] contains the data points for the 3rd-byte power consumption trace used for training the CNN model. The interval of the attacking trace data points depends on which key byte the attacker aims to recover.

(a) An example of power consumption trace from Xmega 1 with all 16 bytes data points

(b) The 3rd-byte power trace extracted from (a)

Figure 3.2: Power trace of Xmega 1 during the execution of AES-128 in ECB mode.

3.4 Training parameters

(26)

except for the input and the output sizes.

3.4.1 Power Models

Since deep learning is essentially a classification task, the number of outputs in the classifier (CNN model) is defined by the power model. There are 3 popular power models which can be used in the deep-learning side-channel analysis: identity, Hamming weight and Hamming distance.

1. The identity power model uses the correlation between the different features of power traces and the data at the attack points P ointa. In this case, for a single byte, since the data at P ointa range from 0 to 255, LID = {0, 1, . . . , 254, 255}. Here LIDdenotes the set of labels for the power traces ⃗T corresponding to the identity power model.

(27)

Layer Type Output Shape Parameter # Input (Dense) (None, 95, 1) 0

Conv1D 1 (None, 95, 64) 768 AveragePooling1 1 (None, 47, 64) 0 Conv1D 2 (None, 47, 128) 90240 AveragePooling1 2 (None, 23, 128) 0 Conv1D 3 (None, 23, 256) 360704 AveragePooling1 3 (None, 11, 256) 0 Conv1D 4 (None, 11, 512) 1442304 AveragePooling1 4 (None, 5, 512) 0 Conv1D 5 (None, 5, 512) 2884096 AveragePooling1 5 (None, 2, 512) 0 Flatten (None, 1024) 0 Dense 1 (None, 4096) 4198400 Dense 2 (None, 4096) 16781312 Output (Dense) (None, 256) 1048832 Total Parameters: 26,806,656

Name Model 1

Table 3.1: The CNN architecture for the identity power model.

Table 3.2 shows the parameters of both training and validating data.

Data Set Board Quantity Target Byte Training Dtrain1 Xmega 1 720000 3rd

Validation Dvalid1 Xmega 1 180000 3rd

Table 3.2: The the parameters of training and validating data

2. The Hamming weight power model applies the Hamming weight of the data at the attack point P ointa. The label LHW is described by formula 9. Here the function HW (x) is the Hamming weight of x. Since for one byte Hamming weight ranges from 0 to 8, the label LHW has 9 classes LHW ∈ {0, 1, . . . , 8}.

(28)

While i ∈ {1, 2, . . . , 15, 16}. In our experiments, we use i = 3. Table 3.3 shows the CNN architecture summary for the Hamming weight power model, which has 9 outputs. The RMSporp optimizer is used with the learning rate of 0.00001. The same training and validation sets as in table 3.2 are used. The batch size is also 500 and the number of the epoch is 100.

Name Model 2

Table 3.3: The CNN architecture for the Hamming weight power model.

3.5 Evaluation

(29)

first marks a score to each key candidate K, then rank the score of all key candidates. During the multi-trace attack, the attacker inputs multiple of power traces (suppose the number is n) into the trained CNN model ˆg. The rank function ranks the score of the real key (denotes k∗) among all the output participants K = {k0, k1. . . , k256}. When the rank function evaluates to zero, the real key is

recovered and the attack succeeds. The rank function is defined by formula (10) and (11), where i ≤ 256, ⃗sDtest(k) denotes the score of a key candidate ki, ⃗∂j(ki)

denotes the jthmodel output for the key candidate ki.

⃗ sDtest(k) = |D_∏test| j=1 ⃗ ∂j(ki) (10)

rank(ˆg, Dtest) = |{ki ∈ K|⃗sDtest(ki) > ⃗sDtest(k

∗₎_}| ₍₁₁₎ The single-trace recovery rate is used in a different attacking scenario, where the attacker will have only one trace for each attack. With only one single trace as the input, when the prediction value of the real key k∗ is the largest among all participants’ outputs, the attack succeeds. The singe-trace recovery rate aims to calculate how often the model ˆg can classify correctly with a single trace. It is defined by formula (12). Rate(ˆg, Dtest) = |D∑test| j | max{⃗∂j (k1), ⃗∂j(k2), . . . ⃗∂j(k256)} = ⃗∂j(k∗)| |Dtest| (12)

(30)

4 Experimental Results

To achieve our goal, three different experiments are designed:

1. The first experiment is to apply the 256-classifier CNN model based on the previous work [2] to attack the board which is not exactly the same as the board used for training. Since in the actual attack scenario, it might be difficult for the attacker to get the identical target device for training the CNN model, this experiment will show the effect if the attacker trains the CNN model with Xmega 1 board but use the trained model to attack Xmega 2. The Xmega 1 and Xmega 2 are programmed with the same version of AES implementation but on different printed circuit.

2. The second experiment is to compare 9-classifier and 256-classifier CNN models. As mentioned in the previous sections 1 and 3, few works consider this pattern. This experiment compares the CNN networks useing two different power models: the identity and the Hamming weight power model. The experiment evaluates the results of the single-trace attacks, multi-trace attacks and training time.

3. The third experiment is to verify that the CNN model trained on a single byte can recover the entire 16-byte key. The previous work [25] believes that the amount of times that a neural network must be trained is equivalent to the number of bytes in the key. However, during our experiments, we found that the location of the key bytes at the training stage has little effect on the recovery of the corresponding byte. we show that it is possible to use a CNN model trained on one key byte to recover the entire key.

4.1 Comparison Between Different Target Boards

(31)

Testing Data Board Quantity Target Byte Data Point Interval Key Value

DtestX1_1 XMgea 1 128000 3rdbyte 249− 344 Random

DtestX2_1 XMgea 2 128000 3rdbyte 249− 344 Random Table 4.1: The parameters of the single-trace testing data

Table 4.2 describes the testing data for multiple-trace tests for finding the average traces needed to recover the key. The model is trained on the 3rdbyte.

DtestX1_2 XMgea 1 50000 3rdbyte 249− 344 Fixed, 43

DtestX2_2 XMgea 2 50000 3rdbyte 249− 344 Fixed, 43 Table 4.2: The parameters of the multi-trace testing data

Figure 4.1 compares the recovery rates for the singe-trace attacks on D_{testX1_}1and

DtestX2_1, using Model 1. The results show that the average recovery rate of the

3rd-byte key by using a single trace from Xmega 1 is 79.76%, while the recovery rate of the single-trace attack from Xmega 2 data is only 2.12%.

(a) XMgea 1 (b) XMgea 2

Figure 4.1: The recovery rates of the singe-trace attacks on D_{testX1_}1and DtestX2_1,

using Model 1.

(32)

D_{testX2_}2 are randomly permuted 1000 times and used to compute the average

ranks. The results show that the average number of traces needed to recover the key byte from XMgea 1 is 5, while XMgea 2 needs 18 traces in average.

Figure 4.2: The ranks of the multi-trace attacks on D_{testX1_}2 and DtestX2_2, using

Model 1.

Table 4.3 summarizes the performances of the 256-classifier CNN model tested on different testing sets.

Xmega 1 Xmega 2

Average recovery rate 79.76% 2.12% Average Traces Needed 5 18

Table 4.3: The performance of Model 1 tested on different target boards

Form figure 4.1, figure 4.2 and table 4.3, we can conclude that it is easy to overestimate the accuracy of attack if the attacker uses same device for both training and testing. So CNN based side-channel attacks should avoid using same device for training and testing.

4.2 Comparison Between Different Power Models

(33)

with the identical training parameters (see section 3).

Since getting the Hamming weight from the training point P ointtis an irreversible process, except the attacking traces which have the Hamming weight 0 and 8, it is impossible to use only one single trace to recover the target key byte. Therefore the single-trace tests needs to be modified to the 2-trace tests.

Figure 4.3 shows the probability of recovering the 3rd key byte from 2 traces of D_{testX1_}1 by two different CNN models, Model 1 and Model 2. The recovery rate

of the identity CNN model (Model 1) is 81.65% while the recovery rate of the Hamming weight CNN model (Model 2) is 58.28%.

(a) The identity CNN Model (b) The Hamming weight CNN Model

Figure 4.3: Probabilities of recovering the 3rdkey byte from 2 traces of DtestX1_1

using Model 1 and Model 2

Figure 4.4 compares the 3rd key byte ranking for multiple traces of DtestX1_2

between Model 1 and Model 2. During this part, the testing data DtestX1_2 is

(34)

Figure 4.4: The ranks of the multi-trace attacks on D_{testX1_}2 using Model 1 and

Model 2

Table 4.4 summarizes the performances of the CNNs based on different power models tested on XMgea 1. The average recovery rate in the table is based on the 2-trace tests.

identity Hamming weight Average recovery rate 81.65% 58.28% Average Traces Needed 5 traces 13 traces

Table 4.4: The performances of CNN models based on different power models tested on XMgea 1.

To fully compare different power models, it is necessary to apply the two different CNN models to the Xmega 2 data. Figure 4.5 shows probability of recovering the 3rdkey byte from 2 traces of DtestX2_1by Model 1 and Model 2. The recovery rate

(35)

Figure 4.5: Probabilitie of recovering the 3rd key byte from 2 traces of DtestX2_1

using Model 1 and Model 2.

Figure 4.6 compares the 3rdkey byte ranking for multiple traces of DtestX2_2using

Model 1 and Model 2. The testing data D_{testX2_}2 is randomly permuted. The

results show that CNN with identity power model needs 18 traces to recover the 3rdkey byte on XMgea 1, while the Hamming weight power model needs 29 traces, in average.

Figure 4.6: Result of the 3rdkey byte ranking for multiple traces of DtestX2_2using

Model 1 and Model 2.

(36)

identity Hamming weight Average recovery rate 2.33% 0.38% Average Traces Needed 18 traces 29 traces

Table 4.5: The performances of Model 1 and Model 2 tested on XMgea 2.

The time for training the identity model is 6.11 hours while for training the Hamming weight model is 6.31 hours.

So the conclusion of this section is: when the training parameters are same, it is more suitable to use the identity power model in the CNN-based side-channel attacks.

4.3 Full Key Recovery

The previous work [25] believes that the amount of times that a neural network must be trained is equivalent to the number of bytes in the key. This section shows that the CNN model trained on only one key byte can recover the entire 16-byte key. The Model 1 and Model 2 used in the previous experiments are trained to achieve a high recovery rate for a specific key byte, they cannot recover the entire 16-byte key.

(37)

Name: Model 3

Table 4.6: The CNN architecture based on identity power model for full key recovery.

The RMSporp optimizer is used with the learning rate of 0.00001. For training the CNN network with identity power model used for recover the full 16-byte key, table 4.7 shows the parameters of the training data and validation data. The batch size is 200 and the number of epoch is 75.

Data Set Board Quantity Target Byte Training Dtrain2 Xmega 1 50000 3rd

Validation Dvalid2 Xmega 1 1000 3rd Table 4.7: The training parameters of Model 3

Table 4.8 shows the parameters of the testing data for the full key recovery, the testing data sets D_{testX1_}2 and DtestX2_2used for the multi-trace tests are part of

(38)

DtestX1_3 XMgea 1 50000 All 16 bytes 1− 3000 Fixed

DtestX2_3 XMgea 2 50000 All 16 bytes 1− 3000 Fixed

Table 4.8: The parameters of the full key recovery testing data

(a) Xmega 1 (b) Xmega 2

Figure 4.7: The cumulative distribution of using Model 3 to recover the entire 16-byte key from Xmega 1 and Xmega 2

Figure 4.7 shows the cumulative distribution of using Model 3 to recover the entire 16-byte key from Xmega 1 and Xmega 2 using data sets DtestX1_3 and DtestX2_3

respectively. The testing data is randomly permuted 1000 times. The result shows that the average number of traces needed for Model 3 to recover the entire 16-byte key for both board Xmega 1 and Xmega 2 are 160.3 and 400.2, respectively

(39)

Figure 4.8: The recovery rates of the singe-trace attacks on DtestX1_1 and

DtestX2_1, using Model 3.

(40)

5 Conclusion

Side-channel attacks are an actual threat to our society and business, especially with the help of deep learning. Through this study, based on the work [2], we examined how CNN can be used in the side-channel attacks. Three experiments were designed and the results show the followings

1. It is easy to overestimate the accuracy of the trained CNN models if the attacker uses the same device for both training and testing. CNN based side-channel attacks should avoid using the same device for both training and testing.

2. It is more suitable to use the identity power model in the CNN-based side-channel attacks rather than the Hamming weight power model when the training parameters are the same.

3. It is possible to use one CNN model trained on one key byte to recover the entire 16-byte key.

5.1 Future Work

The following two extensions can be made in the future:

1. Since the CNN models used in the experiments are certainly not the optimal models, it is necessary to further explore the CNN models with different parameters.

2. We will further explore another well-known power model called Hamming distance model.

(41)

References

[1] Becker, Georg T et al. “Side-channel based watermarks for integrated circuits”. In: 2010 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST). IEEE. 2010, pp. 30–35.

[2] Benadjila, Ryad et al. “Study of deep learning techniques for side-channel analysis and introduction to ASCAD database”. In: ANSSI, France & CEA, LETI, MINATEC Campus, France. Online verfügbar unter https://eprint. iacr. org/2018/053. pdf, zuletzt geprüft am 22 (2018), p. 2018.

[3] Cagli, Eleonora, Dumas, Cécile, and Prouff, Emmanuel. “Convolutional Neural Networks with Data Augmentation against Jitter-Based Countermeasures.” In: Cryptographic Hardware and Embedded Systems-CHES 2017-19th International Conference. 2017.

[4] Chari, Suresh, Rao, Josyula R, and Rohatgi, Pankaj. “Template attacks”. In: International Workshop on Cryptographic Hardware and Embedded Systems. Springer. 2002, pp. 13–28.

[5] Choudary, Marios O and Kuhn, Markus G. “Efficient, portable template attacks”. In: IEEE Transactions on Information Forensics and Security 13.2 (2018), pp. 490–501.

[6] Daemen, Joan and Rijmen, Vincent. “Advanced encryption standard (AES) (FIPS 197)”. In: Technical report, Katholijke Universiteit Leuven/ESAT (2001) (2001).

[7] Daemen, Joan and Rijmen, Vincent. The design of Rijndael: AES-the advanced encryption standard. Springer Science & Business Media, 2013. [8] Diffie, Whitfield and Hellman, Martin. “New directions in cryptography”. In: IEEE transactions on Information Theory 22.6 (1976), pp. 644–654. [9] ElGamal, Taher. “A public key cryptosystem and a signature scheme based

on discrete logarithms”. In: IEEE transactions on information theory 31.4 (1985), pp. 469–472.

(42)

[11] Hanley, Neil et al. “Empirical evaluation of multi-device profiling side-channel attacks”. In: 2014 IEEE Workshop on Signal Processing Systems (SiPS). IEEE. 2014, pp. 1–6.

[12] INC., NEWAE TECHNOLOGY. ChipWhisperer. https : / / newae . com / tools/chipwhisperer/.

[13] Kayser, Richard F. “Announcing request for candidate algorithm nominations for a new cryptographic hash algorithm (SHA-3) family”. In: Federal Register 72.212 (2007), p. 62.

[14] Kelsey, John et al. “Side channel cryptanalysis of product ciphers”. In: European Symposium on Research in Computer Security. Springer. 1998, pp. 97–110.

[15] Kocher, Paul C. “Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other systems”. In: Annual International Cryptology Conference. Springer. 1996, pp. 104–113.

[16] Kocher, Paul, Jaffe, Joshua, and Jun, Benjamin. “Differential power analysis”. In: Annual International Cryptology Conference. Springer. 1999, pp. 388–397.

[17] Lerman, Liran, Bontempi, Gianluca, and Markowitch, Olivier. “A machine learning approach against a masked AES”. In: Journal of Cryptographic Engineering 5.2 (2015), pp. 123–139.

[18] Levina, Alia, Sleptsova, Daria, and Zaitsev, Oleg. “Side-channel attacks and machine learning approach”. In: 2016 18th Conference of Open Innovations Association and Seminar on Information Security and Protection of Information Technology (FRUCT-ISPIT). IEEE. 2016, pp. 181–186.

(43)

[20] Maghrebi, Houssem, Portigliatti, Thibault, and Prouff, Emmanuel. “Breaking cryptographic implementations using deep learning techniques”. In: International Conference on Security, Privacy, and Applied Cryptography Engineering. Springer. 2016, pp. 3–26.

[21] Martinasek, Zdenek, Dzurenda, Petr, and Malina, Lukas. “Profiling power analysis attack based on MLP in DPA contest V4. 2”. In: 2016 39th International Conference on Telecommunications and Signal Processing (TSP). IEEE. 2016, pp. 223–226.

[22] Martinasek, Zdenek, Hajny, Jan, and Malina, Lukas. “Optimization of power analysis using neural network”. In: International Conference on Smart Card Research and Advanced Applications. Springer. 2013, pp. 94– 107.

[23] Martinasek, Zdenek, Malina, Lukas, and Trasy, Krisztina. “Profiling power analysis attack based on multi-layer perceptron network”. In: Computational Problems in Science and Engineering. Springer, 2015, pp. 317–339.

[24] Page, Dan. “Theoretical use of cache memory as a cryptanalytic side-channel.” In: IACR Cryptology ePrint Archive 2002.169 (2002).

[25] Perin, Guilherme, Ege, Baris, and Woudenberg, Jasper van. “Lowering the Bar: Deep Learning for Side Channel Analysis”. In: (2018).

[26] Pfeifer, Christophe and Haddad, Patrick. Spread: a new layer for profiled deep-learning side-channel attacks. Tech. rep. Cryptology ePrint Archive, Report 2018/880, 2018.

[27] Picek, Stjepan et al. “On the performance of convolutional neural networks for side-channel analysis”. In: International Conference on Security, Privacy, and Applied Cryptography Engineering. Springer. 2018, pp. 157– 176.

(44)

[29] Rijmen, Vincent and Daemen, Joan. “Advanced encryption standard”. In: Proceedings of Federal Information Processing Standards Publications, National Institute of Standards and Technology (2001), pp. 19–22.

[30] Rivest, Ronald L, Shamir, Adi, and Adleman, Leonard. “A method for obtaining digital signatures and public-key cryptosystems”. In: Communications of the ACM 21.2 (1978), pp. 120–126.

[31] Samiotis, Ioannis Petros. “Side-Channel Attacks using Convolutional Neural Networks: A Study on the performance of Convolutional Neural Networks on side-channel data”. In: (2018).

[32] Shamir, Adi and Tromer, Eran. “Acoustic cryptanalysis: on nosy people and noisy machines”. In: Online at http://people. csail. mit. edu/tromer/acoustic (2004).

[33] Shannon, Claude E. “Communication theory of secrecy systems”. In: Bell system technical journal 28.4 (1949), pp. 656–715.

[34] Simonyan, Karen and Zisserman, Andrew. “Very

deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).

[35] Skorobogatov, Sergei. “Flash memory ‘bumping’attacks”. In: International Workshop on Cryptographic Hardware and Embedded Systems. Springer. 2010, pp. 158–172.

(45)

Appendices

Appendix - Contents

(46)

A

Rijndael S-box

00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 00 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76 10 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0 20 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15 30 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75 40 09 83 2c 1a ab 6e 5a a0 52 3b d6 b3 29 e3 2f 84 50 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf 60 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8 70 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2 80 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73 90 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db a0 e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79 b0 e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08 c0 ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a d0 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e e0 e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df f0 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16

(47)