Design and Implementation of an Architecture-aware In-memory Key- Value Store

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2021

Design and Implementation of an Architecture-aware In-memory Key- Value Store

OMAR GIORDANO

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Design and Implementation of an Architecture-aware

In-memory Key-Value Store

OMAR GIORDANO

Master’s Programme, ICT Innovation, 120 credits Date: 29th January 2021

Academic Supervisor: Alireza Farshin Industrial Supervisor: Amir Roozbeh Examiner: Dejan Manojlo Kostic Co-examiner: Fabrizio Granelli

School of Electrical Engineering and Computer Science Host company: Ericsson AB - Cloud Division

(3)

Design and Implementation of an Architecture-aware In-memory Key-Value Store

(4)

| i

Abstract

Key-Value Stores (KVSs) are a type of non-relational databases whose data is represented as a key-value pair and are often used to represent cache and session data storage. Among them, Memcached is one of the most popular ones, as it is widely used in various Internet services such as social networks and streaming platforms.

Given the continuous and increasingly rapid growth of networked devices that use these services, the commodity hardware on which the databases are based must process packets faster to meet the needs of the market. However, in recent years, the performance improvements characterising the new hardware has become thinner and thinner. From here, as the purchase of new products is no longer synonymous with signiﬁcant performance improvements, companies need to exploit the full potential of the hardware already in their possession, consequently postponing the purchase of more recent hardware. One of the latest ideas for increasing the performance of commodity hardware is the use of slice-aware memory management. This technique exploits the Last Level of Cache (LLC) by making sure that the individual cores take data from memory locations that are mapped to their respective cache portions (i.e., LLC slices).

This thesis focuses on the realisation of a KVS prototype—based on Intel Haswell micro-architecture—built on top of the Data Plane Development Kit (DPDK), and to which the principles of slice-aware memory management are applied. To test its performance, given the non-existence of a DPDK- based traﬃc generator that supports the Memcached protocol, an additional prototype of a traﬃc generator that supports these features has also been developed. The performances were measured using two distinct machines:

one for the traffic generator and one for the KVS. First, the “regular” KVS prototype was tested, then, to see the actual benefits, the slice-aware one. Both KVS prototypes were subjected to two types of traffic: (i) uniform traffic where the keys are always different from each other, and (ii) skewed traffic, where keys are repeated and some keys are more likely to be repeated than others.

The experiments show that, in real-world scenario (i.e., characterised by skewed key distributions), the employment of a slice-aware memory management technique in a KVS can slightly improve the end-to-end latency (i.e.,~2%). Additionally, such technique highly impacts the look-up time required by the CPU to ﬁnd the key and the corresponding value in the database, decreasing the mean time by ~22.5%, and improving the 99^th percentile by ~62.7%.

(5)

ii |

Keywords

Key-Value Store, Data Plane Development Kit, Last Level of Cache, Slice- aware memory management, Memcached, Haswell microrchitecture.

(6)

Sammanfattning | iii

Sammanfattning

Key-Value Stores (KVSs) är en typ av icke-relationsdatabaser vars data representeras som ett nyckel-värdepar och används ofta för att representera lagring av cache och session. Bland dem är Memcached en av de mest populära, eftersom den används ofta i olika internettjänster som sociala nätverk och strömmande plattformar.

Med tanke på den kontinuerliga och allt snabbare tillväxten av nätverksenheter som använder dessa tjänster måste den råvaruhårdvara som databaserna bygger på bearbeta paket snabbare för att möta marknadens behov.

Under de senaste åren har dock prestandaförbättringarna som kännetecknar den nya hårdvaran blivit tunnare och tunnare. Härifrån, eftersom inköp av nya produkter inte längre är synonymt med betydande prestandaförbättringar, måste företagen utnyttja den fulla potentialen för hårdvaran som redan ﬁnns i deras besittning, vilket skjuter upp köpet av nyare hårdvara. En av de senaste idéerna för att öka prestanda för råvaruhårdvara är användningen av skivmedveten minneshantering. Denna teknik utnyttjar den Sista Nivån av Cache (SNC) genom att se till att de enskilda kärnorna tar data från minnesplatser som är mappade till deras respektive cachepartier (dvs. SNC- skivor).

Denna avhandling fokuserar på förverkligandet av en KVS-prototyp—

baserad på Intel Haswell mikroarkitektur—byggd ovanpå Data Plane Development Kit (DPDK), och på vilken principerna för skivmedveten minneshantering tillämpas. För att testa dess prestanda, med tanke på att det inte finns en DPDK-baserad trafikgenerator som stöder Memcached- protokollet, har en ytterligare prototyp av en trafikgenerator som stöder dessa funktioner också utvecklats. Föreställningarna mättes med två olika maskiner:

en för trafikgeneratorn och en för KVS. Först testades den “vanliga” KVS- prototypen, för att se de faktiska fördelarna, den skivmedvetna. Båda KVS- prototyperna utsattes för två typer av trafik: (i) enhetlig trafik där nycklarna alltid skiljer sig från varandra och (ii) sned trafik, där nycklar upprepas och vissa nycklar är mer benägna att upprepas än andra.

Experimenten visar att i verkliga scenarier (dvs. kännetecknas av snedställda nyckelfördelningar) kan användningen av en skivmedveten minneshanteringsteknik i en KVS förbättra förbättringen från slut till slut (dvs.

~2%). Dessutom påverkar sådan teknik i hög grad uppslagstiden som krävs av CPU: n för att hitta nyckeln och motsvarande värde i databasen, vilket minskar medeltiden med ~22, 5% och förbättrar 99^thpercentilen med ~62, 7%.

(7)

iv | Sammanfattning

Nyckelord

Key-Value Store, Data Plane Development Kit, Sista Nivån av Cache, Skivmedveten Minneshantering, Memcached, Haswell mikroarkitektur.

(8)

Acknowledgments | v

Acknowledgments

I would like to thank the hosting company, Ericsson, for having allowed me to work on a particularly innovative and interesting topic. In particular, a big

“thank you” goes to Amir Roozbeh, my supervisor at Ericsson Research, for his constant support during this work, both on academical and company sides.

I am also very grateful to Alireza Farshin, my supervisor at KTH, for having guided me throughout the entire work thanks to his excellent feedback and suggestions.

I also thank EIT Digital for having given me the opportunity to travel, experience peculiar cultures, and study with a diﬀerent approach than the classic academic one.

I would also like to thank my friends, who supported me and with whom I shared the best moments of these last two years. A special mention is then deserved to my family for their always-present support that allowed me to start and complete this long journey through my studies.

Stockholm, January 2021 Omar Giordano

(9)

vi | Acknowledgments

(10)

CONTENTS | vii

List of Figures

1.1 Last 48 years of microprocessors trend data [1]. . . 3

1.2 Development approach. . . 5

2.1 8-core Haswell EP die conﬁguration [2]. . . 11

2.2 Simpliﬁed model of an 8-core Haswell CPU cache. . . 12

2.3 Cache level memory organisation [3]. . . 13

2.4 Intel Xeon E5-2667 v3 cache addresses – level division. . . 14

2.5 8-core Haswell LLC slices [4]. . . 15

2.6 Comparison between regular and slice-aware memory management. The colours represent the mapping between CPU cores, LLC slices, and memory locations. . . 16

2.7 NUMA memory architecture. . . 17

2.8 Memcached datagram header. Light grey coloured ﬁelds have variable length [5] . . . 19

2.9 Linux Kernel with and without DPDK. . . 21

4.1 Working scheme of the traﬃc generator. . . 29

4.2 Comparison between regular and slice-aware Database creation. 33 5.1 Packet size – byte division. . . 36

5.2 Zipﬁan distribution of 1000 samples – occurrence histogram. . 38

5.3 Comparison between regular and slice-aware working scheme of the KVS. Note that Core 0 is not present, as the master core is responsible for checking the status (e.g., working, idle) of the other cores. . . 41

6.1 Throughput comparison between memaslap-Memcached, and the DPDK-based traﬃc generator and KVS with a uniform key distribution. Both client and server sides are using either one or seven cores – Average of 10 runs. . . 46

(13)

x | LIST OF FIGURES

6.2 Latency comparison between memaslap-Memcached and the DPDK-based traﬃc generator and KVS with a uniform key distribution. Both client and server sides are using either one or seven cores – Average of 10 runs. . . 47 6.3 Median latency comparison of diﬀerent workloads in a single-

core scenario. The lower error bar represents the 25^thpercentile and the upper error bar represents the 75^th percentile – The values are the average obtained from 10 runs. . . 49 6.4 Mean throughput comparison of diﬀerent workloads in a

single-core scenario. The error bars represent the standard deviation (~0.5M pps in every result ) – The values are the average obtained from 10 runs. . . 50 6.5 Median latency comparison of diﬀerent workloads in a multi-

core (7) scenario. The lower error bar represents the 25^th percentile and the upper error bar represents the 75^thpercentile – The values are the average obtained from 10 runs. . . 53 6.6 Mean throughput comparison of diﬀerent workloads in a

multi-core (7) scenario. The error bars represent the standard deviation (~1.5M pps in every result) – The values are the average obtained from 10 runs. . . 54 6.7 Clock cycles spent among all the CPU cores of the KVS during

GEToperations@3.2GHz. The maximum value corresponds to the 99^th percentile, the minimum to the 0.01^st, while the box represents the interquartile percentiles – The values are the average obtained from 10 runs. . . 56 6.8 Clock cycles spent among all the CPU cores of the KVS

while performing GET operations@3.2GHz during skewed key distribution. The maximum value corresponds to the 99^th percentile, the minimum to the 0.01^st, while the box represents the interquartile percentiles – The values are the average obtained from 10 runs. . . 57

(14)

LIST OF TABLES | xi

List of Tables

2.1 Description of cache memory parameters. . . 13 2.2 Example of Key-Value Store association. . . 18 6.1 100% GET workload. Latency percentiles, mean and standard

deviation – All the measures are in [µs] and represent the average of 10 runs. . . 48 6.2 90% GET workload. Latency percentiles, mean and standard

deviation – All the measures are in [µs] and represent the average of 10 runs. . . 48 6.3 Mean throughput conversion of the single-core scenario from

M pps to Gbps – The values are the average obtained from 10 runs. . . 51 6.4 100% GET workload. Latency percentiles, mean and standard

deviation in a multi-core (7) scenario – All the measures are in [µs] and represent the average of 10 runs. . . 52 6.5 90% GET workload. Latency , mean and standard deviation

in a multi-core (7) scenario – All the measures are in [µs] and represent the average of 10 runs. . . 52 6.6 Mean throughput conversion on the multi-core (7) scenario

from M pps to Gbps – The values are the average obtained from 10 runs. . . 54 6.7 Clock-cycle percentiles, mean and standard deviation in a

multi-core (7) scenario while performing GET operations during skewed key distribution. . . 57

(15)

xii | LIST OF TABLES

(16)

List of acronyms and abbreviations | xiii

List of acronyms and abbreviations

API Application Programming Interface CapEx Capital Expenditure

CAS Compare-And-Swap

COTS Commercial Oﬀ-The-Shelf CPU Central Processing Unit DB Database

DDoS Distributed Denial of Service DDR4 Double Data Rate 4

DIMM Dual in-line Memory Module DPDK Data Plane Development Kit

DRAM Dynamic Random Access Memory EP Extreme Processor

GPU Graphic Processing Unit IP Internet Protocol

KVS Key-Value Store LLC Last Level of Cache LRU Least Recently Used

(17)

xiv | List of acronyms and abbreviations

NIC Network Interface Card

NoSQL Non-Structured Query Language NUMA Non-Uniform Memory Access PMD Poll Mode Driver

QPI Quick Path Interconnect RAM Random Access Memory RSS Receive Side Scaling RTT Round Trip Time

SQL Structured Query Language SRAM Static Random Access Memory TCP Transmission Control Protocol TPS Transactions Per Second UDP User Datagram Protocol UPI Ultra Path Interconnect

(18)

Introduction | 1

Chapter 1 Introduction

The large majority of Internet services operates using collections of data, commonly called Databases (DBs). For example, without access to previously deﬁned contacts, it would not be possible to maintain a record of users’

social network. DBs, in turn, rely on physical infrastructures composed of Commercial Oﬀ-The-Shelf (COTS) hardware, also known as “commodity hardware”, and are mainly divided into two sub-categories: Structured Query Language (SQL) and Non-Structured Query Language (NoSQL)—which are also commonly known as relational and non-relational databases, respectively.

Relational DBs store data in tables. There is one table for each type of information to be processed, and each consists of different columns: one for every aspect of the data. Each table has one or more columns that play the role of a primary key (i.e., an index that uniquely identifies a specific row among all the other rows). Additionally, there can also be relationships between tables:

a row of a table can refer to another row of a different table. The rigid content structure of relational DBs is not present in the non-relational ones. The information no longer finds its place in rows listed in tables, but in completely different and not necessarily structured objects (e.g., documents).

Being relational or not, makes a DB suitable for diﬀerent applications and purposes. In fact, relational DBs are mainly used for data analysis, banking systems, and online shopping, whereas non-relational DBs typically include big data, Internet of Things, and real-time web applications [6]. While Oracle [7], Microsoft SQL Azure [8] and PostgreSQL [9] are examples of relational DBs, examples of non-relational DBs are Key-Value Stores (KVSs) (i.e., DBs that associate values to unique keys)^∗ such as Redis [10], Memcached [11], and Aerospike [12], but there are many others [13].

∗More details about Key-Value Stores can be found in Section 2.3.

(19)

2 | Introduction

In the next decade, the number of connected devices will grow from 22 billion in 2018 to approximately 50 billion devices in 2030 [14]. Such massive growth in terms of devices—and data traffic consequently—needs the commodity hardware to be able to handle more than double the requests in less than a decade. Since non-relational DBs are the ones which better tackle the problem of massive data traffic [6], they will play an even more relevant role in the future than today. Consequently, increasing their performance and reducing their costs has become essential. However, to improve the performance of the commodity hardware, it is essential to find the performance inefficiencies and try to address them. For example, many are using the Data Plane Development Kit (DPDK) [15] to mitigate the costly operations done by Linux network stack. DPDK is also proven [16,17] to grant enormous benefits for databases—e.g., improving Transactions Per Second (TPS) and latency.

Nowadays, in-memory KVSs such as Memcached [11] take advantage of particular memory management techniques such as software caching^∗ that exploits the Dynamic Random Access Memory (DRAM) to increase the overall performance (i.e., decrease the DB load). Memcached has become widely popular and companies, such as Facebook [18], Google [19], Netﬂix [20], and Twitter [21], have adopted this solution to handle the millions of requests they have every day [22]. However, in a scenario where there are multi-hundred-gigabit networks [23], DRAM is not fast enough anymore [24]

and we need to find faster solutions. Farshin et al. [25] showed that an efficient usage of the Last Level of Cache (LLC) could further improve the performance of applications. They showed that an emulated slice-aware—i.e., a memory management technique further explained in Section 2.5—KVS is supposed to perform20% faster (in terms of latency) than regular KVSs. However, Farshin et al. only evaluated a simple scenario, which may not apply to real-world systems, as in-use KVSs may have a larger memory footprint—i.e., the amount of memory that an application needs for running—which could kill the benefits of slice-aware memory management. This motivated the implementation of an architecture-aware in-memory key-value store by employing the principles of slice-aware memory management and evaluate its performance.

∗Software cache is an emulated version of hardware cache within the Random Access Memory (RAM). More information can be found in Section 2.1

(20)

Introduction | 3

1.1 Problem

Capital Expenditures (CapExs)^∗ play an important role in any company [27].

Due to their high initial costs and irreversibility, it is imperative to grant CapExs to have the most long-term eﬀect as possible. When dealing with commodity hardware, the main problems are performance and durability.

Hardware does not manage to keep up their performance with the necessity—

or demand—of the market, affecting KVSs and consequently causing the need for hardware upgrades—which also impacts on the CapEx. During the last 50 years, Moore’s law demonstrated this situation [28]. However, in the year 2019, NVIDIA’s CEO affirmed that Moore’s law is now dead [29] since Graphic Processing Units (GPUs) are advancing at a much faster pace than the Central Processing Units (CPUs). CPUs are not respecting Moore’s law expectation anymore [30]. In addition, if we also consider Dennard scaling [31], during the years 2014-2015, CPU frequency almost stopped increasing [32] with a consequent single-thread performance growth attenuation. In 2018, the performance gap between the actual and expected values reached almost two order magnitude difference [33], and it has been maintained to date, as shown in Fig. 1.1.

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷

1970 1980 1990 2000 2010 2020

Frequency (MHz) Single-Thread Performance (SpecINT x 10³) Transistors (thousands)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year

Figure 1.1 – Last 48 years of microprocessors trend data [1].

∗CapExs are funds that companies use to purchase property, equipment, etc. [26].

(21)

4 | Introduction

Even though “Moore’s law is now dead” [29], companies still need more performing hardware. However, buying new products can no longer be considered an eﬃcient solution. The CapEx is increased, but there is no guarantee of substantial performance improvements, as the performance gap between the new and old products is no more particularly remarkable.

By exploiting this problem and focusing on increasing the performance of the commodity hardware that companies already have, they could save on their equipment investment. In addition, the companies which deploy digital services could also start addressing the problem of having almost a three-fold increase of connected devices within 10 years [14].

1.2 Purpose

The primary purpose of this thesis is to discover the real benefits of slice awareness with respect to regularly designed memory managements. As a consequence, the thesis evaluates the effectiveness of a proposed slice- aware memory management for KVS applications [25] using specific CPU micro-architectures (i.e., Haswell [34]). In the end, it could eventually result in considerable savings in terms of money. In fact, slice-aware memory management is expected to increase hardware performance and consequently postpone companies’ need to upgrade their hardware.

1.3 Goals

Given the hypothesis that the use of a better memory management (e.g., employing slice-aware memory management) could improve the performance of high-speed KVSs [25], the goal of this project is to develop a prototype^∗ of a slice-aware Memcached KVS on top of DPDK. To test the prototype, it is necessary to have a packet generator that supports Memcached protocol.

However, there is no available open-source copyright-free DPDK-based packet generator that supports it [35]. Consequently, this work also includes the development of a (prototype) DPDK-based benchmarking tool to test the developed KVS applications.

∗Not all the functions are implemented. More details can be found in Section 1.5.

(22)

Introduction | 5

1.4 Research Methodology

This project employs an interactive-based methodology which consists of an iterative loop until its end. Each task has been divided into sub-tasks that typically require the same structure shown in Fig. 1.2.

Implementation

Test

Measurements

Does it

work? NO

YES New task Literature study

Acceptable

results? NO

YES

Adjustments

Figure 1.2 – Development approach.

The literature study was necessary to better understand the tools (e.g., DPDK libraries) that I would later use during the implementation phase.

By following such an approach—and after a proper conﬁguration of the testbed—I managed to run the tests (experiments) with high control of the background variables—i.e., all the variables related to the experiment, such as the number of keys and values, the number of CPU cores, etc.—granting a high internal validity. Moreover, since proper documentation has been written, the replicability is ensured.

1.5 Delimitations

This work intends to implement a basic Memcached server application on top of DPDK and evaluate its performance. The goal is to examine the full potential of a slice-aware KVS. The benchmarking tool is also a basic version of a packet generator developed using DPDK libraries and the Memcached

(23)

6 | Introduction

protocol. For “basic” I mean that both generator and Memcached server are prototypes and they support neither all the functions nor the operations that the Memcached protocol does. In fact, from the list of all Memcached operations [5] only the operations GET and SET have been taken into account during the development of this project. The reasons behind this choice are due to the fact that both memslap[36] and memaslap[37], which are the default benchmarking tools for Memcached server applications, use those two speciﬁc instructions to run their tests—and the usage of such operations is also a widely adopted choice [38, 39]. Also, the developed applications currently only support the User Datagram Protocol (UDP); supporting the Transmission Control Protocol (TCP) is left as future work—see Sections 3.2.2 and 7.1.

1.6 Contributions

This thesis makes the following contributions:

A) Development of a benchmarking tool to test the KVS application. Since there is not any DPDK-based traﬃc generator—at the moment—that supports the Memcached protocol, its development was necessary to test the developed KVS. However, this tool is a prototype: the main purpose of this thesis is the performance analysis of a slice-aware KVSs using its two primary operations (i.e., read and write). For this reason, the developed benchmarking tool only supports such operations—more details in Section 4.1.

B) Development of a DPDK-based KVS prototype that supports both Memcached protocol and slice-aware memory management. At the moment, it only exists a single DPDK-based KVS that supports Memcached (i.e., ScyllaDB [40]), but it does not make use of slice- awareness. Additionally, in the topic of KVS, slice-awareness was tested only using an emulated version of KVS [25]. Therefore, this thesis contributed to further comprehend the real potential of slice- awareness—when applied to KVS—by evaluating its performance in a scenario closer to reality.

(24)

Introduction | 7

1.7 Structure of the Thesis

After this introductory chapter, the thesis is divided into 7 additional chapters, each of which describes a speciﬁc topic:

Chapter 2presents an extensive overview of the relevant background information useful to understand the diﬀerent topics described in this thesis.

Chapter 3 describes the methodology that is used to approach the problem and all the essential information to replicate the experiments, together with an explanation on the data analysis, their validity, and reliability.

Chapter 4describes the design of the developed applications. It explains their structure in detail, how the applications communicate with each other, and also provides reasons for speciﬁc choices.

Chapter 5 is about the implementation of the traﬃc generator and the key- value store application. There are provided how speciﬁc functionalities are implemented, the limitations of the developed applications and an overview of their behaviour.

Chapter 6explains the obtained results, together with a detailed analysis that also takes into account the limitation described in the previous chapter.

Chapter 7takes up the discourse begun in the introductory chapter, explaining the conclusions of the project, and describing potential future works.

Chapter 8focuses on the social and economic impact that this work may imply, also describing ethical, sustainability, and security aspects that are related to this thesis.

(25)

8 | Introduction

(26)

Background | 9

Chapter 2 Background

This chapter discusses the background information required to understand the thesis. We start by discussing the hardware-related background—i.e., cache hierarchy and Non-Uniform Memory Access (NUMA)—then we move to more software-related contents, such as KVS, Memcached protocol, some used libraries (i.e., DPDK) and hash tables. Finally, this chapter also describes the existing works which are related to this project.

2.1 Cache

The cache is a particular type of memory—i.e., Static Random Access Memory (SRAM)—designed to speed up memory access operations (e.g., reading and writing). It is a temporary and volatile memory located inside the CPU, characterised by limited capacity and high data access speed [41].When there is a request for a particular datum in the memory, the CPU first checks inside the cache. Finding the datum inside the cache corresponds to a cache hit, while not finding it corresponds to a cache miss. The cache is structured to contain the most requested data that the CPU may need. Since CPU cache can contain only a small amount of data, there are different variations of Least Recently Used (LRU) algorithm [42, 43] that aims to improve spatial and temporal locality^∗. The core of these solutions is to grant a proper method to remove some data from the cache when a new datum is requested, and there is no more space in the cache.

∗Spatial locality refers to the memory location: an instruction/data located—or stored—

nearby a lately executed/accessed one has a high execution/access probability.

Temporal locality refers to the re-execution time for an instruction and re-access time for data: a just-executed instruction has a high chance of being executed again in a short time.

(27)

10 | Background

Before continuing, we must distinguish between hardware and software cache. Even though the working principle is the same, the software cache does not necessarily take advantage of the real hardware cache. Software cache creates a small memory container which keeps the LRU entities allowing the applications to search into that speciﬁc memory portion ﬁrst rather than checking a much larger container. Such a system increases the probability of keeping those ‘cache’ entries (i.e., software cache) in the cache hierarchy (i.e., hardware cache).

In this thesis, I want to maximise the beneﬁts from the hardware cache.

To run the tests, I used Intel Haswell [34] Extreme Processors (EPs); more speciﬁcally, the Intel(R) Xeon(R) E5-2667 v3 [44]—more in detail in Section 3.2.1. As a consequence, I consider only this speciﬁc CPU micro-architecture (see Fig.2.1) to explain how the cache is structured and works. Without going into details that deviate too much from the general topics of this thesis, looking at Fig. 2.1 we can say that—in Hasswell micro-architecture—each core is associated with an LLC slice^∗ of 2.5M B. Each core also uses a CBo (i.e., a caching agent) that provides a ring interface between itself and the LLC [45].

Additionally, the cache agent is also the one that allows the cores to write and read among all diﬀerent LLC slices.

This explanation can be extended to many other CPU micro-architectures, but the ﬁgures and structures that are later shown may diﬀer from CPU generation to another CPU generation.

∗This concept is explained in details in Section 2.1.1 and Section 2.1.2

(28)

Background | 11

UP IIO

UBox PCU QPI Agent

LinkQPI QPILink

R3QPI

UP DN

R3QPI CBDMA

IOAPIC

UP DN PCI-e

x16 PCI-e x16 PCI-e

x8

UP DN

DN

DN UP

PCI-e x4 (ESI)

UP Core Cache IDI

Bo D

N CBO

2.5 MBLLC

IDI / QPII

Cache Bo

SAD

UP Core Cache IDI

Bo D

N CBO

2.5 MBLLC

IDI / QPII

Cache Bo

SAD

UP Core ^Cache_Bo _IDI D

N CBO

2.5 MBLLC

IDI / QPII

Cache Bo

SAD UP

Core Cache IDI

Bo D

N CBO

2.5 MBLLC

IDI / QPII

Cache Bo

SAD

UP ^{IDI Cache}_Bo Core

DN CBO 2.5 MBLLC

IDI / QPII

Cache Bo

SAD

IDI / QPII

Cache Bo

SAD

IDI / QPII

Cache Bo

SAD

IDI / QPII

Cache Bo

SAD

DDR Home agentHome agent DDR

Figure 2.1 – 8-core Haswell EP die conﬁguration [2].

(29)

12 | Background

2.1.1 Hardware Cache

The cache of Haswell CPUs organised into three hierarchical levels: level one, level two and level three; abbreviated with L1, L2 and L3 respectively.

The third level of cache is also called Last Level of Cache (LLC). From the microarchitectural point of view, the ﬁrst two levels of cache are located inside the CPU core. By contrast, the LLC is shared among all CPU cores, and they are interconnected via a bus ring (see Fig. 2.2).

Core 0

Level 3 Cache

L1 L2

Core 1

L1 L2

Core 2

L1 L2

Core 3

L1 L2

Core 7

L2 L1

Core 6

L2 L1

Core 5

L2 L1

Core 4

L2 L1

Figure 2.2 – Simpliﬁed model of an 8-core Haswell CPU cache.

Fig. 2.3 shows how cache memory is structured. Cache memory is organised in arrays of sets, where each set is composed of minimum addressable units called cache lines. Cache lines, in turn, consist of three diﬀerent elements:

• Block. AB-byte size memory portion to store data.

• Tag. A ﬁeld that uniquely identiﬁes the data block contained in the cache line.

• V bit, or validity bit. It establishes if the data inside the speciﬁc block within the cache line is valid or not.

To determine the cache size, we must perform a multiplication between the size of a block (B), the number of cache lines within a set (E), and the number of sets (S), resulting in C = B × E × S (see Fig.2.3and Tab.2.1). Note that both tag and validity ﬁelds are not included.

(30)

Background | 13

Cache size C = B × E × S

Set

Line

V Tag 0 1 2 3 4 5 B-1

B = 2^b Bytes S = 2^s sets

E lines per set

Cache

Figure 2.3 – Cache level memory organisation [3].

Table 2.1 – Description of cache memory parameters.

Parameter Desdcription

E Total number of lines per set.

S = 2^s Total number of sets.

B = 2^b Total number of bytes per block.

b = log₂(B) Number of bits of the block offset.

s = log₂(S) Number of bits of the set index.

C = B × E × S Cache size in bytes.

The number of lines within a set is called “set association” and determines the CPU cache mapping technique (i.e., how the main memory contents are stored in the cache). If a CPU cache is N -way set associative, it means that each set containsN lines. The peculiarity of this mapping technique is that a speciﬁc RAM block can only be mapped to one speciﬁc cache set. However, within this set, the memory block can be mapped to any available cache line.

(31)

14 | Background

As we can notice from Fig. 2.3, the parameters B and S induce a partitioning of the total number of bits of the memory addresses into different fields. More precisely, in set-associative mapping, addresses are composed of three fields: offset, index, and tag—represented from the lowest to the highest bit order in Fig. 2.4.

• Offset identiﬁes how many bytes to ‘skip’—within the cache block—

to reach the data. For example, having an offset equal to11₂ (i.e., 3 in decimal form), means that the requested data starts from byte3 of that speciﬁc cache block.

• Set index speciﬁes a cache set within to search to ﬁnd the line containing the data. For example, then^thsetis located at set index n;

• Tag represents the ‘ID’ of an address and what the CPU is searching.

The CPU compares the tag ﬁeld of a given address with the tag ﬁeld present in the cache line. If the compared tags are the same and the validity bit is set, the searching results in a hit, otherwise it results in a miss.

All the three ﬁelds have variable dimensions depending on the size of the cache and the set association (i.e., the number of lines contained into a set).

For example, the Intel Xeon E5-2667 v3 has2 × 8 × 32KB of L1 cache per core (32KB for instructions and 32KB for data). Since a cache block size is 64B, this means that the offset size is log₂(64) = 6b and that there are 32KB/64B = 512 cache blocks. L1 cache is 8-way set associative (i.e., eight cache lines per set), which implies512/8 = 64 diﬀerent indexes (i.e., L1 index size is log₂(64) = 6b). Knowing that a CPU address is 64 bits, we obtain a tag ﬁeld of 64 − 6 − 6 = 52b. By applying the same reasoning to the L2 cache (8 × 256KB, 8-way set associative), and L3 cache (20MB, 20-way set associative), we obtain the cache address level division shown in Fig.2.4.

L1 Tag

L3 Tag

L2 Tag L2 index

L3 index

Offset

Offset L1 index Offset

5 0

63 19 14 11

Figure 2.4 – Intel Xeon E5-2667 v3 cache addresses – level division.

(32)

Background | 15

Another important concept to keep in mind when dealing with the cache is that by decreasing the level (i.e., from L1 to L3), the corresponding cache size increases. However, the frequency to—or the rate from—which data can be read and stored (i.e., the bandwidth) highly decreases [46]. Moreover, when a cache miss occurs, the CPU starts searching inside the next cache level.

Consequently, before reaching the RAM, the CPU must run into a miss in all the three cache levels.

2.1.2 Slice-aware Memory Management

The third level of cache is divided into diﬀerent slices. In particular, in Haswell CPUs, each core is associated with one slice. To simplify the Haswell EP micro-architecture shown in Fig. 2.1, we can refer to Fig. 2.5. As shown, an LLC slice is associated with each CPU core.

Core 0

Core 1

Core 2

Core 3

Core 7

Core 6

Core 5

Core 4 Slice 7LLC

Slice 6LLC

Slice 5LLC

Slice 4LLC Slice 3LLC

Slice 0LLC

Slice 1LLC

Slice 2LLC

Figure 2.5 – 8-core Haswell LLC slices [4].

The difference between the regular and slice-aware memory management can be seen in Fig. 2.6. On the right of the figure, it is possible to see that the cells of the main memory are mapped to different cores, and such mapping is the same for both regular and slice-aware memory management (as previously described in Section 2.1.1). What changes between the two memory managements lies in how data are stored in the LLC slices.

In the regular memory management (Fig. 2.6a), every LLC slice contains data that are associated with memory locations mapped to diﬀerent cores. This

(33)

16 | Background

means that, for example, Core 0 needs to read inside the LLC slice of Core 5 to fetch the data associated with its colour, and so it is for all the other cores.

Since the LLC slices are “distant” between each other and the CPU cores need to use the CPU ring to reach other LLC slices, this results in an increase of the fetching time [47]. Slice awareness addresses this problem by mapping the memory addresses to the proper slice so that every core will not have to fetch data from diﬀerent slices (see Fig.2.6b).

Core 0

Core 3 Core 2 Core 1

Core 7

Core 6

Core 5

Core 4

(a) Regular memory management.

Core 0

Core 3 Core 2 Core 1

Core 7

Core 6

Core 5

Core 4

(b) Slice-aware memory management.

Figure 2.6 – Comparison between regular and slice-aware memory management. The colours represent the mapping between CPU cores, LLC slices, and memory locations.

(34)

Background | 17

We can summarise saying that the slice-aware memory management is the principle upon which an application uses a memory portion that is mapped to speciﬁc LLC slices. Since each core can access any part of the cache, any core can access an LLC slice that is not mapped to itself—which is also the classic cache behaviour. However, slice awareness (i.e., reading data only from the associated LLC slice) can remarkably decrease the time that a core needs to access the memory, resulting in a lower latency [25,47].

2.2 Non-Uniform Memory Access

NUMA is a memory architecture used in a multi-CPU socket environment to achieve higher performances. Such architecture creates multiple memory nodes allowing each CPU—or core—to access all of them (see Fig. 2.7).

However, doing so results in a non-uniform access time because every memory access has to go through a Quick Path Interconnect (QPI) or an Ultra Path Interconnect (UPI) link that connects the CPUs (refer also to Fig. 2.1), increasing the memory access time [48]. Consequently, to improve the performance, applications should allocate memory in a NUMA-aware way—

i.e., allocating memory from the closest memory node (if possible).

Memory

CPU CPU CPU

Memory Memory

QPI / UPI

Figure 2.7 – NUMA memory architecture.

2.3 Key-Value Store

Kay-Value Stores are NoSQL databases. As already discussed in Chapter 1, KVSs are a form of database characterised by the usage of non-relational models. KVSs are inspired to data structures which are typically called

(35)

18 | Background

dictionaries or maps^∗. These DBs do not have the ‘classical’ data structures, such as tables; in fact, data must be provided in the form of key-value pairs.

The value is the actual information, while the key is what allows its recovery in the research phase. A key-value database can be considered as a single large map (or dictionary). As it is possible to see from the example in Tab. 2.2, KVSs are a way to associate keys—typically names (e.g., devices, sensors)—

with values (e.g., numbers, location) or, in other words, to associate values (data) with keys (names).

Table 2.2 – Example of Key-Value Store association.

Key Value

Name ———–> Phone number Wind speed ———–> m/s

KVSs are also principally used to perform (software) cache operations, where data are saved by associating them with a key for their repeated use, without having to retrieve them again from the source [49]. In fact, they are often called upon to represent cache or session data storage, and an example is Memcached [11].

2.3.1 Memcached

Memcached is a “distributed memory object caching system” [11]. It is used to decrease the database load and, at the same time, accelerate dynamic web applications. Memcached acts as a software cache itself. Memcached reserves a piece of memory which acts as a cache, potentially avoiding to scroll through the entire memory to ﬁnd the requested objects and consequently speeding up the processing of the given and received instructions. For this reason, Memcached is positioned in front of the back-end DB, but it does not necessarily mean that the DB is local. In fact, the DB can also be located somewhere else in the network (e.g, in a stand-alone machine). In this case, Memcached will act as an intermediary, containing the most requested data and contacting the DB only when a client asks for a key not present in Memcached.

∗In computer science, dictionaries and maps are abstract data types composed of an extensive collection of keys and values

(36)

Background | 19

The working principle of Memcached is based on a request-response mechanism. For example, a client sends a request—containing a key—to a Memcached server asking for speciﬁc information associated with that key (i.e., a value). The server checks the key contained in the request and starts looking into its cache. If there is a hit, Memcached returns the data to the client (i.e., response). If it results in a miss it starts looking inside the DB (in case of a non-local DB, Memcached contacts it and requests for the data), fetches the data and saves it in Memcached (i.e., it updates its cache) and then returns the value to the client.

Memcached Packet Header

Fig. 2.8 shows the binary protocol used by Memcached with all its ﬁeld. I now provide a basic explanation of each of them^∗.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Magic Opcode Key lenght

Extra length Data type vbucket ID / Status Total body length

Opaque

CAS

Command-speciﬁc extras Key

Value

Figure 2.8 – Memcached datagram header. Light grey coloured ﬁelds have variable length [5]

Magic is the number which identiﬁes the packet type. For example, the hexadecimal number 0x80 corresponds to a request and 0x81 to a reply packet.

Opcode is the operation code. Each operation that Memcached can use is diﬀerentiated using a code. For example, the operation GET has code 0x00 and the operation SET has code 0x01.

∗For more information, refer to the oﬃcial Memcached wiki webpage [5].

(37)

20 | Background

Key length speciﬁes the length of the key ﬁeld in bytes.

Extra length specifies the length of the field Command-specific extra in bytes.

Data type is still not used. For now, only the code 0x00 is used which represents ‘raw’ bytes.

vbucket ID is only used in a request packet. It is the virtual bucket ID assigned to the packet.

Status is used only in a response packet. It gives information about potential errors that may have occurred on a received packet. The value 0x0000 corresponds to the “no error” message.

Total body length is the sum (in bytes) of the last three ﬁelds: command- speciﬁc extra, key, and value.

Opaque identiﬁes the packet. It is an ID that will be copied back in the response and that allows the sender to associate the diﬀerent replies to each sent packet.

CAS is the compare-and-swap (or check-and-set) identiﬁer. It acts as a data version checker, setting the data if it has not been updated since the last fetch.

Command-specific extra is a ﬁeld which can store several “extra” features.

For instance it can store ﬂags and expiration commands.

Key is the ﬁeld where the key is stored.

Value is the ﬁeld where the value is stored.

GETand SET operations

Every Memcached operation must respect diﬀerent rules—or requirements.

However, since this thesis only focuses on the operations GET and SET (see Section 1.5), I only report the rules that these two operations must follow.

GETis the operation used to receive the value corresponding to a speciﬁc key.

In the request phase, when using this operation, both the extra and value fields must be empty. By contrast, the response does not contain the key field but it must have extras (e.g., a flag field) and the value—if the key has been found.

SET operation associates a value to a key regardless of the fact that the key already exists or not. It differs from the operations ADD and REPLACE because the former fails if the key already exists and the latter fails if the key does not exist. As a consequence, a SET request must have both key and value and may have extra fields. The response will have no extra, key or value field, but

(38)

Background | 21

it will contain the data version checker if and only if the operation succeeds.

Otherwise, the status ﬁeld will report the failure.

2.4 Data Plane Development Kit

DPDK is a programming framework, the purpose of which is to accelerate packet processing on a wide variety of CPU architectures [15]. Thanks to its libraries and external network driver support (e.g., MLX5 for Mellanox network cards), DPDK allows network devices—e.g., Network Interface Cards (NICs)—to directly communicate with the applications, bypassing the (costly) Linux Kernel networking stack (see Fig. 2.9).

Network Controller Applications Applications Applications Applications

Network Drivers Network Drivers Applications

Network DriversNetwork DriversNetwork Drivers ApplicationsApplicationsApplicationsApplications Applications _Libabies^DPDK LevelUser

Network Hardware

Kernel Space

LevelUser

Kernel Space

Linux Kernel without DPDK Linux Kernel with DPDK

Network Controller Network

Hardware Network Drivers

Network Drivers Network Drivers

Figure 2.9 – Linux Kernel with and without DPDK.

DPDK uses a Poll Mode Driver (PMD) to communicate with the NIC.

The PMD contains several Application Programming Interfaces (APIs), and it is also the one which gets rid of the overhead of Linux interrupts allowing DPDK to fetch packets directly from the NIC. DPDK organises the messages (i.e., packets) coming from—and to—a network application inside ﬁxed- length objects called mbufs, i.e., message buﬀers. In turn, mbufs are stored

(39)

22 | Background

inside a memory allocator called mempool. Mbufs contain all the relevant information which grants the packets to be processed. For example, mbufs contain the starting memory address of the data segment and the size of the message. In addition, in case multiple mbufs are used, they also contain the location of the next mbuf. DPDK is also NUMA-aware and uses memory alignment techniques to further improve the performance. Further information about the DPDK working principles and drivers can be found in the oﬃcial documentation [50].

2.5 Hashing, Collisions, and Hash Functions

Hashing is a technique that uniquely identiﬁes a speciﬁc object from a group of similar objects by assigning a unique key to that particular element [51].

Hashing can be used in applications when there is a need to store many elements, and a linear accessing cost (e.g., the one of an array) cannot be accepted. In the speciﬁc case of KVSs, the main goal of hash tables is to make both searching and insertion of hashed keys as fast as possible. The hash function is used to compute an index that informs where an entry can be found or inserted. Sometimes pairs of elements might be mapped to the same hash value: this is called “collision”. To avoid collisions and maintain the data structure organised, hash tables need to implement some algorithm that periodically updates and maintains the table. In this way, the degradation of the performance due to high occupancy can be avoided [52].

Hashing is a process that generates a ﬁxed-dimension output starting from a variable-dimension input. This is achieved thanks to the usage of mathematical formulas also known as hash functions. In this thesis, I opted for using Cuckoo hashing [53] because it grants a worst-case look-up ofO(1)—and it is already implemented inside the DPDK libraries. Cuckoo hashing maps each item to multiple candidate buckets using hash, then, the single item is stored in one of the buckets. The insertion of a new item may relocate existing items to their alternate candidate buckets. An important mention is that in order not to keep the insertion time too high, Cuckoo hash tables should not reach a utilisation higher than 90% [54].

2.6 Related Works

In this section, we go through existing works that are related to this project.

The related works are grouped into two diﬀerent sub-sections: traﬃc generators

(40)

Background | 23

and key-value stores. In such a way, we can focus on one topic at a time, explaining how I got inspired by these works, what I decided to—and not to—

use, and the reasons behind these choices.

2.6.1 Traffic Generators

Even though there are several available traﬃc generators, only a few are based on DPDK [35]. More precisely, they are MoonGen [55], mTCP [56], Pktgen-DPDK [57], and TRex [58]. They are all open-source traﬃc generators developed in C/C++ and use Lua scripts—except for mTCP—for their main settings. However, none of them is copyright-free, and, as already stated in Chapter 1, the development of a copyright-free benchmarking tool for KVS was an additional part of this project.

Modifying an existing traﬃc generator and employing the Memcached protocol would have infringed the copyright if the hosting company decided to pursue its development for business reasons. As a consequence, I decided to use the existing ones as “literature studies”, analysing how they handle multi-core transmitting and receiving functions. Also, by getting inspiration from the DPDK sample applications [59], I started developing the DPDK- based Memcached traﬃc generator. In such a way, the hosting company of this project can continue working on the application avoiding any potential copyright infringement coming from an already licensed software.

2.6.2 Key-Value Stores

To increase the performance of Memcached-related KVSs—and KVSs in general—one of the most popular solutions is the combination of two diﬀerent approaches: the implementation of the protocol speciﬁcally within the application, and moving the DB from disk to RAM [60]. For example, MICA [38] is a fast DPDK-based in-memory KVS that implements Memcached protocol over UDP. ScyllaDB [40] is another in-memory KVS implemented using SeaStar framework [61] and uses a TCP/IP user-space stack to deploy high-performance server applications. But when speaking about in-memory key-value DBs, Redis [10] can be considered as one of the fastest and widely used ones [62]. It cannot be considered a strict KVS, but, in a sense, it represents an evolution over canonical KVSs: Redis transforms values into advanced data structures. For example, a GET operation that uses a user’s ID as a key can lead to several data (e.g., name, surname, address) rather than only one—which is typically the case for KVSs (see Section 2.3). Compared to

(41)

24 | Background

Memcached, Redis also implements object persistence. The data associations (i.e., the key-value) are kept in-memory—and therefore limited by the amount of RAM present in the system—but Redis can also write the entire set of data to the hard drives, either during conﬁgurable intervals or during each time a modiﬁcation occurs. Due to its intrinsic complexity with respect to regular KVS applications, I opted not to use Redis, as—in this thesis—values can be single 64-byte entities (maximum) and not data structures—more details are given in Chapter 5.

Moving to in-memory key-value systems that focus on low-latency request processing, RAMCloud [39] deserves a mention. In fact, RAMCloud reaches a Round Trip Time (RTT) of approximately 5µs for reading operations and 15µs for writing operations [63]. It achieves these results thanks to the combination of two diﬀerent techniques. The ﬁrst one consists of bypassing Linux Kernel taking advantage of the drivers of Intel NICs—i.e., the NIC communicates directly with the CPU. The second one is the polling function:

CPU cores do not go to sleep after ﬁnishing the operations, but they move to a busy waiting state called “polling”. In this way, when the NIC receives a packet, the CPU can immediately process it without spending any time to wake up the core—which may take 2µs [39]. Additionally, unlike systems that employ Memcached, RAMCloud applications do not have to deal with (software) cache misses (e.g., by implementing internal cache replacement policies) since the entire DB is stored in RAM. This lack can result in higher throughput as well because replacement schemes can be expensive in terms of resources (i.e., time) and consequently reduce the throughput [38].

2.7 Summary

In this chapter, we went through the topics that I used—and from which I got inspired—during the development of this project. After having explained how both software and hardware-related contents work, we moved to existing works to show how others approached the problem of in-memory KVS and the solutions that they implemented. To avoid any potential copyright infringement that the hosting company may occur in, I decided to develop both the traﬃc generator and KVS “from scratch”, using others’ implementation as examples and sources of inspiration for the project. More precisely, I chose to develop the Memcached protocol over UDP, like MICA, and to apply the same polling technique used in RAMCloud to achieve low latency^∗.

∗More details about the implemented applications are provided in Chapter 5.

(42)

Methodology | 25

Chapter 3 Methodology

This chapter reports the used research methodology. After introducing the research paradigm, I provide some information about used data collections techniques, together with their validity and reliability. In addition, I furnish a description of the hardware that is used to run the experiments and an explanation on how to replicate them.

3.1 Research Paradigm

To perform this work, I used a scientific paradigm because the final goal is to provide evidence with credible and justifiable explanations (internal validity). The scientific paradigm can be considered part of positivism [64].

The approach used for this thesis differs from the post-positivism for intrinsic reasons. While post-positivism affirms that scientific theories can never be proven right [65]—principles of falsification—and that ‘every scientific statement must remain tentative for ever’ [66], the research had to be based on “not-yet-falsified” facts and theories. Moreover, I used quantitative data from potential forecast to justify the reasons for this research. I based the knowledge on the truth of current and already tested hypotheses which collides with post-positivism principles. This approach also aims to identify the causes that influenced and led to a particular result [67].

3.2 Experimental Set-up

To run the tests, I used two machines the speciﬁcations of which are described in Section 3.2.1. One machine was designated to act as a traﬃc generator (i.e., it represents the clients of a real-world scenario) while the other machine acts

(43)

26 | Methodology

as a KVS server application. In the next part of the thesis, I may refer to the machine representing the traﬃc generator as ‘client side’, and ‘server side’ for what concerns the machine behaving as KVS.

3.2.1 Testbed

The devices used to run the experiments are two Dell servers each one having a dual socket conﬁguration with two Intel(R) Xeon(R) E5-2667 v3 @3.20GHz equipped with20M B of LLC cache [44]. The total amount of RAM (DDR4) in each machine is 256GB—8 Dual in-line Memory Modules (DIMMs) of 16GB for each CPU—running at the frequency of 2133M Hz and working in quad-channel. The machines are also equipped with two100Gb/s Mellanox Technologies NICs. More precisely, an MT27700 Family model ConnectX-4 [68] on the client side, and an MT27800 Family model ConnectX-5 [69] on the server side.

3.2.2 Test Environment

To perform the experiments, the traﬃc generators (i.e., memaslap and the custom DPDK-based generator) ran on one machine and the server applications (i.e., the KVS implementation and Memcached) on the other one. To achieve the maximum performance, I also granted NUMA awareness in all the tests.

Each experiment has been repeated 10 times, ﬁnally reporting the averages to ensure their stability. In addition, keys and values have the ﬁxed size of 64B and 16B respectively (additional details in Section 5.2), and the Memcached protocol was encapsulated inside the UDP protocol. The current implementation does not support TCP. However, it is possible to extend the prototypes and utilise the user-space network stacks (e.g., mTCP [56]) to support TCP, which remains as future work (see Section 7.1).

3.3 Data Analysis

Latency is the main feature of interest in this project as the focus was to optimise the server side. Throughput is also reported—for completeness—but the primary focus is to save precious nanoseconds on the KVS (i.e., during the research phase). Throughput is represented in millions of packets per second, and latency is split into two sub-categories: end-to-end RTT—i.e., the latency:

the time interval between a just-sent packet from the client and the received response from the server—and tail latency. More speciﬁcally, the tail latency

(44)

Methodology | 27

consists of the 25^th, 50^th (i.e., the median), 75^th, 90^th, and 99^th percentile.

To extract the measurements of the KVS application, the implemented traffic generator has been used, while I relied on memaslap for the measurements of the default Memcached server application. Due to the initial warm-up phase, it was not possible to use the same generator to test both Memcached and the KVS application. While memaslap creates a file containing the key-value pairs and sends it to Memcached, the implemented traffic generator sends each key-value pair one by one to the KVS (see Section 5.1). A more exhaustive comparison would require additional features (extensions) to the generator and they have been left as future works (see Section 7.1).

Therefore, the implemented applications cannot be strictly compared with the default ones (as they are diﬀerent set-ups), but we can still have an estimation on how DPDK could potentially improve such applications. For this purpose, I set all the parameters of both memaslap and Memcached as the same that I used in the traﬃc generator and KVS application respectively (e.g., key and value size, communication protocol, number of packets sent and processed per core). I also avoided to use features (in memaslap and Memcached) that are not implemented in the prototypes—e.g., concurrency (see Section 5.1)—and tested only the “regular” version of the KVS application.

The slice-aware one is only compared with the regular version (see Section 6.2).

3.4 Data Collection Validity and Reliability

In this thesis, the developed traffic generator was also responsible for the large majority of the data collection. Different verification tests have been run to guarantee the validity of the data (more details in Chapter 5). To ensure the reliability, tests have been performed multiple times—reporting the mean and tail percentiles—and finally obtaining the statistics and graphs. Additionally, the experiments have been run managing all the possible variables and maintaining them unvaried (e.g., CPU clock frequency, keys and values length). These cautions granted both high reliability and validity to the data and consequent results. As for the default Memcached application, I used memaslap [37] as traffic generator confiding in the reliability of the default benchmarking tool for Memcached applications.

Design and Implementation of an Architecture-aware In-memory Key- Value Store

Design and Implementation of an Architecture-aware In-memory Key- Value Store

OMAR GIORDANO

Design and Implementation of an Architecture-aware

In-memory Key-Value Store

OMAR GIORDANO

Abstract

Keywords

Sammanfattning

Nyckelord

Acknowledgments

Contents

List of Figures

List of Tables

List of acronyms and abbreviations

Chapter 1 Introduction

1.1 Problem

1.2 Purpose

1.3 Goals

1.4 Research Methodology

1.5 Delimitations

1.6 Contributions

1.7 Structure of the Thesis

Chapter 2 Background

2.1 Cache

2.1.1 Hardware Cache

2.1.2 Slice-aware Memory Management

2.2 Non-Uniform Memory Access

2.3 Key-Value Store

2.3.1 Memcached

2.4 Data Plane Development Kit

2.5 Hashing, Collisions, and Hash Functions

2.6 Related Works

2.6.1 Traffic Generators

2.6.2 Key-Value Stores

2.7 Summary

Chapter 3

Methodology

3.1 Research Paradigm

3.2 Experimental Set-up

3.2.1 Testbed

3.2.2 Test Environment

3.3 Data Analysis

3.4 Data Collection Validity and Reliability