Hyper-NF: synthesizing chains of virtualized network functions

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING 300 , SECOND CYCLE

CREDITS

STOCKHOLM SWEDEN 2016,

Hyper-NF: synthesizing chains of virtualized network functions

MARCEL ENGUEHARD

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

Hyper-NF: synthesising chains of virtualised network functions

Marcel Enguehard

Thesis performed at KTH – School of Information and Communication Technology (ICT)

Stockholm, Sweden

Supervisor and Examiner: Dejan Kostic

(3)

Abstract

Middleboxes are essential to the functioning of today's internet. They are for instance used to secure networks, to enhance performances (e.g., throughput, scalability or end-user latency) or to monitor trac. Although middleboxes are usually deployed through expensive dedicated hardware, the past 15 years has seen the emergence of a new paradigm: network function virtualisation (NFV).

In the NFV context, middleboxes are implemented in software on commodity hardware, thus reducing costs and increasing exibility. Some of the most recent work even boast performances that are equals to those of hardware middleboxes (e.g., line-rate throughput).

However, none of the frameworks that we could nd were suited to implement chains of virtualised middleboxes. Indeed, it is often the case that a packet must cross numerous middleboxes when traversing the internet. In order for an NFV deployment to scale out and reduce the network overhead, one would wish to be able to deploy all these middleboxes on the same physical machine.

We provide an evaluation of a state-of-the-art NFV framework that shows that the throughput of a chain of 8 middleboxes running on the same server can be as much as 5 times smaller than the throughput of a single middlebox. We then introduce Hyper-NF, a new NFV framework specically designed for implementing chains of virtualised middleboxes. Hyper-NF eliminates redundant packet and I/O operations. Given a chain of middleboxes, it uses graph search and set theory to generate a single equivalent middlebox that only uses one read and one write operation per packet.

Experimentation with middlebox deployments inspired from real-world use cases shows that Hyper-NF achieve constant throughput and latency despite the increasing number of chained middleboxes. Thus, it achieves considerably better performances than traditional deployments. On a chain of 8 virtualised middleboxes, Hyper-NF has a 5 times higher throughput, a 10 times lower latency and uses 8.5 times less CPU cycles per packet.

i

(4)

(5)

Sammanfattning

Middleboxes är absolut nödvändiga i internet idag. De behövs för att säkra nätverken, för att förbättra prestanda (till exempel genomströmning, fördröjn- ing eller skalbarhet) eller för att övervaka nätverkstrak. Middleboxes brukar vara byggd med dyrt och dedikerad hårdvara men ett nytt paradigm utvecklades in senaste åren: virtualisering av nätverksfunktioner (en: Network Function Virtualization - NFV). I NFV, middleboxes skaas av exibel mjukvara på billig råvara hårdvara. Några aktuella ramar även ger samma prestanda som hårdvara middleboxes (t.ex. genomströmning med linjehastighet).

Denna ramar är dock inte lämpade för att genomföra kedjor av virtualiserade middleboxes. Ett nätverkspaket måste ofta gå igenom många middleboxes när det korsar internet. Därför skulle det vara bättre att utplacera era virtuella middleboxes på samma maskin. Så skulle man minska belastningen på nätverket och NFV-utplacering skulle kunna skala ut. Vi visar att även med en state of the art NFV-ram, genomströmningen av en kedjor av 8 middleboxes som körs på samma dator är 5 gånger mindre än den av en enda virtuell middlebox. Vi intro- ducerar Hyper-NF, en ny NFV-ram som vi speciellt utformade för att genomföra kedjor av virtualiserade middleboxes. Hyper-NF eliminerar redundanta I/O och pakets verksamhet. Det förvandlar en kedja av middleboxes till en enda middlebox med liknande funktionalitet genom mängdteori och graf sökning.

Den producerade middlebox använder bara ett läs och ett skriv operation per paket.

Vi testade Hyper-NF med middlebox-utplacering som inspirerades av verkliga användningsfall. Hyper-NF uppnår konstant prestanda (genomströmning och fördröjning) även om vi öka antalet middleboxes i kedjan. Det är alltså betydligt bättre än traditionella ramar. På en kedja av 8 virtualiserade middleboxes har Hyper-NF en 5 gånger högre genomströmning, en 10 gånger lägre latens och använder 8,5 gånger mindre CPU-cykler per paket.

iii

(6)

(7)

Acknowledgements

I would rst and foremost like to thank my academic supervisor and examiner Professor Dejan Kostic for oering me the opportunity to do my master thesis on such an interesting subject. I would also like to thank him for his clairvoyant advice and support. I learned a lot during this project thanks to him.

I would also like to thank Professor Gerald Q. Maguire Jr. for his precious and always so precise feedback.

I wish to acknowledge the amazing involvement of Georgios Katsikas who liter- ally worked days and nights to make this project succeed.

Finally, I would like to express my gratitude towards everyone at the Network Systems Lab. You have made me feel welcome from day one and I very much enjoyed my time with you.

v

(8)

(9)

List of Figures

1.1 Simple NF deployment for an enterprise network . . . 2

1.2 Throughput versus the number of chained NFs on the same CPU core. Input packets are 64-byte UDP packets at 500 kpps. The NFs are connected through VALE. Note: Copyrights for this gure belong to Georgios Katsikas, it appears here with his permission. . . 4

1.3 Typical implementation of a chain of virtualized NFs . . . 5

1.4 Chain of NF that cannot be synthesized . . . 8

2.1 Example of a NUMA architecture . . . 12

2.2 Handling of a new outgoing connection by a NAT with public IP address 203.0.113.1 . . . 16

2.3 Example of Click element that separates HTTP, TCP and UDP trac . . . 17

2.4 The two modes of packet transmissions in Click . . . 18

2.5 An IP router conguration in Click . . . 19

2.6 Approaches for parallelizing NFs . . . 20

2.7 Comparative design of netmap, Intel DPDK and the Linux network stack . . . 25

2.8 Example of virtual switch usage . . . 25

2.9 Flowchart of the forwarding mechanism in Open vSwitch . . . . 26

3.1 The Hyper-NF architecture Note: Original gure by Georgios Katsikas, modied here with his permission. . . 30

3.2 Example of chained NAT congurations where two domains share the same IP prex. . . 37

3.3 State management in Hyper-NF, while passing through stateful Rewrite (RW) operations. Note: Original gure by Georgios Katsikas, modied here with his permission. . . 38

4.1 The SegmentNode data structure . . . 40

4.2 The BPU class . . . 41

4.3 The trac class structure . . . 41

4.4 Click optimisation for consecutive IPRewriter and DecIPTTL elements . . . 42

xi

(14)

5.1 Experimental setups for chains of 3 middleboxes . . . 47 5.2 Throughput (in kpps) vs the number of chained NATs. Input

packets are 64-byte UDP packets at 500 kpps Note:

Copyrights for this gure belong to Georgios Katsikas, it appears here with his permission. . . 50 5.3 Latency (in ns) vs the number of chained NATs. Input packets are

64-byte UDP packets at 200 kpps Note:

Copyrights for this gure belong to Georgios Katsikas, it appears here with his permission. . . 51 5.4 CPU cycles and L1 cache misses per packet vs the number of

chained NATs. Input packets are 64-byte UDP packets at 500 kpps.

Note: Copyrights for this gure belong to Georgios Katsikas, it appears here with his permission. . . 53 5.5 An ISP's service chain that uses three NFs . . . 54 5.6 Throughput (in kpps) vs the length of the ACL, while replaying a

trace of 5 million 64-byte packets at 500 kpps. Note:

Copyrights for this gure belong to Georgios Katsikas, it appears here with his permission. . . 55 5.7 CPU cycles and L1 cache misses per packet vs the length of

the ACL, while replaying a trace of 5 million 64-byte packets at

500 kpps. Note:

Copyrights for this gure belong to Georgios Katsikas, it appears here with his permission. . . 57

(15)

List of Tables

2.1 Latency of memory accesses on Intel i7 Xeon 5500 series (as given in table 2 from [1]) . . . 13 2.2 Key Click elements per type of NF [2] . . . 17 2.3 Median CPU cycles/packet spent by Click's IPClassier or Fast-

IPClassier while exponentially increasing the number of trac classes. The injected packets match the rst or last entry of the classiers' tables. . . 21 3.1 Complexity of the dierent modules of Hyper-NFs. N is the

number of chained NFs, B the number of BPUs/Click elements, T the number of trac classes, S is the maximal number of segments in a eld lter . . . 36 5.1 Summary of the testbed characteristics . . . 46

xiii

(16)

(17)

Chapter 1

Introduction

This chapter provides an introduction to the world of Network Function Virtualization and the specic problem tackled by this thesis. In Section 1.1, we give a summary of why virtualized middleboxes are crucial in today's networks.

In Section 1.2, we introduce the diculties of chaining network functions. We present our solution in Section 1.3 and x the corresponding goals in Section 1.4.

We then describe our methodology in Section 1.5 and the limitations of our approach in Section 1.6. We conclude with the outline of the thesis in Sec- tion 1.8.

1.1 Overview

We rst present what are network middleboxes and why they are crucial in today's networks in Section 1.1.1, then introduce network function virtualization in Section 1.1.2 before explaining the challenges raised by chaining network functions in Section 1.1.3.

1.1.1 Network Middleboxes

Network middleboxes are key devices used for managing communication networks. They are now used in many dierent settings: cellular networks [3], enterprise networks [4] or in datacentres [5]. They enforce specic network functions (NF) related for instance to security policies (e.g., rewalls, deep packet inspection tools) or to network utilisation optimisation (e.g., WAN opti- mizers, load-balancers). A broad list of network functions and their use can be found in the Internet Engineering Task Force's (IETF) Request For Comment (RFC) 3234 [6]. We present a simplied example of a middlebox deployment in an enterprise network in Figure 1.1 with a proxy, load-balancer, rewall and intrusion detection system (IDS). As such, middleboxes must be able to perform complex operations on packets while maintaining high performance (such as line- rate throughput, low latency, and low failure-rate). Sherry et al., conducted in 2012 a survey of middlebox deployments in 57 dierent enterprise networks

1

(18)

[4]. They found out that a typical enterprise network consists of as many middleboxes as Internet Protocol (IP) routers. Their survey also showed that these deployments are costly: maintaining a large enterprise network (between 10,000 and 100,000 hosts) can cost more than one million dollars over ve years, even without taking the cost of human labour into account.

Figure 1.1: Simple NF deployment for an enterprise network

1.1.2 Network function virtualization

The aforementioned costs lead to the development in recent years of a new paradigm: network function virtualization (NFV). In NFV, the network functions are implemented in software instead of hardware, and run on commodity hardware instead of specialised devices. This has been made possible by the improvement in computing capacities and software design along the years. Using the software plane instead of static hardware congurations allows network engineers to create more exible and cheaper NF platforms. With a simple conguration that exported around 60% of the network processing to the cloud, Sherry et al., proved that they could provide the same functionality while reducing the cost as much as 12 times [4].

A successful NFV platform must meet specic criteria:

• Identical functionality: the virtualized NF must provide exactly the same functionality as its hardware equivalent (i.e., packets must be processed and modied in the same fashion)

(19)

1.2. PROBLEM STATEMENT 3

• High performance: even with the increased performance and technical knowledge in high-speed packet processing, it is dicult for a software- based implementation to match the performances of dedicated hardware.

In particular, a virtualised NF must ideally provide a line-rate throughput while keeping the latency low (e.g., on the order of a millisecond).

• Ease of use: ideally, an NFV developer must create his platform so that it is easily congurable, deployable and maintainable for the network operator.

1.1.3 Chaining network functions

Regardless of the NF paradigm, packets often have to go through several network functions before leaving a specic domain. In [3], Wang et al., describe how TCP packets originating from cellular phones traverse network address translation (NAT) middleboxes, rewalls, and TCP reordering proxies. In the much simpler case of Figure 1.1, connections between the internet and the web server must traverse the rewall, IDS, and load-balancer.

As identied in RFC 7498 [7], there are several challenges associated with chaining network functions:

• Ensuring order and functionality: the operator must make sure that packets go through their designated NFs in the right order

• Stateful functions: certain NFs, described as stateful, make decisions depending on the previously seen trac. Typically, they require packets belonging to the same ow or the same group of ows to go through the same instance of the NFs.

• Performance: the more middleboxes a ow has to go through, the more likely its throughput and latency are to be negatively inuenced.

1.2 Problem Statement

Several research eorts have tried to use the exibility provided by NFV to solve the challenges introduced in Section 1.1.3. In [8], Gember-Jacobson et al., propose a framework to cleverly move states between dierent instances of the same virtualized network function. They propose to spawn chains of virtualized NFs on-demand to accommodate elastic trac. However, they do not explore the performance of these chains and use other available machines to scale out their throughput. Recent work with state-of-the art NFV frameworks have shown that performances decrease rapidly with increasing numbers of NFs deployed on the same machine. In [9] (more specically in Figure 12), Hwang et al., report a 30% loss in throughput when chaining ve consecutive virtualized NFs. In [2], Martins et al., report a loss of more than 85% of throughput when chaining nine consecutive NFs.

We performed our own measurements and found similar performance gures.

Figure 1.2 presents the throughput of a chain of NFs deployed on the same

(20)

CPU core using state-of-the art network functions (Click [10] with VALE [11]

based on netmap [12]) on an Intel Xeon processor at 3.2 GHz. The input packet rate is 500,000 packets per second. We describe the experimental setup in greater detail in Section 5.1. The tted equation shows that the throughput is approximately inversely proportional to the number of NFs. This phenomenon is partly due to the fact that the processes that host the NFs must share the computing resources, but also due to the redundancies between the dierent operations implemented in the NFs (see Section 5.2.2).

1NFs 2NFs 3NFs 4NFs 5NFs 6NFs 7NFs 8NFs

0 100 200 300 400 500

Throughput (kpps)

Multi-Process - Throughput ^{= +498} . ⁵

·

NFs

⁻^0.97

,R

²

⁼¹ . ⁰⁰

Figure 1.2: Throughput versus the number of chained NFs on the same CPU core. Input packets are 64-byte UDP packets at 500 kpps. The NFs are connected through VALE.

Note: Copyrights for this gure belong to Georgios Katsikas, it appears here with his permission.

In this thesis, we tackle the scalability problem of chaining virtualized network functions on the same machine. We investigate methods to at the same time increase throughput and reduce latency.

1.3 Proposed solution

To be able to implement chains of network functions without simply scaling up by adding machines and CPU cores, we must address the dierent issues that hamper their performance. As we show in Chapter 2, the main problems are:

• Context switching: if the NFs are implemented in dierent processes, they compete for CPU time. Switching between processes on a core is expensive as it requires the core to clear its cache and reload from memory the necessary data for the new process. On our Intel Xeon CPU, the minimal cost of a context switch is estimated to be 2µs [13].

(21)

1.3. PROPOSED SOLUTION 5

• Redundant operations: certain elds of the packet header will be modied several times along the chain. For instance in Figure 1.1, a packet going from one of the clients in the internal network to one of the servers will have its IP checksum recomputed both at the proxy and at the load-balancer.

• Late drops: certain packets that will be dropped along the chain might still be processed by previous network functions. For instance in Fig- ure 1.1, if the rewall is congured to drop all outgoing web trac to www.example.com (a web server outside of the enterprise network), packets from the client to www.example.com will still be processed by the proxy, thus wasting computation time and memory.

• I/O interactions: a common way to implement chains of network functions is to have an underlying virtual switch (such as Open vSwitch [14]) to forward packets between network functions. Figure 1.3 presents such a conguration. Forwarding packets through that virtual switch is an additional source of overhead that causes the performance to degrade with increasing numbers of chained NFs on the same computer.

Virtual switch

NF 1 NF 2 NF 3 NF 4

NIC 1 NIC 2 NIC 3

Figure 1.3: Typical implementation of a chain of virtualized NFs With that in mind, we propose a new NFV-framework called Hyper-NF. Hyper- NF is based on a simple idea: simplifying the processing to its maximum to remove all the aforementioned sources of overhead. If possible, each eld of the packet header should be read and rewritten only once. We thus base our approach on the following ideas:

• Unique process: a chain of middleboxes should run as a unique process, thus removing any form of context switching and the need for virtual switches.

(22)

• Optimized I/O: to remove the overhead of going through the Linux Kernel network stack, the framework should use a state-of-the-art I/O mechanism such as netmap [12] or DPDK [15]. We discuss these details further in Section 2.5.

• Single-read and single-write: the chain should not have redundant read or write accesses to the packet.

• Early drops: packets that are dropped by the chain should be dropped as early as possible to avoid unnecessary processing.

To achieve these goals, Hyper-NF realizes a high-level synthesis of the chain of network functions. It rst creates an abstract representation of the NF-chain and models its behaviour for every possible input packet. Hyper-NF is thus able to derive which packets are dropped and which operations are redundant.

Then, thanks to this abstract model, it is able to create a synthesized NF that only performs two operations: one read-access to the header elds to determine how the packet should be handled and one write operation that synthesizes the actions of the original chain on the packet.

Hyper-NF is based on the Click Modular Router [10] (we describe Click in Sec- tion 2.3) and uses netmap as an I/O mechanism (as described in Section 2.5.3).

1.4 Goals

The goals of this master's thesis project are to:

• Investigate the state-of-the-art in NFV with a focus on NF chains

• Understand how NUMA architectures inuence the performance of NFV chains

• Create a new NFV-framework that maintains high performance when handling long NF chains

• Implement and deploy the aforementioned framework

• Evaluate the performance of the approach in terms of throughput, latency and CPU consumption

1.5 Methodology

In this thesis, we intend to investigate whether synthesising operations on a chain of virtualized NFs solves the scalability problem described in Section 1.2.

We mainly use Hyper-NF as a proof-of-concept to show how NF synthesis can be realised and how it impacts the performance of chains of NFs deployed on the same machine.

To complete this project, we thus used a three steps approach. We rst performed a background study (3 weeks) of the relevant research areas. We then

(23)

1.6. LIMITATIONS 7 designed and implemented Hyper-NF (8 weeks), an NFV framework that synthesises chains of NFs. Finally, we conducted a thorough evaluation (4 weeks) of our prototype.

1.5.1 Background study

During this phase, we studied the state-of-the-art in several research areas related to our project. This was necessary to understand all the underlying issues that we would have to solve to propose a novel solution. We identied Click as the most widely used NFV-framework, and we studied its enhancements. We analysed how the NUMA architecture is used by Click and by the Linux Kernel to understand its consequences on chains of NFs. Finally, we looked at novel packet processing frameworks to increase the performance of our virtualized NFs.

1.5.2 Design and implementation

During this phase, we used the knowledge gathered through our background study to design and implement Hyper-NF. We created the dierent tools that allowed us to rst input a chain of network functions and their topology con-

guration, and then produce a fully functioning synthesized NF with identical functionality.

1.5.3 Evaluation

We then evaluated the performance of Hyper-NF. We measured throughput, latency and CPU consumption in dierent scenarios, including access control lists (ACL) obtained from Internet service providers (ISP). Finally, we found some issues in the design of Hyper-NF that severly hampered its performance and we corrected them.

1.6 Limitations

The rst limitation of this project is that Hyper-NF is not compatible with every chain of NFs. In particular, it is not always possible to synthesize a chain of NFs with only one read and one write operation. Let us for instance consider the chain in Figure 1.4. After going through the network address translator (NAT), an outgoing packet will have his source port rewritten to a randomly chosen value between 2000 and 3000. Since the write access of the load-balancer depends on the source port number parity, and thus on the value written by the NAT, we cannot synthesize this chain with a one-read-one-write NF. However, this particular problem can be solved by switching to another load-balancing policy (such as round-robin) supported by Hyper-NF. We did not encounter a situation where we could not nd a suitable alternative.

(24)

SNAT

src_ip = 1.1.1.1 port_range = 2000-3000

Load-Balancer

if src_port%2==1, dst_ip=ip1 if src_port%2==0, dst_ip=ip2

Figure 1.4: Chain of NF that cannot be synthesized

Furthermore, Hyper-NF does not scale on multicore architectures. As of now, it synthesises a chain of NFs as a single process to be run in a single core. As we discuss in Section 2.1, exploiting multicore architectures is crucial for developing scalable network functions. We discuss in Section 6.2.1 several ways of doing so.

Finally, Hyper-NF is only applicable when the network operator has full control and understanding of the NF chain's architecture. There are cases where this operator might just default to already packaged or even proprietary NFV implementations, which Hyper-NF cannot synthesize. Such a situation can for instance appear if the operator wants to rely on a third-party's technical support, and must thus use a commercial NFV framework.

1.7 Contributions

This work was conducted at the Network System Laboratory of KTH - Royal In- stitute of Technology in Stockholm, Sweden. It was a joint project with Georgios Katsikas, under the supervision of Professor Dejan Kostic and Professor Gerald Q. Maguire.

The theoretical concepts behind Hyper-NF can be divided in two parts:

• the graph composition of a set of NFs into an abstract graph of basic processing units

• the synthesis of this abstract graph into a single-read-single-write equivalent NF.

Georgios and I both equally contributed to the conception of the aforementioned concepts and we were both equally involved in all design decisions. The use-cases and scenarios presented in Section 5.2 were also jointly designed by Georgios and I.

The dierent steps of the synthesis described in Section 3.1 can also be divided according to the two aforementioned parts. The rst part contains the rst two steps (Chain Congurator and Chain Parser), while the second part is composed of steps 3 and 4 (Trac Class Builder, Synthesizer). The latter step (Generator) is purely technical.

Regarding the implementation of the project of Hyper-NF, our respective contributions were:

(25)

1.8. OUTLINE 9

• I personally implemented the Trac Class Builder and the Synthesizer, as well as the module that handles stateful functions (see Section 3.4).

I created the structures used to implement the abstract model (see Sec- tion 4.1) and wrote the algorithm that performs the theoretical synthesis (see Section 3.3.3). I also contributed to the code of the Chain Parser: I created the function that transforms Click Elements into Basic Processing Units (see Section 3.2). Finally, I implemented the two compressions presented in Section 4.2.2.

• Georgios implemented both the Chain congurator and most of the Chain Parser. He thus modied Click to integrate the Click libraries in Hyper- NF and, he created the modules to parse the input topology (Chain Congurator). He also implemented the graph composition and wrote the code necessary to nd end-to-end paths in the NFs' graph (the Chain Parser). He also modied Click to implement the optimisations described in Section 4.2.1 that were essential to the practical realisation of the synthesis.

• The last step of Section 3.1 (Hyper-NF Generator) was jointly implemented by both.

In the actual evaluation, our respective contributions were:

• I ran Hyper-NF to create the synthesised Click congurations used in both the Hyper-NF and the Hyper-NF(Opt) scenarios (see Section 5.1.2).

• Georgios created the testbed to deploy, run, and monitor all the tested chain types. He wrote the Click congurations for the Single- and Multi- Process scenarios, which I used as input to run Hyper-NF. He also created the scripts to generate graphs from the gathered data. He ran the exper- iments and plotted the graphs presented in Section 5.2, they appear here with his authorisation.

Finally, I created the abstract representation described in Section 3.2. The text and ideas presented in this thesis are my own.

1.8 Outline

In Chapter 2, we investigate the state-of-the-art in NFV implementation. Then in Chapter 3, we present our design decisions for Hyper-NF. We briey describe its implementation in Chapter 4 before analysing its performance in Chapter 5.

In Chapter 6, we draw conclusions from our work and investigate potential future research.

(26)

(27)

Chapter 2

Background study

This section contains the necessary background for the reader to understand the concepts used in this thesis. We start in Section 2.1 by describing one of the most popular contemporary CPU architecture and why it should be taken into account while designing NFV frameworks. We present the Click Modular Router in Section 2.3 and how it can be enhanced in Section 2.4. We discuss state-of-the-art packet processing technologies in Section 2.5 and several virtual switches in Section 2.6. Finally, we briey summarise the related work in Section 2.7.

2.1 CPU architecture

In this section, we introduce the basic knowledge of recent multicore architectures necessary to understand the challenges that it raises for implementing NFV. We rst describe non-uniform memory access architectures in Section 2.1.1 and then explain what context switches are and why they are important in Section 2.1.2.

2.1.1 Non-uniform memory access

Non-uniform memory access (NUMA) refers to a specic type of multicore CPU architecture that has been designed to address the decrease in performance caused by multiple cores trying to access the same zone of shared memory [16].

In NUMA architectures, CPU are assigned designated zones of memory to reduce such conicts. More specically, cores are grouped in sockets. Each socket has a dedicated zone of memory (usually dynamic random-access memory (DRAM)), and then several layers of cache that are accessible by its cores.

The access time to each layer of memory is then relative to its distance to the core.

We represented as an example the architecture of our CPU (Intel Xeon E5-2667) in Figure 2.1. In this case, there are three layers of cache in each socket. L3

11

(28)

CPU0 CPU1 CPU2 CPU3 L1 Cache L1 Cache L1 Cache L1 Cache

L2 Cache L2 Cache

L3 Cache Socket 0

Socket 1

CPU4 CPU5 CPU6 CPU7

L1 Cache L1 Cache L1 Cache L1 Cache

L2 Cache L2 Cache

L3 Cache Local

DRAM

Local DRAM

Interconnect

Figure 2.1: Example of a NUMA architecture

cache is shared by all cores on the same socket has 20MB of space. L2 cache is shared by two cores per socket and has 2048kB of space. Finally, each core has 512kB of dedicated L1 cache. We represented in Table 2.1 the corresponding latencies to access the dierent layers of memory of a similar processor. As we can see, it costs up to 19 times more cycle to access the local L3 cache than the L1 cache but the L1 cache is 39 times smaller than the L3, thus underlying the importance of NUMA-aware software engineering.

2.1.2 The cost of context switches

On top of the cores competing for memory access, another factor that must be taken into account is that several processes compete for processing time on the cores. Thus the operating system's kernel contains a piece of code called a scheduler that regulates the time given to each process on one core. It implements a certain policy that forces transitions (context switches) between the dierent processes competing for the CPU. A process can also manually interrupt itself, either by running to completion or by yielding the CPU with the sched_yield() system call. There are two main sources of overhead during a context switch:

• Scheduler code execution: when the kernel decides on changing the current process, it must run an algorithm to select the next one according

(29)

2.2. NETWORK MIDDLEBOXES 13 Table 2.1: Latency of memory accesses on Intel i7 Xeon 5500 series (as given in table 2 from [1])

Data Source Approximative cost

L1 cache hit ≈4 cycles

L2 cache hit ≈10 cycles

L3 cache hit, unshared line ≈40 cycles L3 cache hit, shared line in another core ≈65 cycles L3 cache hit, modied line in another core ≈75 cycles

Remote L3 cache ≈100-300 cycles

Local DRAM ≈60ns

Remote DRAM ≈100ns

to the current scheduling policy. For instance, the current Linux default scheduling policy can select the next task in constant time [17].

• Flushing the cache: A new process is often going to need new sets of data from the memory. Thus the current cached data will be rendered useless and the new process will need to fetch everything from the memory.

In [13], Li et al., showed that reloading the entire L2 cache could have a signicant impact on the cost of a context switch, bringing it up to 1 ms in their setup. More generally, they showed that the latency of a context switch grows monotonically with the amount of data that must be reloaded.

The measurements from Li et al., show that the overhead of the kernel scheduler's execution is negligible compared to the cost of cache misses when switching processes.

2.2 Network middleboxes

Middleboxes are devices used to perform operations on the network trac. They are used to retrieve information and statistics from the network, modify packets, or drop unwanted trac. The services that they provide are called network functions or service functions. In this section, we present hardware middleboxes and their disadvantages compared to NFV. We then introduce examples of NFs that we used during this project.

2.2.1 Hardware Middleboxes and the case for NFV

Hardware middleboxes are network devices built to perform complex packet processing at line-rate. They are built using dedicated hardware. Middleboxes are widely used in production networks. A survey conducted by Sherry and Ratnasamy in 2012 on 57 dierent enterprise networks showed that the number of middleboxes was often equal or bigger than the number of routers [18].

Hardware middleboxes have numerous disadvantages: cost, complexity, or the small exibility that they oer.

(30)

Cost

A single middlebox can cost between thousands and hundreds of thousands of dollars [18]. On top of that, the human cost of managing the network of middleboxes is important. In the survey conducted by Sherry and Ratnasamy, most networks containing more than 10 middleboxes required at least between 2 and 5 full-time employees. For certain examples, that number went up to several hundreds of employees dedicated to managing the network. In the NFV paradigm, NFs are running on commodity hardware. It thus reduces the cost of the hardware by several orders [19].

Complexity

Large-scale networks often contain middleboxes from dierent vendors. Indeed, when buying a middlebox, enterprises often cite the price as their rst criterion over ease of management [18]. Thus, instead of buying their full middlebox deployment from a single vendor, they will pick for each device the cheapest vendor available. Each of these vendors has a dierent conguration interface, for instance a web interface or a scripting language. Then, for a given vendor, the network operator will also need to learn the specic interface for each and every NFs. This diversity in conguration systems creates an additional overhead in terms of human costs: the network operators have to be taught to manage every dierent type of middlebox on the network. The complexity of managing many dierent middleboxes also causes network failures: in [18], more than 54%

of the surveyed network operators cite misconguration as the most common cause of failures for middleboxes. Since NFV largely reduces the price of the hardware, it allows both NFV manufacturers and clients to focus on making the interface as easy to use as possible. As a new purely software-based platform, NFV also oers the opportunity to standardise a certain amount of the process, for instance through the IETF.

Flexibility

The monolithic implementation of NFs in hardware middleboxes renders the device inexible. Because of the specialised hardware, a given device can only perform the NF it has originally be designed for. This is especially inecient when confronted to elastic demands of trac. For instance, in [19] (Figure 1), the authors present the utilisation rate of four dierent middleboxes. This rate varies between 10% and 100% and has a dierent behaviour from one middlebox to another. To create scaling and fault-tolerant deployment, network operators are thus forced to over-provision the network. Furthermore, the possibility of upgrading middleboxes to oer new features is limited by the exibility of the dedicated hardware. All of these defaults are mitigated by NFV. Indeed, since the NF is running in software, a NFV operator can dynamically adapt the processing power given to each and every NF. Furthermore, once new features are implemented in NFV, it is easy to deploy them on a production network at a low cost by simply upgrading the software.

(31)

2.2. NETWORK MIDDLEBOXES 15

2.2.2 Examples of network function

In this section, we introduce several NFs that we mention in this report. More specically, we describe the role of a rewall, IDS, load-balancer, and of a NAT.

Firewall

A rewall is one of the most common NFs on the internet [18]. It is used to lter network packets according to the specic criteria. A rewall enforces access control lists (ACL), that is to say lists of rules that describe packets that should either be dropped or forwarded. We can distinguish two types of

rewalls: network rewalls and application rewalls.

A network rewall only separates packets according to their headers (e.g., in a standard network their Ethernet, IPv4/IPv6, and TCP/UDP headers).

An example rule for a network rewall would be: "drop all incoming IPv4/TCP packets from host 192.168.3.56 to port 80 ".

An application rewall understands the payload of the packets and can enforce rules depending on it. An example rule for an application rewall (in this case an HTTP-aware rewall) would be: "drop all HTTP POST requests to host 192.168.5.32 " (see RFC 2616 [20] for more information on the HTTP protocol).

Intrusion Detection System

An IDS is a tool used to detect intrusion on a private network. It typically runs by monitoring the network trac: it listens to the packets on the network and runs an algorithm to detect intrusion patterns. An intrusion can for instance be detected through port scanning patterns [21]. Dierent methods have been used in the literature to create the intrusion detection algorithm: nite-state machines [22], graph theory [23], or neural networks [24]. An IDS is thus mostly a passive device that triggers alerts for the operator when intrusion patterns are detected.

Load-balancer

A load-balancer is used to distribute trac between dierent servers providing the same service, thus allowing said service to scale out. For instance, it can divide incoming web trac between several dierent servers hosting the same page. A load-balancer typically sits on the path between clients and servers.

When a client tries to connect to a service, the load-balancer redirects the connection (for instance by changing the destination IP address) to one of the available servers. The server is chosen according to a policy dened by the network operator. The choice can for instance be made through a round-robin between the servers, it can be made according to a probability distribution, or it can depend on the IP address of the client.

(32)

Network Address Translation

A NAT allow a network operator to host several millions clients behind a single IP address. The use of NATs has been made essential with the exhaustion of the space of public IPv4 addresses [25]. A NAT is essentially is non-symmetric function that separates a private zone (typically with private non-routable IP addresses [26]) from the public internet. Figure 2.2 represents how a NAT (with public IP address 203.0.113.1) handles a new outgoing connection. It works by mapping outgoing connections (i.e., the 5-tuple (source IP address, destination IP address, transport protocol, source port, destination port)) to a new source port [27]. The source IP address is replaced by the public IP address of the NAT, and the mapping between the allocated source port and the connection is stored in a table. Thus, if the NAT receives ingress packets coming on the aforementioned source port, it will know the actual destination of the connection.

New outgoing connection from 10.0.0.3:4242 to 130.237.28.40:80

NAT table

Find unallocated

port Register mapping Translate packet

port 5555

10.0.0.3:4242 → 5555

Connection translated:

203.0.113.1:5555 to 130.237.28.40:80

Figure 2.2: Handling of a new outgoing connection by a NAT with public IP address 203.0.113.1

2.3 The Click Modular Router

Introduced in 1999 by Koehler et al. [10], the Click Modular Router (that we will simply call "Click" for brevity) is the de-facto standard of the scientic community for developing NFV. Numerous works on improving the performance of network functions have been based on Click [19, 28, 29]. Click is a purely software-based router that runs either as a Linux kernel module or as a user space process. Packet processing units are developed as C++ objects called elements that the user then connects one with another through virtual ports to achieve a full router conguration.

2.3.1 Click elements

An element is the most basic packet processing unit. It is characterized by its class (what type of operation it performs), its conguration string (which parameters it should use) and its input and output ports. We show an example of a Click IPClassifier element in Figure 2.3 that redirects HTTP trac, the rest of the TCP trac and UDP trac to three dierent output ports.

As of today, Click has elements that can perform a large number of classic network functions: network address (and port) translation, rewalling, load-

(33)

2.3. THE CLICK MODULAR ROUTER 17 balancing, IPsec encryption, congestion control, WAN optimization, etc. We give a list a key elements per type of NFs in Table 2.2. A complete list of the available elements and their description can be found on the Click web- site¹.

IPClassifier (tcp port 80,

tcp, udp)

Figure 2.3: Example of Click element that separates HTTP, TCP and UDP trac

Table 2.2: Key Click elements per type of NF [2]

NF Click elements

Load balancer RatedSplitter, HashSwitch, RoundRobinIPMapper

Firewall IPFilter

NAT IPRewriter

DPI Classifier, IPClassifier

Trac shaper BandwidthShaper, DelayShaper

Tunnel IPEncap, IPsecESPEncap

Multicast IPMulticastEtherEncap, IGMP Monitoring PPPControlProtocol, GREEncap DDoS prevention IPRateMonitor, TCPCollector

IDS IPFilter

IPS Classifier, IPFilter

Congestion control RED, SetECN

IPv6/IPv4 proxy ProtocolTranslator46

2.3.2 Pull and push transmissions

To form a fully working NF conguration, the operator must then link elements together according to the desired behaviour. To transmit packets between elements, Click oers two kinds of links: push and pull connections. In a pull connection, the transmitting element will buer the processed packet and wait for a query from the receiving element to transmit it (see Figure 2.4a). In a push connection, the transmitting element will send packets to the receiving element as soon as they have been processed (see Figure 2.4b). The operator must then ensure to connect ports of the same type (e.g., to connect push output ports only to push input ports).

1http://read.cs.ucla.edu/click/elements

(34)

Pull transitions are typically suited when Click does not control the timing of the next operation. For instance, to send a packet through a network interface, it must wait for the interface to be ready to transmit. On the contrary, push transitions are used when Click does not control the timing of the previous operations. For instance, when the Click instance receives a packet from an interface it must start to process it.

Transmitting element

Receiving element

ready

return(p) pull() dequeue

packet p

(a) Pull connection

Transmitting element

Receiving element process

packet p

push(p)

return

(b) Push connection

Figure 2.4: The two modes of packet transmissions in Click

The notion of push and pull ports is essential to understand how packets are processed in Click. Typically, push and pull transitions regulates the scheduling in Click. A packet going through a sequence of push transitions will continue being processed, while a pull transitions is used to schedule a new packet to process.

2.3.3 The Click Language

This division of the processing in elements allows Click to be extremely modular; thus providing the network operator with base operations that can be combined to create a customised monolithic NF. For instance Figure 2.5 is a graphical representation of an IP router conguration in Click. To declare such a conguration, Click has its own language. It is based on two main actions:

declarations and connections.

A declaration instantiates a Click element and provides its conguration. It has the following syntax:

name :: ElementClass(option1,option2,...);

where name is an alias for the element chosen by the user.

A connection declares a link between an output port and an input port of two dierent elements. It uses the following syntax:

element1[output_port_number] -> [input_port_number]element2;

The declarations and connections are input by the user in a conguration

le, typically with the .click extension. The Click semantic analyser checks the conguration le for loops and for incorrect port mappings before loading it.

(35)

2.3. THE CLICK MODULAR ROUTER 19

FromDevice(eth0) FromDevice(eth1)

Classifier Classifier

Strip CheckIPHeader

GetIPAddress RadixIPLookup DropBroadcasts

IPGWOptions(ip0) FixIPSrc(ip0)

DecIPTTL IPFragmenter

EtherEncap Queue ToDevice(eth0)

DropBroadcasts IPGWOptions(ip1)

FixIPSrc(ip1) DecIPTTL IPFragmenter

EtherEncap Queue ToDevice(eth1)

Figure 2.5: An IP router conguration in Click

(36)

2.4 Click enhancements

The research community has produced a lot of work surrounding Click. We describe how Click can be optimised for multicore architectures in Section 2.4.1.

We then explain how packet classication can be sped up in Section 2.4.2.

Finally, we investigate the inuence of batching in Section 2.4.3.

2.4.1 Multicore architectures

There are two general approaches to parallelising packet processing in a NF [30]:

Splitting (called cloning in [30]) incoming trac is split between cores and then for a given packet the complete processing happens on the same core (see Figure 2.6a).

Pipelining each core is in charge of a certain number of processing tasks.

Incoming packets then ow between multiple cores depending on the tasks that they have to go through. We represent this approach in Figure 2.6b.

Task 1

CPU 0

Task 2 Task 3

Splitting unit

Task 1

CPU 1

Task 2 Task 3

(a) Splitting approach

Task 1

CPU 0

Task 2 Task 3

CPU 1 CPU 2

(b) Pipelining approach

Figure 2.6: Approaches for parallelizing NFs

Chen and Morris argued in 2001 that any benets gained from the pipelining approach are outweighed by the overhead of moving data between CPU caches [31]. In [32], Egi et al., pushed the analysis even further by using hardware multiqueueing to classify the incoming packets and ensure that their entire processing, including socket access on the NIC, is done on the same CPU. Multiqueueing can for instance be realized on Intel NICs through Receiver Side Scaling (RSS) [33], a technology that allows developers to create multiple network queues depending on header elds. RSS has an API accessible within the DPDK framework (see Section 2.5.2). Although Dobrescu et al., also describe multiqueueing as "essential" [34], measurements by Barbette et al., indicate that it depends on the I/O mechanism's capacity to handle multi- queues [29].

(37)

2.4. CLICK ENHANCEMENTS 21 Following this insight, several Click-based NF frameworks changed the Click conguration to change to a full-push path [28, 35], thus pinning the full processing path of a given packet to a unique CPU. Pull elements benet from multi-core architectures by allowing multiple threads to pull data from the same elements as they go. However, this means that packets must be transmitted from one core's cache to another, thus creating a large overhead as described in Section 2.1.1.

2.4.2 Fast packet classication

One key pair of elements in Click are the IPClassifier and the IPFilter.

Both of them are based on the same fundamental mechanisms and are used to classify packets through their network and transport header values. They rely on classifying patterns similar to tcpdump [36] that are logical operations on their header values such as (ip_src==10.0.0.8) && (tcp_port==80). While IPFilter is used to enforce ACLs and to drop or allow packets to pass through, IPClassifier sends trac matching each classifying pattern to a dierent output port. Their functioning is hierarchical: the packet follows the behaviour given by the rst matching pattern (in the order in which the patterns are given by the user).

Increasing number of classifying patterns also increases the average time taken by the element to process a packet. In order to speed up the classifying process, Click also contains a tool called fast-classifier. The fast-classier hardcodes the classication given by the user in C++; thus avoiding the overhead needed for accessing from memory the values against which the packet must be matched (see Section 2.1.1). In Table 2.3, we compare the number of CPU cycles required for the standard IPClassifier against the fast-classier depending on the number of patterns and depending on which pattern the packet matches (either the rst one or the last one). The input packets are 64-byte IP/UDP packets at a rate of 500,000 packets per second. The results show that the fast-classifier is between 3 and 4 times quicker than the standard IPClassifier.

Table 2.3: Median CPU cycles/packet spent by Click's IPClassier or Fast- IPClassier while exponentially increasing the number of trac classes. The injected packets match the rst or last entry of the classiers' tables.

IPClassier Fast IPClassier First Last First Last

#TracClasses

4 156 305 99 125

8 153 482 101 170

16 151 880 102 258

32 154 1544 103 423

64 157 2791 102 845

128 153 5092 96 1455

256 152 9394 108 2800

512 154 17210 114 5255

1024 157 23885 111 10332

(38)

2.4.3 Batching packets

Two dierent types of batching can be used: I/O batching and computation batching. These are described below.

Batching I/O calls

I/O batching is a technique where the NFs retrieves packets by batches. This mitigates the overhead imposed by the system calls necessary to retrieve packets from the network interfaces (NIC). I/O batching's impact is described in [28, 29, 35, 34] and it is now included in the master Click implementation.

Batching computations

Computation batching is a technique where small units of computation are performed on several packets in a row instead of having a linear chain of computation for each packet. In Click, that translates into packets being transferred in batches between elements. Each elements then applies its action on all the packets in the batch. This has several benets. First of all, it reduces the number of virtual function calls necessary to transmit packets between Click elements. Second of all, it is more ecient in terms of CPU cache utilisation in NUMA architectures: indeed, by repeating the same operations batching keeps the corresponding CPU instructions in the instruction cache longer, thus reducing the overhead induced by frequent cache misses. Combined with I/O batching, this becomes a powerful tool as shown by the results presented by Kim et al. [35] or Barbette et al. [29].

2.5 Fast packet processing frameworks

Several new packet processing frameworks have been put forward by the community in the past few years [12, 15, 37, 38]. They were all motivated by the fact that the Linux network stack cannot handle line-rate packet processing [12]. A comprehensive study of their respective performance can be found in [29]. We

rst explain the problems related to the Linux stack, then present two popular frameworks: netmap and Intel DPDK.

2.5.1 What goes wrong in the Linux network stack?

There are three principal sources of overhead in the Linux network stack:

• Too many function calls: simple network system calls require more than a dozen of function calls in the Linux Kernel. In [12], Rizzo shows for instance that an UDP sendto() call requires 14 function calls.

• Memory-to-memory copies: the Linux network stack also copies several times the packet's data (or part of it) from dierent structures. As

(39)

2.5. FAST PACKET PROCESSING FRAMEWORKS 23 we showed in Section 2.1.1, accessing memory is costly in terms of CPU cycles.

• Interrupts: in the current Linux kernel, the arrival of packets is handled through interrupts: whenever a packet arrives, the network interface driver's sends a signal to the kernel with the physical address where it can be found. In [10], Kohler et al., estimated that the accumulated overhead of the interrupt generated by the arrival of a frame was of 5µs on a 700MHz Pentium III PC. On our CPU, it corresponds to approximately 1.1µs.

Even without the overhead caused by interrupts (for a transmission function such as sendto()), the total overhead of the Linux network stack can add up to values of the order of a microsecond [12].

Numerous novel I/O frameworks have focused on eliminating this overhead by bypassing the network stack and providing zero-copy access to the packets on the card to the user processes [12, 38, 15].

2.5.2 Intel DPDK

Intel^R Data Plane Development Kit (DPDK) [15] is a packet processing framework developed by Intel, 6WIND and several other companies. It is distributed as a free software and the source code is available at http://dpdk.org/download.

Several research eorts have integrated DPDK with Click [9, 28, 29] and it has been integrated into the main Click repository² on October 14, 2015.

DPDK provides an interface to directly access Intel network cards' buers from user processes, thus entirely bypassing the kernel. The device's reception and transmission queues are directly mapped to buers (called mbuff) in user space through memory-mapped I/O. The application can then write or read frames directly to/from the NIC using direct memory access (DMA) without the frames content being copied at any point. DPDK also contains functions to read or write frames by burst, thus reducing the number of function calls necessary.

When using DPDK on a specic hardware interface, NIC access is entirely disconnected from the Linux network stack and handled in user space by a library called Poll-Mode Driver. Poll-mode drivers have two benets:

• They remove the overhead caused by kernel interrupts by having the application poll the interface directly for incoming frames.

• They interact with other Intel technologies such as RSS or Flow Director to allow hardware packet classication or multiqueueing.

DPDK thus allows NFV developers to entirely bypass the kernel and access packets directly on the network card queues. However, such a degree of control over the network card means that the application developer has to implement the logic usually handled by the NIC driver.

2https://github.com/kohler/click/

(40)

2.5.3 Netmap

Netmap, introduced by Rizzo [12], is another zero-copy packet processing framework. It is also distributed as free software and is available at https://github.

com/luigirizzo/netmap. One of the rst presented use cases for netmap was an implementation of Click's I/O elements and netmap I/O access has been integrated into the main Click distribution since February 2012.

Similarly to DPDK, netmap works by abstracting the device queues as a data structure called rings in user space. These netmap rings contain pointers to packet buers that are shared between the network interface and the user processes. The application learns about the rings by accessing an interface descriptor called netmap_if that contains the number of rings associated with each interface and the pointers to these rings.

Although it also provides zero-copy access to NIC hardware queues, netmap is dierent from DPDK in that it is integrated into the machine's kernel (implementations exist for Linux and FreeBSD kernels). Instead of entirely bypassing the kernel, netmap overrides the NIC driver and provides two dierent kernel APIs to access the network queues:

• A synchronous API using the ioctl() system calls. In case of a receiving ioctl() call, the application receives from the card the number of buered packets that can be read. In case of a transmitting ioctl() call, the application informs the NIC of the number of buered packets that are ready for transmission. Using ioctl() removes the overhead of interrupts and allows fetching packets in burst mode.

• An asynchronous API using traditional blocking select() and poll() system calls. Using the asynchronous API ensures that only one system call is needed per packet transmission.

Hence, netmap provides zero-copy communications between user space processes and network interfaces but does not remove the overhead of context switching between these processes and the kernel code executed by the netmap system calls.

We schematised the way the dierent framework access data on the network interfaces in Figure 2.7.

2.6 Virtual switches

One of the main problem while implementing chains of network functions is making sure that each packet is forwarded through the correct functions and in the right order [7]. Indeed, depending on the type of trac, ows might need to traverse a specic network function or not (e.g., outgoing HTTP trac goes through a web proxy while outgoing VoIP trac does not). This is also true when virtualising these chains. A common way to implement chains of middleboxes is to use a virtual switch to route packets between the dierent middleboxes [8, 39]. The switches then typically support complex switching policies such as OpenFlow [40] in order to forward packets between the correct

(41)

2.6. VIRTUAL SWITCHES 25

Kernel

Network Interface Linux Driver User application

Network stack Driver

DPDK Netmap

Native Linux

Figure 2.7: Comparative design of netmap, Intel DPDK and the Linux network stack

middleboxes. In the example of Figure 2.8, the virtual switch must for instance be able to separate trac depending on the TCP destination port since the incoming HTTP trac must follow a dierent NF chain and go through a load- balancer before accessing the web servers. The middleboxes are then deployed in specic containers such as virtual machines or Linux containers.

Load-balancer

firewall IDS DNAT

virtual switch

To private zone

To web servers

Other Incoming TCP traffic Incoming

TCP traffic to port 80

Figure 2.8: Example of virtual switch usage

We describe two examples of such virtual switches: (i) Open vSwitch and (ii) VALE and its successor mSwitch.

2.6.1 Open vSwitch

Introduced in 2009 by Pfa et al. [41], Open vSwitch (OVS) is a widely used virtual switch. Numerous features have already been added to OVS: software dened-networking (SDN) with OpenFlow, trac analysis with NetFlow, trac tunnelling with IPsec or VXLAN, etc. OVS runs by default as a kernel module associated with a user space daemon called ovs-vswitchd but can also be launched as a purely user space process.

Figure 2.9 presents the mechanism used by OVS to take forwarding decisions.

The user space daemon is responsible for taking forwarding decisions for each

(42)

ow, depending on the current forwarding conguration (e.g., it can ask the OpenFlow controller for directives). The resulting decision is transmitted to the kernel module which stores it in the cache. Thus all subsequent packets belonging to this ow can be forwarded without going through the user space daemon.

Cache

Is flow cached

?

Apply cached policy input Yes

packet

Generate forwarding policy

No read

read

write

user space daemon kernel module

Figure 2.9: Flowchart of the forwarding mechanism in Open vSwitch

By default in OVS, the kernel cache table performs an exact match on all the

elds of the frame's dierent headers (Ethernet, internet and transport headers).

Thus, the table could be implemented as a simple and ecient hash table, albeit at the price of a quickly growing table when confronting a large number of short- lived connections. To reduce the size of this table, megaows were introduced in the OVS implementation. Megaows are cache entries that support generic matching, i.e., each entry matches several dierent connections.

Even with the introduction of megaows, OVS suers from performance issues that are inherent to the Linux kernel (see Section 2.5.1). It leaves most of the network operations in the kernel and user processes must use standard system calls to retrieve the packets. DPDK has thus been integrated as an experimental feature in OVS to allow zero-copy switching in the user space with minimal overhead. Using DPDK allows OVS to process packets directly on the network card's buers and to send them to applications without replicating the Linux network stack.

2.6.2 VALE and mSwitch

VALE is a virtual-switch based on netmap introduced by Rizzo and Lettieri in 2012 [11]. It runs entirely as a Linux kernel module and transfers packets using the netmap API.

VALE is a simple layer 2 learning switch:

1. it learns mappings between Ethernet addresses and its ports by observing incoming packets

(43)

2.7. RELATED WORK 27 2. if it receives a packet for whose destination Ethernet address it has a

mapping, it forwards it to the corresponding port

3. if it receives a packet for whose destination Ethernet address it does not have any mapping, it broadcasts it to all its ports

Although it uses netmap rings to read and forward packets, VALE is not a zero-copy switch. This is mainly because VALE enforces memory protection and does not trust its clients not to interfere with each other. Indeed, if VALE were entirely zero-copy, they would share access to the same netmap rings. A client could then write over a packet that is supposed to have already been forwarded. Thus, VALE uses an optimised copy function. By storing packets in a compact data structure and preloading them in the CPU cache, it can copy 60-byte packets in less than 60 nanoseconds [11]. Although VALE was mainly created to connect to processes hosted on virtual machines or containers through virtual interfaces, it can also connect to hardware interfaces or the host network stack through netmap bridges.

Compared to OVS, VALE is lacking in features. Its inability to perform forwarding policies based on anything but the destination Ethernet address makes it ill-suited for complex NFV deployment. Thus Honda et al., introduced mSwitch in 2015 [42], a VALE extension that supports modular forwarding policies. It allows the user to juggle between dierent forwarding behaviours by associating every mSwitch instance with a specic kernel module. Implemented use cases include a layer-2 learning switch, a rewall, and an Open vSwitch datapath.

2.7 Related work

The NFV area has been explored in the past few years. As we mentioned in Section 2.4, many eorts took Click [10] as a starting point and explored various research directions. RouteBricks [34] explored how to scale software routers across multi-core and multi-server architectures. PacketShader [38] and NBA [28] show how cheap and powerful graphic cards can be used to further augment throughput. ClickOS [2] proposes a minimal operating system (OS) based on Linux to implement a Click-based NF using netmap. FastClick [29]

explores several research directions mentioned in the aforementioned works by providing a Click implementation that supports all of DPDK, netmap and the PacketShader I/O, as well as batching computing and exploring the benets of hardware multiqueueing. Finally, NetVM [9] uses DPDK to provide a fast platform for virtualising Click instances. Although these studies all improve the general throughput of a single Click instance, they often do not explore scenarios where NFs are chained on the same machine. In the few studies where such an experiment is performed, large throughput degradation was reported (30% for a chain of ve NFs with NetVM, more than 85% for a chain of 9 NFs with ClickOS).

Chaining virtualized NFs has nonetheless been the subject of numerous research eorts. OpenNF [8] is a framework that allows network operators to scale up deployments of chained NFs by moving states between dierent servers and

(44)

redirecting ows thanks to an SDN dataplane. CoMB [19] is closer to our work as it consolidates packet processing by providing dedicated modules to tasks that are common to several NFs (e.g., session reconstruction or protocol parsing). DPIaaS [43] focuses on consolidating the costly operations associated with deep packet inspection. While some of these works consolidate parts of the packet processing, they do not go as far as Hyper-NF and do not mitigate the problems caused by context switches.

Another approach to the problem of chaining NFs has been to look at how to steer packets between them, as dened in RFC 7498 [7]. Simple [44] uses a combination of header tagging (such as VXLAN [45]) and algorithmic payload recognition to steer packets over a network of middleboxes. Slick [46] is similar to Hyper-NF vis-à-vis its approach to user conguration. Slick provides its own programming language in which the user can specify middleboxes as well as declaring how each ow should be processed. It then installs NFs on the available machines on the network in a distributed and smart way, and installs forwarding rules on the network switches to steer the trac correctly.

Hyper-NF: synthesizing chains of virtualized network functions

Hyper-NF: synthesizing chains of virtualized network functions

MARCEL ENGUEHARD

Hyper-NF: synthesising chains of virtualised network functions

Marcel Enguehard

Thesis performed at KTH – School of Information and Communication Technology (ICT)

Stockholm, Sweden

Supervisor and Examiner: Dejan Kostic

Abstract

Sammanfattning

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Overview

1.1.1 Network Middleboxes

1.1.2 Network function virtualization

1.1.3 Chaining network functions

1.2 Problem Statement

1NFs 2NFs 3NFs 4NFs 5NFs 6NFs 7NFs 8NFs

0 100 200 300 400 500

Throughput (kpps)

Multi-Process - Throughput = +498 . 5

NFs

,R

=1 . 00

1.3 Proposed solution

1.4 Goals

1.5 Methodology

1.5.1 Background study

1.5.2 Design and implementation

1.5.3 Evaluation

1.6 Limitations

SNAT

Load-Balancer

1.7 Contributions

1.8 Outline

Chapter 2

Background study

2.1 CPU architecture

2.1.1 Non-uniform memory access

2.1.2 The cost of context switches

2.2 Network middleboxes

2.2.1 Hardware Middleboxes and the case for NFV

2.2.2 Examples of network function

2.3 The Click Modular Router

2.3.1 Click elements

IPClassifier (tcp port 80,

tcp, udp)

2.3.2 Pull and push transmissions

2.3.3 The Click Language

FromDevice(eth0) FromDevice(eth1)

Classifier Classifier

Strip CheckIPHeader

GetIPAddress RadixIPLookup DropBroadcasts

IPGWOptions(ip0) FixIPSrc(ip0)

DecIPTTL IPFragmenter

EtherEncap Queue ToDevice(eth0)

DropBroadcasts IPGWOptions(ip1)

FixIPSrc(ip1) DecIPTTL IPFragmenter

EtherEncap Queue ToDevice(eth1)

2.4 Click enhancements

2.4.1 Multicore architectures

Task 1

Task 2 Task 3

2.4.2 Fast packet classication

2.4.3 Batching packets

2.5 Fast packet processing frameworks

2.5.1 What goes wrong in the Linux network stack?

2.5.2 Intel DPDK

2.5.3 Netmap

2.6 Virtual switches

2.6.1 Open vSwitch

2.6.2 VALE and mSwitch

2.7 Related work

Multi-Process - Throughput ^{= +498} . ⁵

⁼¹ . ⁰⁰

2.4.2 Fast packet classication