Linux Kernel Packet Transmission Performance in High-speed Networks

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2016,

Linux Kernel Packet Transmission Performance in High-speed

Networks

CLÉMENT BERTIER

(2)

Kungliga Tekniska h¨ ogskolan

Master thesis

Linux Kernel packet

transmission performance in high-speed networks

Cl´ement Bertier

August 27, 2016

(3)

Abstract

The Linux Kernel protocol stack is getting more and more additions as time goes by. As new technologies arise, more functions are implemented and might result is a certain amount of bloat. However new methods have been added to the kernel to circumvent common throughput issues and to maximize overall performances, given certain circumstances. To assess the ability of the kernel to produce packets at a given rate, we will use the pktgen tool.

Pktgen is a loadable kernel module dedicated to traffic generation based on UDP. Its philosophy was to be in a low position in the kernel protocol stack to minimize the amount of overhead caused by usual APIs. As measurements are usually done in packets per second instead of bandwidth, the UDP protocol makes perfect sense to minimize the amount of time creating a packet. It has several options which will be investigated, and for further insights its transmission algorithm will be analysed.

But a software is not just a compiled piece of code, it is a set of instructions ran on top of hardware. And this hardware may or may not comply with the design of one’s software, making the execution slower than expected or in extreme cases even not functional.

This thesis aims to investigate the maximum capabilities of Linux packet transmissions in high-speed networks, e.g. 10 Gigabits or 40 Gigabits. To go deeper into the understanding of the kernel behaviour during transmission we will use profiling tools, as perf and the newly adopted eBPF framework.

(4)

Abstract

Linux Kernel protokollstacken blir fler och fler tillägg som tiden g˚ar. Som ny teknik uppst˚ar, fler funk- tioner har genomförts och kan leda till en viss mängd svälla. Men nya metoder har lagts till kärnan för att kringg˚a vanliga genomströmning problem och att maximera den totala föreställningar, med tanke p˚a vissa omständigheter. Att fastställa förm˚agan hos kärnan för att producera paket med en given hastighet, kommer vi att använda pktgen verktyget.

Pktgen är en laddbar kärnmodul tillägnad trafik generation baserad p˚a UDP. Dess filosofi var att vara i en l˚ag position i kärnan protokollstacken för att minimera mängden av overhead orsakad av vanliga API:

er. Som mätningarna görs vanligtvis i paket per sekund i stället för bandbredd, gör UDP-protokollet vettigt att minimera mängden tid p˚a att skapa ett paket. Det har flera alternativ som kommer att undersökas, och för ytterligare insikter sin sändningsalgoritmen kommer att analyseras.

Men en programvara är inte bara en kompilerad bit kod, är det en uppsättning instruktioner sprang ovanp˚a h˚ardvara. Och den här maskinvaran kan eller inte kan följa med utformningen av en programvara, vilket gör utförandet l˚angsammare än väntat eller i extrema fall även fungerar inte.

Denna avhandling syftar till att undersöka de maximala kapacitet Linux paketsändningar i höghastighetsnät, t.ex. 10 gigabit eller 40 Gigabit. För att g˚a djupare in i först˚aelsen av kärnan beteende under överföringen kommer vi att använda profilverktyg, som perf och det nyligen antagna ramen eBPF.

(5)

List of Figures

2.1 Caches location in a 2-core CPU. . . 10

2.2 Theoretical limits of the link according to packet size on a 10G link. . . 12

2.3 Theoretical limits of the link according to packet size on a 40G link. . . 13

2.4 Tux, the mascot of Linux . . . 15

2.5 Overview of the kernel [4] . . . 16

2.6 How pointers are mapped to retrieve data within the socket buffer [18]. . . 17

2.7 Example of a shell command to interact with pktgen. . . 22

2.8 pktgen transmission algorithm . . . 26

2.9 Example of call-graph generated by perf record -g foo [38] . . . 29

2.10 Assembly code required to filter packets on eth0 with tcp ports 22. . . 30

3.1 Representation of the methodology algorithm used . . . 33

3.2 Pearson product-moment correlation coefficient formula. . . 34

4.1 Simplification of block diagram of the S7002 motherboard configuration [46, p. 19] . . . . 36

4.2 Simplification of block diagram of the ProLiant DL380 Gen9 motherboard configuration. . 37

4.3 Simplification of block diagram of the S2600IP [47] motherboard configuration. . . 38

4.4 Simplification of block diagram of the S2600CWR [48] motherboard configuration . . . 39

4.5 Output using the –help parameter on the pktgen script. . . 43

6.1 Benchmarking of different kernel version under bifrost (Machine A) . . . 50

6.2 Performance of pktgen on different machines according to burst variance. . . 51

6.3 Influence of ring size and burst value on the throughput . . . 52

6.4 Machine C parameter variation to amount of cores . . . 53

6.5 Machine C bandwidth test with MTU packets. . . 54

6.6 Throughput to packet size, in million of packets per second. . . 54

6.7 Throughput to packet size, in Mbps. . . 55

6.8 Superposition of the amount of cache misses and the throughput ”sawtooth” behaviour. . 56

C.1 Block diagram of motherboard Tyan S7002 . . . 66

C.2 Block diagram of the motherboard S2600IP . . . 67

C.3 Block diagram of the motherboard S2600CW . . . 68

C.4 Patch proposed to fix the burst anomalous cloning behaviour . . . 69

(8)

List of Tables

2.1 PCIe speeds . . . 14 2.2 Flags available in pktgen. . . 23 6.1 Comparison of throughput with eBPF program . . . 56

(9)

Chapter 1

Introduction

Throughout the evolution of network interface cards to high-speeds such as 10, 40 or even 100 Gigabit per second the amount of packets to handle on a single interface has increased drastically. Whilst the enhancement of the NIC is the first step of a system to handle more traffic there is a inherent consequence to it: the remainder of the system must be capable of handling the same amount of traffic. We are in an era where the bottleneck of the system is shifting towards the CPU [1], due to a more and more bloated protocol stack.

To ensure the capabilities of the operating system to produce or receive a given amount of data, we need to assess them through the help of network testing tools. There are two main categories of network testing tools: software and hardware based. Hardware network testing tools are usually seen as accurate, reliable and powerful in terms of throughput [2] but expensive nonetheless. While software-based testing might in fact fact be less trustworthy than hardware-based, it has a tremendous advantage of malleability. Modifying the behaviour (e.g. protocol update) of the software is easily realized, on the other hand it is not only complex in the case of hardware but also likely to increase the price of the product [3] and usually impossible for the consumer to tamper with as they are commonly proprietary products. There is no better solution between the two, it is a different approach to the same problem and hence testing a system from both perspectives if possible shall be recommended. However in this document we will focus solely on software testing as we did not have specialised hardware.

The Linux operating system will be used to conduct this research as it is fully open-source and recent additions aiming to enable high performances have been developed. It is based on a monolithic-kernel design, meaning the OS can be seen as split into two parts: a kernel-space and user-space [4]. The kernel- space is a contiguous chunk of memory in which everything related to the hardware is handled as well as core system functions, for instance process scheduling. The user-space is where regular user programs are being executed and they have much more freedom as they ignore the underlying architecture and access it through system-calls: secure interfaces to the kernel.

The issue in this model for a software-based network tool is the trade-off between the level on which the software will be located: a user-space network testing program is likely to be slowed down by the numerous system-calls it must perform, and has no control over the path the packet is going to take into the stack. A kernel-space network testing program will be faster but much more complex to design as rules within the kernel are paramount to its stability: as it is a single memory address any part of the kernel can call another function located in the kernel. This paradigm can result in disastrous effects on the system if not being cautiously manipulated.

As we require high performance to achieve line-rate we will therefore use a kernel-space network testing tool: pktgen [5]. It is a purely packet-oriented traffic generator which does not mimic actual protocol behaviour, located at the lowest level of the stack allowing minimum overhead. Its design allows to fully

(10)

amount of throughput, as smaller sized packets should never yield a higher throughput, given the same parameters. A notable advantage of pktgen is the fact that the module is found within the official kernel and therefore does not require any other installation, and can be found in all common distributions.

Getting a low-level traffic generator is not enough to tell if the system is correctly optimized since it does not always reveal the bottleneck of the system. To go deeper into the performances we must get a profile: an overview of the current state of the system. In order to perform such investigations we will use perf events [6], a framework to monitor the performances of the kernel by monitoring well-known bottleneck functions or hardware events likely to reveal problems and outputting a human-readable summary.

To complete the profiling done by perf, we will use the extended Berkeley Packet Filtering aka eBPF [7]. It is an in-kernel VM which is supposedly secure (e.g. does not crash, is finite) due to a strong monitoring of the code before being executed. It can be hooked onto functions and will be used to monitor certain point of interest by executing small programs from the kernel into the user-space.

1.1 Problem

While the speed of network interface cards increase, Linux’s protocol stack is also gaining more additions:

for instance to implement new protocols or to enhance already-existing features. More packets to treat as well as more instructions per packet intrinsically end up in heavier CPU loads. However there are some countermeasures which have been introduced to mitigate the performance loss due to outdated designs in certain parts of the kernel. For instance NAPI [8] which reduces the flow of interrupts by switching to a polling mode when overwhelmed or the recently added xmit more API [9] allowing bulking of packets to defer usual actions to groups rather than per single packet.

Considering all the recent improvements, can the vanilla kernel scale up to performances high enough to saturate 100G links?

We will assess the kernel’s performance at the lowest possible level to avoid maximum overhead therefore allowing maximum packet throughput, hence the use of pktgen. It is important to understand that pktgen’s result will not reflect any kind of realistic behaviour as its purpose is to determine the performances of a system by doing aggressive packet transmission, and the absence of overhead is the key to its functionality: it has to be seen as a tool to reveal underlying problems instead of focusing on regular protocol stack overhead. In other words, it is the first step into verifying a system’s transmission abilities and should therefore be seen as an upper-bound to real-life transmission. Implying if there are results underneath the maximum NIC speed, it would inevitably prove actual transmission scenarios above this result can not be reached. The follow-up question being: can pktgen’s performances scale up to 100G-link saturation?

Ideally the performances indicated by pktgen should be double-checked, meaning getting a second method to testify the accuracy of pktgen’s prompted performance. Hence we will use eBPF as a way to bind a program onto the NICs driver transmission function in order to measure the throughput. Can eBPF be able to correctly quantify the amount of outgoing packets knowing each call is potentially in an order of nanoseconds? If so, do the measured performances match pktgen’s results?

We hypothesize that with the current technologies added in the kernel we will be able to reach a line-rate at 100G with minimum-sized packets using the pktgen module, given proper hardware.

1.2 Methodology

We will use an empirical approach throughout this thesis. The harvesting method will consist in running pktgen while modifying certain parameters to assess the impact of the given parameter over the final result. This will be done by iterating over the parameters with a stepping big enough to finish within a reasonable amount of time but small enough to pinpoint any important change in the results. The value of the stepping will require prior tuning. Each experiment, in order to assert its validity, has to be run several times and on different machines with similar software configuration. To make the results

(11)

human-readable and concise they will be processed into relevant figures, comparing the parameters which were adjusted with their related pktgen experiment result.

To realize the performance assessment the following experiment will be realized in order:

• Draw a simple baseline with straightforward parameters.

• Verify whether the kernel version is improving or downgrading the performances, and select the best-suited one for the rest of the experiments.

• Assess the performances of the packet bulking technique through the xmit more API option of pktgen, and verify if it improves the packet throughput.

• Tamper with the size of the NIC’s buffer as an attempt to increase the performance of packet bulking.

• Find the optimal performance parameter of pktgen.

We will also be monitoring certain metrics through profiling frameworks which are not guaranteed to be directly correlated with the experiment. To test the linear correlation of two variables (i.e. an experiment result and a monitored metric) we will use a complementary statistical analysis through the help of the Pearson product-moment correlation coefficient.

1.3 Goal

As the performances reached by the NICs are now extremely high, we need to know if the system that they are supposed to be used with are capable of handling the load without any extra products or libraries. The purpose is to understand whether or not 100Gb/s NICs are in fact any useful for vanilla Linux kernels. Therefore the goal is to provide a study of the current performances of the kernel.

1.4 Sustainability and ethics

As depicted above the goal is to assess the ability of the kernel to output a maximum amount of packets.

In other words, with perfect hardware and software tuning the system should be able to reach a certain value of packets per second, sent over a wire. Whilst this does not take care of the environmental aspect directly (e.g. power-saving capabilities are disabled in favour of performance) assessing the global tuning of the system will logically help to understand if a system is using more resources than it should. Hence also indirectly assessing its power consumption. If an issue somehow reducing the global throughput is revealed, it could possibly imply machines running under the same configuration also have to put extra computing power to counteract the loss, also bringing ecological issues on a larger scale.

1.5 Delimitation

To limit the length of the thesis and impose boundaries to avoid having to go into too many endeavours we will solely focus on the transmission side. Simple examples of the use of eBPF will be provided to prevent from having to go into too much details inside the latter framework. Regarding the bulking of packets, we will exclusively look into (packet) throughput performances, while in reality such addition might in fact introduce some latency and therefore could create dysfunctions in latency-sensitive applications.

The kernel should not be modified with extra libraries specialized in packet processing.

(12)

• Chapter 3 will explain the methodology used behind the experiments.

• Chapter 4 will summarize the experimental setup including:

– Detailed hardware description.

– Research behind the performance optimization.

– practical description of the realization of experiments and how they were exported into consequential data.

– how a prototype of an interface for pktgen was realized to standardize the results.

• Chapter 5 will be a brief introduction to BCC programming, presenting the structure to create programs with the framework.

• Chapter 6 will hold the most probing results from the experiments into graphical data and their associated analysis.

• Chapter 7 will conclude and wrap-up the results.

(13)

Chapter 2

Background

This section will be dedicated to providing the required knowledge to the reader to fully understand the results at the end of the thesis. Going into deep details of the system was necessary to interpret the results and hence a great part of this thesis was dedicated to understanding various software/hardware techniques and technologies. To do so we will follow a path divided in several sections:

• Firstly we will introduce technical terms related to hardware to the reader as those factors will be investigated to give a deeper overview of the system. This will be done by examining different bottlenecks like the speed of the a PCIe bus or the maximum theoretical throughput on an Ethernet wire.

• Secondly we will dig into the inner working of the Linux operating system, to mainly understand the global architecture of the system but also to provide insights of how the structures and different sections interact together to transmit a packet over the wire. This will include interaction with the hardware, hence a brief study of the drivers.

• Then we will do a strong literature review of the related work accomplished on software traffic generation to compare their perks and drawbacks.

Then a thorough study of the pktgen module will be realised, from its internal working to most proficient parameters influencing throughput performance.

• Last but not least there will be a brief introduction to profiling, which consists of tracing the system to assess its choke-points by analysing the amount of time spent executing functions.

We will also explain how eBPF, an extended version of the Berkeley Packet Filtering originally created for simple packet analysis, which is now a fully functional in-kernel virtual machine that may now be used to investigate parts of the kernel by binding small programs to certain functions.

(14)

2.1 Computer hardware architecture

As we are going to introduce numerous terms that are closely acquainted with the hardware of the machine, this section will be there to clarify most of those to the reader.

2.1.1 CPU

A CPU, or central processing unit, is the heart of the system as it executes all the instructions stored in the memory.

CPU Caches are a part of the CPU that store data which is supposedly going to be needed again by the CPU. An entry in the cache table in called a cache line. When the CPU needs to access data, it first checks the cache, which is directly implemented inside the CPU. If the needed data is found, it is a hit, otherwise a miss. In case of a miss, the CPU must fetch the needed data from the main memory, making the whole process slower.

In principle, the size of the cache needs to be small. For two reasons, the first one being the fact it is implemented directly in the CPU, making the lack of space an issue, and secondly because the bigger the cache, the longer the lookup therefore introducing latency inside the CPU.

Multi-level caches are a way to counteract the trade-off enforced by the cache size to table lookups issue. There are different levels of caches, which are all ”on-chip” meaning on the CPU itself.

• The first level cache, abbreviated L1 Cache will be small, fast, and the first one to be checked.

Note that in real-life scenarios, this cache is actually divided in two caches: the one that stores instructions and one that stores data.

• The second level cache, abbreviated L2 Cache will be bigger than the L1 cache, about 8 to 10 times more storage space.

• The third and last level cache, abbreviated L3 Cache is much larger than the L2 cache however this characteristic vastly variates on the price of the CPU. This cache is not implemented in all brand of CPUs, however the ones that were used for this thesis did (Cf Methodology – Hardware used 4.2).

Moreover L3 caches have the particularity of being shared between all the cores, which leads us to the notion of Symmetric Multiprocessing.

L3 Cache L2 Cache

L1 Cache Instruction

L1 Cache Data

CORE 0

L2 Cache

L1 Cache Instruction

L1 Cache Data

CORE 1

Figure 2.1: Caches location in a 2-core CPU.

(15)

Please note that the Figure 2.1 is a simplification of the actual architecture.

2.1.2 SMP

Symmetric Multiprocessing involves two or more processing units on a single system which will run the same operating system, share a common memory and I/O devices, e.g. hard drives or network interface cards. The notion of SMP applies to both completely separate CPUs and CPUs that have several cores.

The obvious aim of having such an architecture is benefiting from the parallelism of programs to maximize the speed of the overall tasks to be executed by the OS.

Hyperthreading is Intel’s proprietary version of SMT (Simultaneous multi-threading) which is another technique to improve thread symmetric execution, and adds logical cores to the physical ones.

2.1.3 NUMA

Non Uniform Memory Access is a design in SMP architecture which states that CPUs should have dedicated spaces in the memory which can be accessed much faster than the others due to its proximity.

This is done by segmenting the memory and assigning a specific part of it to a CPU. CPUs are joint by a dedicated BUS (called the QPI for Quick Path Interconnect on modern systems). The memory segmented for a specific CPU is called local memory of the CPU. If it needs to access another part of the memory than its own, it is designated as remote memory, since it must go through a network of BUS connections in order to access the requested data.

This technique aims to mitigate the issue of memory access on an SMP architecture, as a single BUS for all the CPUs is a latency bottleneck in modern system architecture [10]. A NUMA system is sub-divided into NUMA nodes, which represent the combination of a local memory and its dedicated CPU. With the help of the command lscpu one can view all the NUMA nodes that are present on a system. It also prompts the latency to access one remote memory node to another.

2.1.4 DMA

Direct Memory Access is a technique to avoid having to make the CPU intervene between an I/O device and the memory to copy data from one another. The CPU simply instigates the transfer a with the help of a DMA controller which then takes care of the transfer between the two entities. When the transfer is done, the device involved throws an interrupt in order to notify the OS and therefore the CPU that the operation has been completed and the consequential actions should be taken, e.g. treat the packets in case of reception of packets or clean-up the memory from the buffer used in case of transmission.

2.1.5 Ethernet

Ethernet is the nowadays standard used for layer-2 frame transmission that we will be using throughout this thesis. The minimum size of a frame in Ethernet was originally 64 bytes due to CSMA/CD technique being used on the link. The idea was to have a minimum time-slot ensured by this fixed size so that the time taken sending those bits on the wire would be enough for all station within a maximum cable radius to hear the transmission of the frame before it ended. Therefore if two stations started transmitting over the common medium (i.e. wire), they would be able to detect the collision.

When a collision happens, a jam-sequence is sent by the station noticing it. Its aim is to make the CRC (located at the end of the frame) bogus, making the NIC discard the entire frame before computation.

The minimum size packet of 64 bytes makes sense in 10Mb/100Mb Ethernet networks, as the maximum length of the cable is respectively 2500 meters and 250 meters. However, if we push the same calculation to a 1000Mb aka 1G Ethernet, the maximum length of 25m can be considered too small, not mentioning

(16)

In reality when one sends a 64 bytes packet on the wire, there are in total 84 bytes that can be counted per frame.

• 64-bytes frames composed of:

– 14-bytes MAC header, destination MAC, source MAC and packet type.

– 46-bytes payload, typically IP packet with TCP or UDP on top of it.

– 4 Bytes CRC at the end.

• 8-byte preamble, for the sender and receiver to synchronise their clocks.

• 12 bytes of interframe-gap. There is not any actual transmission, but it is the required amount of bit-time that must be respected between each frame.

Theoretical limit As shown above, for a 60 bytes-payload (including IP and TCP—UDP headers) we in reality must count 84 bytes.

This implies that for a 10-Gigabit transmission we will have a maximum of:

M ax = Bandwidth

F ramesize = 10 ∗ 10⁹

84 ∗ 8 = 10 ∗ 10⁹

672 ≈ 14880952 ≈ 14.88 ∗ 10⁶ frames per second

We can conclude that the maximum number of minimum sized frames that can be sent over a 10G link is 14.88 millions per second. By applying the same calculation to a 40G and 100G link we find respectively 59.52 and 144.80 millions per seconds.

0 2 4 6 8 10 12 14 16

200 400 600 800 1000 1200 1400

Mpps

Packet size

Maximum amount of packet per second to size on 10G-Link Limit

Figure 2.2: Theoretical limits of the link according to packet size on a 10G link.

(17)

0 10 20 30 40 50 60

200 400 600 800 1000 1200 1400

Mpps

Packet size

Maximum amount of packet per second to size on 40G-Link Limit

Figure 2.3: Theoretical limits of the link according to packet size on a 40G link.

The figure 2.3 will be useful as a benchmark during our experiments, as it is the upper bound.

2.1.6 PCIe

Peripheral Component Interconnect Express usually called PCIe is a type of BUS used to attach components to a motherboard. It was developed in 2004 and as of 2016, its latest release version is 3.1 but only 3.0 product are available. A new 4.0 standard is expected in 2017. PCIe-3.0 (sometimes called Revision 3) is the most common type of BUS found among high-speed NICs; because other standards are in fact too slow to provide the required BUS speed to sustain 40 or even 10 Gigabit per second if the amount of lanes is too little (see next paragraphs).

Bandwidth To actually understand the speed of PCI-e BUSes we must define the notion of ”transfer”, as the speed is actually given in ”Gigatransfers per seconds” in their specification [11]. A transfer is the action of sending a bit of data on the channel, however it does not specify the amount of bit sent because one needs the channel width to compute it, in other words without the amount of bits sent in a transaction, we can not calculate the actual bandwidth of the channel.

Circumventing the complex design details, on the PCIe version 1.0 and 2.0 an 8/10b encoding is used [11, p. 192].

This forces to send 10 bit for an 8-bit data transfer, implying an overhead of 1 −₁₀⁸ = 0.2% for every bit transfer.

The 3.0 revision uses a 128b/130b encoding, limiting the overhead to 1 −¹²⁸₁₃₀ ≈ 0.015%.

Now that we know the channel width, we can calculate the bandwidth B:

B = T ransf ers ∗ (1 − overhead) ∗ 2

The table 2.1 holds the results of the bandwidth calculation. We highlighted the compatible bandwidth for 10G inblueand for 40G inred(10G being compatible with 40G). However using a bandwidth

(18)

Version 1.1 2.0 3.0

Speed 2.5 GT/s 5 GT/s 8 GT/s

Encoding ₁₀⁸ ₁₀⁸ ¹²⁸₁₃₀

Bandwidth 1x 2 Gb/s 4 Gb/s 7.88 Gb/s Bandwidth 4x 8 Gb/s 16 Gb/s 31.50 Gb/s Bandwidth 8x 16 Gb/s 32 Gb/s 63.01 Gb/s Bandwidth 16x 32 Gb/s 64 Gb/s 126.03 Gb/s

Table 2.1: PCIe speeds

2.1.7 Networking terminology

DUT Device Under Test, the targeted device that we aim to assess its performances.

Throughput The throughput is the fastest rate at which the count of test frames transmitted by the DUT is equal to the number of test frames sent to it by the test equipment. [12]

(19)

2.2 Linux

The Linux operating system started in 1991 as a common effort to provide a fully open source operating system by Linus Torvalds. It is UNIX-based and the usage of Linux is between 1 to 5 % of the global market, implying that it is scarcely used by users. However this data is quite unreliable as most companies or researcher rely on publicly available data, for instance the User-Agent header passed in a HTTP request that however can be forged, or worldwide device shipments that tend to be unreliable as well since most laptops will at least allow dual-booting with a second OS.

While not being frequently used within the major part of the population, it is extremely popular among the server market share, its stability, open-source code and constant update making it a weapon of choice for most system-administrators. Whilst it will be referred as ”Linux” in this document, the correct term would be GNU/Linux as the operating system is a collection of programs on top of the Linux kernel and Linux is depended from GNU softwares.

Figure 2.4: Tux, the mascot of Linux

2.2.1 OS Architecture design

Linux is a monolithic kernel design [4, p. 7], meaning that it is loaded as a single binary image at boot, stored and ran in a single address space. In other words: the base kernel is always loaded into one big contiguous area of real memory, whose real addresses are equal to its virtual addresses [13]. The main perks of such an architecture being the ability to run all needed functions and drivers directly from the kernel space, making it fast. However it comes with a price of stability issue, as the whole kernel runs along as a single entity, if there is an error on any subset of it, the system’s stability as a whole can not be guaranteed.

Whilst such drawbacks could seem as an impediment for the OS, monolithic kernels are not only mature nowadays, but the almost-exclusive design used in industry. It opposes to micro-kernels, which we will not detail as it is outside the scope of this study.

But it is not realistic to talk about ”pure” monolithic kernel, as Linux actually has ways to dynamically load code inside the kernel space, more precisely pre-compiled portions of code which are called loadable kernel module or LKM. As the code can not be loaded inside the same address space that the kernel uses, the memory will be allocated dynamically. [13]

The flexibility offered by LKMs are absolutely crucial to Linux’s malleability: if every component had to be loaded at boot the size of the boot image would be colossal.

(20)

The operating system can be seen as being split into three parts: the hardware which is obviously immobile, the kernel space and user-space. This segmentation makes sense when it comes to memory allocation as explained above. The kernel-space is static and continuous, it runs all the functions that interact directly with the hardware (drivers) and its code can not change (unless the code being executed is a LKM). The user-space is much more free of action, as the memory it uses can dynamically be allocated therefore making the loading and evolutions of programs quite flawless. However to interact with hardware, e.g. memory or I/O devices, it must go through system-calls.

System-calls are functions that aim to make usage of a service from the kernel, while abstracting the underlying complexity through simple functions.

Figure 2.5: Overview of the kernel [4]

2.2.2 /proc pseudo-filesystem

The /proc folder is actually a whole other file-system on its own called procfs [4, p. 126]. Loaded at boot, its purpose is to be a way to harvest information from the kernel. In reality it does not have any physical files (i.e. written over hard-disks), all of the ones represented inside of it are actually stored in the memory of the computer (called ram-based file-system) rather than on a hard-drive, also implying they will disappear at shut-down.

It was designed to gather any kind of information the user could need to inspect about the kernel, often related to performances. A lot of programs interact directly with this information to gain knowledge of the system, for instance the well-known command ps makes usage of different statistics included in /proc. However it is even more powerful, as we can directly ”hot-plug” functionalities from inside the kernel by interacting with /proc, for instance which CPUs are pinned to a particular interrupt can be

(21)

changed with the help of it. Needless to say, not all functionalities inside the kernel can be changed by simply writing a number or a string inside the /proc.

This becomes a key-element when linked not only to the vanilla kernel but also its modules. As explained previously, we can load or unload LKMs, and as they are technically part of the kernel we therefore will find their status and configuration interfaces in the /proc.

Other information systems

• sysfs: another ram-based files-system this time whose goal is to export kernel data structures, their attributes, and the linkages between them to userspace [14].

Usually mounted on /sys.

• configfs: complimentary to sysfs, it allows the user to create, configure and delete kernel objects [15].

2.2.3 Socket Buffers

Socket buffers or SKBs are single-handedly the most important structure in the Linux networking stack.

For every packet being present in the operating system, a SKB must be affiliated to it in order to store its data in memory. It has to be done in kernel space, as the interaction with the driver happens inside the kernel [16].

The structure sk buff is implemented as a double linked list in order to loop through the different SKBs easily. Since the content of the sk buff structure content is gigantic we will not go into too much detail here, but here are the basics [17]:

Figure 2.6: How pointers are mapped to retrieve data within the socket buffer [18].

The socket buffers were designed to encapsulate easily any kind of protocol, hence there are ways to access the different parts of the payload by moving a pointer around and mapping its content into a structure.

(22)

As shown in figure 2.6 the data is located in a contiguous chuck of memory and pointers indicate the location of the structure. When going up the stack, extra pointers are mapped to easily recognize and access the desired part of the packet, e.g. IP header or TCP header. Important note, the data pointer DOES NOT refer to the payload of the packet, and reading from this will most likely end up in gibberish values for the user.

With the help of system calls, SKBs are abstracted for user-space programs who most likely will not make use of underlying stack architecture. However those system calls are not accessible from inside kernel-space.

To decode easily data from within the kernel, pre-existing structures with the usual fields of protocols are found and by mapping a pointer to a structure one can make packet content understandability trivial.

Reference counter Another very important variable the structure holds is called atomic t users.

It is a reference counter, a simple integer that accounts the amount of programs that are using the SKB.

Is it implemented as an atomic integer, meaning that it must be tampered only through the help of specific functions that will ensure the integrity of the data among all cores.

It is originally initialized at the value 1 and if it reaches 0 the SKB ought to be deleted. Users should not interact with such counters directly however as we will see with pktgen the latter is not always respected.

2.2.4 xmit more API

Since kernel 3.18 some efforts were made to optimize the global throughput through batching, i.e. bulking packets as a block to be sent instead of being treated one by one. Normally when a packet is given to the hardware through the driver, several actions are made like locking the queue, copying the packet to the hardware buffer, tell the hardware to start the transmission, etc [9].

The idea was to simply communicate to the driver that several more packets are coming and can therefore delay several actions knowing it will be a better fit to postpone it until there are no more packet to be sent. It is important to note the driver is not forced in any way to delay its usual procedures, and is the one taking the decision. To make this functionality available to drivers while not breaking the compatibility with the old ones, a new boolean in the SKB structure xmit more has been added. If set to true, the driver knowns there are more packets to come.

2.2.5 NIC drivers

NIC drivers handle the communication between the NIC and the OS, primarily to handle the packet sending and reception. There are two solutions to receiving packets:

• Interrupt: in case of reception of a packet, the NIC sends an interrupt to the OS in order for it to retrieve the packet. But in case of a high-speed reception, the CPU will most likely be overwhelmed by the interrupts, as they are all executed with a higher priority than other tasks.

• NAPI: To mitigate the latter issue, we disable temporally the interrupts and switch to a polling mode. It is done through the help of the New API which is an extension to the device driver packet processing framework [8]. It is done by switching off the interrupt of a NIC when it reaches a certains threshold fixed at driver initialization [16].

Here are the common pitfalls that can influence the NIC driver performance [19]:

• DMA should have better performance than programmed I/O, however due to the high overhead caused by it, one should not allow DMA under a certain threshold.

• For PCI network cards (which is the only relevant type for high-speed networks nowadays) the size of the burst size for DMA is not always fixed and must be determined. This should coincide with the cache size of the CPU, making the process faster.

• Some drivers have the ability to compute the TCP checksums, offloading from the the CPU and gaining efficiency due to optimized hardware.

(23)

2.2.6 Queuing in the networking stack

The queuing system in Linux is implemented through an abstraction layer called Qdisc[20]. Its uses ranges from a classical fifo algorithm to more advanced QoS-aimed queuing (e.g. HTB or SFQ). Though those methods can be circumvented if one user-level application fires multiple flows at once [21].

Driver queues are the lowest-level networking queue that can be found in the OS. It directly in- teracts with the NIC through DMA. However this interaction is done asynchronously between the two entities (the opposite would make the communication extremely inefficient) hence the need of locks to ensure data integrity. The lowest function to directly interact with the driver queue that one can use is dev hard start xmit().

Nowadays most NIC have multiple queues, to benefit best from the SMP capabilities of the system [22].

For instance, the 82580 NIC from Intel and their variants support multi-queues. Some frameworks (e.g.

confer 2.3.5) allow direct access to the NIC registers for better analysis and tuning of the hardware.

(24)

2.3 Related work – Traffic generators

2.3.1 iPerf

iPerf [23] is a user-space tool made to measure the bandwidth of a network. Due to its userspace design, it can not achieve high packet speed because of the need of using system calls to interact with the lower interfaces, e.g. NIC drivers or even qdiscs. To mitigate this overhead issue the user might use a zero-copy option to make the access to the packet content faster. It is able to saturate links through the use of largely sized packets, and can even report MTU if unknown to the user. It may measure jitter/latency through UDP packets. You must be running both instances of server and client of iPerf to make the program run. An interesting new option is the handling of the SCTP protocol in the version 3.

The simplicity of installation and use make it a weapon of choice for network administrators that wish to check their configurations. It is important to note that this project is still maintained and being updated frequently by the time of this thesis.

Note that this is the only pure user-space oriented traffic generation tool that we will describe here, as their performance can not match other optimized framework.

Other userpace examples include (C)RUDE [24], NetPerf [25], Ostinato[26], lmbench [27].

2.3.2 KUTE

KUTE [28] is a UDP in-kernel traffic generator. They divided their program into two LKMs, a sender and a receiver. Once loaded, the sender will compute a static inter-frame gap based on the speed specified by the user during setup. One improvement they advertise is to directly use the cycle counter located in CPU registers instead of the usual kernel function to check the time, as it was considered not precise enough. Note that as this technology is from 2005, this information might be outdated. An interesting function is that they do not handle the Layer 2 header, making it theoretically possible to use over any L2 network. The receiver module will provide statistics to the user after the end, when it is been unloaded.

2.3.3 PF RING

PF RING [29] is a packet processing framework developed by the ntop company. The idea was, as for pktgen and KUTE, to put the entire program inside the kernel. However it goes a step further by proposing actual kernel to user-space communication. The architecture, as the name suggests, is based on a ring buffer. It polls packets from the driver buffers to the ring [30] through the use of the NAPI.

While it does not require particular drivers, the addition of PF RING aware drivers are possible and should provide extra efficency.

Entirely implemented as a LKM, they advertise a speed of 14.8 Mpps ”and above” on a ”low-end 2,5GHz Xeon”. However they do not state clearly whether that concerns packet transmission or capture, leaving the latter statement ambiguous.

PF RING ZC is a proprietary variant of PF RING, and is not open-source. On top of the previous features they offer an API which is able to handle NUMA nodes, as well as zero copy packet operation, supposedly enhancing the global speed of the framework.

On this version traffic generation is explicitly possible. It can also share data among several threads easily, thanks to their ring buffer architecture coped with zero copying.

2.3.4 Netmap

Netmap[31] aims to reduce kernel-related overhead issues by bypassing the kernel with its own home- brewed network stack.

They advertise a 10G wirespeed (i.e 14.88 Mpps) transfer with a single core at 1.2 Ghz. As number of improvements they:

• Do a shadow copy (snapshot) of the NIC’s buffer into their own ring buffer to support batching, bypassing the need of skbuffers, hence gaining speed on (de)allocations.

(25)

• Efficient synchronization to make the best use of the ring buffer.

• Natively supports multi-queues for SMP architectures through the settings of interrupt affinities.

• The API is still completely independent from the hardware used. The devices driver ought to be modified to interact correctly with the netmap ring buffer, but those changes should always be minimal.

• Netmap does not block any kind of “regular” transmission from or to the host even with a NIC being used by their drivers.

• They also handle the widely used libpcap library by implementing their own version on to of the native API.

• The interaction with the API is done through /dev/netmap, and the content is updated by polling.

The packets are checked by the kernel for consistency.

It is also implemented as a LKM making it easy to install however drivers might need to be changed for full usability of the framework.

2.3.5 DPDK

The Data Plane Development Kit [32] is a ”set of libraries and drivers for fast packet processing”. It was developed by Intel and is only compatible with Intel’s x86 processors architecture.

They advertise a speed of 80 Mpps on a single Xeon CPU (8 cores), which is enough to saturate a 40G link.

DPDK moves its entire process in the user-space, including ring buffers, NIC polling and other features usually located inside the kernel. It does not go through the kernel to push those changes or actions, as it features an Environment Abstraction Layer, an interface that hides the underlying components and bypasses the kernel by loading its own drivers. The offer numerous enhancements regarding software and hardware, e.g. prefetching or setting up core affinity among many other concepts.

2.3.6 Moongen

Moongen [33] is based on the DPDK framework, therefore inheriting its perks and drawbacks. Moongen brings new capabilities to the latter framework by adding several paradigms as “rules” for the software:

It must be fully software implemented, and therefore run on off-the-shelf hardware, it must be able to saturate links at 10G wirespeed (i.e. 14.88 Mpps), be as flexible as possible, and least but not least support precise time-stamping/rate control (i.e. inter-packet latency). They found that the requirements were best fulfilled by implementing malleability through Lua scripting, as the language also has fast performances due to JIT support (Cf 2.5.2). The architecture behind Moongen lies on a master/slave interaction, set up within the script the user must provide. The master process will set-up the counters, including the ones located on NICs, and the slave will effectuate the traffic generation.

An interesting feature that was introduced in this traffic generator was a new approach to rate control.

As explained previously, NICs have an asynchronous buffer to take care of packet transmission, and a usual approach to control the rate is to wait time between packets. However the NIC might not decide to send to packets as they arrive. Instead of waiting, Moongen fills the inter-packet gap with a faulty packet: they forge a voluntarily incorrect packet checksum so that the receiving NIC will discard it upon arrival. However this method is limited due to the NIC having a minimum-size packet acceptance, making the faulty packets having to be a certain size which can be impractical in some situations.

They advertise a speed of 178.5 Mpps at 120 Gbit/s, with a CPU Clock at 2 GHz.

(26)

2.4 Pktgen

Introduction pktgen is a module of the Linux kernel that aims to analyse the networking performances of a system by sending with as many packets as possible [5]. It was developed by Robert Olsson and has been integrated to the Linux main tree in 2002 [36].

The usage of the tool is made through the procfs. All the related files mentioned in the following paragraphs related to pktgen are located in /proc/net/pktgen.

To interact with the module, one must write into a pre-existant file representing kernel threads dedicated to pktgen. There are as many threads as there are cores, for instance the file

”kpktgend 0” is the file bound to the thread for the core number 0. This information is important as nowadays CPUs all have SMP, hence the need of support for such architecture. The user then passes commands by directly writing inside those files.

# echo "add device eth0" > /proc/net/pktgen/kpktgend 0 Figure 2.7: Example of a shell command to interact with pktgen.

Figure 2.7 shows a typical example of interaction between the pktgen module and the user. By redirecting the output of the echo command, we pass the command “add device” with the argument

“eth0” to the thread 0. Please note that all writing operations in the proc filesystem must be done as superuser (aka root). If the operation is unsuccessful, there will be an I/O error on the command.

While this might seem slightly disrupting a first, the design choice behind this interface is due to the module being in-kernel making a proc directory the simplest design to allow interaction with the user.

Example Now that the interaction between the user and the module has been clarified, here is a representative description of how to typically use pktgen, that can be logically split in 3 steps.

1. Binding: The user must bind one or more NICs to a kernel thread.

Fig 2.7 is an example of such action.

2. Setting: If the operation is successful, a new file will be created, matching the name of the NIC (or associated queue). For instance by executing the command in Fig 2.7, a new file eth0 will be created in the folder. The user must then pass the parameters that he or she wishes by writing in the latter file.

A non exhaustive list of parameters would be:

• count 100000 – Send an amount of 100000 packets.

• pkt size 60 – Set the packet payload to 60 bytes. This does include IP/UDP headers. Note that 4 extra bytes are added by the CRC on the frame.

• dst 192.168.1.2 – Set the destination IP.

3. Transmitting: When all the parameters are set, the transmission may start by passing the parameter start to the pktgen control file pgctrl. The transmission will either stop by interrupting the writing operation (typically CTRL+C in the terminal) or when the total amount of packets to be sent will be matched by the pktgen counter.

The transmission statistics, as in time spent transmitting or number of packets per seconds will be found in the file(s) matching the name of the interfaces used in the second step, e.g. eth0.

While it is possible to associate one thread with several NICs, the opposite is not. However pktgen has a workaround to be able to profit from the multi-core capacities, by adding the number of the core after the name of the NIC: eth0@0 will result in interacting with the NIC eth0 through the core 0.

2.4.1 pktgen flags

pktgen has a several flags that can be set upon configuration of the software. The following list is inclusive

(27)

and up to date, as it was directly fetched and interpreted from the latest version of the code (v2.75 – Kernel 4.4.8).

Flag Purpose

IPSRC RND Randomize the IP source.

IPDST RND Randomize the destination IP.

UDPSRC RND Randomize the UDP source port of the packet.

UDPDST RND Randomize the UDP destination port of the packet.

MACSRC RND Randomize the source MAC address.

MACDST RND Randomize the destination MAC address.

TXSIZE RND Randomize the size of the packet to send.

IPV6 Enable IPv6.

MPLS RND Get random MPLS labels.

VID RND Randomize VLAN ID label.

SVID RND Randomize SVLAN ID label.

FLOW SEQ Make the flows sequential.

IPSEC ON Turn IPsec on for flows.

QUEUE MAP RND Match packet queue randomly QUEUE MAP CPU Match packet queue to bound CPU

NODE ALLOC Bind memory allocation to specific NUMA node.

UDPCSUM Include UDP checksums.

NO TIMESTAMP Do not include timestamp in packets.

Table 2.2: Flags available in pktgen.

The highlighted flags in the table 2.2 coloured in grey represent the most important ones to enforce the performance of the system.

QUEUE MAP CPUis a huge performance boost because of the thread behaviour of pktgen. In short, when the pktgen module is loaded it creates a thread for each CPU core detected on the system, this includes logical cores, then a queue is created to handle the packets to be sent (or received) for each thread, that way they can all be independently used instead of a single queue that would require great concurrency to function. It also takes benefit from the ability of recent NICs to do multi-queuing.

Setting this flag ensures the queue the packet will be sent to is located on the same as the current core treating the packet.

NODE ALLOC is obviously only needed in a NUMA-based system, and allows to bind an interface (or queue, as explained) to a particular NUMA memory bank, avoiding latency caused by having to fetch into remote memory.

Note that during the scope of this thesis we will not be treating pktgen options that change or modify the protocol used during the transmission, e.g. VLAN tagging, IPsec, or MPLS. This is outside the scope as we only care about maximum throughput and therefore will not have any use for such technologies.

2.4.2 Commands

(28)

2. The commands on used on the ”pgctrl” file are also obvious: start begins the transmission (or reception)and stop ends it.

3. Most of the commands passed to the device are easily understandable and well documented in [37].

We will only list commands that need to be explained:

• node < integer >: when the NODE ALLOC flag is on, this binds the selected device to the wanted memory node.

• xmit mode < mode >: set the type of mode pktgen should be running. By default the value is start xmit, which is the normal transmission mode which we will detail further in the next paragraph.

The other mode is netif receive which turns pktgen into a receiver instead of a trans- mitter. We will not go into the details on the algorithm here as it will not be charted here;

however the algorithm is summarized through a diagram in the appendix.

• count < integer >: select the amount of packets to be sent. A zero will result in a infinite loop until stopped.

It is important to note that because of the granularity of the timestamping inside pktgen, an amount of packets considered too small will result in a biased speed advertised. As a recommendation the program must run for at least a few millisecond, therefore the count number must match the speed of the medium.

• clone skb < integer >: This option aims to mitigate the overhead caused by having to do a memory allocation for each packet sent. This is done by ”recycling” the SKB structure used, hence sending a carbon-copy of the packet over the wire. This is done through a simple incrementation of the reference counter, to avoid its destruction by the system.

The integer passed as an argument will be the amount of copies sent over the network for 1 SKB allocation. Example, by using skb clone 1000, packets number 1 to 1000 will be the same, then packets from 1001 to 2000 will be the same, etc.

• burst < integer >: This option is the most important one for maximum throughput, as testified by the experiments further. It makes use of the xmit more API hence allowing bulk as explained previously.

2.4.3 Transmission algorithm

Through a code review, we will now explain the internal workings of pktgen when it comes to packet transmission. The following explanation concerns the pktgen xmit function located in net/core/pktgen.c.

Everything commented in this section is condensed in the figure 2.8

1. At start options are retrieved like the burst (which is equal 1 by default), through atomic access if necessary. The device is checked to be up, with a carrier, if not the function will return. This implies that in case of the device not being up, no error will be returned to the user.

2. If there are not any valid SKBs to be sent or it is time for a new allocation, pktgen frees the current SKB pointer with kfree skb (if it is null the function will simply return). A new packet will be allocated and filled with the correct headers through fill packet() function. If the latter did not work, the function will return.

3. If inter-packet delay is required, the spin() function is fired.

4. The final steps to sending out packets are to retrieve the correct transmission queue, disable software-irq as bottom halves could delay the traffic, lock the queue for this CPU.

5. Increment the reference counter with the amount of bursting data about to be sent. This should not happen here, and will be discussed in section 4.8.

(29)

6. Start sending loop: send a packet with the xmit more API compliant function netdet start xmit().

The latter takes as an argument, among others, a boolean to indicate if there is more data to come, so in case the SKB is unique it will be set to false. Otherwise set to true until we run out of bursting data to send.

7. In case of error on transmission returned by netdet start xmit(), except if the device was busy in which case we will try once more, the loop will exit.

8. In case of success update the counters: number of packets sent, amount of bytes sent and sequence number.

9. If there is still data to be sent (i.e. burst > 0), go back to start of sending loop, also check the queue is not frozen. Otherwise exit loop.

10. Exit loop: unlock the queue bound to CPU and enable software-irq.

11. If this is the end of all transmissions programmed, pktgen will check that the reference counter of the last skb is 1, then will stop the program.

12. Otherwise the function ends here.

(30)

Figure 2.8: pktgen transmission algorithm

(31)

2.4.4 Performance checklist

Turull et al. [36] issued a series of recommendations to be sure the system is properly configured to yield the best performances of a pktgen traffic generation.

• Disable frequency scaling, as we will not focus on energy matters.

• Same goes with CPU C-States, their purpose being power saving we should limit its use in order for the CPU to avoid creating latency by falling into a ”sleep” state.

• Pinning the NIC queue interrupts to the matching CPU (or core), aka ”CPU affinity”. This recommendation was already issue by Olsson [5].

• Because of the latter statement, one should also deactivate interrupt load balancing as is spreads the interrupts among all the cores.

• NUMA affinity which maps a packet to a NUMA node can be a problem if the node is far from the CPU used for instance. As explained previously pktgen supports assigning a pktgen to a specific node.

• Ethernet flow control has the possibility of sending a ”Pause frame” to temporally stop the transmission of packet. We will disable this as we will not focus on the receiver side.

• Adaptive Interrupt Moderation (IM) must be kept on for maximum throughput and minimizing overhead from the CPU.

• Placing the sender and receiver on different machines to avoid having the bottleneck located on the BUS of a same machine.

We will later be carefully adjusting the parameters of the machines used through the help of scripting and/or Kernel/BIOS settings if possible.

(32)

2.5 Related work – Profiling

Profiling is getting records of a system (or several systems) called the profile. It is commonly used to evaluate the performances of a system by estimating if certain parts of the system are being too greedy/slow, e.g. taking too much CPU cycles for its operations compared to the rest of the other actions to be executed. We will only pay attention to Linux profiling, as the entire subject was based on this specific OS, and therefore will talk about techniques that might not be shared among other commonly used operating systems (e.g. Windows or BSD based).

There were two profiling systems that were investigated during this thesis, the first one being perf [38] and the second one is in fact more than a profiling tool, as it has several other purposes and was recently turned into a profiling tool in the latests kernel versions: eBPF [7].

2.5.1 perf

Perf also called perf events is fairly broad spectrum into the profiling capabilities. It is based on the notion of events, which are tracepoints that perf pre-programmed inside the kernel. The tool has, by default, several default events from different sources [39]:

• Hardware Events: Use CPU performance counters to gain knowledge of cpu cycles used, caches misses and so on.

• Software Events: low level events based on kernel counters. For example, minor faults, major faults, etc.

• Tracepoint Events: based on ptrace, which is the same lib used by gdb to debug user-space programs, perf has several pre-programmed tracepoints inside the kernel. They are located on

”important” function, meaning function that are almost-mandatory to be executed for a system- call to function correctly. For example, the tracepoint to deallocate a SKB structure is called sock:skb free.

The list of tracepoints used by perf can be found by running sudo perf list tracepoints .

• Dynamic tracing: this is NOT exclusive to perf, it is a kernel function that perf uses for monitoring.

The principle is to create a ”probe”, either called kprobe if located in kernel-code or uprobe if in user-code.

This is an interesting functionality as it brings us the ability to monitor any precise function we wish to investigate, instead of relying on general-purpose functions (tracepoints).

• Snapshots frequency: perf is able to take snapshots at a given frequency to check the CPU usage.

The more a function is called, the more samples are aggregated and the function is considered taking more CPU (total percentage of samples). One of the perks is perf’s ability to also record the PID, and call-stack to provide a full knowledge of what and whom caused the system to use all of the CPU, as the name of a single function might not only be complex to pinpoint, but might also be called from several spots.

Kernel symbols

The kernel keeps a mapping of addresses to name to be able to translate to a human-readable output the results of executions. The names can be matched to several things, but we will only pay attention to function and variable names.

overhead calculation With the -g option which check the entire stack for the calculation of the total percentage of utilization, perf shows two percentage per function: self and children. This is because a function can obviously call other functions recursively, making the ”actual” total amount of time spent in the caller function biased. Therefore the split makes perfect sense: the “self” number represent the percentage of samples from the call of the function and the ”children” number corresponds to the total percentage induced by the function, including the function calls it performs and therefore their percentage being included in that number too.

(33)

Figure 2.9: Example of call-graph generated by perf record -g foo [38]

2.5.2 eBPF

BPF Historically, there was a need for efficient packet capturing. There were some other programs, but usually costly. Along came BPF, for Berkeley Packet Filter, with the idea of making a user-level packet capture efficient. eBPF is the extended version of BPF, as in recent versions of the kernel it has been greatly enhanced. We will discuss those differences soon.

The idea is to run user-space programs, i.e. filters, inside the kernel-space. While this may sound, dangerous, the code produced by the user-space MUST be secure. Meaning, there are only a few instructions that can be actually put inside this filter.

To restrict the available possibilities of coding instructions BPF has a created its own interpreted language, a sort-of x86 assembly instruction set.

There is a structure (linux/filter.h) that can be used by the user-space defined program to explicitly pass the bpf code to the kernel:

struct sock_filter { /* Filter block */

__u16 code;

__u8 jt;

__u8 jf;

__u32 k;

};

Listing 2.1: Structure of a BPF program The variables within this structure being:

• code: unsigned integer which contains the opcode to be executed.

• jt: unsigned char containing the address to jump to in case of test being true

(34)

struct sock_fprog { /* Required for SO_ATTACH_FILTER. */

unsigned short len; /* Number of filter blocks */

struct sock_filter __user *filter;

};

[...]

struct sock_fprog val;

[...]

setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));

Listing 2.2: Binding to a socket

The first parameter being simply the number of instructions and the second one a simple pointer to the previous structure. The user macro adds an attribute for the kernel to understand that the code it is about to run shall not be trusted. This is for the need of security. Last but not least, to actually make the connection between the structure sock fprog and the socket itself, assuming we correctly opened a socket with a file descriptor sock by running setsockopt().

The complexity of BPF programming resides in the pseudo-assembly forced programming. This is done automatically through libpcap or tcpdump. However for programs in C, this quickly becomes too inconvenient and should not be done.

Figure 2.10: Assembly code required to filter packets on eth0 with tcp ports 22.

The figure 2.10 illustrates how complex creating a simple BPF program is.

As you might recognize from the listing 1.1, each row is indeed divided in four field as explained.

extended BPF

Over the last few years, the BPF program has been remodelled. It is not limited to the only usage of packet filtering any-more, and can now be seen as a virtual machine inside the kernel thanks to its specific instruction set [7].

Breaking the shackles of packet filtering came with wholesome features which we will explain.

• The size of the register and arithmetic operations switched from 32 bits to 64 bits, unlocking the power of nowadays CPU which are 64-bits oriented; at least for performance oriented systems.

Linux Kernel Packet Transmission Performance in High-speed Networks

Linux Kernel Packet Transmission Performance in High-speed

Networks

CLÉMENT BERTIER

Kungliga Tekniska h¨ ogskolan

Master thesis

Linux Kernel packet

transmission performance in high-speed networks

Cl´ement Bertier

August 27, 2016

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Problem

1.2 Methodology

1.3 Goal

1.4 Sustainability and ethics

1.5 Delimitation

Chapter 2

Background

2.1 Computer hardware architecture

L3 Cache L2 Cache

CORE 0

L2 Cache

CORE 1

2.2 Linux

2.3 Related work – Traffic generators

2.4 Pktgen

2.5 Related work – Profiling