Code generated data structures and algorithms for classification of Internet traffic

(1)

2006:274 CIV

M A S T E R ' S T H E S I S

Code Generated Data Structures and Algorithms for Classification of

Internet Traffic

Peter Enberg

Luleå University of Technology MSc Programmes in Engineering Computer Science and Engineering

Department of Computer Science and Electrical Engineering

(2)

Code generated data structures and

algorithms for classification of Internet traﬃc

Peter Enberg

Luleå Technology of University

Dept. of Computer Science and Electrical Engineering CDT/EISLAB

September 27, 2006

(3)

(4)

Abstract

One of the goals for this master thesis was to implement a code generator

for the static hybrid data structure. Both the generator and the generated

code has proved to work but the throughput of the lookup function can be

greatly improved. It is possible to enter any stride sequence to generate code,

which allows for a future merge of the the code generator and Sundströms

automated cross-breeding tool, Strider. Some measures needed for a faster

lookup is to inline all functions concerned with query key lookup and try to

avoid loops in the code.

(5)

(6)

Preface

This master thesis is the last part of my examination to become a Mas- ter of Science in Computer communication. The work was conducted in co-operation with the Department of Computer Science and Electrical En- gineering at Luleå University of Technology in Sweden from February to September 2006.

I would like to thank my family and all of my friends for their support

during my study time. Thanks to Malcolm Russell for reviewing the language

in this report. Finally, my supervisor Mikael Sundström also deserves a big

thanks for all tips and tricks and help with diﬃculties.

(7)

(8)

1 Introduction 8

1.1 Motivation . . . . 8

1.2 Goals . . . . 8

2 Theory 9 2.1 The Longest Prefix Matching problem . . . . 9

2.2 Block trees . . . 10

2.2.1 Properties . . . 11

2.3 Tries . . . 11

2.3.1 Binary tries . . . 12

2.3.2 Multi-bit tries . . . 13

2.4 The hybrid data structure . . . 13

2.4.1 Lookup budget and memory blocks . . . 13

2.4.2 Structure . . . 14

2.4.3 Storage . . . 15

2.4.4 Lookups . . . 16

2.4.5 Strides - hybrid configuration . . . 16

3 Method 18 3.1 Types . . . 18

3.1.1 Hybrid keys and data . . . 18

3.1.2 Hybrid interval list . . . 19

3.2 Generic block tree implementation . . . 20

3.2.1 Build . . . 21

3.2.2 Lookup . . . 21

3.2.3 Space optimality . . . 22

3.3 Trie node implementation . . . 22

3.4 The code generator . . . 23

(9)

3.5 Building the hybrid . . . 25

4 Evaluation and discussion 27

4.1 Performance . . . 27

4.1.1 Other improvements . . . 28

4.2 Other optimizations . . . 29

(10)

Chapter 1 Introduction

1.1 Motivation

The need for high performance network hardware is of increasing importance in the data networks of today. The continuous progress in increase of net- work bandwidth and quality of service demanding real-time traﬃc demands network hardware with higher throughput. Throughput improvement can be achieved with faster hardware or with more eﬃcient classification algorithms and data structures.

Mikael Sundström, Ph.D. student at the Department of Computer Science and Electrical Engineering at Luleå University of Technology, has developed a package of eﬃcient algorithms and data structures for packet classification.

The package can not only guarantee classification performance but also stor- age and maintenance (update) costs. The data structure is really a hybrid of two data structures; tries and block trees, which will be explained in chap- ter 2. Sundström’s research treats both a static and a dynamic hybrid data structure but this master thesis will focus on the first of the two structures.

1.2 Goals

• Study Sundström’s research to gain knowledge of the data structure’s construction and complexity.

• Implement a code generator for the static hybrid data structure. An

application for testing will be generated alongside the source code.

(11)

Chapter 2 Theory

This section will focus on describing a special part of the routing/forwarding procedure occuring on layer 3 of the OSI model [5] in a piece of generic network equipment. The task is to find the longest prefix matching of an IP destination address, we will show how to revise it into interval matching.

Following this revision we will begin explaining the two data structures, tries and block trees, that the hybrid data structure is based on.

2.1 The Longest Prefix Matching problem

A routing table is a list of IP address prefixes. Each prefix is associated with some sort of next-hop information. We will just consider the next-hop information as an index into a next-hop information table or array and that the real information is retrieved from that table once the index is found.

When making a forwarding decision the destination IP address is compared to the prefixes and the next-hop information associated with the longest matching prefix from the routing table is retrieved.

A typical IP address (IPv4) is a 32-bit integer [3] which means that there are 2 ³² = 4 294 967 296 ≈ 4, 3·10 ⁹ diﬀerent addresses and they are in the range [0..2 ³² − 1]. A prefix is a bit string with a length in the range [0..32] and represents the most significant bits of an IP address. If a longest matching prefix is of length l, the remaining 32 − l least significant bits in the query IP address are of no importance. The default route is often represented with a * (wildcard), is of length zero and matches any address.

In fact, a prefix of length i constitutes an interval of size 2 ³²⁻ⁱ [1]. Take an

(12)

IPv4 prefix x of length i bits. Treat x as an i-bit unsigned integer, multiply x with 2 ³²⁻ⁱ to get the interval starting point. The same interval ends with (x + 1) · 2 ³²⁻ⁱ − 1. One prefix does consequently partition the address space into three intervals in the worst case. The worst case scenario in general is that n prefixes partition the address space into 2n + 1 intervals. By using this processing of routing table prefixes we can transform the Longest Prefix Matching problem into interval matching and instead use data structures and algorithms for finding the Closest Dominating Point i.e. finding the matching interval.

2.2 Block trees

A block tree is a comparison based data structure consisting of nodes and leaves and is designed to partition a known universe U of integers into inter- vals. Each block tree node stores a maximum of B N keys. The partitioned universe U consists of integers of the length w (in binary representation).

There are consequently 2 ^w integers in the range [0..2 ^w − 1]. Since both min( U) and max(U) are known, the B ^N keys partition the universe into B N + 1 sub-universes which in turn means that each node has a maximum of B N + 1 sub-trees. The leaves contain keys and, in our case, also some sort of target data. Each leaf has a maximum of B L keys which partition the current sub-universe into B L + 1 sub-universes and, since all intervals are associated with target data it follows that, there is a maximum of B L + 1 pieces of target data in every leaf. The height of the block tree is denoted as t. The levels [0..t − 2] consist of nodes and the last level t − 1 consists of leaves.

Figure 2.1: Block tree with 2 keys in every node.

When looking up a query key q, we want to find the Closest Dominating

(13)

Point c, where c is the biggest key in the block tree that is less than q. We start in the root node were there are B N keys. For each node, we want to find the Closest Dominating Point as an index in the range [0..B N ] . This can be accomplished in a number of ways. Since the keys are in sorted order we choose to use binary search which runs in O(log B N ) time. Once the index is retrieved we know in which sub-tree to continue searching. When reaching a leaf we must again find an index but this time in the range [0..B L ]. This index tells us which target data matches our interval.

2.2.1 Properties

With the diﬀerent parameters we can express how many intervals that can be stored in a block tree but also how many memory blocks a complete block tree occupies. The number of intervals is denoted as n max and is a function of t, B N and B L .

n max (t, B N , B L ) = (B N + 1) ^t−1 · (B ^L + 1)

The size of a complete block tree expressed as the number of nodes and leaves is

s(t, B N ) = ((B N + 1) ^t − 1) B N

In the description of our implementation in section 3.2, we will talk about the size of block tree keys and target data. The space optimality concept highly depends on these parameters. A block tree is said to be space optimal if its size is less than or equal to the size required if the keys and data were simply stored in an array. More on space optimality in section 3.2.3.

2.3 Tries

A trie (or prefix tree) is an ordered tree data structure where some nodes and all leaves store some value that is retrieved when doing a lookup. Nodes and leaves does not contain any keys but instead their position represent some part of the query key. The word trie comes from the word retrieval [2]

and should be pronounced tree but since trees are also discussed in these

contexts we prefer to pronounce it try to avoid confusion.

(14)

2.3.1 Binary tries

A binary trie is probably the most basic way to represent a routing table.

Each node corresponds to one bit, namely the bit in the IP address that is being examined, the IP address is examined starting with the most significant bit and ending with the least significant. The root node has two sub-tries, the 0-sub-trie and the 1-sub-trie, often represented as left and right respectively.

When doing a lookup on a w-bit key the children of the root node contains lookup information for the remaining w −1 bits. Continue in the 0-sub-trie if the first bit is 0 or the 1-sub-trie if it is 1. Descend down the trie by repeated examination of the most significant not yet examined bit from the query IP address until reaching an endpoint. We reach an endpoint either when a leaf is reached or the current node does not have the sub-trie corresponding to the currently examined bit. If the endpoint contains a prefix we have found the longest matching prefix and the target data stored in the endpoint is returned. However, if the endpoint does not contain a prefix, the last visited node containing a prefix is the longest matching. The latest prefix and/or target data is recorded during the lookup in order to avoid backtracking.

Figure 2.2: Binary trie.

A binary trie node splits the current universe, as discussed in section 2.1

on page 9, into two sub-universes or buckets as we would like to call them.

(15)

2.3.2 Multi-bit tries

In a binary trie, the depth (=lookup cost) is proportional to the key size w in the worst case. A standard approach to reduce the depth, is to inspect more than one bit in each node. The resulting data structure is sometimes referred to as multi-bit trie. A multi-bit trie node has a maximum of 2 ^k sub-tries where k is called stride and represents the number of bits that is examined at the current level. For a binary trie k = 1. One variant of multi- bit tries is the k-stride trie which have the same stride, k, at all levels. Then there are the fixed stride trie which have various strides [k 0 , k 1 , ..., k _t−1 ] for each of the t trie levels. All nodes at the same level in a fixed stride trie do, however, have the same stride which is not the case for a variable stride trie where all nodes can have diﬀerent strides. The fixed stride trie is the kind that will be used in the hybrid data structure and the strides sequence which is also called hybrid configuration will be further discussed in section 2.4.5.

Figure 2.3: 3-stride trie node with 2 ³ elements.

2.4 The hybrid data structure

Assuming that we have arbitrary block trees and trie nodes we will now describe how they are cross-bred to form the hybrid data structure. Before starting we have to clarify how lookups are budgetized for the data structure and how the hybrid is built according to a given lookup budget.

2.4.1 Lookup budget and memory blocks

Instead of budgetizing the number of comparisons, as is done in many under-

graduate algorithm courses, the idea is to limit the number of time-consuming

memory accesses. We want to minimize the number of times that the CPU

has to read from the main memory (RAM) to the cache memory. When the

CPU does this caching, it reads a whole cache line at a time, which might

(16)

typically be 256 bits of size. This caching stalls the running process until the whole cache line has been read. Once the caching is done the accessing of data in the cache line (which of course contains the data that was read from the main memory) are considered as free operations.

To avoid this time-consuming operation the hybrid data structure is de- signed according to a lookup budget t, which constitute how many times we are prepared to read from the memory for one query lookup. One cache line as entity will be referred to as memory block and its size is also an important parameter which we will refer to as b from now on.

One block tree node/leaf occupies exactly one memory block hence the name of the data structure. Looking up a query key in a block tree with height t, consumes exactly t memory accesses, since we have to access one memory block for each level in the tree. Whereas reading one trie node element only requires one memory access. One trie node can however, occupy more than one memory block but since the elements are accessed with direct indexing, only one memory block is accessed. It is important when designing the trie node to assure that no trie node element crosses the boundary of two memory blocks.

2.4.2 Structure

As in the description of the two data structures in the previous sections, we have a set of common parameters that will be used to characterize the data structure. These parameters are:

t - lookup budget in terms of memory access allowed spending/height of the hybrid data structure

w - key length (bits)

b - memory block size (bits) d - target data size (bits)

p - trie element pointer size (bits)

There is also an array of strides, [k 0 , k 1 , ..., k _t−1 ], that represents the con-

figuration of the hybrid data structure. The configuration will be discussed

in subsection 2.4.5 on page 16. Each trie node has a certain number of ele-

ments, e.g. the trie node at level t a (0 ≤ a < t) has 2 ^k

^a

elements. Each trie

node element contains either target data or a pointer to either another trie

node or a block tree. In the case of a pointer to a trie node or a block tree

(17)

these two are said to be at the next level in the hybrid data structure. If a trie node at level t a has an element that contains a pointer to another trie node, this node is located at level t a + 1 and correspondingly for the case of a block tree pointer. When a trie element contains a pointer to a block tree, this block tree can be regarded as the endpoint for the data structure.

If the block tree root node is located at level t e , the block tree will have a maximum height of t − t ^e − 1 levels.

Figure 2.4: Overview of the hybrid data structure

The data structure as a whole can be seen as a tree where trie nodes constitute either nodes or leaves (leaves when they contain target data) but block trees only constitute leaves.

2.4.3 Storage

Block trees are stored with the root node first, then the sub-trees are stored

recursively starting with the leftmost one. A trie node is stored in one or

several memory blocks depending on the parameters k, p and b. A common

factor for both block trees and trie nodes is that the two have a continuous

memory area where they are stored. When they are crossbred into the hybrid

they are stored back-to-back in a large, pre-allocated memory area. Since the

first part of the hybrid always is a trie node, this is also what occupies the

(18)

first part of the allocated memory area. A trie node occupies

»

2

^k

b

^b^p

c

¼

memory blocks. In most cases the trie node will have pointers to trie nodes and block trees on the next level and they will be stored directly after the last used memory block of the trie at the current level.

2.4.4 Lookups

A lookup on a w-bit query key is performed by inspecting the k 0 first bits and using this unsigned integer as an array index in the first trie node, see figure 2.4. The retrieved trie element is then examined in order to determine what kind of data it is.

• If target data is found, it is immediately returned and the lookup is complete.

• If a pointer to a block tree is found, the lookup continues in the block tree lookup function where the remaining w − k ⁰ bits is used as the query key. The data returned from the block tree lookup is immediately returned and the lookup is complete.

If a pointer to a trie node is found the k 1 most significant bits from the remaining w − k ⁰ bits is used as the array index in this trie node, that is located at the next level of the data structure. Again we must determine what the trie element contains and take the necessary actions in order to continue the search until target data is returned and the lookup is complete.

2.4.5 Strides - hybrid configuration

The stride array [k 0 , k 1 , ..., k _t−1 ] constitutes the configuration of the hybrid

data structure. It is subject to change to create an optimal data structure

in terms of both storage and lookup cost [1]. Storage cost for a trie node at

level t a is O(2 ^k

^a

) which means that it increases with a factor of two for the

smallest increase of k a . An increase of k a might also mean an increase for

B N and/or B L in a block tree at level t a + 1, which in turn leads to a growth

of interval capacity for this block tree. It is, however, not necessary to have

a block tree with an interval capacity larger than the number of integers in

the current sub-universe.

(19)

At present, it is not known if an optimal configuration can be computed, so the only way to reach an optimal configuration is to do an exhaustive search for all possible combinations of strides sequences, a so called brute- force attack. This is an enormous amount of permutations that grows as the key length increases. The goal is to minimize the maximum relative size, c(t), of the data structure.

c(t) = max

( b · s ^h (n, t)

n : ∀n

)

s h (n, t) is the size in memory blocks of the whole hybrid data structure and

n is the number of intervals. Sundström has developed an application called

Strider [1] that performs the search for the best strides sequence. Strider

takes five arguments (t,w,b,d and p which are explained in chapter 3) and

returns an array of strides and the maximum relative size for this configu-

ration. Several improvements based on empirical knowledge of computing

optimal strides sequences, have been implemented to decrease the running

time of Strider.

(20)

Chapter 3 Method

This chapter will describe the main part of this thesis; the implementation.

Description of how problems were solved will be given and the general line of thought will hopefully be clarified.

3.1 Types

This section will describe the most important types of the implementation.

We will explain how a list of keys (interval starting points) with corresponding target data is represented using standard C types and structs.

3.1.1 Hybrid keys and data

Since almost all modern CPU’s has 32 bits registers, or can at least handle 32-bit integers, it was decided to use an unsigned 32-bit integer as default type in the code, for example to represent keys with a size w of 32 bits or less. A delicate problem occurs when the keysize w exceeds 32 bits. It was solved by using the following C struct.

struct hybrid_key {

u_int32_t base[HYBRID_BASE_SIZE];

u_int32_t *key;

u_int32_t mask;

};

typedef struct hybrid_key hybrid_key_t;

(21)

Where HYBRID_BASE_SIZE is defined as ^l ₃₂ ^w ^m . The w-bit key is right- aligned in the base array. This means that element [1..HYBRID_BASE_SIZE- 1] of base is fully utilized while element 0 is not, unless w is a multiple of 32.

The pointer key points to the currently examined quad of base and mask is a bit mask used to extract the least significant bits out of the 0:th element in base, being the most significant part of the whole key.

The target data is represented with the type hybrid_data_t which is defined corresponding to the parameter d, by using the diﬀerent types of sizes for unsigned integers.

• 0 < d ≤ 8 typedef u_int8_t hybrid_data_t;

• 8 < d ≤ 16 typedef u_int16_t hybrid_data_t;

• 16 < d ≤ 32 typedef u_int32_t hybrid_data_t;

3.1.2 Hybrid interval list

A hybrid interval list is a list of keys (interval starting points) and their corresponding target data. Let’s start with looking at one key and its target data. They are stored in a node which is represented using the following C struct:

struct hybrid_interval_node {

struct hybrid_interval_node *next;

union {

u_int32_t klein;

hybrid_key_t *large;

} min;

hybrid_data_t nextdata;

};

typedef struct hybrid_interval_node hybrid_interval_node_t;

The nodes with keys and data form a linked list where each node has a

pointer to the next node, that holds the next key in the list. A problem is

encountered when a small part (less than or equal to 32 bits) of a large key

(more than 32 bits) is about to be stored in a block tree. The block tree build

(22)

function will expect keys with the type u_int32_t but since the original key is more than 32 bits and not of the type u_int32_t we have to do a work- around to fix this. It was solved by using a union as seen in the above struct hybrid_interval_node. The union is accessed with the variable name min and there are diﬀerent C macros for retrieving the key in a way that suits the current scenario. There is also a struct hybrid_interval to represent the interval list as a whole and to abstract the linked list.

struct hybrid_interval {

hybrid_interval_node_t first, last, current, prev;

hybrid_data_t prevdata;

u_int32_t size;

u_int32_t cursize;

};

typedef struct hybrid_interval hybrid_interval_t;

It has pointers to the first, last, current and previous hybrid interval nodes. The current pointer points to the currently investigated key’s node and the prev pointer points to the previous key’s node. The field prevdata stores the previous key’s target data which is useful to have easily accessible when building block trees. The integers size and cursize corresponds to the number of keys in the list and the number of not processed keys respectively.

These struct members help simplifying the creation of small functions and macros for using when working with the key list.

3.2 Generic block tree implementation

You may recall that the primary goal for this thesis is to construct a code

generator for the static hybrid data structure. Trie nodes and block trees will

be generated by the code generator either from a given stride sequence as

input or a sequence calculated by the code generator itself. Since arbitrary

block trees will be a part of the output from the code generator, there will

be some waste of memory as we will not be able to fill every memory block

with either keys or target data up to 100%. A so called space optimal imple-

mentation was discarded early on. See sub-section 3.2.3 for an explanation

of space optimality. The approach was instead to store as many whole keys

(23)

that can fit in one node and correspondingly as many keys and target data that can fit in one leaf. The number of keys that fit in one node is

B N =

$ b w

%

, and the number of keys that fit in one leaf is

B L =

$ b − d w + d

%

.

Recalling the formula from section 2.2.1 where we compute how many in- tervals can be stored in one complete block tree, we can now express that amount directly from the parameters t,w,b and d.

n max (t, w, b, d) =

Ã$ b w

%

+ 1

! _t−1

· Ã$ b − d w + d

%

+ 1

!

Using this implementation’s header and source files as templates, we will be able to generate arbitrary block trees from the code generator.

3.2.1 Build

The block tree build function takes three arguments; tree height (t), memory area pointer (mem) and hybrid interval list (interval ). The allocated memory area that mem points to is assumed to be large enough for whatever size the produced block tree will be. A tree is often built using a recursive function for creating nodes and so does this block tree implementation. The build function, bt_build, is in fact a recursive function that builds the nodes.

When the function reaches the last level it calls the function for building the leaves, bt build leaf.

3.2.2 Lookup

Searching a node for a matching interval returns an index in the range [0..B N ].

The index corresponds to which subtree the search continues in at the next

level. When searching a leaf an index in the range [0..B L ] is returned and

is used to decide which data to extract and return, from the same memory

block. Since all keys are sorted in ascending order the chosen search algorithm

is a binary search which runs in O(log B N ) for nodes.

(24)

3.2.3 Space optimality

Block tree space optimality is reached when b · s(n, t) ≤ n · (w + d) . Or, in a non-compressing implementation we can instead say that all memory blocks are fully utilized. Which only occurs when

w | b (w divides b) and

(w + d) | (b − d).

We could for example ignore the real memory block size and customize the parameter to fit key and data size. The lookup time will increase since some or maybe all node/leaf accesses require more than one memory access.

There are however ways to code extra information in a memory block, for ex- ample changing the order of the keys to be in non-sorted order. Compression is also considered as an alternative in [1]. A space optimal implementation was however, not considered. Not that it is impossible to implement a space optimal block tree with a set of given parameters, but to do it with arbitrary parameters during runtime would have been diﬃcult. Achieving space opti- mality with arbitrary parameters calls for a framework with extended rules for storing or compressing parts of keys/data in nodes and leaves. Creating such a framework and implementing it in the code generator is a project in itself and would have been very time consuming and that is why it was not implemented.

3.3 Trie node implementation

The implementation of the trie node can, however, not be called generic since

the trie element pointer size p was set to a constant of 32 bits. This decision

was made in consensus with my supervisor and the line of thought was to

exploit the fact that the cache line size in most modern CPU’s is a multiple

of 32 bits. By choosing the element size as 32 bits we can assure that reading

one trie element will never cause the CPU to read from two diﬀerent cache

lines. One trie element will also require only one CPU register since most

CPU’s work with 32 bits or bigger registers. Operations will be trivial and

fast since machine operations on 32 bits registers are accomplished in one or

very few CPU cycles.

(25)

The trie node is represented as an array of unsigned 32-bit integers that we call trie elements. An element consists of two parts; an 8-bit code and a 24-bit pointer. The pointer either referres to a piece of data, a block tree or a trie node at the next level. To determine how to interpret and use the pointer, the code field is inspected. 0x00 means it points at another trie, 0xﬀ means target data and the 254 remaining combinations (0x01 to 0xfe) means it points to a block tree of the corresponding height. For example, 0x07abcdef points to a block tree of height 7 located with an oﬀset of abcdef memory blocks from the beginning of the trie node. The 24 bits used for the pointer means that we can address 2 ²⁴ (= 16 777 216) memory blocks (not Bytes) and that will cover all of our current needs.

3.4 The code generator

The idea for the code generator is to first use Sundströms tool for computing optimal stride sequences, Strider. This stride sequence is then fed to the code generator which generates code for a hybrid package with the best configuration.

The approach so far regarding the implementation of the hybrid has been to write generic code for header and source files that can be used as templates when generating code. The only thing that had to be changed in those files is the parameters t,w,b and d plus the names of the critical functions that is exported outside the module. When writing the block tree code we used some numeric values for the parameters w,b and d just to be able to test both the build and lookup functions, t is actually given as argument during runtime.

But when preparing the source files as templates for the code generator the parameters had to be exchanged for some variable expression. It was decided to use the dollar sign ($) for the variables, the variable name itself was set to only one character. Since the parameters in the code are accessed as C preprocessor macros and declared by the #define keyword, we simply wrote

#define BT W $w

where BT_W is the parameter w for the current block tree module. The

other parameters b and d were defined correspondingly. Function names

for build and lookup had to be altered in the same way, they were set as

bt_build_$w_$b_$d and bt_lookup_$w_$b_$d. When the code generator

processes the template file it looks for ”$”, if the following character directly

(26)

after ”$” matches any of the critical parameters a replacement is made. The dollar sign and the following character are replaced by the actual numerical value given to the code generator as argument. As an example, the lookup function for a block tree might be bt_lookup_32_256_16.

There were a few other obstacles that had to be cleared before the hy- brid data structure could be built. We had to declare the stride sequence [k 0 , k 1 , ..., k _t−1 ], and all block tree header files had to be included somewhere in order for the hybrid module to know what to call. The calls to block tree functions were also a problem since it is not possible to call an arbi- trary function name during runtime. The solution is to create a header file which we named ”hybrid_wrapper.h” and it is included in the hybrid mod- ule. First of all the t − 1 block tree header files are included with #include

<’’block_tree_module.h’’>. Then the stride sequence is declared as an array of unsigned integers. Finally the wrapper file has two inline functions, one for calling block tree build functions and one for block tree lookup func- tions. These functions are mainly switch statements that determines from the stride sequence and the current level of the hybrid (given as argument) which block tree module to call. The code generator writes these cases with a call to the proper block tree module. As an example let us look at a call to the block tree build function at level 1 for a hybrid with the stride sequence [8, 8, 8, 8] (w = 32), b = 256 and d = 16. At this level the key length has decreased from 32 to 24 bits, but b and d are unchanged.

case 1:

s = bt_build_24_256_16(h, mem, interval);

break;

The parameter h is the maximum height for the block tree, mem is a pointer to a memory area where the block tree is to be built and interval is a pointer to a hybrid interval list and the build function returns how many memory blocks were consumed to s, which is in turn returned to the hybrid build function.

When generating the source file package the code generator does not only

report on what files it is generating but also the memory block utilization for

block tree nodes and leaves. Even the total utilization for a full block tree of

the maximum available height is reported.

(27)

3.5 Building the hybrid

Forming or rather building the hybrid data structure is accomplished by the function hybrid_build in the hybrid module. This function starts the build- ing process by calling a function named hybrid_build_trie, that splits the current sub-universe into smaller sub-universes that we call buckets. These buckets are then examinated in order to determine what kind of data the trie element representing the current bucket will point to. There are three diﬀerent cases that depend directly on the bucket’s key density:

• Bucket contains no interval starting points, target data will be stored as trie element.

• Sparse number of interval starting points (means that all keys fit in a block tree of remaining height or lower). A block tree will be con- structed and a pointer to it will be stored in the trie element.

• Dense number of interval starting points (means that a block tree will be insuﬃent to store the keys). A trie node will be built and a pointer to it will be stored as trie element.

When the build trie function decides to build a trie at the next level it will not make a recursive call to itself but instead enqueue a trie build task. These tasks are processed by hybrid_build, but not until hybrid_build_trie returns. When processing a trie build task hybrid_build_trie will be called again. By using this queueing technique we can avoid a heavy system load due to an excessive amount of recursive calls. We implement a technique that use breadth first instead of depth first. However, when the decision is to build a block tree it is done immediately and a pointer to the block tree is stored as trie element. hybrid_build returns the number of memory blocks that has been consumed when all trie build tasks has been processed.

Let us take a closer look at the storage of the two data structures in an example. We have a trie node with 4 elements that partitions the address space. The trie node is stored in the first part of the allocated memory.

When processing the 4 sub-universes we find that a block tree is suﬃcient

for storing the keys of the first sub-universe, this block tree is immediately

built starting in the memory block directly after the trie node and a pointer is

stored in the first trie element. The block tree pointer is not a direct memory

reference but instead it is the oﬀset from the trie node’s first memory block

(28)

to the block tree’s first memory block. In this example the pointer would probably be equal to 1 (one) since the trie node only has four elements which fit in one memory block, so the block tree is stored starting one memory block in front of the trie node. Remember that the trie element consists of two parts, the code field and the pointer field. In the case of a block tree the code field indicates the height of the block tree. The second sub-universe is so dense with keys that a block tree can not store all of them, we have to build a trie node at the next level. A trie build task is enqueued and we proceed with the third sub-universe. This bucket contains no keys so the target data of the last (biggest) key in the previous sub-universe is stored in the trie node element. The fourth bucket is again sparse on keys and a block tree is suﬃcient to store them. A block tree is built starting directly after the block tree of the first bucket, the memory blocks oﬀset from the trie node is stored as pointer in the fourth trie node element together with the height. Processing of the first universe partition is now done and we continue to process trie build tasks from the queue. The second bucket of the first trie node had to be represented with a new trie node and this new trie node at the next level is stored directly after the last block tree in the previous trie node. See picture 3.1 below. The trie node in the upper part of the picture is the first part of the memory area representation in the lower part of the picture. Then the two block tree follow and finally the trie node representing the next level of the second sub-universe. The lines between the trie node and the memory area is just intended to clarify what is what.

Figure 3.1: Memory storage order of the hybrid data structure.

(29)

Chapter 4 Evaluation and discussion

Alongside the source code for type definitions, build and lookup for the hy- brid data structure, trie nodes and block trees, a test application is also provided. This application takes the number of keys n, as argument, gener- ates a (pseudo) random list of keys for the desired key length, sorts them and creates a hybrid interval list. Then the hybrid build is called which builds the hybrid data structure. After build is complete, a lookup is performed to test the validity of the data structure. Each key accounts for 3 lookups. One query on the key itself, one on the last key in the current interval and one in the middle of the interval. Benchmarking of lookups is a compile option in form of a preprocessor macro. The number of CPU cycles a lookup takes to complete is recorded and saved in an array that is printed after all lookups are complete. The average is also calculated and printed.

4.1 Performance

Some performance tests has been run to get a feel for the throughput of the hybrid lookup function. A 1,5 GHz Pentium 4 with 512 MB of RAM was used for testing. The tests were run in a command prompt in Windows Xp Professional with a few diﬀerent key amounts. It is currently not known how performance was aﬀected by other running processes or the operating system itself.

With a key size of 128 bits (IPv6 address length [4]), memory block size 256 bits, data size 16 bits, lookup budget/height 16 and a

{8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}

(30)

stride sequence we got the following results. The third column lookups/second is computed as CP U clock speed

CP U cycles average .

Key amount CPU cycles (average) Lookups/second

10000 2700 555000

50000 4200 357000

100000 4270 351000

Key size 32 bits, memory block size 256 bits, data size 16 bits, lookup budget/height 4 and a {8,8,8,8} stride sequence gave the following results.

Key amount CPU cycles (average) Lookups/second

10000 1000 1500000

50000 1840 815000

100000 1850 810000

Sundström has with his implementation of a hybrid data structure where t = 4, w = 32, b = 256 and d = 16, gotten a throughput of almost 30 M lookups/sec (Pentium 4 3.0 GHz), whereas this implementation results in 1,5 M lookups/sec (3,0 M lookups on a Pentium 4 3.0 GHz) as best. The diﬀerences indicate a far from optimal implementation of the data structure.

After some self-criticism we have arrived at a few things that can be done to improve lookup speed.

• Inline the main hybrid lookup function.

• Inline the block tree lookup function.

• Write direct calls (by using the code generator) to block tree lookup functions in the hybrid lookup function instead of using the hybrid wrapper and its slow switch-case block.

Function calls and switch-case blocks both creates branch delays which must be avoided in order to improve performance.

4.1.1 Other improvements

The linked list implementation to represent a list of interval starting points

and target data was somewhat unnecessary since the data structure is static.

(31)

There is no need to insert any key in the list after it has been created and hence would a static array be suﬃcient. Handling keys and data will be easier but it will not improve the lookup speed since the list is only used during construction of the hybrid data structure.

Comparing large keys (w > 32) is done by reading a key from the memory block into a key struct and then comparing it to a query key which is also stored in a key struct. Instead the comparison could be done one quad (32 bits) at a time, of course starting with the most significant bits. If the most significant bits is the same, continue with the less significant bits and if they diﬀer you have the result to which key is bigger. This implementation calls for, what we think is, a faster and less cumbersome comparison method for large keys.

4.2 Other optimizations

It might be interesting, in future co-operation between the code generator

and Strider, to take the storage eﬃciency report from the code generator into

account when calculating the strides sequence. There are of course ways to

code and compress block trees with diﬃcult parameters to better utilize the

memory blocks, but to do that on the fly with the code generator is diﬃcult

as discussed before. If the utilization is regarded as an important parameter,

a more memory consuming data structure with better memory utilization

could be built.

(32)

References

[1] M. Sundström, Time and Space Eﬃcient Algorithms for Packet Classi- fication and Forwarding, Ph.D. thesis, Luleå University of Technology, 2006, (TENTATIVE TITLE, IN PREPARATION)

[2] Trie, URL: http://en.wikipedia.org/wiki/Trie , Wikimedia Foundation, Inc, 2006

[3] INTERNET PROTOCOL, DARPA INTERNET PROGRAM, PROTO- COL SPECIFICATION, URL: http://www.ietf.org/rfc/rfc0791.txt , Information Sciences Institute, University of Southern California, 1981 [4] S. Deering, R. Hinden, Internet Protocol, Version 6 (IPv6) Specification, URL: http://www.ietf.org/rfc/rfc2460.txt , Network Working Group, 1998

[5] OSI model, URL: http://en.wikipedia.org/wiki/OSI_model , Wikimedia

Foundation, Inc, 2006

Code generated data structures and algorithms for classification of Internet traffic

2006:274 CIV

M A S T E R ' S T H E S I S

Code Generated Data Structures and Algorithms for Classification of

Internet Traffic

Peter Enberg

Luleå University of Technology MSc Programmes in Engineering Computer Science and Engineering

Department of Computer Science and Electrical Engineering

Code generated data structures and

algorithms for classification of Internet traﬃc

Peter Enberg

Luleå Technology of University

Dept. of Computer Science and Electrical Engineering CDT/EISLAB

September 27, 2006

Abstract

One of the goals for this master thesis was to implement a code generator

for the static hybrid data structure. Both the generator and the generated

code has proved to work but the throughput of the lookup function can be

greatly improved. It is possible to enter any stride sequence to generate code,

which allows for a future merge of the the code generator and Sundströms

automated cross-breeding tool, Strider. Some measures needed for a faster

lookup is to inline all functions concerned with query key lookup and try to

avoid loops in the code.

Preface

This master thesis is the last part of my examination to become a Mas- ter of Science in Computer communication. The work was conducted in co-operation with the Department of Computer Science and Electrical En- gineering at Luleå University of Technology in Sweden from February to September 2006.

I would like to thank my family and all of my friends for their support

during my study time. Thanks to Malcolm Russell for reviewing the language

in this report. Finally, my supervisor Mikael Sundström also deserves a big

thanks for all tips and tricks and help with diﬃculties.

Contents

1 Introduction 8

1.1 Motivation . . . . 8

1.2 Goals . . . . 8

2 Theory 9 2.1 The Longest Prefix Matching problem . . . . 9

2.2 Block trees . . . 10

2.2.1 Properties . . . 11

2.3 Tries . . . 11

2.3.1 Binary tries . . . 12

2.3.2 Multi-bit tries . . . 13

2.4 The hybrid data structure . . . 13

2.4.1 Lookup budget and memory blocks . . . 13

2.4.2 Structure . . . 14

2.4.3 Storage . . . 15

2.4.4 Lookups . . . 16

2.4.5 Strides - hybrid configuration . . . 16

3 Method 18 3.1 Types . . . 18

3.1.1 Hybrid keys and data . . . 18

3.1.2 Hybrid interval list . . . 19

3.2 Generic block tree implementation . . . 20

3.2.1 Build . . . 21

3.2.2 Lookup . . . 21

3.2.3 Space optimality . . . 22

3.3 Trie node implementation . . . 22

3.4 The code generator . . . 23

3.5 Building the hybrid . . . 25

4 Evaluation and discussion 27

4.1 Performance . . . 27

4.1.1 Other improvements . . . 28

4.2 Other optimizations . . . 29

Chapter 1 Introduction

1.1 Motivation

Mikael Sundström, Ph.D. student at the Department of Computer Science and Electrical Engineering at Luleå University of Technology, has developed a package of eﬃcient algorithms and data structures for packet classification.

1.2 Goals

• Study Sundström’s research to gain knowledge of the data structure’s construction and complexity.

• Implement a code generator for the static hybrid data structure. An

application for testing will be generated alongside the source code.

Chapter 2 Theory

Following this revision we will begin explaining the two data structures, tries and block trees, that the hybrid data structure is based on.

2.1 The Longest Prefix Matching problem

When making a forwarding decision the destination IP address is compared to the prefixes and the next-hop information associated with the longest matching prefix from the routing table is retrieved.

In fact, a prefix of length i constitutes an interval of size 2 32−i [1]. Take an

2.2 Block trees

Figure 2.1: Block tree with 2 keys in every node.

When looking up a query key q, we want to find the Closest Dominating

2.2.1 Properties

With the diﬀerent parameters we can express how many intervals that can be stored in a block tree but also how many memory blocks a complete block tree occupies. The number of intervals is denoted as n max and is a function of t, B N and B L .

n max (t, B N , B L ) = (B N + 1) t−1 · (B L + 1)

The size of a complete block tree expressed as the number of nodes and leaves is

s(t, B N ) = ((B N + 1) t − 1) B N

2.3 Tries

In fact, a prefix of length i constitutes an interval of size 2 ³²⁻ⁱ [1]. Take an

n max (t, B N , B L ) = (B N + 1) ^t−1 · (B ^L + 1)

s(t, B N ) = ((B N + 1) ^t − 1) B N

Figure 2.3: 3-stride trie node with 2 ³ elements.

There is also an array of strides, [k 0 , k 1 , ..., k _t−1 ], that represents the con-

ments, e.g. the trie node at level t a (0 ≤ a < t) has 2 ^k

If the block tree root node is located at level t e , the block tree will have a maximum height of t − t ^e − 1 levels.

• If a pointer to a block tree is found, the lookup continues in the block tree lookup function where the remaining w − k ⁰ bits is used as the query key. The data returned from the block tree lookup is immediately returned and the lookup is complete.