Exploring NAT Host Counting Using Network Traffic Flows

(1)

Exploring NAT Host Counting Using Network Traffic Flows

Sebastian Salomonsson

Faculty of Health, Science and Technology

Degree Project for Master of Science in Engineering

Supervisor: Johan Garcia

Examiner: Kerstin Andersson

30 Hp

2017-06-09

(2)

(3)

Abstract

Network Address Translation (NAT) is used to represent a private internal network which may consist of several hosts with a single outward IP-address. This is an important method used to distribute Internet access, as the IP-addresses provided by IPv4 are becoming scarce. NAT is therefore often used by home routers in order to provide Internet access for local devices. The private network behind a NAT becomes hidden from the public domain and only one IP-address may be visible from the private network. It is of Internet Service Providers (ISPs) and network operators interest to know how their services are used and with NAT it can be difficult to know how many hosts that are using their Internet service. This study will focus on analyzing data flows from real network traffic in order to identify indications of NAT behavior and then count the number of hosts in the NAT network. There exist several studies of how to determine the number of hosts behind a NAT. However, some of the methods rely on signatures in the header fields of communication protocols which can be unreliable and not possible to use when the traffic is encrypted. This study focuses primarily on the detection of large amounts of hosts which have a Windows Operating System (OS). An empirical study is made to identify the distribution and characteristics of Windows Update flows and the results are used in the NAT host counting methods. NAT characteristics were found in the analyzed datasets as well as some indication of a number of hosts.

However to provide a higher accuracy on a number of hosts more research have to be made.

Keywords: NAT, NAT host counting, Data analysis, NAT detection using network traffic flows.

(4)

(5)

Acknowledgements

I would like to thank Procera Networks that gave me the opportunity to work with this highly interesting project. A special thank to Anders Waldenborg who was my supervisor at Procera and the one who provided me with some of the datasets. I would also like to thank Johan Garcia who was my supervisor at Karlstad university who helped me with the dissertation and gave me advices during the course of the project.

(6)

(7)

List of Figures

1 IPv4 Header. . . 6

2 TCP Header. . . 7

3 Static NAT. . . 10

4 Dynamic NAT. . . 11

5 NAPT. . . 12

6 Distribution of Windows 7 Update flows with the start time against the amount of bytes they transfer. . . 37

7 Frequency of the start time difference between the Windows 7 Update flows. . . 37

8 Windows 7 Update flow sizes versus start time difference. . . 38

9 Distribution of Windows 8.1 Update flows with the time plotted against the amount of bytes they download. . . 39

10 Frequency of start time difference between the Windows 8.1 Update flows. . . 40

11 Zoomed in on the frequency for the 5 hour start time difference for Windows 8.1 Update flows. . . 41

12 Windows 8.1 Update flow sizes versus their start time difference. . . . 42

13 Distribution of Windows 10 Update flows with the start time against the amount of bytes they transfer. . . 43

14 Frequency of start time difference between the Windows 10 Update flows. . . 43

15 Windows 10 Update flow sizes versus start time difference. . . 44

16 Small Cellular dataset information. . . 50

17 CDF for the logarithmic amount of flows per individual IP-address in the Small Cellular dataset. . . 51

18 The 15 IP-addresses which received the highest amount of flows in the Small Cellular dataset. The left plot show the total amount of flows per IP-address and the right plot shows the same IP-addresses but with their amount of NAT detection flows. . . 52

19 CDF for the logarithmic amount of NAT detection flows per individual IP-address in the Small Cellular dataset. The left plot is zoomed in on the fraction of IP-addresses which received NAT detection flows. The right plot shows the fraction of IP-addresses which only received NAT detection flows. . . 53

20 Windows Update flow start time versus the number of flows for different IP-addresses in the Small Cellular dataset. . . 55

21 Large Cellular dataset information. . . 57

(10)

22 CDF for the logarithmic amount of flows per individual IP-address in the Large Cellular dataset. . . 58 23 The 15 IP-addresses which received the highest amount of flows in

the Large Cellular dataset. The left plot show the total amount of flows per IP-address and the right plot shows the same IP-addresses but with their amount of NAT detection flows. . . 59 24 CDF for the logarithmic amount of NAT detection flows per

individual IP-address in the Large Cellular dataset. The left plot is zoomed in on the fraction of IP-addresses which received NAT detection flows. The right plot shows the fraction of IP-addresses which only received NAT detection flows. . . 61 25 A third of all Windows Update flows present in the Large Cellular

dataset with each flows size versus the flow start time. . . 63 26 Zoomed in on all Windows Update flows with their flow size and start

time. . . 63 27 Windows Update flow arrival time versus the number of flows for

different IP-addresses in the Large Cellular dataset. . . 65 28 NAT detection flow distribution for IP-address 167784134, with the

size and amount of flows versus time. . . 66 29 DSL and Cellular dataset information. . . 67 30 CDF for the logarithmic amount of flows per individual IP-address

in the DSL and Cellular dataset. . . 68 31 The 15 IP-addresses which received the highest amount of flows in

the mixed DSL and Cellular dataset. The left plot shows the total amount of flows per IP-address and the right plot shows the same IP-address but with their amount of NAT detection flows. . . 69 32 CDF for the logarithmic amount of NAT detection flows per

individual IP-address in the DSL and Cellular dataset. The left plot is zoomed in on the fraction of IP-addresses which received NAT detection flows. The right plot shows the fraction of IP-addresses which only received NAT detection flows. . . 71 33 A tenth of all Windows Update flows present in the DSL and Cellular

dataset with each flows size versus the flow start time. . . 73 34 Zoomed in on the time difference frequency for the DSL and Cellular

dataset. . . 74 35 Windows Update flow sizes versus start time difference for the DSL

and Cellular dataset. Shows every 10th flow in the dataset. . . 75

(11)

36 The size and amount of flows versus the flow arrival time for the 1 - 5 IP-addresses which contained the highest amount of Windows Update flows. . . 76 37 The size and amount of flows versus the flow arrival time for the 6

- 10 IP-addresses which contained the highest amount of Windows Update flows. . . 76 38 The size and amount of flows versus the flow arrival time for the 11

- 15 IP-addresses which contained the highest amount of Windows Update flows. . . 77

(12)

(13)

List of Tables

1 The 8 features for network traffic analyzing. . . 21

2 Top NetFlow features based on unique IP-addresses. . . 23

3 Most important features classified by C4.5. . . 25

4 Features extracted from HTTP logs . . . 25

5 Antivirus Flow Labels. . . 32

6 Software-update Flow Labels. . . 32

7 Size statistics for downloaded Windows 8.1 Updates. . . 36

8 Size statistics for downloaded Windows 8.1 Updates. . . 38

9 The largest Windows Update flows found in the dataset with their date, compared to the updates found in Microsoft’s Update Catalog. 40 10 Size statistics for downloaded Windows 10 Updates. . . 42

11 Average number of flows in a flow burst. . . 47

12 The 10 IP-addresses with the highest number of NAT detection flows in the Small Cellular dataset. . . 53

13 IP-addresses which received flows from two antiviruses. . . 55

14 The IP-addresses which contains Windows Update and two different antivirus flows. . . 56

15 Large Cellular flow statistics. . . 57

16 The 15 IP-addresses which contained the highest number of flows and with their number of NAT detection flows in the Large Cellular dataset. 60 17 The top 10 IP-addresses which contained the greatest number of NAT detection flows in the Large Cellular dataset. . . 62

18 Downloaded Windows Update flow sizes for the Large Cellular dataset. 64 19 DSL and Cellular flow statistics. . . 68

20 The top 15 IP-addresses with their number of NAT detection flows and total number of flows in the DSL and Cellular dataset. . . 70

21 The top 10 IP-addresses which contained the greatest amount of NAT detection flows in the DSL and Cellular dataset. . . 71

22 Downloaded Windows Update flow size statistics for the DSL and Cellular dataset. . . 72

(14)

(15)

1 Introduction

1.1 Motivation

The number of Internet-connected devices and users have increased dramatically during the last couple of years. Never before has the Internet usage and traffic been so large. It is estimated that around 3.6 billion users and 20.35 billion devices have access to the Internet today and it is still increasing [32], [33].

Internet Service Providers (ISP) provide Internet access to their customers, but with many users and devices connected it is in their interest to know how the connectivity is used and possibly redistributed. In order for all of these devices to get a connection to the Internet, each of them will have to be provided with their own unique IP-address. Unfortunately, the number of IP-addresses available from the existing Internet Protocol IPv4 are becoming scarce. There have been issues with the transition to the more modern IPv6 Protocol, which is why IPv4 is still used to carry the majority of network traffic but with the help of Network Address Translation (NAT).

NAT [29] can be used to represent a local network consisting of several devices with just a single or a pool of IP-addresses. This makes it possible for an ISP to distribute Internet access to its customers and all of their devices even if the IP-addresses provided by IPv4 is limited. NAT can also be used to provide security and anonymity by hiding the inner network topology, filter content etc. But even if NAT solves the problem with depleted IP-addresses other issues have been introduced. NAT can cause the ISP’s to not know how their services are used or how large the inner network might be. Unauthorized NAT devices might be installed to illegally sell and redistribute an Internet connection. Therefore it is of an ISP’s interest to know if a single IP-address might represent a large network in order to determinate if a customer is sharing their connection inappropriately with other hosts. Since NAT is able to represent a network with several individual devices in it with only one IP-address in the public address space, the customer is able to act as an ISP of their own, possibly without permission. By determining how many hosts that are in a network the ISP is able to suspect inappropriate sharing and also to gain more knowledge of how their service are used.

Procera Networks [19] is a business company that offers several solutions to network operators. They are able to structure, monitor, analyze and use the data for the operators, in order to increase the quality of their service. They are able to classify the Internet traffic with the help of their Deep Packet Inspection (DPI) engine. DPI is a form of network packet examiner which has been in longtime use in network management for monitoring and classifying of data directly from network

(16)

traffic flows [23]. With the help of Procera’s DPI engine real network traffic flows was collected for this study and analyzed in order to evaluate different methods for NAT host identification.

1.2 Goal

Given the problem formulation, this study will focus on detecting NAT and the number of hosts for Windows computers by analyzing Internet traffic flows. The host identification process will be based on the behavior and analysis of the Internet traffic from anonymous hosts.

By analyzing specific flows for each IP-address, inferences will be made if an IP-address might be a NAT device and the number of hosts in the network. There exists several different methods that have been able to determine the number of hosts behind a NAT. One of the first NAT host detection methods were conducted by M. Bellovin [3] in 2002. Several other studies have also been conducted, which shows promising results for determining the number of hosts behind a NAT [22], [12].

The analysis will focus on the identification of larger private networks, numbered in 20 - 100 hosts, where the amount of flows each NAT device receives should be much larger than for a single host. The specific flows that will be used for the identification will be selected based on expert knowledge provided by Procera Networks.

This thesis analyzes datasets from two different sources. The first data source is used to analyze Windows Update flows from three VM’s with different Windows Operating Systems (OS). This was done in order to provide ground truth data that will be used for the detection of different hosts behind a NAT.

The other datasets are provided by Procera Networks and contain real network traffic. Different methods are derived in this thesis in order to analyze the traffic flows.

1.3 Disposition

Chapter 2 provides background information regarding IP, TCP, NAT and the different tools used in this study. The chapter will explain what a NAT is, how it works and what different types of NAT that exists. It will also provide background information regarding TCP and IP header fields which is important information for this study. The previous studies that have been researched are presented in Chapter 3. Several methods are explained for different approaches which can be used to determine the number of hosts behind a NAT.

The different datasets and the specific flows used in the detection methods will be explained in Chapter 4.

(17)

Chapter 5 presents the research and results from an empirical study made on Windows Update flows from three different Windows OSes. The different methods that are used to evaluate and analyze the Internet traffic flows for this study on the datasets provided by Procera Networks are summarized in Chapter 6. Chapter 7 presents the results and evaluations for the datasets that were used for NAT detection and host counting.

The summary and conclusions are presented in Chapter 8, as well as the future work aiming to increase the accuracy of the host counting methods.

(18)

(19)

2 Background

This chapter will give an introduction to IPv4, TCP and NAT which will be referred to later in the study. Information regarding what they are used for and how they work will be explained. The various tools that are used in the study will also be mentioned here.

2.1 TCP and IP header Information

This section contain information about the TCP, IP header and their fields. This is done in order to get a better understanding of the fields that are affected by a NAT and to introduce them as reference for further explanations. By examining the fields in the headers they can be used to detect NAT and to determine the number of hosts behind a NAT device. Therefore it is important to know what the header fields are used for.

IPv4

Internet Protocol version 4 (IPv4) [25] is an Internet protocol that is used in order to transmit packets, also referred to as datagrams, between hosts in a network. It is a connectionless protocol with a limited scope of functions. In order to provide reliability with the delivery of datagrams, it takes help of an upper laying transport protocol such as TCP or UDP. Today it is one of the core protocols that routes traffic on the Internet. IP is used for two major tasks, addressing and fragmentation. IP delivers datagrams from a source to a destination host solely by using addresses in its header. It also fragments and reassembles datagrams with the help of the fields in the header. The header and fields can be seen in Figure 1. Each datagram consists of an IP header and the data that is transported. Each header consists of 13 mandatory fields and one optional.

(20)

Figure 1: IPv4 Header.

The Version field contains which IP version in use, either IPv4 or IPv6.

The Internet Header Length (IHL) contains the length of the IP header in 32 bits words, ranging from value 5 to 15.

The Differentiated Services Code Point (DSCP) field is used to classify and manage network traffic in order to provide a quality of service.

Explicit Congestion Notification (ECN) is used in order to provide end to end notification of network congestion.

Total Length contains the value of the total length of the IP datagram in bytes, which includes the header and the data.

The Identification field which will be referred to as IPid in this report, it is used to identify the group of fragments for a single datagram. This field is used to reassemble the correct fragments at the destination.

There are three flags in the IP header which are each one bit big. The X flag is reserved and must be zero. Don’t Fragment (DF) is used to indicate if the segment is allowed to be fragmented or not. If the More Fragments (MF) flag is zero it indicates that it is the last fragment. If MF is set to one more fragments will come.

Fragment Offset is a 13-bit long field which indicates the position in the datagram of the fragment.

Time to Live (TTL) is an eight-bit field which determines the maximum time a datagram may remain on the Internet. It contains a value that will decrement by one for each router that processes the datagram. When the value reaches zero the datagram will be dropped. The default TTL values may be different for different OSes, which can be used for NAT detection. Methods for this kind of NAT detection will be further explained in section 3.

The Protocol field defines which transport protocol that will be used in the data section of the datagram, where for example the value six is used for TCP and 17 for UDP.

(21)

The Header Checksum is a 16-bit field that is used for error detection of the header. When a datagram arrives at a router the checksum will be recalculated. If the recalculated value does not match the value in the checksum field the datagram will be dropped.

Source IP-address is a 32-bit sized field that contains the IPv4 address of the sender which is changed when the datagram passes through a NAT.

Destination IP-address is a 32-bit sized field that contains the IPv4 address of the destination. This field may also be altered when the datagrams pass through a NAT.

The IP Options consist of one or more 32-bit optional fields that do not have to be present if no IP options are used.

TCP

The Transmission Control Protocol (TCP) [26] is a common transport layer protocol used for reliable host to host communication that sits just above the Internet protocol. It is used to complement IP to transfer a stream of bytes between applications on IP networks.

TCP divides the data from the data stream into chunks and adds a TCP header to the data in order to create a TCP segment. The TCP segment contains a header and data section and is forwarded to the IP layer to be sent as Internet datagrams.

The header follows the IP header and is used to supply TCP specific information.

It contains ten required fields and one optional field as can be seen in Figure 2. The data section contains the data payload carried for the application.

Figure 2: TCP Header.

Source Port is a 16-bit field containing the host’s source port, which may be altered if the datagram travels through an (Network Address and Port Translation) NAPT.

(22)

Destination Port is a 16-bit field containing the host’s destination port. This field may also be altered when traveling through an NAPT.

The Sequence Number field contains the initial sequence number generated by the OS if the SYN flag is set. The corresponding Acknowledgment number will be this number plus one. Otherwise, if the SYN flag is not set, this field will contain the sequence number of the first data byte in the packet.

The Acknowledgement Number field contains the value of the next sequence number of the TCP segment that the sender is expecting if the ACK flag is set.

Data Offset is a four-bit sized field that contains the total size of the TCP header in 32-bit words.

The Reserved field is only for future use and must be set to zero.

TCP contains six standard and three extended control flags that are one bit each to represent on and off. They are used to manage the data flow in specific situations.

The SYN flag, for example, is used to initiate a connection, only the first packet in each direction should contain this flag. While the ACK flag acknowledges that the receiving data is valid, the FIN flag indicates that the connection is being shut down. Some of these flags can be used for NAT detection.

The Window Size is a 16-bit sized field that is used to regulate the amount of data in bytes that the sender of the packet is willing to receive.

The Checksum is a 16-bit field that is used to detect errors in the header and the data to let the receiver know if the data in the TCP segment have been altered. A pseudo header is also calculated for the checksum which contains the Source Address, the Destination Address, the Protocol, and TCP length carried in the IP header.

The pseudo header adds up to a 96-bit header and the checksum is then calculated over the pseudo header as well as the TCP segment. This value is then placed in the TCP checksum field. This is used in order to deliver the segment to the right address and to verify that the data is error-free.

The Urgent Pointer field is used when the URG flag is set, in order to indicate that a segment of data must be delivered immediately. The 16-bit field will point to the position of the packet that holds the end of the urgent data.

2.2 Network Address Translation

The fundamental role of Network Address Translation (NAT) is to change address information in the Internet Protocol (IP) header of network packets [24], [29]. It was originally created as a solution to combat the shortage of IP-addresses provided by Internet Protocol Version 4 (IPv4). IPv4 does not contain enough IP-addresses in order for each device to have their own unique IP-address on the Internet. To solve this problem the Internet Assigned Number Authority (IANA) provided a

(23)

range of private IP-addresses that for example, a company can use in their private network. They have provided three blocks of private IP-address spaces which can be used without coordinating with IANA or any other Internet registry. But if an organization who uses a private address want to communicate with a public network the address have to change because several organizations can use these address spaces which makes routing impossible. The three blocks of IP-address that can be used for private use is ranged from 10.0.0.0 - 10.255.255.255, 172.16.0.0 - 172.31.255.255, 192.168.0.0 - 192.168.255.255.

NAT is used in order to map private IP-addresses which anyone can use to the more scarce public IP-addresses. The private network can consist of a number of hosts, which are connected to an NAT-Gateway. The NAT modifies or translates the private IP-addresses from the hosts to provide a connection to the public network, most often the Internet. The only IP-addresses visible in the public domain will be the public IP-addresses provided from an Internet Service Provider (ISP). This makes it possible to change the public IP-addresses without changing the private ones. The users are also able to change ISP without changing the addresses on the local network. Because a NAT device modifies the fields in the IP-header, hosts in the private domain are not visible for the hosts in the public domain. Therefore companies and organizations might employ NAT for security purposes in order to keep internal IP-addresses unreachable from external networks.

The NAT sits between the private and the public network and translates the private IP-address to a public one. This is done by creating bindings between addresses. In the next sections, the NAT types and the different binding methods will be explained.

Static NAT

Static NAT is the simplest form of a NAT router and works by mapping one IP-address in the private network space to one in the public network space. Each of the private hosts has a single public IP-address mapped to their NAT IP-address.

The main reason to use a Static NAT is to hide the real IP-addresses but to allow certain network resources to be accessible via a certain IP-address. For example, if some customer has to access certain services from a database inside a private network. As can be seen in Figure 3 the hosts private IP-addresses are each mapped to their own public IP-address but the port numbers remain unchanged.

(24)

Figure 3: Static NAT.

Dynamic NAT

Dynamic NAT maps a private IP-address to a public IP-address chosen from a pool of public IP-addresses. The public IP-addresses are stored in NAT tables which are shared between all the hosts in the private network. When a private host initializes a connection the NAT chooses a currently unused IP-address from the pool. The host can be reached as long as the connection lives. When the connection is terminated the binding expires and the public IP is returned to the pool and can be reused for another host. As can be seen in Figure 4 the NAT router changes the private IP-addresses of a packet to a public IP-address chosen from a pool of available addresses in the Dynamic NAT table. When the packet returns the destination IP is changed back to the original private IP-address of the client.

This type of NAT is mostly used for larger organizations where several of the private network hosts have to communicate with hosts on a public network or vice versa. This makes the method a bit more complex in comparison to static NAT but increases the security because the mapping change which makes it hard to target a single host.

(25)

Figure 4: Dynamic NAT.

NAT Overload / NAPT

NAT overload or Network Address Port Translation (NAPT) can be seen as a combination of static and dynamic NAT but with the added functionality of Port Address Translation (PAT). This is the most commonly used NAT configuration and is often used in homes or small office networks which enable several computers to connect to the Internet while only utilizing one IP-address.

Often when people talk about NAT it is actually NAPT they refer to. NAPT is referred to by several different terms created by different groups, some of them are PAT, NAT overload, many-to-one NAT and IP masquerading. IP masquerading is often used to describe the security measure of hiding private IP-addresses behind a single public IP-address.

PAT is a technique that is used to translate port numbers. In Figure 5 two clients are represented by a single public IP-address but are differentiated by the port number as can be seen in the NAPT table. This allows several hosts to share one single IP-address. All the packets that leave the NAT gateway contain the same NAT IP-address but have different source port numbers. When a host inside the local network sends a packet to an address on the Internet, it goes through the NAT gateway which will store the host’s IP-address and port number in the translation table and replace it with the global IP-address and port number. The reply packet will arrive at the public NAT IP-address and the IP-address and port number will be replaced back using the NAPT table in order to be forwarded to the correct private host IP-address in the local network. If a client wants to initialize a connection the port number should be chosen from the range 1 024 - 65 535 as is recommended according to RFC 2663[29].

(26)

Figure 5: NAPT.

Changes Made by NAT

When a packet traverses through a NAT, several fields are changed. The most obvious changed fields are the source address of outgoing packets and destination address of incoming packets in the IP packet header. When the IP-addresses are changed the IP checksum have to be recomputed. This could otherwise cause the packet to be dropped because the receiver would believe that the packet has been corrupted. This also applies for TCP and UDP which are the most common transport protocols used for Internet traffic. They have a checksum for all of their data as well as a pseudo-checksum header that contain the source and destination address. Therefore when there are a change in the address the checksum have to be recalculated based on the translated IP-addresses and not the original ones. The source and destination port numbers for TCP and UDP are also changed when a packet is transferred through an NAPT.

The Time to Live (TTL) is a field in the IP-packet header that is affected by a NAT device in a network. The field is decremented by one for each router the packet goes through. Because of this, and the knowledge of original TTL values for different OS’s, this field can been used for NAT detection.

(27)

Problems Caused by NAT

Unfortunately, NAT may cause some problems in certain circumstances where for example if the TCP header is encrypted the checksum field will not be able to get translated and the packet might be dropped. Some application protocols also carry an IP-address that might have to be translated. In order to do this, an Application-Level Gateway (ALG) is needed.

There is also problems where unauthorized NAT devices may provide unrestricted access to for example a private company network, thereby causing a significant security threat [28].

Another case is when NAT is used in order to distribute unauthorized Internet access to more hosts than allowed for from an ISP without their knowledge of it.

2.3 Tools

This section explain the libraries, softwares and tools used in the study.

Python

Python[9] is the chosen programming language for this study. It is a popular object-oriented high-level programming language. It is consistently designed, easy to read and have a minimal syntax which makes it easy to learn and shorten the development time. It was chosen because it contains a large library that can be used for scientific computing, more specific the modules for machine learning and handling of large datasets. It does have some disadvantages in that the execution can be slower compared to other imperative languages, but the advantages are believed to outweigh the disadvantages.

Jupyter Notebook

Jupyter Notebook[11] which was formerly known as IPython, is a web application environment that can be used for Python. It contains a cell-based environment where the project’s code were created. The cells are able to contain code, plots, and text, which enabled the code to be organized in a structured way. This is very helpful when working with several calculations and understanding large datasets.

NumPy

NumPy[7] is a package for Python which contains several functions for scientific calculations. In this study, NumPy was used to handle large arrays and matrices with higher dimensions. Operations could be performed on these datasets with the provided functionalities from NumPy.

(28)

Pandas

Pandas[21] is a package for Python that is used to create data structures with high performance and provides tools to easily analyze the data. In this study, it was used to categorize and label the collected data into data structures to analyze it.

2.4 Chapter Summary

This chapter explained and gave an introduction to IP, TCP and NAT which will be refereed to in upcoming parts of the study. The TCP and IP section explain how TCP and IP work as well as explaining what the different fields in the headers are used for. The NAT section explains how different types of NAT works and what it is used for. The Tools section explains the various tools that have been used in the evaluations.

(29)

3 Related Studies

NAT host identification is used in order to determine the number of hosts that are connected behind a NAT device. This can be very valuable for network managing purposes in order to get information on how many users and devices that are connected to the network. There are several studies that examine different methods in order to identify the number of hosts behind the NAT. In this chapter, some of the methods will be explained in order to gain a basic knowledge of how others have performed NAT host detection.

3.1 Signature Based Detection Methods

A technique for counting NATted hosts, Bellovin 2002

Steven M. Bellovin[3] wrote a paper which according to him contained the first technique in order to determine the number of hosts behind a NAT device. The technique was based on examining the IP headers identification field. It was observed that the IPid acted as a simple counter for the packets of many operating systems.

Consecutive packets emitted from a host would carry sequential IPid fields. By counting the number of consecutive packets grouped as strings from an IP-address it is possible to determine the number of hosts behind a NAT.

The IPid field is used to order fragment packets. It has to be unique in all fragmented packets that have the same protocol number, source and destination address. This is done in order to allow fragmentation and reassembly for a packet.

The algorithm that was employed by Bellovin was made to build sets of IPid sequences. When new a packet was received the algorithm scanned for the sets of sequences in order to find the best match. If a match was found the IPid was added to the sequence or else a new sequence was created.

A perfect match for the sequence and a new IPid was considered if the IPid was exactly one higher than the last received packet within the reasonable gap and time bounds.

The tests were performed by using real packet traces where the machines were grouped together in order to behave as if they were behind a real NAT. The subnet used was from their organization’s wireless LAN, which contained hundreds of computers. The output of the algorithm was compared to the real IP-address data.

It showed good results, where the algorithm was able to identify nearly all hosts.

However the algorithm had some problem when there were collisions in the IPid space. This could cause the algorithm to wrongly identify the number of hosts. The algorithm does only look for IPids that increase sequentially which only makes it

(30)

applicable for operating systems and packets that do so. Another limitation is that the IPid field does not have to behave as a counter, only that it is unique which makes this method invalid. It is also possible that the Don’t Fragment bit in the IP header is not set which would not set the IPid field. Some NAT devices also resets the IPid to zero.

Bellovin conducted that the technique was best used for smaller networks behind a NAT, where the data can be collected by an Internet Service Provider (ISP) for analyzing the number of hosts in homes and small business networks.

Passive detection of NAT routers and client counting, Straka 2006

Kenneth Straka and Gavin Manes[34] wrote another NAT detection method and the limits on some already existing methods. Bellovin wrote that NAT detection using the IPid field can be somewhat unreliable to use for detecting different hosts, since it does not have to behave as a counter. Their solution was to combine the IPid field and the TTL values extracted from the IP packet header. Most OS’s have default TTL values and can, therefore, be easily distinguished from each other. The TTL value is also decremented by one for each router the packet pass through and by examining the value of the TTL it can indicate if a NAT exists in the network.

Unfortunately, this method still has some limitations since a client can change the default TTL values. Together with Bellovin’s method of grouping sequential IPid’s together they were able to gain a more accurate estimation of the number of clients in a network. In order to improve their method, they choose to also look at application level packet patterns, where they gave an example of how the detection would work for bursts of Post Office Protocol (POP) traffic. If the packets were found within close intervals between each other it could be an indication of several hosts who checks their e-mails. The usernames submitted to the POP-server could also be used to separate hosts. This method of identification can also be used on operating system updates such as Windows update and application updates such as antivirus updates. Even though they did not perform any experiments on a dataset, these types of host detections have been used in other research as will be seen below.

Application presence fingerprinting for NAT-aware router, Bi 2006 Bi et al.[4] conducted a new fingerprinting method based on application layer data in order to discover NAT. This detection method was introduced because it is hard for NAT routers to modify application layer data in order to avoid detection and that the application developers don’t design features to avoid NAT detection. Usually,

(31)

one host has only one instance of an application, this fact can be used for host separation. In their research, they used Instant Messaging (IM) applications to distinguish hosts. Some IM applications only allow one instance to be active on a desktop, they have a large user population and the applications are usually active for long periods. They found that the IP-address of IM servers, TCP/UDP port numbers, and certain IM packets have some characteristics that can be used for fingerprinting.

Their algorithm checks each IP-address where IM packets come from. It checks the destination IP-address and source port number in order to identify if the packets belong to an existing channel between the client and the application server or otherwise it will create a new channel. The algorithm then counts the number of channels between the given IP-address and the IM application in order to make a verdict. If the number of channels is equal or greater than the maximum of allowed channels for each IM application it is concluded that the IP-address is a NAT gateway address. The algorithm then removes each expired channel given a maximum idle time.

Their experiment was conducted on a campus network and on the China Education and Research Network (CERNET), which is a network used for education and research. The IMs they used was MSN Messenger and Google Talk. Even though they mention that they performed an experiment, no results are presented to show the accuracy of the algorithm.

NAT usage in residential broadband networks, Maier 2011

Maier et al.[14] used an approach to detect NAT and estimate the number of hosts behind it by using the TTL field in the IP header and HTTP user-agent strings. Their approach used the knowledge from previous studies regarding TCP/IP fingerprinting in order to come up with a more reliable NAT detection method.

Their dataset are based on packet-level observations in Digital Subscriber Lines (DSL) connections from a large ISP. They were able to monitor 20,000 anonymized DSL.

In order to detect if NAT was used on a DSL line, they used the fact that OS’s have different TTL values on outgoing packets. Windows have an initial TTL value of 128, MacOS and Linux have 64. For every router that the packet travels through the TTL value will decrement by one. They did also know that the hop distance from the customer’s equipment to their monitor point was one. They used this knowledge to determine if there was a NAT in the customer’s network. To count the number of hosts behind a NAT on the DSL line they counted the number of

(32)

TTL observations from each line to distinguish different OSes. HTTP user agent strings of regular browsers contain information about OS and browser versions. By combining the TTL values and observing the OS and the browser versions they were able to identify the number of different hosts behind a NAT. They made the conclusion that it was unlikely for several customers to have both the same OS and browser family versions.

In their results, more than 90% of the lines they observed used NAT and they also found that 30-50% of the lines had more than one active host.

In order to see if multiple hosts on one line were active at the same time they computed a minimal interactivity time, which showed that 10% of the DSL lines had more than one host active at the same time. They were not able to distinguish between hosts with identical OS and browser. It is also possible that they wrongly classify a computer that has two OS’s as two hosts, or if a user updates his browser during the observation period. If a NAT gateway does not decrement the TTL value it will not be classified as a NAT. This method will also be unable to detect hosts based on the HTTP knowledge if the flows are encrypted, which will hide the user agent information.

Counting NATted hosts by observing TCP/IP field behaviors, Mongkol- luksamee 2012

Mongkolluksamee et al.[18] developed a technique to study long-term NAT trends from 2001-2010 using traces from the Measurement and Analysis on the WIDE Internet (MAWI) group. In their study, they examined the behavior of IPid, TCP sequence number and source port to identify different operating systems. By examining the behavior of these three fields they were able to identify patterns of different operating systems.

They extended the work of Bellovin’s method in order to estimate the number of NAT hosts from network traffic, where they included TCP source port and sequence number. In contrast to Maier that looked on HTTP user agent strings to identify different OS, they discovered that each OS has its own way of selecting a starting TCP sequence number and a TCP source port for each connection. They added this to Bellovin’s method to identify different host when the IPid is in a per flow or random manner.

The method they used is divided into two part. In the first part, they collect IPid sequences, TCP sequences, and TCP source ports and then they ordered the sequences in an increasing order. The TCP sequence is created by calculating the arrival time of the packet and gap limit in the sequence. Based on the results from

(33)

the difference of the previous packet and the arriving packet it is classified differently.

The IPid sequence was calculated in the same way, but they had to create two sets for each new IPid which is added to the best matching sequence set. The parameters chosen in the algorithm to construct the sequences are based on table parameters from Bellovin.

When the sequence construction phase is done they start to classify hosts. This is done in three steps where they first try to associate TCP sequence numbers with IPid sequences. When a TCP sequence and IPid have been associated together they observe patterns of the TCP sequence starting number to distinguish different OSes from each other. Step two is when they only found one TCP sequence for each IPid sequence and tries to distinguish which type of IPid the TCP connection is associated with. The last part observes the TCP source port behavior, this is done if a TPC sequence cannot be associated with any IPid sequence and therefore they try to discover patterns by looking on to the source ports.

The method was evaluated by using two trace files, one from a synthetic NAT and one from real NAT traffic collected from a wireless router. The synthetic NAT traffic was collected from 16 hosts with different OSes. By using this method they found 18 estimated hosts, with 15 true positives and three false positives. This result was slightly better when they compared it with Bellovin’s method. Their method was able to identify Windows, FreeBSD and MacOS with a high accuracy but had problems with detecting OpenBSD, where the algorithm would count several OpenBSD hosts as one host.

When evaluating the results from the real NAT traffic they compared their results with pOf which is a passive OS fingerprinting tool, which gave similar results.

They used their method to study the long-term NAT trends on the data from the MAWI traces from 2001 - 2010. Their results showed that around 2% of the IP-addresses were NATed and on average there were five hosts behind every NAT.

A hybrid packet clustering approach for NAT host analysis, Zhang 2015 Zhang et al.[36] also performed a study in order to discover how many hosts that were located behind a NAT. They used a method where they gathered HTTP data and extracted cookie ID, application ID and user-agent information from each packet group in order to gain knowledge of which IDs belongs to which host. They created an environment where the network traffic was collected inside two laboratory networks before a NAT device so all IP-addresses were visible.

Their method would work on the outside of the NAT network, but the traffic captured before the NAT gateway was used to validate the results. A sliding

(34)

time window technique in combination with cookie ID verification was used to verify when two HTTP’s requests belong to the same host. They analyzed the HTTP header to employ the rules in each time window. If two HTTP requests were sent to the same destination within in a small time period they made the conclusion that they are probably from the same host. They also checked if the User-agent was the same for each HTTP header. A cookie cluster was used which contained cookies that would not change in short periods of time like session cookies. This cookie cluster was combined with a cookie ID cluster in order to verify if they were from the same host. Cookie ID’s were used to verify and connect two HTTP requests from the sliding time window cluster or the cookies cluster to see if there was any cookie ID conflict between them. They created a cookie ID table that contained cookie IDs from different websites to check the connections. Their results revealed that from their NAT networks with around 100 hosts, they had an average accuracy of more than 90% and a coverage of more than 50%. However, this method will not work if the HTTP header is encrypted.

Identification of hosts behind a NAT device utilizing multiple fields of IP and TCP, Park 2016

In a study made by Park H et al.[22] they tried to identify the number of hosts behind a NAT by utilizing multiple IP and TCP header fields. They used the IPid, TTL, SYN, source port and timestamp fields in order to separate the hosts. Their method to identify the number of hosts was made in two phases. The first phase was to determine what OS the packets were directed to. The algorithm examined if the protocol in use was TCP and that the SYN flag was used. In order to determine the OS a table with TTL initial values was used, similar to Maier’s research[14].

The next phase separated the individual hosts, they did this with two methods.

One of the methods was by using IPid and source port number and the other were a method based on TCP timestamp. They generated a host list which contained host information, if a packet came from a new host a host list item was created and added to the list. If the packet came from a already existing host then the packet items were added to the appropriate host list item. In order to determine if a packet came from the same host, they used a table that contained already determined parameter values such as the difference in packet arrival time, IPid value, source port number and the number of items in the list. If the difference between the arriving packet and the last packet in the list were in range according to the parameter table, as well as if the number of items in the list was greater than the value in the parameter table, the packets were considered to come from the

(35)

same host. Since Windows have sequential source port numbers in the TCP SYN packets and IPid they were used to identify Windows hosts. The TCP timestamp method can be used to identify the hosts by identifying the pattern of a linear equation, which depends on the OS, the initial timestamp and the current time.

They performed the test by using two OS’s and in a local small-scale environment.

They were able to gain a result with an accuracy ranging from 71 - 100%, precision of 83 - 100% and sensitivity of 83 - 100%.The results were considerably better compared to only using IPid like Bellovin or just using user-agent strings.

3.2 Behavior Based Detection Methods

Remote NAT detect algorithm based on support vector machine, Rui 2009

Based on previous research, Rui et al.[13] proposed a new method for remote NAT host detection that does not depend on any special fields in the packet header. The technique uses a Support Vector Machine (SVM) method to analyze the network traffic from a NAT together with a remote NAT detection algorithm.

They purposed a method to detect NAT devices passively by analyzing network traffic based on eight features. The chosen features can be seen in Table 1.

Table 1: The 8 features for network traffic analyzing.

Features

1 The number of packets sent out 2 The number of packets received 3 The number of UDP packets 4 The number of TCP packets

5 The number of DNS request packets 6 The number of FIN packets

7 The number of RST packets 8 The number of SYN packets

The features were chosen based on the behavior of the NAT network traffic. A NAT network will send out more bytes, connections, DNS requests, protocols and have a more complex behavior than regular host traffic. These features are used to represent the network traffic from a NAT and were filtered from data traces. The network traffic is represented as eight dimension serial vectors during a set duration.

They discovered that sometimes the network traffic from hosts behind a NAT and ordinary hosts does not differ that much, for example when all the hosts behind

(36)

a NAT are inactive. This makes it harder to see the difference between a NAT host and an ordinary host. In order to gain a higher accuracy for their method, they purposed a function to filter out the inactive hosts form the network traffic leaving only the active ones left in the dataset.

After the network traffic had been filtered, the SVM was used to analyze the vectors. The vectors were evaluated as a binary classification problem where the SVM analyzed if the data were labeled to come from an ordinary host or a NAT host. The network traffic was captured from five networks. Four of the networks were placed behind a NAT and the number of hosts ranged from two - five. One network only contained one host and was not placed behind a NAT. The results showed that as the number of hosts behind the NAT device increased so did the accuracy and the specification. When the number of hosts reached five the accuracy and specificity reached nearly 100% with sensitivity reaching 80%.

Passive Remote Source NAT Detection Using Behavior Statistics Derived from Netflow, Dietz 2013

Dietz et al.[1] proposed a passive NAT detection method in order to detect rogue NAT devices. They did this by employing machine learning algorithms based on behavior statics from NetFlow data. The machine learning algorithms used where a Support Vector Machine (SVM) and a C4.5 decision tree algorithm. Their method is based on the assumption that NAT traffic will behave differently than network traffic with only one host. In order to model the NAT behavior, they used nine features which they extracted from the NetFlows records.

The NetFlow data was collected from an ISP during eight days. Most of the traffic belonged to DNS traffic. This counted up to a total of 6 631 383 anonymize records with labeled NAT and non-NAT traffic.

The method is divided into two steps where they first train the machine learning algorithm with the data from a NAT which is based on the features from the NetFlow data. The trained classifier was then fed with unlabeled feature vectors to classify NAT and non-NAT traffic.

Using several NetFlow records they computed feature vectors based on each unique IP-address found in the records which started during a chosen time window.

The result from this is a set of feature vectors that is based on each unique IP-address.

The features derived from NetFlow are visible in Table 2.

(37)

Table 2: Top NetFlow features based on unique IP-addresses.

Features

1 The number of TCP NetFlow records 2 The number of UDP NetFlow records

3 The number of NetFlow records belonging to DNS 4 The number of NetFlow records belonging to SMTP 5 The number of NetFlow records belonging to Email traffic 6 The number of NetFlow records with SYN flag set

7 The number of NetFlow records with RST flag set 8 The number of bytes exchanged within a flow

9 The number of packets transmitted within a specific flow

The machine learning algorithms were then fed with training and testing sets from the feature vectors. In order to not introduce bias when applying the machine learning algorithms, they derived a balanced dataset by randomly sample the feature vectors on the NAT class. This was done because an imbalance was detected between the nonNAT traffic and the feature vectors created. The number of feature vectors created was rather small in comparison to the amount of nonNAT traffic.

The results showed that the C4.5 algorithm had an accuracy of 95.35% on the unbalanced data and 89.35% on the balanced data. The SVM had an accuracy of 95.10% on the unbalanced data and 81.29% on the balanced data. This reveled that the classifier presented better results when it was trained with biased data.

They also conducted that C4.5 performed better and faster compared to the SVM.

In their finding, they describe that they are not yet sure if the unbalanced dataset should be classified as bias or if it could be used as a feature to discover NAT.

In their experiments, they also found two phenomenon’s that could be used for NAT host identification. The first one was that source port translation done by the NAT gateway on outgoing packets were made in a deterministic way according to programmatic conditions. This was not further investigated but saved as future work. The second phenomenon was that the SYN packet sizes extracted from the NetFlow records had different packet sizes based on their OS, which might be used to identify the number of different operating system behind a NAT.

They have also released a master thesis where they expanded their classifier to include the use of source port sequences and average SYN-flow byte sizes, as well as the user behavior which is based on 11 features instead for the NAT detection[8].

The SYN flows which the SYN-based detection is based on, appeared seldom which resulted in a low detection rate for this approach. 0.22% of the total number of flows were classified as SYN flows and only a subset of those could be used for detecting

(38)

NAT, but they were able to determine different OSes from those flows. The number of source port sequences found in the dataset was also low because some flows which could be part of a sequence was lost, because of the time frame of the flow exports were set to five minutes by the provider, which resulted in a low detection rate.

Can we identify NAT behavior by analyzing Traffic Flows? Gokcen 2014 Gokcen et al.[10] identified behavior from a NAT by analyzing traffic flows using machine learning principles. They did not try to estimate the number of hosts behind a NAT but only wanted to discover the existence of NAT in the connection. The Maier et al [14] approach was re-implemented and compared to their own machine learning approach.

The datasets they used came from two different organizations, which included encrypted and non-encrypted traffic. One of the datasets was collected during a week on their own network and the other set was delivered to them through a private partner. They performed tests on both of the datasets and labeled all flows as NAT flows and OTHER flows. The number of flows used was 321 209.

From the TCP dump traffic, they computed the features for each flow. They used NetMate which is an open-source flow generator to generate flows and to retrieve statistical features from the traffic traces. They did not use IP-address and port numbers because they believed they might cause bias in the results. In their approach, they wanted to find patterns for the behavior of a NAT and did not use any application layer information.

The fingerprinting approach was evaluated in four steps based on methods tested by Maier et al[14]. They tested it by evaluating the TTL range, distinct TTL values for each IP, different OS information and browser information in the HTTP user agent strings.

This approach was then compared to their own proposed approach where they employed two machine learning techniques. They used a C4.5 decision tree classifier and a Naive Bayes probabilistic classifier. These two classifiers were then compared to determine which worked best on the features.

Their results showed that the passive fingerprint classifiers worked on certain NAT behaviors, but as the NAT behavior became more complex it was harder to gain accurate results. On the datasets, the best detection rate for the NAT flows were 100%, with a False positive rate of 6% and 2,7%.

The machine learning approach showed high-performance accuracy with a detection rate of 98-99% for C4.5 on the NAT and OTHER flows, with a low false positive rate of 2-4%. Naive Bayes was not as good and performed differently on the two data sets. Compared to the results from the fingerprint approach it can

(39)

be shown that the C4.5 classification worked best on the two datasets. The C4.5 algorithm was also able to identify the most important features to detect NAT behavior. These features are visible in Table 3.

Table 3: Most important features classified by C4.5.

Features

1 The average number of bytes in a sub flow in the forwarded direction 2 Total bytes in backwards direction

3 Mean size of packets sent in the forwarded direction 4 Maximum duration the flows were active

5 The size of the smallest packet in the forwarded direction 6 The size of the biggest packet in the backwards direction

7 Standard deviation from the mean of the packet sent in the backwards direction

Passive NAT detection using HTTP access logs, Komarek 2016

In one of the most recent papers, Komarek et al.[12]proposed a NAT detection method using HTTP logs. Similar to Gokcen et al.[13] and Rui et al.[10] they looked at the behavior of a NAT to discover it with the help of a machine learning classifier. Their method is divided into two steps; in the first step, they analyzed the host behavior. It consists of feature extraction, statics collection, and window selection. In the second part, they trained a SVM in order to label IP-address as NAT or end host.

For each host that was identified in the HTTP logs they performed feature extraction based on eight features in Table 4

Table 4: Features extracted from HTTP logs

Features

1 The number of unique contacted IP-addresses 2 The number of unique User-Agent strings 3 The number of unique OSes

4 The number of unique Internet browsers 5 The number of persistent connections 6 The number of uploaded bytes

7 The number of downloaded bytes 8 The number of sent HTTP requests

These features were collected to identify a NAT from an ordinary end host. They

(40)

were collected in time windows sequences with 30 minutes for each sequence covering 24 hours.

They gathered the HTTP access logs from four different corporate networks during two working days. The networks were selected to be of different sizes with 3 000, 5 000, 10 000 and 25 000 hosts. From these datasets, they generated an artificial NAT by joining HTTP logs together to simulate hosts behind NAT. The data was then labeled in two sets: Artificial NAT and end hosts to train the classifier.

To evaluate their NAT detection method they conducted an evaluation in five separate scenarios:

• Time-drift: The classifier is trained on the data captured from the first day and is then evaluated by the data captured from the second day. This is done to show that the classifier can operate on data from the same network in future days with a high accuracy.

• Cross-validation: In order to see if the classifier can be used on new network data.

• Hosts with the same OS and/or Internet browser: To test if their method were able to identify the host where both Beverly’s[3] and Maier’s[14] method would fail.

• Sensitivity to contaminated training sets: In order to measure the impact of the classifier if a real NAT was presented in the training set.

• Evaluation on the Network with real NATs: where they used HTTP logs from a corporate network to test their classifier, which contained 1 717 end hosts and 166 NAT devices.

All of the scenarios achieved an accuracy ranging from 93.91-99.36%, precision 89.21-98% and recall of 84.35-95%. The test on the Network with real NAT devices with the trained classifier achieved the highest score on all the measurement. It was also observed that NAT devices that contained five or more hosts were able to be detected in more than 96% or the times.

3.3 Chapter Summary

The different approaches described in this chapter can be categorized into two detection methods namely signature or behavior-oriented.

The signature based or fingerprinting approaches are detection methods which look for signatures in the IP, TCP and HTTP fields, such as the TTL, IPid,

(41)

User-agent strings, port numbers and more. By using these fields it is possible to gain information about the number of hosts by observing a number of similar sequences on the packets from an IP-address. It is also possible to detect OS’s, browser versions and applications in use. By observing and making assumptions of how different applications work they can be used to identify different hosts from one IP-address. In order to increase the detection accuracy, some methods use different fields together. There are some problems associated with this approach where the fields associated with the detection might not be in use. For example, the IPid might not be set or behave in an unpredictable manner or the TTL values might not be set to the standard OS values. Some OSes creates the values in the header fields differently than others which make them harder to detect, because of this some hosts might avoid detection. Although there can be some problems by building a method that uses fields that are not originally created for host detection, most of the methods achieved a high accuracy during the tests.

The behavior-based approaches are methods that examine how the traffic from a NAT behaves. Papers that use these methods are written by Gokchen et al, Dietz et al, Rui et al and Komarek et al. By making the conclusion that traffic from a NAT behaves differently than an ordinary host they are able to identify NATed networks.

Some of the most common features that are examined in these approaches are the number of different packets sent and received, and the number of bytes exchanged.

A NAT usually sends out more traffic than a single ordinary host, because usually there are several hosts hidden behind the NAT gateway. Often the accuracy of these methods increases when the number of hosts in the NAT network was greater.

Usually, these methods use machine learning methods in order to classify if a host is a NAT or not. They used different methods in order to first label the hosts and then train the classifier on the data. Unfortunately, the machine learning methods are only able to classify if a host is a NAT device or not, and is not able to determine the number of hosts in an NATed network. The detection methods were able to classify a NAT device with nearly 100% accuracy if the NAT network consisted of more than five hosts.

Unfortunately, some of these methods might not be viable when the headers are encrypted such as in HTTPS. The papers written by Komarek et al, Gokcen et al, Zhang et al and Maier et al might not work with encrypted fields because they used information from the HTTP header fields.

Those methods described in this section that are able to identify the number of hosts behind a NAT device are described in papers written by Bellovin, Straka et al, Bi et al, Maier et al, Mongkolluksamee et al, Zhang et al and Park H et al.

The next section will explain the datasets and the aspects used in the NAT host detection methods.

(42)

(43)

4 Datasets and Detection Aspects

This section will explain the data sources and datasets used in the study. It will explain the attributes of the different datasets, how they were collected and for how long. This section will also present the NAT host aspects that were used in the study. These aspects were used in the analysis in order to create an accurate NAT host detection model.

4.1 Data Sources

The datasets used in this study is gathered from real network traffic traces from two different sources. The first sources of data were provided by Procera Networks which collected the data with a Deep Packet Inspection (DPI) engine. It collects information from an ISP and inspects the header and the data of each packet that passes through the inspection point.

The second dataset source were gathered from Karlstad’s University network.

The datasets do only consists of traffic that is related to how Windows receives updates. This was done in order to gain ground-truth information of how the flows looks like when a Windows computer downloads the updates to create an accurate detection model. The datasets were collected by a Procera DPI appliance located between the Internet and three VMs running different versions of Windows. The traffic was processed and then transferred to an HDF file.

The gathered packets were grouped into traffic flows [5], which is a sequence of packets that travels between a source and destination. The packets grouped into flows will be identified based on a combination of source and destination address, port numbers and transport protocols. By grouping the packets together into flows it will minimize the amount of data in the datasets while still maintaining a good representation of the traffic.

4.2 Procera Network Datasets

Three different real network traffic datasets were provided by Procera Networks which are analyzed and used to detect different hosts behind a NAT based on the detection models. These datasets consisted of traffic gathered from several different IP-addresses, for different durations, and from different countries.

Small Cellular

The small cellular dataset was gathered by Procera’s DPI device. It was collected during approximately three hours. The dataset consisted of 781 793 unique flows

(44)

with 12 features. It used anonymized IP-addresses and did not contain any information regarding if the hosts were labeled as a NAT device or a regular host.

The dataset was used to gain knowledge of how the data looked like and to make the first attempts for detecting the hosts behind a NAT. Different applications flows were examined in order to detect the hosts. The flows used for the detection of the hosts can be seen in Table 5 and 6. These applications were searched for in the dataset but not all of them were present. The flows were created by grouping ten or more packets which had the same properties.

Large Cellular

This was a much larger dataset than the previous one and contained 42 488 795 flows.

Like the first dataset, it was collected from a cellular network but for 18,7 hours.

It was similar to the first dataset in the sense that it did not have any information regarding if NAT hosts existed in the dataset. All IP-addresses were anonymous and each flow had the same attributes as the first dataset except that the size of each flow in bytes was now present. The same measurements were performed on this dataset as with the first one, in addition to some new measurements. Another difference is that some of this datasets flows were grouped together differently. In the previous dataset, a flow was only created if it contained ten or more packets that had similar properties. In order to increase the presence of antivirus flows for this dataset they were grouped on as few as one packet. This was done for the applications present in Table 5 because these applications might send packets to check for new updates and those flows may consist of less than ten packets.

DSL and Cellular

This was the largest dataset provided by Procera and originally it consisted of 102 866 346 flows, however some flows were not classified correctly and had to be removed. The cleaned dataset instead consisted of 68 596 470 flows and the traffic was collected during approximately seven days. As with the previous datasets, it did not contain any ground truth regarding the presence of NAT routers and all the IP-addresses was anonymous. The dataset contained the same attributes as the Large Cellular dataset, and the same measurements were performed.

4.3 KAU Lab Datasets

Three Virtual Machines (VM’s) with different Windows OSes was setup on a computer at Karlstad university. The VM’s did only have Windows installed on them and the purpose was to receive Windows updates from Microsoft so the traffic

Exploring NAT Host Counting Using Network Traffic Flows