• No results found

Exploration of 5G Traffic Models using Machine Learning

N/A
N/A
Protected

Academic year: 2021

Share "Exploration of 5G Traffic Models using Machine Learning"

Copied!
75
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/52--SE

Exploration of 5G Traffic Models

using Machine Learning

Analys av trafikmodeller i 5G-nätverk med maskininlärning

Aron Gosch

Supervisor : Patrick Lambrix Examiner : Niklas Carlsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

The Internet is a major communication tool that handles massive information ex-changes, sees a rapidly increasing usage, and offers an increasingly wide variety of ser-vices. In addition to these trends, the services themselves have highly varying quality of service (QoS), requirements and the network providers must take into account the frequent releases of new network standards like 5G. This has resulted in a significant need for new theoretical models that can capture different network traffic characteristics. Such models are important both for understanding the existing traffic in networks, and to generate bet-ter synthetic traffic workloads that can be used to evaluate future generations of network solutions using realistic workload patterns under a broad range of assumptions and based on how the popularity of existing and future application classes may change over time. To better meet these changes, new flexible methods are required.

In this thesis, a new framework aimed towards analyzing large quantities of traffic data is developed and used to discover key characteristics of application behavior for IP network traffic. Traffic models are created by breaking down IP log traffic data into dif-ferent abstraction layers with descriptive values. The aggregated statistics are then clus-tered using the K-means algorithm, which results in groups with closely related behaviors. Lastly, the model is evaluated with cluster analysis and three different machine learning algorithms to classify the network behavior of traffic flows. From the analysis framework a set of observed traffic models with distinct behaviors are derived that may be used as building blocks for traffic simulations in the future. Based on the framework we have seen that machine learning achieve high performance on the classification of network traffic, with a Multilayer Perceptron getting the best results. Furthermore, the study has produced a set of ten traffic models that have been demonstrated to be able to reconstruct traffic for various network entities.

(4)

Acknowledgments

Thanks to my supervisor Georgios Almyras, Vengatanathan Krishnamoorthi, Vivek Dasari and the traffic modeling team at Ericsson for technical support during this thesis. I am very grateful to Niklas Carlsson and Pontus Sandberg for providing me with this thesis opportu-nity. Also, a many thanks to my friends and family.

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1 1.1 Background . . . 2 1.2 Motivation . . . 2 1.3 Aim . . . 2 1.4 Research questions . . . 2 1.5 Delimitations . . . 2 2 Background 3 2.1 Preliminaries . . . 3

2.2 Radio Access Network . . . 4

2.3 Network protocols . . . 6

2.4 Internet communication . . . 9

2.5 Traffic packet flow . . . 10

2.6 Machine learning algorithms . . . 10

2.7 Related work . . . 14

3 Method 16 3.1 Investigation of application behavior . . . 16

3.2 Network traffic definition . . . 17

3.3 Data analysis . . . 20 3.4 PySpark . . . 23 3.5 System setup . . . 25 4 Results 26 4.1 Data exploration . . . 26 4.2 Cluster analysis . . . 29 4.3 Classification . . . 31 4.4 Parameter study . . . 32 4.5 Traffic models . . . 37

4.6 Network content analysis . . . 37

5 Discussion 41 5.1 Results . . . 41

(6)

5.2 Method . . . 43

5.3 The work in a wider context . . . 46

6 Conclusion 47 6.1 Future work . . . 48 Bibliography 49 A Appendix A 52 B Appendix B 54 B.1 Volume . . . 55 B.2 Duration . . . 58 B.3 IAT . . . 61 B.4 Overlap . . . 63 B.5 Packets . . . 64

B.6 Packet bursts for UL/DL . . . 66

(7)

List of Figures

2.1 Traffic Forecast Update from the 2019 Cisco VNI Forecast. . . 4

2.2 RAN-architectures. . . 5

2.3 Interconnected mobile traffic network. . . 5

2.4 The OSI reference model and protocol stack for TCP/IP. . . 6

2.5 Data transmission with TCP and UDP. . . 7

2.6 Datagram format for IPv4 protocol. . . 9

2.7 Packet flow for network traffic. . . 10

3.1 Proposed stack layers in network model. . . 17

3.2 UE and application traffic for network levels. . . 18

3.3 Visualization of how thresholds are used to divide packets into network-layer seg-ments. . . 19

3.4 CRISP-DM process [CRISP_DM_WIKI]. . . . 22

3.5 Analysis framework for traffic data. . . 23

3.6 Spark execution process. . . 24

3.7 Spark and HDFS configurations. . . 24

4.1 Fraction of service providers and application-clients from input data. . . 27

4.2 CDF of packet characteristics. . . 27

4.3 Elbow and Silhouette score for K-means model. . . 30

4.4 Clustering with K-means model using t-SNE with ten clusters labeled with num-bers zero to nine. . . 30

4.5 Heat-maps for t-SNE representation of clustering model. The scatter-plots depict parameter intensity from low to high. . . 31

4.6 Heat-map of average parameter values per cluster. Black boxes indicate missing values for a parameter. Distinct parameters value are separated with a horizontal line. . . 33

4.7 CDF for cluster volume in Bytes. Input data is signified by thicker line. . . 34

4.8 CDF of duration for bursts and connections in seconds. . . 35

4.9 CDF of IAT in seconds. . . 36

4.10 CDF for number of packets and volume for up/downlink. . . 37

4.11 Cluster fraction of server IP addresses. . . 38

4.12 Fraction of total percentage from clusters for application-clients and service providers. . . 40

(8)

List of Tables

3.1 Format of statistics aggregated from IP-log data. . . 21

3.2 Hardware specifications . . . 25

3.3 Software specifications . . . 25

4.1 Highly correlated feature groups. . . 28

4.2 Conclusion from network layer study. . . 29

4.3 Results for network service classifiers. . . 32

4.4 Traffic model characteristic behaviors. Abbreviations in the table: Packet (P), Packet Bursts (PB), Connection Sessions (CS), Flow Sessions (FS), Uplink (UL) and Downlink (DL). . . 38

(9)

1

Introduction

Understanding the characteristics of network traffic and their patterns in cellular radio access systems is vital to aspects ranging from design, implementation, testing and development of new network features and concepts. In contrast to the preceding network architectures, the new 5G-networks are predicted to be highly modular and are designed to meet the future demands [1]. Furthermore, the introduction of 5G is foreseen to result in a massive shift in traffic behavior. Adapting viable strategies to meet these changes in network behavior will consequently be of critical importance in order to enable further technological advancement for the mobile network operators [2].

Live networks are inherently dynamic environments where several traffic-generating ap-plications evolve and run side-by-side with multiple users. The practical aspects of the char-acterization of an observed traffic behavior manifest itself in the development of a mathemat-ical representation that is known as a traffic model. Advanced traffic models that build upon a relevant set of traffic behaviors are thus necessary to provide a better understanding of a network and a significant basis from which to evaluate and improve the system in terms of design, troubleshooting, performance and dimensioning [3].

Understanding the behavior of network applications is especially important since differ-ent applications have differdiffer-ent demands on the quality of service. The interaction with an application in this context could, for instance, refer to Voice over IP (VoIP) calls with Skype, watching a video on YouTube or a user accessing emails with Outlook. Today, there ex-ist many possible network analysis methods, such as Deep Packet Inspection (DPI) and IP port-lookup, but this master thesis instead adopts another popular approach which is to use statistical classification based on network packet traces.

The goal of the thesis is to determine if this approach is viable to unearth application-level network traffic behavior in a 5G context. Data from the network is stored in IP packet headers logs, where a line in the log contains all recorded information for a packet. Based on the data in the IP packet header logs, traffic models will be identified by investigating characteristic statistical distributions of the network data. This framework will provide a basis for using machine learning models to categorize application and user equipment (UE) behavior, with features such as burstiness, call duration or activity-inactivity intervals, and find significant correlations in the traffic of the network.

(10)

1.1. Background

1.1

Background

Ericsson is a major telecommunications company with its headquarters based in Stockholm, Sweden. The introduction of the new 5G-network is predicted be more service-centric and thus requires a deeper understanding of the underlying network traffic behavior. For this reason, telecom providers such as Ericsson desire solutions that can be used to analyze mas-sive data-flows and create reliable traffic models for the relatively small user base before the large-scale launch of the 5G-network.

1.2

Motivation

Generating application-level traffic models is a non-trivial task as there is an unknown mix of device types and configurations as well as different variations on user and application behavior. In accordance with traditional traffic modeling studies, real-life network data from customer measurements can be used in advanced data analytics methodologies to utilize the full potential of the data. This is beneficial both from the perspectives of subscribers that may get better service and the service provider that can optimize the network based on indications from the traffic models.

1.3

Aim

The thesis aims to examine how complex behaviors work in a communications network for IP packet header log data and explore the patterns that may emerge from the analysis of user data. This kind of exploration and model construction is essential to construct reliable network traffic models that may be used by mobile network operators in the future.

1.4

Research questions

The following research questions are answered in the thesis:

1. How can traffic characteristics be used for the design of traffic models that mimic char-acteristic behaviors of applications?

2. Which machine learning clustering model works best to define groups of traffic charac-teristics that are most representative of the traffic behavior variety in a network? 3. What traffic models can be defined by the identified groups of behaviors and how are

these correlated with real-life applications?

1.5

Delimitations

The results in this thesis are based on user data provided by Ericsson and the traffic models generated may not be representative of other network types. Restrictions that can arise from big data processing is hardware limitations and data complexity, which may limit the amount of data that can be processed. Additionally, we are limited in terms of knowledge in terms of packet-content and how network conditions may affect the results. For example, the same video watched by two UEs in the network can be played back at various degrees of quality even though both devices utilize the same application, e.g. due to limitations in bandwidth and access to the network. Lastly, the theoretical network model, might be a simplification of reality and may not be precise enough to yield satisfactory results.

(11)

2

Background

This section introduces the relevant background necessary to understand the most critical concepts of the thesis.

2.1

Preliminaries

Today, mobile connectivity has become an essential part of our everyday life and most peo-ple already consider mobile services a necessity. Global mobile data traffic grew 71 percent, compared to the previous year, and reached 11.5 exabytes per month in 2017, which means that mobile data traffic had grown 17-fold over 5 years [4]. Thus, the task of creating reliable models that can be used in network analysis often involves processing of massive amounts of data, which is a considerable undertaking and of critical importance for any modern mobile network service provider. Additionally, the introduction of 5G is predicted to induce mas-sive changes in data generated from the mobile networks. By 2022, 5G-traffic is predicted to make up more than ten percent of the total generated mobile traffic. A visualization of the predicted traffic growth can be found in Figure 2.1.

In order to meet the growing demands, new network technology standards must evolve in order to meet challenges in areas such as energy consumption, speed and connectivity [2]. The new development also involves changes in terms of data volume generated from the networks, the variety of the data from a more significant number of different kinds of devices that are connected and velocity, meaning the frequency of incoming data that needs to be processed. A key component to finding solutions to these problems is to explore and analyze the data generated from the network to be able to create reliable models. Understanding the behavior of traffic is essential to develop models that can be used to support the evolution of the network utilization.

Due to the increasing complexity of networks and Internet traffic, big data-driven devel-opment and large-scale simulations are becoming increasingly relevant in the network field [5]. A popular method used to analyze network behavior is to perform simulations based on traffic generators. The generators are using traffic models to generate series of packet flows populating the network with traffic. Input to the models can be based on parameters such as data volume, packet burst patterns and inter-arrival time (IAT), which are aggregated from IP packet header logs that store massive amounts of information about network traffic.

(12)

Es-2.2. Radio Access Network

Figure 2.1: Traffic Forecast Update from the 2019 Cisco VNI Forecast.

tablishing reliable models for traffic is essential in dimensioning, design and test phases to discover new trends, use-cases and optimizing the performance of an existing system [3].

In a global scheme, as well as within companies such as Ericsson, traffic models are tradi-tionally developed based on the UE behavior, without taking into account the app perspec-tive, where many devices can be connected to the network and run different kinds of applica-tions in parallel, which is predicted to play a much larger role with the evolution of network technology [3]. It is thus necessary to investigate which parameters are crucial to include when designing new traffic models and selecting input for traffic generators. In this thesis, a four-layered network model is proposed with the goal of increasing granularity and capture complex network behaviors. To solve this problem and identify the most critical parameters for a traffic model we develop a framework for processing large quantities of network traffic data with PySpark and Hadoop.

2.2

Radio Access Network

Network traffic can be defined as communication between end-hosts that lead to an informa-tion exchange, which is packetized to carry over digital networks and routed by equipment in the network. The devices in a mobile network are connected through a Radio Access Network (RAN), which consists of several base stations that provide provide over-the-air connectivity for the nearby area. Several UEs of different types (phones, computers, tablets, etc.) may be connected to a base station, which goes through to the core network. At the core level, critical decisions regarding UE access rights and connectivity to the Internet or other networks are handled [3]. There exist many different kinds of RAN-architectures based on various tech-nology and application areas. Today, cellular network are in the process of transitioning from Long Term Evolution (LTE) with 4G towards New Radio (NR) 5G. A high-level illustration of most commonly used RAN-architectures can be found in Figure 2.2.

The architectures for 4G LTE, 4G-5G dual connection and standalone 5G connects a user device to the core-network through a series of intermediary steps. LTE in 4G communicates directly over an evolved Node B (eNB). Dual connectivity is a solution that enables interac-tions between 4G and 5G by establishing a link between a Master evolved Node B (MeNB) and a Secondary geran Node B (SgNB). This makes it possible to boost the performance of a standard 4G network and is a quick route for an operator to get some 5G connectivity benefits before full stand-alone deployment [6]. The final RAN deployment mode is standalone 5G which connects a device to the 5G core network directly through the gNB. The traffic in the

(13)

2.2. Radio Access Network

Figure 2.2: RAN-architectures.

network can be divided into two different different categories; a user plane for data traffic and a control plane that contains administrative traffic from routing protocols. Together the base stations make up a large-scale mobile network. Figure 2.3 contains a small representation of the mobile network structure.

(14)

2.3. Network protocols

2.3

Network protocols

Network communication is built on the utilization of several protocols, which are models connected to different system structures within the network. A protocol defines the format and the order of messages exchanged between two or more communicating entities, as well as the actions taken on the transmission or receipt of a message or other event [7]. Different protocols are used to accomplish various communication tasks and all activity over the In-ternet that involves two or more communicating remote entities is governed by a protocol. A communication network can be organized as a stack of layers, that can be implemented in either software or hardware. The layered architecture has several benefits and can provide an abstraction of a complex system that is relatively modular and easy to understand. The combination all of the protocols used by the network layers is called a protocol stack.

Protocol layering has several conceptual and structural advantages and increases the modularity of network functionality at the cost of some overhead for the lower layers. Each layer in the protocol stack implements a service via its own internal actions which are relying on services provided by the layers below in the protocol stack. Two of the most commonly studied protocol stacks are the Open Systems Interconnection (OSI) protocols and the TCP/IP family, which are used exclusively for Internet communication. Each protocol stack model consists of a number of layers that are connected to a certain network-level functionality [8]. The structure of the OSI and TCP/IP network protocol models can be found in Figure 2.4. As seen in Figure 2.4 the traditional OSI uses seven abstraction layers while the TCP/IP family only uses five layers. The main difference between the two models is that TCP/IP reduces the number of layers which leads to a less complexity. More specifically it does not deal with the presentation and session layers. Below a brief explanation of each of the layers present in the TCP/IP-model is provided.

Application Transport Network Data link Application Transport Network Physical Presentation Session Data link

OSI Model TCP/IP Model

Physical

Figure 2.4: The OSI reference model and protocol stack for TCP/IP.

Application layer

The top-layer in the TCP/IP protocol stack is called the application layer, which defines the format in which the data should be received from or handed over to the applications [8]. There exists a vast variety of network applications with different purposes. Many of the popular Internet applications like World Wide Web (WWW), P2P-file sharing, e-mail, VoIP and social networking applications like Facebook and twitter have become an integral part of our society. The architecture of these applications can be divided into two groups; the client-server architecture or the peer-to-peer (P2P) architecture [7]. For the client-server ap-plications, such as web-based apap-plications, there always exists an active host, which is a

(15)

de-2.3. Network protocols

vice with an unique IP-address, called a server which provides services for many other hosts called clients. With P2P applications, like BitTorrent and Skype, on the other hand there is no reliance on servers and instead they use direct communication between hosts, which are called peers.

Network applications communicate with processes that sends or receives messages to a socket, which works as a sliding-door that must be passed before a message can be delivered. This message exchange procedure is controlled by the application-layer protocols. For exam-ple, HyperText Transfer Protocol (HTTP), File Transfer Protocol (FTP), Simple Mail Transfer Protocol (SMTP) and Domain Name System (DNS). Here each protocol is connected to a spe-cific application type; HTTP is connected to web document request and response, FTP to file-transfer between two end-systems, SMTP to email-transfer and DNS to domain name lookup [7]. Based on specific application quality requirements on data integrity, throughput, timing and security a transport protocol is also needed, which leads us to the next layer in the TCP/IP protocol stack.

Transport layer

The transport layer handles communication services for message exchanges between appli-cation processes that are running on different hosts [7]. This is done by applying protocols to establish a logical communication where the sending host breaks down the application mes-sage into segments that are passed to the network layer. On the receiver-side the segments are then reconstructed into a message that can be passed to the transport and application-layer and be delivered to the intended application. For the Internet, there are two main protocols Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) that correspond to the transport layer. TCP transports data using TCP-segments that are addressed to individ-ual applications, while UDP transports data using UDP-packets. One distinguishing factor between TCP and UDP is that TCP is a connection-oriented service, where the destination confirms the data received. If some data gets lost, the destination requests a re-transmission of the lost data. In contrast, UDP is a connectionless-oriented service that transports data using packets where the delivery is not guaranteed [8]. Figure 2.5 depicts the data exchange process between a sender and receiver for the TCP and UDP protocols.

Figure 2.5: Data transmission with TCP and UDP.

At the beginning of each connection between a sender and receiver, TCP performs a hand-shake where three segments are exchanged. First, the connection is established by a device sending a Synchronize Sequence Number (SYN), which informs the receiver that a sender is ready to initiate communication. In response, the receiver sends back an SYN-ACK sig-nal, which acknowledges that it has received the message. Lastly, the sender sends out an

(16)

2.3. Network protocols

acknowledgment that it has detected the response from the server. Once both devices are synced, the actual data transfer can begin. This contrasts with the exchange process of the UDP-protocol, where transmission is initiated by a request and data exchanges can start im-mediately. Generally, this structure makes UDP more common in smaller applications with low energy cost, where speed is essential, while TCP is more common for larger applications where high transmission quality is necessary. For example, this could mean that that TCP is used for websites and UDP for streaming a Skype conversation [7].

Additionally, the TCP-protocol is further differentiated from UDP with how the protocol handles concepts such as congestion and flow control. These are two fundamental concepts that has a large impact and needs to be taken into consideration for network problems with data transactions between multiple devices such as this thesis. Congestion instead occurs in a network when too many sources are sending out a too much data at too high speed for the network to be able to handle. It can result in lost packets, due to buffer overflow for routers and long delays for queuing in router buffers [7]. In order to reduce data-loss we can specify how much unconfirmed data the source can send before the network gets congested in a Window (WIN). The source-side window is called the congestion window (CWND) and the source must always send amounts of unconfirmed data that do not exceed the window that has been defined by the receiver window (RWND) or the CWND [8].

The congestion control mechanism for TCP is often referred to as Additive-Increase, Mul-tiplicative Decrease (AIMD), which produces a sharp wave-like pattern. This congestion con-trol algorithm means that the CWND is decreased by a factor of two when loss occurs. This procedure means that the algorithm works to achieve connections that have an equal share of bandwidth within the network and limit potential congestion from developing. If a TCP sender detects that very little congestion is occurring between itself and the receiver it can increase its transmission rate, and conversely if there is much congestion it can reduce its transmission. The control algorithms follows a fairness goal: that if K TCP sessions share same bottleneck link of bandwidth R, each should have average data rate of R/K. Flow con-trol is a process where the receiver concon-trols the sender, so that the sender will not overflow the buffer on the receiver-side by transmitting too much data at too high speed.

Network layer

The network layer, for the Internet, is responsible for transferring packets known as IP data-grams from one host to another [7]. The network layer then provides the service of delivering the segments to the transport layer in the destination host. A difference between transport and network layer is that that the transport layer focuses on logical communication between processes while the network layer has logical communication between hosts. On the sending side transport segments are encapsulates into datagrams that are reassembled on the receiv-ing side, that delivers the segments to transport layer. In order to accomplish this task every host has network layer IP and routing protocols. A router examines header fields in all IP datagrams passing through it. It is possible to compare this procedure as putting a letter in an envelope with a specific address dropping it into a mailbox [7]. The structure and content of the IPv4 protocol datagram format can be found in Figure 2.6.

The functionality of the network layer can be divided into two main categories; forward-ing which moves packets from router input to router output and routforward-ing which determines the route that packets will take from source to destination. Routing protocols are often based on algorithms that provides solutions to the shortest-path problem. Some common routing protocols are Routing Information Protocol (RIP), Open Shortest Path First (OSPF) and Border Gateway Protocol (BGP). Included in the network layer is also the essential Internet Protocol (IP), which determines the structure of the datagram that is sent. There is only one IP proto-col, and all Internet components that have a network layer must run the IP protocol. There exists different versions of the IP like the IPv4, as seen in Figure 2.6, and the more recent IPv6 protocol that have a different format for datagrams.

(17)

2.4. Internet communication

ver header len type of service total length

32-bit source IP address 32 bit destination IP address

options (if any)

data (typically a TCP or UDP segment)

0 31

16-bit identifier flags fragment offset

time to live upper layer header checksum

Figure 2.6: Datagram format for IPv4 protocol.

Data link layer

In the network hosts and routers can be considered as nodes that are connected for every data exchange. The data-link layer has the responsibility of transferring datagrams handed-down from the network layer from one node to physically adjacent node over a link [7]. Datagrams often needs to traverse several nodes on its way from source to destination, which means that they may pass several links along the route that utilize different link-layer protocols. The protocols on the link layer may include wired links like Ethernet or wireless links such as WiFi. The complexity of a link can be as simple as a single sender and reciever or it can have multiple senders and recievers in which case it utilizes a multiple access protocol.

The link layer provides a number of services that facilitate communication between nodes. Similar to the upper layers in the protocol stack the link layer can encapsulate the data that will be transmitted, called framing, between nodes into frames. A frame consists of a data field, in which the network-layer datagram is inserted, and a number of header fields. Link access is controlled by a Medium Access Control (MAC) protocol specifies the rules by which a frame is transmitted to the link. The link can provide reliable delivery that ensures that the datagram will be delivered without error. The link can also be used to discover and correct errors with methods such as parity-bit checks or checksums.

Physical layer

The physical layer describes the low-level electric or optical signals used for communicating between two computers [8]. It consists of protocols that operate only on a physical link and is dependent on an actual transmission medium like twisted-pair copper wire, coaxial cables or fiber. The goal of the services provided by the layer is to enable bits to travel over the wire to their intended destination.

2.4

Internet communication

Information exchange over the Internet is based on a technique called packet switching, which refers to the hand-over of data packets. To send a message from a source end sys-tem to a destination end syssys-tem, the source breaks long messages into smaller chunks of data known as packets [7]. Between source and destination, each packet travels through commu-nication links and packet switches. Each network interface on the Internet network has one or more IP address that are unique worldwide. One network interface can have several IP

(18)

2.5. Traffic packet flow

addresses, but one IP address cannot be used by many network interfaces [8]. The Internet is composed of individual networks that are interconnected via routers. Each IP-packet header contains the destination address, which is the complete routing information used for deliv-ering the packet to its destination. Different links can transmit data at different rates, with the transmission rate of a link measured in bits per second. Messages can perform a control function or contain data like an email, a PNG image, or an MP4 video file [7].

2.5

Traffic packet flow

Devices can generate network traffic by sending packets at defined time intervals. Packets in the network can have varying sizes and distribution over a time-based transfer-interval for a particular device configuration. The burst traffic patterns are characterized by a sequence of packages sent in rapid succession, with a Inter-Arrival Time (IAT) between them lower than 1msec, which is a common time-interval used in network communication to reduce impact of network conditions. The time between bursts are called Idle Time (IT) and a device will be disconnected if it exceeds 10 seconds. It is also possible to measure time in terms of Connection Time (CT), which is the total amount of time that a device has been connected to the network. Additionally, each packet has a direction, either Uplink (Ul) or Downlink (Dl), that indicate if the packet is sent or received by a device [9]. Figure 2.7 contains a graphical representation of the traffic packet flow definition.

IAT Burst1 IT Burst2 Uplink  Downlink Size [B] Time [s]

Figure 2.7: Packet flow for network traffic. ’

2.6

Machine learning algorithms

Machine learning (ML) refers to a field within computer science closely related to mathemat-ics and statistmathemat-ics. The purpose of a model in machine learning is to simulate processes and solve complex problems. A model is based on data that is used to train the model and data used to test and validate the model. In machine learning there exists two different methods supervised, where input is mapped against an expected output, and unsupervised learning that can be used to discover patterns from input data. The algorithms described in this sec-tion are all based on the implementasec-tion available in the ML-framework for PySpark, which is a development platform that can be used to process a large amount of data in parallel with ML algorithms.

Unsupervised learning

Unsupervised learning is most commonly used in clustering problems. Clustering refers to creating groups of data points based on a specific criterion or methodology. The groups can then be studied in more detail with cluster analysis.

(19)

2.6. Machine learning algorithms

K-means

K-means is part of a family of unsupervised learning techniques where n data points are as-signed to k different clusters. A cluster is a group created by choosing an arbitrary center point and evaluating the Sum of Square Error (SSE) from the Euclidean distances to all other points within the cluster. The algorithm stops when the best center, which minimizes cost in terms of SSE within the cluster, has been found [10]. The algorithm has a number of draw-backs; the number of clusters k has to be specifie, by a user, beforehand and the initial choice of centroid will influence the outcome of the clustering. Another limitation is that it is depen-dent on the geometric distance between the points and is hence sensitive to problems with large dimensional vector space called the curse of dimensionality [11]. The algorithm has the following steps:

1. Specify number of clusters k.

2. Initialize centroids by randomly selecting k data points for the centroid placement. 3. Identify the subset of data point that is closer to it than any other center.

4. Calculate the means of each feature for the data points in each cluster and this mean vector becomes the new center for that cluster.

The algorithm continues to iterate between step 3 and 4 until convergence is reached. A possible improvement on the algorithm is the K-means++ method which is an initialization technique with the goal of avoiding selection of sub-optimal cluster centers. It addresses this issue by choosing the first centroid at random and subsequently selecting centroids based on the highest probability, which is proportional to the squared distance from the cluster cen-ter. In PySpark the ML-framework utilizes a scalable version of K-means, called K-means||, introduced by Bahmani et. al [12]. Two popular methods to find an appropriate value for k on a data-set is the Silhouette score and Elbow method. The Silhouette score is a measure of similarity within clusters (cohesion) and distance between clusters (separation), where high cohesion and separation is sought after. The metric has a range between [-1, 1], where a high value indicates good clustering. Conversely, the Elbow method works by finding a good cut-off point using the ”elbow” of a curve, e.g. WSSE, after which the returns for adding more clusters diminishes. In this sense using the number of clusters at the elbow reduces the risk of over- or under-fitting the model to a particular data-set.

t-SNE

t-distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction tech-nique primarily used in data exploration and provides support for interpreting and under-standing distributions [13]. It works by creating a probability distribution, which is then translated into coordinates in a vector-space with a selected number of dimensions. This means that high-dimensional problems can be visualized in two- or three-dimensional plots. A drawback of traditional SNE is that it is complicated and generally only is applied to a smaller sample of a dataset. The method addresses this issue by using a student t-distribution in the low dimensional space beside Gaussian distribution employed in the data representa-tion [7]. The student t-distriburepresenta-tion has a longer tail compared with the Gaussian distriburepresenta-tion that helps the model to map faraway points of high-dimensional space as faraway points in the embedded space as well. Another significant difference between the t-SNE and the SNE is that the t-SNE uses a symmetric version of the SNE [14].

PCA

Principal Component Analysis (PCA) is a common method to process a large amount of data and get the essential information necessary to work with algorithms and analysis. It works

(20)

2.6. Machine learning algorithms

by trying to find the new axis, the lines, along which the data variance is the highest. Later on the data is projected to those axis, which then are the PC themselves. The best line is the first principal component (PC) and explains the most variance of the total data. Each PC is a linear combination of variables and weights that determine the impact of a parameter on the component [11]. The components either increase together (correlated) or one increases while the other decreases (inversely correlated). These relations between variables can be shown by visualizing the co-variance between each variable with a covariance-matrix, where a positive value indicates correlation between two variables, while negative values indicate inverse correlation. The algorithm has the following steps:

1. Normalization of variables to have the same amount of impact. 2. Calculate covariance-matrix to explain correlation between variables.

3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the im-pact from parameters on principal components.

Supervised learning

Supervised learning is most commonly used in classification and prediction problems. In supervised learning has two sub-categories: regression and classification. Regression en-capsulates several different methods where the output is continuous, while the output of classification models is discrete.

Random forest

Random Forest is a relatively modern tree-based classification technique, which was intro-duced by Breiman in 2001 [15]. Random forests is an ensemble learning method that consists of multiple decision-trees, where every tree depends on values random vector sampled inde-pendently and with the same distribution for all trees in the forest [11]. The random injections into the training process means that each resulting tree will be somewhat different from one another. After training a certain number of trees the output averages are combined to create to a robust model with low chance of over-fitting to the data. For the classification problem, Random Forest receives a class vote from each tree and then makes a final decision based on the majority vote. The algorithm has the following steps:

1. Create sets of random samples from the input data. 2. Construct a decision-tree for every sample.

3. Make predictions and gather votes from each tree. 4. Select the majority vote as the final model prediction.

The model is initiated by selecting a desired number of trees t and maximum tree depth d. In general, increasing these parameters will lead to a more complex model with higher accuracy at the cost of longer training time. It has been shown that Random Forests stabilizes at approximately 200 trees, after which the accuracy gains become negligible [11]. Another common application of Random Forest is in feature selection; where the goal is to find the most important features in a data-set by evaluating the average impact of the features on every tree in the model. The structure of the Random Forest lends itself well to parallel execution and can also be modified to increase scalability [16].

(21)

2.6. Machine learning algorithms

Multinomial logistic regression

Multinominal logistic regression is a classification method that expands upon standard logis-tic regression for instances with more than two variables. The model is based on the posterior probabilities of K number of classes with separate linear functions for a vector of items x, mul-tiplied with a corresponding regression coefficient β, that together is sum up to one [11]. For the multi-classification problem the model utilizes the softmax function, which is a gener-alization of a sigmoid function, and is used to compute the probability of the item y being present in a class c. One of the properties of the softmax function is that values close to the maximum value get pushed towards 1 while values far from max get pushed towards 0. It is possible to formulate multinominal logistic regression with the following log-linear model:

P(yi=c) =so f tmax(c, β1¨x1, ..., βK¨xi), (2.1) where the softmax function is defined as

so f tmax(c, x1, ..., xn) =exc/ n ÿ i=1

exn. (2.2)

For classification, the model chooses a class based on maximum likelihood, thus the value closest to one is the best possible choice. The model is based on the assumption that each variable in the model can be considered independent, meaning that the variables are not linear combinations of each other.

Multilayer perceptron

A multilayer perceptron is a type of feed-forward neural network, which is a non-linear clas-sification method [11]. These kind of models are inspired by biological phenomena, where artificial neurons conforms to the way that biological neurons are connected to process in-formation in a brain. The multilayer perceptron is organized into distinct nodes and layers, which are initiated by an activation function f , e.g. a heavy-side or sigmoid function. Each node in the network is connected with a specific weight w and depending on the activation function the multilayer perceptron can perform different tasks such as classification or regres-sion. The network is structured in three main layers; input layer, hidden layers and output layer. Nodes in the input layer corresponds to the provided input data x, while all other lay-ers consists of linear combinations of the input with the individual weights w and bias b for the nodes. This can be described as

y(ˆx) = fK(... f2(wˆT2¨f1(wˆT1x+b1) +b2)...+bK). (2.3) In classification the multilayer perceptron uses back-propagation to train a model, which is a generalization of the least square minimization. A general drawback of neural networks are that they are often complex and difficult to train due to having many parameters that may lead to unstable optimization if initialized incorrectly [11]. An overly complex model with too many weights might also lead to over-fitting the model to the input data [11]. Thus, it is desirable strike a balance for a model with enough hidden layers to make it flexible enough to fit a specific problem and input. The network architecture used in this thesis is inspired by the U-net architecture, where, the number of parameters is doubled for each level [17]. This process results in a network with four layers and the following structure:

1. First layer: Same amount of parameters as input variables n. 2. Second layer: 2*n.

3. Third layer: n.

(22)

2.7. Related work

Performance metrics

In classification the performance of the machine learning algorithms are evaluated by calcu-lating the confusion matrix, which is a table containing cells for True Positives (TP), False Negatives (FN), False Postives (FP) and True Negatives (TN). Here TP and TN corresponds to the number of true positive and negative instances that are correctly classified, while FP and FN denote the number of misclassified negative and positive instances [18]. These values can be used in mathematical formulas to calculate precision and recall:

Precision= TP

TP+FN, Recall= TP

TP+TN. (2.4)

Lastly, these two measures can be combined to calculate the harmonic mean between preci-sion and recall called the F1-score measurement:

F1score= 2 ¨ Precision ¨ Recall

Precision+Recall . (2.5)

Data normalization

Normalization is a common technique in data preprocessing for machine learning algorithms. This step is necessary for many problems where the features have different contributions to the ML model due to high variety in range and distributions.

Percentile scaling

In this thesis we have both binomial and heavy-tailed distribution for some features. In order to get a fair contribution to our ML models we employ percentile scaling, which scales all features to the range between zero and one. The new value is based on the ranking and total number of elements of a in a feature vector. The mathematical formula used for feature normalization can be found in equation 2.6 if there are exactly r elements which are less than e and s elements which are equal to e, N is the size of the total number of values for a specific feature and x is the normalized output from the scaling:

ˆxk=

r+0.5 ˚(s ´ 1)

N , i=0...N. (2.6)

2.7

Related work

This section covers the related work that has been done in the relevant fields for this thesis. This paper expands on the findings from the previous research in the area and proposes a more complex network model. The motivation behind this strategy is to be able to adapt to changes in network communication towards a more complex service-oriented structure and to be able to identify a set of characteristic behavior that may be used to simulate dif-ferent traffic types. In this procedure we take advantage of the IP packet header log analysis framework suggested by Lin, et al, we also adopt a combination of the traffic classification procedures used by Rojas, et al and Erman, et. al where a four-layered model is created from time-based features. From the analysis framework a set of statistics are aggregated which are processed with a feature selection technique and unsupervised learning to create groups that can be used to analyze and discover trends and characteristic behaviors for traffic flows in the network.

Big data analysis

Big data analysis is becoming an increasingly important way to discover network patterns and gather the knowledge needed to construct and verify the quality of traffic models. To

(23)

2.7. Related work

meet the growing demands of 5G technology Kan Zhen et al., propose a framework for big data-driven (BDD) optimization [19]. The framework consists of four parts; data collection, storage management, data analytics and network optimization. The authors present three examples with case studies where BDD schemes can be utilized to improve the performance of mobile networks toward 5G. The paper also discusses the potential challenges of adopting BDD like data collection, including the challenges associated with processing large amounts of data due to limitations of available techniques.

IP packet header log analysis

Traffic logs contain much interesting information that is relevant in many contexts like secu-rity, user and application behavior in a system [20, 21, 22]. However, as the complexity and amount of data increase, an efficient solution to process the logs generated by the system is necessary. A log analysis system not only needs to be able to handle massive and stable data processing but also be flexible to account for different scenarios. Lin et al. propose a frame-work for log analysis that consists of a combination of Apache Spark and Hadoop [23]. Being able to process the data with a distributed file-system and in smaller partitions in parallel is crucial to obtaining an acceptable execution-time and not run out of memory when applying more complex algorithms. This is corroborated by a study by Ilias Mavridis, et al. where they show how execution-time decreases with the number of nodes in the cluster [24].

Machine learning in network analysis

There have been many prior studies made with the aim of analyzing the behavior of network traffic. One example of this is the approach described by Rojas, et al. where the authors propose a method to identify over the top (OTT) applications with statistical classification based on IP-log data with the goal to be able to establish a personalized service degradation policy that limits the amount of data that can be transferred over a certain period [25]. The authors employ K-means to data and are then clustered into three groups based on the level of energy consumption: low, medium and high consumption, where each cluster is assigned a specific policy. The results are validated using eight different machine learning classification algorithms. From the analysis, the authors present a table with policy recommendations for each application.

Another approach for traffic classification described by Li Wei et. al utilizes a semi-automated machine learning approach. The proposed framework consists of series of mech-anisms that are applied to find the most relevant parameters from network data to use in classification [26]. The data-set used contains 248 different features, which are reduced to a subset by applying a correlation-filtering method to the data. The resulting subset has ten behavior features and two features for specific port-numbers. These parameters are used as input for classification models on a hand-labeled data-set with ten different application classes. All of the models tested results in high accuracy over 90 percent on the classification problem for all except one of the application classes.

Others have used multi-fractal analysis to extract timing based features to better classify individual encrypted traffic flows [27, 28]. The evolution in network technology means that many previous methods such as Deep Packet Inspection (DPI) and Port-based classification are no longer reliable options, due to applications using dynamic port-numbers and encryp-tion to avoid detecencryp-tion. The authors of this paper show that it is possible to apply a feature selection technique like PCA and man-in-the-middle (MITM) based flow labeling to achieve high accuracy for traffic type classification with time-based information for traffic flows. Fur-thermore, Erman et. al suggest that an alternative approach in traffic classification is to exploit the distinctive characteristics of applications [29]. Fur this purpose the authors show that the clustering algorithms K-means and DBSCAN can be used to identify these behaviors in a network environment.

(24)

3

Method

This chapter describes the method that was used during the thesis work to answer the re-search questions.

3.1

Investigation of application behavior

A starting point for this thesis is the basic assumption that mobile applications generate traffic patterns that may be utilized as a basis for similarity measures. To confirm the validity of this approach to network modeling, it is thus necessary to find traffic models that can be used to describe the behavior of an application in the network over time. This is closely connected to the first research question from section 1.4 in the introduction and it is a non-trivial task that requires several steps to investigate:

1. Data-processing from IP packet header logs.

2. Feature selection to determine the importance of each characteristic on the network behavior definition.

3. Clustering of traffic data based on characteristic behaviors and analysis of the resulting clusters.

4. Definition of traffic models based on the clusters characteristic behaviors and their cor-relation with application types.

Traffic characteristics

A proposed holistic model of traffic characteristics is to investigate behavior on packet, packet bursts, connection and flow that occur in the lifetime of an application. Each of these levels can be considered as a building block, i.e. a burst contains a number of packets and the size of the burst is the sum of the packet sizes in that burst. Additionally, we consider the possible connections for an application, which is defined as a link between an IP-address and port. An application may have several connections and traffic flows active during its lifespan and each one can be further divided into sections based on specified threshold values.

(25)

3.2. Network traffic definition

3.2

Network traffic definition

In communication networks traffic consists of a mixture of both devices and services. A UE can be any kind of device that connects to the network (phone, Ipad, computer, etc.) while a service is something that can run on the device like; Chrome, YouTube, or Skype. Going even further, a service may have multiple applications running in the background; for exam-ple, Skype can be used for both VoIP and chat at the same time. It is possible that each of these underlying functions affects the traffic pattern of the service as a whole. Finding sim-ilar patterns in traffic flow between applications, therefore, first requires the establishing of characteristics that are sufficient for describing the behavior of an application in the network. Since data from IP packet header logs only contain a single line of information per packet, these layers have to be created by a traffic model designer. For this reason the model that we propose to include application behavior utilizes several different abstraction-layers. The model used in this thesis has a four-layered structure which can be seen in Figure 3.1.

Figure 3.1: Proposed stack layers in network model.

The connection between UE, application and each network layer is depicted in Figure 3.2. Here we can see the process of running multiple applications in parallel from the UE perspec-tive. In order to be able to accurately model the traffic flow for these separate entities, it is thus necessary to examine network behavior with increased granularity to separate the ongo-ing traffic generatongo-ing processes from one another. Traffic for the network layers is separated by applying grouping based on the unique combinations of IP-addresses and ports. The in-teraction between a UE and server consists of many flows and connections. Applications in the network are signified by utilizing multiple flows, for underlying processes, where a UE IP-address can be matched with various server-IPs that carry out specific tasks.

A flow combination is created from an IP-tuple,(IPn, IPm), where the UE can be connected to several different servers for the same application, which results in 1, ..., n possible flows per UE. Our second group of network traffic in our model is IP-port connections, which are referred to as connections in this thesis, which in addition to IP-addresses also include ports for both UE and server. This results in a unique four-tuple,((IPi, Pj),(IPm, IPn)), which allows for more variation. A connection has an exact origin and destination and it is possible to have many connections for a UE that only differ in one of the elements of the four-tuple.

Proceeding to an even deeper network level, we can zoom into each flow and connection. Here we see how information exchange is made up by packet deliveries that between a UE and server. Each set of packet transmission exists within a packet burst and it is possible to have an arbitrary number of bursts and packets for a flow or connection, this is denoted as 1, ..., n in the figure. The full perspective of the network model is gained by combining the information from each level. This results in a four-layered model consisting of packets, packet bursts, connections and flows which can be used to investigate traffic characteristics for different abstraction levels of the network. Information for the packet-level is directly available in traffic logs and can be used as a starting-point to aggregate information for each of the higher levels in the hierarchy of the model.

The structure of the proposed network model is highly flexible and layers, may be added or removed based on preference. The goal of creating this kind of network model is that

(26)

Traf-3.2. Network traffic definition UE Application A1 Application An . . . Flow F1 Flow Fn . . . Connection C1 Connection Cn . . . Packet P1 Packet Pn Packet BurstPB1 . . . Packet P1 Packet Pn Packet BurstPBn . . . . . . (IPn, IPm) ((IPi,Pj) (IPm,Pn))

Figure 3.2: UE and application traffic for network levels.

fic Models may be created based on different priorities of network properties. An example of a network property, that is specific to this thesis, is the connection-layer which is based ex-clusively based on the TCP-protocol connection for device communications over the Internet and does not include UDP-logic.

Time perspective

Each packet has a certain size in bytes and follows the structure described in Section 2.5 A packet burst consists of sequence of consecutive packet exchanges for a UE and a server com-bination and is concluded when a threshold on the idle time between packets is met. More specifically, we compared the inter arrival time (IAT) between two packets with packet arrival times tiand ti+1with a threshold δ. Whenever ti+1´tiď δ, we say that both packets belong to the same packet burst. With this definition, packet bursts (e.g., numbered with index j) are separated by an IAT greater than δ. Such arrival instance therefore increase the burst counter j which results in a time section that contains a number of packets. The definition used for a threshold can be found in 3.1, where t is time and PB is short for packet burst.

I AT =ti+1´ti

ti =time o f reception o f packeti j=0

i f I AT ą δ :

@packet Ñ PBj j+=1

(3.1)

In a similar fashion, thresholds can be used to define more complex structures and capture network behavior over a longer interval. By extending the the above burst logic to higher lev-els in our network model we get a definition for ”sessions” which are time-slices connected to a unique identifier and threshold. The idea is that these time intervals can be used to divide the packet exchange into sections that should preserve some specific network characteristics. For the connection and flow level the thresholds are approximated from a designers per-spective with background from network theory. Figure 3.3 illustrates how we define packet bursts, connection sessions, and flows based on inactive-time on packet level. We next de-scribe each of these key concepts one-by-one:

(27)

3.2. Network traffic definition

1. Flow: As we only collect packet level data it is difficult to determine the exact life time of an end-to-end flow. Technically, an application for a UE may remain inactive in the background for long periods of time. The flow lifetime is therefore unknown.

2. Flow session: We define a flow session using a single flow session threshold tf s. In particular, a new flow session is deemed to terminate as soon as there are no packets observed for the flow over the past tf s =80s. The decision to use 80 sec as a threshold for flow-sessions is based on observations from the Cumulative Distribution Function (CDF) of IAT for packets, which can be found in Figure 4.2b. From this study we aim to select the threshold that captures 98 percent of the total packet IAT values.

3. Connection session: The connection threshold is defined by applying a connection threshold tcs. For this threshold we consider tcs =0.5s which is based on the assump-tion that whenever a TCP connecassump-tion is idle for a time larger than the Roundtrip Time-Out (RTO), it enters slow-start, which is also done at the beginning of every connection. With this threshold assumption, we make the distinction that whenever a connection is inactive for more than 0.5 sec, it has experienced a RTO and can be considered as a new network connection.

4. Packet burst: We define a packet burst to last for as long as there are packet exchanges of a certain rate. A packet bursts occurs when a threshold of tb =0.1s is exceeded. The motivation of using a threshold of 0.1 sec for packet bursts is that this is a very com-monly utilized interval in network analysis, mainly due to buffering and radio condi-tion issues that makes it necessary to bundle the packets into bursts.

fs1 cs1 b1 Packets Packet Bursts b2  Connection Sessions Flow cs2 Time [s] Time [s] Time [s] Time [s] b3 fs2 b4 cs3

Figure 3.3: Visualization of how thresholds are used to divide packets into network-layer segments.

Data pre-processing

Each event in the communication network is documented in IP packet header logs, which contain all of the data necessary to monitor network traffic. These measurements are made at the packet-level and include recordings of timestamps, IP addresses, size, transport direction, protocol and more. Below we provide a small sample with the first five lines from an IP packet header log test-file, where all IP-addresses have been replaced to preserve user anonymity.

(28)

3.3. Data analysis

2. 1509373886.128775,u,”xx.xxx.xx.xxx”,1133,”yy.yyy.yy.yyy”,1024,47,d,454063850073515,1234567,geran„454:6:40:11492 3. 1509373887.022907,u,"xx.xx.xxx.xxx",1024,”yy.yyy.yy.yyy”,1133,61,u,454063850073515,1234567,geran„454:6:40:11492 4. 1509373887.319837,u,”xx.xxx.xx.xxx”,1133, ,"yy.yyy.yy.yyy",1024,47,d,454063850073515,1234567,geran„454:6:40:11492 5. 1509373888.199975,u,”xx.xx.xxx.xxx”,1024,”yy.yyy.yy.yyy”,1133,42,u,454063850073515,1234567,geran„454:6:40:11492 The server-client mapping that we are using is dependent on the RAN direction, which means that uplink is up to the RAN and downlink is down to the UE. The logs contain much information, but not everything is important in terms of traffic modeling. To get data entries matching the established traffic characteristics that are interesting to investigate, we remove the columns that are not relevant to the theoretical model established in Section 3.1. Each layer has a number of statistics that are used to model traffic behavior; a brief explanation of the parameter categories is listed below:

1. Flow session level info: Identified by UE and Server IP tuple. Statistics are aggregated after after a flow burst, i.e. when the inter-arrival time exceeds flow session threshold tf s.

2. Connection session level info: Identified by UE and Server IP and port four-tuple. Statistics are aggregated after a connection burst, i.e. when the inter-arrival time ex-ceeds connection session threshold tcs.

3. Dl Packet burst level info: Statistics aggregated from packet bursts level in downlink direction when time exceeds packet-burst threshold tb.

4. Ul Packet burst level info: Statistics aggregated from packet bursts level in uplink direction when time exceeds packet-burst threshold tb.

5. Packet burst level info: Statistics aggregated after a packet burst, i.e. when the inter-arrival time exceeds packet-burst threshold tb.

6. Packet level info: Parameters gathered directly from packets in IP packet header logs; timestamp, packet size and IAT.

After choosing threshold values to split the flows into flow sessions, the data is loaded into PySpark and statistics are aggregated for each parameter. To ensure that the entire spectra of parameter characteristics are preserved, we store values in a four-item array containing the 20, 50, 80 and 100 percentile for a specific feature. This approach is necessary since most pa-rameters from the network have either distributions that are binomial with only a few peaks for distinct values or heavy-tailed with a couple of values that are significantly larger than the rest. Using the average or standard deviation for a feature would, in these instances, result in a misrepresentation of the parameter behavior. Furthermore, the proposed method allows for more complex behavior combinations for different parameter percentiles. The statistics are collected as separate parquet-files that are combined at the end of the pre-processing script, which is used as input for the data analysis. This procedure results in the following set of statistics seen in Table 3.1.

3.3

Data analysis

To ensure that the network traffic definition from Section 3.1 approximates network traffic behavior with sufficient accuracy thorough analysis of a large amount of the IP packet header log data is needed. To do this we employ CRISP-DM, which is a well-established data-mining framework, that have been used to solve similar problems in the past [30].

(29)

3.3. Data analysis

Table 3.1: Format of statistics aggregated from IP-log data.

Flow identifiers Description Type

UE_IP UE IP-address String

srv_IP Server IP-address String

fs_ID Unique identifier for a UE and server combination String

Flow parameters

prov List of service providers Array[String]

app List of client-applications Array[String]

pr Transport protocol String

O List of overlapping flow sessions Array[String]

V Flow session total Volume Int

Dv List of Duration for volume ramp-up of the flow session Array[Float]

cs_N Number of connection sessions in a flow session Integer

cs_V List of total volumes of the underlaying connection sessions Array[integer] cs_D List of session durations of the underlaying connection sessions Array[Float] cs_P List of number of packets in the underlaying connection sessions Array[Int] cs_O List of total Volumes for underlaying connection sessions Array[Int] cs_IAT List of inter-arrival times for underlaying connection sessions Array[Float] pb_N Number of packet bursts in the particular flow session Int

pb_V List of total Volumes of the underlaying packet bursts Array[Int] pb_D List of session Durations of the underlaying packet bursts Array[Float] pb_P List of number of packets in the underlaying packet bursts Array[Int] pb_Dl List of number of dl bursts in the underlaying packet bursts Array[Int] pb_Ul List of number of ul bursts in the underlaying packet bursts Array[Int] pb_IAT List of inter-arrival times for underlaying packet bursts Array[Float] dl_N Number of dl bursts in the particular flow session Int

dl_V List of total Volumes of the underlaying dl bursts Array[Int] dl_D List of session Durations of the underlaying dl bursts Array[Float] dl_P List of number of packets in the underlaying dl bursts Array[Int]

dl_O List of dl overlapping bursts Array[Int]

ul_N Number of ul bursts in the particular flow session Int ul_V List of total Volumes of the underlaying ul bursts Array[Int] ul_D List of session Durations of the underlaying ul bursts Array[Float] ul_P List of number of packets in the underlaying ul bursts Array[Int]

ul_O List of ul overlapping bursts Array[Int]

p_N Number of packets in the particular flow session Int

p_S List of packet sizes Array[Int]

p_IAT List of inter-arrival times for packets Array[Float]

O_N Number of overlapping flow sessions Short

CRISP-DM

Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used models in data analytics. CRISP-DM [31] breaks down the data-mining process into six different phases which can be seen in Figure 3.4.

(30)

3.3. Data analysis

Figure 3.4: CRISP-DM process [31].

Below a description of how each step relates to the analysis process in this thesis is listed: 1. Business Understanding: Introduction to company, colleagues and problem.

2. Data Understanding: Explore content of data, create plots, interpret results.

3. Data Preparation: Preprocessing in PySpark to aggregate statistics for network data. 4. Modeling: Sample data, apply machine learning models, plot and interpret results. 5. Evaluation: Create groupings based on observed behavior, evaluate how well machine

learning classifiers perform with this gold standard.

6. Deployment: Deliver results from analysis, motivate choice of parameters for creating a traffic model.

Analysis method

After aggregating values for characteristics at the different levels of the network model, the data is normalized with percentile scaling, which adjusts the data to the range of zero to one. The mathematical formula used for normalization can be found in equation 2.6. This is done to ensure that different metrics and scaling of individual features do not affect the results. Exceptions to this method are protocols, which are defined as zero for UDP and one for TCP, and null-values that are represented by minus one. Using extreme values enables us to more easily identify these tendencies with machine learning algorithms, an example of this is the heat-map in Figure 4.6 where null-values result in black boxes. Packet volume is especially heavy-tailed with large values and need to be normalized for a fair comparison with the other features. Following the feature selection process we split the data in two halves; where one is used in clustering analysis and the other half in traffic classification.

Analysis of traffic behavior is done by applying a K-means clustering model to the data-structure. If the modeling approach works as expected, groupings with similar characteris-tics will be formed. To get the optimal number of clusters to fit the data we utilize the Elbow method which aims to minimize the Sum of Square Error of the within the clusters and max-imize Silhouette Score which ranges from [-1, 1]. From the clustering we receive groups of packets from the network data with similar characteristic behavior. Validation of the process

References

Related documents

In this article, we present a meta- analysis (i.e. a ‘‘survey of surveys’’) of manually collected survey papers that refer to the visual interpretation of machine learning

[7] presents three similarity metrics in order to investigate matching of similar business process models in a given repository namely (i) structural similarity that compares

Building the training set and training the different machine learning models takes up a majority of the time available.. The project consists of smaller parts that need to

In best case scenario, meaning the misspelled answers do not contain any nonsense answers, this would generate about 99 incorrect match candidates for every correct match

In a classroom situation, these examples can be used for teaching the whole class because (1) they occur at the beginning of the game and both conversations follow the main

[r]

Fogmassan ska förhindra heta gaser a t t passera genom panelfogen och därmed öka fogens brandmotstånd.. Panelerna kapades upp i längder på 1200 mm och

The RDAC model used is segmented, has ideal switches, ideal resistances, an ideal voltage source, and 10 Ω resistance to model the output impedance of the supply network..