Study of TCP Available Bandwidth Using NS2 and Its Forecasting Based on Genetic Algorithm

(1)

Study of TCP Available Bandwidth Using NS2 and Its Forecasting Based on Genetic Algorithm

Cristian Hernandez Benet

Francisco Domingo Sanchez Vizcaino

Faculty of Health, Science and Technology Computer Science

30 ECTS

Andreas Kassler and Enrica Zola Donald Ross

140603

(2)

(3)

Study of TCP Available Bandwidth using NS2 and its forecasting based on Genetic Algorithm

Cristian Hernandez Benet Francisco Domingo Sanchez Vizcaino

(4)

(5)

iii

This thesis is submitted in partial fulfilment of the requirements for the Master’s degree in Computer Science. All material in this thesis which is not our own work, has been identified and no material is included for which a degree has previously been conferred.

Cristian Hernandez Benet

Francisco Domingo Sanchez Vizcaino

Approved, June 03, 2014

Opponent: Jonathan Vestin

Advisor: Andreas Kassler Co-advisor: Enrica Zola

Examiner: Donald Ross

(6)

(7)

v

Abstract

On the one hand, the available bandwidth in a bandwidth-limited medium as the wireless medium is a highly demanded topic of study. On the other hand, the Transport Control Protocol (TCP) is one of the most used transport protocols on the Internet. The available bandwidth study and TCP constitute the most typical scenario in the Wireless Local Area Networks (WLAN). This Thesis locates the study in the 2.4GHz frequency band where Primary Users can be present modifying the behaviour of the WLAN medium. This band is unlicensed and, as a consequence of this, the congestion is considerable. Nowadays, several studies of this band are related to Cognitive Users operating in this band. However, this thesis studies the impact of Primary Users in the TCP available bandwidth in a classic IEE 802.11g WLAN network. The second part of the dissertation takes a step forward, a tool to forecast the TCP available bandwidth for WLAN based on a genetic algorithm has been developed. This tool is able to estimate the future available bandwidth finding the best function that will fit better to the future behaviour of the network. A genetic algorithm programmed specifically for this purpose finds this function. A significant number of tests have been carried out in the study. The TCP available bandwidth study shows a relation between MAC busy time and available bandwidth in some cases. In addition, the study shows that the TCP available bandwidth increases if the idle periods are longer. Reliable results in the forecasting have been achieved with a limitation in some specific scenarios.

.

(8)

(9)

(10)

viii

List of Figures

Figure 1: General diagram of the thesis work ... 5

Figure 2: MAC layer throughput Max rate vs throughput [18] ... 10

Figure 3: Evolution of TCP's congestion window [23] ... 12

Figure 4: Example phenotype and genotype ... 15

Figure 5: Genetic algorithm structure [37] ... 17

Figure 6: One point crossover example ... 19

Figure 7: Mutation example ... 20

Figure 8: From parents to offspring process ... 20

Figure 9: Example of Lorenz chaotic attractor [53] ... 21

Figure 10: Binding between C++ and OTcl ... 26

Figure 11: NS2-CRAHN schema [4] ... 28

Figure 12: PU-log file first part ... 29

Figure 13: ON-OFF model... 29

Figure 14: Example of ON-OFF model distribution in time ... 29

Figure 15: ON-OFF distribution programmed in NS2-CRAHN ... 30

Figure 16: TCP throughput for different alpha beta combinations for CRAHN [4] ... 33

Figure 17: Map of Scenario 1 ... 37

(18)

xvi

Figure 20: Simulation overview diagram ... 40

Figure 21: Primary Users flow chart and location ... 44

Figure 22: MAC busy time ... 46

Figure 23: GA general outline ... 51

Figure 24: Encoded space and solutions space ... 52

Figure 25: Shifted window and training set ... 54

Figure 26: Chromosome generator function ... 58

Figure 27: Fitness calculation ... 61

Figure 28: Elitism process ... 62

Figure 29: Mating pool creation ... 63

Figure 30: Effect of the SP on the probability ... 65

Figure 31: Effect of C on the probabilities ... 66

Figure 32: Ranking vs Exponential... 67

Figure 33: Three selection method example ... 68

Figure 34: New population ... 69

Figure 35: Best chromosome for the prediction ... 73

Figure 36: Available bandwidth for 50% PU ON ... 78

Figure 37: Available bandwidth for 25% PU ON ... 79

Figure 38: Available bandwidth for alpha equal to 0.0768 and different beta values ... 80

Figure 39: Available bandwidth for alpha equal to 2.8 and different beta values ... 80

Figure 40: 3D graphs with the available bandwidth for different alpha/beta combinations and non- random patterns ... 81

Figure 41: 3D graph with the MAC business for different alpha/beta combinations and non-random patterns ... 83

Figure 42: 3D graph with the MAC business and normalized throughput different alpha/beta combinations and non-random patterns ... 84

Figure 43: Available bandwidth for 50% ON and random generation ... 85

Figure 44: Available bandwidth for 25% ON and random generation ... 86

(19)

xvii

Figure 46: 3D graphs with the available bandwidth for different alpha/beta combinations and

random patterns ... 88

Figure 47: 3D graph with the MAC business for different alpha/beta combinations and random patterns ... 89

Figure 48: 3D graph with the MAC business and normalized throughput different alpha/beta combinations and random patterns ... 90

Figure 49: alpha=1.5 and beta=0.5 ... 91

Figure 50: alpha=1.5 and beta=0.5 ... 91

Figure 51: Throughput over time of the real traffic played back ... 93

Figure 53: Fitness selection evaluation for the same initial population with periodic and symmetric function ... 98

Figure 54: Fitness selection evaluation for random initial population with a periodic and symmetric function ... 99

Figure 55: Results for the selection method with and without diversity for a periodic and symmetric function ... 100

Figure 52: Cosine function... 101

Figure 56: Throughput from non-random pattern with alpha 2.2 and beta 1.1 ... 103

Figure 57: Fitness selection evaluation for the same initial population with Non-random ON-OFF pattern ... 104

Figure 58: Best and worst prediction equation 2 for the same initial population ... 105

Figure 59: Example of Equation 2 comparing the MSE ... 106

Figure 60: Fitness selection evaluation for random initial population with non-random ON-OFF pattern ... 107

Figure 61: Random initial population for non-random alpha 2.2 and beta 1.1 with MAPE less than 100% ... 108

Figure 62: Results for the selection method with and without diversity for non-random ON-OFF pattern ... 109

Figure 63: Throughput from a random pattern with alpha 2.2 and beta 0.08 ... 111

Figure 64: Training set and prediction zone for random traffic with alpha 2.2 and beta 0.08 ... 112

(20)

xviii

Figure 66: Fitness selection evaluation for the same initial population with Random ON-OFF

pattern (with errors criterion) ... 113

Figure 67: Fitness selection evaluation for random initial population with random ON-OFF pattern ... 114

Figure 68: Results for the selection method with and without diversity for random ON-OFF pattern ... 115

Figure 69: Throughput from the real traffic simulated in NS2 ... 117

Figure 70: Training set area and prediction area for real traffic with scaled values ... 118

Figure 71: Fitness selection evaluation for the same initial population with real traffic ... 119

Figure 72: Fitness selection evaluation for random initial population with real traffic... 120

Figure 73: Results for the selection method with and without diversity for real traffic ... 121

Figure 74: Random activity pattern for an alpha 2.6 and beta 1.3 ... 123

Figure 75: Graphs of TCP throughput over time for 50% PU ON ... 156

Figure 76: TCP throughput heat-map ... 161

Figure 77: UDP throughput heat-map... 162

(21)

(22)

List of Tables

Table 1: MAC layer throughput [18] ... 10 Table 2: Genetic Algorithm vocabulary [36, p. 7] ... 16 Table 3: Example of reconstructed vector ... 22 Table 4: Example training set window ... 54 Table 5: Chromosome set for a two chromosome population and a training set of ten ... 59 Table 6: Genotype and phenotype of the chromosome set ... 60 Table 7: Main parameters ... 72 Table 8: Set of graphs for the user ... 72 Table 9: Prediction and statistical results... 75 Table 10: Test results for Scenario 2 ... 94 Table 11: Test results for Scenario 3 ... 94 Table 12: Nomenclature of the fitness equations ... 96 Table 13: Nomenclature of the selection method ... 96 Table 15: GA set-up for the cosine function without diversity and same initial population ... 98 Table 14: Example 1 for a cosine function ... 102 Table 18: Parameters GA for non-random using same initial population ... 104

(23)

Table 22: Best and worst results from random and same initial population ... 108 Table 24: Best prediction example for a non-random traffic with ON-OFF pattern ... 110 Table 25: GA set-up for the random traffic alpha 2.2 and beta 0.08 with the same initial population ... 112 Table 28: Best prediction example for a random traffic with ON-OFF pattern ... 116 Table 29: GA get-up for the real traffic scenario with the same initial population ... 118 Table 31: Best and worst results from random and same initial population with real traffic ... 120 Table 33: Best prediction example for a real traffic played back in NS2 ... 122 Table 34: Frame parameters of 802.11g [15]... 151 Table 35: Timing parameters of IEEE 802.11g standard [15] ... 152 Table 36: Throughput over time, Congestion window, Current RTO multiplicative factor and Slow-start threshold for alpha 1.5 and beta 0.5 ... 157 Table 37: Throughput over time, Congestion window, Current RTO multiplicative factor and Slow-start threshold for alpha 0.5 and beta 0.1 ... 158 Table 38: Throughput over time, Congestion window, Current RTO multiplicative factor and Slow-start threshold for alpha 1.5 and beta 0.1 ... 159 Table 39: Results of equation 2 with same initial population with non-random alpha 2.2 and beta 1.1... 164 Table 40: Average of equation 2 resulting of run the GA 50 times ... 164 Table 41: General fitness functions results ... 167 Table 42: General results for the diversity and selection method ... 168 Table 43: MAPE best and worst results ... 168

(24)

(25)

1

Chapter 1 Introduction

1.1 Project overview

The available bandwidth at Transport Control Protocol (TCP) layer is a very important topic in Wireless Local Area Networks. This thesis locates the study of the available bandwidth in the 2.4 GHz band, also called Industrial Scientific and Medical (ISM) band [1]. This band is unlicensed and, as a consequence of this, the congestion is considerable. Additionally to the WLAN users - also called Secondary Users-, Primary or Licensed Users can use the 2.4 GHz frequency band.

PUs are users with preference to operate in the band and that will interfere Secondary Users leading to the loss of all data. The Primary Users will use the medium without taking into account the Secondary Users in the network, but by interfering the Secondary Users. Nowadays, several studies are related to Cognitive Users [2] operating in this band [3] [4] [5] [6]. However, this thesis will study the impact of this Primary Users in the TCP available bandwidth in a classic IEEE 802.11g WLAN. IEEE 802.11 does not define a Primary-Secondary user management. IEEE 802.11 uses Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA), and it makes all users equal in terms of opportunity to access the network. However, Primary Users do not necessarily have to implement a similar strategy. Therefore, this project considers that if a Primary User is active in the wireless medium, all the secondary users in the same frequency band and inside the transmission range of the Primary Users will lose all the sent packets due to

(26)

interferences. The activity of the Primary Users is defined by means of ON-OFF patterns governed by a Birth-death Markovian process. The use of these patterns, that are completely defined by only two parameters (alpha and beta), offers the advantage of covering a wide range of cases in an easily to be controlled way. A classic situation of a WLAN with only Secondary Users with and without the “hidden node problem” is also studied. Some interesting conclusions have been drawn.

The second part of the thesis takes a step forward: a tool to forecast the TCP available bandwidth for WLAN by means of a genetic algorithm (GA) has been developed. This tool is able to estimate the future available bandwidth finding the best function that will fit better to the future behaviour of the network. A genetic algorithm that has been specifically programmed for this purpose finds this function. This GA has been tested in different scenarios simulated in NS2.

Furthermore, different GA functionalities such as fitness equations, selection methods and diversity methods are studied. These functionalities are tested and evaluated for the scenarios proposed. Figure 1 depicts a very general outline of the most important work process done in this thesis.

NS2 simulation set-up

Genetic algorithm set-up Simulation of TCP availble throughput

Find the best function for forecasting

Forecasting of TCP available throughput PU activity

pattern

TCP available throughput and MAC busy time calculation

NS2 traffic trace

Figure 1: General diagram of the thesis work

(27)

1.2 Motivation

On the one hand, the available bandwidth in a bandwidth-limited medium as the wireless medium is a much-demanded topic of study. On the other hand, the Transport Control Protocol (TCP) is one of the most used transport protocols on the Internet. Both of them constitute the most typical scenario in the Wireless Local Area Networks (WLAN). This Thesis presents a step further to the typical situation of a WLAN, locating the study in a very congested frequency band where Primary Users(PU) can be present modifying the behaviour of the WLAN medium in a particular way. Most of the current studies are related to Cognitive Users operating in this band. However, this thesis will study the impact of the Primary Users in the TCP available bandwidth in a classic IEE 802.11g WLAN network. These reasons motivated this team to analyse the performance of TCP in terms of available bandwidth in these situations and to develop a powerful tool in order to forecast the available bandwidth in the immediate future of the WLAN environment.

This study could help in the future to identify which types of PU traffic are more detrimental to WLAN. In addition, it might be possible to identify the available bandwidth of a user that wants to get into a network. This might be done by identifying the pattern of the PU activity in the network and then checking the corresponding available bandwidth – from the presented results- for that pattern. Furthermore, the study of the effects on the PUs on the Secondary Users could help to understand the effect of those technologies that nowadays the ISM band [7] is using such as the Bluetooth, Zigbee, WiMAX, DECT, among others. The Genetic Algorithm could help to forecast the available bandwidth in those situations and be able to manage this information so as to take the appropriate decisions according to the specific network requirements.

1.3 Objectives

This project is divided in two main blocks: the study of TCP available bandwidth with a network simulator and the available bandwidth forecasting using a genetic algorithm. The goal is to use the results of the former to be used in the latter as input and reference. The main aim is to study the impact of different ON/OFF patterns of Primary Users (PU) on secondary users (SU) available throughput. In order to accomplish this goal, an implementation in NS2 [8] network simulator should be developed. This is immediately followed by the available throughput

(28)

forecasting for different patterns of ON-OFF PU activity using the GA. The purpose is to develop the full algorithm step by step in MATLAB and to test it in several situations. In addition, the intention is also to carry out a study on the relation between TCP available bandwidth and MAC busy time. Another objective is to play back real captured WLAN traffic in the network simulator and analyse the available bandwidth in different scenarios with only SU. The last point presented in this thesis will be the test of the reliability of the GA in real situations with the aim of evaluating the impact on quality/error for different available bandwidths with real traffic processed in the simulator.

1.4 Methodology

The methodology is the procedure to be followed during the research and development work [9]. The Engineering Design Process [10] has been the main method taken into account. This method defines the problem. Later, a deep background research on the topic is carried out. Once the background is properly understood, the requirements are specified and a Brainstorming of possible solutions is performed. The next step is to choose the best solution for the problem so as to be developed and built. The last step is to test the solution and redesign it whether it is required.

Within this process, we usually jump at any time from one step to another if something to be improved or modified is found.

The research work is mainly based on the analysis of scientific research publications, books, science magazines and online sources.

1.5 Results

This point present briefly the results obtained in this thesis. A large number of available bandwidth results for different PU activity patterns combinations have been obtained. The available bandwidth tends to grow lightly as the parameters alpha and beta do (alpha and beta are the parameters that define the pattern behaviour) and the retransmission time has a fundamental impact on the available bandwidth. Randomness in the PU activity makes predictions very unreliable. Regarding the MAC busy time (time that the Media Access Control layer is busy), it seems to be a relationship exists between TCP available bandwidth and MAC busy time.

(29)

In a scenario with only secondary users and real traffic, they share the bandwidth if they can sense each other. This situation is opposed to a situation with the “hidden node problem” where the hidden node affects harmfully to the available bandwidth. The results of the available bandwidth with PU activity can be also applied to the “hidden node problem” because the response is similar, but not equal because the hidden node will be affected by the transmission of acknowledgements of the node that is suffering the interference.

The implementation of the GA is successful and its correct functioning has been proved. However, some limitations have been encountered in the GA for the proposed scenario. Regarding the available bandwidth for ON-OFF PU activity patterns tested with the GA, even though the behaviour of the available bandwidth is chaotic and its forecasting is very complicated, acceptable results have been obtained.

For further details, please refer to Chapter 4 and Chapter 5.

1.6 Organization of the dissertation

This dissertation is structured in a progressive way. In an early step, the basic concepts will be presented in order to provide the reader a necessary knowledge to understand the subsequent chapters. All the terminology is explained once is shown on the text. The thesis is divided into the following parts:

• Introduction

The first part of the dissertation introduces the reason why the topic of the thesis was selected, the main goals to be achieved, some guidelines about methodology and an outline of the results of the study.

• Background and Related work

The aim of this chapter is to describe the basic knowledge about the developed topics in this thesis so as to understand better the implementation presented in the next chapter.

• Design and Implementation

(30)

In this chapter, the whole design and implementation of the scenario developed in this thesis is explained. This part tackles all the specific work done in order to get a software able to deploy the tests that will be evaluated in the following chapter.

• Evaluation

In this part, the results and evaluation of several tests carried out within the developed implementation in this thesis are presented.

• Conclusions

In the end, the dissertation draws the conclusions resulted from the implementation and results of the whole project

(31)

7

Chapter 2 Background and Related work

In this chapter all the previous background knowledge considered necessary to properly understand the implementation and the project scope of this master thesis is described. Also some of the related work that are relevant for this thesis is pointed out.

2.1 Introduction

This chapter is divided into two parts: the background study and the related work. The former is broken down again into four main parts. Firstly, general previous information regarding standards and communications protocols is explained. Secondly, basic concepts about genetic algorithms that may be helpful in understanding the project are presented. Thirdly, the time series and forecasting are tackled and finally, some features and basic knowledge regarding the network simulator used in this thesis are described. The second part of the chapter analyses some previous studies that can be used as a reference for this thesis.

(32)

2.2 Background

The section describes the basic knowledge about the developed tools in this thesis that will help to understand better the chapters concerning the implementation and evaluation.

2.2.1 Wireless network and protocols

In this part, the network standards and protocols that apply in this thesis implementation are explained, including IEEE 802.11g wireless local area network standard and the transmission control protocol.

2.2.1.1 IEEE 802.11g

IEEE 802.11 is a Wireless Local Area Network (WLAN) standard developed by the Institute of Electrical and Electronics Engineers (IEEE) and published in 1997 [11]. IEEE 802.11 is a standard containing a set of Media Access Control (MAC) and physical layer specifications for the implementation of Wireless Local Area Networks (WLAN) [12] in the 2.4, 3.6, 5 and 60 GHz frequency bands.

IEEE 802.11g [13] is an advanced version of 802.11 that supports data rates per stream of 6, 9, 12, 18, 24, 36, 48, 54 Mbps. IEEE 802.11g employs a transmission scheme based on Orthogonal Frequency-Division Multiplexing (OFDM) in the 2.4 GHz frequency band, also called Industrial, Scientific and Medical (ISM) band.

2.2.1.1.1 IEEE 802.11 MAC protocol

In WLAN, the MAC protocol – protocol used to manage the MAC layer- is what primarily determines how optimum the bandwidth sharing of the wireless channel is [14]. IEEE 802.11 standard defines two access methods: the Distributed Coordination Function (DFC), which is for distributed access, asynchronous and it uses contention measures; and the Point Coordination Function (PCF) for centralized access without contention. As it is described in [15], the DCF method uses Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA). With this method, before delivering any data packet, the station senses the medium for a DCF Interframe

(33)

Space (DIFS) [16] to detect if there are other transmissions ongoing. If the medium is sensed as free for a DIFS duration, the transmitter station keeps sensing the medium a time corresponding to a random multiple of the slot time – from 0 to the called Contention Window (CW) minus 1-, this time is called the back-off interval and it is used to minimise the possibility of collisions. If during the back-off interval the medium becomes busy, the time counter stops until the channel is idle again for more than a DIFS duration.

The contention window size is not a fixed value. In fact, it changes following an exponential law, where the CW value is set equal to a specified minimum value noted as CWmin that is used in the first transmission. The CW is doubled every transmission up to the maximum contention window - noted as CWmax - if the transmission does not fail. When the Back-off interval ends, the transmitter can either transmit a data packet or send a Request to Send (RTS) that will be answered with a Clear to Send (CTS) if the receiver is available. The RTS/CTS handshake is an optional mechanism and its aim is to avoid problems with interferences and hidden nodes [17]. Once the packet is received at the destination, a MAC layer acknowledgement (ACK) after a Short Inter- frame Space (SIFS) will be transmitted.

2.2.1.1.2 Throughput at MAC layer

In Figure 2, what should be approximately the throughput at MAC layer for different configurations of the maximum WLAN data rate per stream is shown. For 54 Mbps of maximum WLAN data rate, the expected throughput is around 23 Mbps. The throughput is never the maximum available data rate of the link as the data in Table 1 shows. There are several factors that affect the throughput, for example: The ACK that is always sent at a lower data rate, Physical Layer Convergence Protocol (PLCP) preamble that is sent at a lower data rate, overhead, transmission range and interferences.

(34)

Max rate [Mbps]

Throughput

6 5.12

9 7.23

12 9.1

18 12.28

24 14.98

36 18.86

48 22.13

54 23.31

Figure 2: MAC layer throughput Max rate vs throughput [18] Table 1: MAC layer throughput [18]

2.2.1.1.3 Physical layer

The PHY [19] is divided again into 2 sub-layers –as also is the MAC layer- called Physical Layer Convergence Protocol (PLCP) and Physical Medium Dependent (PMD). On the one hand, the PLCP layer is in charge of the Carrier Sense (CS) part of the CSMA/CA protocol described in 2.2.1.1.1. On the other hand, the PMD is the layer in charge of the modulation scheme management and signal encoding.

2.2.1.2 Transmission Control Protocol

Transmission Control Protocol (TCP) is defined in RFC 793 [20] [21]. TCP is defined as “ a reliable connection-oriented delivery service”. The RFC specified that, TCP was defined with the aim of providing a very reliable host-to-host protocol for packet-switched computer communication networks. The term connection-oriented means that a connection must be established before hosts initiate the data transmissions. This connection is established by means of the so-called three-way handshake. TCP uses segmentation of data packets. It implies that if one segment of the whole packet is not received properly, the whole packet will be resent. Reliability

(35)

in TCP is achieved by assigning a sequence number to each segment to be transmitted. In order to confirm that the packet is received free of errors, TCP sends an ACK for every received packet/segment. TCP also preserves the packet order.

If an ACK is not received on time -before the retransmission timeout (RTO)-, the data are retransmitted. RTO is based on the estimated round-trip time (RTT) between the sender and receiver, as well as the variance in this round-trip time. This is explained in detail in 2.2.1.2.2.

TCP defines a sliding window [22] protocol that is an option to increase the received window size –or receiver's advertised window - to avoid exceeding the data processing capacity of the receiver.

Each TCP packet/segment contains the current value of the receiver's advertised window. This is useful because with this option the sender can send “bursts” with the maximum window size without waiting for the ACK of each packet sent in the burst.

Additionally, a congestion window [23] in transmission side is defined to avoid exceeding the capacity of the network. TCP uses a mechanism called slow start to increase the congestion window faster after a connection is initialized and after a timeout event. It starts with a window of two times the maximum segment size (MSS). MMS defines the largest possible amount of data to avoid fragmentation, and TCP protocol defines it. The congestion window doubles for every RTT until it reaches a threshold, this is depicted in Figure 3. When the congestion window is below the threshold, the congestion window grows exponentially and when the congestion window is above the threshold, the congestion window grows linearly -1 MMS each RTT -. Whenever a timeout occurs, the threshold is set to one half of the current congestion window and the congestion window is then set to one.

(36)

Figure 3: Evolution of TCP's congestion window [23]

2.2.1.2.1 Throughput TCP

The congestion window and the receiver window (receiver’s advertised window) limit the TCP throughput. In the RFC 6349 sec 3.3 [24] it is defined how to the calculation for the TCP throughput must be done. The equation 2.1 defines the calculation method.

TCP throughput = TCP RWND ∗ 8 (2.1)

Where:

- TCP RWND is Receive Window size or receiver's advertised window - RTT is the Round-Trip Time

2.2.1.2.2 TCP Retransmission TimeOut

As it is described in [25, pp. 199-200], for every received packet TCP send back an Acknowledgement. If it is not received before a given Retransmission TimeOut (RTO), TCP sender assumes that that packet is lost and it will retransmit it.

(37)

The computation of the current RTO value follows a trade-off between a small value that will imply a lot of unnecessary retransmission and a high value that will result in a high latency in packet loss detection. RTO is a function of the Round Trip Time (RTT). The RTT value is not the same for all packets and Smoothed RTT and RTT variation [26] are calculated based on several RTT samples. RTO depends on instantaneous RTT, Smoothed RTT, RTT variation and the Binary Exponential Back-off (BEB). The BEB or RTO back-off multiplicative factor is set to 1 at the beginning and doubled with every timeout until an ACK is received. As a result of this, for several consecutive timeouts, the RTO will grow to a considerable value.

2.2.2 Genetic Algorithm

In this section, some definitions and explanations about the genetic algorithms such as the basic structure, main terminology and the main functions are presented.

2.2.2.1 Concept of genetic algorithm

A genetic algorithm is a stochastic search algorithm that mimics the process of natural selection and genetics to try to solve complex [27] problems [28]. It is based upon genetic process of living beings. It uses historical data to find new points of search for an optimal solution of the problem, trying to improve the results and converge into the best or expected value [29]. The research procedure is based on Darwinian theories of natural selection and survival. According to these theories, populations in nature will evolve according to the principles of natural selection and survival of the fittest [30].

The main feature of this algorithm is the efficiency exploiting historical information in order to speculate on new search points with the aim of finding a more optimal point [29]. In addition, Genetic Algorithm can successfully be applied to a vast variety of problems from different areas.

2.2.2.2 Elements and biological translation

In order to make easier and understandable the terminology used during the implementation and explanation of the algorithm, some biological concepts and the parallelism that exists between the biological and the evolutionary algorithms are related in the following sections.

(38)

2.2.2.2.1 Chromosome

All the organisms are composed of one or more cells. Every one contains at least one chromosome (DNA strands) where the genetic information is encoded [31]. This structure carries hereditary factors or genes. This is used by the genetic algorithm to encode and store the solution.

Hence, the chromosome is the set of parameters defining a proposed solution to the problems to be solved.

2.2.2.2.2 Genes

One chromosome can be divided into genes and functional blocks of DNA. Therefore, genes are the basic units in which a chromosome can be split [32]. A gene is a position or set of positions in a chromosome. For that reason, it exists in a chromosome as many genes as its number of variable slots. Each one of these genes encodes a particular feature of an individual. The possible values of a feature that a gene can take from a fixed set of symbols are known as alleles [33]. Each one of the positions that a gene can take into a chromosome is called locus.

2.2.2.2.3 Genotype and phenotype

Genotype is the complete set of genes contained in a genome, and thereby the set of inherited factors in an individual that can be manifested in the individual or not. The genotype will give the encoded solution of the problem to solve [32]. This information will be copied at the time of reproduction and will be passed from one generation to the next. For this reason, the genotype only can be determined by biological tests, not observations as the phenotype.

The phenotype is the set of parameters represented in a chromosome, in other words, it contains the required information to create an individual (e.g. eye colour) [29]. The phenotype will give a decoded solution for the given problem [32].

The adaptation of an individual to the problem depends on the evaluation of the genotype, which can be inferred from the phenotype (chromosome) using the fitness function.

In order to clarify the concepts of genotype and phenotype, the following examples are presented:

(39)

•••• Example 1: Given an optimisation problem on integers, a set of integers would form the set of phenotypes. In this manner, one of these phenotypes, for example the phenotype 27, would be the genotype 11011. As it can be seen in this example, the phenotype space and genotype space are different because a genotype could evolve giving a phenotype that is not in the set of integers selected. This is because the evolutionary search is done in the genotype space. Thus, the optimum solution of a phenotype is obtained by decoding the genotype [34].

•••• Example 2: Given a child with haemophilia (a group of hereditary disorders that impair the body’s ability to control blood coagulation), it could occur that the parents did not suffer this disorder in their health, but they carried the haemophilia genes in their body. Then, these parents have the same phenotype but not the same genotype [35].

Genotype 1 1 0 1 1 Evolution of genotype 0 1 0 1 1

Phenotype 27 11

Figure 4: Example phenotype and genotype

2.2.2.2.4 Generation

The chromosomes evolve through iterations (called generations) during the progression of the genetic algorithm. In each iteration, the chromosomes are evaluated using some fitness measures and being exposed to the genetic operations of selection, crossover and mutation giving as a result new chromosomes in each generation. Therefore, the best individuals tend to survive and reproduce in this way propagating their genetic material to future generations.

2.2.2.2.5 Population

A population is a set of chromosomes (possible solutions) that remains constant during the evolution search (generation). As a starting point for the genetic algorithm, an initial population has to be created.

(40)

2.2.2.2.6 Summarize of GA vocabulary

All the vocabulary explained above is summarized in Table 2 based on the table given by [36, p. 7].

Genetic Algorithm Explanation Chromosome Solution

Genes Part of the solution Locus Position of the gene Allele Value of the gene Phenotype Decoded solution Genotype Encoded solution Population Set of chromosomes Generation Iterations of the GA

Table 2: Genetic Algorithm vocabulary [36, p. 7]

2.2.2.3 Genetic algorithm structure

The general procedure of a genetic algorithm is depicted in Figure 5. First, all parameters, such as the length of the chromosome, population size, probability of crossover and mutation, etc., are set. Afterwards, the initial population is randomly generated. This initial population is evaluated by the fitness function. Once this is performed, if this population does not achieve the selected criteria, such as the number of iterations, time elapsed or the optimisation criteria, the best chromosomes are selected for reproduction. Then, these chromosomes are subjected to the crossover process generating offspring. Finally, this offspring goes to the mutation process giving a new population for the next generation. This new population is evaluated again and all the processes previously explained (all of them with the exception of the parameters setting) are repeated until the criteria are accomplished.

(41)

Figure 5: Genetic algorithm structure [37]

2.2.2.4 Operations of the genetic algorithm

All genetic algorithms have common elements such as the creation of a population of chromosomes, the selection depending on their adaptation, the crossover to produce a new generation and finally the random mutation in the new generation.

2.2.2.4.1 Initial population

The first population is obtained in a random process where the chromosomes are created. The number of individuals in the population is related to the required computational resources, increasing these requirements as the population increases. The number of possible solutions and the search space is greater, despite the resources needed are higher, and the speed of the algorithm decreases. Thus, it exists there a trade-off between efficiency and effectiveness [33].

Depending on the problem to be solved, different methods to encode the solution can be used. For example, binary encoding, value encoding, tree encoding, permutation encoding, etc. [38].

(42)

2.2.2.4.2 Fitness Function

The fitness function, also called evaluation function, tests the performance of each chromosome - potential solution - by measuring how good they are related to the current problem domain [39].

2.2.2.4.3 Selection

This operator selects chromosomes in the population for reproduction. This is done randomly done but favouring those chromosomes that have a better fitness, and having the fittest more possibilities to be selected to reproduce. This selection can be done using different techniques as in [40], among others, the roulette-wheel, ranking based and tournament selection [41, pp. 17-18].

The methods proposed in [27], [42] and [43] is the roulette wheel selection.

The roulette wheel selection creates a roulette wheel where all chromosomes of the generation are placed and in which each chromosome has a proportion of the roulette wheel according to its fitness. Therefore, the chromosome with a better fitness will have more probability to be chosen because it has a bigger area of this roulette wheel [44]. Having n chromosomes, the roulette will be split into n portions. In order to select the chromosome, the roulette wheel spins n times, selecting the portion (chromosome) where it stops each time it spins. The probability of selection of an individual is given by the equation 2.2 [44].

= ∑ (2.2)

Where, is the given probability to select the chromosome j and is the fitness of the chromosome j.

The problem of this selection method is that maintains the diversity chance but with the possibility that might converge very quickly [37].

(43)

2.2.2.4.4 Reproduction

The reproduction is performed by exchanging the genetic material of two chromosomes using the crossover operator. The mutation is made with the outcome resulting from this operation. The result of these operations are two new chromosomes with the features of both parents and, if it is the case, with the random mutation.

2.2.2.4.4.1 Crossover

The crossover operator is the most important in GA, as it allows the exchange of features from one generation to the next and thereby the evolution of the species [32]. The main objective is to get an improvement in the fitness of the offspring.

In the crossover, those chromosomes selected for the reproduction are paired up and crossed over.

This is performed to produce offspring with a certain given probability, with a maximum of 1, meaning that the parents will not survive.

Within this crossover process, there are different techniques to implement it such as the one point crossover, two point crossover, uniform crossover [45] and arithmetic crossover [46]. The one point crossover is now explained and a description of the other techniques can be found in [47].

The one point crossover strategy choses randomly a point along the chromosome (a locus) and exchanges the genes before and after that point of the two chromosomes to create the new offspring. One example of this technique is shown in Figure 6.

1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 1

1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0

Figure 6: One point crossover example

2.2.2.4.4.2 Mutation

Once the crossover process is finished, the mutation procedure is carried out his process is performed in order to avoid finding all solutions in a population into a local optimum fitness value [48]. Moreover, it is beneficial because it results in the possibility of searching in those areas still untreated. The mutation process involves the modification of some random genes with a certain probability of mutation [49]. This process is done by changing one gene to another value or

(44)

interchanging two values for two genes placed in different locus. After this process, the chromosomes are evaluated again using the fitness function.

There is a trade-off in the choice of the mutation probability. This is because if this mutation probability is too high, the search will be random, as it will not guarantee the survival of the fittest chromosomes for the next generation. One example of this technique is shown in Figure 7.

1 2 3 4 5 6 7 8 9 1 2 5 4 3 6 7 8 9

Figure 7: Mutation example

The steps explained until now can be illustrated as the Figure 8.

Parents Offspring

Selection Crossover Mutation

Figure 8: From parents to offspring process

2.2.3 Time series and forecasting

In this section, time series and how are used with the genetic algorithm are explained. This tries to help in understanding how the genetic algorithm is used for forecasting.

2.2.3.1 Time series

A time series is a sequence of data points, { _!} in case of discrete time and {X(t) for continuous-time case}, belonging to a system, measured in different instants of time and ordered chronologically which can represent variables such as physical, financial, etc. The main objective of the time series analysis done here is to study the behaviour to the time series in order to try to

(45)

predict the future evolution, up to a certain time horizon (also called prediction horizon). Time series analysis can be divided into linear and nonlinear, and univariate and multivariate [50]. Linear time series follow usually stable patterns in regular intervals that depend linearly on the time series’

past values in contrast with the nonlinear time series, which display a chaotic behaviour [27].

It is possible to characterize the complexity of a system observing one variable belonging to the system as was demonstrated in the theorem proposed by Takens (1981) [51].

2.2.3.2 Phase space and attractor construction

The theorem proposed by Takens states that the statistical properties of an attractor are conserved in the delay coordinates¹, these coordinates are reconstructed with a temporal series of one variable belonging to that system.

Thus, to construct the attractor by means of the delay coordinates proposed by Packard et al. [52]

the time delay and the embedding dimension are needed. One example of this can be seen in Figure 9 with the construction of the Lorenz attractor by means of Takens’ theorem.

Figure 9: Example of Lorenz chaotic attractor [53]

1 In the vector space " = " , "$, "%, … , " , " is called point or vector and " , "$, "%, … , " are called the coordinates of " where n is natural number and is called the dimension of the space.

(46)

Having the following scalar time series ' , '$, '%, … , '(, obtained from observations in constant intervals, is possible to reconstruct a vector with embedding dimension m, into an m-dimensional space, [54] [55] [56] by means of the method of delays as follows:

)_*+,- = .'_*, '_*/0, … , '_{*/+12 -0}3, 4* ϵ ℝ¹ 7 = 1,2, … , : − +, − 1-<

(2.3)

Where, )_* is the reconstructed vector with the embedding dimension m, '_* is the observed discrete value at time n and < is the time delay or embedding time. Thus, the m coordinates of each '_* are samples from the time series separated by a fixed <. The result of that are a series of vectors as follows:

) = ) , )_$, … , )_{(2+12 -0} (2.4)

Where, N is the length of the original series. The purpose is to form the attractor that preserves the topology properties of the original unknown attractor. Thus, the idea of such reconstruction is to capture the original system states in each observation of the system output.

An example applied to evolution of a stock market can be seen the following:

Date Price

1/1/2014 10 000 2/1/2014 10 005 3/1/2014 10 008 4/1/2014 10 003 5/1/2014 10 006

Then, for d=1 and m=4, the reconstructed vectors are defined as:

) = {10 003, 10 008, 10 005, 10 000}

)_$ = {10 006, 10 003, 10 008, 10 005}

Table 3: Example of reconstructed vector 10000

10001 10002 10003 10004 10005 10006 10007 10008 10009

Stock Market

Price

(47)

Therefore, two parameters have to be computed. These parameters are the time delay < and embedding dimension m.

2.2.3.2.1 Time delay

The method used to calculate the delay is important in order to reconstruct the attractor from scalar time series and once reconstructed, to be able to estimate the correlation dimension to evaluate if the scalar time series is chaotic or stochastic.

There is a trade-off in the election of < because it has to be large enough to obtain the largest amount of new information between 4 and 4 _/0 and at the same time has to be small to not be independent [57].

For the time delay <, different methods and techniques were set and implemented to calculate the time delay and embedding dimension as the proposed in [55], [56] and [58]. The method used in this thesis will be the series correlation approaches of autocorrelation and average mutual information.

These methods are not explained in detail in this thesis but further information can be found in [59] and [60]. These methods are used from Nonlinear Time Series Analysis (TISEAN) [61].

The calculation of the delay time is based on the average mutual information and it is determined finding the position of the first minimum represented graphically [62]. Even if, as it was used in [27] this parameter is set to 1 and it is compensated increasing the embedding dimension, as is demonstrated also in [63, p. 31].

2.2.3.2.2 Embedding dimension

The other parameter to be found is the embedding dimension, which is the space dimension where the attractor is reconstructed. Different methods and techniques were set and implemented to calculate the embedding dimension in [56], [64] and [65]. Kennel et al. [64] proposed a method used to calculate the minimum embedding dimension called FNN (False Nearest Neighbours).

This method is the selected for the calculation of the m, which is used the tool TISEAN and further information about this method can be found in [56], [59] and [64].

(48)

2.2.3.3 Forecasting

The univariate analysis is carried out with a single variable with the objective of finding the dynamic dependence of { _!}, i.e. on its past value { _!2 , !2 , … , !2(} [66]. On the other hand, with the multivariate time analysis, more than one variable is studied and observed at a time [67].

The forecasting of the values is done using time series analysis. The work of Takens (1981) and Casdagli (1989) [68] and others established the methodology for the creation of a dynamic model from a chaotic time series [42]. According to the Takens theorem, nonlinear chaotic dynamic systems can be reconstructed from a sequence of observations [27]. This theorem states that giving a deterministic time series, there exist a function F(·) such that verifies the equation 2.5 [42].

! = + _!20, _!2$0, … , _!210- (2.5)

Where, < is the delay factor and m is the embedded dimension. Therefore, the theorem guarantees the forecasting of future values considering only its past values. The difficulty comes when trying find a function F(·) where the genetic algorithm will be implemented and used in order to find a good approximation of this desired function. The forecasts are carried out by deterministic models directly built from observations of the system evolution [43].

2.2.3.4 Forecasting methods

Having an univariate time series ['_!]_!^? representing the observations, it is possible to predict the next n points of this series, i.e. '_?/ , '_?/$, … , '_?/ , with only some of the previous samples.

The prediction could be performed with different methods, such as the prediction of just one point, '!/ – 1-step ahead [69]– , prediction of several points at one time -direct strategy - , and the iterative prediction which is used the '_!/ as the input for the next prediction until '_!/ .

The method used during this thesis is the direct multistep-ahead prediction of several points, also known as independent value prediction in [70] or direct strategy in [71]. For the forecasting of the next samples it is applied the Takens theorem, trying to find a function that connects the previous samples to find a pattern in that past values aiming at using it to predict the next ones.

'!/ = +'?, '?2 , … , '?2+12 -- (2.6)

(49)

In order to forecast ['_!]_!/^?/ from a time series ['_!]_!^? , a training set is created from the time series using a shifted window of length +, − 1- ∗ < [70],where m is the embedding dimension, < the time delay and n the number of samples to predict. Therefore, the prediction horizon will be always fixed when the training set is chosen and vice versa. The training set is necessary in order to find a nonlinear function of the data set.

On the other hand, the iteration prediction [71], also known as multi-stage prediction [70], consists in taking the predicted sample as in input for the next forecast until the n prediction. In this manner, the first predicted sample is used along with the past samples to predict the next one. Hence, the samples used to predict are shifted one time unit, adding the new sample predicted in each iteration.

This is described mathematically by the expressions in equations 2.7, 2.8 and 2.9 [71].

'_?/ = +'_?, '_?2 , '_?2$… , '_{?2+12 -}- (2.7) '_?/$ = +'_?/ , '_?, '_?2 … , '_?2+12$-- (2.8) '_?/% = +'_?/$, '_?/ , '_?… , '_?2+12%-- (2.9) Where from the previous equations could be written the general equation 2.10 described also in [71].

'_?/ = +'_{?/+ 2 -}, '_{?/+ 2$-}, '_{?/+ 2%-}… , '_{?2+12 -}- (2.10) The main problem of this method is that the error is summing up in each iteration due to the predicted sample that is being included in every iteration and the inherited error is added in the next prediction [70]. In contrast to the iteration prediction, in the associated squared multistep- ahead, error is minimised with the direct prediction. Despite the direct strategy implies more computational resources [71] as more n samples are attempted to be predicted. This is because the larger the n the larger the training data needed in order to obtain a good predicted model. This is due to the considerable absence of samples between T and n. Nonetheless, a better function could be found using the latter strategy. For this reason, and also following the work done by [42], [27]

and [43] the direct multistep-ahead prediction is used.

Study of TCP Available Bandwidth Using NS2 and Its Forecasting Based on Genetic Algorithm

Study of TCP Available Bandwidth Using NS2 and Its Forecasting Based on Genetic Algorithm

Cristian Hernandez Benet

Francisco Domingo Sanchez Vizcaino

Study of TCP Available Bandwidth using NS2 and its forecasting based on Genetic Algorithm

Cristian Hernandez Benet Francisco Domingo Sanchez Vizcaino

Abstract

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Project overview

1.2 Motivation

1.3 Objectives

1.4 Methodology

1.5 Results

1.6 Organization of the dissertation

Chapter 2

Background and Related work

2.1 Introduction

2.2 Background

2.2.1 Wireless network and protocols

2.2.2 Genetic Algorithm

2.2.3 Time series and forecasting

Stock Market