Adaptive Checkpointing for Emergency Communication Systems

(1)

Linköping University | IDA 16 hp/bachelor | IT Spring term 2016 | LIU-IDA/LITH-EX-G--16/060—SE

Adaptive Checkpointing

for Emergency Communication Systems

Sebastian Karlsson

Christoffer Nilsson

Supervisor: Mikael Asplund Examiner: Nahid Shahmehri

(2)

(3)

Students in the 5 year Information Technology program complete a semester-long software development project during their sixth semester (third year). The project is completed in mid-sized groups, and the students implement a mobile applica-tion intended to be used in a multi-actor setting, currently a search and rescue scenario. In parallel they study several topics relevant to the technical and ethical considerations in the project. The project culminates by demonstrating a working product and a written report documenting the results of the practical development process including requirements elicitation. During the final stage of the semester, students form small groups and specialise in one topic, resulting in a bachelor the-sis. The current report represents the results obtained during this specialization work. Hence, the thesis should be viewed as part of a larger body of work required to pass the semester, including the conditions and requirements for a bachelor thesis.

(4)

(5)

Abstract

The purpose of an emergency communication system is to be ready and available at all times during an emergency situation. This means that emer-gency systems have specific usage characteristics, they experience long idle periods followed by usage spikes. To achieve high availability it is important to have a fault-tolerant solution. In this report, warm passive replication is in focus. When using warm passive replication, checkpointing is the proce-dure of transfering the current state from a primary server to its replicas. In order to utilize resources in a more effective manner compared to when using a fixed interval checkpointing method an adaptive checkpointing method is proposed. A simulation-based comparison is carried out using MATLAB and Simulink to test both the proposed adaptive method and the fixed in-terval method. Two metrics, response time and time to recover, and four parameters are used in the simulation. The results show that an adaptive method can increase efficiency, but in order to make a good adaptive method it is necessary to have specific information regarding system configuration and usage characteristics.

(6)

(7)

3 Warm Passive Simulation Model 9 3.1 Model Structure . . . 9 3.1.1 Request Generator . . . 10 3.1.2 Checkpoint Controller . . . 12 3.1.3 Primary Server . . . 15 3.1.4 Log/State Transfer . . . 15 3.1.5 Backup Server . . . 16 3.1.6 Data Collector . . . 17 3.2 Parameters . . . 17 3.3 Simulation Procedure . . . 19

4 Warm Passive Simulation Results 21 4.1 Checkpointing Interval . . . 21

4.1.1 Default Values of Checkpointing Interval . . . 23

4.2 Request Service Time Scale . . . 23

4.3 Request Scarcity Scale . . . 26

4.4 State Transfer Scale . . . 28

(8)

5 Discussion, Conclusion and Future Work 33 5.1 Discussion . . . 33 5.2 Conclusion . . . 35 5.3 Future Work . . . 36

(9)

List of Figures

1 Basic overview of the system. . . 2 2 Flowchart which shows how the implemented logic handles a request. 9 3 The model overview, showing all subsystem components. . . 10 4 The inside of the Request Generator subsystem from Section 3.1,

Figure 3 . . . 11 5 The content of the adaptive Checkpoint Controller subsystem from

Section 3.1, Figure 3. . . 12 6 The content of the fixed Checkpoint Controller subsystem from

Sec-tion 3.1, Figure 3. . . 14 7 The content of the Primary server subsystem from Section 3.1,

Fig-ure 3. . . 15 8 The content of the Log/State Transfer subsystem from Section 3.1,

Figure 3. . . 16 9 The content of the Backup server subsystem from Figure 3 in Section

3.1. . . 16 10 The content of the adaptive Checkpoint Controller subsystem from

Figure 3 in Section 3.1. . . 17 11 The effects on response time when varying the checkpoint interval. . 22 12 The effects on time to recover when varying the checkpoint interval. 22 13 The effects on response time when varying the request service time

scale. . . 24 14 The effects on time to recover when varying the request service time

scale. . . 24 15 The effects on response time when varying the request scarcity scale. 26 16 The effects on time to recover when varying the request scarcity scale. 26 17 The effects on time to recover when varying the request scarcity

scale, excluding adaptive max. . . 27 18 The effects on response time when varying the state transfer scale. . 28 19 The effects on time to recover when varying the state transfer scale. 29

(10)

List of Tables

1 Default values for all parameters, except Checkpoint Interval. . . . 19 2 Default values for the Checkpoint Interval parameter. . . 23

(11)

1 Introduction

Emergency situations occur unexpectedly and often have unique characteristics, which makes them hard to plan for. However, by using knowledge of earlier dis-asters people who work with catastrophe preparations, analyse and try to build plans for how to act and respond to disastrous events. When a disaster occurs it is imperative that rescue personnel have a stable line of communication. In today’s society, computer systems are the infrastructure of communication and managing of large scale operations. Clearly, an emergency communication system is a critical part of rescue operations and it is very important that its servers are available and operating at all times.

Typical servers today are configured to handle many different tasks, most of them are for commercial use and configured to handle a steady stream of requests, and may therefore not be the optimal solution for an emergency communication system. Emergency communication systems have specific tasks and many of them may sit idle or perform very few operations between emergencies. A catastrophe can occur without warning and an emergency system may therefore see a drastic spike in traffic, due to the issues caused by the catastrophe. This would result in highly fluctuating traffic characteristics.

This kind of fluctuating characteristic and periodically intense load is the main focus in this report. How should the emergency communication system be con-figured in order to be efficient in handling load spikes? This report focus on one specific part of the system which is server redundancy. Server redundancy is com-monly used for maintaining system availability, but is also known to affect system performance depending on how it is carried out. A common redundancy method is called passive replication and uses checkpointing intervals to decide when infor-mation should be transfered from one server to another.

Our purpose is to explore an adaptive strategy for checkpointing used in emergency communication systems in order to investigate the availability of a client-server-based system which uses warm passive (primary-backup) replication. This should result in a recommendation as to how the checkpointing procedure can be

(12)

spe-cialized for an emergency communication system. The analysis will be performed using probabilistic models and simulations.

Figure 1 shows an overview of the hardware configuration which is used as a basis for the investigation.

Figure 1: Basic overview of the system.

We refer to this hardware configuration as the reference system. The reference system consists of two servers connected to mobile clients through the internet. One of the servers acts as primary and handles all requests. Server states can then be sent to the backup using checkpointing. The backup takes over request handling if the Primary server goes down.

1.1 Purpose

Motivated by an emergency communication application that uses warm passive replication the purpose is to investigate if fixed interval checkpointing can be im-proved upon by introducing special checkpointing rules that consider the current load on the system before determining if a checkpoint operation is suitable at this

(13)

time. The purpose of this dynamic checkpointing is to increase resource efficency and to decrease the performance drawbacks of a redundancy solution focusing on high availability.

1.2 Problem Formulation

The purpose will be fulfilled by providing answers to the following questions:

• What level of availability can be expected in the presence of crash failures for a system with warm passive replication and a load characteristic based on an emergency communication scenario?

• How can adaptive/dynamic checkpointing be performed to effectively dis-tribute resources during irregular system loads where traffic may go from idle to extremely busy in a short period of time?

• How does the performance of the proposed adaptive checkpointing compare to the performance achieved using checkpointing with a fixed interval?

1.3 Method

Passive replication theory is used to create a simulation model of a system which implements warm passive replication. This model is then used to explore how varying different parameters will impact the system’s overall availability and re-sponse time. In addition, another simulation model is created for the proposed adaptive checkpointing method. The parameters to be investigated in the sim-ulation, in order to investigate the two metrics, are called Checkpoint Interval, Request Service Time Scale, Request Scarcity Scale and State Transfer Scale and are further described in Section 3.2. The default values for the parameters are cho-sen based on experimentation and logical reasoning in accordance to the assumed characteristics of an emergency communication system.

(14)

The simulation models are created in Simulink which is a block diagram environ-ment for multi-domain simulation and model-based design integrated into Matlab [1]. As Matlab supports scripting, it allows for easy configuration to perform many simulation runs where system parameters, system load and checkpointing strategies are changed.

The simulation-based evaluation serves two purposes. First, the results are used to determine if an adaptive approach to checkpointing can be made in a way that is more resource efficient than fixed interval checkpointing. The second purpose is to use the result to assess the level of availability that can be achieved by the emergency system when using two servers while disregarding network issues, with the exception of handshakes between the Primary server and its Backup servers.

1.4 Limitations

The work in this thesis is based on the following assumptions.

• Network is reliable at all times.

• Only server crash failures are considered.

• There is a a linear relation between the time it takes to process a request and the time it takes to make a state transfer containing the same request. • An unlimited incoming buffer for the Primary server is used.

(15)

2 Background

This section presents theory necessary for understanding the remainder of the thesis. The concepts of availability, failures, failover, replication and the types of passive replication are briefly explained. Related work is also presented to put this work into context.

2.1 Availability

As presented by Avizienis et al. [2], availability is the readiness for correct service. Equation 1 shows how the availability of a system can be expressed as a function of failure rate and repair time. MTBF stands for Mean Time Before Failure and is the expected time the system will run before failing. MTTR stands for Mean Time To Recovery and is the expected time for the system to recover from a failure.[3]

Availability = M T BF

M T BF + M T T R (1)

To be able to calculate the availability for a system using this method it is necessary to determine both the MTBF and MTTR.

2.2 Failures and Probabilities

There are different causes, so called faults, which can result in the inaccessibility of the Primary server. These can be divided into two categories called node and channel. Channel faults are caused by problems in the communication between nodes. This report does not take channel faults into consideration and so it will not be further explained. Node faults are problems that originate from within a node such as a server. Only faults that cause a complete server failure is being considered, and they are called crash failures.[2]

Faults can be of a nature that is either transient, intermittent or permanent. Transient faults are momentary issues where the service recovers. Intermittent faults are issues that recover at some point and permanent faults are issues that

(16)

the system can not recover from without changes being made to the system and must therefore be fixed by a system administrator [2]. The crash failures are assumed to be the result of faults that are permanent in nature and this means if a server goes down it must be manually fixed before initiating the recovery after repair process.

2.3 Failover

When a fault is detected it initiates the failover process. The failover process switches processing from the Primary server to the Backup server. In passive replication this means that the Backup server has to get up to date by reading and processing everything from the log file which has been added after the latest received state update.[4, 5]

2.4 Replication

The goal of replication is to increase availability and integrity by having more than one copy of the data. A copy is referred to as a replica. If the data changes, all replicas must implement the same changes or they will no longer be replicas. This is the essential principle of replication.[5]

There are two primary types of replication, passive and active. When using active replication, requests are sent to all replicas and they each process the request inde-pendently and answer. If one replica goes down the others still continue answering and the system is not affected. Since the scope of this report does not include ac-tive replication it will not be described in further detail. In passive replication the data replication is accomplished through the Primary server updating the repli-cas. Passive replication has three subtypes usually referred to as warm, hot and cold.[5, 6]

In a warm passive replication structure there is a single primary replica(server) which responds to client messages. The primary server has one or several backup servers that are prepared to take over and act as the primary server in case the

(17)

previous primary server is affected by a crash failure. In order to reduce the amount of updates needed for the backup server, when taking the role as primary server, the state of the primary server is continually transferred to the backup servers. Such a state transfer is called a checkpoint. When using warm passive replication checkpoints are carried out according to an interval in time, this is called the checkpoint interval.[5]

Hot passive replication can be seen as a special case of warm passive where state updates are done after every request. This type of replication method can be resource consuming as it needs to perform state updates often, but allows for very quick failover as the backup replica states should always be up to date.[5]

Cold passive replication can be seen as the opposite of hot passive replication as there are no state updates. Requests are only put in a log and a replica has to replay everything since system start before it can take the role as primary.[5]

2.5 Related Work

Nadjm-Tehrani [7] presents a thorough investigation of primary-backup replication using a mathematical model to optimize checkpointing interval with both availabil-ity and response time parameters. There has been a lot of other work on passive replication such as [8, 9, 10, 11]. A form of passive replication called hot passive replication is often overlooked, but is thoroughly examined in [12]. All forms of passive replication is considered in this paper and is therefore related to all pas-sive replication. When it comes to adaptive approaches, the closest work found is that of Balasubramanian et al. [13] which presents investigations regarding an adaptive failover using a Fault-tolerant Loadaware and Adaptive middlewaRe they call FLARe. This papers work differs from [13] due to the fact that it specifically investigate emergency communication systems, their specific characteristics and how an adaptive approach fits in this area of use.

An alternative look on an adaptive approach to replication can be found in [14] which investigates the possibility of adaptively choosing the number of replicas and which replication strategy is suitable for this type of solution. The popular

(18)

alternative to passive replication is active replication and is described in [15, 5, 6], but this is not too closely related to this work as it focuses on passive replication. Another alternative which has been investigated is the use of a hybrid strategy, combining both active and passive replication as investigated in [16].

(19)

3 Warm Passive Simulation Model

This section present the system model and simulation procedure. In order to compare how a system with an adaptive approach to checkpointing performs versus a system using a fixed checkpointing interval, two Simulink models have been created. Except for the checkpointing mechanism the models are identical. Both of these models are used when simulating and they provide data for the two different checkpointing alternatives.

Figure 2 shows how the implemented logic handles a request. When the request enters the primary buffer, which is a FIFO (First In First Out) queue, it will stay there until it is its turn to be processed by the Primary server. The request will be stored in the log file and is available to the Backup server if a failover takes place. If the Primary server requests a checkpoint it will stop processing from its incoming buffer and instead initiate a transfer to carry over all its, since the last checkpoint, updated states to the Backup server.

Figure 2: Flowchart which shows how the implemented logic handles a request.

Information on which parameters are investigated is explained in Section 3.2 and the simulation procedure is explained in Section 3.3. First however, Section 3.1 describes the two models in detail.

3.1 Model Structure

The simulation models are designed to represent a warm passive replication struc-ture which is able to handle ordinary operations and send statistics to a data collector. In order to form a good structure for the models several subsystems are created to represent the different components. From outside the subsystems

(20)

they are identical and this component overview is shown in Figure 3 which was extracted directly from Simulink:

Figure 3: The model overview, showing all subsystem components.

The difference between the two models is how the Checkpoint Controller subsys-tem works. All subsyssubsys-tems will be explained in further detail in the following subsections.

3.1.1 Request Generator

The Request Generator (See Figure 4) produces items that represent messages sent by mobile users. The Set Attribute block sets a service time for the item. The FIFO queue has near unlimited capacity and is only there to make sure request generation speed is not affected by connections outside the component. The start timer is a timestamp that will later be used to measure response time for requests.

(21)

Figure 4: The inside of the Request Generator subsystem from Section 3.1, Figure 3

The items are generated with the help of a gamma probability distribution. The gamma distribution is used to simulate the variety of load that a system may ex-perience. The gamma distribution along with its parameters were picked for its fluctuative behaviour in order to represent the characteristics of a crisis communi-cation system as mentioned in Section 1. Yang et al. [17] shows that the gamma distribution fits well with empirical data which has very large differences between the mean and median value. This description of the empirical data fits well with this system load characteristic which has both calm periods and high load spikes. A gamma distribution in Matlab uses the parameters shape and scale. For this thesis the shape parameter has been experimented and set to a value which ap-propriately fits with the assumed characteristics of emergency communication sys-tems. The scale parameter is varied in order to simulate different system load. In this thesis, the gamma distribution is used for two parameters related to sys-tem load. These parameters scale the size of the gamma distribution outputs and will be accounted for in Section 3.2. The first gamma distribution component is used to generate the length of time before producing a new request item. This length of time is generated for each new item and varies greatly. The variation in time between new request items represent the variations in system activity. The second gamma distribution component is used to generate the time it takes for the system to process the corresponding request item. This variation in request processing time is motivated by the differences in data size which is downloaded

(22)

and/or uploaded to the system.

3.1.2 Checkpoint Controller

The adaptive checkpointing method is based on the assumption that it would like to avoid performing checkpoints while the server is experiencing heavy load. We investigate this assumption by implementing a checkpointing method that does not checkpoint unless there is at least 50 requests to transfer to replicas (there has to be a need for performing a checkpoint) and less than 150 requests in the queue for process at the Primary server. These values would of course be interesting to scale in different ways according to a specific system, but we only consider these values that are tailored for our particular scenario.

The Checkpointing Controller contains the logic for triggering checkpointing at the Primary server. In the system using adaptive checkpointing this subsystem is con-stantly (every time there is a new request) checking the values of Primary’s incom-ing buffer size and number of requests processed since the last checkpoint(system log size) to determine if a checkpoint should occur. The adaptive checkpointing is designed to trigger when the incoming buffer is lower than 150 and the num-ber of requests processed since last checkpoint is higher than 50. The adaptive checkpointing component (inside subsystem) is shown in Figure 5 and the custom component (the MATLAB function) code is shown in Listing 1.

Figure 5: The content of the adaptive Checkpoint Controller subsystem from Section 3.1, Figure 3.

(23)

function OUT = f c n ( b u f f e r S i z e , l o g S i z e ) p e r s i s t e n c h e c k p o i n t P r o c e s s A c t i v e F l a g = 0 ; s e r v e r B u f f e r S i z e A d a p t i v e M a x i m u m = 1 5 0 ; LogSizeAdaptiveMinimum = 5 0 ; i f ( c h e c k p o i n t P r o c e s s A c t i v e F l a g == 0 ) i f ( b u f f e r S i z e < ,→ s e r v e r B u f f e r S i z e A d a p t i v e M a x i m u m && ,→ l o g S i z e > LogSizeAdaptiveMinimum ) c h e c k p o i n t P r o c e s s A c t i v e F l a g = 1 ; end e l s e i f ( l o g S i z e == 0 ) c h e c k p o i n t P r o c e s s A c t i v e F l a g = 0 ; end OUT = c h e c k p o i n t P r o c e s s A c t i v e F l a g ; end

Listing 1: Simplified pseudo code for the adaptive Checkpoint Controller.

In the model using a fixed checkpointing period, the Checkpointing Controller is controlled by a clock and logic which tells it to checkpoint if; there is at least one request in the log and the time since the last checkpoint ended is greater than or equal to the checkpointing interval. The fixed checkpointing component is shown in Figure 6 and the custom component (the MATLAB function) code is shown in Listing 2.

(24)

Figure 6: The content of the fixed Checkpoint Controller subsystem from Section 3.1, Figure 3. function OUT = f c n ( l o g S i z e , c h e c k p o i n t i n g R e q u e s t P e r i o d , ,→ c u r r e n t T i m e ) p e r s i s t e n t c h e c k p o i n t P r o c e s s A c t i v e F l a g = 0 ; p e r s i s t e n t l a s t C h e c k p o i n t T i m e s t a m p ; i f ( l o g S i z e ˜= 0 ) i f ( && c h e c k p o i n t P r o c e s s A c t i v e F l a g == 0 && c u r r e n t T i m e ,→ >= l a s t C h e c k p o i n t T i m e s t a m p + ,→ c h e c k p o i n t i n g R e q u e s t P e r i o d ) c h e c k p o i n t P r o c e s s A c t i v e F l a g = 1 ; end e l s e i f ( c h e c k p o i n t P r o c e s s A c t i v e F l a g == 1 ) c h e c k p o i n t P r o c e s s A c t i v e F l a g = 0 ; l a s t C h e c k p o i n t T i m e s t a m p = c u r r e n t T i m e ; end OUT = c h e c k p o i n t P r o c e s s A c t i v e F l a g ; end

(25)

3.1.3 Primary Server

The subsystem shown in Figure 7 represents the Primary server and handles all requests before a simulated crash. Incoming requests are put in a FIFO queue and are then sent to the server one by one for processing in the processing unit. When the processing unit finnished processing the response time is read by the Read Timer component. All processed requests are put into a log and when checkpointing, it sends state updates to the Backup server using the Log/State Transfer component.

Figure 7: The content of the Primary server subsystem from Section 3.1, Figure 3.

3.1.4 Log/State Transfer

The Log/State Transfer component (See Figure 8) handles the communication between the Primary server and Backup server. The Primary server stores all pro-cessed requests in a queue (Log component in Figure 8) in this subsystem. These requests are stored until they have been sent to the replicas during a checkpoint. The Delay Signal component is used to apply an overhead for performing a state transfer (Described in Section 3.2). Both gates are used for controlling flow in/out to/from the log. The Avoid Race Condition component is something required for MATLAB to work properly and can basically be ignored.

(26)

Figure 8: The content of the Log/State Transfer subsystem from Section 3.1, Figure 3.

3.1.5 Backup Server

The backup server (See Figure 9) has a Get Attribute component that reads the service time of incoming items and passes it on to the Data Collector subsystem (See Figure 3 in Section 3.1). It also consists of a flow regulator (Scale Attribute component) and a sink. Items that have been processed by both servers are sent to the sink where they are discarded. The flow regulator is made up of a gate, process time scale function and a server(processing unit). This is where a State Transfer Scale value is used to scale the time it takes to process a state transferred request and then it is processed by the server (Described in Section 3.2).

Figure 9: The content of the Backup server subsystem from Figure 3 in Section 3.1.

(27)

3.1.6 Data Collector

The Data Collector contains functions for calculating the maximum and average values of Time to Recover (TTR) and Response Time (RT) which is then sent to the Matlab Workbench for graph generation. This subsystem is shown in Figure 10.

Figure 10: The content of the adaptive Checkpoint Controller subsystem from Figure 3 in Section 3.1.

3.2 Parameters

In order to investigate the behaviour of different system configurations, several parameters are introduced. The parameters are measured in number of requests, general units of time or a scalar. A time unit should be seen as the time it takes to process the smallest type of request for a specific system. It is therefore simple to use the simulator for a specific system and just multiply the time unit with the actual time it takes to process the simplest of requests. The scalars are numeric values which are used to scale either a gamma distribution or the process time of a request. The following four parameters have been chosen to investigate through simulation:

• Checkpointing Interval is the interval between checkpoints for the fixed checkpointing method. This parameter is measured in time units.

(28)

• Request Service Time Scale is a value which controls how far the service time of requests can scale. In other words, how big a request can be. This parameter is a scalar value.

• Request Scarcity Scale is a value which controls how long periods of time there is between new requests. This parameter is a scalar value.

• State Transfer Scale which controls how long it takes to transfer a pro-cessed request from the Primary server to Backup servers. This parameter is a scalar value.

The Request Service Time Scale and Request Scarcity Scale are used as inputs for the gamma distribution components as described in Section 3.1.1.

In addition to the above parameters, there are a few more used in the simulation. They are not investigated through simulation and are strictly used with default values. The parameters considered as constants are listed below:1

• Checkpointing Delay represents the time it takes for the Primary server to initiate a checkpoint. It is based on the assumption that there is a need for a handshake process between the servers and can be seen as a static overhead for performing a state transfer. This parameter is measured in time units. • Server Buffer Size - Adaptive Maximum is a parameter which is used

by the adaptive checkpointing method to decide if the server is too busy to perform a checkpoint. Is otherwise referred to as simply Adaptive Maximum. This parameter is measured in number of requests.

• Log Size - Adaptive Minimum is used for the adaptive checkpointing method to decide whether there is a need for a checkpoint. If the number of requests is lower than this minimum value, then there is no need for a checkpoint. Is otherwise referred to as simply Adaptive Minimum. This parameter is measured in number of requests.

1_{The adaptive parameters, Adaptive Maximum and Minimum, could be investigated further} if the adaptive method is proved to be a useful approach. Checkpointing Delay is left to future work as it is mostly a network issue and not the primary focus of this report.

(29)

3.3 Simulation Procedure

The simulation is configured with default values for all parameters, except for the variable that is currently being investigated. The first parameter, checkpointing interval, is used to extract three interesting default values to use for the fixed interval checkpointing. All other parameters are run three times, once for each checkpointing interval, in order to compare the adaptive approach to several fixed interval configurations. Each parameter is configured with a maximum and min-imum value for which it should be tested. The density of points simulated along the interval is also set. Each point/value is repeatedly run 100 times with differ-ent probability distribution seeds. Each of these simulations are run for a period of 50000 time units. After which, the average response time, maximum response time, average time to recover and maximum time to recover is extracted. (The maximum value is the highest achieved value during simulation and can be seen as the worst case scenario.) Once the entire range of values has been simulated two graphs are produced, one for response time and one for time to recover. The chosen default values for the parameters are shown in table 1. (Except for the Checkpointing Interval, which is determined through simulation and is presented in 4.1.1.)

Table 1: Default values for all parameters, except Checkpoint Interval.

Parameter name Value Unit

Request Service Time Scale 100 Scalar Request Scarcity 1000 Scalar State Transfer Scale 0.1 Scalar

Checkpoint Delay 2 Time Units

Adaptive Maximum 150 Requests

(30)

(31)

4 Warm Passive Simulation Results

Two models are used for the simulation as described in section 3. One model for simulating the fixed interval checkpointing approach and one model for the proposed adaptive approach. This section presents the simulation results for all five parameters presented in Section 3.2. All simulated parameters are presented using the average and maximum for both the response time and time to recover.

4.1 Checkpointing Interval

Since the adaptive checkpointing does not use a time interval to decide when a state transfer should occur it is not affected by the checkpointing interval parameter, but for the fixed interval checkpointing model it is a very important parameter and will change the system behaviour a lot depending on which value it uses. This means that it is necessary to test several checkpointing interval values in order to make a just comparison of the fixed interval checkpointing approach to the adaptive checkpointing approach. A range of fixed interval checkpointing values are investigated, ranging from 1 to 121 in order to find three appropriate default values for checkpointing interval to be used in the other parameter simulations. The default values are marked in the figures using dashed vertical lines.

The three default values for the fixed interval checkpointing method have been chosen by picking points at or near an intersection between the two checkpointing methods and one point some distance before and some distance after. This is in order to get intervals that may perform well in different scenarios and make it possible to compare all of these to the adaptive method. The chosen values are marked using dashed vertical lines in Figures 11 and 12 below.

(32)

Figure 11: The effects on response time when varying the checkpoint interval.

Figure 12: The effects on time to recover when varying the checkpoint interval.

Figure 11 shows that a lower checkpointing interval results in a higher response time for the fixed method. As a reversed relation, Figure 12 shows that a higher checkpointing interval results in a longer time to recover for the fixed method. It is also possible to see (roughly) where the fixed interval approach meets the

(33)

adaptive approach for both the average and maximum. The intersection points are at roughly 70 for both the average and maximum lines for the response time. When it comes to the time to recover there is only one visible intersection point which is at roughly 43 for the average lines. As mentioned earlier, the adaptive method is not affected by this parameter and has a strictly horizontal line. The primary interesting characteristic which can be mentioned regarding these figures is that the adaptive checkpointing method has a good average time to recover, but the max time to recover is very high in comparison.

4.1.1 Default Values of Checkpointing Interval

In order to compare the proposed adaptive checkpointing method to the fixed in-terval checkpointing method default values for the Checkpoint Inin-terval parameter is needed before proceeding with the other parameters. There is an intersection between the average time to recover of the adaptive and fixed approaches that provide three suitable values, one value before the intersection, (near) the inter-section itself and one value after the interinter-section. These values also fit fairly well with the response time intersections. The three default values are shown in Table 2 (and have already been presented in previous Figures 11 and 12):

Table 2: Default values for the Checkpoint Interval parameter. Description Checkpoint Interval

Before Intersection 21

Intersection 41

After Intersection 71

4.2 Request Service Time Scale

This parameter affects the load by scaling the amount of time a request takes to process. It can be seen as a parameter that changes the size of the requests that are sent into the system.

(34)

Figure 13: The effects on response time when varying the request service time scale.

Figure 14: The effects on time to recover when varying the request service time scale.

Figure 13 shows that the adaptive method is faster than two of the fixed inter-val methods. However the fixed interinter-val 71 is slightly faster than the adaptive.

(35)

Figure 14 shows that the adaptive approach is roughly equal to fixed 41 when it comes to the average, but has a significantly higher maximum (than all fixed) for large requests. In more general terms you can also see that the relation be-tween response time (See Figure 13) and Request Service Time Scale is roughly linear. This correlates well with how the processing time is scaled up along with the size of incoming requests. The reason for why the response time is different for the checkpointing methods is that the resources are used in a more efficient way when using the adaptive checkpointing method due to it adapting to server load. However, the downside is that the maximum time to recover (See Figure 14) is significantly higher. This is due to the fact that it postpones state transfers in order to focus on processing requests. This in turn could have a negative impact on system availability as a backup would need to process more requests directly from the log instead of already being up to date.

(36)

4.3 Request Scarcity Scale

The Request Scarcity Scale adjusts the amount of time between new requests.

Figure 15: The effects on response time when varying the request scarcity scale.

(37)

Figure 17: The effects on time to recover when varying the request scarcity scale, excluding adaptive max.

Figure 15 shows that the adaptive approach has a faster response time for low scale values and is due to the adaptive approach postponing checkpointing as long as there is a lot of requests waiting at the Primary server. This means, that when there are a lot of incoming requests, the adaptive checkpointing method does not perform checkpoints and will be able to allocate more resources for processing incoming requests. The downside of not checkpointing during high system load is that it significantly increases the adaptive method’s maximum time to recovery as shown in Figure 16 and therefore it is higher than the fixed checkpointing methods’ maximum time to recovery. Simply put, the worst case scenario for the time to recover of the adaptive method is its downside. In extreme cases it is also significant enough to affect the average time to recover of the adaptive method which is shown in 17. The adaptive method does not seem ideal for systems with low request scarcity, which can be translated into systems with constant high load characteristics. The cause of the strange behaviour of the adaptive method’s maximum line shown in Figure 16 has not been identified, but since the maximum line is based on single worst case seeds (scenarios) it should not be significant for the study as the average line is stable. However in order to identify ideal situations it would be interesting to dig deeper into this issue, but for this bachelor thesis it

(38)

has been scoped out.

4.4 State Transfer Scale

Varying the State Transfer Scale parameter shows how different types of data can affect the state transfer procedure. For example, if many requests update the same data it would mean that doing a state transfer of these requests would be small (Because it only needs to send the most recent update. However if the requests just add a lot of data, it would also need to send a lot of data to the Backup and therefore a state transfer would take some effort. It is also possible that an initial request needs to be processed to calculate an answer and that for the state transfer it only needs to send the answer to the Backup. This would also mean that the state transfer for these requests would be small.

(39)

Figure 19: The effects on time to recover when varying the state transfer scale.

Figure 18 shows that varying the state transfer scale has almost no impact on the adaptive checkpointing method’s response time, but it has a (roughly) logarithmic effect on the fixed checkpointing method. Reversely, Figure 19 shows that the adaptive checkpointing method’s time to recover is significantly affected by the in-crease in state transfer scale and has an almost linear line characteristic. The fixed checkpointing method is barely affected by the state transfer overhead increasing. The Figures presented in this section clearly shows the strength and weakness of the adaptive method. Which is its ability to focus on incoming requests, but at the expense of the maximum (worst case) time to recover.

4.5 System Availability Case Study

Before showing the System Availability Case Study, we present a short summary of the data gathered in the simulation sections above. The adaptive method is flexible to change and allows for quick response times in most situations. The downside is the high maximum mean time to recovery.

(40)

an example of its use, a general system specification is used to estimate the avail-ability and performance drawbacks achieved using both the adaptive checkpointing method and fixed interval checkpointing. The assumed hardware configuration consists of two separate servers and clients may vary from phones to personal computers.

In [18] there is a figure showing the breakdown of failures into which types of faults caused them. Hardware failures make up over 60% of all failures and is the largest cause of failure. The rest is roughly split between software faults and other types of faults. Furthermore they have normalized the number of failures by number of processors in a particular system and the average annual failure rate (AFR) is roughly 0.35 if disregarding systems one and three which have a much larger failure rate than any of the others.

Since the availability equation is based on MTBF and not on AFR a conversion is made from AFR to MTBF using equation 3 presented in [19], this is shown in Equation 2.

M T BF = Hours in a year

AF R =

8760

0.35 ≈ 25, 000 hours (2) As is shown in Equation 2 above, the resulting MTBF is roughly 25,000 hours. Which is just below three years. However, in a more complex system with more processors than one, this MTBF would quickly drop[18]. Compare this to the MTBF for a node in a large scale system, which in [20] is said to be 4.3 months which translates to 3096 hours as shown in Equation 3:

4.3 months = 4.3 months ∗ 30 days ∗ 24 hours = 3096 hours (3)

The adaptive checkpointing method has an max (worst case) MTTR at about 32000 time units, according to Figure 16 in Section 4.3. By assuming that one time unit is as long as the average time to load wikipedia.org it would be roughly 1 second long[21]. This includes more than just the time it takes for the server to handle a request, like network delay, but in this case it can be contributed to a slow server. Now this would result in an MTTR at 32000 seconds. 32000 seconds is roughly 8.89 hours. As shown in Equation 4 this results in an availability of

(41)

99.96%.

M T BF + M T T R =

25000

25000 + 8.89 = 0.999644526 ≈ 99.96% (4) The fixed checkpointing method has an max MTTR at about 1000 time units, according to Figure 19 (line fixed 71) in Section 4.4. Converting the fixed check-pointing MTTR from time units (By using the same reference as for the adaptive calculation in the previous paragraph) to seconds gives an MTTR at 1000. 1000 seconds is roughly 0.28 hours. This results in an availability of 99.99% as shown in Equation 5.

M T BF + M T T R =

25000

25000 + 0.28 = 0.9999888 ≈ 99.99% (5) When compared to the fixed checkpointing method, the adaptive checkpointing method has less availability, but as shown in the simulation graphs it generally has a lower response time.

(42)

(43)

5 Discussion, Conclusion and Future Work

In this section the results are analyzed through a discussion and conclusions are presented. A list of suggestions for future work is also presented at the end.

5.1 Discussion

In the simulation results it has been shown that a low MTTR, and thus high avail-ability, is not a problem using a fixed interval checkpointing method, as long as the interval is not too high. However it has also been shown that the response times can suffer from often having to perform checkpoints. The adaptive checkpointing approach manages to utilize the resource better and manage lower response times and often a reasonable time to recover. However since it does not have any re-strictions for how long it can put off checkpointing it sometimes suffer from a very long time to recover. This is a possible subject for future work.

It should be possible to tailor the adaptive approach specifically to system re-quirements by imposing strict limitations for how it should behave. This could of course result in not being able to accomplish both a good availability and quick response times. However if one were to base it off of the level of availability which is required in a well-defined worst case scenario, this information could be used to estimate the type of system which would be necessary in order to also meet the requirements for response times.

Warm passive replication is highly dependant on the specific uses and the more specific system requirements which can be imposed, the easier it is to accomplish a good fault-tolerant solution. The simulations have shown that general solutions do not fit specific scenarios very well and that there is much to gain from tailoring the solution for specific requirements.

Our limitations have resulted in more work being put into the simulation models than to perfect the suggested adaptive approach. A simulation model requires a lot of work to properly reflect how something works in reality and there will

(44)

still be idealizations. Therefore our results may not entirely reflect the reality of warm passive simulation. The strength of a simulator is that you can quickly test different system configurations and determine how important a specific parameter is. The extent of the variations between our simulator and live situations is still a subject for research. The adaptive approach is far from finished and the possible restrictions that can be put on it should be further explored. It is also important to note that the failure data which is used as the basis for the case study is not from an emergency communication system as such information was not found.

(45)

5.2 Conclusion

We had three questions to answer. First,What level of availability can be expected in the presence of crash failures for a system with warm passive replication and a load characteristic based on an emergency communication scenario?

By carefully considering the situation and parameters, close to 100% availability can be achieved. (See Section 4.5, Equation 4) However there are problems out-side of the scope which makes this a little more complicated. For example, most network issues have been disregarded.

Second and Third, How can adaptive/dynamic checkpointing be performed to ef-fectively distribute resources during irregular system loads where traffic may go from idle to extremely busy in a short period of time? How does the performance of the proposed adaptive checkpointing compare to the performance achieved using checkpointing with a fixed interval?

By making use of checkpointing rules which make sure that checkpoints are per-formed in idle periods, efficiency(as measured by a combination of average response time and average time to recover) can be increased beyond that of any fixed check-pointing interval.

The simulations successfully show how an adaptive checkpointing approach can be used to improve resource efficiency.(Resource effectivity implied as a combination of response time and time to recover.) The thesis has been strengthened, but due to the complicated nature of these simulations and systems, further experiments will be needed to prove the benefits for specific systems. Possible starting points for further research is presented below in Future Work.

(46)

5.3 Future Work

Data collected using simulations does not always reflect reality. One important aspect of continuing this work would be to determine the extent of the variations between our simulator and live systems. This may lead to having to extend the simulation to include things not regarded in this report, like for example network issues.

When the extent of variation has been determined, more work could be put into further developing the adaptive checkpointing method. It could be done by opti-mizing the values of Minimum Log Size and Maximum Buffer Size, which affect how the adaptive checkpointing method works. It could also be done by introduc-ing and testintroduc-ing a max value for how many requests are allowed to be stored in the log while postponing checkpoints. A more advanced adaptive approach could, for example, make use of entire system history logs to predict future behaviour. This could all be interesting fields to explore.

A further developed adaptive method, or even the adaptive method created in this report, could be used to perform empirical studies on the use of an adap-tive checkpointing approach on specific emergency systems or other systems with similar characteristics.

(47)

References

[1] Mathworks. Simulink. http://se.mathworks.com/products/simulink/, last visited 2016-05-25.

[2] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Trans. Dependable Secur. Comput., vol. 1, no. 1, pp. 11–33, Jan. 2004, ISSN: 1545-5971.

[3] EventHelix. Reliability and availability basics. http://www.eventhelix. com/RealtimeMantra/FaultHandling/reliability availability basics.htm# .V0TVOpGLSUk, last visited 2016-05-25.

[4] A. Silberschatz, P. Galvin, and G. Gagne, Operating System Concepts, 9th Edition. Wiley Global Education, 2012, ISBN: 9781118559635.

[5] A. S. Bernadette Charron-Bost, Fernando Pedone, Replication: Theory and Practice. Springer-Verlag Berlin Heidelberg, 2010, vol. 5959, ISBN: 978-3-642-11293-5.

[6] G. Coulouris, J. Dollimore, T. Kindberg, and G. Blair, Distributed Systems: Concepts and Design, 5th ed. USA: Addison-Wesley Publishing Company, 2011, ISBN: 9780132143011.

[7] S. Diana, N.-T. Simin, and N. John M., “Configuring fault-tolerant servers for best performance.” Proccedings of the International workshop on High Avail-ability of Distributed Systems HADIS05, Part of 16th International DEXA workshops,2005, pp. 310–314, Aug 2005, ISSN: 1529-4188.

[8] E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, “A survey of rollback-recovery protocols in message-passing systems,” ACM Comput. Surv., vol. 34, no. 3, pp. 375–408, Sep. 2002, ISSN: 0360-0300.

[9] S. Garg, C. M. Kintala, S. Yajnik, and Y. Huang, “Performance and reliability evaluation of passive replication schemes in application level fault tolerance,” Fault-Tolerant Computing, International Symposium on, vol. 0, p. 322, 1999, ISSN: 0731-3071.

(48)

[10] A. S. Gokhale, B. Natarajan, D. C. Schmidt, and J. K. Cross, “Towards real-time fault-tolerant corba middleware,” Cluster Computing, vol. 7, no. 4, pp. 331–346, 2004, ISSN: 1573-7543.

[11] S. Diana and N.-T. Simin, “Building and evaluating an fault-tolerant corba in-frastructure.” Proceedings of the Workshop on Dependable Middleware-Based Systems (WDMS02), 2002.

[12] R. de Juan-Marin, H. Decker, and F. D. Munoz-Esco, “Revisiting hot passive replication,” 2012 Seventh International Conference on Availability, Reliabil-ity and SecurReliabil-ity, vol. 0, pp. 93–102, 2007, ISBN: 0-7695-2775-2.

[13] J. Balasubramanian, S. Tambe, C. Lu, A. Gokhale, C. Gill, and D. C. Schmidt, “Adaptive failover for real-time middleware with passive replication,” 2009. [14] Z. Guessoum, J.-P. Briot, A. Hamel, O. Marin, and P. Sens, Dynamic and

Adaptive Replication for Large-Scale Reliable Multi-Agent Systems. Springer Verlag, Apr. 2003, pp. 182–198.

[15] R. Guerraoui and A. Schiper, “Software-based replication for fault tolerance,” Computer, vol. 30, no. 4, pp. 68–74, Apr. 1997, ISSN: 0018-9162.

[16] P. Felber, X. Dfago, P. T. Eugster, and A. Schiper, “Replicating corba ob-jects: a marriage between active and passive replication.” in DAIS, ser. IFIP Conference Proceedings, L. Kutvonen, H. Knig, and M. Tienari, Eds., vol. 143. Kluwer, 1999, pp. 375–388, ISBN: 0-7923-8527-6.

[17] J. Yang, D. Jin, Y. Li, K.-S. Hielscher, and R. German, “Modeling and sim-ulation of performance analysis for a cluster-based web server.” Simsim-ulation Modelling Practice and Theory, vol. 14, pp. 188 – 200, 2006, ISSN: 1569-190X.

[18] B. Schroeder and G. A. Gibson, “A large-scale study of failures in high-performance computing systems,” in Proceedings of the International Con-ference on Dependable Systems and Networks, ser. DSN ’06. Washington, DC, USA: IEEE Computer Society, 2006, pp. 249–258, ISBN: 0-7695-2607-1.

(49)

[19] W. Torell and V. Avelar, “Performing effective mtbf comparisons for data center infrastructure.” WP-112, 2005.

[20] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan, “Availability in globally distributed storage sys-tems.” in OSDI, R. H. Arpaci-Dusseau and B. Chen, Eds. USENIX Associ-ation, 2010, pp. 61–74, ISBN: 978-1-931971-79-9.

[21] Alexa Internet Inc. Actionable analytics for the web. http://www.alexa.com/ siteinfo/wikipedia.org, last visited 2016-05-25.

(50)

Adaptive Checkpointing for Emergency Communication Systems