Estimating Time to Repair Failures in a Distributed System

(1)

Linköping University | Department of Computer and Information Science Bachelor Thesis | Information Technology Autumn 2016 | LIU-IDA/LITH-EX-G--16/072—SE

Estimating Time to Repair Failures in

a Distributed System

Lisa Habbe

Matilda Söderholm

Supervisor: Mikael Asplund Examinator: Nahid Shahmehri

(2)

Abstract

To ensure the quality of important services, high availability is critical. One aspect to be considered in availability is the downtime of the system, which can be measured in time to recover from failures. In this report we investigate current research on the subject of repair time and the possibility to estimate this metric based on relevant parameters such as hardware, the type of fault and so on. We thoroughly analyze a set of data containing 43 000 failure traces from Los Alamos National Laboratory on 22 different cluster organized systems. To enable the analysis we create and use a program which parses the raw data, sorts and categorizes it based on certain criteria and formats the output to enable visual-ization. We analyze this data set in consideration of type of fault, memory size, processor quantity and at what time repairs were started and completed. We visualize our findings of number of failures and average times of repair dependent on the different parameters. For different faults and time of day we also display the empirical cumulative distribution function to give an overview of the probability for different times of repair. The failures are caused by a variety of different faults, where hardware and software are most frequently occurring. These two along with network faults have the highest average downtime. Time of failure proves important since both day of week and hour of day shows patterns that can be explained by for example work schedules. The hardware characteristics of nodes seem to affect the repair time as well, how this correlation works is although difficult to conclude. Based on the data extracted we suggest two simple methods of formulating a mathematical model estimating downtime which both prove insufficient; more research on the subject and on how the parameters affect each other is required.

(3)

List of Figures

2.1 FTA format structure . . . 4

2.2 Event types interlacing . . . 5

3.1 Overview of the Java program . . . 8

3.2 Overview of the program testing the mathematical models . . . 10

4.1 Distribution of number of failures per type of fault . . . 12

4.2 Average downtime per type of fault . . . 13

4.3 Average downtime per weekday . . . 14

4.4 Average downtime and average number of completed repairs per hour . . . 14

4.5 Average number of started repairs and average number of completed repairs per hour . . . 15

4.6 Average downtime for different sizes of memory. . . 17

4.7 Average downtime per node with different number of processors. . . 18

4.8 The empirical cumulative distribution function for all failure traces. Note the log-arithmic scale. . . 19

4.9 The distribution of downtimes of all failure traces. . . 20

(5)

List of Tables

4.1 Number of failures occurred each weekday. . . 13

4.2 Memory Sizes . . . 16

4.3 Processor Quantities . . . 16

4.4 Number of failures for different sizes of memory. . . 16

4.5 Number of failures for different sizes of processor quantities. . . 17

4.6 The 95th percentile for the different fault types . . . 21

(6)

1 Introduction

Availability is the ratio of time which a system is up and functioning and is an important attribute for a dependable system. Consequences of unscheduled downtime for the company that provides a service may be graver than their system being down as they might loose the trust of their clients. Availability is difficult to measure in real systems. However, one factor to be considered is the downtime of a system which we are going to investigate by using raw data. The cause of unscheduled downtime is a failure. A fault, also known as a root cause, may lead to an error in the system. When the error makes the system unavailable, it experiences a system failure.

An important aspect to give customers and users a better experience when a failure does occur is to give an estimation on how much time will pass before the service becomes avail-able again. Knowledge of a systems probavail-able downtime is always useful, especially when working directly towards a client who might want to have an estimation of downtime when the system becomes unavailable. If that time can be easily obtained with the help of a math-ematical model it will be easier to quickly give information.

In our report we aim to investigate and create an understanding of downtime, also re-ferred to as time of repair, in a real system. This understanding can be used to implement methods to adapt the repair implementation based on the system and different aspects of the failure; depending on what the cause of the failure is, it will probably matter what method you use to prevent and correct the failure. Our aim is to investigate the relationship between the relevant parameters of a failure and the time of repair.

We will study current research on the subject of availability and downtime, as well as pub-licly available failure traces. Furthermore we want to determine what data of system failures are relevant and in which way. One data set which includes the specifics about parameters and we deem as relevant will be selected to use in a historical raw data study. Furthermore we will reformat the data set with a Java program and use MATLAB to visualize our findings. We will analyze the obtained information and use this to formulate and test mathematical estimation models of downtime.

1.1 Objective

We want to be able to use our mathematical model, which will be based on historical data of failure traces, to accurately estimate recovery time. We want the model to be able to estimate a systems unscheduled downtime based on certain parameters to enable repair adaption and quantifying a systems availability through this.

1.2 Research Questions

We have chosen to investigate the following questions: • Which factors are relevant in estimating downtime?

• Are there any patterns or correlations between these parameters and downtime? • Can downtime be estimated based on knowledge of different relevant parameters?

(7)

1.3. Scope and Limitation of Study

1.3 Scope and Limitation of Study

To find data that is relevant is difficult due to the fact that many companies are not keen on revealing how their systems fail for their customers. Much of the failure data are therefore not available for the public. Further obstacles in finding relevant data is the large number of different kinds of systems that exist and the different types of failures that can occur in these. Many system failure logs are useless to us as they simply check if a component is available or not in regular time intervals, which does not contribute to the understanding of recovery time. We are interested in data that state the exact time of failure and time of repair to be able to analyze the downtime.

(8)

2 Theory

When conducting background research it is obvious that there is a lack of resources concern-ing this subject. Although a significant amount of material on availability of systems can be studied, few of the articles actually analyze a set of traces and even fewer focus primarily on the downtime. As clarified by Schroeder and Gibson [5], the lack of failure traces made public is the main reason for this void. There is a limited amount of data sets containing failure logs, and the few publicly available that actually displays data with relevant parameters are not as detailed as desired.

2.1 Present Research on Downtime Due to Failures

Schroeder and Gibson analyze a collection of data collected at Los Alamos National Labora-tory (LANL) between 1996 and 2005. Their study was executed before the public release of the data and the objective of their paper was to:

• Provide a description of the statistical properties of the data

• Give information for other researchers on how to interpret the data.

Schroeder and Gibson evaluate the availability of the systems as a whole rather than focusing on downtime. They consider the downtime as an aspect of the availability whereas we solely focus on the downtime. A disadvantage with the LANL data set according to Schroeder and Gibson is that the faults leading to a failure are determined by the system administrators. The data quality is therefore dependent on the system administrators. However, because of the routine in reporting failures and the highly-trained staff they do not consider their reports to be a problem. In cases where the administrators do not have enough knowledge about a system most failures will be categorized as unknown, a correlation which Schroeder and Gib-son show in their report. When many of the failures have a fault that is unknown it makes it harder to analyze the aspects we want to. Worth to mention is that the recording of the time and where the failures occurs is executed automatically, therefore the data concerning time can be considered as accurate. Moreover, Schroeder and Gibson made a breakdown of failures and downtime depending on the fault and the type of system. The most common fault was hardware followed by software for all system types. They also presented the me-dian and the mean time to repair for the systems in correlation to the fault and found that both varied depending on the fault. They found that the type of system had a large impact on the downtime and that the size of the system does not contribute much. They also found that downtimes vary considerably, both across systems and within a system.

Yigitbasi et al. [7] analyzed the time-varying behaviour of failures in large scale systems and created a model for this. In their introduction they claim that "[...] most previous work focuses on failure models that assume the failures to be non-correlated, but this may not be re-alistic for the failures occurring in large-scale distributed systems." This is because large-scale distributed systems may have periods during which the rate of failures increases. Therefore the focus of their work was to create a peak model that show these periods. They found that the failure rate is variable but the failures still shows periodic behaviour and correlation in time. This applies to most of the 19 systems studied.

(9)

2.2. Publicly Available Failure Traces

Bahman et al. [2] made a comparison between several data sets of failure traces in regards of statistical properties on availability and unavailability intervals. The statistical properties are for example mean, median and standard deviation. They also fit different distributions to the data sets. While they investigated several different data set and presented a comparison between them in different aspects we want to concentrate on investigating one data set more thoroughly and see what aspects is important regarding downtime for this one data set.

2.2 Publicly Available Failure Traces

Because of the great void in available system failure traces, there are some organizations founded with the sole purpose of making this kind of data accessible for the public. We have considered records provided by the Failure Trace Archive (FTA) [2] as well as the Computer Failure Data Repository (CFDR) [6]. While the CFDR provides raw data of failure logs re-trieved from a wide range of systems, the FTA provides all traces formatted according to a hierarchically organized structure of the system components. This enables comparison be-tween different system traces as well as a better understanding of the traces and how desired information can be retrieved. Because of the FTA standardized formatting of all traces and its broad range of traces we choose to proceed with focus on the FTA resources.

The FTA format is structured hierarchically as described in figure 2.2. A platform is a collection of nodes, which are constructed by different components. The components have a creator which is a log of who created the trace, as well as a component type code. Each com-ponent has an event trace with an event type code which describes if the target comcom-ponent is available or not. The event end reason code is relevant to us since it categorizes the reasons for unavailability.

Figure 2.1: FTA format structure [2]

Each of these parts, represented as coloured blocks in figure 2.2, have traces stored in separate tables for each data set. The FTA provides 26 sets of data containing failure traces of different system types.

2.3 Historical Data: The Los Alamos National Laboratory Data Set

The LANL data set covers 22 high-performance-computing (HPC) systems consisting of 4750 nodes and 24101 processors. The system logs span from year 1996 to 2005, and the majority

(10)

2.3. Historical Data: The Los Alamos National Laboratory Data Set

of its workload is used for scientific 3D-modeling and simulation, which are often several months long. These tasks amount to a lot of CPU-computation along with regular check-pointing. Data from the results of these are also visualized by the system, furthermore some nodes of course need to handle front-end tasks as well. In the data every failure that required a system administrator to solve it has been included[5]. We initially place our focus on the event trace logs, since these provide information about the relevant parameters.

Each event trace can be categorized as available or unavailable, depending on the event type code. In cases which the component is unavailable the event type will be 0, and where it is available it will be 1. The component ID as well as node ID are also logged enabling location of failure in the system. The timestamps for event start and end are logged in Epoch time which can easily be converted into UTC time. The time unit used is seconds, however all measures are made in spans of 60, i.e. minutes.

Each trace has an event ID, which together with the node ID is unique for each trace. This is because a node will experience several events, and therefore these two make up the composite primary key of the table. Every trace where the event type is 0, meaning that the node is unavailable, has the event end reason 0, meaning that the fault is undetermined. This is because there are no faults which leads to the end for a session of unavailability, only faults which ends a session of availability. Therefore we need to extract all relevant data from the trace of unavailability with the event end reason of the previous trace of the same node with matching event ID which shows what reason the availability ended. Consider figure 2.2; the event end time for event 1 would be the same as the event start time for event 2, and the event 1 end reason would be the event start reason for event 2. In this example we would therefore need to extract all relevant data from event 2 except event end reason, which we would get from event 1, to analyse the parameters concerning the down time.

Figure 2.2: Event types interlacing

The data set covers a total amount of 43325 traces, however this includes the traces both when a node is available and not available. Only 17096 of the traces showing node availability is provided with an event end reason, which each represents one failure. The event end reasons are categorized according to the following:

(11)

2.3. Historical Data: The Los Alamos National Laboratory Data Set

Type of fault Range of failure codes

Not determined 0 Infrastructure 1´999 Hardware 1000´1999 I/O 2000´2999 Network 3000´3999 Software 4000´4999 Human Error 5000´5999

Infrastructural failures are failures which are caused by faults such as power outage and problems with chillers. Hardware failures are caused by hardware faults at the actual nodes. I/O is short for Input/Output and represents faults in the communication flow in for exam-ple the fibre cable or disk drive. Network failures are caused by faults in the hardware of the network such as different cables and switches. The software failures concern faults in soft-ware of the whole system, as well as softsoft-ware in network and user code. The human error is anything caused by a human, for example a component improperly installed. In total there are 122 failures of unique error codes recorded in the data set.

The data set also covers information about each node, we have focused on the memory size and number of processors each node possesses. In the set there are in total nine different memory sizes and the memory sizes ranges from 1 Gigabyte to 1 Terabyte. The number of processors ranges from 2 to 256 and there are in total seven different processor quantities for all nodes.

(12)

3 Method

We will try to meet the objectives of this report by conducting the following in chronological order:

1. A study of relevant failure traces and parameters to be considered when estimating downtime

2. The development of a program to parse raw failure trace data into needed format with relevant information

3. Visualization and analysis of extracted data for a chosen system

4. Development of mathematical models to estimate downtime based on selected param-eters

3.1 Available Failure Traces

When searching for a set of failure traces to investigate, we will evaluate each data set after a few characteristics that we classify as relevant. These characteristics are:

• Length of time span which data was collected during • Number of traces in data set

• Detailed information about relevant parameters, such as:

– Downtime

– Fault, error and/or failure categorization

– Time of failure

– Hardware of failed node

Evidently it is difficult to find a set of data made public that is appropriate for our use in every characteristic. Ideally the data should have been collected quite recently from a modern system, however due to the lack of more recent traces this is an aspect we choose to not weigh in as much since it does not directly affect our actual analysis of the particular system. The time span needs to be quite vast and cover information from a big set of nodes to provide a wide foundation for identifying patterns based on the different parameters. Most system failure traces where the systems are of Peer-To-Peer structure (often Internet services) there is no identification of the fault of the failure, and the time stamps are not very useful since the traces often only check if a host is available or not with certain frequencies that are too low for our analysis. The lack of fault identification is pervading through many of the other system traces as well.

We choose to focus on the LANL data set which can be collected from the FTA, since this set is most appropriate for our analysis according to previously mentioned characteristics. The analysis and conclusions of Schroeder and Gibson [5] will be of use for us while we conduct our research as they analyze the same data collection. The data set will be inserted

(13)

3.2. Java Program

into our own database, since it will give us a better understanding of the structure. It is divided into several tables, and enables us to navigate through the data using MySQL queries as well as parsing the data into text files to be read by the Java program. Only 90% of the data set will be used in the analysis, the remaining 10% will be used as an evaluation set for testing our mathematical models. The testing set will be randomly selected based on time of the failure. This is done to avoid an evaluation set where failures are correlating or dependent of each other.

3.2 Java Program

We will create the Java program from scratch and customize it both to enable reading the current format of the data set and to format the data to be able to use it in our analysis. The structure of the program is described in figure 3.1.

Figure 3.1: Overview of the Java program

The input will consist of two trace files, Event Trace and Node. Because of the format of the data we use, the program needs to retrieve information about the same failure from both files. In the first module of the program the files will be read two traces at a time. We will compare the node ID and event ID of these two traces, as well as the event type (avail-able/unavailable), to determine if they concern the same failure and needs to be reformatted. If so, a new trace for the failure will be created and the parameters type of fault, memory size, number of processors and time of failure will be logged in this as well as downtime. All traces are added to a trace list which is then sorted in the next program module based on the different parameters. The sorting module is divided into several parts, one for each

(14)

3.2. Java Program

parameter. Each one will only save trace data on the relevant parameters, for example when sorting traces in the Time of failure module, only downtime, start hour and end hour of trace will be required.

The output of the program is a text file for each parameter value with their respective average downtime and total number of failures, as well as number of repairs started and ended each hour of the day to visualize workload during a day. We also print a separate file with all downtimes both for the whole system and for different faults to create cumulative distribution functions, CDF.

Analysis of the Failure Trace

In our analysis we will first investigate the different types of faults for the system. One of the reasons for doing this is all of the failures being recorded with a fault in the failure trace. The system administrators must always give the failure a fault code and if they do not know the reason they mark it as unknown. This together with the previous research presented in chapter 2 indicates that the different type of faults matters for the downtime. Therefore we will investigate if the type of fault gives a difference in the downtime of the and if so, how this difference presents itself. We will also present how the occurrence of downtime is for the different types of fault is in the failure trace.

Because we have an exact time of when a failure starts and ends we will display the average downtime in consideration of when the fault occurred. We have chosen to study the downtime during an average day and week. We will observe if any patterns occurs in these periods, which we think it should since the usage of the system will probably be different if we compare for example the weekend with a workday. Moreover we will display the number of repairs over the same periods.

A failure is always recorded with an id of the node that the failure occurs in. This can be used for identifying the hardware characteristics of the node. As the system operates with heavy computations it is not unlikely that hardware related failures are common. Therefore we will investigate the differences between the various hardware characteristics.

Lastly before we conduct our mathematical models we investigate the probability for a certain downtime in the system by making use of the CDF for the system as a whole and for the different fault categories as a complement for the average downtime and occurrence of these.

Developement and Evaluation of the Mathematical Models

Based on the analysis of the failure trace we will develop two simplistic mathematical mod-els. Both model will be an estimate of the downtime τ based on weighed parameters, where the weighing is determined by the result of our analysis. We will formulate the models ac-cordingly: τ= n ÿ x=1 αxβx

Where n is the number of parameters considered, αx is the parameter value and βx the

weighing of the parameter x. Both mathematical models will be evaluated with an autom-atized test of our evaluation set, which reads each trace in the set and compares the actual downtime of this with our estimation. The test is also written in Java and its structure is described in figure 3.2.

(15)

3.2. Java Program

Figure 3.2: Overview of the program testing the mathematical models

We use the same reading module as in the parsing program since the traces have the same format, the only difference is that we only read the evaluation data together with the node information. A similar trace list is created, where each trace contains data about the param-eters of the mathematical model as well as the downtime. The program will determine the value and the weighing of each parameter for each trace by reading data from our analysis. It will then calculate the difference of our estimation and the actual downtime by inserting the parameter data into our model. When all traces have been estimated by the model the program will calculate the mean absolute error (MAE), the root mean squared error (RMSE), the mean absolute percentage error (MAE) and the standard deviation (σ) for the MAE.

For n traces considered with the actual downtime aiand estimation error ei, MAPE, MAE,

and RMSE will be calculated acoording to equations 3.1, 3.2 and 3.3 respectively[1][4].

MAE= 1 n n ÿ i=1 |ei| (3.1) RMSE= g f f e 1 n n ÿ i=1 e2 i (3.2) MAPE= 1 n n ÿ i=1 |ei| ai (3.3)

(16)

3.2. Java Program

The standard deviation σ will be calculated based on the MAE:

σ= g f f e1 n n ÿ i=1 (ei´MAE)2 (3.4)

The MAE weighs all estimation errors, ei, equally when calculating the mean. The RMSE

calculates the average over the squared estimation errors, e2_i, which results in larger errors weighing more in the mean. The percentage error shows at what percent the estimation error, ei, is compared to the actual downtime, ai, meaning that the MAPE is the only measure

of error relative to the actual downtime. We will also calculate the standard deviation, σ since it contributed to the understanding of the MAE.

(17)

4 Results

There are in total 38992 traces considered in the analysis which have as previously stated been continuously logged during a nine-year-period. In this section we will show how the parameters which we have identified as relevant may have affected the downtime.

4.1 Downtime in Consideration of Fault

The first parameter we consider is the original fault leading to the failure and its correlation with the time of repair. In this part of the analysis there are 13 869 traces considered, as we exclude the failures where the reason for the downtime is unknown. The faults are distributed over the different categories according to figure 4.1.

Hardware Software Network I/O Infrastructure Human Error 59% 26% 8% 4% 2% 1%

Figure 4.1: Distribution of number of failures per type of fault

The obvious majority of failures are caused by hardware faults, however software faults are also common. Human fault is extremely rare; this is not surprising since the system is used to visualize scientific data. The average downtime for each failure category is shown in figure 4.1. The average downtime for I/O and Infrastructure both lie at about 37, 000 seconds (about 10.3 hours) and the average downtime for hardware and network related faults is about 12, 000 seconds (about 3.3 hours). The one fault that is deviant in consideration of average downtime is human errors which has a downtime that is under 10, 000 seconds ( 2.8 hours). We saw in figure 4.1 that human errors are the least usual of the recorded faults, the average downtime and the occurrence of human errors therefore show that human errors are not a large contributor to the total downtime of this system. Hardware, software and network related faults are the three most frequent faults, they do however have the lowest average downtime. Since hardware and software are so common we can however argue that these are the biggest contributors to the total downtime.

(18)

4.2. Time of Failure Infrastr uctur e Har dwar e I/O Network Softwar e Human Err or 1 2 3 4 ¨104 Type of fault Downtime [ s ]

Figure 4.2: Average downtime per type of fault

4.2 Time of Failure

We have investigated if the downtime depends on when the failure occurs. Here we have regarded the data with and without a known fault, in total 15 811 failures. In particular we investigate the average time of failure over a week and over a day. We removed all traces for which the downtime where over a day, since these are deviations which were to affect the averages in a misleading way.

Table 4.1: Number of failures occurred each weekday.

Day of week # of failures

Monday 2295 Tuesday 2703 Wednesday 2965 Thursday 2541 Friday 2336 Saturday 1611 Sunday 1360

Table 4.1 shows the number of failures occurred each day, not surprisingly there are less failures reported during the weekend which indicates a common work schedule. In figure 4.2 we have plotted the average downtime for the system displayed over a week. The graph shows the time that the fault starts and the average downtime that this fault leads to. We can see that Mondays has considerably lower downtime compared to all the other days in a week. The average repair time for Tuesdays, Wednesdays and Fridays is slightly lower than 2 hours. For Thursdays, Saturdays and Sundays the average repair time is over 2 hours.

(19)

4.2. Time of Failure Monday Tuesday W ednesday Thursday Friday Satur day Sunday 4,000 5,000 6,000 7,000 8,000 Day of week Downtime [ s ]

Figure 4.3: Average downtime per weekday

In figure 4.4 the plot to the left shows the average downtime of failures for each hour of the day, while the right figure shows the number of repairs that are completed at a certain time of the day. We can see that the average time of repair is the lowest between 10 am to about 6 pm. For the same period the right figure shows the largest number of repairs completed. The time of day is registered according to the local time and we can see that the graphs are about the opposite of each other, meaning that while one graph has its highs the other has its low values. This suggests a correlation between downtime, number of repairs completed and regular working hours.

0 10 20 0.6 0.8 1 ¨104 Time of day[h] A verage downtime [ s] 0 10 20 400 600 800 1,000 Time of day[h] Number of repairs completed

(20)

4.3. Hardware in Node Experiencing Failure 0 2 4 6 8 10 12 14 16 18 20 22 24 300 400 500 600 700 800 900 1,000 1,100 Time of day[h] Number of repairs

Average number of started repairs per hour Average number of completed repairs per hour

Figure 4.5: Average number of started repairs and average number of completed repairs per hour

Figure 4.5 shows the average number of repairs per hour, both the number of completed repairs and the number of started repairs. We can see that the number of repairs that are started is followed quite well by the number of completed repairs. This is logical since when many repairs are started it is after some time followed by a peak when many repairs are finished. Hence, when a number of repairs are started about the same number are finished although shifted in time. A peak occurs at 8 o’clock which also indicates that this is in fact the first hour of work. The peak at 18 o’clock is difficult to explain, it could for example be caused by a routine check of availability at a certain group of rarely used nodes or such. Without a better insight into the system it is hard to tell.

4.3 Hardware in Node Experiencing Failure

As shown in figure 4.1 hardware faults are the most common, we therefore find it interesting to investigate characteristics for the node where the failure occurs. Because of the hardware

(21)

4.3. Hardware in Node Experiencing Failure

characteristics the nodes have we can use these as parameters to investigate downtime. In this part of the analysis we investigate all 16 001 failures.

All memory sizes and processor quantities are displayed in tables 4.2 and 4.3, with the number of nodes possessing each memory size and processor quantity respectively in the system. As seen the majority of the nodes have smaller memories and fewer processors. This data is stored in the separate node table in the data set.

Table 4.2: Memory Sizes

Memory size # of nodes

1 GB 120 4 GB 1041 8 GB 1365 16 GB 856 32 GB 185 64 GB 2 80 GB 1 128 GB 4 1 TB 1

Table 4.3: Processor Quantities

# of processors # of nodes 2 803 4 2700 8 1 32 2 80 1 128 68 256 1

Table 4.4 shows number of failures occurred per memory size during the time span of the limited failure trace.

Table 4.4: Number of failures for different sizes of memory.

Memory size # of failures

1 GB 486 4 GB 1027 8 GB 3272 16 GB 2903 32 GB 6917 64 GB 472 80 GB 5 128 GB 876 1 TB 43

For nodes where the memory size was 80 Gigabytes and 1 Terabyte where only recorded 5 and 43 times respectively. As seen nodes with 1 and 64 Gigabytes of memory experience quite few failures as well; 486 and 472 respectively. Nodes with memory size of 32 Gigabytes experience by far the largest number of failures, which is quite surprising since only 185 of 3577 nodes have this size of their memory. To investigate this further we would need a better insight in the functionality of the system and mapping of nodes since this can depend on for example specific jobs of a node.

Figure 4.6 shows the average downtime for each memory size excluding sizes of 80 GB and 1 TB due to their low number of failures. The average has been calculated by dividing total downtime with total number of failures of all nodes with the specific memory size. The memory size which has less failures also exhibit the longest average down time.

(22)

4.3. Hardware in Node Experiencing Failure 1 GB 4 GB 8 GB 16 GB 32 GB 64 GB 128 GB 2 4 6 8 ¨104

Size of memory[byte]

A verage downtime [ s ]

Figure 4.6: Average downtime for different sizes of memory.

No strict pattern can be detected in this data, the downtime does however seem to de-crease for the medium sizes of memory, and inde-crease for smaller and bigger memories, ex-cluding the memory size of 128 GB. This memory size is vastly bigger than others, which might explain its deviation from the pattern.

We have also considered the number of processors each node has. In table 4.5 we show the number of failures per quantity of processors.

Table 4.5: Number of failures for different sizes of processor quantities.

Processor quantity # of failures

2 832 4 6750 8 98 32 306 80 5 128 7967 256 43

Nodes with 4 and 128 processors are greatly represented in the number of failures. This does not seem to affect the average downtime, which is plotted in figure 4.7. The average downtime is quite consistent with a lower downtime in the nodes with more processors. The only abnormality is the downtime of nodes with two processors, which is extremely high relative to the other processor quantities.

The number of failures and average downtime has no visible correlation neither for the different memory sizes or processor quantities, the two does however combined provide a better picture of a memory size and processor quantity’s performance. For example, consider the nodes with the smallest memory, 1 Gigabyte. These experience a very low percentage of the total amount of failures as seen in table 4.4, however they have the by far longest average downtime as seen in figure 4.6. The considerably low number of failures is quite good if taking the number of nodes of this memory size, 120 nodes, into account, but the average time it takes to repair these failures is terrible.

(23)

4.4. Cumulative Distribution of the Downtimes 2 4 8 32 128 256 2 4 6 ¨104

Number of processors[bytes]

A verage downtime [ s ]

Figure 4.7: Average downtime per node with different number of processors.

4.4 Cumulative Distribution of the Downtimes

We have plotted the cumulative distribution function, CDF, of the repair time for our whole failure trace in a logarithmic scale as shown in figure 4.8. The CDF F(t)shows the cumulative probability as a function of repair time t, meaning that the probability of a repair time less than t is equal to the value of F(t). A steep CDF which increases in value early therefore indicates a good probability for short repair times. We can estimate the probability that we will solve a failure in a given range of time. For example; if you know there is a fault at the beginning of your day of work, how large is the chance that you will solve it before you go home the same day? Assume that one work day is 8 hours of effective time, which is 28, 800 seconds. You can from figure 4.8 see that the probability of you solving it in time, without knowing beforehand what kind of fault it is, is about 95%.

(24)

4.4. Cumulative Distribution of the Downtimes

Figure 4.8: The empirical cumulative distribution function for all failure traces. Note the logarithmic scale.

As stated in chapter 2 Schroeder and Gibson have plotted the CDF for the total downtime for the failure trace and matched it with four different standard distributions in their study. [5] It showed that all of the data is a close match to the log-normal distribution. The data was quite a good match to the Weibull and the Gamma distribution as well but proved to be a poor fit to the exponential distribution. To gain a clearer understanding of what a log-normal distribution is, we have plotted the distribution displayed in figure 4.9. As seen here, the failure trace data is roughly log-normally distributed. Observe that we have limited the X-axis to only show downtimes up to 20 000 seconds, there are still occurrences of longer downtimes than that not visible in figure 4.9. The distribution exhibit the appearance of a long tail, suggesting that most failure occurrences have a short downtime but as previously mentioned there are some occurrences of longer downtimes. Although these occurrences seem to be few, the extreme downtimes of them will probably affect the accuracy of our estimation negatively.

(25)

Figure 4.9: The distribution of downtimes of all failure traces.

To further investigate if the patterns, shown in section 4.1, are prominent when consid-ering the root cause of failure, we plotted the different failure types’ corresponding CDF in figure 4.10.

(26)

If we consider where the time of repair is at 104we can see that for infrastructure related faults the probability that the time of repair will be less than 104seconds is about 30%. Mean-while hardware, software and network faults have a probability of about 80% at the same time of repair. Statistically the occurrence of a infrastructure or I/O related fault will take longer time to repair than the other faults. Human error is for most of the time the fault to be repaired the fastest. Network and software has about the same cumulative probability for all the downtimes. The cumulative probability for hardware indicates that it takes longer time to repair than a software, network or human error related fault for downtimes less than where the CDFs cross. The probability of a longer downtime is higher with hardware failures than I/O failures for a downtime of maximum 1000 seconds, where the two functions cross. For longer downtimes, the hardware failures have a higher probability of being repaired quicker than I/O. This can be because of several reasons, one is that I/O failures have a small amount of longer downtimes, while hardware failures are many in numbers and short in average.

Based on the CDF for the different types of fault we have derived the 95th percentile for each fault, displayed in table 4.6. The 95th percentile shows the value of t in the CDF when F(t) = 0.95. This represents the downtime for which 95% of the repairs are predicted to be completed. Infrastructure have together with I/O the highest 95th percentile. Hardware and software faults have the lowest 95th percentile. Low 95th percentile is good because it indicates that most of the failures have been repaired by this time.

Table 4.6: The 95th percentile for the different fault types

Type of fault Downtime at 95th percentile (h)

Hardware 7.50 Human error 12.64 Infrastructure 23.42 I/O 27.45 Network 12.50 Software 8.00

(27)

5 Mathematical Model

A mathematical model to roughly estimate downtime can be made based on the five most relevant parameters:

• Type of fault, Ai

• Day of week, Bj

• Hour of day, Ck

• Memory size of node, Dl

• Number of processors at node, Em

Each parameter has a set of values which all have an calculated average downtime, T, as well as proportion of total number of failures, N. All of these parameter values have been derived in previous sections of the result.

Type of fault can be set to six different values which are hardware, human error, infrastruc-ture, I/O, network and software. Similarly all the other parameters have a different number of possible values each, which all have T and N defined.

To indicate how many values a certain parameter has we make use of indexes. Each parameter are represented with a letter, A, B, C, D or E with the corresponding index, i, j, k, l and m. The index i ranges from one to six, and indicates the type of fault that the parameter A is. Index j and m ranges from one to seven and represents which day of the week the fault occurs on respectively the number of processors that the node in which the failure occurs possesses. The index k indicates which hour of the day that the failure starts on and ranges therefore from one to 24. The index l ranges from one to eight and indicates which size of memory a node in which a failure occurs possesses. For example, the parameter value Ai=3

(third hour of the day) would give the average downtime at this hour of the day, T(A3), and

the number of failures that occurred at this hour, N(A3).

We have developed two different mathematical models based on these parameters to es-timate downtime, both which are very simple.

The first model makes use of the average downtime for each parameter value. All param-eters are weighed equally and therefore divided with the number of paramparam-eters, n.

τ= 1

n¨(T(Ai) +T(Bj) +T(Ck) +T(Dl) +T(Em)) (5.1) The second model also makes use of the number of failures and have a specific value for each parameter value.

τ= T

(Ai)¨N(Ai) +T(Bj)¨N(Bj) +T(Ck)¨N(Ck) +T(Dl)¨N(Dl) +N(Em)¨T(Em)

(28)

5.1. Test of our Mathematical Models

5.1 Test of our Mathematical Models

By testing the randomly selected evaluation data we can evaluate the accuracy of our mathe-matical models. We have disregarded downtimes over 24 hours, which trims the evaluation data set from 1984 traces to 1941. This is done since we suspected a long tail in the distri-bution of the downtimes in the original data set, indicating that such can also be present in the evaluation data. The difference is subtle in number of traces but because of the difficulty for our models to estimate large deviations in downtime this will make a huge difference for the accuracy. We have evaluated the models in consideration of the MAPE, MAE, RMSE and standard deviation based on the MAE.

To obtain an idea of how good each model is, we have calculated the overall average downtime of all traces in the data set used in the analysis and compared each evaluation data trace to this rough estimation. This estimation can also be evaluated with the same measures when comparing evaluation data traces to the average downtime, enabling us to compare our models results to what would be the simplest way of estimating downtime. Note that all results have been rounded off to the nearest decimal or whole percentage.

Table 5.1: Result from the test of the mathematical models Model Measure Result

Model 1 MAE 3.2 hours RMSE 3.4 hours MAPE 814 % σ 6.1 hours Model 2 MAE 3.6 hours RMSE 3.6 hours MAPE 894 % σ 6.4 hours Comparison to Average MAE 3.5 hours RMSE 3.9 hours MAPE 946 % σ 6.3 hours

Both our models give inaccurate results as showcased in the measures we have chosen. Even if the MAE and RMSE show quite low values, ranging from 3.2 to 3.4 hours, the MAPE suggest a great inaccuracy in the estimations. With the MAPE of 814% and 894% respectively, the estimation of downtime relative to the actual downtime is extremely big. This suggests that we have many greatly inaccurate estimations of shorter downtimes.

The RMSE is greater than the MAE for our first mathematical model. Since a higher RMSE weighs large errors more than smaller errors, this suggests that this model has a few estima-tions that greatly differ from the actual downtime. The second model have the same MAE and RMSE, suggesting that there are not as many greatly inaccurate errors affecting the re-sulting mean values.

The rough estimation of comparing to the overall average have slightly higher measures than the two models with some exceptions. We can clearly see that Model 1 is better in all measures, which means that this model is in fact a better estimation than the average down-time. Model 2 only beats the rough average estimation in RMSE and standard deviation, which indicates that this model is not as beneficial to the rough estimation solely based on the average. A higher RMSE than MAE in the comparison to average suggests that there are quite a few big errors which greatly contribute to the MAPE in the rough estimation.

(29)

6 Discussion

This chapter will be devoted to further analysis of the results presented in the previous sec-tion, as well as discussion about the conduction of our method.

6.1 Results

The examination of the data set has lead to many interesting observations. The different parameters which we chose to focus on showed to provide some relevance in the downtime of a system after repair, however we do need to analyze these further to fully understand their meaning.

Downtime in Consideration of Fault

The average downtimes for the different types of faults were quite consistent throughout, excluding faults of types infrastructure and I/O which had much longer average downtimes than the rest of the fault categories. Failures due to hardware faults are the most frequent, they do however have the second to lowest average downtime. When studying the CDF of the different types of faults it is clear that hardware in fact is one of the faults with the prob-able shortest downtime after a failure. Human error shows both best average downtime and one of the best probabilities of low downtime based on the CDF. For the 95th percentile, hard-ware and softhard-ware have the lowest values followed by network and human errors. According to what can be seen in the CDF, human error have the probable shortest downtime for most of the time. However it is not the case for the longer downtime, which is also shown in the 95th percentile. It is however hard to make a credible conclusion of the downtime for human errors in a larger context since the number of traces from human errors is considerably low. Although I/O and and infrastructure faults have the worst average downtimes and CDFs it is also hard to conclude that these are the worst faults to experience since they only consist of 2% respectively 4% of the failure trace.

Time When a Failure Occurs

The time of a failure seems to display more clear patterns than root cause, especially since it is a parameter which can be measured linearly. Both day of week and time of day indicates a regularity in both average downtime and number of failures repaired or reported. Saturdays and Sundays do not surprisingly have a quite high average downtime, the peak on Thursday is however deviant. Excluding Thursday, there is a trend of the downtime increasing as the week goes. What this depends on is quite hard to conclude without knowing more about their routines. One big factor to take into consideration is that we have plotted the average downtime for a fault that starts on a specific day. All the failures that are in the failure trace are the ones that required the attention of a system administrator. This is probably why the weekends have a higher average downtimes and lower number of started repairs than the rest of the days, disregarding Thursday and assuming that not as many system administrators work on the weekend as in the week.

(30)

6.2. Method

Thursdays have a higher average downtime, which could be explained by a greater work-load during this day because of more repairs being started. However the number of repairs started is consistent with other non-weekend days. To investigate further we need a more in-depth understanding of the systems and workers routines and workload. We have seen that the number of repairs completed and the average duration of repair correlated when consid-ering hour of day. We also saw a pattern where the average number of repairs completed almost have the appearance of the curve for the average number of repairs started although shifted in time, which also happens to match regular working hours.

Hardware in Node Experiencing Failure

Even if we can not find a direct correlation between memory size or number of processors and downtime, some memory sizes and processor quantities do have some patterns which are worth discussing.

Failures due to a smaller memory size is not unexpected since it reaches its threshold much faster than a larger one and therefore causes error, however the extremely high aver-age downtime of this memory size can arguably not be explained only by this. The same behaviour can be noted in the nodes with the lowest quantity of processors, 2. A majority of the nodes have this processor quantity and the number of failures for these is relatively quite low. The average downtime is however extremely high for nodes with this amount of processors compared to other processor quantities. Even if a lower processor quantity can explain the frequent occurrence of some failures, this average downtime is too extreme as in the case of the smallest memory size.

There are other factors of the nodes which we have not been able to take into considera-tion due to the lack of insight in the structure of the system. The extreme peaks in average downtime for nodes of a certain memory size or a specific processor quantity could for ex-ample be explained by a certain workload being put on these, or if they are dedicated to a certain kind of tasks which results in failures that take longer to repair and so on. Likely is that the nodes with a small memory size and a low quantity of processors are not dedicated to the most important tasks of the system. Therefore failures in these nodes do not have a high priority to be addressed, and exhibit longer downtime.

Cumulative Distributions of Downtimes

The CDFs, both over the complete collection of traces and per type of failure, is quite inter-esting since they all fit to the log-normal distribution. A log-normal distribution describes a stochastic variable which logarithm is normally distributed, and is very useful in describing different developments in nature. This is due to the fact that many natural processes will variate logarithmically due to their progression being caused by the multiplicative relation of their causing factors, as "[...] chemistry and physics are fundamental in life, and the prevail-ing operation in the laws of these disciplines is multiplication"[3].

Considering how the log-normal distribution is usually fitted to multiplicative caused progressions, the good fit of our data’s distribution to the log-normal distribution does sug-gest some kind of correlation between the downtime and a product of several parameters causing the downtime. It would be beneficial to investigate further to possibly identify the different parameters and what product or other combination of these are related to the down-time to enable estimation of downdown-time.

6.2 Method

Our method has sufficiently helped us investigate the downtime of the LANL system. A big portion of the time was put on simply writing the Java program which parsed the raw data and formatted it to fit our analysis. Although this time might not directly be visible in the

(31)

6.3. Conclusion

result of the study, it did ease the visualization of our findings and helped us to focus on the chosen parameters. Since we wrote the program from scratch, it allowed us to customize the data output to our needs as well as selecting what data to trim.

The lack of insight in the LANL system greatly affected the study in a negative way, since it limited further analysis of certain parameters. Although some patterns were detected, they could have been more thoroughly explained if we were not limited by this. Due to this and the general lack of public failure traces this study would ideally be performed in cooperation with a company as a given assignment of analyzing the availability of their system. The time constraint of this project does however prevent us from setting up this kind of contact with a relevant corporation.

For the same reason, the mathematical models was not developed further. The models did not reach desired results in terms of accuracy, they do however show us that the parameters can not provide a foundation for estimating downtime when solely investigated by adding them with different weighs. As also suggested from the analysis cumulative distribution function, a mathematical model based on the product of different parameters might give more accurate results.

6.3 Conclusion

Although some of the results might not be as specific as desired, conclusions can still be drawn from them. The aim of this report was to answer the following research questions:

• Which factors are relevant in estimating down time?

• Are there any patterns or correlations between these parameters and downtime? • Can downtime be estimated based on knowledge of different relevant parameters? We have identified type of fault, time of failure and hardware of node as relevant param-eters. Just by studying the average downtime of failures it is clear that the fault is important, even if no evident pattern has presented itself. The different CDFs of the fault types show that the cumulative probability of downtime does differ. The time when a failure occurs clearly relates to the resulting time of repair, since the system has shorter average downtime dur-ing typical work hours and workdays, as well as more frequently initiated repairs. Though, more information about system repair routines and work hours would provide further clarity in this correlation. The hardware of the node also proves relevant, however the actual corre-lation is hard to identify. We have shown how different memory sizes and different processor quantities differ in average downtime, but to investigate this correlation more information about the system is required.

Even if we cannot explain for certain some of these patterns, they do function as param-eters in our mathematical models. Our findings indicate that downtime can be estimated based on the parameters, but do not provide enough material to model a perfectly accu-rate estimation. We have developed two simple models that estimates the downtime for the system roughly. To create better models based on ours more time should be spent on the weighting of the parameters in aspect on which of them is of the greatest importance. In our mathematical models we have studied how the parameters affect downtime separately, since it is a sum of the different parameters, however the good fit of the log-normal distribution suggests that it is based on a product of parameters.

In conclusion, we have answered our research questions and constructed two mathemat-ical models with a large MAPE, but quite good MAE. They are not as accurate as desired and not suitable for their intended use, but does show that it is possible to estimate down-time based on relevant parameters. The inaccessibility of failure data is an obstacle in the research of repair time, but worth overcoming because of the possibilities of estimating the consequence a failure has to a system quantified in time.

(32)

Bibliography

[1] Tianfeng Chai and Roland Draxler. “Root mean square error (RMSE) or mean absolute error (MAE)? - Arguments against avoiding RMSE in the literature”. In: Geoscientific Model Development 7.3 (2014), pp. 1247–1250.DOI: 10.5194/gmd-7-1247-2014. [2] Bahman Javadi et al. “The Failure Trace Archive: Enabling the Comparison of Failure

Measurements and Models of Distributed Systems”. In: Journal of Parallel and Distributed Computing (2013).

[3] Eckhard Limpert, Werner A. Stahel, and Markus Abbt. “Log-normal Distributions across the Sciences: Keys and Clues”. In: BioScience 51.5 (2001), pp. 341–352.DOI: 10.1641/ 0006-3568(2001)051[0341:LNDATS]2.0.CO;2.

[4] Arnaud de Myttenaere et al. “Mean Absolute Percentage Error for regression models”. In: Neurocomputing 192 (2016). Advances in artificial neural networks, machine learning and computational intelligenceSelected papers from the 23rd European Symposium on Artificial Neural Networks (ESANN 2015), pp. 38–48. ISSN: 0925-2312.DOI: http:// dx.doi.org/10.1016/j.neucom.2015.12.114.

[5] Bianca Schroeder and Garth Gibson. “A Large-Scale Study of Failures in High-Performance Computing Systems”. In: IEEE Transactions on Dependable and Secure Com-puting 7.4 (Oct. 2010), pp. 337–350.ISSN: 1545-5971.DOI: 10.1109/TDSC.2009.4. [6] Bianca Schroeder and Garth Gibson. “The Computer Failure Data Repository (CFDR)”.

In: USENIX The Advanced Computing Systems Association (2016).

[7] Nezih Yigitbasi et al. “Understanding Time-Varying Behavior of Failures in Large-Scale Distributed Systems”. In: Proceedings of the Sixteenth Annual Conference of the Advanced School for Computing and Imaging. ASCI 2010. Veldhoven, the Netherlands, pp. 1–8.

(33)

Estimating Time to Repair Failures in a Distributed System