Hadoop Read Performance During Datanode Crashes

(1)

Linköpings universitet | Institutionen för datavetenskap

Examensarbete på grundnivå, 16hp | Informationsteknologi

2016 | LIU-IDA/LITH-EX-G--16/056--SE

Hadoops

läsprestanda

vid

datanodkrascher

Hadoop Read Performance During Datanode Crashes

Fabian Johannsen

Mattias Hellsing

Handledare : Mikael Asplund Examinator : Nahid Shahmehri

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

Fabian Johannsen Mattias Hellsing

(3)

Abstract

This bachelor thesis evaluates the impact of datanode crashes on the performance of the read operations of a Hadoop Distributed File System, HDFS. The goal is to better under-stand how datanode crashes, as well as how certain parameters, affect the performance of the read operation by looking at the execution time of the get command. The parameters used are the number of crashed nodes, block size and file size. By setting up a Linux test environment with ten virtual machines and Hadoop installed on them and running tests on it, data has been collected in order to answer these questions. From this data the average execution time and standard deviation of the get command was calculated. The network activity during the tests was also measured. The results showed that neither the number of crashed nodes nor block size had any significant effect on the execution time. It also demonstrated that the execution time of the get command was not directly proportional to the size of the fetched file. The execution time was up to 4.5 times as long when the file size was four times as large. A four times larger file did sometimes result in more than a four times as long execution time. Although, the consequences of a datanode crash while fetching a small file appear to be much greater than with a large file. The average execution time increased by up to 36% when a large file was fetched but it increased by as much as 85% when fetching a small file.

(4)

Abstract iii Contents iv List of Figures v List of Tables vi 1 Introduction 1 1.1 Purpose . . . 2 1.2 Problem statement . . . 2 1.3 Limitations . . . 2 2 Background 3 2.1 HDFS . . . 3 2.2 Tools used . . . 5 2.3 Related work . . . 5 3 Method 7 3.1 Experimental environment . . . 7 3.2 Measurements . . . 8 3.3 Network activity . . . 8 3.4 Files used . . . 9 3.5 System state . . . 9 3.6 Test execution . . . 9 4 Results 10 4.1 Impact of datanode crashes . . . 10

4.2 File size . . . 12

4.3 Number of crashed nodes . . . 15

4.4 Block size . . . 16 5 Discussion 17 5.1 Limitations . . . 17 5.2 Results . . . 18 5.3 Method . . . 18 5.4 Future work . . . 19 Bibliography 20

(5)

List of Figures

2.1 HDFS . . . 4

3.1 Loading the saved system state . . . 9

4.1 Average execution time of test configurations 1 to 8. . . 11

4.2 The rate of packets arriving at the client during tests with configuration 1. . . 11

4.3 Comparison of the average execution time with configurations 5 and 6. . . 12

4.5 The rate of packets arriving at the client during crash-tests with configurations 3 and 4. . . 13

4.6 Comparison of the average execution time of no-crash tests with configurations 1 and 2. . . 14

4.7 Comparison of the average execution time of no-crash tests with configurations 7 and 8. . . 14

(6)

List of Tables

(7)

1 Introduction

Hardware failures are inevitable and sooner or later some part of your system will go down. It is important for any system to be prepared for the event of data loss due to hardware failure, and it is therefore a necessity to have some sort of backup. In a distributed storage system the data is stored on a number of different nodes. Without a backup for your data, the service you provide will not be reliable and therefore probably not desirable. [7]

Having a well thought out backup plan for your data is not the only challenge with dis-tributed data storage. The need for processing and storing data for any company has in-creased rapidly throughout the last years. One of the key challenges for any company that deals with huge amounts of data is to create cost-efficient storage and processing systems that can handle all the data in an efficient and a reliable way. These companies often use some kind of distributed storage system to cope with this challenge [8].

One of the most popular distributed storage systems right now is Hadoop. It is a widely used open source software for distributed storage and distributed processing. It serves to support data intensive distributed applications with the ability to store and process a massive amount of data [7]. Hadoop consist of four major parts which are Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop Yarn and Hadoop MapReduce. A wide variety of different companies like Amazon, Adobe and Ebay1 uses Hadoop and HDFS. Companies and organizations like these process a massive amount of data every day. Because of this they are highly dependent on reliable software and in the event of node crashes, which is the cause of a severe hardware failure, it is important that the servers can maintain the quality of service.

We have in this thesis chosen to examine HDFS which is a distributed file system that provides reliable and scalable data storage designed for commodity hardware. Hadoop and HDFS can scale up from one single server to large clusters of servers, where each server has a set of inexpensive internal disk drives. Hadoop does not rely on hardware to deliver high reliability which means that the system itself can detect and handle, for example, server crashes. The main goal of Hadoop and HDFS is to deliver a high reliability service, with the presumption that hardware failure is inevitable [6].

(8)

1.1. Purpose

1.1 Purpose

Replication of data and distributing it across datanodes is a good way of keeping the data in-tact in case of a datanode crash. The purpose of this thesis is to investigate how well Hadoop and HDFS deal with these crashes. We consider a node to be crashed when it is not available to the master node.

1.2 Problem statement

These are the two major problems we will investigate:

1. How is the performance of HDFS read affected when a datanode becomes unavailable, but not yet confirmed dead with respect to execution time?

2. What different parameters will affect the performance of HDFS read with respect to execution time?

We study these questions by running experiments on the hadoop framework. This is done by configuring a Hadoop cluster with ten nodes and performing tests where each test has different set test parameters. This thesis mainly focuses on the impact of these parameters on the performance of HDFS read by measuring the execution time of the get command. The parameters used are block size, number of crashed nodes and size of fetched file. We chose to look at the file size as it would be interesting to see if the execution time of the get command is directly proportional to the size of the fetched file. We will perform tests with different block sizes because it is one of the most commonly tuned parameter in HDFS. How to set the block size for best performance depends mainly on how big the files in the system are [9]. Since different file sizes will be used it is therefore interesting to have different block size as well. By crashing different amounts of nodes we want to get a better idea of how node crashes actually affect HDFS. For example we would be able to see if the potential increase in execution time of the get command is directly proportional to the number of nodes crashed. The main metrics are the average execution time and the standard deviation for every test configuration. We will also measure the network activity when the tests are running. This is done to see if there are any correlations between the rate of which the packets arrive at the client and the changed test parameters.

1.3 Limitations

We are limited to testing on a relatively small scale as we only have access to ten nodes. We are also limited to testing in a virtual environment as these nodes are virtual machines running on the same physical machine.

(9)

2 Background

This chapter serves to provide a sufficient explanation of how HDFS works and its main com-ponents in order to understand the remainder of this report. It also provides an explanation of used terms and the tools used for conducting the experiments.

2.1 HDFS

This section is based on the Hadoop Architecture guide [2]. An HDFS is typically set up on node clusters with several servers, where every server in the cluster consists of one namenode and several datanodes. There is usually one datanode per server. HDFS has a master/slave architecture which means that one server in the cluster, the namenode, has unidirectional control over the other servers, the datanodes. A brief visual presentation of the HDFS read and write operations are presented in figure 2.1. The client to the left of the figure shows a write operation, and the client to the right shows a read operation.

(10)

2.1. HDFS

Figure 2.1: HDFS

Namenode

The green box in figure 2.1 represents the namenode and is responsible for keeping track of the namespace in the distributed file system, the location of all the data blocks and their status as well as the health of the datanodes. It uses this information to serve the clients when requesting and inserting files into the file system.

Datanode

The blue boxes in figure 2.1 represent the datanodes. A datanode is a node which serves to store the data blocks. In a cluster there is usually one namenode and many datanodes. The stored data blocks are replicated on a number of datanodes.

HDFS write

Every file that is inserted into an HDFS, for example by running the put command, will be split into data blocks, represented as the black squares in figure 2.1, by the client. Each data block is the same size, except for the last block of a file which could potentially be smaller depending on how many bytes remain to be written. For each data block the client will consult the namenode that will provide a list of metadata which contains information about on which datanodes to put the blocks. The list also contains information about the replication

(11)

2.2. Tools used

factor, that is how many times each block should be replicated in the file system. The data blocks are then distributed to each of the determined datanodes.

HDFS read

When a client requests a file, for example by running the get command, it does so by asking the namenode for the datanodes where the data blocks that make up the file are located. For each requested data block the namenode will return a list of all datanodes that contain that datablock. The client will then proceed to request these data blocks from the datanodes. It will do so one datanode at a time until all blocks are collected and will then re-assemble the file.

Heartbeat

Every datanode in the cluster periodically sends a heartbeat to the namenode. This is a mes-sage which implies that the datanode is alive and well. This heartbeat is sent every three seconds. The namenode checks every 200 seconds if the datanode has not sent a heartbeat for 600 seconds [1]. If it has not sent a heartbeat for 600 seconds the datanode is declared dead and will not be addressed by the namenode.

Datanode crash

In this report we define a datanode crash as a crash failure where the datanode becomes unreachable to the system for the duration of the test. This means that the datanode has stopped sending heartbeats to the namenode.

2.2 Tools used

Here we present the tools used for conducting the experiments.

• GNU time. This is a bash command which runs a specified time command. It displays information about resources used by the program such as time elapsed since execution.2 • Tcpdump. This is a bash command which tracks network activity and outputs a

de-scription of the content of packets sent or received on the screen or to a file.3

2.3 Related work

There are many published articles in this subject and in this section we will present a selection of them in order to give you an idea of what has been done prior to this thesis.

In existing literature it has been shown that Hadoop does not handle compute node crashes very well [1]. A compute node is a datanode with the addition of a tasktracker. It has been shown that the execution time of a job in Hadoop is significantly delayed when a compute node is crashed during runtime. It is worth noting that a job in Hadoop is a Mapre-duce task and does not equal a read or write operation in HDFS. In our thesis we instead focus on HDFS to see if it also has trouble dealing with datanode crashes.

It is well known that hadoop handles large files more efficient than small files. The result of when Hadoop stores a large amount of small files is a distinct performance degradation [5]. One situation that causes issues for HDFS is when the files are much smaller than the block size. As each file still has to be put into blocks, the metadata describing the location of all of

2_{http://www.gnu.org/software/time/} 3_{http://www.tcpdump.org/tcpdump_man.html}

(12)

2.3. Related work

the blocks will grow rapidly [2]. Our thesis aims to examine to what extent the performance is degraded when HDFS is handling small files.

Research has been done investigating the efficiency of Hadoop and HDFS [6]. It has been shown that Hadoop as a whole has a low efficiency when it comes to utilizing the disk and processor. However it was concluded that it was not because of HDFS but rather the way applications are scheduled in Hadoop. In the same article the authors also show that the per-formance of the write operation in HDFS degrades when concurrent read or write operations are added.

A large amount of enhancements for Hadoop have been proposed and there are plenty of extensions available. A sample of these are presented in a article by A. Kala Karun and K. Chitharanjan [4]. An example is Co-Hadoop which provides a solution for co-locating related files in HDFS. Another example is Elastic Replica Management System for HDFS, ERMS, which enables flexible replication factors for different kinds of data. It divides data into hot data and cold data. Hot data is data that is accessed often, it claims this requires a higher replication factor in order to distribute the workload better. Cold data is data that is accessed less frequently, it claims this allows for a lower replication factor which would reduce overhead.

(13)

3 Method

We will set up and configure a hadoop environment with ten nodes and perform tests on the HDFS. The metric we are measuring is the execution time of the get command with different configurations of our parameters. The parameters we will be using are block size, number of crashed nodes and size of requested file. These parameters will be set up in eight different configurations, one for each scenario we will be testing. Every test is performed while the datanode is unavailable, but not yet confirmed dead. The test configurations are represented in figure 3.1. The reason for using these file sizes, 486MB and 1840MB, is that we have a limited amount of space on each virtual machine and we want to have at least a few files in the system when running our tests. The file sizes used are fairly small in this context [3] but as mentioned we have limited space on our machines. The reason for the files not being larger is because the runtime of the tests will be too long. A test using the larger file size, 1840MB, took up to 36 hours during pretesting. Using file sizes any larger than that will push this thesis beyond its deadline. The number of nodes we can crash is actually limited by the system itself. As we use a replication factor of three, which is the default value, we can not crash more than two nodes at a time. This is because of the possibility of making all three replicas of a certain block unavailable at the same time due to these replices being stored on the three nodes that are being crashed. This would cause the get command to fail and therefor we would not be able to get any results from the tests. Finally the block sizes of 64MB and 128MB have been chosen due to them both having been the default value in HDFS. 64MB used to be the default value in previous versions of Hadoop and 128MB is currently, version 2.7.2, the default value for the block size.

3.1 Experimental environment

The evaluation was carried out using ten computing nodes, where one node acts as the na-menode and the remaining nine are datanodes. All nodes are virtual machines on the same physical machine. The virtual machines are the only nodes running on the physical machine and when the nodes are on standby there is close to zero load on the physical machine. The physical machine uses two Intel(R) Xeon(R) CPU E5-2620 2.00GHz CPUs and has 24GB of installed RAM. All machines run on the Linux operating system and the installed Hadoop version is 2.7.2.

(14)

3.2. Measurements

Table 3.1: Test configurations

Test configuration Block size (MB) Number of crashed nodes Size of fetched file (MB)

1 64 1 486 2 64 1 1840 3 64 2 486 4 64 2 1840 5 128 1 486 6 128 1 1840 7 128 2 486 8 128 2 1840

3.2 Measurements

In order to measure the time it takes for the get command to execute we will use GNU time which returns the elapsed time of the execution. This is the real time elapsed between running the command and it finishing. We are conducting 200 tests with each configuration of the parameters and for each test we write the elapsed time to a log file. 100 of these tests will be crash tests and the other 100 will be no-crash tests. The first test of every no-crash test suite will be removed from the result. This is because the first run will have an abnormally long execution time due to the system needing a few seconds to recover after a restart. This was discovered during pre-testing of the system. It is therefore removed to prevent it from distorting the result. We then calculate the average elapsed time of the get command using equation 3.1. µ= 1 Nˆ N ÿ i=1 xi (3.1)

Where N is the number of requests and xi is the response time of each request. We also calculate the standard deviation σ using equation 3.2.

σ= g f f e1 Nˆ N ÿ i=1 (xi´ µ)2 (3.2)

Where N is the number of requests and xiis the response time of each request.

3.3 Network activity

To get a better understanding of the actions performed internally by HDFS we are also look-ing at the network activity. We do this by runnlook-ing a tcpdump in parallel with the tests. We run the tcpdump with the option -n src port 50010 which means we only capture packets ar-riving at the client from the datanodes, since port 50010 is the port used by the datanodes to send data to the clients4. This gives us a good idea of how much of the execution time of the get command is used for file transfer as well as at what rate and when this transfer takes place.

Six different tcpdumps are selected from each test configuration. Three of these are tcp-dumps from no-chrash tests and three are from crash test. The tcptcp-dumps that are selected are those that have an execution time close to the average execution time of the specific test con-figuration. This is in order to get tcpdumps that are as representative of the test configuration as possible.

(15)

3.4. Files used

3.4 Files used

In order to produce the files we need to fill HDFS we run a Java program that writes a sentence on each row for a certain amount of rows. We use two different file sizes, a small file which is roughly 486MB and a large file that is about 1840MB. Both are created using the same program, the only difference is that the large file contains four times the amount of rows as the small file.

3.5 System state

For each set of test configurations that use the same block size we begin by saving the state of the file system. This is done by running a script that copies the contents of the dfs directory inside the hadoop directory and stores it in another directory we chose to call savedstate. The dfs directory contains the data on the datanodes and the namespace on the namenode. The reason for this is to allow us to recreate the system to how it was at a certain point. This cannot be done by simply resetting the system and re-inserting the files as the system will distribute the datablocks differently each time a file is inserted. To recreate the saved state we run a script that formats all of the nodes by using the built in formatting command. In order to load this saved state the script copies the contents of the savedstate directory to the dfs directory on each node. This is illustrated in figure 3.1. At step 1 savedstate contains the original set of files while dfs contains a modified state where files has been altered. In step 2 the datanode has been formatted, causing the dfs folder to be emptied. In step 3 the files in savedstate are copied to dfs in order to restore the system to its previous state. The same files are used for each configuration but the state will be different for those test configurations that use different block sizes. The state changes since the files has to be reinserted when changing block size as the files must be split up in different sized blocks. The files that are inserted are the same for both block-size-setups, four big files and one small file.

Figure 3.1: Loading the saved system state

3.6 Test execution

To simulate the conditions of a single or two datanode crashes we run a script that enters one or two datanodes, depending on the test configuration, and turn them off. We then run the get command straight away meaning the HDFS does not have time to mark the shut down datanodes as stale or dead. If the system would have time to do this it would mean the system would ignore the datanode [2], which would probably mean a low or nonexistant impact on performance. The get command is executed on a Hadoop client that is located on the same

(16)

4 Results

This section provides the results from the conducted experiments. The result is presented with respect to datanode crashes, file size, number of crashed nodes and finally block size. The data from the tests is represented in bar- and line-graphs.

4.1 Impact of datanode crashes

Figure 4.1 shows the increases in average execution time of all of the test configurations when nodes are crashed. Each group of bars represent one test configuration. The standard devi-ation can also be seen at the top of each bar. The reason for the large difference in size of the bars in the graph comes from the different file sizes used. This can be seen in table 3.1. Since the larger file is four times larger than the small file it is expected that the execution time should be four times as long. Test configuration 1, 3, 5 and 7 had the highest increase of average execution time. 69% increase for test configuration 1, 70% for configuration 3, 85% for configuration 5 and 84% for configuration 7. All of these configurations used the small file.

The lowest increase of average execution time could be found in test configuration 2, 4, 6 and 8. Figure 4.1 shows that test configuration 2 and 8 both had the same increase when a node was crashed, that is 36%. For test configuration 4 there was an increase of 48% and 32% for configuration 6. All of these configurations used the large file.

(17)

4.1. Impact of datanode crashes

Figure 4.1: Average execution time of test configurations 1 to 8.

A potential cause for these increases can be seen in figure 4.2 which shows the network ac-tivity during a test with configuration 1. In these figures the y-axis represents the throughput from the datanodes to the client. The slimmer lines represent the three different tcpdumps that we have picked out and the thick line is the average of these three. All packets re-ceived during each second has been summed up and therefor the graph has a precision of one second. The data has also been smoothed in order to make it easier to spot trends in the throughput. The throughput is lower in the crash-case on average than in the no-crash-case. This would mean that it would take longer for all of the data blocks to arrive at the client in the crash-case. The same observation could be made with all test configurations.

(a) Small file (b) Small file

Figure 4.2: The rate of packets arriving at the client during tests with configuration 1.

(18)

4.2. File size

4.2 File size

The comparisons made in this section are between configurations with the same block size and number of crashed nodes, but different size of fetched file.

Figure 4.3 shows that configuration 5, which had a file size of 486MB, had an increase of average execution time of 85% when a node was crashed and configuration 6, with a file size of 1840MB, had an increase of 32% in the same case. Both of these configurations had a block size of 128 Mb and one node was crashed. This was the biggest difference in average execution time when comparing two configurations.

Figure 4.3: Comparison of the average execution time with configurations 5 and 6.

In figure 4.4 we see that test configuration 3, with a file size of 486MB, also had a significantly higher increase in average execution time during crash tests. The increase was 70% higher which could be compared to configuration 4, with a file size of 1840MB, which had an increase of 48%. Both of these configurations had a block size of 64 Mb and two nodes were crashed. Though the increase in average execution time when changing file size was much higher in this comparison as well, it was the smallest difference that was observed.

(19)

4.2. File size

The throughput from the datanodes to the client during these tests did not give any clear idea as to what could cause this difference. As can be seen in figure 4.5 the average throughput is fairly similar between configurations 3 and 4 during crash-tests. It is important to note that due to the different ranges of the x-axis, the characteristics of the left graph is to be compared to the first thirty seconds of the right graph. The same observations were made for all comparisons between test configurations where the only difference was the size of the fetched file.

(a) Small file (b) Large file

Figure 4.5: The rate of packets arriving at the client during crash-tests with configurations 3 and 4.

Another noteworthy observation is the percental increase in execution time when comparing two configurations with the same parameters, except for the size of the fetched file. When comparing test configurations 1 and 2, figure 4.6, we can see a close to four times higher execution time with a large file than with a small file. This makes sense as the large file

(20)

4.2. File size

however, the difference is more significant, figure 4.7. The average execution time with a large file in this case is close to 4.5 times higher than with the small file. The difference between configurations 1 and 2 and 7 and 8 is the block size which is 64MB in the first case and 128MB in the second.

Figure 4.6: Comparison of the average execution time of no-crash tests with configurations 1 and 2.

Figure 4.7: Comparison of the average execution time of no-crash tests with configurations 7 and 8.

(21)

4.3. Number of crashed nodes

4.3 Number of crashed nodes

The impact of increasing the amount of crashed nodes from one to two was hard to determine. In one case, when comparing configurations 1 and 3, the average execution time was lower when crashing two nodes than when crashing one as can be seen in figure 4.8. In another case, when comparing configurations 6 and 8, the result was the opposite, figure 4.9. We could not find any obvious cause to this and there were no significant differences in the network activity either.

(22)

4.4. Block size

4.4 Block size

When comparing the configurations with the same parameters, except for the block size used, there were little or no difference in average execution time. This can be seen in figure 4.10 and all of the other comparison-graphs look much the same. When looking at the network activity we could not find any distinct differences either.

(23)

5 Discussion

To conclude this thesis we will begin by answering the problem statement described in sec-tion 1.2.

1. How is the performance of HDFS read affected when a datanode becomes unavailable, but not yet confirmed dead with respect to execution time?

We can conclude that it has a fairly large effect on the execution time. The execution time was increased by 32% up to as much as 85% when a datanode became unavailable while fetching a file from the file system.

2. What different parameters will affect the performance of HDFS read with respect to execution time?

Based on the results of our study we can conclude that the number of crashed nodes does not seems to have any significant effect on the execution time. Our results are in-conclusive on this matter. It would seem as if other factors play a larger role in this case. For example we could see shorter execution time when crashing two nodes instead of one when using a block size of 64MB. The block size on its own appears to have no clear effect on the execution time. There was a small difference in execution time be-tween configurations 4 and 8, although it was not large enough to have any significance. When it comes to the file size we could conclude that HDFS does not seem to do a better job handling large files than small ones. A four times larger file did sometimes result in more than a four times as long execution time. It being four times longer was expected due to the file size being four times larger, but in some cases it was as much as 4.5 times longer. We could, however, also see much more major consequences when running crash tests with small files than with large files. The execution time increased by as much as 85% when a crash occurred while fetching a small file. By measuring the net-work activity during our tests we could also conclude that what appears to cause this increase in execution time is a drop in the throughput. In some cases the throughput would drop by as much as 500 packets per second, comparing figure 4.2(a) and 4.2(b).

5.1 Limitations

(24)

5.2. Results

amount of data is crucial. This thesis was limited to testing on only ten nodes and each node with the capability of storing just under 8 gigabytes. Since Hadoop and HDFS are capable of storing petabytes of data in clusters with thousands of servers, the tests conducted in this thesis did not stretch the Hadoop framework to its full potential and was not used for what it was originally designed for.

As presented earlier, the ten nodes which were used in the experiments were all virtual machines on the same physical machine. In a normal Hadoop configuration the namenode and the datanodes are usually on different machines and sometimes on different geograph-ical locations. This means that in this thesis all potential network delays between machines were neglected. One could say that the experiments were conducted in an ideal environment were the execution time was tested only to the regards of the Hadoop framework and not to other external causes. On the other hand network delays are in some sense a part of an Hadoop setup because companies which uses Hadoop will always have to take the network delays in consideration. Testing Hadoop and HDFS on a real network with nodes distributed across different geographical locations would might have given us a more useful, but also unpredictable, result.

It is hard to predict how the results would differ if we had more resources and more time to our disposal. We managed to do 200 test for every test configuration, were 100 of these tests were without any node crash and 100 tests with a node crash. The time to run the test configurations differed from 24 hours to 36 hours. Since we had eight test configurations to run, the time it took to conduct all experiments were a proportionally large part of this thesis. Ideally we would have collect a larger set of test samples to increase the statistical significance, but since time was a delimitation in this thesis, we believed that around 200 test for each test configuration would be enough to get an interesting result.

5.2 Results

The most notable thing in our results were the large increase in execution time in the event of a node crash. The system coped better when fetching a large file but when fetching a small file the execution time almost doubled in some cases when a node crash occured. As we simply did not have the time to examine the source code of HDFS during this project we could not draw any conclusions about what causes this. We could, however, see from our tcpdumps that the transfer rate during crash tests were fairly constant albeit lower than during no-crash tests. The most unexpected result from our tests was the execution time of the tests where two nodes were crashed. Sometimes those tests had a slightly longer execution time than the one-node-crash tests and sometimes it was shorter. The common denominator seems to be the 64MB block size. This was the only constant parameter amongst those tests where the two-node-crash tests would have a shorter execution time than the one-two-node-crash tests. Another peculiar thing was a drop in the network activity around twenty-five to thirty seconds into the tests. This was present in configurations 1, 2 and 5. There does not appear to be any common denominator between these configurations and therefor we could not draw any conclusions from this.

5.3 Method

GNU time is not the most precise tool for measuring execution time of a command but we would argue that it was precise enough for the measurements we did. Even if the tool would have an inaccuracy at up to a second it would not affect our conclusions from the tests. We also took this inaccuracy in consideration when analyzing our results.

The state of the system was something we had to create our own way of restoring. After locating the files specifying the namespace and state of the datablocks in the system we found that it should be sufficient to copy these in order to copy the state of the system. We did not,

(25)

5.4. Future work

however, know for sure if this was enough to reproduce the state entirely. In our results it seems as if it was enough, though, as our results from each iteration of each test configura-tion were very similar. Something that we would have preferred to have done differently is preserving the same state for every test configuration. This was impossible because of the block size change between test configurations 1 to 4 and 5 to 8. It meant that all the blocks in the saved state from configurations 1 to 4 were 64MB while configurations 5 to 8 required a block size of 128MB.

The fact that we only captured packets sent from the datanodes to the client when run-ning our tcpdumps does mean that we missed out on certain data from the tests. This could perhaps have given a better picture of the communication between the client and the datan-odes. After running all the tests, however, we find that processing additional data would have pushed this project beyond our final deadlines.

5.4 Future work

When and why you would use the Hadoop get command differs from system to system. There are different implications a node crash can have on a system, depending on what it is designed to do, but as our results showed execution time is a factor that will be affected. Hardware failure will occur if your system consist of a large amount of servers and as we concluded in this thesis, the execution time of the get command will then be greatly increased. Hadoop is a time consuming framework when it comes to storing and processing data, and because of that it is not suitable for applications where fast performance is crucial. Therefor when configuring a Hadoop cluster one should be aware that execution time should not be the most important part of one’s system.

(26)

Bibliography

[1] F. Dinu and T.S Eugene Ng. “Understanding the effects and implications of compute node related failures in hadoop”. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (2014). ISBN:978-1-4503-0805-2, pp. 187–198.DOI: 10.1145/2287076.2287108.

[2] Apache Software Foundation. Hadoop Architecture Guide. 2016.URL: https://hadoop. apache . org / docs / r2 . 7 . 2 / hadoop - project - dist / hadoop - hdfs / HdfsDesign.html(visited on 05/13/2016).

[3] Shadi Ibrahim, Hai Jin, Lu Lu, Li Qi, Song Wu, and Xuanhua Shi. “Cloud Computing: First International Conference, CloudCom 2009, Beijing, China, December 1-4, 2009. Pro-ceedings”. In: ed. by Martin Gilje Jaatun, Gansen Zhao, and Chunming Rong. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. Chap. Evaluating MapReduce on Virtual Machines: The Hadoop Case, pp. 519–528.ISBN: 978-3-642-10665-1.DOI: 10.1007/978-3-642-10665-1_47.

[4] A. Kala Karun and K. Chitharanjan. “A review on hadoop x2014; HDFS infrastructure extensions”. In: Information Communication Technologies (ICT), 2013 IEEE Conference on. Apr. 2013, pp. 132–137.DOI: 10.1109/CICT.2013.6558077.

[5] X. Liu, J. Han, Y. Zhong, C. Han, and X. He. “Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS”. In: IEEE International Conference on Cluster Computing and Workshops (2009), pp. 1–8.ISSN: 1552-5244.DOI: 10. 1109/CLUSTR.2009.5289196.

[6] J. Shafer, S. Rixner, and A.L Cox. “The Hadoop Distributed Filesystem: Balancing Porta-bility and Performance”. In: IEEE International Symposium on Performance Analysis of Sys-tems and Software (2010), pp. 122–133.DOI: 10.1109/ISPASS.2010.5452045.

[7] E. Sivaraman and R. Manickachezian. “High performance and fault tolerant distributed file system for big data storage and processing using hadoop”. In: Proceedings - 2014 International Conference on Intelligent Computing Applications (2014). ISBN: 9781479939664, pp. 32–36.DOI: 10.1109/ICICA.2014.16.

[8] A. Thusoo, Z. Shao, S. Antony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu. “Data Warehousing and Analytics Infrastructure at Facebook”. In: SIGMOD (2010). ISBN: 978-1-4503-0032-2, pp. 1013–1020.DOI: 10.1145/1807167.1807278.

(27)

Bibliography

[9] J. Venner, S. Wadkar, and M. Siddalingaiah. Pro Apache Hadoop. ISBN:9781430248637. 2014, p. 183.