A Global Ecosystem for Datasets on Hadoop

(1)

INOM

EXAMENSARBETE DATALOGI OCH DATATEKNIK, AVANCERAD NIVÅ, 30 HP

,

STOCKHOLM SVERIGE 2016

A Global Ecosystem for

Datasets on Hadoop

JOHAN SVEDLUND NORDSTRÖM

KTH

(2)

A Global Ecosystem for Datasets on Hadoop

TRITA-ICT-EX-2016:131

Johan Peter Svedlund Nordstr ¨om

Master of Science Thesis

Software Engineering of Distributed Systems School of Information and Communication Technology

KTH Royal Institute of Technology Stockholm, Sweden

11 September 2016

(3)

c

(4)

Abstract

The immense growth of the web has led to the age of Big Data. Companies like Google, Yahoo and Facebook generates massive amounts of data everyday. In order to gain value from this data, it needs to be effectively stored and processed. Hadoop, a Big Data framework, can store and process Big Data in a scalable and performant fashion. Both Yahoo and Facebook, two major IT companies, deploy Hadoop as their solution to the Big Data problem. Many application areas for Big Data would benefit from the ability to share datasets across cluster boundaries. However, Hadoop does not support searching for datasets either local to a single Hadoop cluster or across many Hadoop clusters. Similarly, there is only limited support for copying datasets between Hadoop clusters (using Distcp). This project presents a solution to this weakness using the Hadoop distribution, Hops, and its frontend Hopsworks. Clusters advertise their peer-to-peer and search endpoints to a central server called Hops-Site. The advertised endpoints builds a global hadoop ecosystem and gives clusters the ability to participate in public-search or peer-to-peer sharing of datasets. HopsWorks users are given a choice to write data into Kafka as it’s being downloaded. This opens up new possibilities for data scientists who can interactively analyse remote datasets without having to download everything in advance. By writing data into Kafka as its being downloaded, it can be consumed by entities like Spark-streaming or Flink.

(5)

(6)

Acknowledgements

I would like to acknowledge my examinor Jim Dowling and my supervisor Alex Ormenisan. Both of them have contributed with advice and smart ideas which has helped me throughout the project. I would also like to thank all the coworkers at SICS, who where always glad to offer help.

(7)

(8)

List of Figures

2.1 HopsWorks and Hops . . . 10

2.2 HDFS . . . 15

2.3 Hops HDFS . . . 16

4.1 Register to Hops-Site . . . 23

4.2 Ping Hops-Site . . . 24

4.3 Public Search . . . 27

5.1 Download with one uploader . . . 35

5.2 Download with two uploaders . . . 35

5.3 Download with three uploaders . . . 36

5.4 Download with four uploaders . . . 36

(11)

(12)

List of Acronyms and Abbreviations

HDFS Hadoop Distributed File System YARN Yet Another Resource Negotiator Hops Hadoop Open Platform As-a-Service JSON Javascript Object Notation

REST Representational State Transfer NAT Network address translation FTP File Transfer Protocol

NDB Network Database

API Application Programming Interface

(13)

(14)

Chapter 1 Introduction

Modern computing is generating massive volumes of data at ever growing speeds. The data generated has different characteristics. Not only are the volumes large but data is structured, semi structured and unstructured [1]. Data originates from different kinds of sources like web pages, logs, social media, e-mail, documents, sensor devices and many more [2]. The different characteristics, size, complexity and origins of this data makes it difficult for traditional storage and processing systems to handle. A commonly recognized term for this kind of data is ”Big Data”.

Storing, processing and extracting value from Big Data is no trivial matter and a problem that companies like Google and Yahoo spends lots of resources on. The most notable framework for Big Data storage and processing is called Hadoop [3]. The base of Hadoop was developed at Yahoo together with the creator of the famous search-engine library Apache Lucene [4]. Hadoop is a framework consisting of several different projects such as ”Hadoop Distributed FileSystem”(HDFS) and ”Yet Another Resource Scheduler”(Yarn). Hadoop is capable of storing and processing large amounts of data in a scalable, efficient and

(15)

2 CHAPTER 1. INTRODUCTION

effective way. Although Hadoop excels with handling of Big Data, it is a fairly young framework and lacks some important capabilities that would be beneficial for its progress.

1.1 Problem description

(16)

1.2. PROBLEM STATEMENT ANDPURPOSE 3

users will be forced to wait for everything to finish downloading. This will likely make them less inclined to share datasets.

1.2 Problem statement and Purpose

Hadoop has no solutions for either searching or scalable sharing of datasets across cluster boundaries. Also, sharing datasets between datacenters could be problematic in the face of NAT endpoints. This means that data will likely be bound to one cluster and sharing will not happen. The lack of shared datasets is a hindrance for further progress within application areas that use the utilities of Hadoop. Even if Hadoop had scalable capabilities to share data, downloading large datasets might take too long and therefore be avoided. There needs to be a way to do processing on data as it is being downloaded.

(17)

4 CHAPTER 1. INTRODUCTION

entities like Spark streaming [10], Flink [11] or similar technologies.

1.3 Goals, Ethics and Sustainability

The goal of this thesis and work will be a working, scalable and efficient implementation of a global ecosystem for Hops datasets. HopsWorks users of different Hops clusters will be able to share and search for datasets in this global ecosystem. Also in the case of downloading datasets, real-time processing will be supported. The explicit goals are listed below.

• Implement search for public datasets. HopsWorks users should be able to find data in their own Hops cluster and in remote Hops clusters.

• Implement peer-to-peer sharing of public datasets. HopsWorks users should be able to upload and download data to and from other Hops clusters.

• Implement support for real-time processing of downloading data. HopsWorks users should not be forced to wait for complete downloads in order to investigate interesting data.

• Demonstrate that peer-to-peer sharing of data is a scalable solution and better solution than Hadoops built in copying mechanism Distcp and similar technologies.

If these goals are met then people creating, storing and processing interesting data can share it with others from the related or unrelated application domains in order to further progress their products and goals. It can directly benefit entities that work in the Big Data society but could also indirectly benefit those who are only affected by it, for example visitors of enterprise web applications.

(18)

1.4. STRUCTURE OF THIS THESIS 5

and a sustainable standpoint. Peer-to-peer downloads of large datasets could potentially mean large usage of bandwidth which could strain network infrastructure. Also, in an ethical perspective sharing data can be problematic if there is no sufficient access control to the data being shared. Fortunately GVoD which is the process that conducts the peer-to-peer downloads uses a special network protocol known as Ledbat [12]. This protocol is different from protcols such as TCP and UDP as it will adapt its usage to the current network characteristics. Concerning ethical problems, access control is managed by the Hopsworks web application where users can choose to make their own data publicly available.

1.4 Structure of this thesis

(19)

(20)

Chapter 2 Background

This chapter presents the background of the thesis. It introduces different entities that the reader needs to understand in order to comprehend the remainder of the thesis. First, Hadoop, Hops and HopsWorks are introduced as they outline the base of which the solution is built upon. After that the different parts of the solution architecture are introduced. Starting with the central-server(Hops-Site) and then further on to relational-persistence(MySQL cluster), search(ElasticSearch and Epipe) and peer-to-peer sharing (GVoD). Lastly, the different dataset-store components (HDFS and Kafka) are presented and also what they offer for this particular solution . This chapter doesn’t discuss how these different techniques accomplish the overall solution, that is done in Chapter4.

2.1 Hadoop

Apache Hadoop is a framework that provides distributed storage and processing of large datasets [3]. Hadoop was designed to scale from single servers up to massive

(21)

8 CHAPTER2. BACKGROUND

clusters of nodes where each node offered both storage and processing power. Today, companies like Yahoo, Facebook and Spotify deploy Hadoop stacks in their datacenters in order to manage the large amount of data they generate [13]. Rather than relying on central and expensive solutions, Hadoop utilizes the power of parallel-computing and non expensive hardware. The main modules of Hadoop are Hadoop Common, HDFS, YARN and MapReduce. Hadoop Common consists of the utilities that the other modules need in order to function properly. HDFS is a distributed filesystem, the default filesystem for Hadoop. Yarn is a resource negotiator, with responsibilities similar to a traditional operating system. MapReduce is a programming model that is widely supported inside Hadoop and allows for things like distributed processing of data.

2.2 Hops

Hadoop open platform as-a-Service or ”Hops” is a Hadoop distribution developed at SICS [14]. Hops has several improvements over Hadoop, some of these are listed below.

• Hadoop-as-a-Service

• Project-Based Multi-Tenancy

• Secure sharing of DataSets across HopsWorks projects

• Extensible metadata that supports free-text search using Elasticsearch • YARN quotas for projects

(22)

2.3. HOPSWORKS 9

and 2.7. Also, with HDFS metadata stored inside the MySQL cluster instead on the Heap of a NameNode, Hops becomes more scalable than other Hadoop architectures [15]. The secure sharing, multi-tenancy and service improvements are enabled by the HopsWorks web application which is described further below.

2.3 HopsWorks

HopsWorks is the frontend for Hops. It introduces concepts like Users, Projects and Datasets which helps organize the different services that Hops offer. For example, a user of HopsWorks can create a Project, run a sparkjob and store the results as a dataset inside the project. Technically speaking, HopsWorks is an AngularJS [16] front-end and a Java Jersey [17] REST(Representational state transfer) [18] back-end. The back-end talks with Hops services(like ElasticSearch and GVoD) via REST calls. Most data exchanged over REST in the Hops and HopsWorks architecture are serialized into JSON(Javascript Object Notation) [19] format. Figure2.1exemplifies the architecture of Hops and HopsWorks.

(23)

Figure 2.1: HopsWorks and Hops

(24)

2.4. HOPS-SITE 11

2.4 Hops-site

Searching publicly for datasets means that those dataset needs to be globally unique, i.e unique across different clusters and within clusters. A Hops cluster also needs to know the search endpoints of other clusters in order to direct public-search queries to them. Hops-Site serves as a solution for both of these problems. Hops-Site is a Jersey [17] RESTFUL web-service deployed on a Glassfish webserver [20] local to a certain Hops cluster. Hops-Site offers a REST-API to Hops-clusters where they can advertise their search endpoints, obtain an unique cluster-id and find information of other registered clusters. Hops-Site also maintains other types of information about registered clusters, such as how active(how often they ping) they are and what GVoD endpoint they have. Hops-Site uses a MySQL cluster for persistence, same as HopsWorks.

2.5 MySQL cluster

Both HopsWorks, Hops and Hops-Site utilizes a relational database to persist important information. HopsWorks and Hops(Yarn and HDFS) needs to store information about users, projects, datasets and more, while Hops-Site persists information about registered Hops clusters. Some of the reasons for choosing MySQL-cluster as a persistent store for Hops-Site are listed below.

• Integration

• High Availability and Scale Ability • No Single Point of Failure

(25)

HopsWorks already utilizes MySQL cluster for persistence, it became a natural choice for Hops-Site which also runs inside a Hops Cluster. The second and perhaps most important reason is the performance that MySQL cluster offers. MySQL cluster has both high availability and scale-ability [21] which both are critical for up-time and being able to handle a high load of requests. Also, because MySQL cluster is a distributed Relational database, there is no single point of failure which further improves potential up-time.

2.6 ElasticSearch

(26)

2.7. EPIPE 13

2.7 Epipe

Inside a Hops cluster, an application called Epipe is responsible for writing data from the MySQL cluster into ElasticSearch. This application employees the NDB(Network Database) event API [23] and listens for evens that are generated when something changes inside the MySQL cluster, for example an update to a table. When an event is generated, Epipe looks at the event and writes the changes into ElasticSearch. In this manner parts of the MySQL cluster is replicated in ElasticSearch which means that users of HopsWorks can direct search-queries to ElasticSearch and search for data that is stored in the MySQL cluster. An example of this would be when a user makes a dataset public, this would then change a column inside the dataset table in the MySQL cluster. Epipe would recieve an event and write the column change into ElasticSearch. Now users should be able to search for that public datasets by querying the local ElasticSearch instance.

2.8 GVoD

(27)

Downloading datasets with GVoD means transferring alot of these pieces and building blocks from them. In order for a GVoD instance to know that it has obtained correct blocks, it needs to verify block hash-values. GVoD incorporates ”on demands hashing” of blocks in order to support this. GVoD downloads data in an orderly fashion which differs to other common peer-to-peer applications that usually downloads data out of order. The fact that GVoD downloads data in order is necessary since HDFS is an append only filesystem [24]. GVoD has also been altered to write to HDFS and Kafka, where HDFS represents the persistent store and Kafka the temporary store that enables real-time processing. Writing data into HDFS is is done by transferring pieces and building blocks, verifying the block hash-values and then writing them to the DataNodes of HDFS. Writing data into Kafka is a bit different and is described further in section2.10.

2.9 HDFS

HDFS is the default filesystem for both Hadoop and Hops, it is a distributed filesystem designed to run on non expensive hardware. HDFS was built according to some particular design goals namely, fault-tolerance, streaming data access, large datasets, simple read-write model and moving application to data [24]. Figure2.2exemplifies the architecture of HDFS.

(28)

2.9. HDFS 15

Figure 2.2: HDFS

other DataNodes at the command of the NameNode.

The Hops filesystem is different from HDFS, it migrates the filesystem metadata from the Namenode heap to a MySQL cluster [15]. See figure2.3for an example of the Hops-HDFS architecture.

(29)

(30)

2.10. KAFKA 17

2.10 Kafka

(31)

(32)

Chapter 3 Method

This chapter presents the type of methodology used to produce the thesis, work and result. It discusses the analysis and tests. The actual results and implementation is presented in the upcoming chapters.

3.1 Methodology

For this thesis, a quantitative research method was chosen [27]. First a system was created that sought to meet the proposed goals of the project. These where as mentioned before, to implement search and scalable sharing of public datasets as well as support for real-time processing of downloading datasets. Along with the quantitative research method, a deductive research approach [27] was chosen and experimental tests and evaluations were made to verify the goals.

(33)

20 CHAPTER3. METHOD

3.2 Experiments and evaluation

Due to the nature of the project as well as the limited time frame, only a couple of experiments where conducted. The main experiment for testing the performance of the implementation was a test that transferred datasets between clusters with an increasing number of participating peers. By downloading datasets and increasing the number of participating peers, the scale-ability and performance of the peer-to-peer sharing could be verified.

No tests were conducted to establish the performance of public-search. The main reason for this was that GVoD at the time of writing did not have the ability to build its own overlay. Therefore in order to make the peer-to-peer sharing optimal, public-search had to take a hit in performance, more on this later in chapter4and section4.3. Also, because no Hadoop implementation had the ability to do public-search there was not really anything to do benchmarks against.

In order to test the ability of real-time processing, a simple test was conducted. This test evaluated how much time it took before downloading data started to appear in a Kafka Topic.

(34)

Chapter 4 Implementation

This chapter presents the implementation of the search and peer-to-peer downloading as well as the real-time processing support. First a couple of rules/assumptions are presented, some of these are temporary limitations and others are the results of logical conclusions. After that, the interactions between Hops-Site and HopsWorks are discussed as the results from those interactions are essential for public-search. Next, both public-search and peer-to-peer sharing of datasets are presented in depth. Lastly the real-time processing support is explained.

4.1 Rules

The first rule says that public datasets are immutable. This means that once a dataset is made public, it cannot be changed, i.e you cannot add or remove files from a public dataset. It turns out that this rule is a logical choice as HDFS originally was designed for immutable data [28] and even though it’s now possible to append to files, it’s common that large datasets remain static.

(35)

22 CHAPTER4. IMPLEMENTATION

The second rule says that public datasets are identified by the cluster id, the project name, the dataset name and an unix-timestamp. The cluster id is obtained through registration with Hops-Site, which is described further in section4.2. The project name and dataset name are provided by Hopsworks and its structure of users having projects and projects having datasets. The cluster id makes the public dataset unqique across different clusters, the project name makes the dataset unique within a cluster and dataset name is there for convenience. The unix-timestamp is there because a HopsWorks user might want to remove the public property of the dataset but then make it public again later.

The third rule is really a result of a temporary version of GVoD. When this thesis was written, GVoD did not have the ability to build its own overlay for peer-to-peer sharing of data. Instead it needed to know all the peer-to-peers that it was going to download data from. This limited the ability to optimize public-search and we will discuss improvements to this later in chapter6section6.2.

The fourth rule is another one of those temporary assumptions that can be improved upon. At this moment public datasets are considered to be one-level in the sense that they don’t have any directories, only files.

(36)

4.2. HOPS-SITE INTERACTIONS WITH AHOPSWORKS INSTANCE 23

4.2 Hops-Site interactions with a HopsWorks instance

Hops-Site is a centralized server that is crucial to the functionality of the global Hops ecosystem. Below are two figures that presents the interactions between HopsWorks and Hops-Site. The first, figure4.1, shows the Register REST call to Hops-Site and the second, figure4.2, shows the Ping REST call.

Figure 4.1: Register to Hops-Site

(37)

public-datasets.

Figure 4.2: Ping Hops-Site

(38)

4.3. PUBLICSEARCH 25

4.3 Public Search

This section describes public search in detail. First, the information needed to perform public search is summarized. After that the different steps of public search is described and also what the results are.

4.3.1 What is needed ?

(39)

4.3.2 Producing a public search and handling responses

(40)

4.3. PUBLICSEARCH 27

implementation described above.

Figure 4.3: Public Search

(41)

and returned to the orignal frontend.

4.4 GVoD peer-to-peer upload and download

This section describes the peer-to-peer sharing in detail. First, the information needed before a download can be made is presented. After that the actual, upload and download are described.

4.4.1 What is needed ?

In Chapter2section2.8we mentioned that GVoD is the application that takes care of the download and upload of datasets. In this Chapter, in section4.1, we also mentioned that GVoD has no ability to build an overlay on demand. This means that in order to produce an optimal download, i.e a download with the maximum amount of participating peers, GVoD needs to get the peers from somewhere. It turns out that public search does just that. Public search returns a list of unique public datasets corresponding to the query it received as input, each of these datasets also comes with a list of GVoD endpoints, which happens to be all of the peers that GVoD can utilize to download the dataset. This means that after a public-search is performed a HopsWorks user has all the information needed to perform a optimal peer-to-peer download of a dataset.

4.4.2 Upload

(42)

4.4. GVODPEER-TO-PEER UPLOAD AND DOWNLOAD 29

HopsWorks and Hops involves several steps. The first thing that happens is that a user of HopsWorks right-clicks on a dataset icon and selects ”make public”. The next step involves the creation of the so called Manifest. A Manifest is a JSON file that contains information about the contents of a public dataset. It describes the files and if they support writing into Kafka. The Manifest also contains other metadata information such as creator, creator-date and so on. When the Manifest is created, it is written to the dataset folder in HDFS. After that, HopsWorks makes a REST call to GVoD, informing it about the path of to the HDFS folder and other information such as the public-dataset-id and HDFS endpoint information. GVoD then looks at the path provided and tries to read the Manifest and parse it into a JSON file. If successful GVoD knows the structure of the dataset it should upload and also the torrent-id it should use (public-dataset-id). GVoD then replies to HopsWorks with a REST call indicating that everything went fine. HopsWorks then persists the fact that this dataset is now public and also its public-dataset-id into the MySQL cluster. Epipe will then receive an event and write the changes into ElasticSearch, making it available for public-search.

4.4.3 Download

(43)

local GVoD instance to download the Manifest and present it to the HopsWorks user. In order to do this, there must first be a location where the local GVoD can write the Manifest. The HopsWorks user must then create a destination dataset folder so that the local GVoD instance can write the Manifest into it. After a destination dataset is created, HopsWorks sends the path of this dataset to GVoD in a REST call, together with the other important info such as GVoD endpoints to download from and the torrent-id(public-dataset-id). GVoD then downloads the Manifest from the peers it was presented with and writes the Manifest into the path that HopsWorks gave it. Then it sends a REST call back indicating that the Manifest is now present in the path that it was given. HopsWorks can now read the manifest from the destination dataset and present the information to the user. Depending on what kind of files and schemas are present in the dataset, the user can choose to either write the rest of the dataset into only HDFS or into both HDFS and Kafka. After that choice is made, HopsWorks sends a REST call to GVoD informing it about what kind of download should be made. When GVoD recieves this REST call it proceeds to download the rest of the data into the desired storage components.

4.5 Real-time processing

(44)

4.5. REAL-TIME PROCESSING 31

(45)

(46)

Chapter 5 Analysis

This chapter presents the results and analysis of the implementation and tests. First, the evaluation of the implementation and existing technology is presented. After that the different test results are shown.

5.1 Evaluation of implementation and current technologies

This thesis and project has introduced a peer-to-peer sharing service that enables scalable and efficient sharing of public datasets. Copying data between file-systems, servers and datacenters is no novel idea. There exists countless solutions for this type of problem but almost none of them fit particularly good in an Hadoop ecosystem. First of all, datasets in an Hadoop cluster like Hops are often very large hence simple transfer protocols like FTP won’t scale well. Technologies like DistCp does perform well when copying large datasets but not from one datacenter to another. Also because neither of these solutions use peer-to-peer technology they are unlikely to achieve maximum performance. Another major obstacle for

(47)

34 CHAPTER 5. ANALYSIS

technologies such as DistCp is that they cannot traverse NAT-endpoints. This is a major problem as most of todays internet use NATs to extend network infrastructure.

The implementation developed throughout this thesis suffers from none of these above mentioned problems. It is peer-to-peer by nature and has built in NAT-traversal capabilities.

5.2 Results of Experiments

This section presents the results of the tests that were conducted to validate the implementation. First, the scale-ability of the peer-to-peer sharing service is presented. Then the performance of the real-time processing is presented and discussed.

5.3 P2P test

(48)

5.3. P2PTEST 35

first figure is a download with one uploader and the next with two uploaders and so on.

Figure 5.1: Download with one uploader

Figure 5.2: Download with two uploaders

(49)

36 CHAPTER 5. ANALYSIS

Figure 5.3: Download with three uploaders

(50)

5.4. REAL-TIME PROCESSING TESTS 37

5.4 Real-time processing tests

(51)

(52)

Chapter 6 Conclusions

This chapter concludes the thesis by presenting the authors reflections of the project. First an evaluation of the goals are made. Then some reflections about the work and future work is presented. The thesis is summed up with a final conclusion in the last section.

6.1 Goals

The explicit goals of this project can be found in chapter1 section1.3. Overall, the goal was to create a scalable and effective solution to share datasets between Hadoop clusters and also support the ability to do real-time processing on downloading datasets. The tests and evaluation in chapter5confirms that the peer-to-peer sharing is scalable and that the real-time processing is effective and very useful. The public-search part wasn’t tested and I can’t therefore claim anything about its performance. However, as is described in chapter 4 section 4.3 the main thread that does the search does block to wait for all clusters to respond.

(53)

40 CHAPTER 6. CONCLUSIONS

This is obviously a performance flaw and improvements to this will be discussed below.

6.2 Future work

Even though the implementation fulfilled the goals of the project there are plenty of improvements that would make the system more component and performant.

(54)

6.2. FUTURE WORK 41

chapter 4 section 4.1. The fact that public datasets are immutable could be changed and incorporate some kind of version system. Instead of forcing public dataset to be static, a public dataset could have different versions where added data meant a new version of the dataset.The peer-to-peer system could the recognize that two datasets had the same base version and perhaps use that to optimize a download or upload. Another obvious limitation in those rules is the fact that public datasets are one-level. This should be changeds so that public datasets can have directories to further structure its data.

In both chapter2and4it was mentioned that the HopsWorks web application had a REST call that was available for anyone to call. This is an obvious weak point for DDOS attackers to exploit but it is not trivial to fix. The fix needs to allow different HopsWorks web-applications to differentiate themselves from random DDOS attackers. Another solution could be to incorporate some kind of DDOS detection, where spam-behavior is detected and dealt with correctly.

Another problem that wasn’t really clear from the tests is the way that GVoD writes data to Kafka. When this thesis is written this is done with synchronous producers which basically means that GVoD writes data to a topic, awaits a confirmation that it was written and then writes again. This is of course not optimal, it would be better if data could be written in a asynchronous way, similar to how search queries to other clusters are handled.

(55)

42 CHAPTER 6. CONCLUSIONS

(56)

6.3. CONCLUSION 43

6.3 Conclusion

(57)

(58)

Bibliography

[1] S. Kaisler, F. Armour, J. A. Espinosa, and W. Money, “Big data: Issues and challenges moving forward,” in System Sciences (HICSS), 2013 46th Hawaii International Conference on, Jan 2013, pp. 995–1004.

[2] A. Katal, M. Wazid, and R. H. Goudar, “Big data: Issues, challenges, tools and good practices,” in Contemporary Computing (IC3), 2013 Sixth International Conference on, Aug 2013, pp. 404–409.

[3] “Hadoop homepage,”http://hadoop.apache.org/, accessed: 2016-09-09. [4] “Apache lucene,”https://lucene.apache.org/core/, accessed: 2016-09-09. [5] “Apache distcp homepage,” https://hadoop.apache.org/docs/r1.2.1/distcp2.

html, accessed: 2016-09-09.

[6] “Ftp rfc,”https://www.ietf.org/rfc/rfc959.txt, accessed: 2016-09-09.

[7] “Gvod homepage,” http://www.decentrify.io/?q=content/video, accessed: 2016-09-09.

[8] “Elasticsearch guide,” https://www.elastic.co/guide/en/elasticsearch/guide/ current/getting-started.html, accessed: 2016-09-09.

(59)

46 BIBLIOGRAPHY

[10] “Spark streaming and kafka,” http://spark.apache.org/docs/latest/streaming-kafka-integration.html.

[11] “Flink and kafka,” https://ci.apache.org/projects/flink/flink-docs-release-1. 0/apis/streaming/connectors/kafka.html, accessed: 2016-09-09.

[12] D. Rossi, C. Testa, S. Valenti, and L. Muscariello, “Ledbat: The new bittorrent congestion control protocol,” in Computer Communications and Networks (ICCCN), 2010 Proceedings of 19th International Conference on, Aug 2010, pp. 1–6.

[13] “Hadoop usages,” http://wiki.apache.org/hadoop/PoweredBy, accessed: 2016-09-09.

[14] “Hadoop open platform-as-a-service,” http://www.hops.io/?q=content/docs. [15] K. Hakimzadeh, H. Peiro Sajjad, and J. Dowling, Scaling HDFS with a Strongly Consistent Relational Model for Metadata. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 38–51. [Online]. Available: http://dx.doi.org/10.1007/978-3-662-43352-2 4

[16] “Angularjs doc,”https://angularjs.org/, accessed: 2016-09-09. [17] “Jersey web services,” https://jersey.java.net/.

[18] L. Richardson and S. Ruby, RESTful Web Services. ” O’Reilly Media, Inc.”, 2008.

[19] “Json rfc,”https://tools.ietf.org/html/rfc7159, accessed: 2016-09-09. [20] “Glassfish server,” https://glassfish.java.net/.

(60)

BIBLIOGRAPHY 47

[22] “Lucene inverted index,” https://lucene.apache.org/core/3 0 3/fileformats. html, accessed: 2016-09-09.

[23] “Ndb cluster api,” https://dev.mysql.com/doc/ndbapi/en/mysql-cluster-api-overview.html.

[24] “Hdfs architecture,” http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.

[25] “Csv rfc,”https://www.ietf.org/rfc/rfc4180.txt, accessed: 2016-09-09. [26] “Avro docs,” http://avro.apache.org/docs/1.7.5/spec.html, accessed:

2016-09-09.

[27] A. H˚akansson, “Portal of research methods and methodologies for research projects and degree projects,” in Proceedings of the International Conference on Frontiers in Education : Computer Science and Computer Engineering FECS’13. CSREA Press U.S.A, 2013, pp. 67–73, qC 20131210.

[28] “Older hdfs version,” https://hadoop.apache.org/docs/r1.2.1/hdfs design. html, accessed: 2016-09-09.

[29] “Elasticsearch score,” , accessed: 2016-09-09.

(61)

(62)

TRITA -ICT-EX-2016:131

A Global Ecosystem for Datasets on Hadoop

A Global Ecosystem for

Datasets on Hadoop

JOHAN SVEDLUND NORDSTRÖM

A Global Ecosystem for Datasets on Hadoop

Abstract

Acknowledgements

Contents

List of Figures

List of Acronyms and Abbreviations

Chapter 1

Introduction

1.1

Problem description

1.2

Problem statement and Purpose

1.3

Goals, Ethics and Sustainability

1.4

Structure of this thesis

Chapter 2

Background

2.1

Hadoop

2.2

Hops

2.3

HopsWorks

2.4

Hops-site

2.5

MySQL cluster

2.6

ElasticSearch

2.7

Epipe

2.8

GVoD

2.9

HDFS

2.10

Kafka

Chapter 3

Method

3.1

Methodology

3.2

Experiments and evaluation

Chapter 4

Implementation

4.1

Rules

4.2

Hops-Site interactions with a HopsWorks instance

4.3

Public Search

4.3.1

What is needed ?

4.3.2

Producing a public search and handling responses

4.4

GVoD peer-to-peer upload and download

4.4.1

What is needed ?

4.4.2

Upload

4.4.3

Download

4.5

Real-time processing

Chapter 5

Analysis

5.1

Evaluation of implementation and current technologies

5.2

Results of Experiments

5.3

P2P test

5.4

Real-time processing tests